【论文翻译】Clustering by fast search and find of density peaks

Clustering by fast search and find of density peaks

基于快速搜索和密度峰发现的聚类方法

Alex Rodriguez and Alessandro Laio

Cluster analysis is aimed at classifying elements in to categories on the basis of their similarity. Its applications range from astronomy to bioinformatics, bibliometrics, and pattern recognition. We propose an approach based on the idea that cluster centers are characterized by a higher density than their neighbors and by a relatively large distance from points with higher densities. This idea forms the basis of a clustering procedure in which the number of clusters arises intuitively, outliers are automatically spotted and excluded from the analysis, and clusters are recognized regardless of their shape and of the dimensionality of the space in which they are embedded. We demonstrate the power of the algorithm on several test cases.
摘要
聚类分析的目的是根据元素的相似性将其分类。它的应用范围从天文学到生物信息学、文献计量学和模式识别。我们提出了一种基于这样一种思想的方法，即簇中心的密度比它们的邻居高，并且与高密度点的距离相对较大。这一思想构成了聚类过程的基础，在这个过程中，聚类的数量会直观地产生，离群值会自动被发现并排除在分析之外，而不管聚类的形状和嵌入的空间的维数如何，聚类都会被识别出来。我们在几个测试用例上展示了该算法的强大功能。

Clustering algorithms attempt to classify elements into categories, or clusters, on
the basis of their similarity. Several different clustering strategies have been proposed (1), but no consensus has been reached even on the definition of a cluster. In K-means (2) and K-medoids (3) methods, clusters are groups of data characterized by a small distance to the cluster center. An objective function, typically the sum of the distance to a set of putative cluster centers, is optimized (3–6) until the best cluster centers candidates are found. However, because a data point is always assigned to the nearest center, these approaches are not able to detect nonspherical clusters (7). In distribution-based algorithms, one attempts to reproduce the observed realization of data points as a mix of predefined probability distribution functions (8); the accuracy of such methods depends on the capability of the trial probability to represent the data.
聚类算法试图根据元素的相似性将元素分类为类别或簇。有人提出了几种不同的聚类策略（1），但对聚类的定义也没有达成共识。在K-means（2）和K-medoids（3）方法中，聚类是一组数据，其特征是距离聚类中心很小。一个目标函数，通常是到一组假定的聚类中心的距离之和，被优化（3-6），直到找到最佳的聚类中心候选。然而，由于一个数据点总是被分配到最近的中心，这些方法不能检测到非球形的簇（7）。在基于分布的算法中，人们试图在概率分布函数（8）中再现数据的观测值，这种方法的精度取决于试验概率表示数据的能力。

Clusters with an arbitrary shape are easily detected by approaches based on the local density of data points. In density-based spatial clustering of applications with noise (DBSCAN) (9),one chooses a density threshold, discards as noise the points in regions with densities lower than this threshold, and assigns to different clusters disconnected regions of high density. However, choosing an appropriate threshold can be nontrivial, a drawback not present in the mean-shift clustering method (10, 11). There a cluster is defined as a set of points that converge to the same local maximum of the density distribution function. This method allows the finding of nonspherical clusters but works only for data defined by a set of coordinates and is computationally costly.
基于局部数据点密度的方法可以很容易地检测出任意形状的簇。在基于密度的噪声应用空间聚类（DBSCAN）（9）中，选择一个密度阈值，丢弃密度低于该阈值的区域中的点作为噪声，并将高密度的不连通区域分配给不同的簇。然而，选择适当的阈值可能是非平凡的，这是mean-shift聚类方法（10，11）中不存在的缺点。簇被定义为一组收敛到密度分布函数的同一局部极大值的点集。这种方法允许在一组坐标系下找到非球面的椭圆曲线，并且计算量大。

Here, we propose an alternative approach. Similar to the K-medoids method, it has its basis only in the distance between data points. Like DBSCAN and the mean-shift method, it is able to detect nonspherical clusters and to automatically find the correct number of clusters. The cluster centers are defined, as in the meanshift method, as local maxima in the density of data points. However, unlike the mean-shift method, our procedure does not require embedding the data in a vector space and maximizing explicitly the density field for each data point.
在这里，我们提出一种替代方法。与K-medoids方法类似，它的基础只是数据点之间的距离。与DBSCAN和mean-shift方法一样，它能够检测非球形簇并自动找到正确的簇数。聚类中心被定义为数据点密度的局部极大值，如meanshift方法所述。然而，与mean-shift方法不同，我们的方法不需要在向量空间中嵌入数据并显式地最大化每个数据点的密度场。

The algorithm has its basis in the assumptions that cluster centers are surrounded by neighbors with lower local density and that they are at a relatively large distance from any points with a higher local density. For each data point i, we compute two quantities: its local density and its distance from points of higher density. Both these quantities depend only on the distances dij between data points, which are assumed to satisfy the trian gular in equality . The local density of data point i is defined as

该算法的基础是假设簇中心被局部密度较低的邻域包围，并且它们与任何局部密度较高的点之间的距离相对较大。对于每个数据点i，我们计算两个量：局部密度和与高密度点的距离。这两个量只依赖于数据点之间的距离，这被假定为满足三角不等式。数据点i的局部密度定义为

where =1 if x < 0 and =0 otherwise , and dc is a cutoff distance. Basically, is equal to the number of points that are closer than dc to point i. The algorithm is sensitive only to the relative magnitude ofriin different points, implying that, for large data sets, the results of the analysis are robust with respect to the choice of dc.
其中如果x<0并且=0时，=1，是截止距离。基本上，等于比更接近点i的点的数目。算法只对不同点的的相对大小敏感，这意味着，对于大数据集，分析结果对于的选择是稳健的。通过计算点i和任何其他密度更高的点之间的最小距离来测量：

For the point with highest density, we conventionally take
. Note that much larger than the typical nearest neighbor distance only for points that are local or global maxima in the density. Thus, cluster centers are recognized as points for which the value of is anomalously large.
或者密度最高的点，我们通常取
。仅对于密度为局部或全局最大值的点，远大于典型的最近邻距离。因此聚类中心被认为是值异常大的点。

This observation, which is the core of the algorithm, is illustrated by the simple example in Fig. 1. Figure 1A shows 28 points embedded in a two-dimensional space. We find that the density maxima are at points 1 and 10, which we identify as cluster centers. Figure 1B shows the plot of as a function of for each point; we will call this representation the decision graph.The value of for points 9 and 10, with similar values of , is very different: Point 9 belongs to the cluster of point 1, and several other points with a higherare very close to it, whereas the nearest neighbor of higher density of point 10 belongs to another cluster. Hence, as anticipated,the only points of high and relatively highare the cluster centers. Points 26, 27, and 28 have a relatively high and a low because the y are isolated; they can be considered as clusters composed of a single point, namely, outliers.

该观测值是算法的核心，由图1中的简单示例说明。图1A显示了在二维空间中的28个嵌入点。我们发现密度最大值出现在第1点和第10点，这是我们确定的聚类中心。图1B显示了各点作为函数的曲线图;我们将这种表示称为决策图。第9点和第10点的值与值相近，但相差很大:点9属于点1的聚类，其他几个值较高的点与点1非常接近，而点10密度较高的最近邻居属于另一个聚类。因此，正如预期的那样，只有高和相对高的点是集群中心。第26,27和28点相对较高较低因为他们是被孤立的;它们可以被认为是由单个点(即离群点)组成的簇。

After the cluster centers have been found, each remaining point is assigned to the same cluster as its nearest neighbor of higher density. The cluster assignment is performed in a single step, in contrast with other clustering algorithms where an objective function is optimized iteratively (2,8).
在聚类中心找到后，每个剩余的点被分配到同一个簇中，作为其密度较高的最近邻。与目标函数迭代优化的其他聚类算法（2,8）相比，聚类分配是在一个步骤中完成的。

In cluster analysis, it is often useful to measure quantitatively the reliability of an assignment. In approaches based on the optimization of a function (2, 8), its value at convergence is also a natural quality measure. In methods like DBSCAN(9), one considers reliable points with density values above a threshold, which can lead to lowdensity clusters, such as those in Fig. 2E, being classified as noise. In our algorithm, we do not introduce a noise-signal cutoff. Instead, we first find for each cluster a border region, defined as the set of points assigned to that cluster but being within a distancefrom data points belonging to other clusters. We then find, for each cluster, the point of highest density within its border region. We denote its density by. The points of the cluster whose density is higher thanare considered part of the cluster core (robust assignation).The others are considered part of the cluster halo( suitable to be considered as noise )
在聚类分析中，定量地度量任务的可靠性通常是有用的。在基于函数（2，8）优化的方法中，它在收敛时的值也是一个自然的质量度量。在类似DBSCAN（9）的方法中，考虑密度值高于阈值的可靠点，这可能导致低密度簇，例如图2E中的那些，被归类为噪声。在我们的算法中，我们没有引入噪声信号截止。相反，我们首先为每个簇找到一个边界区域，定义为分配给该簇的一组点，但与属于其他簇的数据点之间的距离为。然后我们发现，对于每个聚类，在其边界区域内密度最高的点。我们用表示它的密度。密度高于的簇的点被认为是簇核心的一部分（鲁棒分配）。其他被认为是聚类光晕的一部分（适合被认做噪声）

In order to benchmark our procedure, let us first consider the test case in Fig. 2. The data points are drawn from a probability distribution with nonspherical and strongly overlapping peaks(Fig. 2A); the probability values corresponding to the maxima differ by almost an order of magnitude. In Fig. 2, B and C, 4000 and 1000 points,respectively, are drawn from the distribution inFig. 2A. In the corresponding decision graphs (Fig. 2, D and E), we observe only five points with a large value of and a sizeable density. These points are represented in the graphs as large solid circles and correspond to cluster centers. After the centers have been selected, each point is assigned either to a cluster or to the halo. The algorithm captures the position and shape of the probability peaks, even those corresponding to very different densities (blue and light green points in Fig. 2C) and nonspherical peaks.Moreover, points assigned to the halo correspond to regions that by visual inspection of the probability distribution in Fig. 2A would not be assigned to any peak.
为了对我们的程序进行基准测试，让我们首先考虑图2中的测试用例。数据点来自具有非球形和强重叠峰的概率分布（图2A）；与最大值对应的概率值相差几乎一个数量级。在图2中，B和C中，分别从图2A中的分布中提取了4000个和1000个点。在相应的决策图（图2、D和E）中，我们只观察到5个具有较大值和较大密度的点。这些点在图中表示为大实心圆，对应于聚类中心。选择中心后，每个点都将指定给簇或光环。该算法捕捉概率峰值的位置和形状，甚至是与非常不同密度（图2C中的蓝色和浅绿色点）和非球形峰值对应。此外，分配给晕的点对应的区域，通过目视观察图2A中的概率分布不会分配到任何峰。

To demonstrate the robustness of the procedure more quantitatively, we performed the analysis by drawing 10,000 points from the distribution in Fig. 2A, considering as a reference the cluster assignment obtained on that sample. We then obtained reduced samples by retaining only a fraction of points and performed cluster assignment for each reduced sample independently.Figure 2F shows, as a function of the size of the reduced sample, the fraction of points assigned to a cluster different than the one they were assigned to in the reference case. The fraction of misclassified points remains well below 1% even for small samples containing 1000 points.
为了更定量地证明程序的稳健性，我们通过从图2A中的分布中提取10000个点来进行分析，并将在该样本上获得的聚类赋值作为参考。然后我们通过只保留一小部分点来获得简化样本，并对每个缩减样本独立地进行聚类分配。图2F显示，作为简化样本大小的函数，分配给一个簇的点的分数与在参考案例中被分配到的点不同。即使是含有1000个点的小样本，错误分类点的比例仍远低于1%。

Varying for the data in Fig. 2B produced mutually consistent results (fig. S1). As a rule of thumb, one can choose so that the average number of neighbors is around 1 to 2% of the total number of points in the data set. For data sets composed by a small number of points, might be affected by large statistical errors . In these cases, it might be useful to estimate the density by more accurate measures (10, 11).
对图2B中的数据进行变换，得到了相互一致的结果(图S1)。根据经验，可以选择，使其邻居的平均数量约为数据集中总点数的1 - 2%。对于由少量点数组成的数据集，可能会受到较大的统计误差的影响。在这些情况下，通过更精确的测量来估计密度可能是有用的(10,11)。

Next, we benchmarked the algorithm on the test cases presented in Fig. 3. For computing the density for cases with few points, we adopted the exponential kernel described in (11). In Fig.3A, we consider a data set from (12), obtaining results comparable to those of the original article, where it was shown that other commonly used methods fail. In Fig. 3B, we consider an example with 15 clusters with high overlap in data distribution taken from (13); our algorithm successfully determines the cluster structure of the data set. In Fig. 3C, we consider the test case for the FLAME ( fuzzy clustering by local approximation of membership) approach (14), with results comparable to the original method . In the data set originally introduced to illustrate the performance of path-based spectral clustering (15) shown in Fig. 4D, our algorithm correctly finds the three clusters without the need of generating a connectivity graph. As comparison, in figs. S3 and S4 we show the cluster assignations obtained by K-means (2) for these four test cases and for the example in Fig. 2. Even if the K-means optimization is performed with use of the correct value of K, the assignations are, in most of the cases, not compliant with visual intuition.
接下来，我们根据图3所示的测试用例对算法进行基准测试。对于计算少量点的情况下的密度，我们采用(11)中描述的指数核函数。在Fig.3A中，我们考虑了(12)的一个数据集，得到的结果与原始文章的结果相比较，在原始文章中，其他常用的方法都失败了。在图3B中，我们考虑了从(13)中得到的数据分布高重叠的15个聚类的例子;我们的算法成功地确定了数据集的聚类结构。在图3C中，我们考虑了火焰(局部隶属度近似模糊聚类)方法的测试用例(14)，结果与原始方法相似。在图4D所示的最初用于说明基于路径的光谱聚类性能的数据集(15)中，我们的算法在不需要生成连通图的情况下正确地找到三个聚类。作为比较，图S3和S4展示了这四个测试用例和图2中的例子通过K-means(2)获得的集群分配。即使使用正确的K值来进行K-means优化，在大多数情况下，分配也不符合视觉直觉。

The method is robust with respect to changes in the metric that do not significantly affect the distances below dc, that is, that keep the density estimator in Eq. 1 unchanged. Clearly, the distance in Eq. 2 will be affected by such a change of metric,but it is easy to realize that the structure of the decision graph (in particular, the number of data points with a large value of d) is a consequence o f the ranking of the density values, not of the actual distance between far away points. Examples demonstrating this statement are shown in fig. S5.
该方法对于度量的变化是鲁棒的，这些变化对dc以下的距离没有显著影响，也就是说，等式1中的密度估计量保持不变。显然，等式2中的距离将受到度量的这种变化的影响，但是很容易认识到决策图的结构（尤其是具有大值的数据点的数量）是密度值排序的结果，而不是远离点之间的实际距离的结果。在图S5中示出了演示该声明的示例。

Our approach only requires measuring (or computing) the distance between all the pairs of data points and does not require parameterizing a probability distribution (8) or amultidimensional density function (10). Therefore,its performance is not affected by the intrinsic dimensionality of the space in which the data points are embedded. We verified that, in a test case with 16 clusters in 256 dimensions (16), the algorithm finds the number of clusters and assigns the points correctly (fig. S6). For a data set with 210 measurements of seven x-ray features for three types of wheat seeds from (17), the algorithm correctly predicts the existence of three clusters and classifies correctly 97% of the points assigned to the cluster cores (figs. S7 and S8).
我们的方法只需要测量（或计算）所有数据点对之间的距离，而不需要参数化概率分布（8）或多维密度函数（10）。因此，它的性能不受嵌入数据点的空间的内在维数的影响。我们验证了，在一个256维（16）的16个簇的测试用例中，算法找到了簇的数量并正确地分配了点（图S6）。对于（17）中三种类型小麦种子的7种x射线特征的210次测量数据集，该算法正确地预测了三个簇的存在，并正确地对分配给簇核的97%的点进行了分类（图S7和S8）。

We also applied the approach to the Olivetti Face Database (18), a widespread benchmark for machine learning algorithms, with the aim of identifying, without any previous training, the number of subjects in the database. This data set poses a serious challenge to our approach because the “ideal” number of clusters (namely of distinct subjects) is comparable with the number of elements in the data set (namely of different images, 10 for each subject). This makes a reliable estimate of the densities difficult. The similarity between two images was computed by following (19). The density is estimated by a Gaussian kernel (11) with variance dc = 0.07. For such a small set, the density estimator is unavoidably affected by large statistical errors; thus, we assign images to a cluster following a slightly more restrictive criterion than in the preceding examples. An image is assigned to the same cluster of its nearest image with higher density only if their distance is smaller than dc. As a consequence, the images further than dcfrom any other image of higher density remain unassigned. In Fig. 4, we show the results of an analysis performed for the first 100 images in the data set. The decision graph (Fig. 4A) shows the presence of several distinct density maxima. Unlike in other examples, their exact number is not clear, a consequence of the sparsity of the data points. A hint for choosing the number of centers is provided by the plot of
sorted in decreasing order (Fig. 4B).This graph shows that this quantity, that is by definition large for cluster centers, starts growing anomalously below a rank order 9. Therefore, we performed the analysis by using nine centers. In Fig. 4D, we show with different colors the clusters corresponding to these centers. Seven clusters correspond to different subjects, showing that the algorithm is able to “recognize” 7 subjects out of 10. An eighth subject appears split in two different clusters. When the analysis is performed on all 400 images of the database, the decision graph again does not allow recognizing clearly the number of clusters (fig. S9). However, in Fig. 4C we show that by adding more and more putative centers, about 30 subjects can be recognized unambiguously (fig. S9). When more centers are included ,the images of some of the subjects are split in two clusters, but still all the clusters remain pure, namely include only images of the same subject. Following (20) we also computed the fraction of pairs of images of the same subject correctly associated with the same cluster (rtrue) and the fraction of pairs of images of different subjects erroneously assigned to the same cluster (rfalse).If one does not apply the cutoff at dc in the assignation (namely if one applies our algorithm in its general formulation), one obtains rtrue~ 6 8 % and rfalse~ 1.2% with ~42 to ~50 centers, a performance comparable to a state-of-the-art approach for unsupervised image categorization (20).
我们还将该方法应用于Olivetti人脸数据库（18），这是一个广泛的机器学习算法的基准，目的是在没有任何训练的情况下，识别数据库中的受试者数量。这个数据集对我们的方法提出了严重的挑战，因为“理想”的聚类数（即不同主题的聚类）与数据集中的元素数（即不同图像的数量，每个主题10个）相当。这使得对密度的可靠估计变得困难。两幅图像之间的相似性通过以下方法计算（19）。密度由高斯核（11）估计，方差dc=0.07。对于这样一个小集合，密度估计不可避免地会受到较大的统计误差的影响；因此，我们将图像分配到一个比前面的例子稍微严格一些的标准。只有当距离小于dc时，才将一个图像分配到其最近的图像的同一个簇中。因此，来自任何其他高密度图像的比dc更远的图像保持未分配。在图4中，我们示出了对数据集中的前100个图像执行的分析的结果。判定图（图4A）示出了几个不同的密度极大值的存在。与其他例子不同的是，数据的稀疏性并不明显。以降序排序的
图提供了选择中心数量的提示（图4B）。这张图显示，这张图显示，这个数量，根据定义，对于集群中心来说是很大的，开始异常增长，低于排名顺序9。因此，我们使用九个中心进行分析。在图4D中，我们用不同的颜色显示了与这些中心相对应的簇。七个簇对应不同的主题，表明该算法能够“识别”10个主题中的7个。第八个主题被分成两个不同的组。当对数据库的所有400个图像执行分析时，决策图再次不允许清楚地识别簇的数目（图S9）。然而，在图4C中，我们显示通过添加越来越多的假设中心，可以明确地识别大约30个被摄体（图S9）。当包含更多的中心时，一些被摄体的图像被分割成两个簇，但仍然所有的簇仍然是纯的，即只包含同一个被摄体的图像。在（20）之后，我们还计算了同一被摄体与同一簇正确关联的成对图像的分数（rtrue）和被错误地分配到同一簇的不同被摄体图像对的分数（rfalse）一般公式），一个获得约6.8%的rtrue和_{1.2%的rfalse，具有}42到~50个中心，这一性能可与最先进的无监督图像分类方法相媲美（20）。

Last ,we benchmarked the clustering algorithm on the analysis of a molecular dynamics trajectory of trialanine in water at 300 K (21). In this case, clusters will approximately correspond to kinetic basins, namely independent conformations of the system that are stable for a substantial time and separated by free energy barriers, that are crossed only rarely on a microscopic time scale. We first analyzed the trajectory by a standard approach (22) based on a spectral analysis of the kinetic matrix, whose eigenvalues are associated with the relaxation times of the system. A gap is present after the seventh eigenvalue (fig. S10), indicating that the system has eight basins; in agreement with that, our cluster analysis (fig. S10) gives rise to eight clusters, including conformations in a one-to-one correspondence with those defining the kinetic basins (22).
最后，我们将聚类算法用于分析300k（21）下三苯胺在水中的分子动力学轨迹。在这种情况下，团簇将近似地对应于动力学盆地，即系统的独立构象，它们在相当长的时间内是稳定的，并被自由能垒隔开，在微观时间尺度上很少被跨越。我们首先通过一个标准方法（22）来分析轨迹，该方法基于动力学矩阵的谱分析，其特征值与系统的弛豫时间相关。在第七个特征值（图S10）之后出现一个间隙，表明该系统有八个盆地；与此一致，我们的聚类分析（图S10）产生了八个簇，包括与定义动力学盆地的构象一一对应的构象（22）。

Identifying clusters with density maxima, as is done here and in other density-based clustering algorithms (9, 10), is a simple and intuitive choice but has an important draw back . If one gene rates data points at random, the density estimated for a finite sample size is far from uniform and is instead characterized by several maxima. However, the decision graph allows us to distinguish genuine clusters from the density ripples generated by noise. Qualitatively, only in the former case are the points corresponding to cluster centers separated by a sizeable gap in and from the other points. For a random distribution, one instead observes a continuous distribution in the values of and. Indeed, we performed the analysis for sets of points generated at random from a uniform distribution in a hypercube. The distances between data points entering in Eqs. 1 and 2 are computed with periodic boundary conditions on the hypercube. This analysis shows that, for randomly distributed data points, the quantity
is distributed according to a power law, with an exponent that depends on the dimensionality of the space in which the points are embedded. The distributions of for data sets with genuine clusters, like those in Figs. 2 to 4 , are strikingly different from power laws , especially in the region of high (fig. S11). This observation may provide the basis for a criterion for the automatic choice of the cluster centers as well as for statistically validating the reliability of an analysis performed with our approach.
用密度最大值来识别聚类，就像这里和其他基于密度的聚类算法（9，10）中所做的一样，这是一个简单而直观的选择，但有一个重要的退步。如果一个基因随机地对数据点进行评级，那么在有限的样本量下估计的密度远远不是一致的，而是以几个最大值为特征。然而，决策图允许我们区分真正的集群和由噪声产生的密度波动。定性地说，只有在前一种情况下，与簇中心相对应的点与其他点的和之间有很大的差距。对于随机分布，我们反而观察到和是通过 Euler integral值的连续分布。实际上，我们对超立方体中均匀分布随机生成的点集进行了分析。输入等式中的数据点之间的距离。在超立方体上用周期边界条件计算了1和2。该分析表明，对于随机分布的数据点，
是根据幂律分布的，其指数取决于嵌入点的空间的维数。具有真实簇的数据集的分布，如图中所示。2到4，与幂律明显不同，特别是在高区域（图S11）。这一观察结果可以为自动选择聚类中心提供依据，也可以为用我们的方法进行分析的可靠性提供统计验证的依据。