
Hierarchical clustering is a method of cluster analysis that is used to cluster similar data points together. Hierarchical clustering follows either the top-down or bottom-up method of clustering.

分层聚类是一种聚类分析的方法,用于将相似的数据点聚类在一起。 分层聚类遵循自顶向下或自底向上的聚类方法。

什么是群集? (What is Clustering?)

Clustering is an unsupervised machine learning technique that divides the population into several clusters such that data points in the same cluster are more similar and data points in different clusters are dissimilar.


  • Points in the same cluster are closer to each other.同一群集中的点彼此靠近。
  • Points in the different clusters are far apart.不同聚类中的点相距很远。
(Image by Author), Sample 2-dimension Dataset

In the above sample 2-dimension dataset, it is visible that the dataset forms 3 clusters that are far apart, and points in the same cluster are close to each other.


There are several types of clustering algorithms other than Hierarchical clusterings, such as k-Means clustering, DBSCAN, and many more. Read the below article to understand what is k-means clustering and how to implement it.

除分层聚类之外,还有几种聚类算法,例如k-Means聚类,DBSCAN等。 阅读以下文章,了解什么是k-means聚类以及如何实现它。

In this article, you can understand hierarchical clustering, its types.


There are two types of hierarchical clustering methods:


  1. Divisive Clustering分裂聚类
  2. Agglomerative Clustering聚集聚类

分裂聚类: (Divisive Clustering:)

The divisive clustering algorithm is a top-down clustering approach, initially, all the points in the dataset belong to one cluster and split is performed recursively as one moves down the hierarchy.


分裂聚类的步骤: (Steps of Divisive Clustering:)

  1. Initially, all points in the dataset belong to one single cluster.最初,数据集中的所有点都属于一个群集。
  2. Partition the cluster into two least similar cluster将群集划分为两个最不相似的群集
  3. Proceed recursively to form new clusters until the desired number of clusters is obtained.递归进行以形成新的群集,直到获得所需的群集数量。
1st Image: All the data points belong to one cluster, 第一个图像:所有数据点都属于一个群集, 2nd Image: 1 cluster is separated from the previous single cluster, 第二个图像: 1个群集与先前的单个群集分离, 3rd Image: Further 1 cluster is separated from the previous set of clusters.第三个图像:另外1个群集与先前的群集集合分离。

In the above sample dataset, it is observed that there is 3 cluster that is far separated from each other. So we stopped after getting 3 clusters.

在上面的样本数据集中,可以看到有3个彼此远离的群集。 因此,我们在获得3个簇之后就停止了。

Even if start separating further more clusters, below is the obtained result.


(Image by Author), Sample dataset separated into 4 clusters

如何选择要拆分的集群? (How to choose which cluster to split?)

Check the sum of squared errors of each cluster and choose the one with the largest value. In the below 2-dimension dataset, currently, the data points are separated into 2 clusters, for further separating it to form the 3rd cluster find the sum of squared errors (SSE) for each of the points in a red cluster and blue cluster.

检查每个群集的平方误差总和,然后选择值最大的一个。 当前,在下面的二维数据集中,数据点被分为2个簇,为了进一步将其分离以形成第3个簇,找到红色簇和蓝色簇中每个点的平方误差总和(SSE)。

(Image by Author), Sample dataset separated into 2clusters

The cluster with the largest SSE value is separated into 2 clusters, hence forming a new cluster. In the above image, it is observed red cluster has larger SSE so it is separated into 2 clusters forming 3 total clusters.

具有最大SSE值的群集分为2个群集,因此形成一个新群集。 在上图中,可以看到红色群集的SSE较大,因此将其分为2个群集,形成3个总群集。

如何分割以上选择的集群? (How to split the above-chosen cluster?)

Once we have decided to split which cluster, then the question arises on how to split the chosen cluster into 2 clusters. One way is to use Ward’s criterion to chase for the largest reduction in the difference in the SSE criterion as a result of the split.

一旦决定拆分哪个群集,就会出现有关如何将所选群集拆分为2个群集的问题。 一种方法是使用Ward准则 ,以求最大程度地减少分裂导致的SSE准则差异。

如何处理噪音或离群值? (How to handle the noise or outlier?)

Due to the presence of outlier or noise, can result to form a new cluster of its own. To handle the noise in the dataset using a threshold to determine the termination criterion that means do not generate clusters that are too small.

由于存在异常值或噪声,可能导致形成自己的新簇。 为了使用阈值确定终止标准来处理数据集中的噪声,这意味着不要生成太小的簇。

聚集聚类: (Agglomerative Clustering:)

Agglomerative Clustering is a bottom-up approach, initially, each data point is a cluster of its own, further pairs of clusters are merged as one moves up the hierarchy.


聚集聚类的步骤: (Steps of Agglomerative Clustering:)

  1. Initially, all the data-points are a cluster of its own.最初,所有数据点都是其自身的集群。
  2. Take two nearest clusters and join them to form one single cluster.选取两个最近的群集,并将它们合并为一个群集。
  3. Proceed recursively step 2 until you obtain the desired number of clusters.递归地执行步骤2,直到获得所需的群集数量。
1st Image: All the data point is a cluster of its own, 第一个图像:所有数据点都是其自己的一个群集, 2nd Image: Two nearest clusters (surrounded by a black oval) joins together to form a single cluster.第二个图像:两个最近的群集(由黑色椭圆形包围)连接在一起形成一个群集。

In the above sample dataset, it is observed that 2 clusters are far separated from each other. So we stopped after getting 2 clusters.

在上面的样本数据集中,观察到2个聚类彼此分离。 因此,我们在获得2个簇之后就停止了。

(Image by Author), Sample dataset separated into 2 clusters

如何加入两个集群以形成一个集群? (How to join two clusters to form one cluster?)

To obtain the desired number of clusters, the number of clusters needs to be reduced from initially being n cluster (n equals the total number of data-points). Two clusters are combined by computing the similarity between them.

为了获得所需的群集数量,需要将群集数量从最初的n个群集减少(n等于数据点的总数)。 通过计算两个群集之间的相似度将它们组合在一起。

There are some methods which are used to calculate the similarity between two clusters:


  • Distance between two closest points in two clusters.两个群集中两个最近点之间的距离。
  • Distance between two farthest points in two clusters.两个群集中两个最远点之间的距离。
  • The average distance between all points in the two clusters.两个群集中所有点之间的平均距离。
  • Distance between centroids of two clusters.两个簇的质心之间的距离。

There are several pros and cons of choosing any of the above similarity metrics.


实现方式: (Implementation:)

(Code by Author)

结论: (Conclusion:)

In this article, we have discussed the in-depth intuition of agglomerative and divisive hierarchical clustering algorithms. There are some disadvantages of hierarchical algorithms that these algorithms are not suitable for large datasets because of large space and time complexities.

在本文中,我们讨论了聚集和分裂层次聚类算法的深入直觉。 分层算法存在一些缺点,即这些算法由于空间和时间复杂而不适用于大型数据集。

