现代分层、聚集聚类算法

Hierarchical clustering is a method of cluster analysis that is used to cluster similar data points together. Hierarchical clustering follows either the top-down or bottom-up method of clustering.

分层聚类是一种聚类分析的方法，用于将相似的数据点聚类在一起。分层聚类遵循自顶向下或自底向上的聚类方法。

什么是群集？ (What is Clustering?)

Clustering is an unsupervised machine learning technique that divides the population into several clusters such that data points in the same cluster are more similar and data points in different clusters are dissimilar.

聚类是一种无监督的机器学习技术，可将总体分为多个聚类，以使同一聚类中的数据点更相似，而不同聚类中的数据点则不相似。

Points in the same cluster are closer to each other.同一群集中的点彼此靠近。
Points in the different clusters are far apart.不同聚类中的点相距很远。

(Image by Author), Sample 2-dimension Dataset

In the above sample 2-dimension dataset, it is visible that the dataset forms 3 clusters that are far apart, and points in the same cluster are close to each other.

在上面的示例二维数据集中，可以看到该数据集形成了3个彼此相距很远的群集，并且同一群集中的点彼此靠近。

There are several types of clustering algorithms other than Hierarchical clusterings, such as k-Means clustering, DBSCAN, and many more. Read the below article to understand what is k-means clustering and how to implement it.

除分层聚类之外，还有几种聚类算法，例如k-Means聚类，DBSCAN等。阅读以下文章，了解什么是k-means聚类以及如何实现它。

In this article, you can understand hierarchical clustering, its types.

在本文中，您可以了解层次聚类及其类型。

There are two types of hierarchical clustering methods:

有两种类型的分层聚类方法：

Divisive Clustering分裂聚类
Agglomerative Clustering聚集聚类

分裂聚类： (Divisive Clustering:)

The divisive clustering algorithm is a top-down clustering approach, initially, all the points in the dataset belong to one cluster and split is performed recursively as one moves down the hierarchy.

划分聚类算法是一种自上而下的聚类方法，最初，数据集中的所有点都属于一个聚类，并且随着一个层次向下移动，递归执行拆分。

分裂聚类的步骤： (Steps of Divisive Clustering:)

Initially, all points in the dataset belong to one single cluster.最初，数据集中的所有点都属于一个群集。
Partition the cluster into two least similar cluster将群集划分为两个最不相似的群集
Proceed recursively to form new clusters until the desired number of clusters is obtained.递归进行以形成新的群集，直到获得所需的群集数量。

**1st Image:** All the data points belong to one cluster, **第一个图像：**所有数据点都属于一个群集， **2nd Image:** 1 cluster is separated from the previous single cluster, **第二个图像：** 1个群集与先前的单个群集分离， **3rd Image:** Further 1 cluster is separated from the previous set of clusters.第三个图像：另外1个群集与先前的群集集合分离。

In the above sample dataset, it is observed that there is 3 cluster that is far separated from each other. So we stopped after getting 3 clusters.

在上面的样本数据集中，可以看到有3个彼此远离的群集。因此，我们在获得3个簇之后就停止了。

Even if start separating further more clusters, below is the obtained result.

即使开始进一步分离更多的簇，也可以得到以下结果。

(Image by Author), Sample dataset separated into 4 clusters

如何选择要拆分的集群？ (How to choose which cluster to split?)

Check the sum of squared errors of each cluster and choose the one with the largest value. In the below 2-dimension dataset, currently, the data points are separated into 2 clusters, for further separating it to form the 3rd cluster find the sum of squared errors (SSE) for each of the points in a red cluster and blue cluster.

检查每个群集的平方误差总和，然后选择值最大的一个。当前，在下面的二维数据集中，数据点被分为2个簇，为了进一步将其分离以形成第3个簇，找到红色簇和蓝色簇中每个点的平方误差总和(SSE)。

(Image by Author), Sample dataset separated into 2clusters

The cluster with the largest SSE value is separated into 2 clusters, hence forming a new cluster. In the above image, it is observed red cluster has larger SSE so it is separated into 2 clusters forming 3 total clusters.

具有最大SSE值的群集分为2个群集，因此形成一个新群集。在上图中，可以看到红色群集的SSE较大，因此将其分为2个群集，形成3个总群集。

如何分割以上选择的集群？ (How to split the above-chosen cluster?)

Once we have decided to split which cluster, then the question arises on how to split the chosen cluster into 2 clusters. One way is to use Ward’s criterion to chase for the largest reduction in the difference in the SSE criterion as a result of the split.

一旦决定拆分哪个群集，就会出现有关如何将所选群集拆分为2个群集的问题。一种方法是使用Ward准则，以求最大程度地减少分裂导致的SSE准则差异。

如何处理噪音或离群值？ (How to handle the noise or outlier?)

Due to the presence of outlier or noise, can result to form a new cluster of its own. To handle the noise in the dataset using a threshold to determine the termination criterion that means do not generate clusters that are too small.

由于存在异常值或噪声，可能导致形成自己的新簇。为了使用阈值确定终止标准来处理数据集中的噪声，这意味着不要生成太小的簇。

聚集聚类： (Agglomerative Clustering:)

Agglomerative Clustering is a bottom-up approach, initially, each data point is a cluster of its own, further pairs of clusters are merged as one moves up the hierarchy.

聚集式聚类是一种自下而上的方法，最初，每个数据点都是其自身的一个聚类，随着一个聚类上移，将进一步合并成对的聚类。

聚集聚类的步骤： (Steps of Agglomerative Clustering:)

Initially, all the data-points are a cluster of its own.最初，所有数据点都是其自身的集群。
Take two nearest clusters and join them to form one single cluster.选取两个最近的群集，并将它们合并为一个群集。
Proceed recursively step 2 until you obtain the desired number of clusters.递归地执行步骤2，直到获得所需的群集数量。

**1st Image:** All the data point is a cluster of its own, **第一个图像：**所有数据点都是其自己的一个群集， **2nd Image:** Two nearest clusters (surrounded by a black oval) joins together to form a single cluster.第二个图像：两个最近的群集(由黑色椭圆形包围)连接在一起形成一个群集。

In the above sample dataset, it is observed that 2 clusters are far separated from each other. So we stopped after getting 2 clusters.

在上面的样本数据集中，观察到2个聚类彼此分离。因此，我们在获得2个簇之后就停止了。

如何加入两个集群以形成一个集群？ (How to join two clusters to form one cluster?)

To obtain the desired number of clusters, the number of clusters needs to be reduced from initially being n cluster (n equals the total number of data-points). Two clusters are combined by computing the similarity between them.

为了获得所需的群集数量，需要将群集数量从最初的n个群集减少(n等于数据点的总数)。通过计算两个群集之间的相似度将它们组合在一起。

There are some methods which are used to calculate the similarity between two clusters:

有一些方法可用于计算两个聚类之间的相似度：

Distance between two closest points in two clusters.两个群集中两个最近点之间的距离。
Distance between two farthest points in two clusters.两个群集中两个最远点之间的距离。
The average distance between all points in the two clusters.两个群集中所有点之间的平均距离。
Distance between centroids of two clusters.两个簇的质心之间的距离。

There are several pros and cons of choosing any of the above similarity metrics.

选择上述相似性指标中的任何一个都有其优缺点。

实现方式： (Implementation:)

(Code by Author)

(作者代码)

结论： (Conclusion:)

In this article, we have discussed the in-depth intuition of agglomerative and divisive hierarchical clustering algorithms. There are some disadvantages of hierarchical algorithms that these algorithms are not suitable for large datasets because of large space and time complexities.

在本文中，我们讨论了聚集和分裂层次聚类算法的深入直觉。分层算法存在一些缺点，即这些算法由于空间和时间复杂而不适用于大型数据集。

Thank You for Reading

谢谢您的阅读

翻译自: https://towardsdatascience.com/hierarchical-clustering-agglomerative-and-divisive-explained-342e6b20d710

现代分层、聚集聚类算法

查看全文

http://www.taodudu.cc/news/show-863511.html

特斯拉自动驾驶使用的技术_使用自回归预测特斯拉股价
熊猫分发_实用熊猫指南
救命代码_救命！如何选择功能？
回归模型评估_评估回归模型的方法
gan学到的是什么_GAN推动生物学研究
揭秘机器学习
投影仪投影粉色_DecisionTreeRegressor —停止用于将来的投影！
机器学习中的随机过程_机器学习过程
ci/cd heroku_在Heroku上部署Dash或Flask Web应用程序。简易CI / CD。
图像纹理合成_EnhanceNet：通过自动纹理合成实现单图像超分辨率
变压器耦合和电容耦合_超越变压器和抱抱面的分类
梯度下降法_梯度下降
学习机器学习的项目_辅助项目在机器学习中的重要性
计算机视觉知识基础_我见你：计算机视觉基础知识
配对交易方法_COVID下的自适应配对交易，一种强化学习方法
设计数据密集型应用程序_设计数据密集型应用程序书评
pca 主成分分析_超越普通PCA：非线性主成分分析
全局变量和局部变量命名规则_变量范围和LEGB规则
dask 使用_在Google Cloud上使用Dask进行可扩展的机器学习
计算机视觉课_计算机视觉教程—第4课
用camelot读取表格_如何使用Camelot从PDF提取表格
c盘扩展卷功能只能向右扩展_信用风险管理：功能扩展和选择
使用OpenCV，Keras和Tensorflow构建Covid19掩模检测器
使用Python和OpenCV创建自己的“ CamScanner”
cnn图像进行预测_CNN方法：使用聚合物图像预测其玻璃化转变温度
透过性别看世界_透过树林看森林
gan神经网络_神经联觉：当艺术遇见GAN
rasa聊天机器人_Rasa-X是持续改进聊天机器人的独特方法
python进阶指南_Python特性工程动手指南
人工智能对金融世界的改变_人工智能革命正在改变网络世界

现代分层、聚集聚类算法_分层聚类：聚集性和分裂性-解释相关推荐

python谱聚类算法_谱聚类（spectral clustering）原理总结
谱聚类(spectral clustering)是广泛使用的聚类算法,比起传统的K-Means算法,谱聚类对数据分布的适应性更强,聚类效果也很优秀,同时聚类的计算量也小很多,更加难能可贵的是实现起来也 ...
python谱聚类算法_谱聚类Spectral clustering(SC)
在之前的文章里,介绍了比较传统的K-Means聚类.Affinity Propagation(AP)聚类.比K-Means更快的Mini Batch K-Means聚类以及混合高斯模型Gaussian ...
python谱聚类算法_谱聚类 - python挖掘 - 博客园
谱聚类(Spectral Clustering,SC)是一种基于图论的聚类方法,将带权无向图划分为两个或两个以上的最优子图,使子图内部尽量相似,而子图间距离尽量远.能够识别任意形状的样本空间且收敛于全 ...
聚类算法_层次聚类_密度聚类(dbscan,meanshift)_划分聚类(Kmeans)详解
注: 两整天的成果,谬误之处勿喷 1 聚类概述样本没有训练的样本没有标注的样本 1.1 相似度度量 1.1.1 距离相似度度量距离度量 dist(oi,oj)dist(o_{i},o_{j}) ...
软聚类算法：模糊聚类 (Fuzzy Clustering)
前言如果你对这篇文章感兴趣,可以点击「[访客必读 - 指引页]一文囊括主页内所有高质量博客」,查看完整博客分类与对应链接. 在介绍模糊聚类之前,我们先简单地列举一下聚类算法的常见分类: 硬聚类 (H ...
Python基于聚类算法实现密度聚类(DBSCAN)计算
本文实例讲述了Python基于聚类算法实现密度聚类(DBSCAN)计算.分享给大家供大家参考,具体如下: 算法思想基于密度的聚类算法从样本密度的角度考察样本之间的可连接性,并基于可连接样本不断扩展聚 ...
ML之Clustering之普聚类算法：普聚类算法的相关论文、主要思路、关键步骤、代码实现等相关配图之详细攻略
ML之Clustering之普聚类算法:普聚类算法的相关论文.主要思路.关键步骤.代码实现等相关配图之详细攻略目录普聚类算法的相关论文普聚类算法的主要思路普聚类算法的关键步骤普聚类算法的代码 ...
《MATLAB 神经网络43个案例分析》：第34章广义神经网络的聚类算法——网络入侵聚类
<MATLAB 神经网络43个案例分析>:第34章广义神经网络的聚类算法--网络入侵聚类 1. 前言 2. MATLAB 仿真示例 3. 小结 1. 前言 <MATLAB 神经网络 ...
模式识别：C-means(K-means)聚类算法与分级聚类(层次聚类)算法
C均值聚类算法与分级聚类算法的聚类分析一.实验目的理解聚类的整体思想,了解聚类的一般方法: 掌握 C-means与分级聚类算法算法思想及原理,并能够熟练运用这些算法进行聚类分析: 能够分析二者的优 ...

现代分层、聚集聚类算法_分层聚类：聚集性和分裂性-解释