聚类树状图

Agglomerative Clustering is a type of hierarchical clustering algorithm. It is an unsupervised machine learning technique that divides the population into several clusters such that data points in the same cluster are more similar and data points in different clusters are dissimilar.

聚集聚类是一种层次聚类算法。这是一种无监督的机器学习技术，可将总体分为多个集群，以使同一集群中的数据点更加相似，而不同集群中的数据点则彼此不同。

Points in the same cluster are closer to each other.同一群集中的点彼此靠近。
Points in the different clusters are far apart.不同聚类中的点相距很远。

(Image by Author), Sample 2-dimension Dataset

In the above sample 2-dimension dataset, it is visible that the dataset forms 3 clusters that are far apart, and points in the same cluster are close to each other.

在上面的示例二维数据集中，可以看到该数据集形成了3个彼此相距很远的群集，并且同一群集中的点彼此靠近。

聚集集群背后的直觉： (The intuition behind Agglomerative Clustering:)

Agglomerative Clustering is a bottom-up approach, initially, each data point is a cluster of its own, further pairs of clusters are merged as one moves up the hierarchy.

聚集式聚类是一种自下而上的方法，最初，每个数据点都是其自身的一个聚类，随着一个聚类上移，将进一步合并成对的聚类。

聚集聚类的步骤： (Steps of Agglomerative Clustering:)

Initially, all the data-points are a cluster of its own.最初，所有数据点都是其自身的集群。
Take two nearest clusters and join them to form one single cluster.选取两个最近的群集，并将它们合并为一个群集。
Proceed recursively step 2 until you obtain the desired number of clusters.递归地执行步骤2，直到获得所需的群集数量。

**1st Image:** All the data point is a cluster of its own, **第一个图像：**所有数据点都是其自己的一个群集， **2nd Image:** Two nearest clusters (surrounded by a black oval) joins together to form a single cluster.第二个图像：两个最近的群集(由黑色椭圆形包围)连接在一起形成一个群集。

In the above sample dataset, it is observed that 2 clusters are far separated from each other. So we stopped after getting 2 clusters.

在上面的样本数据集中，观察到2个聚类彼此分离。因此，我们在获得2个簇之后就停止了。

(Image by Author), Sample dataset separated into 2 clusters

如何加入两个集群以形成一个集群？ (How to join two clusters to form one cluster?)

To obtain the desired number of clusters, the number of clusters needs to be reduced from initially being n cluster (n equals the total number of data-points). Two clusters are combined by computing the similarity between them.

为了获得所需的群集数量，需要将群集数量从最初的n个群集减少(n等于数据点的总数)。通过计算两个群集之间的相似度将它们组合在一起。

There are some methods which are used to calculate the similarity between two clusters:

有一些方法可用于计算两个聚类之间的相似度：

Distance between two closest points in two clusters.两个群集中两个最近点之间的距离。
Distance between two farthest points in two clusters.两个群集中两个最远点之间的距离。
The average distance between all points in the two clusters.两个群集中所有点之间的平均距离。
Distance between centroids of two clusters.两个簇的质心之间的距离。

There are several pros and cons of choosing any of the above similarity metrics.

选择上述相似性指标中的任何一个都有其优缺点。

凝聚集群的实现： (Implementation of Agglomerative Clustering:)

(Code by Author)

(作者代码)

如何获得最佳的簇数？ (How to obtain the optimal number of clusters?)

The implementation of the Agglomerative Clustering algorithm accepts the number of desired clusters. There are several ways to find the optimal number of clusters such that the population is divided into k clusters in a way that:

聚集聚类算法的实现接受所需聚类的数量。有几种方法可以找到最佳数目的聚类，以便按以下方式将总体分为k个聚类：

Points in the same cluster are closer to each other.

同一群集中的点彼此靠近。

Points in the different clusters are far apart.

不同聚类中的点相距很远。

By observing the dendrograms, one can find the desired number of clusters.

通过观察树状图，可以找到所需数目的簇。

Dendrograms are a diagrammatic representation of the hierarchical relationship between the data-points. It illustrates the arrangement of the clusters produced by the corresponding analyses and is used to observe the output of hierarchical (agglomerative) clustering.

树状图是数据点之间层次关系的图形表示。它说明了由相应分析产生的聚类的排列，并用于观察分层(聚集)聚类的输出。

树状图的实现： (Implementation of Dendrograms:)

(Code by Author)

(作者代码)

Download the sample 2-dimension dataset from here.

从此处下载示例二维数据集。

**Left Image:** Visualize the sample dataset, **左图像：**可视化示例数据集， **Right Image:** Visualize 3 cluster for the sample dataset右图像：可视化示例数据集的3个簇

For the above sample dataset, it is observed that the optimal number of clusters would be 3. But for high dimension dataset where visualization is of the dataset is not possible dendrograms plays an important role to find the optimal number of clusters.

对于上面的样本数据集，可以观察到最佳数目的聚类将是3。但是对于高维数据集，无法可视化该数据集，树状图对于找到最佳数目的聚类起着重要的作用。

如何通过观察树状图找到最佳聚类数： (How to find the optimal number of clusters by observing the dendrograms:)

(Image by Author), Dendrogram for the above sample dataset

From the above dendrogram plot, find a horizontal rectangle with max-height that does not cross any horizontal vertical dendrogram line.

从上面的树状图中，找到最大高度不与任何水平垂直树状图线交叉的水平矩形。

**Left:** Separating into 2 clusters, 左：分为2个类， **Right:** Separating into 3 clusters右：分为3个类

The portion in the dendrogram in which rectangle having the max-height can be cut, and the optimal number of clusters will be 3 as observed in the right part of the above image. Max height rectangle is chosen because it represents the maximum Euclidean distance between the optimal number of clusters.

在树状图中可以切割出具有最大高度的矩形的部分，并且如上图右侧所示，最佳簇数将为3。选择最大高度矩形是因为它代表最佳簇数之间的最大欧几里得距离。

结论： (Conclusion:)

In this article, we have discussed the in-depth intuition of the agglomerative hierarchical clustering algorithm. There are some disadvantages to the algorithm that it is not suitable for large datasets because of the large space and time complexities. Even observing the dendrogram to find the optimal number of clusters for a large dataset is very difficult.

在本文中，我们讨论了聚集层次聚类算法的深入直觉。由于存在较大的空间和时间复杂性，该算法存在一些缺点，不适用于大型数据集。即使观察树状图以找到大型数据集的最佳聚类数也非常困难。

Thank You for Reading

谢谢您的阅读

翻译自: https://towardsdatascience.com/agglomerative-clustering-and-dendrograms-explained-29fc12b85f23

聚类树状图

查看全文

http://www.taodudu.cc/news/show-863829.html

机器学习与分布式机器学习_我将如何再次开始学习机器学习（3年以上）
机器学习算法机器人足球_购买足球队：一种机器学习方法
机器学习与不确定性_机器学习求职中的不确定性
pandas数据处理代码_使用Pandas方法链接提高代码可读性
opencv 检测几何图形_使用OpenCV + ConvNets检测几何形状
立即学习AI：03-使用卷积神经网络进行马铃薯分类
netflix 开源_Netflix的Polynote是一个新的开源框架，可用来构建更好的数据科学笔记本
电场大学_人工电场优化算法
主题建模lda_使用LDA的Google Play商店应用评论的主题建模
胶囊路由_评论：胶囊之间的动态路由
交叉验证python_交叉验证
open ai gpt_您实际上想尝试的GPT-3 AI发明鸡尾酒
python 线性回归_Python中的简化线性回归
机器学习模型的性能指标
利用云功能和API监视Google表格中的Cloud Dataprep作业状态
谷歌联合学习的论文_Google的未来联合学习
使用cnn预测房价_使用CNN的人和马预测
利用colab保存模型_在Google Colab上训练您的机器学习模型中的“后门”
java 回归遍历_回归基础：代码遍历
sql 12天内的数据_想要在12周内成为数据科学家吗？
SorterBot-第1部分
算法题指南书_分类算法指南
小米 pegasus_使用Google的Pegasus库生成摘要
数据集准备及数据预处理_1.准备数据集
ai模型_这就是AI的样子：用于回答问题的BiDAF模型
正则化技术
检测对抗样本_避免使用对抗性T恤进行检测
大数据数据量估算_如何估算数据科学项目的数据收集成本
为什么和平精英无响应_什么和为什么
1. face_generate.py

聚类树状图_聚集聚类和树状图-解释相关推荐

echart关系树状图_干货 | 25个常用Matplotlib图的Python代码
50个Matplotlib图的汇编,在数据分析和可视化中最有用.此列表允许您使用Python的Matplotlib和Seaborn库选择要显示的可视化对象. 1.关联散点图带边界的气泡图带线性回 ...
聚类算法的缺点_常用聚类算法
一.K-Means 算法步骤: (1) 首先我们选择一些类/组,并随机初始化它们各自的中心点.中心点是与每个数据点向量长度相同的位置.这需要我们提前预知类的数量(即中心点的数量). (2) 计算每个数 ...
下拉菜单实现树状结构_二叉索引树（树状数组）的原理
背景了解到二叉索引树这个数据结构,是在 leetcode 的 307 题,题目是要求实现一个数据结构,可以返回数组任意区间的和以及更新数组的某个值. 307.Range Sum Query - Mu ...
目录树删除数据结构_数据结构：B树和B+树的插入、删除图文详解
B树 1.1B树的定义 B树也称B-树,它是一颗多路平衡查找树.我们描述一颗B树时需要指定它的阶数,阶数表示了一个结点最多有多少个孩子结点,一般用字母m表示阶数.当m取2时,就是我们常见的二叉搜索树. ...
r语言绘制雷达图_用r绘制雷达蜘蛛图
r语言绘制雷达图 I've tried several different types of NBA analytical articles within my readership who are ...
小强升职记思维导图_你学会用 “思维导图” 学英语了吗？
今天我们来讲讲目前比较火爆的"思维导图学习法".思维导图又叫"MIND MAP",是英国人托尼博赞发明的一种思维工具. 托尼博赞本人在心理学.语言学.数学以及科 ...
项目计划表格甘特图_项目管理：什么是甘特图？
什么是甘特图? 许多人从未听说过甘特图.简而言之,甘特图是随时间计划的任务的可视视图.甘特图用于计划各种规模的项目,它们是显示计划在特定日期完成的工作的有用方法.它们还可以帮助您在一个简单的视图中查看 ...
java 性能火焰图_性能调优工具-火焰图
性能调优工具-火焰图发布时间:2019-07-17 19:29, 浏览次数:402 前言工具的进化一直是人类生产力进步的标志,合理使用工具能大大提高我们的工作效率,遇到问题时,合理使用工具更能加快 ...
excel瀑布图_在Excel中创建瀑布图
excel瀑布图 We have a very famous waterfall here in Canada, and it creates gorgeous photos, like this o ...

聚类树状图_聚集聚类和树状图-解释