层次聚类算法算法_聚类算法简介

层次聚类算法算法

Take a look at the image below. It’s a collection of bugs and creepy-crawlies of different shapes and sizes. Take a moment to categorize them by similarity into a number of groups.

看看下面的图片。它是各种形状和大小的错误和令人毛骨悚然的爬行动物的集合。花一点时间按照相似性将它们分为多个组。

This isn’t a trick question. Start with grouping the spiders together.

这不是一个技巧性的问题。首先将蜘蛛分组在一起。

Done? While there’s not necessarily a “correct” answer here, it’s most likely you split the bugs into four clusters. The spiders in one cluster, the pair of snails in another, the butterflies and moth into one, and the trio of wasps and bees into one more.

做完了吗虽然这里一定不是一个“正确”的答案，这是最有可能拆分的错误分为四个集群。蜘蛛在一个簇中，成对的蜗牛在另一个簇中，蝴蝶和飞蛾成一团，WaSP和蜜蜂三人成群。

That wasn’t too bad, was it? You could probably do the same with twice as many bugs, right? If you had a bit of time to spare — or a passion for entomology — you could probably even do the same with a hundred bugs.

还不错吧？您可能会用两倍多的错误来做同样的事情，对不对？如果您有空余时间-或对昆虫学充满热情-您甚至可以用一百个bug来做同样的事情。

For a machine though, grouping ten objects into however many meaningful clusters is no small task, thanks to a mind-bending branch of maths called combinatorics, which tells us that are 115,975 different possible ways you could have grouped those ten insects together.

但是对于一台机器来说，将十个对象分组到许多有意义的簇中并不是一件容易的事，这要归功于一个名为combinatorics的弯弯曲曲的数学分支，该分支告诉我们，有115,975种不同的可能方式可以将这十个昆虫分组在一起。

Had there been twenty bugs, there would have been over fifty trillion possible ways of clustering them.

如果有二十个错误，那么将有超过五十万亿种可能的方法将它们聚类。

With a hundred bugs — there’d be many times more solutions than there are particles in the known universe.

有一百个错误-解决方案的数量比已知宇宙中存在的粒子多很多倍。

How many times more? By my calculation, approximately five hundred million billion billion times more. In fact, there are more than four million billion googol solutions (what’s a googol?).

还有多少倍？根据我的计算，大约是五千亿亿倍。实际上，有超过四十亿个googol解决方案( 什么是googol？ )。

For just a hundred objects.

仅用于一百个对象。

Almost all of those solutions would be meaningless — yet from that unimaginable number of possible choices, you pretty quickly found one of the very few that clustered the bugs in a useful way.

几乎所有这些解决方案都是毫无意义的-但是，从众多难以想象的可能选择中，您很快就找到了以有用的方式对错误进行聚类的少数几个。

Us humans take it for granted how good we are categorizing and making sense of large volumes of data pretty quickly. Whether it’s a paragraph of text, or images on a screen, or a sequence of objects — humans are generally fairly efficient at making sense of whatever data the world throws at us.

我们人类理所当然地认为我们在分类和快速理解大量数据方面有多出色。无论是一段文字，还是屏幕上的图像，还是一系列对象，人类通常都能有效地理解世界向我们提供的任何数据。

Given that a key aspect of developing A.I. and machine learning is getting machines to quickly make sense of large sets of input data, what shortcuts are there available?

鉴于开发AI和机器学习的关键方面是使机器快速理解大量输入数据，有哪些捷径可用？

Here, you can read about three clustering algorithms that machines can use to quickly make sense of large datasets. This is by no means an exhaustive list — there are other algorithms out there — but they represent a good place to start!

在这里，您可以了解机器可以用来快速理解大型数据集的三种聚类算法。这绝不是一个详尽的清单-那里还有其他算法-但这是一个不错的起点！

You’ll find for each a quick summary of when you might use them, a brief overview of how they work, and a more detailed, step-by-step worked example. I believe it helps to understand an algorithm by actually carrying out yourself.

您将找到每种使用时间的快速摘要，它们如何工作的简要概述以及更详细的分步工作示例。我相信通过实际执行自己的算法有助于理解算法。

If you’re really keen, you’ll find the best way to do this is with pen and paper. Go ahead — nobody will judge!

如果您真的很热衷 ，您会发现使用笔和纸是实现此目的的最佳方法。继续-没有人会判断！

K均值聚类 (K-means clustering)

在...时使用 (Use when...)

…you have an idea of how many groups you’re expecting to find a priori.

…您对希望找到先验的小组有一个想法。

这个怎么运作 (How it works)

The algorithm randomly assigns each observation into one of k categories, then calculates the mean of each category. Next, it reassigns each observation to the category with the closest mean before recalculating the means. This step repeats over and over until no more reassignments are necessary.

该算法将每个观察随机分配到k个类别之一，然后计算每个类别的平均值 。接下来，它将在重新计算均值之前将每个观察值分配给具有最均值的类别。重复这一步骤，直到不再需要重新分配为止。

工作实例 (Worked Example)

Take a group of 12 football (or ‘soccer’) players who have each scored a certain number of goals this season (say in the range 3–30). Let’s divide them into separate clusters — say three.

以一组12名足球(或“足球”)球员为例，他们每个赛季都进球数(例如3-30)。让我们将它们分为单独的集群-说三个。

Step 1 requires us to randomly split the players into three groups and calculate the means of each.

步骤1要求我们将参与者随机分为三组，并计算每组的均值。

Group 1Player A (5 goals),Player B (20 goals),Player C (11 goals)
Group Mean = (5 + 20 + 11) / 3 = 12 goalsGroup 2Player D (5 goals),Player E (3 goals),Player F (19 goals)
Group Mean = 9 goalsGroup 3Player G (30 goals),Player H (3 goals),Player I (15 goals)
Group Mean = 16 goals

Step 2: For each player, reassign them to the group with the closest mean. E.g., Player A (5 goals) is assigned to Group 2 (mean = 9). Then recalculate the group means.

步骤2：对于每位玩家，将他们重新分配给具有最接近均值的组。例如，玩家A(5个进球)被分配给第2组(平均值= 9)。然后重新计算组均值。

Group 1 (Old Mean = 12 goals)Player C (11 goals)
New Mean = 11 goalsGroup 2 (Old Mean = 9 goals)Player A (5 goals),Player D (5 goals),Player E (3 goals),Player H (3 goals)
New Mean = 4 goalsGroup 3 (Old Mean = 16 goals)Player G (30 goals),Player I (15 goals),Player B (20 goals),Player F (19 goals)
New Mean = 21 goals

Repeat Step 2 over and over until the group means no longer change. For this somewhat contrived example, this happens on the next iteration. Stop! You have now formed three clusters from the dataset!

一遍又一遍地重复步骤2，直到组的含义不再改变为止。对于这个有些人为的例子，这发生在下一次迭代中。 停止！ 现在，您已经从数据集中形成了三个聚类！

Group 1 (Old Mean = 11 goals)Player C (11 goals),Player I (15 goals)
Final Mean = 13 goalsGroup 2 (Old Mean = 4 goals)Player A (5 goals),Player D (5 goals),Player E (3 goals),Player H (3 goals)
Final Mean = 4 goalsGroup 3 (Old Mean = 21 goals)Player G (30 goals),Player B (20 goals),Player F (19 goals)
Final Mean = 23 goals

With this example, the clusters could correspond to the players’ positions on the field — such as defenders, midfielders and attackers.

在这个例子中，集群可以对应于球员在场上的位置，例如防守者，中场球员和进攻者。

K-means works here because we could have reasonably expected the data to fall naturally into these three categories.

K均值之所以在这里起作用，是因为我们可以合理地预期数据自然会落入这三类。

In this way, given data on a range of performance statistics, a machine could do a reasonable job of estimating the positions of players from any team sport — useful for sports analytics, and indeed any other purpose where classification of a dataset into predefined groups can provide relevant insights.

这样，给定一系列性能统计数据，一台机器就可以合理地估算出任何团队运动项目中球员的位置，这对运动分析非常有用，并且在将数据集分类为预定义组的其他任何目的上都非常有用提供相关见解。

更细的细节 (Finer details)

There are several variations on the algorithm described here. The initial method of ‘seeding’ the clusters can be done in one of several ways.

这里描述的算法有几种变体。可以通过以下几种方式之一来完成“播种”群集的初始方法。

Here, we randomly assigned every player into a group, then calculated the group means. This causes the initial group means to tend towards being similar to one another, which ensures greater repeatability.

在这里，我们将每个玩家随机分配到一个组中，然后计算组均值。这导致初始组手段趋于彼此相似，从而确保了更大的可重复性。

An alternative is to seed the clusters with just one player each, then start assigning players to the nearest cluster. The returned clusters are more sensitive to the initial seeding step, reducing repeatability in highly variable datasets.

另一种选择是给每个只有一个玩家的集群播种，然后开始将玩家分配到最近的集群。返回的簇对初始播种步骤更加敏感，从而降低了高度可变的数据集中的可重复性。

However, this approach may reduce the number of iterations required to complete the algorithm, as the groups will take less time to diverge.

但是，这种方法可能会减少完成算法所需的迭代次数，因为组将花费更少的时间进行分离。

An obvious limitation to K-means clustering is that you have to provide a priori assumptions about how many clusters you’re expecting to find.

K均值聚类的一个明显限制是，您必须提供关于要找到多少个聚类的先验假设。

There are methods to assess the fit of a particular set of clusters. For example, the Within-Cluster Sum-of-Squares is a measure of the variance within each cluster.

有一些方法可以评估特定集群集的拟合度。例如，集群内平方和是每个集群内方差的度量。

The ‘better’ the clusters, the lower the overall WCSS.

群集越“好”，则总体WCSS越低。

层次聚类 (Hierarchical clustering)

在...时使用 (Use when...)

…you wish to uncover the underlying relationships between your observations.

…您希望揭示观察结果之间的潜在关系。

这个怎么运作 (How it works)

A distance matrix is computed, where the value of cell (i, j) is a distance metric between observations i and j.

计算距离矩阵，其中像元( i，j)的值是观测值i和j之间的距离度量。

Then, pair the closest two observations and calculate their average. Form a new distance matrix, merging the paired observations into a single object.

然后，将最接近的两个观测值配对并计算它们的平均值。形成一个新的距离矩阵，将成对的观测值合并为一个对象。

From this distance matrix, pair up the closest two observations and calculate their average. Repeat until all observations are grouped together.

从这个距离矩阵中，配对最接近的两个观测值并计算它们的平均值。重复直到将所有观察结果分组在一起。

工作的例子 (Worked example)

Here’s a super-simplified dataset about a selection of whale and dolphin species. As a trained biologist, I can assure you we normally use much more detailed datasets for things like reconstructing phylogeny.

这是有关鲸和海豚物种选择的超级简化数据集。作为一名训练有素的生物学家，我可以向您保证，我们通常使用更详细的数据集来重建系统发育。

For now though, we’ll just look at the typical body lengths for these six species. We’ll be using just two repeated steps.

现在，我们只看这六个物种的典型体长。我们将仅使用两个重复步骤。

Species          Initials  Length(m)
Bottlenose Dolphin     BD        3.0
Risso's Dolphin        RD        3.6
Pilot Whale            PW        6.5
Killer Whale           KW        7.5
Humpback Whale         HW       15.0
Fin Whale              FW       20.0

Step 1: compute a distance matrix between each species. Here, we’ll use the Euclidean distance — how far apart are the data points?

步骤1：计算每个物种之间的距离矩阵。在这里，我们将使用欧几里得距离 -数据点相距多远？

Read this exactly as you would a distance chart in a road atlas. The difference in length between any pair of species can be looked up by reading the value at the intersection of the relevant row and column.

就像在道路地图集上绘制距离图一样，仔细阅读本章。可以通过读取相关行和列相交处的值来查找任意一对物种之间的长度差异。

BD   RD   PW   KW   HW
RD  0.6
PW  3.5  2.9
KW  4.5  3.9  1.0
HW 12.0 11.4  8.5  7.5
FW 17.0 16.4 13.5 12.5  5.0

Step 2: Pair up the two closest species. Here, this will be the Bottlenose & Risso’s Dolphins, with an average length of 3.3m.

步骤2：配对两个最接近的物种。在这里，这将是宽吻海豚和海豚的海豚，平均长度为3.3m。

Repeat Step 1 by recalculating the distance matrix, but this time merge the Bottlenose & Risso’s Dolphins into a single object with length 3.3m.

通过重新计算距离矩阵来重复步骤1，但是这次将宽吻瓶和里索的海豚合并为一个长度为3.3m的单个对象。

[BD, RD]   PW   KW   HW
PW       3.2
KW       4.2   1.0
HW      11.7   8.5  7.5
FW      16.7  13.5 12.5  5.0

Next, repeat Step 2 with this new distance matrix. Here, the smallest distance is between the Pilot & Killer Whales, so we pair them up and take their average — which gives us 7.0m.

接下来 ，使用这个新的距离矩阵重复步骤2。在这里，飞行员与虎鲸之间的距离最小，因此我们将它们配对并取它们的平均值，即为7.0m。

Then, we repeat Step 1 — recalculate the distance matrix, but now we’ve merged the Pilot & Killer Whales into a single object of length 7.0m.

然后，我们重复步骤1-重新计算距离矩阵，但是现在我们将“飞行员与杀人鲸”合并为一个长度为7.0m的对象。

[BD, RD] [PW, KW]   HW[PW, KW]      3.7              HW           11.7      8.0     FW           16.7     13.0   5.0

Next, repeat Step 2 with this distance matrix. The smallest distance (3.7m) is between the two merged objects — so now merge them into an even bigger object, and take the average (which is 5.2m).

接下来 ，使用此距离矩阵重复步骤2。最小的距离(3.7m)在两个合并的对象之间-现在将它们合并成更大的对象，并取平均值(即5.2m)。

Then, repeat Step 1 and compute a new distance matrix, having merged the Bottlenose & Risso’s Dolphins with the Pilot & Killer Whales.

然后，重复步骤1并计算新的距离矩阵，将宽吻鼻和里索的海豚与飞行员和虎鲸合并。

[[BD, RD] , [PW, KW]]    HW
HW                   9.8
FW                  14.8   5.0

Next, repeat Step 2. The smallest distance (5.0m) is between the Humpback & Fin Whales, so merge them into a single object, and take the average (17.5m).

接下来 ，重复步骤2。座头鲸和鳍鲸之间的最小距离(5.0m)，因此将它们合并为一个对象，并取其平均值(17.5m)。

Then, it’s back to Step 1 — compute the distance matrix, having merged the Humpback & Fin Whales.

然后，返回到步骤1-合并了座头鲸和鳍鲸，计算距离矩阵。

[[BD, RD] , [PW, KW]]
[HW, FW]                  12.3

Finally, repeat Step 2 — there is only one distance (12.3m) in this matrix, so pair everything into one big object. Now you can stop! Look at the final merged object:

最后，重复步骤2-这个矩阵只有一个距离(12.3m)，因此将所有东西配对成一个大物体。现在您可以停止！ 查看最终的合并对象：

[[[BD, RD],[PW, KW]],[HW, FW]]

It has a nested structure (think JSON), which allows it to be drawn up as a tree-like graph, or 'dendrogram'.

它具有嵌套结构(例如JSON )，可以将其绘制为树状图或“树状图”。

It reads in much the same way a family tree might. The nearer two observations are on the tree, the more similar or closely-related they are taken to be.

它的读取方式与家谱的读取方式几乎相同。树上的两个观测值越接近，它们被认为越相似或紧密相关。

The structure of the dendrogram gives insight into how the dataset is structured.

树状图的结构使您可以深入了解数据集的结构。

In this example, there are two main branches, with Humpback Whale and Fin Whale on one side, and the Bottlenose Dolphin/Risso’s Dolphin and Pilot Whale/Killer Whale on the other.

在此示例中，有两个主要分支，一侧是座头鲸和鳍鲸，另一侧是宽吻海豚/里索的海豚和领航鲸/杀人鲸。

In evolutionary biology, much larger datasets with many more specimens and measurements are used in this way to infer taxonomic relationships between them.

在进化生物学中，以这种方式使用具有更多标本和测量值的更大的数据集来推断它们之间的分类学关系。

Outside of biology, hierarchical clustering has applications in data mining and machine learning contexts.

在生物学之外，层次聚类在数据挖掘和机器学习环境中具有应用。

The cool thing is that this approach requires no assumptions about the number of clusters you’re looking for.

有趣的是，这种方法无需假设您要寻找的群集数量。

You can split the returned dendrogram into clusters by “cutting” the tree at a given height. This height can be chosen in a number of ways, depending on the resolution at which you wish to cluster the data.

您可以通过以指定高度“切割”树来将返回的树状图拆分为簇。可以通过多种方式选择此高度，具体取决于您希望对数据进行聚类的分辨率。

For instance, looking at the dendrogram above, if we draw a horizontal line at height = 10, we’d intersect the two main branches, splitting the dendrogram into two sub-graphs. If we cut at height = 2, we’d be splitting the dendrogram into three clusters.

例如，查看上面的树状图，如果我们在height = 10处绘制一条水平线，我们将与两个主要分支相交，将树状图分为两个子图。如果我们在高度= 2处进行切割，我们会将树状图分成三个簇。

更细的细节 (Finer details)

There are essentially three aspects in which hierarchical clustering algorithms can vary to the one given here.

从本质上讲，层次聚类算法可以在三个方面与此处给出的算法有所不同。

Most fundamental is the approach — here, we have used an agglomerative process, whereby we start with individual data points and iteratively cluster them together until we’re left with one large cluster.

最基本的方法是这种方法-在这里，我们使用了一个凝聚过程，即从单个数据点开始，然后将它们迭代地聚类在一起，直到剩下一个大的聚类。

An alternative (but more computationally intensive) approach is to start with one giant cluster, and then proceed to divide the data into smaller and smaller clusters until you’re left with isolated data points.

另一种替代方法(但计算量更大)是从一个巨型群集开始，然后将数据分成越来越小的群集，直到剩下孤立的数据点为止。

There are also a range of methods that can be used to calculate the distance matrices. For many purposes, the Euclidean distance (think Pythagoras’ Theorem) will suffice, but there are alternatives that may be more applicable in some circumstances.

还有多种方法可用于计算距离矩阵。对于许多目的，欧几里德距离(认为毕达哥拉斯定理)就足够了，但是有些替代方法可能在某些情况下更适用。

Finally, the linkage criterion can also vary. Clusters are linked according to how close they are to one another, but the way in which we define ‘close’ is flexible.

最后， 链接标准也可以变化。群集根据彼此之间的接近程度进行链接，但是我们定义“关闭”的方式很灵活。

In the example above, we measured the distances between the means (or ‘centroids’) of each group and paired up the nearest groups. However, you may want to use a different definition.

在上面的示例中，我们测量了每个组的均值(或“质心”)之间的距离，并将最接近的组配对。但是，您可能要使用其他定义。

For example, each cluster is made up of several discrete points. You could define the distance between two clusters to be the minimum (or maximum) distance between any of their points — as illustrated in the figure below.

例如，每个群集由几个离散点组成。您可以将两个聚类之间的距离定义为它们的任意点之间的最小(或最大)距离，如下图所示。

There are still other ways of defining the linkage criterion, which may be suitable in different contexts.

还有其他定义链接标准的方式，可能适用于不同的上下文。

图社区检测 (Graph Community Detection)

使用时 (Use when)

…you have data that can be represented as a network, or ‘graph’.

…您拥有可以表示为网络或“图形”的数据。

这个怎么运作 (How it works)

A graph community is very generally defined as a subset of vertices which are more connected to each other than with the rest of the network.

通常将图社区定义为顶点的子集，这些顶点彼此之间的联系比与网络其余部分的联系更多。

Various algorithms exist to identify communities, based upon more specific definitions. Algorithms include, but are not limited to: Edge Betweenness, Modularity-Maximsation, Walktrap, Clique Percolation, Leading Eigenvector…

基于更具体的定义，存在各种算法来标识社区。算法包括但不限于：边缘中间性，模块化最大化，Walktrap，集团渗透，前导特征向量…

工作的例子 (Worked example)

Graph theory, or the mathematical study of networks, is a fascinating branch of mathematics that lets us model complex systems as an abstract collection of ‘dots’ (or vertices) connected by ‘lines’ (or edges).

图论或网络的数学研究是数学的一个引人入胜的分支，它使我们可以将复杂的系统建模为由“线”(或边 )连接的“点”(或顶点 )的抽象集合。

Perhaps the most intuitive case-studies are social networks.

也许最直观的案例研究是社交网络。

Here, the vertices represent people, and edges connect vertices who are friends/followers. However, any system can be modelled as a network if you can justify a method to meaningfully connect different components.

在这里，顶点代表人，边连接作为朋友/跟随者的顶点。但是，如果可以证明一种方法有意义地连接不同的组件，则可以将任何系统建模为网络。

Among the more innovative applications of graph theory to clustering include feature extraction from image data, and analysing gene regulatory networks.

图论在聚类中的更多创新应用包括从图像数据中提取特征以及分析基因调控网络。

As an entry-level example, take a look at this quickly put-together graph. It shows the eight websites I most recently visited, linked according to whether their respective Wikipedia articles link out to one another.

作为入门级示例，请看一下此快速汇总的图表。它显示了我最近访问的八个网站，这些网站根据各自的Wikipedia文章是否相互链接而链接在一起。

You could assemble this data manually, but for larger-scale projects, it’s much quicker to write a Python script to do the same. Here’s one I wrote earlier.

您可以手动组装此数据，但是对于大型项目，编写Python脚本来完成此操作要快得多。这是我之前写的。

The vertices are colored according to their community membership, and sized according to their centrality. See how Google and Twitter are the most central?

顶点根据其社区成员资格进行着色，并根据其中心性进行大小调整。看看Google和Twitter是最核心的吗？

Also, the clusters make pretty good sense in the real-world (always an important performance indicator).

此外，集群在现实世界中非常有意义(始终是重要的性能指标)。

The yellow vertices are generally reference/look-up sites; the blue vertices are all used for online publishing (of articles, tweets, or code); and the red vertices include YouTube, which was of course founded by former PayPal employees. Not bad deductions for a machine.

黄色顶点通常是参考/查找站点；蓝色顶点全部用于在线发布(文章，推文或代码的发布)；红色顶点包括YouTube，它当然是由前PayPal员工创立的。对机器的扣减还不错。

Aside from being a useful way to visualize large systems, the real power of networks comes from their mathematical analysis. Let’s start by translating our nice picture of the network into a more mathematical format. Below is the adjacency matrix of the network.

除了是可视化大型系统的有用方法之外，网络的真正功能还在于它们的数学分析。让我们首先将网络的漂亮图片转换成更数学的格式。下面是网络的邻接矩阵 。

GH Gl  M  P  Q  T  W  Y
GitHub    0  1  0  0  0  1  0  0
Google    1  0  1  1  1  1  1  1
Medium    0  1  0  0  0  1  0  0
PayPal    0  1  0  0  0  1  0  1
Quora     0  1  0  0  0  1  1  0
Twitter   1  1  1  1  1  0  0  1
Wikipedia 0  1  0  0  1  0  0  0
YouTube   0  1  0  1  0  1  0  0

The value at the intersection of each row and column records whether there is an edge between that pair of vertices.

每行和列的交点处的值记录该对顶点之间是否存在边。

For instance, there is an edge between Medium and Twitter (surprise, surprise!), so the value where their rows/columns intersect is 1. Similarly, there is no edge between Medium and PayPal, so the intersection of their rows/columns returns 0.

例如，Medium和Twitter之间有一条边(惊讶，令人惊讶！)，因此它们的行/列相交的值是1。类似地，Medium和PayPal之间没有边，因此它们的行/列的交点返回0。

Encoded within the adjacency matrix are all the properties of this network — it gives us the key to start unlocking all manner of valuable insights.

该网络的所有属性都编码在邻接矩阵中-它为我们提供了开始解锁各种有价值的见解的关键。

For a start, summing any column (or row) gives you the degree of each vertex — i.e., how many others it is connected to. This is commonly denoted with the letter k.

首先，求和任何列(或行)的总和即可得出每个顶点的度数，即它与多少个顶点相连。通常用字母k表示。

Likewise, summing the degrees of every vertex and dividing by two gives you L, the number of edges (or ‘links’) in the network. The number of rows/columns gives us N, the number of vertices (or ‘nodes’) in the network.

同样，将每个顶点的度数相加并除以2可得到L ，即网络中边(或“链接”)的数量。行数/列数为N ，即网络中的顶点数(或“节点”)。

Knowing just k, L, N and the value of each cell in the adjacency matrix A lets us calculate the modularity of any given clustering of the network.

只需知道k ， L，N和邻接矩阵A中每个像元的值，就可以计算模块化 的网络的任何给定群集。

Say we’ve clustered the network into a number of communities. We can use the modularity score to assess the ‘quality’ of this clustering.

假设我们已将网络聚集到多个社区中。我们可以使用模块化评分来评估该聚类的“质量”。

A higher score will show we’ve split the network into ‘accurate’ communities, whereas a low score suggests our clusters are more random than insightful. The image below illustrates this.

较高的分数将表明我们已将网络划分为“准确的”社区，而较低的分数表明我们的集群比有见地的更为随机。下图说明了这一点。

Modularity can be calculated using the formula below:

模块化可以使用以下公式计算：

That’s a fair amount of math, but we can break it down bit by bit and it’ll make more sense.

这是相当多的数学运算，但是我们可以一点一点地分解它，这将更有意义。

M is of course what we’re calculating — modularity.

M当然是我们正在计算的-模块化。

1/2L tells us to divide everything that follows by 2L, i.e., twice the number of edges in the network. So far, so good.

1/2 L告诉我们将其后的所有内容除以2 L ，即网络中边数的两倍。到目前为止，一切都很好。

The Σ symbol tells us we’re summing up everything to the right, and lets us iterate over every row and column in the adjacency matrix A.

Σ符号告诉我们我们在右边汇总所有内容，并让我们遍历邻接矩阵A中的每一行和每一列。

For those unfamiliar with sum notation, the i, j = 1 and the N work much like nested for-loops in programming. In Python, you’d write it as follows:

对于那些不熟悉总和表示法的人， i，j = 1和N的工作方式很像编程中的嵌套for循环。在Python中，您可以这样编写：

sum = 0
for i in range(1,N):for j in range(1,N):ans = #stuff with i and j as indices sum += ans

So what is #stuff with i and j in more detail?

那么， #stuff with i and j是什么呢？

Well, the bit in brackets tells us to subtract ( k_i k_j ) / 2L from A_ij.

好吧，方括号中的位告诉我们从A_ij减去( k_i k_j)/ 2L 。

A_ij is simply the value in the adjacency matrix at row i, column j.

A_ij仅是第i行第 j列的邻接矩阵中的值。

The values of k_i and k_j are the degrees of each vertex — found by adding up the entries in row i and column j respectively. Multiplying these together and dividing by 2L gives us the expected number of edges between vertices i and j if the network were randomly shuffled up.

k_i和k_j的值是每个顶点的度数-通过分别将第i行和第j列中的条目相加得出。如果将网络随机改组，则将它们相乘并除以2 L可得到顶点i和j之间的预期边数。

Overall, the term in the brackets reveals the difference between the network’s real structure and the expected structure it would have if randomly reassembled.

总体而言，方括号中的术语揭示了网络的实际结构与如果随机重组将具有的预期结构之间的差异。

Playing around with the values shows that it returns its highest value when A_ij = 1, and ( k_i k_j ) / 2L is low. This means we see a higher value if there is an ‘unexpected’ edge between vertices i and j.

数值的计算表明，当A_ij = 1且( k_i k_j)/ 2L为低时，它将返回其最大值。这意味着如果在顶点i和j之间存在“意外”边缘，则我们看到一个更高的值。

Finally, we multiply the bracketed term by whatever the last few symbols refer to.

最后，我们将括号中的术语乘以最后几个符号所指的内容。

The ?c_i, c_j is the fancy-sounding but totally harmless Kronecker-delta function. Here it is, explained in Python:

？c _i， c _j是听起来很花哨但完全无害的Kronecker-delta函数。在Python中进行了解释：

def kroneckerDelta(ci, cj):if ci == cj:return 1else:return 0kroneckerDelta("A","A")
#returns 1kroneckerDelta("A","B")
#returns 0

Yes — it really is that simple. The Kronecker-delta function takes two arguments, and returns 1 if they are identical, otherwise, zero.

是的-真的就是这么简单。 Kronecker-delta函数采用两个参数，如果相同则返回1，否则返回零。

This means that if vertices i and j have been put in the same cluster, then ?c_i, c_j = 1. Otherwise, if they are in different clusters, the function returns zero.

这意味着，如果将顶点i和j放在同一簇中，则？c _i， c _j = 1 。否则，如果它们位于不同的群集中，则该函数将返回零。

As we are multiplying the bracketed term by this Kronecker-delta function, we find that for the nested sum Σ, the outcome is highest when there are lots of ‘unexpected’ edges connecting vertices assigned to the same cluster.

当我们将括号中的项乘以该Kronecker-delta函数时，我们发现对于嵌套的总和Σ ，当有很多“意外”边缘连接分配给同一聚类的顶点时，结果最高。

As such, modularity is a measure of how well-clustered the graph is into separate communities.

因此，模块化是衡量图表在不同社区中的聚集程度的一种度量。

Dividing by 2L bounds the upper value of modularity at 1. Modularity scores near to or below zero indicate the current clustering of the network is really no use. The higher the modularity, the better the clustering of the network into separate communities.

除以2L会将模块性的上限定为1。接近或低于零的模块性分数表明该网络的当前集群实际上没有用。模块化程度越高，将网络更好地聚集到单独的社区中就越好。

By maximising modularity, we can find the best way of clustering the network.

通过最大程度地提高模块化，我们可以找到群集网络的最佳方法。

Notice that we have to pre-define how the graph is clustered to find out how ‘good’ that clustering actually is.

注意，我们必须预先定义图的聚类方式，以找出聚类的实际效果。

Unfortunately, employing brute force to try out every possible way of clustering the graph to find which has the highest modularity score would be computationally impossible beyond a very limited sample size.

不幸的是，在非常有限的样本量之外，采用蛮力尝试对图进行聚类以找到具有最高模块化得分的所有可能方法，在计算上都是不可能的。

Combinatorics tells us that for a network of just eight vertices, there are 4140 different ways of clustering them. A network twice the size would have over ten billion possible ways of clustering the vertices.

组合法告诉我们，对于只有八个顶点的网络，有4140种不同的方法将它们聚类。两倍大小的网络将有超过一百亿种可能的顶点聚类方法。

Doubling the network again (to a very modest 32 vertices) would give 128 septillion possible ways, and a network of eighty vertices would be cluster-able in more ways than there are atoms in the observable universe.

将网络再次加倍(到非常适度的32个顶点)将提供128亿种可能的方式，并且与可观察的宇宙中存在的原子相比，具有80个顶点的网络将以更多的方式可集群。

Instead, we have to turn to a heuristic method that does a reasonably good job at estimating the clusters that will produce the highest modularity score, without trying out every single possibility.

取而代之的是，我们必须转向一种启发式方法，该方法在估计将产生最高模块性得分的集群方面做得相当好，而无需尝试每种可能性。

This is an algorithm called Fast-Greedy Modularity-Maximization, and it’s somewhat analogous to the agglomerative hierarchical clustering algorithm describe above. Instead of merging according to distance, ‘Mod-Max’ merges communities according to changes in modularity.

这是一种称为快速贪婪模块化最大化的算法，它有点类似于上面描述的聚集层次聚类算法。 'Mod-Max'并非根据距离进行合并，而是根据模块化的变化来合并社区。

Here’s how it goes:

这是怎么回事：

Begin by initially assigning every vertex to its own community, and calculating the modularity of the whole network, M.

首先将每个顶点分配给它自己的社区，然后计算整个网络的模块化M。

Step 1 requires that for each community pair linked by at least a single edge, the algorithm calculates the resultant change in modularity ΔM if the two communities were merged into one.

步骤1要求，对于至少由一条边链接的每个社区对，如果将两个社区合并为一个社区，该算法将计算模块性ΔM的最终变化。

Step 2 then takes the pair of communities that produce the biggest increase in ΔM, which are then merged. Calculate the new modularity M for this clustering, and keep a record of it.

然后， 步骤2选取产生最大ΔM的一对社区，然后将其合并。计算该集群的新模块性M ，并对其进行记录。

Repeat steps 1 and 2 — each time merging the pair of communities for which doing so produces the biggest gain in ΔM, then recording the new clustering pattern and its associated modularity score M.

重复步骤1和2-每次合并一对社区，这样做会在ΔM中产生最大的收益，然后记录新的聚类模式及其相关的模块化评分M。

Stop when all the vertices are grouped into one giant cluster. Now the algorithm checks the records it kept as it went along, and identifies the clustering pattern that returned the highest value of M. This is the returned community structure.

将所有顶点分组为一个大簇时停止。现在，该算法检查其保留的记录，并确定返回最大M值的聚类模式。这是返回的社区结构。

更细的细节 (Finer details)

Whew! That was computationally intensive, at least for us humans.

ew！这是计算密集型的，至少对于我们人类而言。

Graph theory is a rich source of computationally challenging, often NP-hard problems — yet it also has incredible potential to provide valuable insights into complex systems and datasets.

图论是计算难题(通常是NP难题)的丰富来源，但它也具有令人难以置信的潜力，可以为复杂的系统和数据集提供有价值的见解。

Just ask Larry Page, whose eponymous PageRank algorithm — which helped propel Google from start-up to basically world domination in less than a generation — was based entirely in graph theory.

只需问问拉里·佩奇(Larry Page)，他的同名PageRank算法完全基于图论，该算法在不到一代人的时间里就将Google从新兴企业推向了世界统治。

Community detection is a major focus of current research in graph theory, and there are plenty of alternatives to Modularity-Maximization, which while useful, does have some drawbacks.

社区检测是当前图论研究的主要重点，模块化最大化有许多替代方案，尽管它们很有用，但确实存在一些缺点。

For a start, its agglomerative approach often sees small, well-defined communities swallowed up into larger ones. This is known as the resolution limit — the algorithm will not find communities below a certain size.

首先，它的聚集方法通常会看到小型的，定义明确的社区被吞并为较大的社区。这称为分辨率限制 -算法将找不到小于特定大小的社区。

Another challenge is that rather than having one distinct, easy-to-reach global peak, the Mod-Max approach actually tends to produce a wide ‘plateau’ of many similar high modularity scores — making it somewhat difficult to truly identify the absolute maximum score.

另一个挑战是，Mod-Max方法并没有产生一个清晰易懂的全球峰值，而是趋向于产生许多相似的高模块评分的宽广的“平台”，这使得真正识别绝对最大评分有些困难。

Other algorithms use different ways to define and approach community detection.

其他算法使用不同的方式来定义和处理社区检测。

Edge-Betweenness is a divisive algorithm, starting with all vertices grouped in one giant cluster. It proceeds to iteratively remove the least ‘important’ edges in the network, until all vertices are left isolated. This produces a hierarchical structure, with similar vertices closer together in the hierarchy.

Edge-Betweenness是一种分割算法，从将所有顶点分组到一个巨型簇中开始。进行迭代地删除网络中最不重要的边缘，直到所有顶点都保持隔离。这将产生一个层次结构，相似的顶点在层次结构中靠得更近。

Another algorithm is Clique Percolation, which takes into account possible overlap between graph communities.

另一个算法是Clique Percolation ，它考虑了图社区之间可能的重叠。

Yet another set of algorithms are based on random-walks across the graph, and then there are spectral clustering methods which start delving into the eigendecomposition of the adjacency matrix and other matrices derived therefrom. These ideas are used in feature extraction in, for example, areas such as computer vision.

另一组算法是基于整个图的随机游动，然后是频谱聚类方法，它们开始研究邻接矩阵和从中得出的其他矩阵的特征分解。这些想法用于例如计算机视觉等领域的特征提取。

It’d be well beyond the scope of this article to give each algorithm its own in-depth worked example. Suffice to say that this is an active area of research, providing powerful methods to make sense of data that even a generation ago would have been extremely difficult to process.

为每个算法提供自己的深入工作示例将远远超出本文的范围。可以说这是一个活跃的研究领域，它提供了强大的方法来理解数据，即使是一代人以前也很难处理这些数据。

结论 (Conclusion)

Hopefully this article has informed and inspired you to better understand how machines can make sense of data. The future is a rapidly changing place, and many of those changes will be driven by what technology becomes capable of in the next generation or two.

希望本文能为您提供启发并启发您更好地了解机器如何理解数据。未来是一个瞬息万变的地方，其中许多变化将取决于下一代技术的能力。

As outlined in the introduction, machine learning is an extraordinarily ambitious field of research, in which massively complex problems require solving in as accurate and as efficient a way possible. Tasks that come naturally to us humans require innovative solutions when taken on by machines.

正如导言中概述的那样，机器学习是一个非常宏大的研究领域，其中庞大的复杂问题需要以尽可能准确和有效的方式来解决。对于人类来说，自然而然的任务需要由机器承担的创新解决方案。

There’s still plenty of progress to be made, and whoever contributes the next breakthrough idea will no doubt be generously rewarded. Maybe someone reading this article will be behind the next powerful algorithm?

仍有大量的进步，无论谁贡献下一个突破性的想法，无疑都会得到丰厚的回报。也许有人读这篇文章会成为下一个强大算法的幕后推手？

All great ideas have to start somewhere!

所有好主意都必须从某个地方开始！

翻译自: https://www.freecodecamp.org/news/how-machines-make-sense-of-big-data-an-introduction-to-clustering-algorithms-4bd97d4fbaba/

层次聚类算法算法