k均值算法 二分k均值算法

Have you ever seen a Caribbean reef? Well if you haven’t, prepare yourself.

您见过加勒比礁吗? 好吧,如果没有,请做好准备。

Today, we will be answering a question that, at face value, appears quite simple: “What does a Caribbean reef look like?” However, this question can be decomposed into many complex layers. So to avoid ambiguity, let’s refine the question to: “What are the non-mobile components of a Caribbean reefs and how are they related?”

今天,我们将回答一个从表面上看很简单的问题:“加勒比海礁石看起来像什么?” 但是,这个问题可以分解为许多复杂的层。 因此,为避免歧义,让我们将问题细化为:“加勒比海珊瑚礁的非活动组成部分是什么,它们之间有何关系?”

That seems reasonable; we’ll have to look at fish another day.

这似乎是合理的; 我们要改天看看鱼。

Now we’re not going to roll out beautiful images of underwater cities teeming with diversity. Instead, we have bar charts. Without further ado, let’s dive in.

现在,我们不打算发布充满多样性的水下城市的美丽影像。 相反,我们有条形图。 事不宜迟,让我们开始吧。

什么是典型的珊瑚礁? (What Makes up a Typical Reef?)

To start, we have developed a baseline graph (Figure 1) of the components of all Caribbean reefs. Here we have the median percent cover for nine substrate types. Now, if you haven’t conducted a scuba transect before, it may be helpful to break down the above sentence. First, percent cover is how coral reef composition is measured — in other words, from a birds-eye view what percent of sea floor is hard coral, sponge, rock, etc. Second, substrates types are broad categories of sea floor, such as silt or sand. If you’re curious about the sampling methods or specific substrate definitions, check this out.

首先,我们绘制了所有加勒比海珊瑚礁成分的基线图(图1)。 此处,我们提供了9种基材类型的中位覆盖率百分比。 现在,如果您以前没有进行过水肺横断面检查,则最好将上述句子分解。 首先,覆盖率是如何测量珊瑚礁成分的,换句话说,从鸟瞰角度看,硬质珊瑚,海绵,岩石等占海床的百分比。其次,底物类型是海床的大类,例如淤泥或沙子。 如果您想了解抽样方法或特异底物的定义,请检查该出来。

Ok, so in Figure 1 we’re looking at the median value for each of the nine substrate values. For example, in the Hard Coral column, we can see that hard coral’s median percent cover is roughly 17%. Good to know.

好的,因此在图1中,我们查看的是9个底物值中的每个的中值。 例如,在“ 硬珊瑚”列中,我们可以看到硬珊瑚的覆盖率中位数约为17%。 很高兴知道。

Diving deeper into the chart, it appears that most Caribbean reefs are primarily composed four substrate types: rock, hard coral, nutrient indicator algae (NI Algae), and sand. Together, these four categories account for 91% of the total median values. On the other hand, recently killed coral (RK Coral) and silt both have median values of 0. So, they’re relatively rare.

深入研究图表,似乎大多数加勒比海礁石主要由四种基质类型组成:岩石,硬珊瑚,营养指示藻类( NI Algae )和沙子。 这四个类别合起来占总中值的91%。 另一方面,最近被杀死的珊瑚( RK Coral )和淤泥的中位数均为0。因此,它们相对较少。

We have learned that Caribbean reefs are rocky and sandy. Lovely.

我们了解到加勒比礁是岩石和沙滩。 可爱。

But here’s an alarming analogy: the average number of children per US family is 1.93. If we take that number to be representative of the data, we might conclude that most families have 1.93 children, which I find hard to believe. Even worse, we have no understanding of the underlying distribution that led to an average of 1.93. There could be one family with 184 children and 9 families with one child. Instead, it would be useful to see if there are common counts for the number of kids per family.

但这是一个令人震惊的类比:每个美国家庭的平均孩子人数为1.93。 如果我们以该数字作为数据的代表,我们可以得出结论,大多数家庭有1.93个孩子,我很难相信。 更糟糕的是,我们不了解导致平均1.93的基本分布。 可能有一个家庭有184个孩子,有9个家庭有一个孩子。 取而代之的是,查看每个家庭的孩子数是否有共同计数是有用的。

K-均值演示 (K-Means Demo)

Applying this logic to reef composition, we will explore if there are groups coral reefs using the above substrate categories. This is where unsupervised classification comes into play. Unsupervised algorithms fit data where we don’t know the “correct” answer. And, one of the simplest methods of all is the k-means algorithm.

将这种逻辑应用于珊瑚礁组成,我们将使用上述基质类别探讨是否存在珊瑚礁群。 这是无监督分类起作用的地方。 无监督算法适合我们不知道“正确”答案的数据。 而且,最简单的方法之一是k-means算法。

Without getting too technical, k-means attempts to split data into k clusters. The algorithm does this by minimizing the distance from the center of the cluster (the cluster mean) to all points in that cluster. And because of this simple fitting criteria, it’s really easy to interpret. So let’s see an example…

不用太技术,k-means尝试将数据拆分为k个群集。 该算法通过最小化从群集中心(群集均值)到该群集中所有点的距离来实现此目的。 而且由于这种简单的拟合标准,它真的很容易解释。 因此,让我们看一个例子……

Reef Check.Reef Check 。

In Figure 2 we have created two clusters (k=2 in this case) using two substrate categories: hard coral and nutrient indicator algae. As you can see, there appears to be a clear divide between these two categories. But, let’s not get into interpretation quite yet.

在图2中,我们使用两个基质类别(硬珊瑚和营养指示藻)创建了两个群集(在这种情况下, k = 2 )。 如您所见,这两个类别之间似乎存在明显的鸿沟。 但是,让我们暂时不做解释。

Instead, let’s consider the case where we add another variable. Here, the k-means algorithm would categorize each point using three dimensions instead of two. But as you increase the number of dimensions, you lose the ability to visualize; it’s pretty hard to think in five or eight dimensions. However, we can still see where the cluster centers are numerically located in hyperspace.

相反,让我们考虑添加另一个变量的情况。 在这里,k-means算法将使用三个维度而不是两个维度对每个点进行分类。 但是随着尺寸的增加,您将失去可视化的能力。 很难从五个或八个维度来思考。 但是,我们仍然可以看到聚类中心在数字上位于超空间中的位置。

Now that we have a basic understanding of what k-means does, let’s move on to the interesting graphs.

现在,我们对k均值的功能有了基本的了解,让我们继续研究有趣的图。

前4种基板类型(k = 3) (Top 4 Substrate Types (k=3))

In Figure 3 (below) we have fit three clusters (k=3) using the four most most prevalent substrate types. Each bar represents a substrate category. The height of each bar represents the the difference between the cluster mean and the total mean for that given substrate. Blue bars correspond to a cluster mean greater than the entire category’s mean and conversely, red bars correspond to a cluster mean less than the entire category’s mean.

在下面的图3中,我们使用四种最普遍的底物类型拟合了三个簇( k = 3 )。 每个条形代表基材类别。 每个条形的高度代表该给定底物的簇均值与总均值之差。 蓝色条形对应的聚类平均值大于整个类别的平均值,红色条形对应的聚类平均值小于整个类别的平均值。

Reef Check.Reef Check 。

When classifying Caribbean reefs into three clusters there appear to be sensible groupings: sand-dominated, rock-dominated, and algae-dominated. Interestingly, hard coral showed relatively little change even though it was the second most abundant substrate category. Conversely, nutrient indicator algae, which is often found on degraded reefs, had extremely high signal relative to its abundance.

将加勒比海珊瑚礁分为三类时,似乎有一些合理的分类:以沙子为主,以岩石为主和以藻类为主。 有趣的是,即使硬质珊瑚是第二丰富的底物类别,其变化也相对较小。 相反,经常在退化的珊瑚礁上发现的营养指示剂藻类相对于其丰富度具有极高的信号。

We can also observe that sand-dominated reefs allowed for the highest quantity of hard coral at roughly 10 percentage points more than the total data average. Rock-dominated reefs were net positive but had little impact on hard corals. And finally, as most people would expect, the evil nutrient indicator algae appears to have a fairly strong negative impact on all other substrate types.

我们还可以观察到,以砂岩为主的礁石允许的硬珊瑚数量最多,比整个数据平均值高出大约10个百分点。 岩石为主的礁石为净阳性,但对硬珊瑚影响不大。 最后,正如大多数人所期望的那样,邪恶的营养指示剂藻类似乎对所有其他底物类型具有相当强烈的负面影响。

Ok, we’re starting to get somewhere. Now let’s increase the number of substrate types by including all categories that had a median value greater than zero: only silt and recently killed coral were not included.

好的,我们开始有所建树。 现在,通过包含中值大于零的所有类别来增加底物类型的数量:不包括淤泥和最近被杀死的珊瑚。

非零中值基板类型(k = 3) (Non-Zero-Median Substrate Types (k=3))

Reef Check.Reef Check 。

In Figure 4 it appears the categories we found above hold steady. Sand/rubble dominated reefs seem to support the most life with above-average values in hard coral, soft coral, and sponge. Rocky reefs also exhibit life-supporting ability, although less than its sandy counterpart. And finally, nutrient indicator algae reefs show below average percent cover in all other substrate values observed.

在图4中,我们上面找到的类别似乎保持稳定。 在硬珊瑚,软珊瑚和海绵中,以沙/卵石为主的礁石似乎能维持大多数生命,其价值均高于平均值。 礁石还具有生命维持能力,尽管比沙质礁石要弱一些。 最后,营养指示剂藻类礁石在所有其他底物值中均显示低于平均覆盖率。

Now you might be wondering what the deal is with NI Algae. Well, nutrient indicator algae are often found on degraded reefs because they thrive in waters with elevated nutrient levels, such as nitrogen and phosphorus; Reef Check added this category to monitor the infamous algal blooms. Conversely, these high levels of nutrients can be harmful to corals. Thus, we would expect to see an inverse relationship between nutrient indicator algae and the other living substrate types, namely sponges, soft corals, and hard corals.

现在您可能想知道与NI Algae达成的交易是什么。 好吧,营养指示剂藻类经常在退化的珊瑚礁上发现,因为它们在营养水平较高的水中繁殖,例如氮和磷。 Reef Check添加了此类别,以监视臭名昭著的藻华。 相反,这些高含量的养分可能对珊瑚有害。 因此,我们希望看到营养指示剂藻类与其他活的基质类型(即海绵,软珊瑚和硬珊瑚)之间存在反比关系。

This stuff is pretty cool.

这个东西很酷。

使用非零基材值进行拟合(k = 4) (Fitting Using the Non-Zero Substrate Values (k=4))

In our final chart, we will try increasing the number of clusters to four because who’s to say there are only three types of Caribbean reefs? Well, technically there are statistical methods to show reasonable values that k can take. In this case the elbow method was implemented and three to five clusters were deemed sensible.

在我们的最终图表中,我们将尝试将集群数增加到四个,因为谁能说只有三种类型的加勒比海珊瑚礁? 嗯,从技术上讲,有统计方法可以显示k可以取的合理值。 在这种情况下,采用肘部方法,认为三到五个簇是明智的。

Reef Check.Reef Check 。

As shown shown in Figure 5 to the left, as expected, a fourth category has emerged. Boasting extremely high values of hard and soft corals, this coral-dominated reef appears to be the “healthiest” reefs of the four.

如预期的那样,如左图5所示,出现了第四类。 这种以珊瑚为主的珊瑚礁拥有极高的硬珊瑚和软珊瑚价值,似乎是这四种珊瑚中“最健康的”。

Now why did increasing the number of clusters suddenly create this magical healthy reef category? Well, with only three clusters, the high levels of hard and soft corals were lumped into the sand-dominated and rock-dominated classifications. By allowing for a fourth category, the data could be subset more cleanly.

现在,为什么增加簇的数量突然创建了这个神奇的健康珊瑚礁类别? 好吧,只有三个集群,高水平的硬珊瑚和软珊瑚被归类为以沙子为主和以岩石为主的分类。 通过考虑第四类,可以更清晰地对数据进行子集化。

In a similar vein, why can’t we conclude that there are five types of reefs? To answer your outstanding question, k-means with k=5 was plotted, however the categories created were not intuitive. Moreover, because four central substrate categories compose 91% of the median total, limiting to four clusters is intuitive.

同样,为什么我们不能得出结论说有五种类型的珊瑚礁呢? 为了回答您的悬而未决的问题,绘制了k = 5的 k均值,但是创建的类别不直观。 此外,由于四个中央底物类别构成中位数总数的91%,因此直观地限制为四个簇即可。

Ok final question, how can we tell if three or four clusters is better? Another outstanding question, but unfortunately there isn’t a clear answer.

好吧,最后一个问题,我们如何确定三个或四个集群更好? 另一个悬而未决的问题,但不幸的是没有一个明确的答案。

From an ecological perspective, there is no reason why rock and sand-dominated reefs can’t support corals and sponges, which argues for k=3. It’s also simpler. However, by creating four clusters we can develop clear-cut classifications that appear to correspond to health, which argues for k=4. Those categories are:

从生态的角度来看,没有任何理由说明以岩石和沙子为主的珊瑚礁不能支撑珊瑚和海绵,这证明了k = 3 。 它也更简单。 但是,通过创建四个群集,我们可以开发出与健康相对应的清晰分类,这证明k = 4 。 这些类别是:

  1. High health: coral-dominated高健康:珊瑚为主
  2. Medium health: sand/rubble-dominated, rock-dominated中度健康:以沙子/碎石为主,以岩石为主
  3. Low health: algae-dominated低健康:藻类为主

As with many applied statistics problems, humans have to make judgement calls based on subject-matter knowledge. Here, there are good arguments for both k=3 and k=4.

与许多应用统计问题一样,人类必须根据主题知识做出判断。 在这里,对于k = 3k = 4都有很好的论据。

结论 (Conclusion)

I’m glad you now understand why bar charts are superior to pretty pictures. Even though you have no idea what a Caribbean reef looks like, you have a better understanding of what makes up a Caribbean reef (which is pretty cool).

我很高兴您现在了解为什么条形图优于漂亮的图片。 即使您不知道加勒比礁是什么样子,您也可以更好地了解加勒比礁的构成(这很酷)。

What else can we conclude?

我们还能得出什么结论?

  1. Caribbean reefs tend to be dominated by sand, rock, hard coral, and nutrient indicator algae. However, ratios differ greatly at the tails of the distributions.加勒比礁往往以沙子,岩石,坚硬的珊瑚和营养指示剂藻类为主。 但是,比率在分布的尾部差别很大。
  2. One of the most consistent reef classifications was algae-dominated reefs. Algal blooms tend to occur in areas with high levels of sunlight, nutrients, and CO2 (a term called eutrophication), so from an ecological standpoint, it makes sense that coral cover would have an inverse relationship with algae. That being said, further research is required, specifically species breakdown of the NI algae.最一致的礁石分类之一是藻类为主的礁石。 藻华往往发生在阳光,营养和二氧化碳含量高的区域(富营养化),因此从生态角度来看,珊瑚覆盖与藻类成反比是有意义的。 话虽如此,还需要进一步的研究,特别是NI藻类的种类分解。
  3. All classifications that do not include nutrient indicator algae have the ability to support coral. That being said, sand-dominated reefs show a higher “life capacity” than rock-dominated reefs.所有不包括营养指标藻类的分类都具有支持珊瑚的能力。 话虽如此,以砂为主的珊瑚礁比以岩石为主的珊瑚礁显示出更高的“生命能力”。

Got any other ideas?

还有其他想法吗?

资料来源 (Sources)

  • Algae can function as indicators of water pollution. (n.d.). Retrieved August 21, 2020, from http://www.walpa.org/waterline/june-2012/algae-can-function-as-indicators-of-water-pollution/

    藻类可以作为水污染的指标。 (nd)。 检索于2020年8月21日, 网址为http://www.walpa.org/waterline/june-2012/algae-can-function-as-indicators-of-water-pollution/

  • Barott, K. L., Rodriguez-Mueller, B., Youle, M., Marhaver, K. L., Vermeij, M. J., Smith, J. E., & Rohwer, F. L. (2011). Microbial to reef scale interactions between the reef-building coral Montastraea annularis and benthic algae. Proceedings of the Royal Society B: Biological Sciences, 279(1733), 1655–1664. doi:10.1098/rspb.2011.2155

    KL的Barott,B。的Rodriguez-Mueller,M。的Youle,Marhaver的KL,Vermeij,MJ,Smith,JE和Rohwer的佛罗里达(2011)。 造礁珊瑚Montastraea ringis和底栖藻类之间的微生物到礁垢的相互作用。 皇家学会学报B:生物科学, 279 (1733),1655–1664。 doi:10.1098 / rspb.2011.2155

  • Duffin, P., & 13, J. (2020, January 13). Average number of own children per family U.S. Retrieved August 20, 2020, from https://www.statista.com/statistics/718084/average-number-of-own-children-per-family/

    Duffin,P.,&13,J.(2020年1月13日)。 美国每个家庭的平均独生子女数于2020年8月20日从https://www.statista.com/statistics/718084/average-number-of-own-children-per-family/检索

The data were collected by Reef Check, a coral conservation non-profit that trains volunteer divers to collect marine data. There were 1576 unique entries for the Caribbean ranging from 1997–05–24 to 2019–08–24. Date of the dive was not taken into account, however in future iterations it would be interesting to see how these cluster centers change over time. The only transformation to the traditional k-means algorithm was including weights that correspond to the median percent cover of each substrate category.

数据是由珊瑚礁非营利组织Reef Check收集的,该组织培训志愿潜水员收集海洋数据。 1997–05–24至2019–08–24期间,加勒比海地区共有1576个独特条目。 没有考虑潜水日期,但是在将来的迭代中,观察这些聚类中心如何随时间变化会很有趣。 对传统k均值算法的唯一转换是包括权重,该权重对应于每种基材类别的中位覆盖率百分比。

Here is the code.

这是代码 。

Note: These are my findings. If you would like to contact me, leave a message here. All criticisms are welcome.

注意:这些是我的发现。 如果您想与我联系,请在此处留言。 欢迎所有批评。

翻译自: https://medium.com/data-diving/classification-of-caribbean-coral-reefs-using-k-means-51a66997a989

k均值算法 二分k均值算法


http://www.taodudu.cc/news/show-997360.html

相关文章:

  • 衡量试卷难度信度_我们可以通过数字来衡量语言难度吗?
  • 视图可视化 后台_如何在单视图中可视化复杂的多层主题
  • python边玩边学_边听边学数据科学
  • 边缘计算 ai_在边缘探索AI!
  • 如何建立搜索引擎_如何建立搜寻引擎
  • github代码_GitHub启动代码空间
  • 腾讯哈勃_用Python的黑客统计资料重新审视哈勃定律
  • 如何使用Picterra的地理空间平台分析卫星图像
  • hopper_如何利用卫星收集的遥感数据轻松对蚱hopper中的站点进行建模
  • 华为开源构建工具_为什么我构建了用于大数据测试和质量控制的开源工具
  • 数据科学项目_完整的数据科学组合项目
  • uni-app清理缓存数据_数据清理-从哪里开始?
  • bigquery_如何在BigQuery中进行文本相似性搜索和文档聚类
  • vlookup match_INDEX-MATCH — VLOOKUP功能的升级
  • flask redis_在Flask应用程序中将Redis队列用于异步任务
  • 前馈神经网络中的前馈_前馈神经网络在基于趋势的交易中的有效性(1)
  • hadoop将消亡_数据科学家:适应还是消亡!
  • 数据科学领域有哪些技术_领域知识在数据科学中到底有多重要?
  • 初创公司怎么做销售数据分析_为什么您的初创企业需要数据科学来解决这一危机...
  • r软件时间序列分析论文_高度比较的时间序列分析-一篇论文评论
  • selenium抓取_使用Selenium的网络抓取电子商务网站
  • 裁判打分_内在的裁判偏见
  • 从Jupyter Notebook切换到脚本的5个理由
  • ip登录打印机怎么打印_不要打印,登录。
  • 机器学习模型 非线性模型_调试机器学习模型的终极指南
  • 您的第一个简单的机器学习项目
  • 鸽子为什么喜欢盘旋_如何为鸽子回避系统设置数据收集
  • 追求卓越追求完美规范学习_追求新的黄金比例
  • 周末想找个地方敲代码_观看我们的代码游戏,全周末直播
  • javascript 开发_25个新JavaScript开发人员的免费资源

k均值算法 二分k均值算法_使用K均值对加勒比珊瑚礁进行分类相关推荐

  1. k均值算法 二分k均值算法_如何获得K均值算法面试问题

    k均值算法 二分k均值算法 数据科学访谈 (Data Science Interviews) KMeans is one of the most common and important cluste ...

  2. python图像分割_基于K均值聚类算法的Python图像分割

    1个K均值算法 实际上,K-means算法是一种非常简单的算法,与算法思想或特定实现无关. 通过以一定方式测量样本之间的相似度,并迭代更新聚类中心,它属于无监督分类. 当聚类中心不再移动或移动差异小于 ...

  3. 机器学习之聚类算法:K均值聚类(一、算法原理)

    目录 一.Kmeans 二.Kmeans的流程 三.距离度量方式 3.1.闵可夫斯基距离 3.2.马哈拉诺比斯距离 3.3.其他 四.Kmeans聚类实例 五.Kmeans存在的问题 5.1.初始点的 ...

  4. K均值(K-means)聚类算法

    文章目录 一. K-Means原理 二.算法流程 2.1 算法描述 2.2 算法分析: 2.3 k-means评价标准 2.4 k-means优缺点 三.项目实战 K-Means算法是经典的无监督的聚 ...

  5. 原创 | 一文读懂K均值(K-Means)聚类算法

    作者:王佳鑫审校:陈之炎本文约5800字,建议阅读10+分钟本文为你介绍经典的K-Means聚类算法. 概述 众所周知,机器学习算法可分为监督学习(Supervised learning)和无监督学习 ...

  6. 一文读懂K均值(K-Means)聚类算法

    作者:王佳鑫 审校:陈之炎 本文约5800字,建议阅读10+分钟本文为你介绍经典的K-Means聚类算法. 概述 众所周知,机器学习算法可分为监督学习(Supervised learning)和无监督 ...

  7. R实现K均值算法,层次聚类算法与DBSCAN算法

    1.聚类的基本概念 聚类分析(cluster analysis)仅根据在数据中发现的描述对象及其关系的信息,将数据对象分组.其目标是,组内的对象相互之间是相似的(相关的),而不同组中的对象是不同的(不 ...

  8. 两种聚类方法——K均值聚类(K-means)算法和模糊C均值聚类(FCM)算法的简述与在MATLAB中的实现

    目录 1.K-means算法 1.1算法流程 1.2程序实现 1.3实验结果 原始数据集 聚类结果 2.FCM算法 2.1算法流程 2.2程序设计 FCM子函数 主函数 2.3实验结果 原始数据集 聚 ...

  9. matlab对图像进行均值滤波_用K均值进行图像分割

    个人学习笔记:采用聚类方法对图像进行分割,以下内容纯粹个人理解,如有错误请帮我指出!多谢! 图像分割就是把图像按照某些条件分成不同的区域,并提取出感兴趣的区域.传统的分割方法包括基于阈值的分割.基于区 ...

最新文章

  1. Curator counters
  2. solr安装笔记与定时器任务
  3. PreparedStatement和Statement比较
  4. x学校计算机及网络维护方案,校园计算机网络常见故障的处理与维护
  5. [唐诗]183清平调词三首-李白
  6. nginx: [error] invalid PID number问题处理
  7. (转)Unity 导出XML配置文件,动态加载场景
  8. pythonsparkfilter_python中的map、filter、reduce函数
  9. tomcat服务器的虚拟目录,Windows系统下安装Tomcat服务器和配置虚拟目录的方法
  10. 信捷电子凸轮使用_1.电子凸轮入门应用之基础知识介绍
  11. 2021年Web前端开发的趋势有哪些
  12. visio2003 FK
  13. EXCEL中IF函数的嵌套结构以及AND与OR的用法
  14. 如何输入“·”间隔号
  15. 远程桌面连接遇到的问题及解决方法
  16. 原子操作Atomic类
  17. 【程序】STM32使用SPI接口读取93C46存储器上的数据(非软件模拟SPI时序)
  18. java入门基础教程(纯干货知识点+视频资源)
  19. 吴宗宪的35个BT经典台词
  20. 初等数论的一部分结论

热门文章

  1. python sendline_python Pexpect模块的使用
  2. 大数据基础技术和应用
  3. JavaScript知识笔记(二)——事件
  4. Linux系统中nc工具那些不为人知的用法
  5. IOS上传文件给java服务器,返回报错unacceptable context-type:text/plain
  6. 几种机器学习算法的优缺点
  7. 线性代数-矩阵-转置 C和C++的实现
  8. Day 3 网络基础
  9. 【代码真相】之 开篇
  10. RUNOOB python练习题37 对一个序列的数进行排序