2019独角兽企业重金招聘Python工程师标准>>>

Goal

In this chapter, we will understand the concepts of K-Means Clustering, how it works etc.

Theory

We will deal this with an example which is commonly used.

T-shirt size problem

Consider a company, which is going to release a new model of T-shirt to market. Obviously they will have to manufacture models in different sizes to satisfy people of all sizes. So the company make a data of people's height and weight, and plot them on to a graph, as below:

image

Company can't create t-shirts with all the sizes. Instead, they divide people to Small, Medium and Large, and manufacture only these 3 models which will fit into all the people. This grouping of people into three groups can be done by k-means clustering, and algorithm provides us best 3 sizes, which will satisfy all the people. And if it doesn't, company can divide people to more groups, may be five, and so on. Check image below :

image

How does it work ?

This algorithm is an iterative process. We will explain it step-by-step with the help of images.

Consider a set of data as below ( You can consider it as t-shirt problem). We need to cluster this data into two groups.

image

Step : 1 - Algorithm randomly chooses two centroids, C1 and C2 (sometimes, any two data are taken as the centroids).

Step : 2 - It calculates the distance from each point to both centroids. If a test data is more closer to C1, then that data is labelled with '0'. If it is closer to C2, then labelled as '1' (If more centroids are there, labelled as '2','3' etc).

In our case, we will color all '0' labelled with red, and '1' labelled with blue. So we get following image after above operations.

image

Step : 3 - Next we calculate the average of all blue points and red points separately and that will be our new centroids. That is C1 and C2 shift to newly calculated centroids. (Remember, the images shown are not true values and not to true scale, it is just for demonstration only).

And again, perform step 2 with new centroids and label data to '0' and '1'.

So we get result as below :

image

Now Step - 2 and Step - 3 are iterated until both centroids are converged to fixed points. *(Or it may be stopped depending on the criteria we provide, like maximum number of iterations, or a specific accuracy is reached etc.)* These points are such that sum of distances between test data and their corresponding centroids are minimum. Or simply, sum of distances between C1↔Red_Points and C2↔Blue_Points is minimum.

minimize[J=∑AllRed_Pointsdistance(C1,Red_Point)+∑AllBlue_Pointsdistance(C2,Blue_Point)]

Final result almost looks like below :

image

So this is just an intuitive understanding of K-Means Clustering. For more details and mathematical explanation, please read any standard machine learning textbooks or check links in additional resources. It is just a top layer of K-Means clustering. There are a lot of modifications to this algorithm like, how to choose the initial centroids, how to speed up the iteration process etc.

Additional Resources

  1. Machine Learning Course, Video lectures by Prof. Andrew Ng (Some of the images are taken from this)

Exercises

转载于:https://my.oschina.net/u/2306127/blog/626531

OpenCV3的机器学习算法-K-means-使用Python相关推荐

  1. 机器学习算法一览(附python和R代码)

     机器学习算法一览(附python和R代码) 来源:数据观 时间:2016-04-19 15:20:43 作者:大数据文摘 "谷歌的无人车和机器人得到了很多关注,但我们真正的未来却在于能 ...

  2. k均值聚类算法(K Means)及其实战案例

    算法说明 K均值聚类算法其实就是根据距离来看属性,近朱者赤近墨者黑.其中K表示要聚类的数量,就是说样本要被划分成几个类别.而均值则是因为需要求得每个类别的中心点,比如一维样本的中心点一般就是求这些样本 ...

  3. 机器学习算法清单!附Python和R代码

    来源:数据与算法之美 本文约6000字,建议阅读8分钟. 通过本文为大家介绍了3种机器学习算法方式以及10种机器学习算法的清单,学起来吧~ 前言 谷歌董事长施密特曾说过:虽然谷歌的无人驾驶汽车和机器人 ...

  4. 机器学习系列(9)_机器学习算法一览(附Python和R代码)

    转载自:http://blog.csdn.net/longxinchen_ml/article/details/51192086 – 谷歌的无人车和机器人得到了很多关注,但我们真正的未来却在于能够使电 ...

  5. 转机器学习系列(9)_机器学习算法一览(附Python和R代码)

    转自http://blog.csdn.net/han_xiaoyang/article/details/51191386 – 谷歌的无人车和机器人得到了很多关注,但我们真正的未来却在于能够使电脑变得更 ...

  6. python必备基础代码-机器学习算法基础(使用Python代码)

    介绍 谷歌的自动驾驶汽车和机器人受到了很多媒体的关注,但该公司真正的未来是在机器学习领域,这种技术能使计算机变得更聪明,更个性化.-Eric Schmidt(Google董事长) 我们可能生活在人类历 ...

  7. 机器学习算法基础(使用Python代码)

    介绍 谷歌的自动驾驶汽车和机器人受到了很多媒体的关注,但该公司真正的未来是在机器学习领域,这种技术能使计算机变得更聪明,更个性化.-Eric Schmidt(Google董事长) 我们可能生活在人类历 ...

  8. 机器学习算法基础之使用python代码

    介绍 谷歌的自动驾驶汽车和机器人受到了很多媒体的关注,但该公司真正的未来是在机器学习领域,这种技术能使计算机变得更聪明,更个性化.-Eric Schmidt(Google董事长) 我们可能生活在人类历 ...

  9. 机器学习算法——K近邻法

    K近邻算法 k近邻算法(k-nearest neighbor,k-NN)是一种基本分类与回归方法.k-近邻算法的输入为实例的特征向量,对应于特征空间的点:输出为实例的类别,可以取多类.k-近邻算法假设 ...

最新文章

  1. 直播 | 平安人寿资深算法工程师谢舒翼:智能问答系统探索与实践
  2. MySQL之mysql客户端工作的批处理一些使用手法
  3. J2EE学习辅助工具资料列表及下载3(初学积累中)
  4. AttributeError: module ‘tensorflow‘ has no attribute ‘xxx‘
  5. Linux的常用网络命令
  6. 《JS权威指南学习总结--3.4null和undefined》
  7. java操作SFTP工具类
  8. 计算机常见软件故障有哪几种,计算机常见故障可分为硬件和软件故障,具体介绍...
  9. 本地服务启动慢问题及dubbo测试方法记录
  10. Rose出现 “relation from A to B would cause an Invalid circular inheritance解决方法。
  11. STM32Cube程序使用 DFU 烧写后Leave DFUMode无法运行程序
  12. 第二集:你真的会吸气吗 ?科学呼吸法(汇播课程演说笔记)
  13. 【重点】React.Component用法
  14. JS实现在线汉字笔画练习特效(平板移动端可用)
  15. 用c语言实现cos(x)与sin(x) 函数以及“绝对值函数”和“阶乘函数”
  16. 《Eloquent JavaScript 3rd》笔记
  17. Markdown 实现页内跳转
  18. Android 使用NDK开发中,遇到memset,memcpy, malloc函数错误
  19. Cookie大总结(来自网易)
  20. Windows phone 北京地铁软件实现

热门文章

  1. 回车键兼容多个浏览器
  2. gzip压缩算法: gzip 所使用压缩算法的基本原理
  3. Java方法,调用,static关键字
  4. 《Java8实战》-第六章读书笔记(用流收集数据-01)
  5. 按对象某属性排序的几种方法
  6. Halcon 彩色图片通道分割处理
  7. JavaScript 工作原理之五-深入理解 WebSockets 和带有 SSE 机制的HTTP/2 以及正确的使用姿势(译)...
  8. 阿里云分析性数据库的发展历史
  9. 《DNS与BIND(第5版)》——7.6 保持一切平稳运行
  10. 2.选择元素 - 自定义过滤器《jquery实战》