Clustering 聚类

kmeans k均值聚类

Finds centers of clusters and groups input samples around the clusters.

寻找clusters的中心，并且将输入的样本聚合

C++:double kmeans(InputArray data, int K, InputOutputArray bestLabels, TermCriteria criteria, int attempts, intflags, OutputArray centers=noArray() )

Python:cv2.kmeans(data, K, criteria, attempts, flags[, bestLabels[, centers]]) → retval, bestLabels, centers

C:int cvKMeans2(const CvArr* samples, int cluster_count, CvArr* labels, CvTermCriteria termcrit, intattempts=1, CvRNG* rng=0, int flags=0, CvArr* _centers=0, double* compactness=0 )

Python:cv.KMeans2(samples, nclusters, labels, termcrit, attempts=1, flags=0, centers=None) → float

Parameters:	samples – Floating-point matrix of input samples, one row per sample. //输入的浮点型样本 data – Data for clustering. //聚类的数据 cluster_count – Number of clusters to split the set by.// K – Number of clusters to split the set by. labels – Input/output integer array that stores the cluster indices for every sample. criteria – The algorithm termination criteria, that is, the maximum number of iterations and/or the desired accuracy. The accuracy is specified as criteria.epsilon. As soon as each of the cluster centers moves by less than criteria.epsilon on some iteration, the algorithm stops. termcrit – The algorithm termination criteria, that is, the maximum number of iterations and/or the desired accuracy. attempts – Flag to specify the number of times the algorithm is executed using different initial labellings. The algorithm returns the labels that yield the best compactness (see the last function parameter). rng – CvRNG state initialized by RNG(). flags – Flag that can take the following values: KMEANS_RANDOM_CENTERS Select random initial centers in each attempt. KMEANS_PP_CENTERS Use kmeans++ center initialization by Arthur and Vassilvitskii [Arthur2007]. KMEANS_USE_INITIAL_LABELS During the first (and possibly the only) attempt, use the user-supplied labels instead of computing them from the initial centers. For the second and further attempts, use the random or semi-random centers. Use one of KMEANS__CENTERS flag to specify the exact method. centers* – Output matrix of the cluster centers, one row per each cluster center. _centers – Output matrix of the cluster centers, one row per each cluster center. compactness – The returned value that is described below.

Parameters:

samples – Floating-point matrix of input samples, one row per sample. //输入的浮点型样本
data – Data for clustering. //聚类的数据
cluster_count – Number of clusters to split the set by.//
K – Number of clusters to split the set by.
labels – Input/output integer array that stores the cluster indices for every sample.
criteria – The algorithm termination criteria, that is, the maximum number of iterations and/or the desired accuracy. The accuracy is specified as criteria.epsilon. As soon as each of the cluster centers moves by less than criteria.epsilon on some iteration, the algorithm stops.
termcrit – The algorithm termination criteria, that is, the maximum number of iterations and/or the desired accuracy.
attempts – Flag to specify the number of times the algorithm is executed using different initial labellings. The algorithm returns the labels that yield the best compactness (see the last function parameter).
rng – CvRNG state initialized by RNG().
flags –
Flag that can take the following values:
- KMEANS_RANDOM_CENTERS Select random initial centers in each attempt.
- KMEANS_PP_CENTERS Use kmeans++ center initialization by Arthur and Vassilvitskii [Arthur2007].
- KMEANS_USE_INITIAL_LABELS During the first (and possibly the only) attempt, use the user-supplied labels instead of computing them from the initial centers. For the second and further attempts, use the random or semi-random centers. Use one of KMEANS_*_CENTERS flag to specify the exact method.
centers – Output matrix of the cluster centers, one row per each cluster center.
_centers – Output matrix of the cluster centers, one row per each cluster center.
compactness – The returned value that is described below.

The function kmeans implements a k-means algorithm that finds the centers of cluster_count clusters and groups the input samples around the clusters. As an output, contains a 0-based cluster index for the sample stored in the row of the samples matrix.

The function returns the compactness measure that is computed as

after every attempt. The best (minimum) value is chosen and the corresponding labels and the compactness value are returned by the function. Basically, you can use only the core of the function, set the number of attempts to 1, initialize labels each time using a custom algorithm, pass them with the ( flags = KMEANS_USE_INITIAL_LABELS ) flag, and then choose the best (most-compact) clustering.

Note

An example on K-means clustering can be found at opencv_source_code/samples/cpp/kmeans.cpp
(Python) An example on K-means clustering can be found at opencv_source_code/samples/python2/kmeans.py

基于这样一个假设，我们再来导出 k-means 所要优化的目标函数：设我们一共有 N 个数据点需要分为 K 个 cluster ，k-means 要做的就是最小化

<span style="font-size:18px;"><img title="\displaystyle J = \sum_{n=1}^N\sum_{k=1}^K r_{nk} \|x_n-\mu_k\|^2" alt="\displaystyle J = \sum_{n=1}^N\sum_{k=1}^K r_{nk} \|x_n-\mu_k\|^2" align="absMiddle" src="http://blog.pluskid.org/latexrender/pictures/6d769d53cfc5e304cda806c84b310ec8.png" style="border: none; max-width: 100%;" /></span>

这个函数，其中在数据点 n 被归类到 cluster k 的时候为 1 ，否则为 0 。直接寻找和来最小化并不容易，不过我们可以采取迭代的办法：先固定，选择最优的，很容易看出，只要将数据点归类到离他最近的那个中心就能保证最小。下一步则固定，再求最优的。将对求导并令导数等于零，很容易得到最小的时候应该满足：

<span style="font-size:18px;"><img title="\displaystyle \mu_k=\frac{\sum_n r_{nk}x_n}{\sum_n r_{nk}}" alt="\displaystyle \mu_k=\frac{\sum_n r_{nk}x_n}{\sum_n r_{nk}}" align="absMiddle" src="http://blog.pluskid.org/latexrender/pictures/a0aa5b1fd15778697fc5f5c6f1c3f348.png" style="border: none; max-width: 100%;" /></span>

亦即的值应当是所有 cluster k 中的数据点的平均值。由于每一次迭代都是取到的最小值，因此只会不断地减小（或者不变），而不会增加，这保证了 k-means 最终会到达一个极小值。虽然 k-means 并不能保证总是能得到全局最优解，但是对于这样的问题，像 k-means 这种复杂度的算法，这样的结果已经是很不错的了。

下面我们来总结一下 k-means 算法的具体步骤：

选定 K 个中心的初值。这个过程通常是针对具体的问题有一些启发式的选取方法，或者大多数情况下采用随机选取的办法。因为前面说过 k-means 并不能保证全局最优，而是否能收敛到全局最优解其实和初值的选取有很大的关系，所以有时候我们会多次选取初值跑 k-means ，并取其中最好的一次结果。
将每个数据点归类到离它最近的那个中心点所代表的 cluster 中。
用公式计算出每个 cluster 的新的中心点。
重复第二步，一直到迭代了最大的步数或者前后的的值相差小于一个阈值为止。

OpenCV实现：

<span style="font-size:18px;">#include "opencv2/highgui/highgui.hpp"
#include "opencv2/core/core.hpp"
#include <iostream>using namespace cv;
using namespace std;int main( int /*argc*/, char** /*argv*/ )
{const int MAX_CLUSTERS = 5;Scalar colorTab[] =     //因为最多只有5类，所以最多也就给5个颜色{Scalar(0, 0, 255),Scalar(0,255,0),Scalar(255,100,100),Scalar(255,0,255),Scalar(0,255,255)};Mat img(500, 500, CV_8UC3);RNG rng(12345); //随机数产生器for(;;){int k, clusterCount = rng.uniform(2, MAX_CLUSTERS+1);int i, sampleCount = rng.uniform(1, 1001);Mat points(sampleCount, 1, CV_32FC2), labels;   //产生的样本数，实际上为2通道的列向量，元素类型为Point2fclusterCount = MIN(clusterCount, sampleCount);Mat centers(clusterCount, 1, points.type());    //用来存储聚类后的中心点/* generate random sample from multigaussian distribution */for( k = 0; k < clusterCount; k++ ) //产生随机数{Point center;center.x = rng.uniform(0, img.cols);center.y = rng.uniform(0, img.rows);Mat pointChunk = points.rowRange(k*sampleCount/clusterCount,k == clusterCount - 1 ? sampleCount :(k+1)*sampleCount/clusterCount);   //最后一个类的样本数不一定是平分的，//剩下的一份都给最后一类//每一类都是同样的方差，只是均值不同而已rng.fill(pointChunk, CV_RAND_NORMAL, Scalar(center.x, center.y), Scalar(img.cols*0.05, img.rows*0.05));}randShuffle(points, 1, &rng);   //因为要聚类，所以先随机打乱points里面的点，注意points和pointChunk是共用数据的。kmeans(points, clusterCount, labels,TermCriteria( CV_TERMCRIT_EPS+CV_TERMCRIT_ITER, 10, 1.0),3, KMEANS_PP_CENTERS, centers);  //聚类3次，取结果最好的那次，聚类的初始化采用PP特定的随机算法。img = Scalar::all(0);for( i = 0; i < sampleCount; i++ ){int clusterIdx = labels.at<int>(i);Point ipt = points.at<Point2f>(i);circle( img, ipt, 2, colorTab[clusterIdx], CV_FILLED, CV_AA );}imshow("clusters", img);char key = (char)waitKey();     //无限等待if( key == 27 || key == 'q' || key == 'Q' ) // 'ESC'break;}return 0;
}</span>

参考：http://www.cnblogs.com/tornadomeet/archive/2012/11/23/2783709.html

http://blog.csdn.net/heavendai/article/details/7029465

Kmeans K均值聚类，OpenCV实现相关推荐

聚类分析 | MATLAB实现k-Means(k均值聚类)分析
目录聚类分析 | MATLAB实现k-Means(k均值聚类)分析 k-均值聚类简介相关描述程序设计学习小结参考资料致谢聚类分析 | MATLAB实现k-Means(k均值聚类)分析 k ...
K-Means(K均值聚类)原理及代码实现
机器学习没有免费午餐定理和三大机器学习任务如何对模型进行评估 K-Means(K均值聚类)原理及代码实现 KNN(K最近邻算法)原理及代码实现 KMeans和KNN的联合演习文章目录机器学习 ...
Udacity机器人软件工程师课程笔记（二十一) - 对点云进行集群可视化 - 聚类的分割 - K-means|K均值聚类, DBSCAN算法
聚类的分割 1.K-均值聚类 (1)K-均值聚类介绍 k均值聚类算法(k-means clustering algorithm)是一种迭代求解的聚类分析算法,其步骤是随机选取K个对象作为初始的聚类中心 ...
k-means(k均值聚类)算法介绍及实现(c++)
基本介绍: k-means 算法接受输入量 k :然后将n个数据对象划分为 k个聚类以便使得所获得的聚类满足:同一聚类中的对象相似度较高:而不同聚类中的对象相似度较小.聚类相似度是利用各聚类中对象的均 ...
【数据挖掘】十大算法之K-Means K均值聚类算法
目录 1 Kmeans步骤 2 kmeans损失函数 3 优缺点 4 如何调优和改进 5 改进的算法 1 Kmeans步骤 (1)数据预处理,如归一化.离群点处理等 (2)随机选取K个簇中心,记为u1 ...
K-Means(K均值)聚类算法
K-mean 初始数据集如下图所示,数据集未做任何标记labels 要求将其分为两簇,K均值算法的操作原理为: 随机挑选两个点作为聚类中心(cluster centroids),K-均值算法是一个 ...
Python——KMeans(k均值聚类)实战(附详细代码与注解)
开始之前各位朋友周末好,今天博主小码将开车≥Ö‿Ö≤为大家用代码实战讲解KMeans聚类,请大家坐稳了≡(▔﹏▔)≡.作为机器学习的十大经典算法之一,聚类的相关现实应用非常之广,如图像分割,文本分类 ...
k-means k均值聚类的弱点/缺点
Similar to other algorithm, K-mean clustering has many weaknesses: 1 When the numbers of data are no ...
聚类算法中的K均值聚类算法（K-Means clustering）
======================================================================= Machine Learni ...

Kmeans K均值聚类，OpenCV实现

Clustering 聚类

kmeans k均值聚类

Kmeans K均值聚类，OpenCV实现相关推荐

最新文章

热门文章