简单介绍：

k-means聚类属于无监督学习的一种，在没有给与labels的情况下，将数据分成指定的K类。

它将相似的对象归到一个簇中，将不相似的对象归到不同簇中，相似这一概念，取决于所选择的相似度计算方法。

K-means是发现给定数据集的K个簇的聚类算法，之所以称之为K均值，是因为他可以发现K个不同的簇，且每个簇的中心采用簇中所含值得均值计算而成。

簇的个数是用户指定的，每一个簇通过其质心，即簇中所有点的中心来描述。

聚类于分类算法最大的区别在于，分类的目标类别已知，但是聚类目标类别是未知的。

思想：

对于第一张图一样的散乱的数据样本的聚类，首先，假设要聚成两大团，那么，随机给两个点的坐标，如同第二张图的两个十字mark,然后第一步就是帮所有点认领归属的团，站第一次队。认领方法是，每个点都算一算自己到这两个十字mark的距离，认领距离小的那个mark,作为自己的归属。好了，第一次站队完后，形成了第三张图的局势。很明显，现在的十字mark位置已经不合适了，需要重新设置各自阵营自己的mark。游戏规则是，红队阵营所有的点的坐标都来加和求出横坐标和纵坐标的平均值，这个平均后的坐标位置就是新的mark。当然，蓝队阵营也一样的进行，选出新的mark。
这次mark和上一次不一样吧？那说明这次细分有效果了

接下来，进行第二次站队，也就是所有的点再一次计算和两个mark的距离，认领新的mark归属，好了，新的局势再次形成，接下来继续选择合适的mark新位置，和上次对比发现，依然后变化，恩！那就对了，不断重复这两个步骤，直到mark位置几乎不变，那就完成了聚类过程。

1.载入数据

def loadDataSet(fileName):'''加载数据集:param fileName::return:'''# 初始化一个空列表dataSet = []# 读取文件fr = open(fileName)# 循环遍历文件所有行for line in fr.readlines():# 切割每一行的数据curLine = line.strip().split('\t')# 将数据转换为浮点类型,便于后面的计算# fltLine = [float(x) for x in curLine]# 将数据追加到dataMatfltLine = list(map(float,curLine))    # 映射所有的元素为 float（浮点数）类型dataSet.append(fltLine)# 返回dataMatreturn mat(dataSet)

2.求向量距离

K均值聚类中需要计算数据和质心的距离，常见的距离有欧氏距离(Euclidean distance)和曼哈顿距离(Manhattan distance)，本处采用欧式距离。

def distEclud(vecA, vecB):'''欧氏距离计算函数:param vecA::param vecB::return:'''return sqrt(sum(power(vecA - vecB, 2)))

3.随机生成k个点作为初始质心

def randCent(dataMat, k):'''为给定数据集构建一个包含K个随机质心的集合,随机质心必须要在整个数据集的边界之内,这可以通过找到数据集每一维的最小和最大值来完成然后生成0到1.0之间的随机数并通过取值范围和最小值,以便确保随机点在数据的边界之内:param dataMat::param k::return:'''# 获取样本数与特征值m, n = dataMat.shape# 初始化质心,创建(k,n)个以零填充的矩阵centroids = mat(zeros((k, n)))# 循环遍历特征值for j in range(n):# 计算每一列的最小值minJ = min(dataMat[:, j])# 计算每一列的范围值rangeJ = float(max(dataMat[:, j]) - minJ)# 计算每一列的质心,并将值赋给centroidscentroids[:, j] = mat(minJ + rangeJ * random.rand(k, 1))# 返回质心return centroids

4.K均值聚类算法实现

def kMeans(dataMat, k, distMeas=distEclud, createCent=randCent):'''创建K个质心,然后将每个店分配到最近的质心,再重新计算质心。这个过程重复数次,直到数据点的簇分配结果不再改变为止:param dataMat: 数据集:param k: 簇的数目:param distMeans: 计算距离:param createCent: 创建初始质心:return:'''# 获取样本数和特征数m, n = dataMat.shape# 初始化一个矩阵来存储每个点的簇分配结果# clusterAssment包含两个列:一列记录簇索引值,# 第二列存储误差(误差是指当前点到簇质心的距离,后面会使用该误差来评价聚类的效果)clusterAssment = mat(zeros((m, 2)))# 创建质心,随机K个质心centroids = createCent(dataMat, k)# 初始化标志变量,用于判断迭代是否继续,如果True,则继续迭代clusterChanged = Truewhile clusterChanged:clusterChanged = False# 遍历所有数据找到距离每个点最近的质心,# 可以通过对每个点遍历所有质心并计算点到每个质心的距离来完成for i in range(m):minDist = infminIndex = -1for j in range(k):# 计算数据点到质心的距离# 计算距离是使用distMeas参数给出的距离公式,默认距离函数是distEcluddistJI = distMeas(centroids[j, :], dataMat[i, :])# 如果距离比minDist(最小距离)还小,更新minDist(最小距离)和最小质心的index(索引)if distJI < minDist:minDist = distJIminIndex = j# 如果任一点的簇分配结果发生改变,则更新clusterChanged标志if clusterAssment[i, 0] != minIndex: clusterChanged = True# 更新簇分配结果为最小质心的index(索引),minDist(最小距离)的平方clusterAssment[i, :] = minIndex, minDist ** 2# print(centroids)# 遍历所有质心并更新它们的取值for cent in range(k):# 通过数据过滤来获得给定簇的所有点ptsInClust = dataMat[nonzero(clusterAssment[:, 0].A == cent)[0]]# 计算所有点的均值,axis=0表示沿矩阵的列方向进行均值计算centroids[cent, :] = mean(ptsInClust, axis=0)# 返回所有的类质心与点分配结果return centroids, clusterAssment

5.数据可视化

def showCluster(dataSet, k, clusterAssment, centroids):fig = plt.figure()plt.title("K-means")ax = fig.add_subplot(111)data = []for cent in range(k): #提取出每个簇的数据ptsInClust = dataSet[nonzero(clusterAssment[:,0].A==cent)[0]] #获得属于cent簇的数据data.append(ptsInClust)for cent, c, marker in zip( range(k), ['r', 'g', 'b', 'y'], ['^', 'o', '*', 's'] ): #画出数据点散点图ax.scatter(data[cent][:, 0], data[cent][:, 1], s=80, c=c, marker=marker)ax.scatter(np.array(centroids)[:, 0], np.array(centroids)[:, 1], s=1000, c='black', marker='+', alpha=1) #画出质心点ax.set_xlabel('X Label')ax.set_ylabel('Y Label')plt.show()

使用K均值算法对测试数据进行聚类，一般情况下效果不错，各个簇区别明显，如图所示：

但是有时候效果就不好了，因为K均值聚类收敛的是局部最小值，而不是全局最小值，如图所示：

下面的两类几乎分到了一起，左上的一类又被分成了两类，如何解决这个问题呢？二分K均值算法。

一些说明：

每次迭代中都要对每一个簇进行划分，在其中选择最大程度降低误差平方和簇进行聚类
当使用KMeans()函数且指定的聚类数目为2时，会得到编号为0和1的两个簇，将编号为0的簇编号改为输入簇的编号，编号为1的簇编号改为所有簇的数目len(centList)，即在原先簇上追加一个簇

def biKmeans(dataMat, k, distMeas=distEclud):'''在给定数据集,所期望的簇数目和距离计算方法的条件下,函数返回聚类结果:param dataMat::param k::param distMeas::return:'''m, n = dataMat.shape# 创建一个矩阵来存储数据集中每个点的簇分配结果及平方误差clusterAssment = mat(zeros((m, 2)))# 计算整个数据集的质心,并使用一个列表来保留所有的质心centroid0 = mean(dataMat, axis=0).tolist()[0]centList = [centroid0]# 遍历数据集中所有点来计算每个点到质心的误差值for j in range(m):clusterAssment[j, 1] = distMeas(mat(centroid0), dataMat[j, :]) ** 2# 对簇不停的进行划分,直到得到想要的簇数目为止while (len(centList) < k):# 初始化最小SSE为无穷大,用于比较划分前后的SSElowestSSE = inf# 通过考察簇列表中的值来获得当前簇的数目,遍历所有的簇来决定最佳的簇进行划分for i in range(len(centList)):# 对每一个簇,将该簇中的所有点堪称一个小的数据集ptsInCurrCluster = dataMat[nonzero(clusterAssment[:, 0].A == i)[0], :]# 将ptsInCurrCluster输入到函数kMeans中进行处理,k=2,# kMeans会生成两个质心(簇),同时给出每个簇的误差值centroidMat, splitClustAss = kMeans(ptsInCurrCluster, 2, distMeas)# 将误差值与剩余数据集的误差之和作为本次划分的误差sseSplit = sum(splitClustAss[:, 1])sseNotSplit = sum(clusterAssment[nonzero(clusterAssment[:, 0].A != i)[0], 1])print('sseSplit, and notSplit: ', sseSplit, sseNotSplit)# 如果本次划分的SSE值最小,则本次划分被保存if (sseSplit + sseNotSplit) < lowestSSE:bestCentToSplit = ibestNewCents = centroidMatbestClustAss = splitClustAss.copy()lowestSSE = sseSplit + sseNotSplit# 找出最好的簇分配结果# 调用kmeans函数并且指定簇数为2时,会得到两个编号分别为0和1的结果簇bestClustAss[nonzero(bestClustAss[:, 0].A == 1)[0], 0] = len(centList)# 更新为最佳质心bestClustAss[nonzero(bestClustAss[:, 0].A == 0)[0], 0] = bestCentToSplitprint('the bestCentToSplit is: ', bestCentToSplit)print('the len of bestClustAss is: ', len(bestClustAss))# 更新质心列表# 更新原质心list中的第i个质心为使用二分kMeans后bestNewCents的第一个质心centList[bestCentToSplit] = bestNewCents[0, :].tolist()[0]# 添加bestNewCents的第二个质心centList.append(bestNewCents[1, :].tolist()[0])# 重新分配最好簇下的数据(质心)以及SSEclusterAssment[nonzero(clusterAssment[:, 0].A == bestCentToSplit)[0], :] = bestClustAssreturn mat(centList), clusterAssment

二分K均值之所以稳定，是由于初始质心不再是随机生成K个，而是基于全部数据的平均值先生成一个质心，然后基于最优的方法分裂成2个质心，然后再对现有的2个判断下哪个簇的误差较大，将其再分裂成2个。如此n+1+1+1每次优化分裂一个簇，直到达到k个簇结束优化。

机器学习实战中最后给了一个案例介绍聚类的一个应用场景，小伙伴们要出去游玩，选择了几个地方，想打车到几个地方的中心点后步行前往，给出最优路线。

采用二分K-均值法本身不难，我却被其中另一个问题给考倒了，因为获得的地点是经纬度坐标，如何求地球妈妈球面积上坐标之间的距离？

https://www.cnblogs.com/softfair/p/distance_of_two_latitude_and_longitude_points.html

这个小伙伴讲的很透彻了，基本是就是利用平面的问题去求解立体的问题，原理图如下：

代码中需要添加一个求坐标距离的辅助函数：

def distSLC(vecA, vecB):'''返回地球表面两点间的距离,单位是英里给定两个点的经纬度,可以使用球面余弦定理来计算亮点的距离:param vecA::param vecB::return:'''# 经度和维度用角度作为单位,但是sin()和cos()以弧度为输入.# 可以将江都除以180度然后再诚意圆周率pi转换为弧度a = sin(vecA[0, 1] * pi / 180) * sin(vecB[0, 1] * pi / 180)b = cos(vecA[0, 1] * pi / 180) * cos(vecB[0, 1] * pi / 180) * cos(pi * (vecB[0, 0] - vecA[0, 0]) / 180)return arccos(a + b) * 6371.0

利用二分K-均值求质心，然后将结果展示到地图上：

def clusterClubs(fileName, imgName, numClust=5):'''将文本文件的解析,聚类以及画图都封装在一起:param fileName: 文本数据路径:param imgName: 图片路径:param numClust: 希望得到的簇数目:return:'''# 创建一个空列表datList = []# 打开文本文件获取第4列和第5列,这两列分别对应维度和经度,然后将这些值封装到datListfor line in open(fileName).readlines():lineArr = line.split('\t')datList.append([float(lineArr[4]), float(lineArr[3])])datMat = mat(datList)# 调用biKmeans并使用distSLC函数作为聚类中使用的距离计算方式myCentroids, clustAssing = biKmeans(datMat, numClust, distMeas=distSLC)# 创建一幅图和一个举行,使用该矩形来决定绘制图的哪一部分fig = plt.figure()rect = [0.1, 0.1, 0.8, 0.8]# 构建一个标记形状的列表用于绘制散点图scatterMarkers = ['s', 'o', '^', '8', 'p', 'd', 'v', 'h', '>', '<']axprops = dict(xticks=[], yticks=[])ax0 = fig.add_axes(rect, label='ax0', **axprops)# 使用imread函数基于一幅图像来创建矩阵imgP = plt.imread(imgName)# 使用imshow绘制该矩阵ax0.imshow(imgP)# 再同一幅图上绘制一张新图,允许使用两套坐标系统并不做任何缩放或偏移ax1 = fig.add_axes(rect, label='ax1', frameon=False)# 遍历每一个簇并将它们一一画出来,标记类型从前面创建的scatterMarkers列表中得到for i in range(numClust):ptsInCurrCluster = datMat[nonzero(clustAssing[:, 0].A == i)[0], :]# 使用索引i % len(scatterMarkers)来选择标记形状,这意味这当有更多簇时,可以循环使用这标记markerStyle = scatterMarkers[i % len(scatterMarkers)]# 使用十字标记来表示簇中心并在图中显示ax1.scatter(ptsInCurrCluster[:, 0].flatten().A[0], ptsInCurrCluster[:, 1].flatten().A[0], marker=markerStyle,s=90)ax1.scatter(myCentroids[:, 0].flatten().A[0], myCentroids[:, 1].flatten().A[0], marker='+', s=300)plt.show()fileName='./kmeans/places.txt'
imgName='./kmeans/Portland.png'
clusterClubs(fileName, imgName, numClust=5)

sklearn中的K-Means

`sklearn.cluster`.k_means

sklearn.cluster.k_means(X, n_clusters, init='k-means++', precompute_distances='auto', n_init=10, max_iter=300, verbose=False, tol=0.0001, random_state=None, copy_x=True, n_jobs=1, algorithm='auto', return_n_iter=False)

Parameters:	X : array-like or sparse matrix, shape (n_samples, n_features) The observations to cluster. n_clusters : int The number of clusters to form as well as the number of centroids to generate. init : {‘k-means++’, ‘random’, or ndarray, or a callable}, optional Method for initialization, default to ‘k-means++’: ‘k-means++’ : selects initial cluster centers for k-mean clustering in a smart way to speed up convergence. See section Notes in k_init for more details. ‘random’: generate k centroids from a Gaussian with mean and variance estimated from the data. If an ndarray is passed, it should be of shape (n_clusters, n_features) and gives the initial centers. If a callable is passed, it should take arguments X, k and and a random state and return an initialization. precompute_distances : {‘auto’, True, False} Precompute distances (faster but takes more memory). ‘auto’ : do not precompute distances if n_samples * n_clusters > 12 million. This corresponds to about 100MB overhead per job using double precision. True : always precompute distances False : never precompute distances n_init : int, optional, default: 10 Number of time the k-means algorithm will be run with different centroid seeds. The final results will be the best output of n_init consecutive runs in terms of inertia. max_iter : int, optional, default 300 Maximum number of iterations of the k-means algorithm to run. verbose : boolean, optional Verbosity mode. tol : float, optional The relative increment in the results before declaring convergence. random_state : int, RandomState instance or None, optional, default: None If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random. copy_x : boolean, optional When pre-computing distances it is more numerically accurate to center the data first. If copy_x is True, then the original data is not modified. If False, the original data is modified, and put back before the function returns, but small numerical differences may be introduced by subtracting and then adding the data mean. n_jobs : int The number of jobs to use for the computation. This works by computing each of the n_init runs in parallel. If -1 all CPUs are used. If 1 is given, no parallel computing code is used at all, which is useful for debugging. For n_jobs below -1, (n_cpus + 1 + n_jobs) are used. Thus for n_jobs = -2, all CPUs but one are used. algorithm : “auto”, “full” or “elkan”, default=”auto” K-means algorithm to use. The classical EM-style algorithm is “full”. The “elkan” variation is more efficient by using the triangle inequality, but currently doesn’t support sparse data. “auto” chooses “elkan” for dense data and “full” for sparse data. return_n_iter : bool, optional Whether or not to return the number of iterations.
Returns:	centroid : float ndarray with shape (k, n_features) Centroids found at the last iteration of k-means. label : integer ndarray with shape (n_samples,) label[i] is the code or index of the centroid the i’th observation is closest to. inertia : float The final value of the inertia criterion (sum of squared distances to the closest centroid for all observations in the training set). best_n_iter : int Number of iterations corresponding to the best results. Returned only if return_n_iter is set to True.

Parameters:

X : array-like or sparse matrix, shape (n_samples, n_features)

The observations to cluster.

n_clusters : int

The number of clusters to form as well as the number of centroids to generate.

init : {‘k-means++’, ‘random’, or ndarray, or a callable}, optional

Method for initialization, default to ‘k-means++’:

‘k-means++’ : selects initial cluster centers for k-mean clustering in a smart way to speed up convergence. See section Notes in k_init for more details.

‘random’: generate k centroids from a Gaussian with mean and variance estimated from the data.

If an ndarray is passed, it should be of shape (n_clusters, n_features) and gives the initial centers.

If a callable is passed, it should take arguments X, k and and a random state and return an initialization.

precompute_distances : {‘auto’, True, False}

Precompute distances (faster but takes more memory).

‘auto’ : do not precompute distances if n_samples * n_clusters > 12 million. This corresponds to about 100MB overhead per job using double precision.

True : always precompute distances

False : never precompute distances

n_init : int, optional, default: 10

Number of time the k-means algorithm will be run with different centroid seeds. The final results will be the best output of n_init consecutive runs in terms of inertia.

max_iter : int, optional, default 300

Maximum number of iterations of the k-means algorithm to run.

verbose : boolean, optional

Verbosity mode.

tol : float, optional

The relative increment in the results before declaring convergence.

random_state : int, RandomState instance or None, optional, default: None

If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

copy_x : boolean, optional

When pre-computing distances it is more numerically accurate to center the data first. If copy_x is True, then the original data is not modified. If False, the original data is modified, and put back before the function returns, but small numerical differences may be introduced by subtracting and then adding the data mean.

n_jobs : int

The number of jobs to use for the computation. This works by computing each of the n_init runs in parallel.

If -1 all CPUs are used. If 1 is given, no parallel computing code is used at all, which is useful for debugging. For n_jobs below -1, (n_cpus + 1 + n_jobs) are used. Thus for n_jobs = -2, all CPUs but one are used.

algorithm : “auto”, “full” or “elkan”, default=”auto”

K-means algorithm to use. The classical EM-style algorithm is “full”. The “elkan” variation is more efficient by using the triangle inequality, but currently doesn’t support sparse data. “auto” chooses “elkan” for dense data and “full” for sparse data.

return_n_iter : bool, optional

Whether or not to return the number of iterations.

Returns:

centroid : float ndarray with shape (k, n_features)

Centroids found at the last iteration of k-means.

label : integer ndarray with shape (n_samples,)

label[i] is the code or index of the centroid the i’th observation is closest to.

inertia : float

The final value of the inertia criterion (sum of squared distances to the closest centroid for all observations in the training set).

best_n_iter : int

Number of iterations corresponding to the best results. Returned only if return_n_iter is set to True.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans# 加载数据集
dataMat = []
fr = open("E:\python_code\Python_algorithm\ml\MachineLearning-master\input/10.KMeans/testSet.txt") # 注意，这个是相对路径，请保证是在 MachineLearning 这个目录下执行。
for line in fr.readlines():curLine = line.strip().split('\t')fltLine = list(map(float,curLine))    # 映射所有的元素为 float（浮点数）类型dataMat.append(fltLine)# 训练模型
km = KMeans(n_clusters=4) # 初始化
km.fit(dataMat) # 拟合
km_pred = km.predict(dataMat) # 预测
print(km_pred)
centers = km.cluster_centers_ # 质心# 可视化结果
plt.scatter(np.array(dataMat)[:, 1], np.array(dataMat)[:, 0], c=km_pred)
plt.scatter(centers[:, 1], centers[:, 0], c="r")
plt.show()

K-均值聚类算法(K-Means)相关推荐

k均值聚类算法(K Means)及其实战案例
算法说明 K均值聚类算法其实就是根据距离来看属性,近朱者赤近墨者黑.其中K表示要聚类的数量,就是说样本要被划分成几个类别.而均值则是因为需要求得每个类别的中心点,比如一维样本的中心点一般就是求这些样本 ...
K-Means（K均值聚类算法）
K-Means(K均值聚类算法) 1.前言要学习聚类算法就要知道聚类学习算法是什么,为什么要学习聚类学习聚类学习算法,有什么用途,下面就简单的做一下介绍,并且详细的说明k-means均值聚类学习算法 ...
k均值聚类算法python_K均值和其他聚类算法：Python快速入门
k均值聚类算法python This post was originally published here 这篇文章最初发表在这里 Clustering is the grouping of obje ...
【模式识别】K均值聚类算法应用实验报告及MATLAB仿真
一. 实验目的 1.掌握K均值聚类算法的原理和实现过程: 2.掌握K均值聚类算法的应用方法. 二. 实验内容 1.彩色图像分割选择一幅图像,分别按三种颜色数进行彩色图像分割的结果(原图和分割图).步 ...
k均值聚类算法优缺点_Grasshopper实现K均值聚类算法
本文很长很长,有很多很多图,包含以下部分: 1.算法简介 2.如何分类平面点 3.如何分类空间点 4.如何分类多维数据 5.后记提醒:以下内容包括:智障操作,无中生友,重复造轮子等 1.算法简介 ...
K均值聚类算法(HCM，K-Means)
K均值聚类核心思想如下: 算法把n个向量分为个组,并求每组的聚类中心,使得非相似性(或距离)指标的价值函数(或目标函数)达到最小.当选择欧几里德距离为组j中向量与相应聚类中心间的非相似性指标时,价值函 ...
K均值聚类算法(Kmeans)讲解及源码实现
K均值聚类算法(Kmeans)讲解及源码实现算法核心 K均值聚类的核心目标是将给定的数据集划分成K个簇,并给出每个数据对应的簇中心点.算法的具体步骤描述如下. 数据预处理,如归一化.离群点处理等. ...
python（scikit-learn）实现k均值聚类算法
k均值聚类算法原理详解示例为链接中的例题直接调用python机器学习的库scikit-learn中k均值算法的相关方法 from sklearn.cluster import KMeans imp ...
机器学习之无监督学习-K均值聚类算法
机器学习之无监督学习-K均值聚类算法对于无监督学习,有两类重要的应用,一个是聚类,一个是降维.我们今天主要学习聚类中的K均值聚类. 我们先看看下图,图a为原始的数据点,我们想要对图a的数据点进行分类 ...
Thinking in SQL系列之五：数据挖掘K均值聚类算法与城市分级
原创: 牛超 2017-02-21 Mail:10867910@qq.com 引言:前一篇文章开始不再介绍简单算法,而是转到数据挖掘之旅.感谢CSDN将我前一篇机器学习C4.5决策树算法的博文 ...

K-均值聚类算法(K-Means)