K均值(K-means)聚类算法原理与代码详解

0. 算法原理：

上述过程简单描述：
a: 初始数据
b: 选择质点
c: 根据质点划分
d: 求均值，更新质心点
e: 划分
f: 更新质心点

1. 代码实现：

# K means 教程# 0. 引入依赖
import numpy as np
import matplotlib.pyplot as plt# 从sklearn中直接生成聚类数据
from sklearn.datasets.samples_generator import make_blobs# 1. 数据加载
x, y = make_blobs(n_samples = 100, centers = 6, random_state = 1234, cluster_std = 0.6)
# make_blobs函数是为聚类产生数据集
# n_samples:表示数据样本点个数
# centers是聚类中心点的个数 可以理解为label的种类数
# random_state是随机种子，可以固定生成的数据
# cluster_std设置每个类别的方差
# print(x.shape) x为100 * 2 矩阵，横纵坐标
plt.figure(figsize = (6, 6))            # 设置画布大小
plt.scatter(x[:, 0],x[:, 1], c = y)     # 散点图
plt.show()# 2. 算法实现
# 引入scipy中的距离函数， 默认欧氏距离
from scipy.spatial.distance import cdist
class K_Means(object):# 初始化，参数n_clusters(K), 迭代次数max_iter,  初始质心 centroidsdef __init__(self, n_clusters= 6, max_iter = 300, centroids = []):self.n_clusters = n_clustersself.max_iter = max_iterself.centroids = np.array(centroids, dtype = np.float)  # 初始化# 训练模型方法，K-means聚类过程,传入原始数据xdef fit(self, data):# 假如没有指定初始质心，就随机选取data中的点作为初始质心if(self.centroids.shape == (0,)):   # 如果初始质心为0行矩阵，也就是没有初始质心# 从data中随机生成0到data行数的6个整数， 作为索引值self.centroids = data[np.random.randint(0, data.shape[0], self.n_clusters),:]# 开始迭代for i in range(self.max_iter):# 1. 计算距离矩阵， 得到的是100*6矩阵，即每个点到6个质心的距离distances = cdist(data, self.centroids)     # 传入数据点， 质心点# 2. 对距离按由近到远排序，选取最近的质心点的类别作为当前点的分类c_ind = np.argmin(distances, axis = 1) # 得到100 * 1 的矩阵 ，即每个点到质心点的最近距离的索引值# 3. 对每一类数据进行均值计算， 更新质心点坐标for i in range(self.n_clusters):  # 0-5的取值# 排除掉没有出现在c_ind里的类别if i in c_ind:# 选出所有类别是i的的点， 取data里面坐标的均值， 更新第i个质心self.centroids[i] = np.mean(data[c_ind == i], axis = 0) # c_ind == i 返回bool值，布尔索引，所以可以选出data里所有为True的值，axis = 0：得到一行的数值'''mean()函数功能：求取均值经常操作的参数为axis，以m * n矩阵举例：axis 不设置值，对 m*n 个数求均值，返回一个实数axis = 0：压缩行，对各列求均值，返回 1* n 矩阵axis =1 ：压缩列，对各行求均值，返回 m *1 矩阵'''# 实现预测方法def predict(self, samples):# 跟上面一样， 先计算距离矩阵， 然后选取距离最近的那个质心的类别distance = cdist(samples, self.centroids)c_ind = np.argmin(distances, axis = 1)return c_ind# 例子：
dist = np.array([[121, 221, 32, 43], # 行代表5个点， 每列代表与每个质心的距离[121, 1, 12, 23],[65, 21, 2, 43],[1, 221, 32, 43],[21, 11, 22, 3],])
c_ind = np.argmin(dist, axis = 1)
print(c_ind)        # 输出了上面dist中每行距离质心最近的索引点
x_new = x[0:5]      # 取了一开始数据集x的前5个值
print(x_new)        # 输出
print(c_ind==2) # 结果为布尔值， 属于第2类的返回True,不属于返回False
print(x_new[c_ind==2])   # 返回属于第2类的x_new数值
print(np.mean(x_new[c_ind==2], axis = 0)) # 得到上面每列加起来求平均值# 3. 测试
# 定义一个绘制子图函数
def plotKMeans(x, y, centroids, subplot, title):  # 画图函数# 分配子图, 121 表示1行2例的子图中的第一个plt.subplot(subplot)plt.scatter(x[:, 0], x[:, 1], c = 'r')# 画出质心点plt.scatter(centroids[:, 0], centroids[:, 1], c = np.array(range(6)), s= 100)plt.title(title) kmeans = K_Means(max_iter = 300, centroids = np.array([[2, 1], [2, 2], [2, 3], [2, 4], [2, 5], [2, 6]]))plt.figure(figsize = (16, 6))
plotKMeans(x, y, kmeans.centroids, 121, 'Initial State')# 开始聚类
kmeans.fit(x)plotKMeans(x, y, kmeans.centroids, 122, 'Final State')     # 预测新数据点的类别
x_new = np.array([[0,0], [10, 7]])
# y_pred = kmeans.predict(x_new)print(kmeans.centroids)
# print(y_pred)plt.scatter(x_new[:,0], x_new[:,1], s = 100, c = 'black')

2. 运行结果：

[2 1 2 0 3]
[[-0.02708305  5.0215929 ][-5.49252256  6.27366991][-5.37691608  1.51403209][-5.37872006  2.16059225][ 9.58333171  8.10916554]]
[ True False  True False False]
[[-0.02708305  5.0215929 ][-5.37691608  1.51403209]]
[-2.70199956  3.26781249]
[[ 5.76444812 -4.67941789][-2.89174024 -0.22808556][-5.89115978  2.33887408][-4.53406813  6.11523454][-1.15698106  5.63230377][ 9.20551979  7.56124841]]

3. 参考文献：

尚硅谷讲师：武晟然

4. 写给自己：

天行健，君子以自强不息

K均值(K-means)聚类算法原理与代码详解相关推荐

【MATLAB】Parzen窗与K近邻算法原理与代码详解
文章目录 1.非参数估计原理 2.Parzen窗 2.1.算法原理 2.2.Matlab实现与参数探究 3.K近邻 3.1.算法原理 3.2.Matlab实现与参数探究 1.非参数估计原理 \qqua ...
KNN算法原理和代码详解
原理有这样一条河流like that,河流的左边是rich 人家,河流的右边是poor 人家,这时新搬来一家小甲,这个算法是看小甲是有钱人家还是没钱人家. 要解决这个问题,那么就可以说立着他最近的几 ...
DeepLearning tutorial（1）Softmax回归原理简介+代码详解
FROM: http://blog.csdn.net/u012162613/article/details/43157801 DeepLearning tutorial(1)Softmax回归原理简介 ...
kmeans python interation flag_机器学习经典算法-logistic回归代码详解
一.算法简要我们希望有这么一种函数:接受输入然后预测出类别,这样用于分类.这里,用到了数学中的sigmoid函数,sigmoid函数的具体表达式和函数图象如下: 可以较为清楚的看到,当输入的x小于0 ...
batchnorm原理及代码详解
转载自:http://www.ishenping.com/ArtInfo/156473.html batchnorm原理及代码详解原博文原微信推文见到原作者的这篇微信小文整理得很详尽.故在csd ...
天津理工大学《操作系统》实验二，存储器的分配与回收算法实现，代码详解，保姆式注释讲解
天津理工大学<操作系统>实验二,存储器的分配与回收算法实现,代码详解,保姆式注释讲解实验内容 1．本实验是模拟操作系统的主存分配,运用可变分区的存储管理算法设计主存分配和回收程序,并不 ...
Pytorch|YOWO原理及代码详解(二)
Pytorch|YOWO原理及代码详解(二) 本博客上接,Pytorch|YOWO原理及代码详解(一),阅前可看. 1.正式训练 if opt.evaluate:logging('evaluating ...
Pytorch | yolov3原理及代码详解（二）
阅前可看: Pytorch | yolov3原理及代码详解(一) https://blog.csdn.net/qq_24739717/article/details/92399359 分析代码: ht ...
DeepLearning tutorial（3）MLP多层感知机原理简介+代码详解
FROM:http://blog.csdn.net/u012162613/article/details/43221829 @author:wepon @blog:http://blog.csdn.n ...

K均值(K-means)聚类算法原理与代码详解

0. 算法原理：

1. 代码实现：

2. 运行结果：

3. 参考文献：

4. 写给自己：

K均值(K-means)聚类算法原理与代码详解相关推荐

最新文章

热门文章