在处理离散数据分类时,我们常用的一个聚类方法就是KMeans。
KMeans是基于距离的排他方法,通过指定K的值,KMeans算法将数据构建成K个划分,每一个划分即一个聚类。所有聚类不为空,且每个点有且仅属于一个聚类

传统KMeans聚类

  1. 从数据离散点中随机选取K个点作为K个聚类的质心;
  2. 测量其他离散点,将其归入距离最小质心的聚类,并重新计算相应聚类的质心;
  3. 重复步骤2,直到聚类质心不产生变化或变化幅度小于阈值;

这里我们一般选用欧氏距离作为距离测量,为了防止由于初始选取的点较差,导致无法成功分类,这里设置了最大循环次数。
下面介绍一个实现传统KMeans算法所需的一些函数知识
- np.linalg.norm:求范数,这里为L2范数
- np.tile: 将指定数组广播到指定维度

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from mpl_toolkits.mplot3d import Axes3Ddef normalize(X, axis=-1, p=2):"""normalize the data set X:param X: the input data set:param axis: the axis of the data set to calculate norm:param p: Lp norm:return:"""lp_norm = np.atleast_1d(np.linalg.norm(X, p, axis))lp_norm[lp_norm == 0] = 1return X / np.expand_dims(lp_norm, axis)def euclidean_distance(one_sample, X):"""Calculate the euclidean distance between one sample with all data X:param one_sample: the sample to calculate:param X: all samples point in data:return: euclidean distance (a array of shape(X.shape()[0], null)"""one_sample.reshape(1, -1)X.reshape(X.shape[0], -1)distance = np.power(np.tile(one_sample, (X.shape[0], 1)), 2).sum(axis=1)return distanceclass Kmeans():def __init__(self, X, k=2, max_iterations=500, varepsilon=0.0001):""":param X: the data set:param k: the number to classify:param max_iterations:  the max time to iteration:param varepsilon: the threshold"""self.X = Xself.k = kself.max_iterations = max_iterationsself.varepsilon = varepsilondef init_random_centroids(self):"""randomly choose k centroids from all sample points:param X: all data set:return: a array containing k centroids of shape(k, X.shape()[1])"""n_samples, n_features = np.shape(self.X)centroids = np.zeros((self.k, n_features))for i in range(self.k):centroid = self.X[np.random.choice(range(n_samples))]centroids[i] = centroidreturn centroidsdef closest_centroid(self, sample, centroids):"""Get the index of nearest centroid with sample from centroids:param sample: the sample you want to get:param centroids: all centroids in data set:return: the index of a centroid"""distances = euclidean_distance(sample, centroids)closest_index = np.argmin(distances)return closest_indexdef create_clusters(self, centroids):"""Classify all samples, the rule of classify is the nearest centroid:param centroids: all centroids in data set:return: the array containing the index of sample of shape(k, cluster_number)"""n_samples = np.shape(self.X)[0]cluster = [[] for _ in range(self.k)]for sample_i, sample in enumerate(self.X):centroid_i = self.closest_centroid(sample, centroids)cluster[centroid_i].append(sample_i)return clusterdef update_centroids(self, clusters):"""Update centroids of each clusters after one create_clusters has finished:param clusters: the clusters classify last time:return: the array containing the new centroids we update, of shape(k, X.shape()[1])"""n_features = np.shape(self.X)[1]centroids = np.zeros((self.k, n_features))for i, cluster in enumerate(clusters):centroid = np.mean(self.X[cluster], axis=0)centroids[i] = centroidreturn centroidsdef get_cluster_labels(self, clusters):"""Classify all samples, the class label is index of cluster in clusters:param clusters: all clusters:return: an array containing the class labels of shape(X.shape()[1])"""y_pred = np.zeros(np.shape(self.X)[0])for cluster_i, cluster in enumerate(clusters):for sample_i in cluster:y_pred[sample_i] = cluster_ireturn y_preddef predict(self):"""Predict all data set:return: the array containing class label of shape(X.shape()[1])"""# Step1: pick up randomly k centroids from all samplescentroids = self.init_random_centroids()# Iteration to classify and update util the result at last time minus this time is less than thresholdfor _ in range(self.max_iterations):# Classifyclusters = self.create_clusters(centroids)former_centroids = centroids# Updatecentroids = self.update_centroids(clusters)# Judgement whether twice centroids have differentdifferent = centroids - former_centroidsif different.any() < self.varepsilon:breakreturn self.get_cluster_labels(clusters)def main():X, y = datasets.make_blobs(n_samples= 10000,n_features= 3,centers=[[3, 3, 3], [0, 0, 0], [1, 1, 1], [2, 2, 2]],cluster_std=[0.2, 0.1, 0.2, 0.2],random_state=10)clf = Kmeans(X, k=4)y_pred = clf.predict()fig = plt.figure(figsize=(12, 8))ax = Axes3D(fig, rect=[0, 0, 1, 1], elev=30, azim=20)# the data with y labels 3D figure# plt.scatter(X[y == 0][:, 0], X[y == 0][:, 1], X[y == 0][:, 2])# plt.scatter(X[y == 1][:, 0], X[y == 1][:, 1], X[y == 1][:, 2])# plt.scatter(X[y == 2][:, 0], X[y == 2][:, 1], X[y == 2][:, 2])# plt.scatter(X[y == 3][:, 0], X[y == 3][:, 1], X[y == 3][:, 2])plt.scatter(X[y_pred == 0][:, 0], X[y_pred == 0][:, 1], X[y_pred == 0][:, 2])plt.scatter(X[y_pred == 1][:, 0], X[y_pred == 1][:, 1], X[y_pred == 1][:, 2])plt.scatter(X[y_pred == 2][:, 0], X[y_pred == 2][:, 1], X[y_pred == 2][:, 2])plt.scatter(X[y_pred == 3][:, 0], X[y_pred == 3][:, 1], X[y_pred == 3][:, 2])plt.show()if __name__ == '__main__':main()

原始数据分类

预测数据分类

优化KMeans聚类——二分KMeans聚类

由于传统KMeans分类的初始质心是随机选择,这导致了很多不确定性因素,使得聚类的结果时好时坏,所以在传统KMeans聚类的基础上,改进了初始质心的选择,利用二分法选取质心,这样保证了每个聚类的数量大题相同。
具体思路如下:

  1. 将数据集使用KMeans为俩个聚类,分别计算俩个聚类的质心;
  2. 将所有聚类中,SSE最大的聚类再进行KMeans分为俩个类,计算新获得的聚类的质心;
  3. 重复步骤2,直到聚类个数达到K;

其中SSE(Sum of Squares due to Error)残差平方和即聚类中每个点到质心距离的平方和。

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from mpl_toolkits.mplot3d import Axes3D
from KMeans import Kmeansdef normalize(X, axis=-1, p=2):"""normalize the data set X:param X: the input data set:param axis: the axis of the data set to calculate norm:param p: Lp norm:return:"""lp_norm = np.atleast_1d(np.linalg.norm(X, p, axis))lp_norm[lp_norm == 0] = 1return X / np.expand_dims(lp_norm, axis)def euclidean_distance(one_sample, X):"""Calculate the euclidean distance between one sample with all data X:param one_sample: the sample to calculate:param X: all samples point in data:return: euclidean distance (a array of shape(X.shape()[0], null)"""one_sample = np.array(one_sample).reshape(1, -1)X = np.array(X).reshape(np.shape(X)[0], -1)distance = np.power(np.tile(one_sample, (X.shape[0], 1) - X), 2).sum(axis=1)return distancedef sse(centroid, cluster):"""Calculate the sse of the cluster:param centroid: the centroid of this cluster:param cluster: the cluster containing all the samples belong to this cluster:return: the sse of this cluster"""return np.sum(euclidean_distance(centroid, cluster))class Bisecting_KMeans:def __init__(self, X, k=4, max_iterations=500, varepsilon=0.0001):""":param X: the data set:param k: the number of classes:param max_iterations: the max times to iterations:param varepsilon: the threshold"""self.X = Xself.k = kself.max_iterations = max_iterationsself.varepsilon = varepsilonself.clusters = [np.array([]) for _ in range(k)]def bisect_data(self):"""Bisect data:return clusters an array of shape (2, number in cluster)"""kMeans = Kmeans(self.X, k=2)clusters = kMeans.predict()for i in range(self.X.shape[0]):self.clusters[int(clusters[i])] = np.append(self.clusters[int(clusters[i])], i)def calculate_sse(self):"""Calculate the see of each clusters:return: a list of shape(len(self.clusters)) containing see"""sse_list = []for cluster_i, cluster in enumerate(self.clusters):centroid = np.mean(self.X[cluster], axis=0)sse_list.append(sse(centroid, cluster))return sse_listdef loop(self, sse_list, times):"""Loop for classify:param sse_list: the see list of current clusters:param times: the iteration times:return:"""index = np.argmin(np.array(sse_list))print(index)kMeans = Kmeans(self.clusters[index], k=2)clusters = kMeans.predict()self.clusters[index] = np.array([])for i in range(self.X.shape[0]):self.clusters[int(clusters[i]) + times] = np.append(self.clusters[int(clusters[i]) + times], i)def get_cluster_labels(self):"""Classify all samples, the class label is index of cluster in clusters:return:"""y_pred = np.zeros(np.shape(self.X)[0])for cluster_i, cluster in enumerate(self.clusters):for i in range(cluster.shape[0]):y_pred[int(cluster[i])] = cluster_ireturn y_preddef predict(self):"""Predict:return:"""self.bisect_data()times = 0while self.clusters.__len__() < self.k:sse_list = self.calculate_sse()times += 1print(times)self.loop(sse_list, times)return self.get_cluster_labels()def main():X, y = datasets.make_blobs(n_samples= 10000,n_features= 3,centers=[[3, 3, 3], [0, 0, 0], [1, 1, 1], [2, 2, 2]],cluster_std=[0.2, 0.1, 0.2, 0.2],random_state=10)clf = Bisecting_KMeans(X, k=4)y_pred = clf.predict()# print(y_pred)fig = plt.figure(figsize=(12, 8))ax = Axes3D(fig, rect=[0, 0, 1, 1], elev=30, azim=20)# the data with y labels 3D figure# plt.scatter(X[y == 0][:, 0], X[y == 0][:, 1], X[y == 0][:, 2])# plt.scatter(X[y == 1][:, 0], X[y == 1][:, 1], X[y == 1][:, 2])# plt.scatter(X[y == 2][:, 0], X[y == 2][:, 1], X[y == 2][:, 2])# plt.scatter(X[y == 3][:, 0], X[y == 3][:, 1], X[y == 3][:, 2])plt.scatter(X[y_pred == 0][:, 0], X[y_pred == 0][:, 1], X[y_pred == 0][:, 2])plt.scatter(X[y_pred == 1][:, 0], X[y_pred == 1][:, 1], X[y_pred == 1][:, 2])plt.scatter(X[y_pred == 2][:, 0], X[y_pred == 2][:, 1], X[y_pred == 2][:, 2])plt.scatter(X[y_pred == 3][:, 0], X[y_pred == 3][:, 1], X[y_pred == 3][:, 2])plt.show()if __name__ == '__main__':main()

如果有任何问题或不足还请指出。

笔记_KMeans聚类相关推荐

  1. 浅尝辄止_数学建模(笔记_K-means聚类算法)

    文章目录 一.聚类模型 二.K-means聚类算法 1.算法的流程步骤 2.优点 3.缺点 三.K-means++算法 1. 算法的流程步骤 四.SPSS软件求解K-means++算法 五.K-mea ...

  2. python画聚类树状图_影像组学学习笔记(36)-聚类树状图Dendrogram的python实现

    本笔记来源于B站Up主: 有Li 的影像组学系列教学视频 本节(36)主要介绍: 聚类树状图Dendrogram的python实现 应该注意一下scipy版本的问题:scipy 1.5.0版本画聚类树 ...

  3. kmeans算法_KMeans聚类算法详解

    1. 写在前面 如果想从事数据挖掘或者机器学习的工作,掌握常用的机器学习算法是非常有必要的,常见的机器学习算法: 监督学习算法:逻辑回归,线性回归,决策树,朴素贝叶斯,K近邻,支持向量机,集成算法Ad ...

  4. kmeans设置中心_kmeans聚类与支持向量机(1)

    本篇文章主要讲一下k-means聚类和支持向量机(SVM),这两种方法都是机器学习中基本的分类方法(分类一词可能不准确),作用都是把一堆数据按照一定的规则分成两类或几类. 该篇文章还是注重于通过pyt ...

  5. 机器学习笔记(九)聚类

    9.聚类 有必要回顾下前文所涉及的机器学习主流分类,有监督学习中根据预测结果离散和连续属性分为分类和回归两大类,常见的算法有:线性模型.决策树.神经网络.支持向量机.贝叶斯分类器以及集成学习. 本文开 ...

  6. 机器学习笔记: 聚类 模糊聚类与模糊层次聚类(论文笔记 Fuzzy Agglomerative Clustering :ICAISC 2015)

    前言:模糊层次聚类是参考了论文"A Spatial-Temporal Decomposition Based Deep Neural Network for TimeSeries Forec ...

  7. 机器学习笔记 (聚类) 层次聚类 Agglomerative Clutsering(Single-linkage、Complete-linkage,Group average)

    1 Agglomerative Clutsering 这是一种自底而上的层次聚类方法.大致可以分为三步: 1.将每一个元素单独定为一类 2.每一轮都合并指定距离(对指定距离的理解很重要)最小的类 3. ...

  8. k means聚类算法_K-Means 聚类算法 20210108

    说到聚类,应先理解聚类和分类的区别 聚类和分类最大的不同在于:分类的目标是事先已知的,而聚类则不一样,聚类事先不知道目标变量是什么,类别没有像分类那样被预先定义出来. K-Means 聚类算法有很多种 ...

  9. java的k-means算法_k-means聚类算法的java实现描述!

    从网上找到了很多定义,这里选取比较典型的几个: K-Mean 分群法是一种分割式分群方法,其主要目标是要在大量高纬的资料点中找出 具有代表性的资料点:这些资料点可以称为群中心,代表点:然后再根据这些 ...

  10. 七月算法机器学习笔记8 聚类算法

    七月算法(http://www.julyedu.com) 12月份 机器学习在线班 学习笔记

最新文章

  1. 第一次当领导,一定要知道的5个工具
  2. @Autowire 和 @Resource 注解使用的正确姿势,别再用错的了!!
  3. QT各种版本第三方下载地址
  4. JVM_java内存区域
  5. git branch 为什么会进入编辑状态_Git很难,搞砸很容易,好在有神奇命令让时光倒流...
  6. Linux高级实用命令
  7. @Scheduled定时任务
  8. 零基础学习AI也有快捷方式?一文帮你提升竞争力!
  9. 小程序input获得焦点触发_小程序学习(三)
  10. C语言求素数的几种方法
  11. android 软件搬家 换机,手机搬家一键换机
  12. echarts地图设置legend_echarts设置图例颜色和地图底色的方法实例
  13. FPGA设计思想与技巧(转载)
  14. 面向异构众核超级计算机的大规模稀疏计算性能优化研究
  15. 微信小程序一定高度文字的展开与收起
  16. SSM在线学习系统毕业设计源码131843
  17. 蓝桥杯真题 ——单词分析(python3)
  18. WDM波分复用技术:TFF(薄膜滤波) AWG(阵列波导光栅)介绍
  19. poj1144 求割点模板
  20. 如何在Android Studio使用单选和复选框

热门文章

  1. 计算机的e盘 f盘找不到,求助:急!我的电脑中D盘和E盘不见了
  2. Win32基础学习笔记
  3. JS格式化数字保留小数点
  4. Excel VBA——两种获取使用最大行数的方法
  5. javaSE之异常详解(1)
  6. 利用QVOD架设流媒体服务器/电影服务器/vod服务器
  7. linux服务器cpu/负载占用率100%怎么办?
  8. 三星S8 隐藏Android功能键,三星S8获系统更新 新增虚拟按键隐藏功能
  9. 使用FME进行GIS与CAD转换
  10. Unity与iOS相互调用