笔记_KMeans聚类

在处理离散数据分类时，我们常用的一个聚类方法就是KMeans。
KMeans是基于距离的排他方法，通过指定K的值，KMeans算法将数据构建成K个划分，每一个划分即一个聚类。所有聚类不为空，且每个点有且仅属于一个聚类

传统KMeans聚类

从数据离散点中随机选取K个点作为K个聚类的质心;
测量其他离散点，将其归入距离最小质心的聚类，并重新计算相应聚类的质心;
重复步骤2，直到聚类质心不产生变化或变化幅度小于阈值;

这里我们一般选用欧氏距离作为距离测量，为了防止由于初始选取的点较差，导致无法成功分类，这里设置了最大循环次数。
下面介绍一个实现传统KMeans算法所需的一些函数知识
- np.linalg.norm:求范数，这里为L2范数
- np.tile: 将指定数组广播到指定维度

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from mpl_toolkits.mplot3d import Axes3Ddef normalize(X, axis=-1, p=2):"""normalize the data set X:param X: the input data set:param axis: the axis of the data set to calculate norm:param p: Lp norm:return:"""lp_norm = np.atleast_1d(np.linalg.norm(X, p, axis))lp_norm[lp_norm == 0] = 1return X / np.expand_dims(lp_norm, axis)def euclidean_distance(one_sample, X):"""Calculate the euclidean distance between one sample with all data X:param one_sample: the sample to calculate:param X: all samples point in data:return: euclidean distance (a array of shape(X.shape()[0], null)"""one_sample.reshape(1, -1)X.reshape(X.shape[0], -1)distance = np.power(np.tile(one_sample, (X.shape[0], 1)), 2).sum(axis=1)return distanceclass Kmeans():def __init__(self, X, k=2, max_iterations=500, varepsilon=0.0001):""":param X: the data set:param k: the number to classify:param max_iterations:  the max time to iteration:param varepsilon: the threshold"""self.X = Xself.k = kself.max_iterations = max_iterationsself.varepsilon = varepsilondef init_random_centroids(self):"""randomly choose k centroids from all sample points:param X: all data set:return: a array containing k centroids of shape(k, X.shape()[1])"""n_samples, n_features = np.shape(self.X)centroids = np.zeros((self.k, n_features))for i in range(self.k):centroid = self.X[np.random.choice(range(n_samples))]centroids[i] = centroidreturn centroidsdef closest_centroid(self, sample, centroids):"""Get the index of nearest centroid with sample from centroids:param sample: the sample you want to get:param centroids: all centroids in data set:return: the index of a centroid"""distances = euclidean_distance(sample, centroids)closest_index = np.argmin(distances)return closest_indexdef create_clusters(self, centroids):"""Classify all samples, the rule of classify is the nearest centroid:param centroids: all centroids in data set:return: the array containing the index of sample of shape(k, cluster_number)"""n_samples = np.shape(self.X)[0]cluster = [[] for _ in range(self.k)]for sample_i, sample in enumerate(self.X):centroid_i = self.closest_centroid(sample, centroids)cluster[centroid_i].append(sample_i)return clusterdef update_centroids(self, clusters):"""Update centroids of each clusters after one create_clusters has finished:param clusters: the clusters classify last time:return: the array containing the new centroids we update, of shape(k, X.shape()[1])"""n_features = np.shape(self.X)[1]centroids = np.zeros((self.k, n_features))for i, cluster in enumerate(clusters):centroid = np.mean(self.X[cluster], axis=0)centroids[i] = centroidreturn centroidsdef get_cluster_labels(self, clusters):"""Classify all samples, the class label is index of cluster in clusters:param clusters: all clusters:return: an array containing the class labels of shape(X.shape()[1])"""y_pred = np.zeros(np.shape(self.X)[0])for cluster_i, cluster in enumerate(clusters):for sample_i in cluster:y_pred[sample_i] = cluster_ireturn y_preddef predict(self):"""Predict all data set:return: the array containing class label of shape(X.shape()[1])"""# Step1: pick up randomly k centroids from all samplescentroids = self.init_random_centroids()# Iteration to classify and update util the result at last time minus this time is less than thresholdfor _ in range(self.max_iterations):# Classifyclusters = self.create_clusters(centroids)former_centroids = centroids# Updatecentroids = self.update_centroids(clusters)# Judgement whether twice centroids have differentdifferent = centroids - former_centroidsif different.any() < self.varepsilon:breakreturn self.get_cluster_labels(clusters)def main():X, y = datasets.make_blobs(n_samples= 10000,n_features= 3,centers=[[3, 3, 3], [0, 0, 0], [1, 1, 1], [2, 2, 2]],cluster_std=[0.2, 0.1, 0.2, 0.2],random_state=10)clf = Kmeans(X, k=4)y_pred = clf.predict()fig = plt.figure(figsize=(12, 8))ax = Axes3D(fig, rect=[0, 0, 1, 1], elev=30, azim=20)# the data with y labels 3D figure# plt.scatter(X[y == 0][:, 0], X[y == 0][:, 1], X[y == 0][:, 2])# plt.scatter(X[y == 1][:, 0], X[y == 1][:, 1], X[y == 1][:, 2])# plt.scatter(X[y == 2][:, 0], X[y == 2][:, 1], X[y == 2][:, 2])# plt.scatter(X[y == 3][:, 0], X[y == 3][:, 1], X[y == 3][:, 2])plt.scatter(X[y_pred == 0][:, 0], X[y_pred == 0][:, 1], X[y_pred == 0][:, 2])plt.scatter(X[y_pred == 1][:, 0], X[y_pred == 1][:, 1], X[y_pred == 1][:, 2])plt.scatter(X[y_pred == 2][:, 0], X[y_pred == 2][:, 1], X[y_pred == 2][:, 2])plt.scatter(X[y_pred == 3][:, 0], X[y_pred == 3][:, 1], X[y_pred == 3][:, 2])plt.show()if __name__ == '__main__':main()

原始数据分类

预测数据分类

优化KMeans聚类——二分KMeans聚类

由于传统KMeans分类的初始质心是随机选择，这导致了很多不确定性因素，使得聚类的结果时好时坏，所以在传统KMeans聚类的基础上，改进了初始质心的选择，利用二分法选取质心，这样保证了每个聚类的数量大题相同。
具体思路如下：

将数据集使用KMeans为俩个聚类，分别计算俩个聚类的质心；
将所有聚类中，SSE最大的聚类再进行KMeans分为俩个类，计算新获得的聚类的质心；
重复步骤2，直到聚类个数达到K；

其中SSE(Sum of Squares due to Error)残差平方和即聚类中每个点到质心距离的平方和。

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from mpl_toolkits.mplot3d import Axes3D
from KMeans import Kmeansdef normalize(X, axis=-1, p=2):"""normalize the data set X:param X: the input data set:param axis: the axis of the data set to calculate norm:param p: Lp norm:return:"""lp_norm = np.atleast_1d(np.linalg.norm(X, p, axis))lp_norm[lp_norm == 0] = 1return X / np.expand_dims(lp_norm, axis)def euclidean_distance(one_sample, X):"""Calculate the euclidean distance between one sample with all data X:param one_sample: the sample to calculate:param X: all samples point in data:return: euclidean distance (a array of shape(X.shape()[0], null)"""one_sample = np.array(one_sample).reshape(1, -1)X = np.array(X).reshape(np.shape(X)[0], -1)distance = np.power(np.tile(one_sample, (X.shape[0], 1) - X), 2).sum(axis=1)return distancedef sse(centroid, cluster):"""Calculate the sse of the cluster:param centroid: the centroid of this cluster:param cluster: the cluster containing all the samples belong to this cluster:return: the sse of this cluster"""return np.sum(euclidean_distance(centroid, cluster))class Bisecting_KMeans:def __init__(self, X, k=4, max_iterations=500, varepsilon=0.0001):""":param X: the data set:param k: the number of classes:param max_iterations: the max times to iterations:param varepsilon: the threshold"""self.X = Xself.k = kself.max_iterations = max_iterationsself.varepsilon = varepsilonself.clusters = [np.array([]) for _ in range(k)]def bisect_data(self):"""Bisect data:return clusters an array of shape (2, number in cluster)"""kMeans = Kmeans(self.X, k=2)clusters = kMeans.predict()for i in range(self.X.shape[0]):self.clusters[int(clusters[i])] = np.append(self.clusters[int(clusters[i])], i)def calculate_sse(self):"""Calculate the see of each clusters:return: a list of shape(len(self.clusters)) containing see"""sse_list = []for cluster_i, cluster in enumerate(self.clusters):centroid = np.mean(self.X[cluster], axis=0)sse_list.append(sse(centroid, cluster))return sse_listdef loop(self, sse_list, times):"""Loop for classify:param sse_list: the see list of current clusters:param times: the iteration times:return:"""index = np.argmin(np.array(sse_list))print(index)kMeans = Kmeans(self.clusters[index], k=2)clusters = kMeans.predict()self.clusters[index] = np.array([])for i in range(self.X.shape[0]):self.clusters[int(clusters[i]) + times] = np.append(self.clusters[int(clusters[i]) + times], i)def get_cluster_labels(self):"""Classify all samples, the class label is index of cluster in clusters:return:"""y_pred = np.zeros(np.shape(self.X)[0])for cluster_i, cluster in enumerate(self.clusters):for i in range(cluster.shape[0]):y_pred[int(cluster[i])] = cluster_ireturn y_preddef predict(self):"""Predict:return:"""self.bisect_data()times = 0while self.clusters.__len__() < self.k:sse_list = self.calculate_sse()times += 1print(times)self.loop(sse_list, times)return self.get_cluster_labels()def main():X, y = datasets.make_blobs(n_samples= 10000,n_features= 3,centers=[[3, 3, 3], [0, 0, 0], [1, 1, 1], [2, 2, 2]],cluster_std=[0.2, 0.1, 0.2, 0.2],random_state=10)clf = Bisecting_KMeans(X, k=4)y_pred = clf.predict()# print(y_pred)fig = plt.figure(figsize=(12, 8))ax = Axes3D(fig, rect=[0, 0, 1, 1], elev=30, azim=20)# the data with y labels 3D figure# plt.scatter(X[y == 0][:, 0], X[y == 0][:, 1], X[y == 0][:, 2])# plt.scatter(X[y == 1][:, 0], X[y == 1][:, 1], X[y == 1][:, 2])# plt.scatter(X[y == 2][:, 0], X[y == 2][:, 1], X[y == 2][:, 2])# plt.scatter(X[y == 3][:, 0], X[y == 3][:, 1], X[y == 3][:, 2])plt.scatter(X[y_pred == 0][:, 0], X[y_pred == 0][:, 1], X[y_pred == 0][:, 2])plt.scatter(X[y_pred == 1][:, 0], X[y_pred == 1][:, 1], X[y_pred == 1][:, 2])plt.scatter(X[y_pred == 2][:, 0], X[y_pred == 2][:, 1], X[y_pred == 2][:, 2])plt.scatter(X[y_pred == 3][:, 0], X[y_pred == 3][:, 1], X[y_pred == 3][:, 2])plt.show()if __name__ == '__main__':main()

如果有任何问题或不足还请指出。

笔记_KMeans聚类相关推荐

浅尝辄止_数学建模（笔记_K-means聚类算法）
文章目录一.聚类模型二.K-means聚类算法 1.算法的流程步骤 2.优点 3.缺点三.K-means++算法 1. 算法的流程步骤四.SPSS软件求解K-means++算法五.K-mea ...
python画聚类树状图_影像组学学习笔记(36)-聚类树状图Dendrogram的python实现
本笔记来源于B站Up主: 有Li 的影像组学系列教学视频本节(36)主要介绍: 聚类树状图Dendrogram的python实现应该注意一下scipy版本的问题:scipy 1.5.0版本画聚类树 ...
kmeans算法_KMeans聚类算法详解
1. 写在前面如果想从事数据挖掘或者机器学习的工作,掌握常用的机器学习算法是非常有必要的,常见的机器学习算法: 监督学习算法:逻辑回归,线性回归,决策树,朴素贝叶斯,K近邻,支持向量机,集成算法Ad ...
kmeans设置中心_kmeans聚类与支持向量机(1)
本篇文章主要讲一下k-means聚类和支持向量机(SVM),这两种方法都是机器学习中基本的分类方法(分类一词可能不准确),作用都是把一堆数据按照一定的规则分成两类或几类. 该篇文章还是注重于通过pyt ...
机器学习笔记(九)聚类
9.聚类有必要回顾下前文所涉及的机器学习主流分类,有监督学习中根据预测结果离散和连续属性分为分类和回归两大类,常见的算法有:线性模型.决策树.神经网络.支持向量机.贝叶斯分类器以及集成学习. 本文开 ...
机器学习笔记：聚类模糊聚类与模糊层次聚类（论文笔记 Fuzzy Agglomerative Clustering :ICAISC 2015）
前言:模糊层次聚类是参考了论文"A Spatial-Temporal Decomposition Based Deep Neural Network for TimeSeries Forec ...
机器学习笔记（聚类）层次聚类 Agglomerative Clutsering（Single-linkage、Complete-linkage，Group average）
1 Agglomerative Clutsering 这是一种自底而上的层次聚类方法.大致可以分为三步: 1.将每一个元素单独定为一类 2.每一轮都合并指定距离(对指定距离的理解很重要)最小的类 3. ...
k means聚类算法_K-Means 聚类算法 20210108
说到聚类,应先理解聚类和分类的区别聚类和分类最大的不同在于:分类的目标是事先已知的,而聚类则不一样,聚类事先不知道目标变量是什么,类别没有像分类那样被预先定义出来. K-Means 聚类算法有很多种 ...
java的k-means算法_k-means聚类算法的java实现描述！
从网上找到了很多定义,这里选取比较典型的几个: K-Mean 分群法是一种分割式分群方法,其主要目标是要在大量高纬的资料点中找出具有代表性的资料点:这些资料点可以称为群中心,代表点:然后再根据这些 ...
七月算法机器学习笔记8 聚类算法
七月算法(http://www.julyedu.com) 12月份机器学习在线班学习笔记

笔记_KMeans聚类

传统KMeans聚类

优化KMeans聚类——二分KMeans聚类

笔记_KMeans聚类相关推荐

最新文章

热门文章