K-means算法代码详解及Demo

最近比较忙公众号更新的就不太及时，请各位大佬见谅，但是我依旧每天坚持学习。那今天大管就给各位小伙伴献上K-means算法的sklearn使用方法，以及在文章末尾我们使用K-Means算法对图片进行矢量化，即在保证图片质量的前提下来减少图片的使用(可以理解为压缩图片)。想回顾K-Means理论的小伙伴可以点击文章末尾的连接。

K-means

KMeans算法通过试着将样本分离到n组方差相等的情况下对数据进行聚类，使惯性或聚类内平方和最小化。该算法要求指定集群的数量。它可以很好地扩展到大量的样本，并且已经在许多不同领域的广泛应用领域中使用。k-means算法将一组样本分成不相交的簇，每个簇用该簇中样本的均值来描述，这种方法通常被称为簇的“质心”。

步骤

基本来说，该算法有三个步骤。第一步选择初始质心，最基本的方法是从数据集中选择样本。初始化之后，K-means由其他两个步骤之间的循环组成。第一步将每个样本分配到其最近的质心。第二步通过取分配给每个质心的所有样本的平均值来创建新的质心。计算新旧质心之间的差值，算法重复最后两步，直到该值小于阈值。换句话说，它重复，直到质心不明显移动。

#调用函数

class sklearn.cluster.KMeans(n_clusters=8, init='k-means++', n_init=10, max_iter=300, tol=0.0001, precompute_distances='auto', verbose=0, random_state=None, copy_x=True, n_jobs=None, algorithm='auto')

#参数Parameters

## n_clusters : int, optional, default: 8 要聚类的数目即簇数。默认为8.

## init : {‘k-means++’, ‘random’ or an ndarray} 选择初始化方法。

## n_init : int, default: 10 对于不同的初始值，k-means算法的运行时间。

## max_iter : int, default: 300 k-means算法一次运行的最大迭代次数

## tol : float, default: 1e-4 对于收敛的精确度

## precompute_distances : {‘auto’, True, False} 是否要预计算距离，‘auto’自动，建议当样本数大于1200万不要开启。

## verbose : int, default 0 冗长模式

## random_state : int, RandomState instance or None (default) 确定聚类中心初始化的随机数生成。使用int使随机性具有确定性

## copy_x : boolean, optional 在预计算距离时，先将数据中心放在首位是比较精确的

## n_jobs : int or None, optional (default=None) 用于计算的作业数量。这是通过并行计算每个n_init来实现的

## algorithm : “auto”, “full” or “elkan”, default=”auto” 选择算法风格，经典的是full表示稀疏数据，auto选择elkan表示密集数据

#属性Attributes

## cluster_centers_ : array, [n_clusters, n_features] 簇的中心坐标

## labels_ : 每个点的标签

## inertia_ : float 样本到它们最近的簇中心的距离的平方的和

## n_iter_ : int 运行的迭代次数

#代码举例

>>> from sklearn.cluster import KMeans
>>> import numpy as np
>>> X = np.array([[1, 2], [1, 4], [1, 0],
...               [10, 2], [10, 4], [10, 0]])
>>> kmeans = KMeans(n_clusters=2, random_state=0).fit(X)
>>> kmeans.labels_
array([1, 1, 1, 0, 0, 0], dtype=int32)
>>> kmeans.predict([[0, 0], [12, 3]])
array([1, 0], dtype=int32)
>>> kmeans.cluster_centers_
array([[10.,  2.],[ 1.,  2.]])

#方法Methods

### fit(X, y=None, sample_weight=None) 根据训练数据来拟合模型

参数Parameters

X : array-like or sparse matrix, shape=(n_samples, n_features) 训练数据集

y : Ignored 不使用

sample_weight : array-like, shape (n_samples,), optional x中每个样本的权重，如果没有，则所有样本的权重相等

### fit_predict(X, y=None, sample_weight=None) 计算预测聚类

参数Parameters

X : array-like or sparse matrix, shape=(n_samples, n_features) 新样本数据集

y : Ignored 不使用

sample_weight : array-like, shape (n_samples,), optional x中每个样本的权重，如果没有，则所有样本的权重相等

返回值Return

abels : array, shape [n_samples,] 每个样本所属聚类的簇

### fit_transform(X, y=None, sample_weight=None) 计算聚类并将X转换为聚类距离空间

参数同上

返回值Return

X_new : array, shape [n_samples, k] X在新的空间中变换

### get_params(deep=True)

返回值Return

params : mapping of string to any 参数名称映射到它们的值

### predict(X, sample_weight=None) 预测X中每个样本所属于的最近的聚类

参数Parameters

X : {array-like, sparse matrix}, shape = [n_samples, n_features] 要预测的样本

sample_weight : array-like, shape (n_samples,), optional 样本的权重

返回值Return

labels : array, shape [n_samples,] 每个样本所属的簇

### score(X, y=None, sample_weight=None) 与k均值目标上的X值相反

参数同上

返回值Return

score : float 与k均值目标上的X值相反

### transform(X) 将X转换为簇距离空间

参数同上

返回值Reutrn

X_new : array, shape [n_samples, k] 装换后的X值

#实例

对颐和园(中国)的图像进行像素矢量量化(VQ)，将显示图像所需的颜色从96615种减少到64种，同时保持整体外观质量。在本例中，像素在3d空间中表示，使用K-means找到64个颜色簇。在图像处理文献中，K-means(聚类中心)得到的码本称为调色板。使用单个字节，最多可以寻址256种颜色，而RGB编码需要每个像素3个字节。例如，GIF文件格式就使用了这样一个调色板，为了进行比较，还显示了使用随机码本(随机提取的颜色)的量化图像。

import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.metrics import pairwise_distances_argmin
from sklearn.datasets import load_sample_image
from sklearn.utils import shuffle
from time import timen_colors = 64# Load the Summer Palace photo
china = load_sample_image("china.jpg")# Convert to floats instead of the default 8 bits integer coding. Dividing by
# 255 is important so that plt.imshow behaves works well on float data (need to
# be in the range [0-1])
china = np.array(china, dtype=np.float64) / 255# Load Image and transform to a 2D numpy array.
w, h, d = original_shape = tuple(china.shape)
assert d == 3
image_array = np.reshape(china, (w * h, d))print("Fitting model on a small sub-sample of the data")
t0 = time()
image_array_sample = shuffle(image_array, random_state=0)[:1000]
kmeans = KMeans(n_clusters=n_colors, random_state=0).fit(image_array_sample)
print("done in %0.3fs." % (time() - t0))# Get labels for all points
print("Predicting color indices on the full image (k-means)")
t0 = time()
labels = kmeans.predict(image_array)
print("done in %0.3fs." % (time() - t0))codebook_random = shuffle(image_array, random_state=0)[:n_colors]
print("Predicting color indices on the full image (random)")
t0 = time()
labels_random = pairwise_distances_argmin(codebook_random,image_array,axis=0)
print("done in %0.3fs." % (time() - t0))def recreate_image(codebook, labels, w, h):"""Recreate the (compressed) image from the code book & labels"""d = codebook.shape[1]image = np.zeros((w, h, d))label_idx = 0for i in range(w):for j in range(h):image[i][j] = codebook[labels[label_idx]]label_idx += 1return image# Display all results, alongside original image
plt.figure(1)
plt.clf()
plt.axis('off')
plt.title('Original image (96,615 colors)')
plt.imshow(china)plt.figure(2)
plt.clf()
plt.axis('off')
plt.title('Quantized image (64 colors, K-Means)')
plt.imshow(recreate_image(kmeans.cluster_centers_, labels, w, h))plt.figure(3)
plt.clf()
plt.axis('off')
plt.title('Quantized image (64 colors, Random)')
plt.imshow(recreate_image(codebook_random, labels_random, w, h))
plt.show()

下图显示的是处理前，处理后以及随机选取出出力后的图片。可以明显看到k-means算法处理后的图片还原度更高，而随机处理后图片在颜色上有了很大的失真。

内容下载机器学习资料请扫描下方二维码关注小编公众号:程序员大管

K-means算法代码详解及Demo相关推荐

DDA画线算法+代码详解-直线扫描算法之一
#DDA画线算法+代码详解-直线扫描算法之一本文目录结构如下 1.直线扫描算法简介 2.DDA直线扫描算法 2.1 公式推理 1.求斜率K: 2.当|K| <= 1 时 3.当|K| > ...
KNN算法（K最近邻算法）详解
K 最近邻的核心数学知识是距离的计算和权重的计算.我们把需要预测的点作为中心点,然后计算其周围一定半径内的已知点距其的距离,挑选前 k 个点,进行投票,这 k 个点中,哪个类别的点多,该预测点就被判定 ...
图像拼接之APAP算法代码详解
apap 算法:mdlt matlab 很多内置函数都是对列操作,如mean() 1. VLFEAT库检测和匹配 SIFT 关键点 kp1,kp2,matches 2. 关键点坐标齐次化:(x,y, ...
zuc算法代码详解_最短路算法-dijkstra代码与案例详解
引言在研究路径选择和流量分配等交通问题时,常常会用到最短路算法.用最短路算法解决交通问题存在两个难点: 一.算法的选择和程序的编写.最短路算法有很多种改进算法和启发式算法,这些算法的效率不同,适用的 ...
SoundTouch变调编译以及算法代码详解
1. ubuntu20编译源码地址:https://codeberg.org/soundtouch/soundtouch 安装必要依赖 sudo apt-get install automake a ...
knn算法代码详解（以鸢尾花数据为例）
KNN 欧氏距离: d ( x , y ) = ∑ k = 1 n ( x k − y k ) 2 d(x,y)=\sqrt{\sum_{k=1}^{n}{(x_k-y_k)^2}} d(x,y)=k ...
CLAHE算法代码详解
转载 https://www.cnblogs.com/jsxyhelu/p/6435601.html?utm_source=debugrun&utm_medium=referral
对比学习：MoCo代码详解
MoCo算法代码详解本文代码来源: 1.导入包 2.参数设置 3.数据预处理 4. 模型 4.1moment update key encoder 4.2进队出队 4.3shuffle 4.4损失计 ...
ADMM,ISTA,FISTA算法步骤详解，MATLAB代码，求解LASSO优化问题
ADMM,ISTA,FISTA算法步骤详解,MATLAB代码,求解LASSO优化问题原创文章!转载需注明来源:©️ Sylvan Ding's Blog ❤️ 实验目的了解 ADMM, ISTA, ...

K-means算法代码详解及Demo

K-means算法代码详解及Demo相关推荐

最新文章

热门文章