Spark：聚类算法

Kmeans聚类

KMeans算法的基本思想是初始随机给定K个簇中心，按照最邻近原则把待分类样本点分到各个簇。然后按平均法重新计算各个簇的质心，从而确定新的簇心。一直迭代，直到簇心的移动距离小于某个给定的值。
K-Means聚类算法主要分为三个步骤：
(1)第一步是为待聚类的点寻找聚类中心
(2)第二步是计算每个点到聚类中心的距离，将每个点聚类到离该点最近的聚类中去
(3)第三步是计算每个聚类中所有点的坐标平均值，并将这个平均值作为新的聚类中心
反复执行(2)、(3)，直到聚类中心不再进行大范围移动或者聚类次数达到要求为止

如何选择聚类数目how do we know how many clusters to make?Within Set Sum of Squared Error (WSSSE)

use the "elbow method." This means we evaluate range of values for the number of clusters and check the within set sum of squared error (WSSSE). Plotting the number of clusters against WSSSE will show improvement (minimization) somewhere along its progression. Choose the number of clusters at which large improvements in WSSSE stop.也就是角点处角最小的那个角点对应的k值。

使用函数computeCost(rdd)

具体实现

from math import sqrtdef error(point): center = clusters.centers[clusters.predict(point)] return sqrt(sum([x**2 for x in (point - center)]))WSSSE = (parsedData.map(lambda point:error(point)).reduce(lambda x, y: x+y))
print('Within Set Sum of Squared Error = ' + str(WSSSE))

[ K-means with Spark & Hadoop]

kmeans中心点的计算

对于初始化聚类中心点，我们可以在输入的数据集中随机地选择k 个点作为中心点，但是随机选择初始中心点可能会造成聚类的结果和数据的实际分布相差很大。k-means++就是选择初始中心点的一种算法。

k-means++算法选择初始中心点的基本思想就是：初始的聚类中心之间的相互距离要尽可能远。初始化过程如下。

1）从输入的数据点集合中随机选择一个点作为第一个聚类中心；

2）对于数据集中的每一个点x，计算它与最近聚类中心（指已选择的聚类中心）的距离D(x)；

3）选择一个新的数据点作为新的聚类中心，选择的原则是：D(x)较大的点，被选取作为聚类中心的概率较大；

4）重复2）和3），直到k 个聚类中心被选出来；

5）利用这k 个初始的聚类中心来运行标准的KMeans 算法。

从上面的算法描述可以看到，算法的关键是第3 步，如何将D(x)反映到点被选择的概率上。一种算法如下。

1）随机从点集D 中选择一个点作为初始的中心点。

2）计算每一个点到最近中心点的距离i S ，对所有i S 求和得到sum。

3）然后再取一个随机值，用权重的方式计算下一个“种子点”。取随机值random（0<random<sum），对点集D 循环，做random - = i S 运算，直到random < 0，那么点i 就是下一个中心点。

4）重复2）和3），直到k 个聚类中心被选出来。

5）利用这k 个初始的聚类中心来运行标准的KMeans 算法。

MLlib 的KMeans 聚类模型对于计算样本属于哪个中心点，采用了一种快速查找、计算距离的方法，其方法如下。

首先定义lowerBoundOfSqDist 距离公式，假设中心点center 是(a1,b1)，需要计算的点point是2 2 (a ,b )，那么lowerBoundOfSqDist 是：

可轻易证明lowerBoundOfSqDist 将会小于或等于EuclideanDist，因此在进行距离比较的时候，先计算很容易计算的lowerBoundOfSqDist（只需要计算center、point 的L2 范数）。如果lowerBoundOfSqDist 都不小于之前计算得到的最小距离bestDistance，那真正的欧氏距离也不可能小于bestDistance 了。因此在这种情况下就不需要去计算欧氏距离了，省去了很多计算工作。

如果lowerBoundOfSqDist 小于bestDistance，则进行距离的计算，调用fastSquaredDistance，该方法是一种快速计算距离的方法。

fastSquaredDistance方法会先计算一个精度，有关精度的计算val precisionBound1 = 2.0 * EPSILON * sumSquaredNorm / (normDiff * normDiff + EPSILON)，如果在精度满足条件的情况下，欧式距离sqDist = sumSquaredNorm - 2.0 * v1.dot(v2)，sumSquaredNorm即为，2.0 * v1.dot(v2)即为。这也是之前将norm计算出来的好处。如果精度不满足要求，则进行原始的距离计算公式了，即调用Vectors.sqdist(v1, v2)。

Debugs

py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.mllib.api.python.SerDe.loads.
: net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for numpy.dtype)

for k in np.arange(2000, 60000, 2000):km_model = km.train(ll_rdd, k=k)

改成

for k in range(2000, 60000, 2000):km_model = km.train(ll_rdd, k=k)

就可以了。spark现在真不稳定。。。

pyspark.mllib.clustering module

BisectingKMeans

A bisecting k-means algorithm based on the paper “A comparison ofdocument clustering techniques” by Steinbach, Karypis, and Kumar,with modification to fit Spark.The algorithm starts from a single cluster that contains all points.Iteratively it finds divisible clusters on the bottom level andbisects each of them using k-means, until there are k leafclusters in total or no leaf clusters are divisible.The bisecting steps of clusters on the same level are groupedtogether to increase parallelism. If bisecting all divisibleclusters on the bottom level would result more than k leafclusters, larger clusters get higher priority.

[BisectingKMeansModel]

[BisectingKMeans]

KMeans

class pyspark.mllib.clustering.KMeans[source]

classmethod train(rdd, k, maxIterations=100, runs=1, initializationMode='k-means||', seed=None, initializationSteps=5, epsilon=0.0001, initialModel=None)[source]

Parameters:	rdd – Training points as an RDD of Vector or convertiblesequence types. k – Number of clusters to create. maxIterations – Maximum number of iterations allowed.(default: 100) runs – This param has no effect since Spark 2.0.0. 之前版本的runs：并行度，默认1； initializationMode – The initialization algorithm. This can be either “random” or“k-means\|\|”.(default: “k-means\|\|”) seed – Random seed value for cluster initialization. Set as None togenerate seed based on system time.(default: None) initializationSteps – Number of steps for the k-means\|\| initialization mode.This is an advanced setting – the default of 5 is almostalways enough.(default: 5) epsilon – Distance threshold within which a center will be considered tohave converged. If all centers move less than this Euclideandistance, iterations are stopped.(default: 1e-4) initialModel – Initial cluster centers can be provided as a KMeansModel objectrather than using the random or k-means\|\| initializationModel.(default: None)

Parameters:

rdd – Training points as an RDD of Vector or convertiblesequence types.
k – Number of clusters to create.
maxIterations – Maximum number of iterations allowed.(default: 100)
runs – This param has no effect since Spark 2.0.0. 之前版本的runs：并行度，默认1；
initializationMode – The initialization algorithm. This can be either “random” or“k-means||”.(default: “k-means||”)
seed – Random seed value for cluster initialization. Set as None togenerate seed based on system time.(default: None)
initializationSteps – Number of steps for the k-means|| initialization mode.This is an advanced setting – the default of 5 is almostalways enough.(default: 5)
epsilon – Distance threshold within which a center will be considered tohave converged. If all centers move less than this Euclideandistance, iterations are stopped.(default: 1e-4)
initialModel – Initial cluster centers can be provided as a KMeansModel objectrather than using the random or k-means|| initializationModel.(default: None)

Note: lz发现没有自定义距离度量函数的接口。。。跪了

computeCost(rdd)[source]

Return the K-means cost (sum of squared distances of points to their nearest center) for this model on the givendata.

predict(x)[source]

Find the cluster that each of the points belongs to in thismodel.

Parameters:	x – A data point (or RDD of points) to determine cluster index.
Returns:	Predicted cluster index or an RDD of predicted cluster indicesif the input is an RDD.

Note: collect返回后是一个list类型。

save(sc, path)[source]¶

Save this model to the given path.Note: path是一个目录，不能是一个文件名，否则报错。

[KMeansModel]

[KMeans]

[streaming k-means in the Spark documentation]

皮皮blog

pyspark.ml.clustering module

这个使用pandas dataframe运行kmeans总是出错，官方文档不完善，没搞定。

pyspark.sql.utils.IllegalArgumentException: 'Field "features" does not exist.'

from: Spark：聚类算法

ref:

Spark：聚类算法相关推荐

spark聚类算法的聚类效果评估指标“轮廓系数”细节
什么是轮廓系数? 参见百度百科,讲的很清楚此处不赘述: https://baike.baidu.com/item/%E8%BD%AE%E5%BB%93%E7%B3%BB%E6%95%B0/173616 ...
基于Spark的机器学习实践 (九) - 聚类算法
0 相关源码 1 k-平均算法(k-means clustering)概述 1.1 回顾无监督学习 ◆ 分类.回归都属于监督学习 ◆ 无监督学习是不需要用户去指定标签的 ◆ 而我们看到的分类.回归算法 ...
离线轻量级大数据平台Spark之MLib机器学习库聚类算法KMeans实例
1.KMeans算法所谓聚类问题,就是给定一个元素集合D,其中每个元素具有n个可观察属性,使用某种算法将D划分成k个子集,要求每个子集内部的元素之间相异度尽可能低,而不同子集的元素相异度尽可能高.其 ...
Spark ML - 聚类算法
http://ihoge.cn/2018/ML2.html Spark ML - 聚类算法 1.KMeans快速聚类首先到UR需要的包: import org.apache.spark.ml.clu ...
Spark MLlib之K-Means聚类算法
聚类算法聚类,Cluster analysis,有时也被翻译为簇类,其核心任务是:将一组目标object划分为若干个簇,每个簇之间的object尽可能的相似,簇与簇之间的object尽可能的相异.聚 ...
Spark mlib KMeans聚类算法
1. 聚类 1.1 什么是聚类? 所谓聚类问题,就是给定一个元素集合D,其中每个元素具有n个可观察属性,使用算法将集合D划分成k个子集,要求每个子集内部的元素之间相异度尽可能低,而不同子集的元素相异度 ...
大数据进阶必修课！Spark实战KMeans聚类算法
目录 7.SparkMLlib KMeans聚类算法 7.1 KMeans聚类算法 7.2 算法源码分析 7.3 应用实战 7.3.1 数据说明 7.3.2 代码详解 7.SparkMLlib KMe ...
混合高斯模型_大数据小白入门高斯混合模型(GMM)聚类算法
导读高斯混合模型(Gaussian Mixture Model)通常简称GMM,是一种业界广泛使用的聚类算法,属于生成式模型,它假设所有的数据样本都是由某一个给定参数的多元高斯分布所生成的.从中 ...
通过K-means对iris数据集进行处理 Kmeans聚类算法实例
Python from sklearn import datasets import matplotlib.pyplot as plt from sklearn.cluster import KMea ...
java spark k-means算法
配置配置请看我的其他文章点击跳转 spark官方文档点击跳转官方文档数据训练数据实体类用了swagger和lombok 不需要的可以删掉 import io.swagger.annota ...