聚类Clustering
This page describes clustering algorithms in MLlib. The guide for clustering in the RDD-based API also has relevant information about these algorithms. 本文描述MLlib中的聚类算法。基于RDD-API中的聚类指南提供了有关这些算法的相关信息。
Table of Contents
• K-means
o Input Columns
o Output Columns
• Latent Dirichlet allocation (LDA)
• Bisecting k-means
• Gaussian Mixture Model (GMM)
o Input Columns
o Output Columns
• Power Iteration Clustering (PIC)
K-means
k-means is one of the most commonly used clustering algorithms that clusters the data points into a predefined number of clusters. The MLlib implementation includes a parallelized variant of the k-means++ method called kmeans||.
KMeans is implemented as an Estimator and generates a KMeansModel as the base model.
k均值是最常用的聚类算法之一,将数据点聚集成预定数量的聚类。MLlib实现包括k-means ++方法的并行变体,称为kmeans ||。。
KMeans实现,Estimator生成KMeansModel作为基本模型。

Examples
• Scala
• Java
• Python
• R
Refer to the Scala API docs for more details.
import org.apache.spark.ml.clustering.KMeans
import org.apache.spark.ml.evaluation.ClusteringEvaluator

// Loads data.
val dataset = spark.read.format(“libsvm”).load(“data/mllib/sample_kmeans_data.txt”)

// Trains a k-means model.
val kmeans = new KMeans().setK(2).setSeed(1L)
val model = kmeans.fit(dataset)

// Make predictions
val predictions = model.transform(dataset)

// Evaluate clustering by computing Silhouette score
val evaluator = new ClusteringEvaluator()

val silhouette = evaluator.evaluate(predictions)
println(s"Silhouette with squared euclidean distance = $silhouette")

// Shows the result.
println("Cluster Centers: ")
model.clusterCenters.foreach(println)
Find full example code at “examples/src/main/scala/org/apache/spark/examples/ml/KMeansExample.scala” in the Spark repo.
Latent Dirichlet allocation (LDA)
LDA is implemented as an Estimator that supports both EMLDAOptimizer and OnlineLDAOptimizer, and generates a LDAModel as the base model. Expert users may cast a LDAModel generated by EMLDAOptimizer to a DistributedLDAModel if needed.
LDA实现Estimator,支持EMLDAOptimizer和OnlineLDAOptimizer,生成LDAModel作为基础模型。专家用户可以将LDAModel生成的 EMLDAOptimizer转换为DistributedLDAModel。
Examples
• Scala
• Java
• Python
• R
Refer to the Scala API docs for more details.
import org.apache.spark.ml.clustering.LDA

// Loads data.
val dataset = spark.read.format(“libsvm”)
.load(“data/mllib/sample_lda_libsvm_data.txt”)

// Trains a LDA model.
val lda = new LDA().setK(10).setMaxIter(10)
val model = lda.fit(dataset)

val ll = model.logLikelihood(dataset)
val lp = model.logPerplexity(dataset)
println(s"The lower bound on the log likelihood of the entire corpus: $ll")
println(s"The upper bound on perplexity: $lp")

// Describe topics.
val topics = model.describeTopics(3)
println(“The topics described by their top-weighted terms:”)
topics.show(false)

// Shows the result.
val transformed = model.transform(dataset)
transformed.show(false)
Find full example code at “examples/src/main/scala/org/apache/spark/examples/ml/LDAExample.scala” in the Spark repo.
Bisecting k-means
Bisecting k-means is a kind of hierarchical clustering using a divisive (or “top-down”) approach: all observations start in one cluster, and splits are performed recursively as one moves down the hierarchy.
Bisecting K-means can often be much faster than regular K-means, but it will generally produce a different clustering.
BisectingKMeans is implemented as an Estimator and generates a BisectingKMeansModel as the base model.
将k均值平分是一种使用除法(或“自上而下”)方法的分层聚类:所有观测值都在一个聚类中开始,当一个聚结向下移动时,递归执行拆分。
平分K均值通常会比常规K均值快得多,但通常会产生不同的聚类。
BisectingKMeans实现,Estimator并生成BisectingKMeansModel作为基本模型。
Examples
• Scala
• Java
• Python
• R
Refer to the Scala API docs for more details.
import org.apache.spark.ml.clustering.BisectingKMeans
import org.apache.spark.ml.evaluation.ClusteringEvaluator

// Loads data.
val dataset = spark.read.format(“libsvm”).load(“data/mllib/sample_kmeans_data.txt”)

// Trains a bisecting k-means model.
val bkm = new BisectingKMeans().setK(2).setSeed(1)
val model = bkm.fit(dataset)

// Make predictions
val predictions = model.transform(dataset)

// Evaluate clustering by computing Silhouette score
val evaluator = new ClusteringEvaluator()

val silhouette = evaluator.evaluate(predictions)
println(s"Silhouette with squared euclidean distance = $silhouette")

// Shows the result.
println("Cluster Centers: ")
val centers = model.clusterCenters
centers.foreach(println)
Find full example code at “examples/src/main/scala/org/apache/spark/examples/ml/BisectingKMeansExample.scala” in the Spark repo.
Gaussian Mixture Model (GMM)
A Gaussian Mixture Model represents a composite distribution whereby points are drawn from one of k Gaussian sub-distributions, each with its own probability. The spark.ml implementation uses the expectation-maximization algorithm to induce the maximum-likelihood model given a set of samples.
GaussianMixture is implemented as an Estimator and generates a GaussianMixtureModel as the base model.
高斯混合模型 代表一个复合分布,绘制ķ高斯子分布,每个具有其相应的概率。该spark.ml实现使用 期望最大化 算法,给定一组样本,得出最大似然模型。
GaussianMixture实现,Estimator并生成GaussianMixtureModel作为基本模型。

Examples
• Scala
• Java
• Python
• R
Refer to the Scala API docs for more details.
import org.apache.spark.ml.clustering.GaussianMixture

// Loads data
val dataset = spark.read.format(“libsvm”).load(“data/mllib/sample_kmeans_data.txt”)

// Trains Gaussian Mixture Model
val gmm = new GaussianMixture()
.setK(2)
val model = gmm.fit(dataset)

// output parameters of mixture model model
for (i <- 0 until model.getK) {
println(s"Gaussian KaTeX parse error: Undefined control sequence: \nweight at position 3: i:\̲n̲w̲e̲i̲g̲h̲t̲={model.weights(i)}\n" +
s"mu=KaTeX parse error: Undefined control sequence: \nsigma at position 26: …ssians(i).mean}\̲n̲s̲i̲g̲m̲a̲=\n{model.gaussians(i).cov}\n")
}
Find full example code at “examples/src/main/scala/org/apache/spark/examples/ml/GaussianMixtureExample.scala” in the Spark repo.
Power Iteration Clustering (PIC)
Power Iteration Clustering (PIC) is a scalable graph clustering algorithm developed by Lin and Cohen. From the abstract: PIC finds a very low-dimensional embedding of a dataset using truncated power iteration on a normalized pair-wise similarity matrix of the data.
spark.ml’s PowerIterationClustering implementation takes the following parameters:
功率迭代聚类(PIC)是Lin和Cohen开发的可伸缩图聚类算法。PIC在数据的标准化成对相似度矩阵上使用截断的幂次迭代,发现了数据集的非常低维的嵌入。
spark.ml的PowerIterationClustering实现采用以下参数:
• k: the number of clusters to create
• initMode: param for the initialization algorithm
• maxIter: param for maximum number of iterations
• srcCol: param for the name of the input column for source vertex IDs
• dstCol: name of the input column for destination vertex IDs
• weightCol: Param for weight column name
• k:要创建的聚类数
• initMode:初始化算法的参数
• maxIter:最大迭代次数的参数
• srcCol:参数,用于源顶点ID的输入列的名称
• dstCol:目标顶点ID的输入列的名称
• weightCol:权重列名称的参数
Examples
• Scala
• Java
• Python
• R
Refer to the Scala API docs for more details.
import org.apache.spark.ml.clustering.PowerIterationClustering

val dataset = spark.createDataFrame(Seq(
(0L, 1L, 1.0),
(0L, 2L, 1.0),
(1L, 2L, 1.0),
(3L, 4L, 1.0),
(4L, 0L, 0.1)
)).toDF(“src”, “dst”, “weight”)

val model = new PowerIterationClustering().
setK(2).
setMaxIter(20).
setInitMode(“degree”).
setWeightCol(“weight”)

val prediction = model.assignClusters(dataset).select(“id”, “cluster”)

// Shows the cluster assignment
prediction.show(false)
Find full example code at “examples/src/main/scala/org/apache/spark/examples/ml/PowerIterationClusteringExample.scala” in the Spark repo.

聚类Clustering相关推荐

  1. 聚类(Clustering)定义、聚类思想及形式、距离的度量

    聚类(Clustering)定义.聚类思想及形式.距离的度量 目录 聚类(Clustering)定义.聚类思想及形式.距离的度量 聚类(Clustering)定义

  2. 聚类(Clustering): K-means算法

    聚类(Clustering): K-means算法 1.归类: 聚类(clustering)属于非监督学习(unsupervised learning) 无类别标记( class label) 3. ...

  3. 聚类(Clustering):hierarchical clustering 层次聚类及其应用

    聚类(Clustering):hierarchical clustering 层次聚类及其应用 clustering实现: from numpy import * import math #基于mat ...

  4. Mahout – Clustering (聚类篇)

    什么是Mahout? " Apache Mahout™ project's goal is to build a scalable machine learning library &quo ...

  5. Mahout0.9 – Clustering (聚类篇)

    Mahout – Clustering (聚类篇) Leave a reply 什么是Mahout? " Apache Mahout™ project's goal is to build ...

  6. 快速上手:图聚类入门 Graph Clustering

    硕士研究工作基本告一段落了,静候佳音中- 其实一直想总结一下图节点聚类的一些工作,算是一个逗号吧. 个人总结,若有错误欢迎指正. 本文从问题定义入手,再到近几年的工作,最后进行横向对比,并提供一些个人 ...

  7. QIIME 2教程. 19使用q2-vsearch聚类ASVs为OTUs(2021.2)

    使用q2-vsearch聚类序列为OTUs Clustering sequences into OTUs using q2-vsearch 目前QIIME2支持三个聚类方式:无参(De novo), ...

  8. QIIME 2用户文档. 19使用q2-vsearch聚类OTUs(2019.7)

    前情提要 NBT:QIIME 2可重复.交互和扩展的微生物组数据分析平台 1简介和安装Introduction&Install 2插件工作流程概述Workflow 3老司机上路指南Experi ...

  9. QIIME 2教程. 19使用q2-vsearch聚类ASVs为OTUs(2020.11)

    文章目录 使用`q2-vsearch`聚类序列为OTUs 下载数据 序列去冗余 特征[频率]和特征数据[序列]的聚类 无参/从头聚类 有参聚类 半有参/开放参考聚类 译者简介 Reference 猜你 ...

最新文章

  1. 怎么定义图像的质量?如何评价图像的质量?
  2. 三相滤波器怎么接线_您知道家用电表如何接线吗?小编来告诉你!
  3. windows server系统,登录系统提示按下 ctrl+alt+delete
  4. java 文件分隔符_Java文件分隔符
  5. signature=65a5d6b0ac441e09ae68e9bbee76cba1,Bortezomib
  6. 「版权流氓」终结者:6天时间,堆出687亿段旋律
  7. java param add_Java中的Map paramMap
  8. BZOJ 1101 Luogu P3455 POI 2007 Zap (莫比乌斯反演+分块)
  9. ctags对部分目录生成tags
  10. Activity 在横竖屏切换情况下的生命周期变化
  11. 计算机会计学ufo报表,ufo报表实验报告(共10篇).doc
  12. Android Studio项目整合PullToRefresh的问题记录
  13. 计算机应用基础一级考试题库,2018一级结构工程师《计算机应用基础》题库及答案(一)...
  14. 焦作一中高考成绩查询2021,焦作高中学校排名2021最新排名,焦作高中排名前十
  15. 解决:SpringBoot 搭建聚合项目 报 “程序包XXX不存在”
  16. Acer 4750 安装黑苹果_黑苹果 MacOS 10.15 Catalina 最新安装教程
  17. 全文索引--海量数据模糊查询
  18. RMQ----不更新点
  19. 电子政务建设模式的演进
  20. html5 聊天机器人,发挥你想象力,BotUI – 聊天机器人 JS 框架

热门文章

  1. 从言行合一到知行合一
  2. 不占用多余空间实现值的交换——异或运算
  3. 距离传感器控制灯泡代码_如何使用颜色传感器和超声波传感器检测障碍物和避障...
  4. 2022-2028中国橡胶衬里行业全景调研及竞争格局预测报告
  5. Linux shell 学习笔记(3)— shell 父子关系及内建命令
  6. Qt中如何改变三角形图形项的包围盒
  7. 【Sql Server】数据库的安全机制
  8. Android数据持久化:文件存储
  9. AI人工智能天机芯芯片
  10. 横竖屏切换时Activity的生命周期