用Spark MLlib进行数据挖掘

spark官网有很多关于机器学习的例子及详细说明。链接：MLlib: Main Guide - Spark 3.2.1 Documentation (apache.org)

K-means（命令行方式）

cd spark-3.1.1-bin-hadoop2.7/bin/

spark-shell

之后依次输入：

import org.apache.spark.ml.clustering.KMeans
import org.apache.spark.ml.evaluation.ClusteringEvaluator
val dataset =spark.read.format("libsvm").load("file:///home/ZQ/spark-3.1.1-bin-hadoop2.7/data/mllib/sample_kmeans_data.txt")
val kmeans = new KMeans().setK(2).setSeed(1L)
val model = kmeans.fit(dataset)
val predictions = model.transform(dataset)
val evaluator = new ClusteringEvaluator()
val silhouette = evaluator.evaluate(predictions)
println(s"Silhouette with squared euclidean distance = $silhouette")
println("Cluster Centers: ")
model.clusterCenters.foreach(println)

多项逻辑回归（命令行方式）

依次输入以下代码：

import org.apache.spark.ml.classification.LogisticRegression
val training = spark.read.format("libsvm").load("file:///home/ZQ/spark-3.1.1-bin-hadoop2.7/data/mllib//sample_multiclass_classification_data.txt")
val lr = new LogisticRegression().setMaxIter(10).setRegParam(0.3).setElasticNetParam(0.8)
val lr=new LogisticRegression().setMaxIter(10).setRegParam(0.3).setElasticNetParam(0.8)
val lrModel = lr.fit(training)
println(s"Coefficients: \n${lrModel.coefficientMatrix}")
println(s"Intercepts: \n${lrModel.interceptVector}")
val trainingSummary = lrModel.summary
val objectiveHistory = trainingSummary.objectiveHistory
println("objectiveHistory:")
objectiveHistory.foreach(println)
println("False positive rate by label:")
trainingSummary.falsePositiveRateByLabel.zipWithIndex.foreach { case (rate,label) =>println(s"label $label: $rate")}
println("True positive rate by label:")
trainingSummary.truePositiveRateByLabel.zipWithIndex.foreach { case (rate, label)
=>println(s"label $label: $rate")
}
println("Precision by label:")
trainingSummary.precisionByLabel.zipWithIndex.foreach println("Recall by label:"){ case (prec, label) =>println(s"label $label: $prec")
}
println("Recall by label:")
trainingSummary.recallByLabel.zipWithIndex.foreach { case (rec, label) =>println(s"label $label: $rec")}
println("F-measure by label:")
trainingSummary.fMeasureByLabel.zipWithIndex.foreach { case (f, label) =>println(s"label $label: $f")}
val accuracy = trainingSummary.accuracy
val falsePositiveRate = trainingSummary.weightedFalsePositiveRate
val truePositiveRate = trainingSummary.weightedTruePositiveRate
val fMeasure = trainingSummary.weightedFMeasure
val precision = trainingSummary.weightedPrecision
val recall = trainingSummary.weightedRecall
println(s"Accuracy: $accuracy\nFPR: $falsePositiveRate\nTPR: $truePositiveRate\n" + s"F-measure: $fMeasure\nPrecision: $precision\nRecall: $recall")

随机森林（sbt编译打包方式）

1、创建文件夹randomforest

cd spark-3.1.1-bin-hadoop2.7/Test/sparkmllib/

 mkdir randomforest

2、在randomforest文件夹下递归创建目录：mkdir -p src/main/scala

cd randomforest/

mkdir -p src/main/scala

3、在randomforest下创建simple.sbt如下

vim simple.sbt

在simple.sbt文件中增加如下内容：

name := "Simple Project"
version := "1.6.1"
scalaVersion := "2.12.10"
libraryDependencies += "org.apache.spark" %% "spark-core" % "3.1.1"
libraryDependencies += "org.apache.spark" %% "spark-mllib" % "3.1.1"

4、在scala目录下创建randomforest.scala文件

cd src/main/scala/

vim randomforest.scala

在文件下增加如下内容：

import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.{RandomForestClassificationModel,
RandomForestClassifier}
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
import org.apache.spark.ml.feature.{IndexToString, StringIndexer,
VectorIndexer}
// $example off$
import org.apache.spark.sql.SparkSession
object RandomForestClassifierExample {
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder
.appName("RandomForestClassifierExample")
.getOrCreate()
// $example on$
// Load and parse the data file, converting it to a DataFrame.
val data =
spark.read.format("libsvm").load("file:///home/ZQ/spark-3.1.1-bin-hadoop2.7/data/mllib/sample_libsvm_data.txt")
// Index labels, adding metadata to the label column.
// Fit on whole dataset to include all labels in index.
val labelIndexer = new StringIndexer()
.setInputCol("label")
.setOutputCol("indexedLabel")
.fit(data)
// Automatically identify categorical features, and index them.
// Set maxCategories so features with > 4 distinct values are treated as continuous.
val featureIndexer = new VectorIndexer()
.setInputCol("features")
.setOutputCol("indexedFeatures")
.setMaxCategories(4)
.fit(data)
// Split the data into training and test sets (30% held out for testing).
val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3))
// Train a RandomForest model.
val rf = new RandomForestClassifier()
.setLabelCol("indexedLabel")
.setFeaturesCol("indexedFeatures")
.setNumTrees(10)
// Convert indexed labels back to original labels.
val labelConverter = new IndexToString()
.setInputCol("prediction")
.setOutputCol("predictedLabel")
.setLabels(labelIndexer.labelsArray(0))
// Chain indexers and forest in a Pipeline.
val pipeline = new Pipeline()
.setStages(Array(labelIndexer, featureIndexer, rf, labelConverter))
// Train model. This also runs the indexers.
val model = pipeline.fit(trainingData)
// Make predictions.
val predictions = model.transform(testData)
// Select example rows to display.
predictions.select("predictedLabel", "label", "features").show(5)
// Select (prediction, true label) and compute test error.
val evaluator = new MulticlassClassificationEvaluator()
.setLabelCol("indexedLabel")
.setPredictionCol("prediction")
.setMetricName("accuracy")
val accuracy = evaluator.evaluate(predictions)
println(s"Test Error = ${(1.0 - accuracy)}")
val rfModel =
model.stages(2).asInstanceOf[RandomForestClassificationModel]
println(s"Learned classification forest model:\n ${rfModel.toDebugString}")
// $example off$
spark.stop()
}
}

5、打包(randomforest目录下进行)

cd ../../..

/home/ZQ/sbt/sbt package

6、运行(randomforest目录下进行)

 /home/ZQ/spark-3.1.1-bin-hadoop2.7/bin/spark-submit --class "RandomForestClassifierExample" ./target/scala-2.12/simple-project_2.12-1.6.1.jar

用Spark MLlib进行数据挖掘相关推荐

Spark MLlib 机器学习
本章导读机器学习(machine learning, ML)是一门涉及概率论.统计学.逼近论.凸分析.算法复杂度理论等多领域的交叉学科.ML专注于研究计算机模拟或实现人类的学习行为,以获取新知识.新 ...
Spark MLlib回归算法------线性回归、逻辑回归、SVM和ALS
Spark MLlib回归算法------线性回归.逻辑回归.SVM和ALS 1.线性回归: (1)模型的建立: 回归正则化方法(Lasso,Ridge和ElasticNet)在高维和数据集变量之间多 ...
Spark入门实战系列--8.Spark MLlib（上）--机器学习及SparkMLlib简介
[注]该系列文章以及使用到安装包/测试数据可以在<倾情大奉送--Spark入门实战系列>获取 1.机器学习概念 1.1 机器学习的定义在维基百科上对机器学习提出以下几种定义: l&qu ...
Spark MLlib机器学习 | 算法综合实战(一)(史上最详细)
========== ========= 8.1.1 什么是机器学习机器学习可以看做是一门人工智能的科学,该领域的主要研究对象是人工智能.机器学习利用 ...
Spark入门实战系列--8.Spark MLlib（下）--机器学习库SparkMLlib实战
1.MLlib实例 1.1 聚类实例 1.1.1 算法说明聚类(Cluster analysis)有时也被翻译为簇类,其核心任务是:将一组目标object划分为若干个簇,每个簇之间的object尽可 ...
Spark Mllib里的Mllib基本数据类型（图文详解）
不多说,直接上干货! Spark Mllib基本数据类型,根据不同的作用和应用场景,分为四种不同的类型 1.Local vector : 本地向量集,主要向spark提供一组可进行操作的数据集合 2 ...
Spark MLlib实现的中文文本分类–Naive Bayes
2019独角兽企业重金招聘Python工程师标准>>> 中文分词对于中文文本分类而言,需要先对文章进行分词,我使用的是IKAnalyzer中文分析工具,其中自己可以配置扩展词库来使 ...
spark mllib 预测之LinearRegression(线性回归)
为什么80%的码农都做不了架构师?>>> 商品价格与消费者输入之间的关系商品需求(y, 吨),价格(x1, 元),消费者收入(x2, 元) y x1 x2 5 1 1 8 1 ...
协同过滤算法 R/mapreduce/spark mllib多语言实现
用户电影评分数据集下载 http://grouplens.org/datasets/movielens/ 1) Item-Based,非个性化的,每个人看到的都一样 2) User-Based,个性化 ...

用Spark MLlib进行数据挖掘

用Spark MLlib进行数据挖掘相关推荐

最新文章

热门文章