spark官网有很多关于机器学习的例子及详细说明。链接:MLlib: Main Guide - Spark 3.2.1 Documentation (apache.org)

K-means(命令行方式)

cd spark-3.1.1-bin-hadoop2.7/bin/
spark-shell

之后依次输入:

import org.apache.spark.ml.clustering.KMeans
import org.apache.spark.ml.evaluation.ClusteringEvaluator
val dataset =spark.read.format("libsvm").load("file:///home/ZQ/spark-3.1.1-bin-hadoop2.7/data/mllib/sample_kmeans_data.txt")
val kmeans = new KMeans().setK(2).setSeed(1L)
val model = kmeans.fit(dataset)
val predictions = model.transform(dataset)
val evaluator = new ClusteringEvaluator()
val silhouette = evaluator.evaluate(predictions)
println(s"Silhouette with squared euclidean distance = $silhouette")
println("Cluster Centers: ")
model.clusterCenters.foreach(println)

多项逻辑回归(命令行方式)

依次输入以下代码:

import org.apache.spark.ml.classification.LogisticRegression
val training = spark.read.format("libsvm").load("file:///home/ZQ/spark-3.1.1-bin-hadoop2.7/data/mllib//sample_multiclass_classification_data.txt")
val lr = new LogisticRegression().setMaxIter(10).setRegParam(0.3).setElasticNetParam(0.8)
val lr=new LogisticRegression().setMaxIter(10).setRegParam(0.3).setElasticNetParam(0.8)
val lrModel = lr.fit(training)
println(s"Coefficients: \n${lrModel.coefficientMatrix}")
println(s"Intercepts: \n${lrModel.interceptVector}")
val trainingSummary = lrModel.summary
val objectiveHistory = trainingSummary.objectiveHistory
println("objectiveHistory:")
objectiveHistory.foreach(println)
println("False positive rate by label:")
trainingSummary.falsePositiveRateByLabel.zipWithIndex.foreach { case (rate,label) =>println(s"label $label: $rate")}
println("True positive rate by label:")
trainingSummary.truePositiveRateByLabel.zipWithIndex.foreach { case (rate, label)
=>println(s"label $label: $rate")
}
println("Precision by label:")
trainingSummary.precisionByLabel.zipWithIndex.foreach println("Recall by label:"){ case (prec, label) =>println(s"label $label: $prec")
}
println("Recall by label:")
trainingSummary.recallByLabel.zipWithIndex.foreach { case (rec, label) =>println(s"label $label: $rec")}
println("F-measure by label:")
trainingSummary.fMeasureByLabel.zipWithIndex.foreach { case (f, label) =>println(s"label $label: $f")}
val accuracy = trainingSummary.accuracy
val falsePositiveRate = trainingSummary.weightedFalsePositiveRate
val truePositiveRate = trainingSummary.weightedTruePositiveRate
val fMeasure = trainingSummary.weightedFMeasure
val precision = trainingSummary.weightedPrecision
val recall = trainingSummary.weightedRecall
println(s"Accuracy: $accuracy\nFPR: $falsePositiveRate\nTPR: $truePositiveRate\n" + s"F-measure: $fMeasure\nPrecision: $precision\nRecall: $recall")

随机森林(sbt编译打包方式)

1、创建文件夹randomforest

cd spark-3.1.1-bin-hadoop2.7/Test/sparkmllib/
 mkdir randomforest

2、在randomforest文件夹下递归创建目录:mkdir -p src/main/scala

cd randomforest/
mkdir -p src/main/scala

3、在randomforest下创建simple.sbt如下

vim simple.sbt

在simple.sbt文件中增加如下内容:

name := "Simple Project"
version := "1.6.1"
scalaVersion := "2.12.10"
libraryDependencies += "org.apache.spark" %% "spark-core" % "3.1.1"
libraryDependencies += "org.apache.spark" %% "spark-mllib" % "3.1.1"

4、在scala目录下创建randomforest.scala文件

cd src/main/scala/
vim randomforest.scala

在文件下增加如下内容:

import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.{RandomForestClassificationModel,
RandomForestClassifier}
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
import org.apache.spark.ml.feature.{IndexToString, StringIndexer,
VectorIndexer}
// $example off$
import org.apache.spark.sql.SparkSession
object RandomForestClassifierExample {
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder
.appName("RandomForestClassifierExample")
.getOrCreate()
// $example on$
// Load and parse the data file, converting it to a DataFrame.
val data =
spark.read.format("libsvm").load("file:///home/ZQ/spark-3.1.1-bin-hadoop2.7/data/mllib/sample_libsvm_data.txt")
// Index labels, adding metadata to the label column.
// Fit on whole dataset to include all labels in index.
val labelIndexer = new StringIndexer()
.setInputCol("label")
.setOutputCol("indexedLabel")
.fit(data)
// Automatically identify categorical features, and index them.
// Set maxCategories so features with > 4 distinct values are treated as continuous.
val featureIndexer = new VectorIndexer()
.setInputCol("features")
.setOutputCol("indexedFeatures")
.setMaxCategories(4)
.fit(data)
// Split the data into training and test sets (30% held out for testing).
val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3))
// Train a RandomForest model.
val rf = new RandomForestClassifier()
.setLabelCol("indexedLabel")
.setFeaturesCol("indexedFeatures")
.setNumTrees(10)
// Convert indexed labels back to original labels.
val labelConverter = new IndexToString()
.setInputCol("prediction")
.setOutputCol("predictedLabel")
.setLabels(labelIndexer.labelsArray(0))
// Chain indexers and forest in a Pipeline.
val pipeline = new Pipeline()
.setStages(Array(labelIndexer, featureIndexer, rf, labelConverter))
// Train model. This also runs the indexers.
val model = pipeline.fit(trainingData)
// Make predictions.
val predictions = model.transform(testData)
// Select example rows to display.
predictions.select("predictedLabel", "label", "features").show(5)
// Select (prediction, true label) and compute test error.
val evaluator = new MulticlassClassificationEvaluator()
.setLabelCol("indexedLabel")
.setPredictionCol("prediction")
.setMetricName("accuracy")
val accuracy = evaluator.evaluate(predictions)
println(s"Test Error = ${(1.0 - accuracy)}")
val rfModel =
model.stages(2).asInstanceOf[RandomForestClassificationModel]
println(s"Learned classification forest model:\n ${rfModel.toDebugString}")
// $example off$
spark.stop()
}
}

5、打包(randomforest目录下进行)

cd ../../..
/home/ZQ/sbt/sbt package

6、运行(randomforest目录下进行)

 /home/ZQ/spark-3.1.1-bin-hadoop2.7/bin/spark-submit --class "RandomForestClassifierExample" ./target/scala-2.12/simple-project_2.12-1.6.1.jar

用Spark MLlib进行数据挖掘相关推荐

  1. Spark MLlib 机器学习

    本章导读 机器学习(machine learning, ML)是一门涉及概率论.统计学.逼近论.凸分析.算法复杂度理论等多领域的交叉学科.ML专注于研究计算机模拟或实现人类的学习行为,以获取新知识.新 ...

  2. Spark MLlib回归算法------线性回归、逻辑回归、SVM和ALS

    Spark MLlib回归算法------线性回归.逻辑回归.SVM和ALS 1.线性回归: (1)模型的建立: 回归正则化方法(Lasso,Ridge和ElasticNet)在高维和数据集变量之间多 ...

  3. Spark入门实战系列--8.Spark MLlib(上)--机器学习及SparkMLlib简介

    [注]该系列文章以及使用到安装包/测试数据 可以在<倾情大奉送--Spark入门实战系列>获取 1.机器学习概念 1.1 机器学习的定义 在维基百科上对机器学习提出以下几种定义: l&qu ...

  4. Spark MLlib机器学习 | 算法综合实战(一)(史上最详细)

    ==========                         ========= 8.1.1 什么是机器学习 机器学习可以看做是一门人工智能的科学,该领域的主要研究对象是人工智能.机器学习利用 ...

  5. Spark入门实战系列--8.Spark MLlib(下)--机器学习库SparkMLlib实战

    1.MLlib实例 1.1 聚类实例 1.1.1 算法说明 聚类(Cluster analysis)有时也被翻译为簇类,其核心任务是:将一组目标object划分为若干个簇,每个簇之间的object尽可 ...

  6. Spark Mllib里的Mllib基本数据类型(图文详解)

    不多说,直接上干货! Spark Mllib基本数据类型,根据不同的作用和应用场景,分为四种不同的类型 1.Local  vector : 本地向量集,主要向spark提供一组可进行操作的数据集合 2 ...

  7. Spark MLlib实现的中文文本分类–Naive Bayes

    2019独角兽企业重金招聘Python工程师标准>>> 中文分词 对于中文文本分类而言,需要先对文章进行分词,我使用的是IKAnalyzer中文分析工具,其中自己可以配置扩展词库来使 ...

  8. spark mllib 预测之LinearRegression(线性回归)

    为什么80%的码农都做不了架构师?>>>    商品价格与消费者输入之间的关系 商品需求(y, 吨),价格(x1, 元),消费者收入(x2, 元) y x1 x2 5 1 1 8 1 ...

  9. 协同过滤算法 R/mapreduce/spark mllib多语言实现

    用户电影评分数据集下载 http://grouplens.org/datasets/movielens/ 1) Item-Based,非个性化的,每个人看到的都一样 2) User-Based,个性化 ...

最新文章

  1. R语言-包的安装、载入及使用方法
  2. 关于android 5.0报错:dlopen failed: couldn't map ... Permission denied
  3. Pytorch搭建SSD目标检测平台
  4. Word2Vec详解
  5. 报错:Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
  6. CodeForces - 123B Squares(简单几何+旋转坐标系)
  7. 中国公司占据VVC专利贡献榜4席
  8. P2579,jzoj2288-[ZJOI2005]沼泽鳄鱼【矩阵乘法】
  9. UVA12439 February 29【日期计算】
  10. 【转】Java杂谈(八)--Servlet/Jsp
  11. 超实用的JavaScript技巧及最佳实践(下)
  12. Leetcode之合并区间
  13. ARPG游戏角色行为分析
  14. 批量挖掘SRC思路与实践一
  15. Matlab 中 rank() 函数的用法—求矩阵的秩
  16. 昨天搭完梯子之后就打不开12306查询的网页了
  17. PCA降维以及Kmeans聚类实例----python,sklearn,PCA,Kmeans
  18. 解答:Visio自画封闭图形如何填充?
  19. 云网融合与算力网络系列文章
  20. 总线 —— 总线仲裁

热门文章

  1. APP优化 MultiDex优化
  2. poj 2942-圆桌骑士(点双连通分量+二分图)
  3. 【ios开发Xcode】实现登录注册
  4. 工程伦理--7.3 公平原则
  5. 基于c语言的遥感图像处理,基于GDAL的遥感影像显示(C#版)
  6. 一周新书精选:深度学习、强化学习、Web开发最受程序员关注
  7. 请勿滥用 2PC, 忘记提交prepared transaction对PostgreSQL造成的危害.
  8. 麦克风音频服务器未响应,无线话筒的故障现象分析与维修方法
  9. int32_t int64_t和int的区别
  10. Android 自定义View 仿微信好友,字母排序