Spark2-oneHot编码-标准化-主成分-聚类
- // affairs:一年来婚外情的频率
- // gender:性别
- // age:年龄
- // yearsmarried:婚龄
- // children:是否有小孩
- // religiousness:宗教信仰程度(5分制,1分表示反对,5分表示非常信仰)
- // education:学历
- // occupation:职业(逆向编号的戈登7种分类)
- // rating:对婚姻的自我评分(5分制,1表示非常不幸福,5表示非常幸福)
- import org.apache.spark.sql.SparkSession
- import org.apache.spark.sql.Dataset
- import org.apache.spark.sql.Row
- import org.apache.spark.sql.DataFrame
- import org.apache.spark.sql.Column
- import org.apache.spark.sql.DataFrameReader
- import org.apache.spark.rdd.RDD
- import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder
- import org.apache.spark.sql.Encoder
- import org.apache.spark.ml.linalg.Vectors
- import org.apache.spark.ml.feature.StringIndexer
- import org.apache.spark.ml.feature.OneHotEncoder
- import org.apache.spark.ml.feature.VectorAssembler
- import org.apache.spark.ml.feature.StandardScaler
- import org.apache.spark.ml.feature.PCA
- import org.apache.spark.ml.clustering.KMeans
- scala> val spark = SparkSession.builder().appName("Spark SQL basic example").config("spark.some.config.option", "some-value").getOrCreate()
- scala>
- scala> // For implicit conversions like converting RDDs to DataFrames
- scala> import spark.implicits._
- scala> val data: DataFrame = spark.read.format("csv").option("header", true).load("hdfs://ns1/datafile/wangxiao/Affairs.csv")
- data: org.apache.spark.sql.DataFrame = [affairs: string, gender: string ... 7 more fields]
- scala>
- scala> data.cache
- res0: data.type = [affairs: string, gender: string ... 7 more fields]
- scala>
- scala> data.limit(10).show()
- scala>
- scala> // 转换字符类型,将Double和String的字段分开放
- scala> val data1 = data.select(
- | data("affairs").cast("Double"),
- | data("age").cast("Double"),
- | data("yearsmarried").cast("Double"),
- | data("religiousness").cast("Double"),
- | data("education").cast("Double"),
- | data("occupation").cast("Double"),
- | data("rating").cast("Double"),
- | data("gender").cast("String"),
- | data("children").cast("String"))
- data1: org.apache.spark.sql.DataFrame = [affairs: double, age: double ... 7 more fields]
- scala>
- scala> data1.printSchema()
- root
- |-- affairs: double (nullable = true)
- |-- age: double (nullable = true)
- |-- yearsmarried: double (nullable = true)
- |-- religiousness: double (nullable = true)
- |-- education: double (nullable = true)
- |-- occupation: double (nullable = true)
- |-- rating: double (nullable = true)
- |-- gender: string (nullable = true)
- |-- children: string (nullable = true)
- scala> data1.limit(10).show
- scala>
- scala> val dataDF = data1
- dataDF: org.apache.spark.sql.DataFrame = [affairs: double, age: double ... 7 more fields]
- scala>
- scala> dataDF.cache()
- res4: dataDF.type = [affairs: double, age: double ... 7 more fields]
- scala>
- scala> //###################################
- scala> val indexer = new StringIndexer().setInputCol("gender").setOutputCol("genderIndex").fit(dataDF)
- indexer: org.apache.spark.ml.feature.StringIndexerModel = strIdx_19a888aff882
- scala> val indexed = indexer.transform(dataDF)
- indexed: org.apache.spark.sql.DataFrame = [affairs: double, age: double ... 8 more fields]
- scala> // OneHot编码,注意setDropLast设置为false
- scala> val encoder = new OneHotEncoder().setInputCol("genderIndex").setOutputCol("genderVec").setDropLast(false)
- encoder: org.apache.spark.ml.feature.OneHotEncoder = oneHot_f0f47e0b5b37
- scala> val encoded = encoder.transform(indexed)
- encoded: org.apache.spark.sql.DataFrame = [affairs: double, age: double ... 9 more fields]
- scala> encoded.show()
- scala>
- scala> val indexer1 = new StringIndexer().setInputCol("children").setOutputCol("childrenIndex").fit(encoded)
- indexer1: org.apache.spark.ml.feature.StringIndexerModel = strIdx_7e4d8c69b823
- scala> val indexed1 = indexer1.transform(encoded)
- indexed1: org.apache.spark.sql.DataFrame = [affairs: double, age: double ... 10 more fields]
- scala> val encoder1 = new OneHotEncoder().setInputCol("childrenIndex").setOutputCol("childrenVec").setDropLast(false)
- encoder1: org.apache.spark.ml.feature.OneHotEncoder = oneHot_9a8906781325
- scala> val encoded1 = encoder1.transform(indexed1)
- encoded1: org.apache.spark.sql.DataFrame = [affairs: double, age: double ... 11 more fields]
- scala> encoded1.show()
- scala>
- scala> val encodeDF: DataFrame = encoded1
- encodeDF: org.apache.spark.sql.DataFrame = [affairs: double, age: double ... 11 more fields]
- scala> encodeDF.show()
- scala> encodeDF.printSchema()
- root
- |-- affairs: double (nullable = true)
- |-- age: double (nullable = true)
- |-- yearsmarried: double (nullable = true)
- |-- religiousness: double (nullable = true)
- |-- education: double (nullable = true)
- |-- occupation: double (nullable = true)
- |-- rating: double (nullable = true)
- |-- gender: string (nullable = true)
- |-- children: string (nullable = true)
- |-- genderIndex: double (nullable = true)
- |-- genderVec: vector (nullable = true)
- |-- childrenIndex: double (nullable = true)
- |-- childrenVec: vector (nullable = true)
- scala>
- scala> //#################################
- scala> val assembler = new VectorAssembler().setInputCols(Array("affairs", "age", "yearsmarried", "religiousness", "education", "occupation", "rating", "genderVec", "childrenVec")).setOutputCol("features")
- assembler: org.apache.spark.ml.feature.VectorAssembler = vecAssembler_8ccd528981cd
- scala>
- scala> val vecDF: DataFrame = assembler.transform(encodeDF)
- vecDF: org.apache.spark.sql.DataFrame = [affairs: double, age: double ... 12 more fields]
- scala> vecDF.select("features").show
- 16/11/05 15:56:14 WARN Executor: 1 block locks were not released by TID = 11:
- [rdd_17_0]
- +--------------------+
- | features|
- +--------------------+
- |[0.0,37.0,10.0,3....|
- |[0.0,27.0,4.0,4.0...|
- |[0.0,32.0,15.0,1....|
- |[0.0,57.0,15.0,5....|
- |[0.0,22.0,0.75,2....|
- |[0.0,32.0,1.5,2.0...|
- |[0.0,22.0,0.75,2....|
- |[0.0,57.0,15.0,2....|
- |[0.0,32.0,15.0,4....|
- |[0.0,22.0,1.5,4.0...|
- |[0.0,37.0,15.0,2....|
- |[0.0,27.0,4.0,4.0...|
- |[0.0,47.0,15.0,5....|
- |[0.0,22.0,1.5,2.0...|
- |[0.0,27.0,4.0,4.0...|
- |[0.0,37.0,15.0,1....|
- |[0.0,37.0,15.0,2....|
- |[0.0,22.0,0.75,3....|
- |[0.0,22.0,1.5,2.0...|
- |[0.0,27.0,10.0,2....|
- +--------------------+
- only showing top 20 rows
- scala>
- scala> // 标准化--均值标准差
- scala> val scaler = new StandardScaler().setInputCol("features").setOutputCol("scaledFeatures").setWithStd(true).setWithMean(true)
- scaler: org.apache.spark.ml.feature.StandardScaler = stdScal_2e35fbc29084
- scala>
- scala> // Compute summary statistics by fitting the StandardScaler.
- scala> val scalerModel = scaler.fit(vecDF)
- scalerModel: org.apache.spark.ml.feature.StandardScalerModel = stdScal_2e35fbc29084
- scala>
- scala> // Normalize each feature to have unit standard deviation.
- scala> val scaledData: DataFrame = scalerModel.transform(vecDF)
- scaledData: org.apache.spark.sql.DataFrame = [affairs: double, age: double ... 13 more fields]
- scala> // scaledData:DataFrame = [features: vector, scaledFeatures: vector]
- scala>
- scala> scaledData.select("features", "scaledFeatures").show
- 16/11/05 15:56:20 WARN Executor: 1 block locks were not released by TID = 13:
- [rdd_17_0]
- +--------------------+--------------------+
- | features| scaledFeatures|
- +--------------------+--------------------+
- |[0.0,37.0,10.0,3....|[-0.4413500298573...|
- |[0.0,27.0,4.0,4.0...|[-0.4413500298573...|
- |[0.0,32.0,15.0,1....|[-0.4413500298573...|
- |[0.0,57.0,15.0,5....|[-0.4413500298573...|
- |[0.0,22.0,0.75,2....|[-0.4413500298573...|
- |[0.0,32.0,1.5,2.0...|[-0.4413500298573...|
- |[0.0,22.0,0.75,2....|[-0.4413500298573...|
- |[0.0,57.0,15.0,2....|[-0.4413500298573...|
- |[0.0,32.0,15.0,4....|[-0.4413500298573...|
- |[0.0,22.0,1.5,4.0...|[-0.4413500298573...|
- |[0.0,37.0,15.0,2....|[-0.4413500298573...|
- |[0.0,27.0,4.0,4.0...|[-0.4413500298573...|
- |[0.0,47.0,15.0,5....|[-0.4413500298573...|
- |[0.0,22.0,1.5,2.0...|[-0.4413500298573...|
- |[0.0,27.0,4.0,4.0...|[-0.4413500298573...|
- |[0.0,37.0,15.0,1....|[-0.4413500298573...|
- |[0.0,37.0,15.0,2....|[-0.4413500298573...|
- |[0.0,22.0,0.75,3....|[-0.4413500298573...|
- |[0.0,22.0,1.5,2.0...|[-0.4413500298573...|
- |[0.0,27.0,10.0,2....|[-0.4413500298573...|
- +--------------------+--------------------+
- only showing top 20 rows
- scala>
- scala> //##########################
- scala> // 主成分
- scala> val pca = new PCA().setInputCol("scaledFeatures").setOutputCol("pcaFeatures").setK(3).fit(scaledData)
- 16/11/05 15:56:21 WARN Executor: 1 block locks were not released by TID = 14:
- [rdd_17_0]
- 16/11/05 15:56:22 WARN Executor: 1 block locks were not released by TID = 15:
- [rdd_17_0]
- 16/11/05 15:56:24 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS
- 16/11/05 15:56:24 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS
- 16/11/05 15:56:25 WARN LAPACK: Failed to load implementation from: com.github.fommil.netlib.NativeSystemLAPACK
- 16/11/05 15:56:25 WARN LAPACK: Failed to load implementation from: com.github.fommil.netlib.NativeRefLAPACK
- pca: org.apache.spark.ml.feature.PCAModel = pca_8569d580d6e4
- scala> pca.explainedVariance.values //解释变量方差
- res11: Array[Double] = Array(0.28779526464781313, 0.23798543640278289, 0.11742828783633019)
- scala> pca.pc //载荷(观测变量与主成分的相关系数)
- res12: org.apache.spark.ml.linalg.DenseMatrix =
- -0.12034310848156521 0.05153952289637974 0.6678769450480689
- -0.42860623714516627 0.05417889891307473 -0.05592377098140197
- -0.44404074412877986 0.1926596811059294 -0.017025575192258197
- -0.12233707317255231 0.08053139375662526 -0.5093149296300096
- -0.14664751606128462 -0.3872166556211308 -0.03406819489501708
- -0.145543746024348 -0.43054860653839705 0.07841454709046872
- 0.17703994181974803 -0.12792784984216296 -0.5173229755329072
- 0.2459668445061567 0.4915809641798787 0.010477548320795945
- -0.2459668445061567 -0.4915809641798787 -0.010477548320795945
- -0.44420980045271047 0.240652448514566 -0.089356723885704
- 0.4442098004527103 -0.24065244851456588 0.08935672388570405
- scala> pca.extractParamMap()
- res13: org.apache.spark.ml.param.ParamMap =
- {
- pca_8569d580d6e4-inputCol: scaledFeatures,
- pca_8569d580d6e4-k: 3,
- pca_8569d580d6e4-outputCol: pcaFeatures
- }
- scala> pca.params
- res14: Array[org.apache.spark.ml.param.Param[_]] = Array(pca_8569d580d6e4__inputCol, pca_8569d580d6e4__k, pca_8569d580d6e4__outputCol)
- scala>
- scala> val pcaDF: DataFrame = pca.transform(scaledData)
- pcaDF: org.apache.spark.sql.DataFrame = [affairs: double, age: double ... 14 more fields]
- scala> // pcaDF:DataFrame = [features: vector, scaledFeatures: vector,pcaFeatures: vector]
- scala> pcaDF.cache()
- res15: pcaDF.type = [affairs: double, age: double ... 14 more fields]
- scala>
- scala> pcaDF.printSchema()
- root
- |-- affairs: double (nullable = true)
- |-- age: double (nullable = true)
- |-- yearsmarried: double (nullable = true)
- |-- religiousness: double (nullable = true)
- |-- education: double (nullable = true)
- |-- occupation: double (nullable = true)
- |-- rating: double (nullable = true)
- |-- gender: string (nullable = true)
- |-- children: string (nullable = true)
- |-- genderIndex: double (nullable = true)
- |-- genderVec: vector (nullable = true)
- |-- childrenIndex: double (nullable = true)
- |-- childrenVec: vector (nullable = true)
- |-- features: vector (nullable = true)
- |-- scaledFeatures: vector (nullable = true)
- |-- pcaFeatures: vector (nullable = true)
- scala> pcaDF.select("features", "scaledFeatures", "pcaFeatures").show
- 16/11/05 15:56:36 WARN Executor: 1 block locks were not released by TID = 18:
- [rdd_64_0]
- +--------------------+--------------------+--------------------+
- | features | scaledFeatures | pcaFeatures |
- +--------------------+--------------------+--------------------+
- |[0.0,37.0,10.0,3....|[-0.4413500298573...|[0.27828160409293...|
- |[0.0,27.0,4.0,4.0...|[-0.4413500298573...|[2.42147114101165...|
- |[0.0,32.0,15.0,1....|[-0.4413500298573...|[0.18301418047489...|
- |[0.0,57.0,15.0,5....|[-0.4413500298573...|[-2.9795960667914...|
- |[0.0,22.0,0.75,2....|[-0.4413500298573...|[1.79299133565688...|
- |[0.0,32.0,1.5,2.0...|[-0.4413500298573...|[2.65694237441759...|
- |[0.0,22.0,0.75,2....|[-0.4413500298573...|[3.48234503794570...|
- |[0.0,57.0,15.0,2....|[-0.4413500298573...|[-2.4215838062079...|
- |[0.0,32.0,15.0,4....|[-0.4413500298573...|[-0.6964555195741...|
- |[0.0,22.0,1.5,4.0...|[-0.4413500298573...|[2.18771069800414...|
- |[0.0,37.0,15.0,2....|[-0.4413500298573...|[-2.4259075891377...|
- |[0.0,27.0,4.0,4.0...|[-0.4413500298573...|[-0.7743038356008...|
- |[0.0,47.0,15.0,5....|[-0.4413500298573...|[-2.6176149267534...|
- |[0.0,22.0,1.5,2.0...|[-0.4413500298573...|[2.95788535193022...|
- |[0.0,27.0,4.0,4.0...|[-0.4413500298573...|[2.50146472861263...|
- |[0.0,37.0,15.0,1....|[-0.4413500298573...|[-0.5123817022008...|
- |[0.0,37.0,15.0,2....|[-0.4413500298573...|[-0.9191740114044...|
- |[0.0,22.0,0.75,3....|[-0.4413500298573...|[2.97391491782863...|
- |[0.0,22.0,1.5,2.0...|[-0.4413500298573...|[3.17940505267806...|
- |[0.0,27.0,10.0,2....|[-0.4413500298573...|[0.74585406839527...|
- +--------------------+--------------------+--------------------+
- only showing top 20 rows
- scala>
- scala> //#####################################
- scala>
- scala> // 注意最大迭代次數和轮廓系数
- scala> val KSSE = (2 to 10 by 1).par.toList.map { k =>
- | // 聚类
- | // Trains a k-means model.
- | val kmeans = new KMeans().setK(k).setSeed(1L).setFeaturesCol("pcaFeatures")
- | val model = kmeans.fit(pcaDF)
- |
- | // Evaluate clustering by computing Within Set Sum of Squared Errors.
- | val WSSSE = model.computeCost(pcaDF)
- |
- | (k, WSSSE)
- | }
- KSSE: List[(Int, Double)] = List((2,2876.20580405469), (3,1680.6647048004902), (4,1395.7184052948346), (5,1239.9362814229812), (6,999.2793106095127), (7,849.0071338527408), (8,737.8560221633246), (9,771.8211752483357), (10,655.7836351785677))
- scala>
- scala> KSSE.foreach(println)
- (2,2876.20580405469)
- (3,1680.6647048004902)
- (4,1395.7184052948346)
- (5,1239.9362814229812)
- (6,999.2793106095127)
- (7,849.0071338527408)
- (8,737.8560221633246)
- (9,771.8211752483357)
- (10,655.7836351785677)
来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/29070860/viewspace-2127855/,如需转载,请注明出处,否则将追究法律责任。
转载于:http://blog.itpub.net/29070860/viewspace-2127855/
Spark2-oneHot编码-标准化-主成分-聚类相关推荐
- 文献学习(part65)--稳健主成分聚类方法的构建及其比较研究
学习笔记,仅供参考,有错必纠 关键词:主成分聚类分析:稳健统计量: 协方差矩阵:离群值 文章目录 稳健主成分聚类方法的构建及其比较研究 摘要 引言 传统主成分聚类方法及其不稳健性 传统主成分聚类方法的 ...
- R语言主成分PCA、因子分析、聚类对地区经济研究分析重庆市经济指标
全文下载链接:http://tecdat.cn/?p=27515 建立重庆市经济指标发展体系,以重庆市一小时经济圈作为样本,运用因子分析方法进行实证分析,在借鉴了相关评价理论和评价方法的基础上,本文提 ...
- 数据分享|R语言主成分PCA、因子分析、聚类对地区经济研究分析重庆市经济指标...
原文链接:http://tecdat.cn/?p=27515 建立重庆市经济指标发展体系,以重庆市一小时经济圈作为样本,运用因子分析方法进行实证分析,在借鉴了相关评价理论和评价方法的基础上,本文提取 ...
- 主成分分析,充分图,聚类,主成分回归——数据分析与R语言 Lecture 11
主成分分析,充分图,聚类,主成分回归--数据分析与R语言 Lecture 11 主成分分析 例子:求相关矩阵特征值 例子:求主成分载荷 例子:画碎石图确定主成分 例子:主成分得分-相当于predict ...
- R语言主成分回归(PCR)、 多元线性回归特征降维分析光谱数据和汽车油耗、性能数据...
原文链接:http://tecdat.cn/?p=24152 什么是PCR?(PCR = PCA + MLR)(点击文末"阅读原文"获取完整代码数据). • PCR是处理许多 x ...
- R语言实战笔记--第十四章 主成分和因子分析
R语言实战笔记–第十四章 主成分和因子分析 标签(空格分隔): R语言 主成分分析 因子分析 原理及区别 主成分分析与因子分析很接近,其目的均是为了降维,以更简洁的数据去解释结果,但这两种方法其实是相 ...
- ggbiplot-最好看的PCA作图:样品PCA散点+分组椭圆+主成分丰度和相关
写在前面 https://github.com/vqv/ggbiplot/blob/master/README.md 前几天在<宏基因组0>微信讨论群看到了有人发了一个上面链接,点开一看居 ...
- python numpy数组和one-hot编码相互转换
a=[0,0,1,0,1,0,1]result=[] for i, x in enumerate(a):if x==1:result.append(i)print(result) python num ...
- 主成分分析二级指标权重_羡慕神仙权重?主成分与因子分析带你揭开权重的秘密...
文末领取[世界500强面试题及评点50题] 01 主成分分析 1.主成分分析流程 原始数据标准化 计算标准化变量间的相关系数矩阵 计算相关系数矩阵的特征值和特征向量 计算主成分变量值 统计结果分析,提 ...
最新文章
- Java程序员技术培训需要培训哪些?
- 推荐系统炼丹笔记:阿里边缘计算+奉送20个推荐系统强特
- 提升【百度网盘】下载速度
- php asciii 回车换行,ubuntu下关于telnet俩个特殊ascii字符回车0x0d与换行0x0a
- java mockserver搭建_mockjs,json-server一起搭建前端通用的数据模拟框架教程
- 爬取http://ycb-benchmarks.s3-website-us-east-1.amazonaws.com/的链接并下载文件
- Tomcat配置和使用——详解
- “请求未在nginx中配置的域名时,给浏览器返回508错误码”配置示例
- 【并行计算-CUDA开发】CUDA shared memory bank 冲突
- github每次push时自动输入用户名密码
- Polycom高清视频会议桌面系统HDX 4000
- 淘宝商城事件:中小卖家缺失的互联网信任
- 栈和队列的共同处和不同处
- 软件著作权登记怎么查询
- java batter_android电池管理系统从上层的java到底层驱动的调用(转载)
- 掌握这个小技巧,让你的 C++ 编译速度提升 50 倍!
- 2013年MBA、MPA、MPAcc入学考试综合能力辅导教材
- 【中文】【吴恩达课后编程作业】Course 5 - 序列模型 - 第一周作业
- Qt OpenGL 旋转、平移、缩放
- 批处理命令 bat文件
热门文章
- 说说O2O(2):O2O和二维码
- 程序员写简历时的技术词汇拼写规范备忘录!
- BUI+Springboot,问题总结
- c 微信项目开发多语言切换,微信小程序实现多国语言的切换
- matlab计算投影矩阵,如何在OpenCV和Matlab校准工具箱中形成投影矩阵?
- C++语言的define用法
- 解决usbisp不识别无法烧录Atmega328P,Arduino不识别问题
- JavaScript中实现首字母大写,小写
- python热力图颜色设置_【Python】绘制热力图seaborn.heatmap,cmap设置颜色的参数
- SQL学习笔记1:SQL语句可以分三类