mmlspark-101: TrainClassifier
mmlspark
101
预测一个人的收入是否超过$50k
数据下载地址https://www.kaggle.com/uciml/adult-census-income/data
注意!!!
mmlspark安装,版本0.17,部分api已经发生变化,官方git上notebook版本较低
shell
pyspark --master=spark://Lord:7077 --packages Azure:mmlspark:0.17
会自动下载
from pyspark import SparkConf
import pyspark.sql.functions as F
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField
from pyspark.mllib.evaluation import RankingMetrics, RegressionMetrics
from pyspark.sql.types import StringType, FloatType, IntegerType, LongType
read and clean data
spark = SparkSession.builder.appName("MyApp").config("spark.jars.packages", "Azure:mmlspark:0.17").getOrCreate()data = spark.read.csv('hdfs:///user/hadoop/adult.csv',inferSchema=True, header=True)data.limit(10).toPandas()
age | workclass | fnlwgt | education | education.num | marital.status | occupation | relationship | race | sex | capital.gain | capital.loss | hours.per.week | native.country | income | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 90 | ? | 77053 | HS-grad | 9 | Widowed | ? | Not-in-family | White | Female | 0 | 4356 | 40 | United-States | <=50K |
1 | 82 | Private | 132870 | HS-grad | 9 | Widowed | Exec-managerial | Not-in-family | White | Female | 0 | 4356 | 18 | United-States | <=50K |
2 | 66 | ? | 186061 | Some-college | 10 | Widowed | ? | Unmarried | Black | Female | 0 | 4356 | 40 | United-States | <=50K |
3 | 54 | Private | 140359 | 7th-8th | 4 | Divorced | Machine-op-inspct | Unmarried | White | Female | 0 | 3900 | 40 | United-States | <=50K |
4 | 41 | Private | 264663 | Some-college | 10 | Separated | Prof-specialty | Own-child | White | Female | 0 | 3900 | 40 | United-States | <=50K |
5 | 34 | Private | 216864 | HS-grad | 9 | Divorced | Other-service | Unmarried | White | Female | 0 | 3770 | 45 | United-States | <=50K |
6 | 38 | Private | 150601 | 10th | 6 | Separated | Adm-clerical | Unmarried | White | Male | 0 | 3770 | 40 | United-States | <=50K |
7 | 74 | State-gov | 88638 | Doctorate | 16 | Never-married | Prof-specialty | Other-relative | White | Female | 0 | 3683 | 20 | United-States | >50K |
8 | 68 | Federal-gov | 422013 | HS-grad | 9 | Divorced | Prof-specialty | Not-in-family | White | Female | 0 | 3683 | 40 | United-States | <=50K |
9 | 41 | Private | 70037 | Some-college | 10 | Never-married | Craft-repair | Unmarried | White | Male | 0 | 3004 | 60 | ? | >50K |
withColumnRenamed
data = data.withColumnRenamed('education.num','education_num')\.withColumnRenamed('marital.status','marital_status')\.withColumnRenamed('capital.gain','capital_gain')\.withColumnRenamed('capital.loss','capital_loss')\.withColumnRenamed('hours.per.week','hours_per_week')\.withColumnRenamed('native.country','native_country')
data.printSchema()
root|-- age: integer (nullable = true)|-- workclass: string (nullable = true)|-- fnlwgt: integer (nullable = true)|-- education: string (nullable = true)|-- education_num: integer (nullable = true)|-- marital_status: string (nullable = true)|-- occupation: string (nullable = true)|-- relationship: string (nullable = true)|-- race: string (nullable = true)|-- sex: string (nullable = true)|-- capital_gain: integer (nullable = true)|-- capital_loss: integer (nullable = true)|-- hours_per_week: integer (nullable = true)|-- native_country: string (nullable = true)|-- income: string (nullable = true)
EDA
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
# add label column
add_label = F.udf(lambda income : 0 if income == '<=50K' else 1 , IntegerType())data = data.withColumn('label', add_label(data['income']))
# 收入大于$50k
data[data.label == 1].describe().toPandas()[['summary','age','education_num','hours_per_week']]
summary | age | education_num | hours_per_week | |
---|---|---|---|---|
0 | count | 7841 | 7841 | 7841 |
1 | mean | 44.24984058155847 | 11.611656676444332 | 45.473026399693914 |
2 | stddev | 10.519027719851813 | 2.385128632665079 | 11.01297093020927 |
3 | min | 19 | 2 | 1 |
4 | max | 90 | 16 | 99 |
# 收入小于$50k
data[data.label == 0].describe().toPandas()[['summary','age','education_num','hours_per_week']]
summary | age | education_num | hours_per_week | |
---|---|---|---|---|
0 | count | 24720 | 24720 | 24720 |
1 | mean | 36.78373786407767 | 9.595064724919094 | 38.840210355987054 |
2 | stddev | 14.020088490824895 | 2.4361467923083993 | 12.31899464185489 |
3 | min | 17 | 1 | 1 |
4 | max | 90 | 16 | 99 |
age,education_num, hours_per_week分布情况
ages_0 = data[data.label == 0].select('age').collect()ages_1 = data[data.label == 1].select('age').collect()plt.figure(figsize=(10, 5))
sns.distplot(ages_0, label='<=$50K')
sns.distplot(ages_1, label='>$50K')
plt.xlabel('age',fontsize=15)
plt.legend()
显而易见,收入大于50K50K50K的人群年龄整体大于小于50K50K50K
edus_0 = data[data.label == 0].select('education_num').collect()edus_1 = data[data.label == 1].select('education_num').collect()plt.figure(figsize=(10, 5))
sns.distplot(edus_0, label='<=$50K')
sns.distplot(edus_1, label='>$50K')
plt.xlabel('education_num',fontsize=15)
plt.legend()
hours_per_week_0 = data[data.label == 0].select('hours_per_week').collect()hours_per_week_1 = data[data.label == 1].select('hours_per_week').collect()plt.figure(figsize=(10, 5))
sns.distplot(hours_per_week_0, label='<=$50K')
sns.distplot(hours_per_week_1, label='>$50K')
plt.xlabel('hours_per_week',fontsize=15)
plt.legend()
Spilt Data and Training Data
data = data.select(["age","education", "education_num","marital_status", "hours_per_week", "income"])train, test = data.randomSplit([0.75, 0.25], seed=20200420)
from mmlspark import TrainClassifierfrom pyspark.ml.classification import LogisticRegression, DecisionTreeClassifier, NaiveBayes
# lr : LogisticRegression
# dt : decisionTree
# nb : NaiveBayes
model_lr = TrainClassifier(model=LogisticRegression(), labelCol="income", numFeatures=256).fit(train)
model_dt = TrainClassifier(model=DecisionTreeClassifier(), labelCol="income", numFeatures=256).fit(train)
model_nb = TrainClassifier(model=NaiveBayes(), labelCol="income", numFeatures=256).fit(train)model_lr.write().overwrite().save("../models/LrModel.mml")
evaluate
from mmlspark import ComputeModelStatistics, TrainedClassifierModel
predictionModel = TrainedClassifierModel.load("../models/LrModel.mml")
prediction = predictionModel.transform(test)
metrics = ComputeModelStatistics().transform(prediction)
metrics.toPandas()
evaluation_type | confusion_matrix | accuracy | precision | recall | AUC | |
---|---|---|---|---|---|---|
0 | Classification | DenseMatrix([[5766., 473.],\n [ 9... | 0.824068 | 0.674242 | 0.503083 | 0.869809 |
dt_prediction = model_dt.transform(test)
nb_prediction = model_nb.transform(test)dt_metrics = ComputeModelStatistics().transform(dt_prediction)
nb_metrics = ComputeModelStatistics().transform(nb_prediction)
dt_metrics.toPandas()
evaluation_type | confusion_matrix | accuracy | precision | recall | AUC | |
---|---|---|---|---|---|---|
0 | Classification | DenseMatrix([[5943., 296.],\n [11... | 0.825901 | 0.734052 | 0.419836 | 0.672863 |
nb_metrics.toPandas()
evaluation_type | confusion_matrix | accuracy | precision | recall | AUC | |
---|---|---|---|---|---|---|
0 | Classification | DenseMatrix([[5824., 415.],\n [10... | 0.818448 | 0.678295 | 0.44964 | 0.247985 |
mmlspark-101: TrainClassifier相关推荐
- 仙居机器人_【101巨喜讯】又一个全国冠军!仙居学子机器人全国赛获奖啦!
原标题:[101巨喜讯]又一个全国冠军!仙居学子机器人全国赛获奖啦! 仙居私家车广播 美丽仙居,品质广播!欢迎关注仙居最具品质广播微信公众号! 特大喜讯 ! 浙江仙居城峰中学.仙居机器人协会7名学生 ...
- Maya 2020面部绑定动画学习视频教程 Facial Rigging 101 – Maya 2020
Maya 2020面部绑定动画学习视频教程 Facial Rigging 101 – Maya 2020 时长:16h 55m |视频:. MP4 1280x720,30 fps(r) |音频:AAC ...
- 学习 Linux,101: 引导系统
2019独角兽企业重金招聘Python工程师标准>>> 系列文章: http://www.ibm.com/developerworks/cn/views/linux/libraryv ...
- BIML 101 - ETL数据清洗 系列 - BIML 快速入门教程 - 序
BIML 101 - BIML 快速入门教程 做大数据的项目,最花时间的就是数据清洗. 没有一个相对可靠的数据,数据分析就是无木之舟,无水之源. 如果你已经进了ETL这个坑,而且预算有限,并且有大量的 ...
- 二次开发photoshop_Photoshop 101:Web开发人员简介
二次开发photoshop 介绍 (Introduction) Often, when working as web developer, we need to integrate templates ...
- Verilog与SystemVerilog编程陷阱:怎样避免101个常犯的编码错误
这篇是计算机类的优质预售推荐>>>><Verilog与SystemVerilog编程陷阱:怎样避免101个常犯的编码错误> 编辑推荐 纠错式学习,从"陷阱 ...
- Python分析101位《创造营2020》小姐姐,谁才是你心中的颜值担当?
来源 | CDA 数据分析师 责编 | Carol Show me data,用数据说话. 今天我们聊一聊<创造营2020>各个小姐姐,点击下方视频,先睹为快: 最近可以追的综艺真是太多 ...
- 研发管理101军规#003 实战规模化敏捷:从8人到百人的敏捷之路
这是研发管理101的第三篇 如果用一句话概述本篇的主题,那就是:关注8人团队的自组织性,构建百人团队的研发工作流. Worktile是在15年的时候引入的Scrum.在那之前我们并没有采用标准的敏捷 ...
- 研发管理101军规#001 两周迭代,形成团队持续习惯
前言, 本篇是<研发管理的101条军规>专栏的第一篇,先在这里给各位介绍下我想构建这个专栏的想法和想在这里跟各位分享的内容方向. <研发管理的101条军规>将是一个关于如何更好 ...
- 报错解决:InvalidArgumentError: Received a label value of 101 which is outside the valid range of [0, 101
报错解决:InvalidArgumentError: Received a label value of 101 which is outside the valid range of [0, 101 ...
最新文章
- 关于抢红包的_关于抢红包的作文500字
- 流行的9个Java框架介绍: 优点、缺点等等
- php redis删除所有key,php redis批量删除key的方法
- node --- http数据上传
- 利用python爬取房价
- 【技术+某度面经】Jenkins 内容+百度面经分享
- 5亿整数的大文件,怎么排序 ?面试被问傻!
- 一个基于WF的业务流程平台
- (转) 淘淘商城系列——redis-desktop-manager的使用
- mysql关闭12260端口_windows 如何查看端口占用情况?
- 书店智能机器人编程与拼装体验课堂_让人工智能与编程教育走进初中教学课堂...
- 浅谈Java程序员的黄金五年,如何实现快速进阶
- openbsd_仔细看一下OpenBSD
- 4个漂亮的wordpress企业主题
- Unity 制作萌系live2d桌宠:屏幕自适应+交互
- echarts3实现世界地图
- 人啊,除了健康,什么都是浮云
- Spring整体学习笔记-IoC依赖注入-AOP代理-整合Spring-Mybatis
- js练手小项目——JavaScript实现进度条
- 【深度学习】预训练语言模型-BERT
热门文章
- svm分类器_用人话讲明白支持向量机SVM(上)
- python求商和余数 考虑可能出现的异常情况_python面试题
- 【转】C#实现SqlServer数据库的备份和还原
- Docker docker-compose 配置lnmp开发环境
- StylesheetLanguage--如何使用Less--前端样式语言
- 聊一聊Cookie(结合自己的学习方法分享一篇维基百科和一篇segmentfault(思否)好文)...
- 学习笔记——字符串方法整理
- exchange2003系列总结:-5邮件加密与签名的工作流程
- Android 通过Base64上传图片到服务器
- Bravo.Reporting:使用 .Net 实现基于 ODF 文档格式的报表系统