mmlspark 101

预测一个人的收入是否超过$50k
数据下载地址https://www.kaggle.com/uciml/adult-census-income/data
注意!!!
mmlspark安装,版本0.17,部分api已经发生变化,官方git上notebook版本较低
shell

pyspark --master=spark://Lord:7077 --packages Azure:mmlspark:0.17


会自动下载

from pyspark import SparkConf
import pyspark.sql.functions as F
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField
from pyspark.mllib.evaluation import RankingMetrics, RegressionMetrics
from pyspark.sql.types import StringType, FloatType, IntegerType, LongType

read and clean data

spark = SparkSession.builder.appName("MyApp").config("spark.jars.packages", "Azure:mmlspark:0.17").getOrCreate()data = spark.read.csv('hdfs:///user/hadoop/adult.csv',inferSchema=True, header=True)data.limit(10).toPandas()
age workclass fnlwgt education education.num marital.status occupation relationship race sex capital.gain capital.loss hours.per.week native.country income
0 90 ? 77053 HS-grad 9 Widowed ? Not-in-family White Female 0 4356 40 United-States <=50K
1 82 Private 132870 HS-grad 9 Widowed Exec-managerial Not-in-family White Female 0 4356 18 United-States <=50K
2 66 ? 186061 Some-college 10 Widowed ? Unmarried Black Female 0 4356 40 United-States <=50K
3 54 Private 140359 7th-8th 4 Divorced Machine-op-inspct Unmarried White Female 0 3900 40 United-States <=50K
4 41 Private 264663 Some-college 10 Separated Prof-specialty Own-child White Female 0 3900 40 United-States <=50K
5 34 Private 216864 HS-grad 9 Divorced Other-service Unmarried White Female 0 3770 45 United-States <=50K
6 38 Private 150601 10th 6 Separated Adm-clerical Unmarried White Male 0 3770 40 United-States <=50K
7 74 State-gov 88638 Doctorate 16 Never-married Prof-specialty Other-relative White Female 0 3683 20 United-States >50K
8 68 Federal-gov 422013 HS-grad 9 Divorced Prof-specialty Not-in-family White Female 0 3683 40 United-States <=50K
9 41 Private 70037 Some-college 10 Never-married Craft-repair Unmarried White Male 0 3004 60 ? >50K

withColumnRenamed

data = data.withColumnRenamed('education.num','education_num')\.withColumnRenamed('marital.status','marital_status')\.withColumnRenamed('capital.gain','capital_gain')\.withColumnRenamed('capital.loss','capital_loss')\.withColumnRenamed('hours.per.week','hours_per_week')\.withColumnRenamed('native.country','native_country')
data.printSchema()
root|-- age: integer (nullable = true)|-- workclass: string (nullable = true)|-- fnlwgt: integer (nullable = true)|-- education: string (nullable = true)|-- education_num: integer (nullable = true)|-- marital_status: string (nullable = true)|-- occupation: string (nullable = true)|-- relationship: string (nullable = true)|-- race: string (nullable = true)|-- sex: string (nullable = true)|-- capital_gain: integer (nullable = true)|-- capital_loss: integer (nullable = true)|-- hours_per_week: integer (nullable = true)|-- native_country: string (nullable = true)|-- income: string (nullable = true)

EDA

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
# add label column
add_label = F.udf(lambda income : 0 if income == '<=50K' else 1 , IntegerType())data = data.withColumn('label', add_label(data['income']))
# 收入大于$50k
data[data.label == 1].describe().toPandas()[['summary','age','education_num','hours_per_week']]
summary age education_num hours_per_week
0 count 7841 7841 7841
1 mean 44.24984058155847 11.611656676444332 45.473026399693914
2 stddev 10.519027719851813 2.385128632665079 11.01297093020927
3 min 19 2 1
4 max 90 16 99
# 收入小于$50k
data[data.label == 0].describe().toPandas()[['summary','age','education_num','hours_per_week']]
summary age education_num hours_per_week
0 count 24720 24720 24720
1 mean 36.78373786407767 9.595064724919094 38.840210355987054
2 stddev 14.020088490824895 2.4361467923083993 12.31899464185489
3 min 17 1 1
4 max 90 16 99

age,education_num, hours_per_week分布情况

ages_0 = data[data.label == 0].select('age').collect()ages_1 = data[data.label == 1].select('age').collect()plt.figure(figsize=(10, 5))
sns.distplot(ages_0, label='<=$50K')
sns.distplot(ages_1, label='>$50K')
plt.xlabel('age',fontsize=15)
plt.legend()

显而易见,收入大于50K50K50K的人群年龄整体大于小于50K50K50K

edus_0 = data[data.label == 0].select('education_num').collect()edus_1 = data[data.label == 1].select('education_num').collect()plt.figure(figsize=(10, 5))
sns.distplot(edus_0, label='<=$50K')
sns.distplot(edus_1, label='>$50K')
plt.xlabel('education_num',fontsize=15)
plt.legend()

hours_per_week_0 = data[data.label == 0].select('hours_per_week').collect()hours_per_week_1 = data[data.label == 1].select('hours_per_week').collect()plt.figure(figsize=(10, 5))
sns.distplot(hours_per_week_0, label='<=$50K')
sns.distplot(hours_per_week_1, label='>$50K')
plt.xlabel('hours_per_week',fontsize=15)
plt.legend()

Spilt Data and Training Data

data = data.select(["age","education", "education_num","marital_status", "hours_per_week", "income"])train, test = data.randomSplit([0.75, 0.25], seed=20200420)
from mmlspark import TrainClassifierfrom pyspark.ml.classification import LogisticRegression, DecisionTreeClassifier, NaiveBayes
# lr : LogisticRegression
# dt : decisionTree
# nb : NaiveBayes
model_lr = TrainClassifier(model=LogisticRegression(), labelCol="income", numFeatures=256).fit(train)
model_dt = TrainClassifier(model=DecisionTreeClassifier(), labelCol="income", numFeatures=256).fit(train)
model_nb = TrainClassifier(model=NaiveBayes(), labelCol="income", numFeatures=256).fit(train)model_lr.write().overwrite().save("../models/LrModel.mml")

evaluate

from mmlspark import ComputeModelStatistics, TrainedClassifierModel
predictionModel = TrainedClassifierModel.load("../models/LrModel.mml")
prediction = predictionModel.transform(test)
metrics = ComputeModelStatistics().transform(prediction)
metrics.toPandas()
evaluation_type confusion_matrix accuracy precision recall AUC
0 Classification DenseMatrix([[5766., 473.],\n [ 9... 0.824068 0.674242 0.503083 0.869809
dt_prediction = model_dt.transform(test)
nb_prediction = model_nb.transform(test)dt_metrics = ComputeModelStatistics().transform(dt_prediction)
nb_metrics = ComputeModelStatistics().transform(nb_prediction)
dt_metrics.toPandas()
evaluation_type confusion_matrix accuracy precision recall AUC
0 Classification DenseMatrix([[5943., 296.],\n [11... 0.825901 0.734052 0.419836 0.672863
nb_metrics.toPandas()
evaluation_type confusion_matrix accuracy precision recall AUC
0 Classification DenseMatrix([[5824., 415.],\n [10... 0.818448 0.678295 0.44964 0.247985

mmlspark-101: TrainClassifier相关推荐

  1. 仙居机器人_【101巨喜讯】又一个全国冠军!仙居学子机器人全国赛获奖啦!

    原标题:[101巨喜讯]又一个全国冠军!仙居学子机器人全国赛获奖啦! 仙居私家车广播 美丽仙居,品质广播!欢迎关注仙居最具品质广播微信公众号! 特大喜讯 ! 浙江仙居城峰中学.仙居机器人协会7名学生 ...

  2. Maya 2020面部绑定动画学习视频教程 Facial Rigging 101 – Maya 2020

    Maya 2020面部绑定动画学习视频教程 Facial Rigging 101 – Maya 2020 时长:16h 55m |视频:. MP4 1280x720,30 fps(r) |音频:AAC ...

  3. 学习 Linux,101: 引导系统

    2019独角兽企业重金招聘Python工程师标准>>> 系列文章: http://www.ibm.com/developerworks/cn/views/linux/libraryv ...

  4. BIML 101 - ETL数据清洗 系列 - BIML 快速入门教程 - 序

    BIML 101 - BIML 快速入门教程 做大数据的项目,最花时间的就是数据清洗. 没有一个相对可靠的数据,数据分析就是无木之舟,无水之源. 如果你已经进了ETL这个坑,而且预算有限,并且有大量的 ...

  5. 二次开发photoshop_Photoshop 101:Web开发人员简介

    二次开发photoshop 介绍 (Introduction) Often, when working as web developer, we need to integrate templates ...

  6. Verilog与SystemVerilog编程陷阱:怎样避免101个常犯的编码错误

    这篇是计算机类的优质预售推荐>>>><Verilog与SystemVerilog编程陷阱:怎样避免101个常犯的编码错误> 编辑推荐 纠错式学习,从"陷阱 ...

  7. Python分析101位《创造营2020》小姐姐,谁才是你心中的颜值担当?

    来源 | CDA 数据分析师 责编 |  Carol Show me data,用数据说话. 今天我们聊一聊<创造营2020>各个小姐姐,点击下方视频,先睹为快: 最近可以追的综艺真是太多 ...

  8. 研发管理101军规#003 实战规模化敏捷:从8人到百人的敏捷之路

    ​这是研发管理101的第三篇 如果用一句话概述本篇的主题,那就是:关注8人团队的自组织性,构建百人团队的研发工作流. Worktile是在15年的时候引入的Scrum.在那之前我们并没有采用标准的敏捷 ...

  9. 研发管理101军规#001 两周迭代,形成团队持续习惯

    前言, 本篇是<研发管理的101条军规>专栏的第一篇,先在这里给各位介绍下我想构建这个专栏的想法和想在这里跟各位分享的内容方向. <研发管理的101条军规>将是一个关于如何更好 ...

  10. 报错解决:InvalidArgumentError: Received a label value of 101 which is outside the valid range of [0, 101

    报错解决:InvalidArgumentError: Received a label value of 101 which is outside the valid range of [0, 101 ...

最新文章

  1. 关于抢红包的_关于抢红包的作文500字
  2. 流行的9个Java框架介绍: 优点、缺点等等
  3. php redis删除所有key,php redis批量删除key的方法
  4. node --- http数据上传
  5. 利用python爬取房价
  6. 【技术+某度面经】Jenkins 内容+百度面经分享
  7. 5亿整数的大文件,怎么排序 ?面试被问傻!
  8. 一个基于WF的业务流程平台
  9. (转) 淘淘商城系列——redis-desktop-manager的使用
  10. mysql关闭12260端口_windows 如何查看端口占用情况?
  11. 书店智能机器人编程与拼装体验课堂_让人工智能与编程教育走进初中教学课堂...
  12. 浅谈Java程序员的黄金五年,如何实现快速进阶
  13. openbsd_仔细看一下OpenBSD
  14. 4个漂亮的wordpress企业主题
  15. Unity 制作萌系live2d桌宠:屏幕自适应+交互
  16. echarts3实现世界地图
  17. 人啊,除了健康,什么都是浮云
  18. Spring整体学习笔记-IoC依赖注入-AOP代理-整合Spring-Mybatis
  19. js练手小项目——JavaScript实现进度条
  20. 【深度学习】预训练语言模型-BERT

热门文章

  1. svm分类器_用人话讲明白支持向量机SVM(上)
  2. python求商和余数 考虑可能出现的异常情况_python面试题
  3. 【转】C#实现SqlServer数据库的备份和还原
  4. Docker docker-compose 配置lnmp开发环境
  5. StylesheetLanguage--如何使用Less--前端样式语言
  6. 聊一聊Cookie(结合自己的学习方法分享一篇维基百科和一篇segmentfault(思否)好文)...
  7. 学习笔记——字符串方法整理
  8. exchange2003系列总结:-5邮件加密与签名的工作流程
  9. Android 通过Base64上传图片到服务器
  10. Bravo.Reporting:使用 .Net 实现基于 ODF 文档格式的报表系统