使用Spark中DataFrame的语法与SQL操作,对人类数据进行处理,比较学历与离婚率的关系
简介
整理Kaggle上的人类信息数据 Machine-Learning-Databases,这个数据集已经有二十多年的历史,虽然历史久远,但是格式明确,是比较好的入门数据集。
通过Spark中的Dataframe语法对其进行基本的数据处理和输出,主要实现了如下功能:
- SparkSession的建立
- Dataframe的创建、过滤(filter)、合并(groupBy)、部分选择(select)、排序(orderBy)等操作
- 简单的SQL、Hive语法操作
- Dataframe的分桶保存(bucketBy)、分区操作(partitionBy)
- 通过csv文件对Dataframe进行存取操作
github地址:Truedick23 - AdultBase
配置
- 语言:Scala 2.11
- Spark版本:Spark 2.3.1
- 环境:IntelliJ IDEA 2018.2.3
- sbt项目
bulit.sbt示例:
(根据Spark版本自行修改,根据实际到 MVN Repository: Apache Spark 查找)
name := "AdultBase"version := "0.1"scalaVersion := "2.11.8"// https://mvnrepository.com/artifact/org.apache.spark/spark-core
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.3.1"// https://mvnrepository.com/artifact/org.apache.spark/spark-mllib
libraryDependencies += "org.apache.spark" %% "spark-mllib" % "2.3.1"// https://mvnrepository.com/artifact/org.apache.spark/spark-mllib-local
libraryDependencies += "org.apache.spark" %% "spark-mllib-local" % "2.3.1"// https://mvnrepository.com/artifact/org.apache.spark/spark-hive
libraryDependencies += "org.apache.spark" %% "spark-hive" % "2.3.1"
熟悉数据集
首先我们查看下载得到的 adult.names文件,在96行之后找到如下信息,这一段文字给出了adult.data和adult.test的数据格式及其意义:
age: continuous.
workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
fnlwgt: continuous.
education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
education-num: continuous.
marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
sex: Female, Male.
capital-gain: continuous.
capital-loss: continuous.
hours-per-week: continuous.
native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.
下面我们查看adult.data,主要注意其列分隔符、行分隔符和列数:
39, State-gov, 77516, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Male, 2174, 0, 40, United-States, <=50K
50, Self-emp-not-inc, 83311, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 13, United-States, <=50K
38, Private, 215646, HS-grad, 9, Divorced, Handlers-cleaners, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K
53, Private, 234721, 11th, 7, Married-civ-spouse, Handlers-cleaners, Husband, Black, Male, 0, 0, 40, United-States, <=50K
.....
可知其列分隔符为“, ”(逗号+空格),并不是csv格式一般的分割形式,行分隔符为“\n”,即换行符,每行有15条信息,是否明确这几条格式信息决定了随后我们能否正确读入数据
代码编写
Dataframe的创建
首先我们根据adult.name中的信息编写Adult类:
case class Adult (age: Int, workclass: String, fnlwgt: Int,education: String, education_num: Int,maritial_status: String, occupation: String,relationship: String, race: String,sex: String, capital_gain: Int, capital_loss: Int,hours_per_week: Int, native_country: String)
然后我们通过builder建立一个SparkSession,用于读入数据并建立Dataframe:
import org.apache.spark.sql.SparkSessionval spark = SparkSession.builder().master("local[*]").appName("AdultBaseDatabase").getOrCreate()
其中master指定了此程序运行在本地服务器上,appName确定了SparkSession的名字
由于我们列分隔符比较特殊,我们此处采用一个比较普适的方法,首先使用textFile读入数据,用split函数分割行,filter确保每一行的元素是15个,然后通过Adult类建立每一条的表格信息,通过toDF来建立Dataframe:
import spark.implicits._val df = spark.sparkContext.textFile("./data/machine-learning-databases/adult.data").map(fields => fields.split(", ")).filter(lines => lines.length == 15).map(attributes =>Adult(attributes(0).toInt, attributes(1), attributes(2).toInt, attributes(3), attributes(4).toInt,attributes(5), attributes(6), attributes(7), attributes(8), attributes(9), attributes(10).toInt,attributes(11).toInt, attributes(12).toInt, attributes(13))).toDF()
我们可以输出其信息结构、记录条数和头几条记录试试:
df.printSchema()println(df.count())df.show(8)
输出如下:
root|-- age: integer (nullable = false)|-- workclass: string (nullable = true)|-- fnlwgt: integer (nullable = false)|-- education: string (nullable = true)|-- education_num: integer (nullable = false)|-- maritial_status: string (nullable = true)|-- occupation: string (nullable = true)|-- relationship: string (nullable = true)|-- race: string (nullable = true)|-- sex: string (nullable = true)|-- capital_gain: integer (nullable = false)|-- capital_loss: integer (nullable = false)|-- hours_per_week: integer (nullable = false)|-- native_country: string (nullable = true)32561+---+----------------+------+---------+-------------+--------------------+-----------------+-------------+-----+------+------------+------------+--------------+--------------+
|age| workclass|fnlwgt|education|education_num| maritial_status| occupation| relationship| race| sex|capital_gain|capital_loss|hours_per_week|native_country|
+---+----------------+------+---------+-------------+--------------------+-----------------+-------------+-----+------+------------+------------+--------------+--------------+
| 39| State-gov| 77516|Bachelors| 13| Never-married| Adm-clerical|Not-in-family|White| Male| 2174| 0| 40| United-States|
| 50|Self-emp-not-inc| 83311|Bachelors| 13| Married-civ-spouse| Exec-managerial| Husband|White| Male| 0| 0| 13| United-States|
| 38| Private|215646| HS-grad| 9| Divorced|Handlers-cleaners|Not-in-family|White| Male| 0| 0| 40| United-States|
| 53| Private|234721| 11th| 7| Married-civ-spouse|Handlers-cleaners| Husband|Black| Male| 0| 0| 40| United-States|
| 28| Private|338409|Bachelors| 13| Married-civ-spouse| Prof-specialty| Wife|Black|Female| 0| 0| 40| Cuba|
| 37| Private|284582| Masters| 14| Married-civ-spouse| Exec-managerial| Wife|White|Female| 0| 0| 40| United-States|
| 49| Private|160187| 9th| 5|Married-spouse-ab...| Other-service|Not-in-family|Black|Female| 0| 0| 16| Jamaica|
| 52|Self-emp-not-inc|209642| HS-grad| 9| Married-civ-spouse| Exec-managerial| Husband|White| Male| 0| 0| 45| United-States|
+---+----------------+------+---------+-------------+--------------------+-----------------+-------------+-----+------+------------+------------+--------------+--------------+
only showing top 8 rows
可以看到共有32561条记录,程序首先通过树的形式输出表格的格式,然后成功输出了表格头8条记录,下面我们做些简单的实验来测试一下基本函数操作:
简单函数操作
我们尝试通过filter函数来输出满足某些条件的记录,注意表示格式!
我们先过滤出年龄大于70岁的人群信息,后过滤出种族为其他的记录:
注意“$”符号的使用,以及等号用“===”表示等
val elderly_data = df.filter($"age" > 70)println(elderly_data.count())elderly_data.show(8)df.filter($"race" === "Other").show(8)
输出如下:
540+---+----------------+------+------------+-------------+------------------+---------------+--------------+-----+------+------------+------------+--------------+--------------+
|age| workclass|fnlwgt| education|education_num| maritial_status| occupation| relationship| race| sex|capital_gain|capital_loss|hours_per_week|native_country|
+---+----------------+------+------------+-------------+------------------+---------------+--------------+-----+------+------------+------------+--------------+--------------+
| 79| Private|124744|Some-college| 10|Married-civ-spouse| Prof-specialty|Other-relative|White| Male| 0| 0| 20| United-States|
| 76| Private|124191| Masters| 14|Married-civ-spouse|Exec-managerial| Husband|White| Male| 0| 0| 40| United-States|
| 71|Self-emp-not-inc|494223|Some-college| 10| Separated| Sales| Unmarried|Black| Male| 0| 1816| 2| United-States|
| 90| Private| 51744| HS-grad| 9| Never-married| Other-service| Not-in-family|Black| Male| 0| 2206| 40| United-States|
| 75| Private|314209| Assoc-voc| 11| Widowed| Adm-clerical| Not-in-family|White|Female| 0| 0| 20| Columbia|
| 77|Self-emp-not-inc|138714|Some-college| 10|Married-civ-spouse| Sales| Husband|White| Male| 0| 0| 40| United-States|
| 76|Self-emp-not-inc|174309| Masters| 14|Married-civ-spouse| Craft-repair| Husband|White| Male| 0| 0| 10| United-States|
| 80| ?|107762| HS-grad| 9| Widowed| ?| Not-in-family|White| Male| 0| 0| 24| United-States|
+---+----------------+------+------------+-------------+------------------+---------------+--------------+-----+------+------------+------------+--------------+--------------+
only showing top 8 rows+---+---------+------+------------+-------------+------------------+-----------------+--------------+-----+------+------------+------------+--------------+--------------+
|age|workclass|fnlwgt| education|education_num| maritial_status| occupation| relationship| race| sex|capital_gain|capital_loss|hours_per_week|native_country|
+---+---------+------+------------+-------------+------------------+-----------------+--------------+-----+------+------------+------------+--------------+--------------+
| 25| Private| 32275|Some-college| 10|Married-civ-spouse| Exec-managerial| Wife|Other|Female| 0| 0| 40| United-States|
| 33| Private|110978|Some-college| 10| Divorced| Craft-repair|Other-relative|Other|Female| 0| 0| 40| United-States|
| 65| Private|161400| 11th| 7| Widowed| Other-service| Unmarried|Other| Male| 0| 0| 40| United-States|
| 28| Private|166481| 7th-8th| 4|Married-civ-spouse|Handlers-cleaners| Husband|Other| Male| 0| 2179| 40| Puerto-Rico|
| 44| Private|109339| 11th| 7| Divorced|Machine-op-inspct| Unmarried|Other|Female| 0| 0| 46| Puerto-Rico|
| 40| Private|237601| Bachelors| 13| Never-married| Sales| Not-in-family|Other|Female| 0| 0| 55| United-States|
| 26| Private|195105| HS-grad| 9| Never-married| Sales| Not-in-family|Other| Male| 0| 0| 40| United-States|
| 31| Private|175548| HS-grad| 9| Never-married| Other-service| Not-in-family|Other|Female| 0| 0| 35| United-States|
+---+---------+------+------------+-------------+------------------+-----------------+--------------+-----+------+------------+------------+--------------+--------------+
only showing top 8 rows
可知我们filter操作成功了,年龄超过70岁的记录共有540条
然后我们尝试通过统计不同种族和学历的人数,通过groupBy函数来合并某一信息相同的记录,用count函数统计人数,最后通过orderBy来排序:
df.groupBy($"race").count().orderBy("count").show()df.groupBy($"education").count().orderBy("count").show()
输出如下:
+------------------+-----+
| race|count|
+------------------+-----+
| Other| 271|
|Amer-Indian-Eskimo| 311|
|Asian-Pac-Islander| 1039|
| Black| 3124|
| White|27816|
+------------------+-----++------------+-----+
| education|count|
+------------+-----+
| Preschool| 51|
| 1st-4th| 168|
| 5th-6th| 333|
| Doctorate| 413|
| 12th| 433|
| 9th| 514|
| Prof-school| 576|
| 7th-8th| 646|
| 10th| 933|
| Assoc-acdm| 1067|
| 11th| 1175|
| Assoc-voc| 1382|
| Masters| 1723|
| Bachelors| 5355|
|Some-college| 7291|
| HS-grad|10501|
+------------+-----+
大致可知统计的人群中白种人和高中毕业者占大多数
然后我们通过SQL语句来过滤出某一确定编号的人的记录,并格式化输出它,我们用createOrReplaceTempView对数据集命名一个暂时的View,用于SQL操作,用getAs函数得到一条记录中的不同栏目中的数值:
df.createOrReplaceTempView("adult")val dude = spark.sql("SELECT * FROM adult WHERE fnlwgt=77516")println(dude.map(info => "Aged " + info.getAs[String]("age") +", occupation is " + info.getAs[String]("occupation") +", race is " + info.getAs[String]("race")).first())
输出如下:
Aged 39, occupation is Adm-clerical, race is White
统计并比较不同学历的群体的离婚率
我们此处需要输出百分数,故导入了NumberFormat包,用于规范百分数输入,然后我们多次使用filter函数实现了不同群体离婚率的计算,之后格式化输出:
import java.text.NumberFormatval format = NumberFormat.getPercentInstanceformat.setMaximumFractionDigits(4)val total_divorce_rate: Double = df.filter($"maritial_status" === "Divorced").count.toDouble/df.count.toDoubleval doc_divorce_rate: Double = df.filter($"education" === "Doctorate" and $"maritial_status" === "Divorced").count.toDouble /df.filter($"education" === "Doctorate").count.toDoubleval preschool_divorce_rate: Double = df.filter($"education" === "Preschool" and $"maritial_status" === "Divorced").count.toDouble /df.filter($"education" === "Preschool").count.toDoubleval HS_divorce_rate: Double = df.filter($"education" === "HS-grad" and $"maritial_status" === "Divorced").count.toDouble /df.filter($"education" === "HS-grad").count.toDoubleprintln("Doctorate divorce rate: " + format.format(doc_divorce_rate))println("Preschool divorce rate: " + format.format(preschool_divorce_rate))println("High School graduation divorce rate: " + format.format(HS_divorce_rate))println("Total divorce rate: " + format.format(total_divorce_rate))
输出如下:
Doctorate divorce rate: 7.9903%
Preschool divorce rate: 1.9608%
High School graduation divorce rate: 15.3604%
Total divorce rate: 13.6452%
这四个数据分别代表博士、未接受教育、高中毕业和总人群的离婚率,我们可以看出未接受教育人群的离婚率非常低,而人数最多的高中毕业人群的离婚率却超过了平均离婚率
分桶(bucketBy)和分区(partitionBy)操作
我们尝试对数据集的workclass类进行分区,relationship类进行分桶,分成6桶,新表命名为workclass_relationship_table
df.write.partitionBy("workclass").bucketBy(6, "relationship").saveAsTable("workclass_relationship_table")
运行后看到spark-warehouse中多了如此的一个文件:
保存信息到csv文件并读取,通过Hive语句操作
我们编写如下代码对学历为博士的数据记录的婚姻状况、编号、性别和职业进行保存:
df.filter($"education" === "Doctorate").select($"maritial_status", $"fnlwgt", $"sex", $"occupation", $"education").write.csv("./data/doctorates_marriage_occupation_data")
我们通过以下语句建立一个SparkSession来进行Hive表操作:
val warehouseLocation = new File("spark-warehouse").getAbsolutePathval spark = SparkSession.builder().master("local[*]").appName("Doctorate Hive").config("spark.sql.warehouse.dir", warehouseLocation).enableHiveSupport().getOrCreate()
使用如下的Hive语句建立新表并对其进行导入,其中通过Doctorate类规定了表格的信息,并明确了行分隔符、列分隔符:
import spark.implicits._import spark.sqlcase class Doctorate(marriage: String, id: Int, sex: String, occupation: String, education: String)sql("CREATE TABLE IF NOT EXISTS doc (marriage STRING, id INT, sex STRING, occupation STRING, education STRING) " +"ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' " +"LINES TERMINATED BY '\n' STORED AS TEXTFILE")sql("LOAD DATA LOCAL INPATH './data/doctorates_marriage_occupation_data' OVERWRITE INTO TABLE doc")
通过如下语句统计其中的性别和职业数据:
val doc_table = sql("SELECT * FROM doc")val marriage_doc = doc_table.groupBy("marriage").count().orderBy("count")doc_table.groupBy("sex").count().orderBy("count").show()doc_table.groupBy("occupation").count().orderBy("count").show()
输出如下:
+------+-----+
| sex|count|
+------+-----+
|Female| 86|
| Male| 327|
+------+-----++-----------------+-----+
| occupation|count|
+-----------------+-----+
| Transport-moving| 1|
|Machine-op-inspct| 1|
| Farming-fishing| 1|
| Other-service| 1|
| Craft-repair| 2|
| Tech-support| 3|
| Adm-clerical| 5|
| Sales| 8|
| ?| 15|
| Exec-managerial| 55|
| Prof-specialty| 321|
+-----------------+-----+
然后我们通过一下语句格式化输出每一条信息,其中show函数通过设置truncate来避免了输出时压缩过长的字符串栏目:
val divorced_male = sql("SELECT * FROM doc WHERE sex='Male' AND marriage='Divorced' ORDER BY ID")divorced_male.show(8)val recordsDoc = divorced_male.map {case Row(marriage: String, id: Int, sex: String, occupation: String, education: String) =>val suffix = sex match {case "Male"=> s"His"case "Female" => s"Her"}s"Doc. $id is $sex, $suffix marriage status is $marriage, occupation is $occupation"}recordsDoc.show(8, truncate = false)
输出如下:
+----------------------------------------------------------------------------------+
|value |
+----------------------------------------------------------------------------------+
|Doc. 42924 is Male, His marriage status is Divorced, occupation is Exec-managerial|
|Doc. 71772 is Male, His marriage status is Divorced, occupation is Prof-specialty |
|Doc. 140644 is Male, His marriage status is Divorced, occupation is Prof-specialty|
|Doc. 145333 is Male, His marriage status is Divorced, occupation is Prof-specialty|
|Doc. 148171 is Male, His marriage status is Divorced, occupation is Prof-specialty|
|Doc. 160120 is Male, His marriage status is Divorced, occupation is Adm-clerical |
|Doc. 170769 is Male, His marriage status is Divorced, occupation is Sales |
|Doc. 195343 is Male, His marriage status is Divorced, occupation is Prof-specialty|
+----------------------------------------------------------------------------------+
only showing top 8 rows
参考资料
- SparkDoc - SpakSession
- Spark SQL, DataFrames and Datasets Guide
使用Spark中DataFrame的语法与SQL操作,对人类数据进行处理,比较学历与离婚率的关系相关推荐
- python dataframe 列_python pandas库中DataFrame对行和列的操作实例讲解
用pandas中的DataFrame时选取行或列: import numpy as np import pandas as pd from pandas import Sereis, DataFram ...
- spark中dataframe解析_SparkSql 中 JOIN的实现
Join作为SQL中一个重要语法特性,几乎所有稍微复杂一点的数据分析场景都离不开Join,如今Spark SQL(Dataset/DataFrame)已经成为Spark应用程序开发的主流,作为开发者, ...
- spark中dataframe解析_Spark 结构流处理介绍和入门教程
概念和简介 Spark Structured Streaming Structured Streaming 是在 Spark 2.0 加入的经过重新设计的全新流式引擎.它使用 micro-batch ...
- spark中dataframe解析_Spark-SQL
fe 缺点 不方便添加新的优化策略 线程安全问题 Spark SQL支持三种语言 java Scala python DataFrame 大规模数据化结构能历.提高了运算能力 从sql到dataFra ...
- Spark中DataFrame 基本操作函数
DataFrame的基本操作函数 原文链接 https://blog.csdn.net/u010003835/article/details/106436091?utm_medium=distribu ...
- Spark中dataframe里data.drop()和data.na.drop()的区别
问题描述:原始数据data总行数是1303638,使用data.drop()后数据总行数是1303638,使用data.na.drop()后数据总行数是0:为啥data.drop()没有丢弃null或 ...
- SQL操作的组成部分-数据查询
数据查询 SQL是一种查询功能很强的语言,只要是数据库存在的数据,总能通过适当的方法将它从数据库中查找出来.SQL中的查询语句只有一个:SELECT,它可与其它语句配合完成所有的查询功能.SELECT ...
- SQL操作的组成部分-数据控制
由于数据库管理系统是一个多用户系统,为了控制用户对数据的存取权利,保持数据的共享及完全性,SQL语言提供了一系列的数据控制功能.其中,主要包括安全性控制.完整性控制.
- Spark15:Spark SQL:DataFrame常见算子操作、DataFrame的sql操作、RDD转换为DataFrame、load和save操作、SaveMode、内置函数
前面我们学习了Spark中的Spark core,离线数据计算,下面我们来学习一下Spark中的Spark SQL. 一.Spark SQL Spark SQL和我们之前讲Hive的时候说的hive ...
最新文章
- 浅谈Transformer的初始化、参数化与标准化
- 生成QR二维码的多种方法
- 微信小程序开发视频教程新鲜出炉
- 2020 蚂蚁面试指南!
- ios获取软键盘完成事件
- 好程序员web前端技术之CSS3过渡
- 15个C++项目列表
- 仿ISQL功能的实现,可以实现批处理功能
- mysql与citespace_CiteSpace与MySQL数据库的连接-科学网—博客.PDF
- 【Pytorch】MNIST数据集的训练和测试
- JVM学习-StringTable字符串常量池
- sql server中 设置与查看锁的超时时间(ZT) @@LOCK_TIMEOUT
- Dreamweaver简单的表格附加代码
- 知识驱动的主动式开放域对话系统 by 车万翔 2020/4/11
- input type=file 的onchange事件
- Asp.Net Core 系列教程 (一)
- 美国密歇根州立大学计算机专业,密歇根州立大学计算机科学硕士排名第66(2020年TFE Times排名)...
- linux怎么保存7天内文件,Linux七天系列(第七天)—文件系统管理
- 简单明了的阐述SVM支持向量机以及做法步骤
- 崛起于Springboot2.X之集成规则引擎Drools(41)
热门文章
- 大数据计算的基石——MapReduce
- 第五章 黎明踏浪号 Facebook (二)
- 概率与数学期望------扑克牌
- python---做一个恶搞程序
- shell 判断网线插拔_shell脚本自动检测网络掉线和自动重连
- 美国大学计算机专业排名2014,2014USNews美国大学本科计算机专业排名
- 基于Spring事件模型实现观察者模式的工程实践
- 国产数据库《人大金仓v8》适配过程问题解决记录
- 工程伦理--13.4 临平净水厂化解“邻避效应”的对策
- 深度学习之图像分类(十九)-- Bottleneck Transformer(BoTNet)网络详解