简介

整理Kaggle上的人类信息数据 Machine-Learning-Databases,这个数据集已经有二十多年的历史,虽然历史久远,但是格式明确,是比较好的入门数据集。

通过Spark中的Dataframe语法对其进行基本的数据处理和输出,主要实现了如下功能:

  1. SparkSession的建立
  2. Dataframe的创建、过滤(filter)、合并(groupBy)、部分选择(select)、排序(orderBy)等操作
  3. 简单的SQL、Hive语法操作
  4. Dataframe的分桶保存(bucketBy)、分区操作(partitionBy)
  5. 通过csv文件对Dataframe进行存取操作

github地址:Truedick23 - AdultBase

配置

  • 语言:Scala 2.11
  • Spark版本:Spark 2.3.1
  • 环境:IntelliJ IDEA 2018.2.3
  • sbt项目

bulit.sbt示例:
(根据Spark版本自行修改,根据实际到 MVN Repository: Apache Spark 查找)

name := "AdultBase"version := "0.1"scalaVersion := "2.11.8"// https://mvnrepository.com/artifact/org.apache.spark/spark-core
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.3.1"// https://mvnrepository.com/artifact/org.apache.spark/spark-mllib
libraryDependencies += "org.apache.spark" %% "spark-mllib" % "2.3.1"// https://mvnrepository.com/artifact/org.apache.spark/spark-mllib-local
libraryDependencies += "org.apache.spark" %% "spark-mllib-local" % "2.3.1"// https://mvnrepository.com/artifact/org.apache.spark/spark-hive
libraryDependencies += "org.apache.spark" %% "spark-hive" % "2.3.1"

熟悉数据集

首先我们查看下载得到的 adult.names文件,在96行之后找到如下信息,这一段文字给出了adult.data和adult.test的数据格式及其意义:

age: continuous.
workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
fnlwgt: continuous.
education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
education-num: continuous.
marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
sex: Female, Male.
capital-gain: continuous.
capital-loss: continuous.
hours-per-week: continuous.
native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.

下面我们查看adult.data,主要注意其列分隔符、行分隔符和列数:

39, State-gov, 77516, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Male, 2174, 0, 40, United-States, <=50K
50, Self-emp-not-inc, 83311, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 13, United-States, <=50K
38, Private, 215646, HS-grad, 9, Divorced, Handlers-cleaners, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K
53, Private, 234721, 11th, 7, Married-civ-spouse, Handlers-cleaners, Husband, Black, Male, 0, 0, 40, United-States, <=50K
.....

可知其列分隔符为“, ”(逗号+空格),并不是csv格式一般的分割形式,行分隔符为“\n”,即换行符,每行有15条信息,是否明确这几条格式信息决定了随后我们能否正确读入数据

代码编写

Dataframe的创建

首先我们根据adult.name中的信息编写Adult类:

  case class Adult (age: Int, workclass: String, fnlwgt: Int,education: String, education_num: Int,maritial_status: String, occupation: String,relationship: String, race: String,sex: String, capital_gain: Int, capital_loss: Int,hours_per_week: Int, native_country: String)

然后我们通过builder建立一个SparkSession,用于读入数据并建立Dataframe:

    import org.apache.spark.sql.SparkSessionval spark = SparkSession.builder().master("local[*]").appName("AdultBaseDatabase").getOrCreate()

其中master指定了此程序运行在本地服务器上,appName确定了SparkSession的名字

由于我们列分隔符比较特殊,我们此处采用一个比较普适的方法,首先使用textFile读入数据,用split函数分割行,filter确保每一行的元素是15个,然后通过Adult类建立每一条的表格信息,通过toDF来建立Dataframe:

    import spark.implicits._val df = spark.sparkContext.textFile("./data/machine-learning-databases/adult.data").map(fields => fields.split(", ")).filter(lines => lines.length == 15).map(attributes =>Adult(attributes(0).toInt, attributes(1), attributes(2).toInt, attributes(3), attributes(4).toInt,attributes(5), attributes(6), attributes(7), attributes(8), attributes(9), attributes(10).toInt,attributes(11).toInt, attributes(12).toInt, attributes(13))).toDF()

我们可以输出其信息结构、记录条数和头几条记录试试:

    df.printSchema()println(df.count())df.show(8)

输出如下:

root|-- age: integer (nullable = false)|-- workclass: string (nullable = true)|-- fnlwgt: integer (nullable = false)|-- education: string (nullable = true)|-- education_num: integer (nullable = false)|-- maritial_status: string (nullable = true)|-- occupation: string (nullable = true)|-- relationship: string (nullable = true)|-- race: string (nullable = true)|-- sex: string (nullable = true)|-- capital_gain: integer (nullable = false)|-- capital_loss: integer (nullable = false)|-- hours_per_week: integer (nullable = false)|-- native_country: string (nullable = true)32561+---+----------------+------+---------+-------------+--------------------+-----------------+-------------+-----+------+------------+------------+--------------+--------------+
|age|       workclass|fnlwgt|education|education_num|     maritial_status|       occupation| relationship| race|   sex|capital_gain|capital_loss|hours_per_week|native_country|
+---+----------------+------+---------+-------------+--------------------+-----------------+-------------+-----+------+------------+------------+--------------+--------------+
| 39|       State-gov| 77516|Bachelors|           13|       Never-married|     Adm-clerical|Not-in-family|White|  Male|        2174|           0|            40| United-States|
| 50|Self-emp-not-inc| 83311|Bachelors|           13|  Married-civ-spouse|  Exec-managerial|      Husband|White|  Male|           0|           0|            13| United-States|
| 38|         Private|215646|  HS-grad|            9|            Divorced|Handlers-cleaners|Not-in-family|White|  Male|           0|           0|            40| United-States|
| 53|         Private|234721|     11th|            7|  Married-civ-spouse|Handlers-cleaners|      Husband|Black|  Male|           0|           0|            40| United-States|
| 28|         Private|338409|Bachelors|           13|  Married-civ-spouse|   Prof-specialty|         Wife|Black|Female|           0|           0|            40|          Cuba|
| 37|         Private|284582|  Masters|           14|  Married-civ-spouse|  Exec-managerial|         Wife|White|Female|           0|           0|            40| United-States|
| 49|         Private|160187|      9th|            5|Married-spouse-ab...|    Other-service|Not-in-family|Black|Female|           0|           0|            16|       Jamaica|
| 52|Self-emp-not-inc|209642|  HS-grad|            9|  Married-civ-spouse|  Exec-managerial|      Husband|White|  Male|           0|           0|            45| United-States|
+---+----------------+------+---------+-------------+--------------------+-----------------+-------------+-----+------+------------+------------+--------------+--------------+
only showing top 8 rows

可以看到共有32561条记录,程序首先通过树的形式输出表格的格式,然后成功输出了表格头8条记录,下面我们做些简单的实验来测试一下基本函数操作:

简单函数操作

我们尝试通过filter函数来输出满足某些条件的记录,注意表示格式!
我们先过滤出年龄大于70岁的人群信息,后过滤出种族为其他的记录:
注意“$”符号的使用,以及等号用“===”表示等

    val elderly_data = df.filter($"age" > 70)println(elderly_data.count())elderly_data.show(8)df.filter($"race" === "Other").show(8)

输出如下:

540+---+----------------+------+------------+-------------+------------------+---------------+--------------+-----+------+------------+------------+--------------+--------------+
|age|       workclass|fnlwgt|   education|education_num|   maritial_status|     occupation|  relationship| race|   sex|capital_gain|capital_loss|hours_per_week|native_country|
+---+----------------+------+------------+-------------+------------------+---------------+--------------+-----+------+------------+------------+--------------+--------------+
| 79|         Private|124744|Some-college|           10|Married-civ-spouse| Prof-specialty|Other-relative|White|  Male|           0|           0|            20| United-States|
| 76|         Private|124191|     Masters|           14|Married-civ-spouse|Exec-managerial|       Husband|White|  Male|           0|           0|            40| United-States|
| 71|Self-emp-not-inc|494223|Some-college|           10|         Separated|          Sales|     Unmarried|Black|  Male|           0|        1816|             2| United-States|
| 90|         Private| 51744|     HS-grad|            9|     Never-married|  Other-service| Not-in-family|Black|  Male|           0|        2206|            40| United-States|
| 75|         Private|314209|   Assoc-voc|           11|           Widowed|   Adm-clerical| Not-in-family|White|Female|           0|           0|            20|      Columbia|
| 77|Self-emp-not-inc|138714|Some-college|           10|Married-civ-spouse|          Sales|       Husband|White|  Male|           0|           0|            40| United-States|
| 76|Self-emp-not-inc|174309|     Masters|           14|Married-civ-spouse|   Craft-repair|       Husband|White|  Male|           0|           0|            10| United-States|
| 80|               ?|107762|     HS-grad|            9|           Widowed|              ?| Not-in-family|White|  Male|           0|           0|            24| United-States|
+---+----------------+------+------------+-------------+------------------+---------------+--------------+-----+------+------------+------------+--------------+--------------+
only showing top 8 rows+---+---------+------+------------+-------------+------------------+-----------------+--------------+-----+------+------------+------------+--------------+--------------+
|age|workclass|fnlwgt|   education|education_num|   maritial_status|       occupation|  relationship| race|   sex|capital_gain|capital_loss|hours_per_week|native_country|
+---+---------+------+------------+-------------+------------------+-----------------+--------------+-----+------+------------+------------+--------------+--------------+
| 25|  Private| 32275|Some-college|           10|Married-civ-spouse|  Exec-managerial|          Wife|Other|Female|           0|           0|            40| United-States|
| 33|  Private|110978|Some-college|           10|          Divorced|     Craft-repair|Other-relative|Other|Female|           0|           0|            40| United-States|
| 65|  Private|161400|        11th|            7|           Widowed|    Other-service|     Unmarried|Other|  Male|           0|           0|            40| United-States|
| 28|  Private|166481|     7th-8th|            4|Married-civ-spouse|Handlers-cleaners|       Husband|Other|  Male|           0|        2179|            40|   Puerto-Rico|
| 44|  Private|109339|        11th|            7|          Divorced|Machine-op-inspct|     Unmarried|Other|Female|           0|           0|            46|   Puerto-Rico|
| 40|  Private|237601|   Bachelors|           13|     Never-married|            Sales| Not-in-family|Other|Female|           0|           0|            55| United-States|
| 26|  Private|195105|     HS-grad|            9|     Never-married|            Sales| Not-in-family|Other|  Male|           0|           0|            40| United-States|
| 31|  Private|175548|     HS-grad|            9|     Never-married|    Other-service| Not-in-family|Other|Female|           0|           0|            35| United-States|
+---+---------+------+------------+-------------+------------------+-----------------+--------------+-----+------+------------+------------+--------------+--------------+
only showing top 8 rows

可知我们filter操作成功了,年龄超过70岁的记录共有540条

然后我们尝试通过统计不同种族和学历的人数,通过groupBy函数来合并某一信息相同的记录,用count函数统计人数,最后通过orderBy来排序:

    df.groupBy($"race").count().orderBy("count").show()df.groupBy($"education").count().orderBy("count").show()

输出如下:


+------------------+-----+
|              race|count|
+------------------+-----+
|             Other|  271|
|Amer-Indian-Eskimo|  311|
|Asian-Pac-Islander| 1039|
|             Black| 3124|
|             White|27816|
+------------------+-----++------------+-----+
|   education|count|
+------------+-----+
|   Preschool|   51|
|     1st-4th|  168|
|     5th-6th|  333|
|   Doctorate|  413|
|        12th|  433|
|         9th|  514|
| Prof-school|  576|
|     7th-8th|  646|
|        10th|  933|
|  Assoc-acdm| 1067|
|        11th| 1175|
|   Assoc-voc| 1382|
|     Masters| 1723|
|   Bachelors| 5355|
|Some-college| 7291|
|     HS-grad|10501|
+------------+-----+

大致可知统计的人群中白种人和高中毕业者占大多数

然后我们通过SQL语句来过滤出某一确定编号的人的记录,并格式化输出它,我们用createOrReplaceTempView对数据集命名一个暂时的View,用于SQL操作,用getAs函数得到一条记录中的不同栏目中的数值:

    df.createOrReplaceTempView("adult")val dude = spark.sql("SELECT * FROM adult WHERE fnlwgt=77516")println(dude.map(info => "Aged " + info.getAs[String]("age") +", occupation is " + info.getAs[String]("occupation") +", race is " + info.getAs[String]("race")).first())

输出如下:

Aged 39, occupation is Adm-clerical, race is White

统计并比较不同学历的群体的离婚率

我们此处需要输出百分数,故导入了NumberFormat包,用于规范百分数输入,然后我们多次使用filter函数实现了不同群体离婚率的计算,之后格式化输出:

    import java.text.NumberFormatval format = NumberFormat.getPercentInstanceformat.setMaximumFractionDigits(4)val total_divorce_rate: Double = df.filter($"maritial_status" === "Divorced").count.toDouble/df.count.toDoubleval doc_divorce_rate: Double = df.filter($"education" === "Doctorate" and $"maritial_status" === "Divorced").count.toDouble /df.filter($"education" === "Doctorate").count.toDoubleval preschool_divorce_rate: Double = df.filter($"education" === "Preschool" and $"maritial_status" === "Divorced").count.toDouble /df.filter($"education" === "Preschool").count.toDoubleval HS_divorce_rate: Double = df.filter($"education" === "HS-grad" and $"maritial_status" === "Divorced").count.toDouble /df.filter($"education" === "HS-grad").count.toDoubleprintln("Doctorate divorce rate: " + format.format(doc_divorce_rate))println("Preschool divorce rate: " + format.format(preschool_divorce_rate))println("High School graduation divorce rate: " + format.format(HS_divorce_rate))println("Total divorce rate: " + format.format(total_divorce_rate))

输出如下:

Doctorate divorce rate: 7.9903%
Preschool divorce rate: 1.9608%
High School graduation divorce rate: 15.3604%
Total divorce rate: 13.6452%

这四个数据分别代表博士、未接受教育、高中毕业和总人群的离婚率,我们可以看出未接受教育人群的离婚率非常低,而人数最多的高中毕业人群的离婚率却超过了平均离婚率

分桶(bucketBy)和分区(partitionBy)操作

我们尝试对数据集的workclass类进行分区,relationship类进行分桶,分成6桶,新表命名为workclass_relationship_table

      df.write.partitionBy("workclass").bucketBy(6, "relationship").saveAsTable("workclass_relationship_table")

运行后看到spark-warehouse中多了如此的一个文件:

保存信息到csv文件并读取,通过Hive语句操作

我们编写如下代码对学历为博士的数据记录的婚姻状况、编号、性别和职业进行保存:

      df.filter($"education" === "Doctorate").select($"maritial_status", $"fnlwgt", $"sex", $"occupation", $"education").write.csv("./data/doctorates_marriage_occupation_data")

我们通过以下语句建立一个SparkSession来进行Hive表操作:

    val warehouseLocation = new File("spark-warehouse").getAbsolutePathval spark = SparkSession.builder().master("local[*]").appName("Doctorate Hive").config("spark.sql.warehouse.dir", warehouseLocation).enableHiveSupport().getOrCreate()

使用如下的Hive语句建立新表并对其进行导入,其中通过Doctorate类规定了表格的信息,并明确了行分隔符、列分隔符:

    import spark.implicits._import spark.sqlcase class Doctorate(marriage: String, id: Int, sex: String, occupation: String, education: String)sql("CREATE TABLE IF NOT EXISTS doc (marriage STRING, id INT, sex STRING, occupation STRING, education STRING) " +"ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' " +"LINES TERMINATED BY '\n' STORED AS TEXTFILE")sql("LOAD DATA LOCAL INPATH './data/doctorates_marriage_occupation_data' OVERWRITE INTO TABLE doc")

通过如下语句统计其中的性别和职业数据:

    val doc_table = sql("SELECT * FROM doc")val marriage_doc = doc_table.groupBy("marriage").count().orderBy("count")doc_table.groupBy("sex").count().orderBy("count").show()doc_table.groupBy("occupation").count().orderBy("count").show()

输出如下:

+------+-----+
|   sex|count|
+------+-----+
|Female|   86|
|  Male|  327|
+------+-----++-----------------+-----+
|       occupation|count|
+-----------------+-----+
| Transport-moving|    1|
|Machine-op-inspct|    1|
|  Farming-fishing|    1|
|    Other-service|    1|
|     Craft-repair|    2|
|     Tech-support|    3|
|     Adm-clerical|    5|
|            Sales|    8|
|                ?|   15|
|  Exec-managerial|   55|
|   Prof-specialty|  321|
+-----------------+-----+

然后我们通过一下语句格式化输出每一条信息,其中show函数通过设置truncate来避免了输出时压缩过长的字符串栏目:

    val divorced_male = sql("SELECT * FROM doc WHERE sex='Male' AND marriage='Divorced' ORDER BY ID")divorced_male.show(8)val recordsDoc = divorced_male.map {case Row(marriage: String, id: Int, sex: String, occupation: String, education: String) =>val suffix = sex match {case "Male"=> s"His"case "Female" => s"Her"}s"Doc. $id is $sex, $suffix marriage status is $marriage, occupation is $occupation"}recordsDoc.show(8, truncate = false)

输出如下:

+----------------------------------------------------------------------------------+
|value                                                                             |
+----------------------------------------------------------------------------------+
|Doc. 42924 is Male, His marriage status is Divorced, occupation is Exec-managerial|
|Doc. 71772 is Male, His marriage status is Divorced, occupation is Prof-specialty |
|Doc. 140644 is Male, His marriage status is Divorced, occupation is Prof-specialty|
|Doc. 145333 is Male, His marriage status is Divorced, occupation is Prof-specialty|
|Doc. 148171 is Male, His marriage status is Divorced, occupation is Prof-specialty|
|Doc. 160120 is Male, His marriage status is Divorced, occupation is Adm-clerical  |
|Doc. 170769 is Male, His marriage status is Divorced, occupation is Sales         |
|Doc. 195343 is Male, His marriage status is Divorced, occupation is Prof-specialty|
+----------------------------------------------------------------------------------+
only showing top 8 rows

参考资料

  • SparkDoc - SpakSession
  • Spark SQL, DataFrames and Datasets Guide

使用Spark中DataFrame的语法与SQL操作,对人类数据进行处理,比较学历与离婚率的关系相关推荐

  1. python dataframe 列_python pandas库中DataFrame对行和列的操作实例讲解

    用pandas中的DataFrame时选取行或列: import numpy as np import pandas as pd from pandas import Sereis, DataFram ...

  2. spark中dataframe解析_SparkSql 中 JOIN的实现

    Join作为SQL中一个重要语法特性,几乎所有稍微复杂一点的数据分析场景都离不开Join,如今Spark SQL(Dataset/DataFrame)已经成为Spark应用程序开发的主流,作为开发者, ...

  3. spark中dataframe解析_Spark 结构流处理介绍和入门教程

    概念和简介 Spark Structured Streaming Structured Streaming 是在 Spark 2.0 加入的经过重新设计的全新流式引擎.它使用 micro-batch ...

  4. spark中dataframe解析_Spark-SQL

    fe 缺点 不方便添加新的优化策略 线程安全问题 Spark SQL支持三种语言 java Scala python DataFrame 大规模数据化结构能历.提高了运算能力 从sql到dataFra ...

  5. Spark中DataFrame 基本操作函数

    DataFrame的基本操作函数 原文链接 https://blog.csdn.net/u010003835/article/details/106436091?utm_medium=distribu ...

  6. Spark中dataframe里data.drop()和data.na.drop()的区别

    问题描述:原始数据data总行数是1303638,使用data.drop()后数据总行数是1303638,使用data.na.drop()后数据总行数是0:为啥data.drop()没有丢弃null或 ...

  7. SQL操作的组成部分-数据查询

    数据查询 SQL是一种查询功能很强的语言,只要是数据库存在的数据,总能通过适当的方法将它从数据库中查找出来.SQL中的查询语句只有一个:SELECT,它可与其它语句配合完成所有的查询功能.SELECT ...

  8. SQL操作的组成部分-数据控制

    由于数据库管理系统是一个多用户系统,为了控制用户对数据的存取权利,保持数据的共享及完全性,SQL语言提供了一系列的数据控制功能.其中,主要包括安全性控制.完整性控制.

  9. Spark15:Spark SQL:DataFrame常见算子操作、DataFrame的sql操作、RDD转换为DataFrame、load和save操作、SaveMode、内置函数

    前面我们学习了Spark中的Spark core,离线数据计算,下面我们来学习一下Spark中的Spark SQL. 一.Spark SQL Spark SQL和我们之前讲Hive的时候说的hive ...

最新文章

  1. 浅谈Transformer的初始化、参数化与标准化
  2. 生成QR二维码的多种方法
  3. 微信小程序开发视频教程新鲜出炉
  4. 2020 蚂蚁面试指南!
  5. ios获取软键盘完成事件
  6. 好程序员web前端技术之CSS3过渡
  7. 15个C++项目列表
  8. 仿ISQL功能的实现,可以实现批处理功能
  9. mysql与citespace_CiteSpace与MySQL数据库的连接-科学网—博客.PDF
  10. 【Pytorch】MNIST数据集的训练和测试
  11. JVM学习-StringTable字符串常量池
  12. sql server中 设置与查看锁的超时时间(ZT) @@LOCK_TIMEOUT
  13. Dreamweaver简单的表格附加代码
  14. 知识驱动的主动式开放域对话系统 by 车万翔 2020/4/11
  15. input type=file 的onchange事件
  16. Asp.Net Core 系列教程 (一)
  17. 美国密歇根州立大学计算机专业,密歇根州立大学计算机科学硕士排名第66(2020年TFE Times排名)...
  18. linux怎么保存7天内文件,Linux七天系列(第七天)—文件系统管理
  19. 简单明了的阐述SVM支持向量机以及做法步骤
  20. 崛起于Springboot2.X之集成规则引擎Drools(41)

热门文章

  1. 大数据计算的基石——MapReduce
  2. 第五章 黎明踏浪号 Facebook (二)
  3. 概率与数学期望------扑克牌
  4. python---做一个恶搞程序
  5. shell 判断网线插拔_shell脚本自动检测网络掉线和自动重连
  6. 美国大学计算机专业排名2014,2014USNews美国大学本科计算机专业排名
  7. 基于Spring事件模型实现观察者模式的工程实践
  8. 国产数据库《人大金仓v8》适配过程问题解决记录
  9. 工程伦理--13.4 临平净水厂化解“邻避效应”的对策
  10. 深度学习之图像分类(十九)-- Bottleneck Transformer(BoTNet)网络详解