(转)Spark中对Dataframe的union 、unionAll和 unionByName方法说明
【转载原因:Dataframe的union 、unionAll和 unionByName方法区别解释很清楚】
【转载原文:https://blog.csdn.net/bowenlaw/article/details/102996825?utm_medium=distribute.pc_aggpage_search_result.none-task-blog-2~all~sobaiduend~default-5-102996825.nonecase&utm_term=spark%E4%B8%ADunion&spm=1000.2123.3001.4430】
方法说明:
union: 两个df合并,但是不按列名进行合并,而是位置,列名以前表为准(a.union(b) 列名顺序以a为准)
unionAll:同union方法
unionByName:合并时按照列名进行合并,而不是位置
举例:
把 b表的id_num和CST_NO两列的值更改顺序
把 b表的id_num和CST_NO两列的值更改顺序var a = Seq(("1", "ke", "hb","2019-09-04 21:15:15", "1001", "192.196", "mac", "43", "ATM"),("1", "ke", "hb","2019-09-04 21:15:15", "1001", "192.196", "mac", "43", "ATM"),("200", "ming", "hlj","2019-09-06 17:15:15", "2002", "192.196", "win7", "13", "ATM")).toDF("id_num", "CST_NO", "distribution","dayId", "AMOUNT_cnt", "CLIENT_IP", "CLIENT_MAC", "PAYER_CODE_num","CHANNEL_CODE")a.show()var b = Seq(("ke", "9999", "hb","2019-09-04 21:15:15", "1001", "192.196", "mac", "43", "ATM"),("ke", "9999", "hb","2019-09-04 21:15:15", "1001", "192.196", "mac", "43", "ATM"),("ming", "787878", "hlj","2019-09-06 17:15:15", "2002", "192.196", "win7", "13", "ATM")).toDF("CST_NO", "id_num", "distribution","dayId", "AMOUNT_cnt", "CLIENT_IP", "CLIENT_MAC", "PAYER_CODE_num","CHANNEL_CODE")b.show()var r = a.union(b)r.show()var p = a.unionAll(b)p.show()var t = a.unionByName(b)t.show()
结果是:a: org.apache.spark.sql.DataFrame = [id_num: string, CST_NO: string ... 7 more fields]
+------+------+------------+-------------------+----------+---------+----------+--------------+------------+
|id_num|CST_NO|distribution| dayId|AMOUNT_cnt|CLIENT_IP|CLIENT_MAC|PAYER_CODE_num|CHANNEL_CODE|
+------+------+------------+-------------------+----------+---------+----------+--------------+------------+
| 1| ke| hb|2019-09-04 21:15:15| 1001| 192.196| mac| 43| ATM|
| 1| ke| hb|2019-09-04 21:15:15| 1001| 192.196| mac| 43| ATM|
| 200| ming| hlj|2019-09-06 17:15:15| 2002| 192.196| win7| 13| ATM|
+------+------+------------+-------------------+----------+---------+----------+--------------+------------+b: org.apache.spark.sql.DataFrame = [CST_NO: string, id_num: string ... 7 more fields]
+------+------+------------+-------------------+----------+---------+----------+--------------+------------+
|CST_NO|id_num|distribution| dayId|AMOUNT_cnt|CLIENT_IP|CLIENT_MAC|PAYER_CODE_num|CHANNEL_CODE|
+------+------+------------+-------------------+----------+---------+----------+--------------+------------+
| ke| 9999| hb|2019-09-04 21:15:15| 1001| 192.196| mac| 43| ATM|
| ke| 9999| hb|2019-09-04 21:15:15| 1001| 192.196| mac| 43| ATM|
| ming|787878| hlj|2019-09-06 17:15:15| 2002| 192.196| win7| 13| ATM|
+------+------+------------+-------------------+----------+---------+----------+--------------+------------+r: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id_num: string, CST_NO: string ... 7 more fields]
+------+------+------------+-------------------+----------+---------+----------+--------------+------------+
|id_num|CST_NO|distribution| dayId|AMOUNT_cnt|CLIENT_IP|CLIENT_MAC|PAYER_CODE_num|CHANNEL_CODE|
+------+------+------------+-------------------+----------+---------+----------+--------------+------------+
| 1| ke| hb|2019-09-04 21:15:15| 1001| 192.196| mac| 43| ATM|
| 1| ke| hb|2019-09-04 21:15:15| 1001| 192.196| mac| 43| ATM|
| 200| ming| hlj|2019-09-06 17:15:15| 2002| 192.196| win7| 13| ATM|
| ke| 9999| hb|2019-09-04 21:15:15| 1001| 192.196| mac| 43| ATM|
| ke| 9999| hb|2019-09-04 21:15:15| 1001| 192.196| mac| 43| ATM|
| ming|787878| hlj|2019-09-06 17:15:15| 2002| 192.196| win7| 13| ATM|
+------+------+------------+-------------------+----------+---------+----------+--------------+------------+warning: there was one deprecation warning; re-run with -deprecation for details
p: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id_num: string, CST_NO: string ... 7 more fields]
+------+------+------------+-------------------+----------+---------+----------+--------------+------------+
|id_num|CST_NO|distribution| dayId|AMOUNT_cnt|CLIENT_IP|CLIENT_MAC|PAYER_CODE_num|CHANNEL_CODE|
+------+------+------------+-------------------+----------+---------+----------+--------------+------------+
| 1| ke| hb|2019-09-04 21:15:15| 1001| 192.196| mac| 43| ATM|
| 1| ke| hb|2019-09-04 21:15:15| 1001| 192.196| mac| 43| ATM|
| 200| ming| hlj|2019-09-06 17:15:15| 2002| 192.196| win7| 13| ATM|
| ke| 9999| hb|2019-09-04 21:15:15| 1001| 192.196| mac| 43| ATM|
| ke| 9999| hb|2019-09-04 21:15:15| 1001| 192.196| mac| 43| ATM|
| ming|787878| hlj|2019-09-06 17:15:15| 2002| 192.196| win7| 13| ATM|
+------+------+------------+-------------------+----------+---------+----------+--------------+------------+t: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id_num: string, CST_NO: string ... 7 more fields]
+------+------+------------+-------------------+----------+---------+----------+--------------+------------+
|id_num|CST_NO|distribution| dayId|AMOUNT_cnt|CLIENT_IP|CLIENT_MAC|PAYER_CODE_num|CHANNEL_CODE|
+------+------+------------+-------------------+----------+---------+----------+--------------+------------+
| 1| ke| hb|2019-09-04 21:15:15| 1001| 192.196| mac| 43| ATM|
| 1| ke| hb|2019-09-04 21:15:15| 1001| 192.196| mac| 43| ATM|
| 200| ming| hlj|2019-09-06 17:15:15| 2002| 192.196| win7| 13| ATM|
| 9999| ke| hb|2019-09-04 21:15:15| 1001| 192.196| mac| 43| ATM|
| 9999| ke| hb|2019-09-04 21:15:15| 1001| 192.196| mac| 43| ATM|
|787878| ming| hlj|2019-09-06 17:15:15| 2002| 192.196| win7| 13| ATM|
+------+------+------------+-------------------+----------+---------+----------+--------------+------------+
可以看出: r表和p表只关注位置, t表对应列进行合并,为正确的unionByName如果两表列不完全相同,是会报错:var a = Seq(("1", "ke", "hb","2019-09-04 21:15:15", "1001", "192.196", "mac", "43", "ATM"),("1", "ke", "hb","2019-09-04 21:15:15", "1001", "192.196", "mac", "43", "ATM"),("200", "ming", "hlj","2019-09-06 17:15:15", "2002", "192.196", "win7", "13", "ATM")).toDF("id_num", "CST_NO", "distribution","dayId", "AMOUNT_cnt", "CLIENT_IP", "CLIENT_MAC", "PAYER_CODE_num","CHANNEL_CODE")var b = Seq(("ke", "9999", "hb","2019-09-04 21:15:15", "1001", "192.196", "mac", "43", "ATM"),("ke", "9999", "hb","2019-09-04 21:15:15", "1001", "192.196", "mac", "43", "ATM"),("ming", "787878", "hlj","2019-09-06 17:15:15", "2002", "192.196", "win7", "13", "ATM")).toDF("vvv", "id_num", "distribution","dayId", "AMOUNT_cnt", "CLIENT_IP", "CLIENT_MAC", "PAYER_CODE_num","CHANNEL_CODE")var t = a.unionByName(b)t.show()
报错: org.apache.spark.sql.AnalysisException: Cannot resolve column name "CST_NO" among (vvv, id_num, distribution, dayId, AMOUNT_cnt, CLIENT_IP, CLIENT_MAC, PAYER_CODE_num, CHANNEL_CODE);
注:三种方法的前提是两个df的行数一样,不一样会直接报错
(转)Spark中对Dataframe的union 、unionAll和 unionByName方法说明相关推荐
- python中的iloc函数_详解pandas中利用DataFrame对象的.loc[]、.iloc[]方法抽取数据
pandas的DataFrame对象,本质上是二维矩阵,跟常规二维矩阵的差别在于前者额外指定了每一行和每一列的名称.这样内部数据抽取既可以用"行列名称(对应.loc[]方法)",也 ...
- Spark中对dataframe内重复列求和
前言 在处理dataframe中的字段名是,发现有些字段名在处理后是重复,于是新定义了策略,对这些相同列名的字段进行求和合并 summing the duplicated columns 代码实现 # ...
- Spark中RDD与DataFrame与DataSet的区别与联系
1.概述 这是一个面试题 在Spark中,DataFrame是一种以RDD为基础的分布式数据集,类似传统数据库中的二维表格 DataFrame与RDD的主要区别在于,前者带有schema元数据信息,既 ...
- 使用Spark中DataFrame的语法与SQL操作,对人类数据进行处理,比较学历与离婚率的关系
简介 整理Kaggle上的人类信息数据 Machine-Learning-Databases,这个数据集已经有二十多年的历史,虽然历史久远,但是格式明确,是比较好的入门数据集. 通过Spark中的Da ...
- Spark中RDD、DataFrame和DataSet的区别与联系
一.RDD.DataFrame和DataSet的定义 在开始Spark RDD与DataFrame与Dataset之间的比较之前,先让我们看一下Spark中的RDD,DataFrame和Dataset ...
- Spark RDD、DataFrame原理及操作详解
RDD是什么? RDD (resilientdistributed dataset),指的是一个只读的,可分区的分布式数据集,这个数据集的全部或部分可以缓存在内存中,在多次计算间重用. RDD内部可以 ...
- Spark RDD与DataFrame
1. DataFrame概念 DataFrame的前身是SchemaRDD,从Spark 1.3.0开始SchemaRDD更名为DataFrame.与SchemaRDD的主要区别是:DataFrame ...
- Spark修炼之道(进阶篇)——Spark入门到精通:第八节 Spark SQL与DataFrame(一)
本节主要内宾 Spark SQL简介 DataFrame 1. Spark SQL简介 Spark SQL是Spark的五大核心模块之一,用于在Spark平台之上处理结构化数据,利用Spark SQL ...
- spark中dataframe解析_SparkSql 中 JOIN的实现
Join作为SQL中一个重要语法特性,几乎所有稍微复杂一点的数据分析场景都离不开Join,如今Spark SQL(Dataset/DataFrame)已经成为Spark应用程序开发的主流,作为开发者, ...
- spark中dataframe解析_Spark 结构流处理介绍和入门教程
概念和简介 Spark Structured Streaming Structured Streaming 是在 Spark 2.0 加入的经过重新设计的全新流式引擎.它使用 micro-batch ...
最新文章
- PyTorch的十七个损失函数
- [转]WebPack 常用功能介绍
- 【转】符串搜索工具及XenoCode字符串自动解密工具
- scrapy mysql测试连接_scrapy连接MySQL
- python 最小二乘法三维坐标拟合平面_【MQ笔记】超简单的最小二乘法拟合平面(Python)...
- 三星android webview,[转]三星GS4(Android 4.3)上webview crash问题
- 计算机二级C语言冲刺笔记。
- 呼和浩特市啥时计算机考试,2021上半年内蒙古自治区呼和浩特市全国计算机等级考试时间...
- catia三边倒角_CATIA课时:操作工具栏创建倒角倒圆角视频教程_翼狐网
- Bootstrap实战 - 评论列表
- 美联储数字货币最新进展
- win10计算器_你所不知道的 Windows 10 小诀窍:万能计算器、虚拟键盘、屏幕截图标注...
- netty报错:远程主机强迫关闭了一个现有的连接。(已解决)
- PPT插件(VSTO)开发入门
- Superset系列9- 制作地图
- 使用MATLAB计算一幅图像的熵
- ERROR: Failed building wheel for box2d-py
- MySQL的主机什么什么_什么是mysql虚拟主机?什么是mysql数据库?
- 蓝色基因超级计算机top500,美国力推Graph500超级计算机排名欲取代Top500
- flutter 微信聊天输入框