【转载原因:Dataframe的union 、unionAll和 unionByName方法区别解释很清楚】

【转载原文:https://blog.csdn.net/bowenlaw/article/details/102996825?utm_medium=distribute.pc_aggpage_search_result.none-task-blog-2~all~sobaiduend~default-5-102996825.nonecase&utm_term=spark%E4%B8%ADunion&spm=1000.2123.3001.4430】

方法说明:

union: 两个df合并,但是不按列名进行合并,而是位置,列名以前表为准(a.union(b) 列名顺序以a为准)
unionAll:同union方法
unionByName:合并时按照列名进行合并,而不是位置
举例:

把 b表的id_num和CST_NO两列的值更改顺序

把 b表的id_num和CST_NO两列的值更改顺序var a = Seq(("1", "ke", "hb","2019-09-04 21:15:15", "1001", "192.196", "mac", "43", "ATM"),("1", "ke", "hb","2019-09-04 21:15:15", "1001", "192.196", "mac", "43", "ATM"),("200", "ming", "hlj","2019-09-06 17:15:15", "2002", "192.196", "win7", "13", "ATM")).toDF("id_num", "CST_NO", "distribution","dayId", "AMOUNT_cnt", "CLIENT_IP", "CLIENT_MAC", "PAYER_CODE_num","CHANNEL_CODE")a.show()var b = Seq(("ke", "9999", "hb","2019-09-04 21:15:15", "1001", "192.196", "mac", "43", "ATM"),("ke", "9999", "hb","2019-09-04 21:15:15", "1001", "192.196", "mac", "43", "ATM"),("ming", "787878", "hlj","2019-09-06 17:15:15", "2002", "192.196", "win7", "13", "ATM")).toDF("CST_NO", "id_num", "distribution","dayId", "AMOUNT_cnt", "CLIENT_IP", "CLIENT_MAC", "PAYER_CODE_num","CHANNEL_CODE")b.show()var r = a.union(b)r.show()var p = a.unionAll(b)p.show()var t = a.unionByName(b)t.show()
结果是:a: org.apache.spark.sql.DataFrame = [id_num: string, CST_NO: string ... 7 more fields]
+------+------+------------+-------------------+----------+---------+----------+--------------+------------+
|id_num|CST_NO|distribution|              dayId|AMOUNT_cnt|CLIENT_IP|CLIENT_MAC|PAYER_CODE_num|CHANNEL_CODE|
+------+------+------------+-------------------+----------+---------+----------+--------------+------------+
|     1|    ke|          hb|2019-09-04 21:15:15|      1001|  192.196|       mac|            43|         ATM|
|     1|    ke|          hb|2019-09-04 21:15:15|      1001|  192.196|       mac|            43|         ATM|
|   200|  ming|         hlj|2019-09-06 17:15:15|      2002|  192.196|      win7|            13|         ATM|
+------+------+------------+-------------------+----------+---------+----------+--------------+------------+b: org.apache.spark.sql.DataFrame = [CST_NO: string, id_num: string ... 7 more fields]
+------+------+------------+-------------------+----------+---------+----------+--------------+------------+
|CST_NO|id_num|distribution|              dayId|AMOUNT_cnt|CLIENT_IP|CLIENT_MAC|PAYER_CODE_num|CHANNEL_CODE|
+------+------+------------+-------------------+----------+---------+----------+--------------+------------+
|    ke|  9999|          hb|2019-09-04 21:15:15|      1001|  192.196|       mac|            43|         ATM|
|    ke|  9999|          hb|2019-09-04 21:15:15|      1001|  192.196|       mac|            43|         ATM|
|  ming|787878|         hlj|2019-09-06 17:15:15|      2002|  192.196|      win7|            13|         ATM|
+------+------+------------+-------------------+----------+---------+----------+--------------+------------+r: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id_num: string, CST_NO: string ... 7 more fields]
+------+------+------------+-------------------+----------+---------+----------+--------------+------------+
|id_num|CST_NO|distribution|              dayId|AMOUNT_cnt|CLIENT_IP|CLIENT_MAC|PAYER_CODE_num|CHANNEL_CODE|
+------+------+------------+-------------------+----------+---------+----------+--------------+------------+
|     1|    ke|          hb|2019-09-04 21:15:15|      1001|  192.196|       mac|            43|         ATM|
|     1|    ke|          hb|2019-09-04 21:15:15|      1001|  192.196|       mac|            43|         ATM|
|   200|  ming|         hlj|2019-09-06 17:15:15|      2002|  192.196|      win7|            13|         ATM|
|    ke|  9999|          hb|2019-09-04 21:15:15|      1001|  192.196|       mac|            43|         ATM|
|    ke|  9999|          hb|2019-09-04 21:15:15|      1001|  192.196|       mac|            43|         ATM|
|  ming|787878|         hlj|2019-09-06 17:15:15|      2002|  192.196|      win7|            13|         ATM|
+------+------+------------+-------------------+----------+---------+----------+--------------+------------+warning: there was one deprecation warning; re-run with -deprecation for details
p: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id_num: string, CST_NO: string ... 7 more fields]
+------+------+------------+-------------------+----------+---------+----------+--------------+------------+
|id_num|CST_NO|distribution|              dayId|AMOUNT_cnt|CLIENT_IP|CLIENT_MAC|PAYER_CODE_num|CHANNEL_CODE|
+------+------+------------+-------------------+----------+---------+----------+--------------+------------+
|     1|    ke|          hb|2019-09-04 21:15:15|      1001|  192.196|       mac|            43|         ATM|
|     1|    ke|          hb|2019-09-04 21:15:15|      1001|  192.196|       mac|            43|         ATM|
|   200|  ming|         hlj|2019-09-06 17:15:15|      2002|  192.196|      win7|            13|         ATM|
|    ke|  9999|          hb|2019-09-04 21:15:15|      1001|  192.196|       mac|            43|         ATM|
|    ke|  9999|          hb|2019-09-04 21:15:15|      1001|  192.196|       mac|            43|         ATM|
|  ming|787878|         hlj|2019-09-06 17:15:15|      2002|  192.196|      win7|            13|         ATM|
+------+------+------------+-------------------+----------+---------+----------+--------------+------------+t: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id_num: string, CST_NO: string ... 7 more fields]
+------+------+------------+-------------------+----------+---------+----------+--------------+------------+
|id_num|CST_NO|distribution|              dayId|AMOUNT_cnt|CLIENT_IP|CLIENT_MAC|PAYER_CODE_num|CHANNEL_CODE|
+------+------+------------+-------------------+----------+---------+----------+--------------+------------+
|     1|    ke|          hb|2019-09-04 21:15:15|      1001|  192.196|       mac|            43|         ATM|
|     1|    ke|          hb|2019-09-04 21:15:15|      1001|  192.196|       mac|            43|         ATM|
|   200|  ming|         hlj|2019-09-06 17:15:15|      2002|  192.196|      win7|            13|         ATM|
|  9999|    ke|          hb|2019-09-04 21:15:15|      1001|  192.196|       mac|            43|         ATM|
|  9999|    ke|          hb|2019-09-04 21:15:15|      1001|  192.196|       mac|            43|         ATM|
|787878|  ming|         hlj|2019-09-06 17:15:15|      2002|  192.196|      win7|            13|         ATM|
+------+------+------------+-------------------+----------+---------+----------+--------------+------------+
可以看出: r表和p表只关注位置, t表对应列进行合并,为正确的unionByName如果两表列不完全相同,是会报错:var a = Seq(("1", "ke", "hb","2019-09-04 21:15:15", "1001", "192.196", "mac", "43", "ATM"),("1", "ke", "hb","2019-09-04 21:15:15", "1001", "192.196", "mac", "43", "ATM"),("200", "ming", "hlj","2019-09-06 17:15:15", "2002", "192.196", "win7", "13", "ATM")).toDF("id_num", "CST_NO", "distribution","dayId", "AMOUNT_cnt", "CLIENT_IP", "CLIENT_MAC", "PAYER_CODE_num","CHANNEL_CODE")var b = Seq(("ke", "9999", "hb","2019-09-04 21:15:15", "1001", "192.196", "mac", "43", "ATM"),("ke", "9999", "hb","2019-09-04 21:15:15", "1001", "192.196", "mac", "43", "ATM"),("ming", "787878", "hlj","2019-09-06 17:15:15", "2002", "192.196", "win7", "13", "ATM")).toDF("vvv", "id_num", "distribution","dayId", "AMOUNT_cnt", "CLIENT_IP", "CLIENT_MAC", "PAYER_CODE_num","CHANNEL_CODE")var t = a.unionByName(b)t.show()
报错: org.apache.spark.sql.AnalysisException: Cannot resolve column name "CST_NO" among (vvv, id_num, distribution, dayId, AMOUNT_cnt, CLIENT_IP, CLIENT_MAC, PAYER_CODE_num, CHANNEL_CODE);   

注:三种方法的前提是两个df的行数一样,不一样会直接报错

(转)Spark中对Dataframe的union 、unionAll和 unionByName方法说明相关推荐

  1. python中的iloc函数_详解pandas中利用DataFrame对象的.loc[]、.iloc[]方法抽取数据

    pandas的DataFrame对象,本质上是二维矩阵,跟常规二维矩阵的差别在于前者额外指定了每一行和每一列的名称.这样内部数据抽取既可以用"行列名称(对应.loc[]方法)",也 ...

  2. Spark中对dataframe内重复列求和

    前言 在处理dataframe中的字段名是,发现有些字段名在处理后是重复,于是新定义了策略,对这些相同列名的字段进行求和合并 summing the duplicated columns 代码实现 # ...

  3. Spark中RDD与DataFrame与DataSet的区别与联系

    1.概述 这是一个面试题 在Spark中,DataFrame是一种以RDD为基础的分布式数据集,类似传统数据库中的二维表格 DataFrame与RDD的主要区别在于,前者带有schema元数据信息,既 ...

  4. 使用Spark中DataFrame的语法与SQL操作,对人类数据进行处理,比较学历与离婚率的关系

    简介 整理Kaggle上的人类信息数据 Machine-Learning-Databases,这个数据集已经有二十多年的历史,虽然历史久远,但是格式明确,是比较好的入门数据集. 通过Spark中的Da ...

  5. Spark中RDD、DataFrame和DataSet的区别与联系

    一.RDD.DataFrame和DataSet的定义 在开始Spark RDD与DataFrame与Dataset之间的比较之前,先让我们看一下Spark中的RDD,DataFrame和Dataset ...

  6. Spark RDD、DataFrame原理及操作详解

    RDD是什么? RDD (resilientdistributed dataset),指的是一个只读的,可分区的分布式数据集,这个数据集的全部或部分可以缓存在内存中,在多次计算间重用. RDD内部可以 ...

  7. Spark RDD与DataFrame

    1. DataFrame概念 DataFrame的前身是SchemaRDD,从Spark 1.3.0开始SchemaRDD更名为DataFrame.与SchemaRDD的主要区别是:DataFrame ...

  8. Spark修炼之道(进阶篇)——Spark入门到精通:第八节 Spark SQL与DataFrame(一)

    本节主要内宾 Spark SQL简介 DataFrame 1. Spark SQL简介 Spark SQL是Spark的五大核心模块之一,用于在Spark平台之上处理结构化数据,利用Spark SQL ...

  9. spark中dataframe解析_SparkSql 中 JOIN的实现

    Join作为SQL中一个重要语法特性,几乎所有稍微复杂一点的数据分析场景都离不开Join,如今Spark SQL(Dataset/DataFrame)已经成为Spark应用程序开发的主流,作为开发者, ...

  10. spark中dataframe解析_Spark 结构流处理介绍和入门教程

    概念和简介 Spark Structured Streaming Structured Streaming 是在 Spark 2.0 加入的经过重新设计的全新流式引擎.它使用 micro-batch ...

最新文章

  1. PyTorch的十七个损失函数
  2. [转]WebPack 常用功能介绍
  3. 【转】符串搜索工具及XenoCode字符串自动解密工具
  4. scrapy mysql测试连接_scrapy连接MySQL
  5. python 最小二乘法三维坐标拟合平面_【MQ笔记】超简单的最小二乘法拟合平面(Python)...
  6. 三星android webview,[转]三星GS4(Android 4.3)上webview crash问题
  7. 计算机二级C语言冲刺笔记。
  8. 呼和浩特市啥时计算机考试,2021上半年内蒙古自治区呼和浩特市全国计算机等级考试时间...
  9. catia三边倒角_CATIA课时:操作工具栏创建倒角倒圆角视频教程_翼狐网
  10. Bootstrap实战 - 评论列表
  11. 美联储数字货币最新进展
  12. win10计算器_你所不知道的 Windows 10 小诀窍:万能计算器、虚拟键盘、屏幕截图标注...
  13. netty报错:远程主机强迫关闭了一个现有的连接。(已解决)
  14. PPT插件(VSTO)开发入门
  15. Superset系列9- 制作地图
  16. 使用MATLAB计算一幅图像的熵
  17. ERROR: Failed building wheel for box2d-py
  18. MySQL的主机什么什么_什么是mysql虚拟主机?什么是mysql数据库?
  19. 蓝色基因超级计算机top500,美国力推Graph500超级计算机排名欲取代Top500
  20. flutter 微信聊天输入框

热门文章

  1. 基于web的木子日记个人博客网站的设计与实现
  2. cout 声明与定义
  3. 谷歌身份验证器 java demo实现 及使用中问题分析
  4. 10 Habits of All Successful People 成功人士的10个习惯
  5. 怎样做一个软件注册程序
  6. 第三届蓝桥杯Java组 黄金队列
  7. Directx11教程(58) 鼠标控制摄像机
  8. 线性代数(4):伴随矩阵、逆矩阵和矩阵的秩
  9. VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-S CALE IMAGE RECOGNITION-论文笔记
  10. yigo基础学习笔记4_业务流程