来自官网

Spark2.4版本Join算子的重载方法有6种,分别如下:

  • 第一种:
def join(right: Dataset[_], joinExprs: Column, joinType: String): DataFrameJoin with another DataFrame, using the given join expression.
The following performs a full outer join between df1 and df2.// Scala:
import org.apache.spark.sql.functions._
df1.join(df2, $"df1Key" === $"df2Key", "outer")// Java:
import static org.apache.spark.sql.functions.*;
df1.join(df2, col("df1Key").equalTo(col("df2Key")), "outer");
right
Right side of the join.joinExprs
Join expression.joinType
Type of join to perform. Default inner(默认为inner). Must be one of:
inner, cross,
outer, full, full_outer,
left, left_outer,
right, right_outer,
left_semi, left_anti.Since
2.0.0
  • 第二种:
def join(right: Dataset[_], joinExprs: Column): DataFrame
Inner join with another DataFrame, using the given join expression
(由于默认为inner,此处省略了joinType).// The following two are equivalent:
df1.join(df2, $"df1Key" === $"df2Key")
df1.join(df2).where($"df1Key" === $"df2Key")
Since
2.0.0
  • 第三种:

def join(right: Dataset[_], usingColumns: Seq[String], joinType: String): DataFrameEqui-join with another DataFrame using the given columns.
A cross join with a predicate is specified as an inner join.
If you would explicitly like to perform a cross join use the crossJoin method.Different from other join functions, the join columns will only appear once in the output, i.e. similar to SQL's JOIN USING syntax.right
Right side of the join operation.usingColumns
Names of the columns to join on. This columns must exist on both sides.
(两个DataFrame关联的字段名称必须一致)joinType
Type of join to perform. Default inner. Must be one of: inner, cross, outer, full, full_outer, left, left_outer, right, right_outer, left_semi, left_anti.Since
2.0.0Note
If you perform a self-join using this function without aliasing the input DataFrames,
you will NOT be able to reference any columns after the join,
since there is no way to disambiguate which side of the join you would like to reference.
  • 第四种:

def join(right: Dataset[_], usingColumns: Seq[String]): DataFrame
Inner equi-join with another DataFrame using the given columns.Different from other join functions, the join columns will only appear once in the output, i.e. similar to SQL's JOIN USING syntax.// Joining df1 and df2 using the columns "user_id" and "user_name"
df1.join(df2, Seq("user_id", "user_name"))
right
Right side of the join operation.usingColumns
Names of the columns to join on. This columns must exist on both sides.
(两个DataFrame关联的字段名称必须一致)Since
2.0.0Note
If you perform a self-join using this function without aliasing the input DataFrames,
you will NOT be able to reference any columns after the join,
since there is no way to disambiguate which side of the join you would like to reference.
  • 第五种:
def join(right: Dataset[_], usingColumn: String): DataFrameInner equi-join with another DataFrame using the given column.Different from other join functions, the join column will only appear once in the output, i.e. similar to SQL's JOIN USING syntax.// Joining df1 and df2 using the column "user_id"
df1.join(df2, "user_id")
right
Right side of the join operation.usingColumn
Name of the column to join on. This column must exist on both sides.
(两个DataFrame关联的字段名称必须一致)Since
2.0.0Note
If you perform a self-join using this function without aliasing the input DataFrames,
you will NOT be able to reference any columns after the join,
since there is no way to disambiguate which side of the join you would like to reference.
  • 第六种:
def join(right: Dataset[_]): DataFrame
Join with another DataFrame.Behaves as an INNER JOIN and requires a subsequent join predicate.right
Right side of the join operation.Since
2.0.0

简化理解

  • 一般统计用的最多的为inner、left_outer,偶尔用full,剩下的几乎不太用
  • 看两个关联字段名称是否完全一致,如果一致,直接用含有usingColumn(是个字符串)或usingColumns(是个Seq(),一般用来写两个或以上的关联字段,当然也可以写只有一个关联的字段,此时,类似于usingColumn)参数的join方法,否则需要写表达式(joinExprs)
  • 如何写joinExprs: Column个表达式呢?其实官网案例已经很明确了,但是对于新手来说,刚开始一脸懵逼,下面用简单的案例描述

案例描述

  • 测试数据
    文件name.txt内容:
1,Jack
2,Rose
3,Lily
4,Lucy
7,Rivers

文件age.txt内容:

1,18
2,19
3,20
4,21
5,22
6,23
  • 代码测试
    程序入口SparkSession编写都是固定的这里不再描述:
 val spark = SparkSession.builder().appName(this.getClass.getSimpleName).master("local[*]").getOrCreate()

先以两个关联字段都起名为id,作为测试

import spark.implicits._
val df_name = spark.read.textFile("./data/name.txt").map(_.split(",")).map(x => (x(0), x(1))).toDF("id", "name").cache()val df_age = spark.read.textFile("./data/age.txt").map(_.split(",")).map(x => (x(0), x(1))).toDF("id", "age").cache()

(1) 内连接代码

df_name.join(df_age, usingColumn = "id").show()
df_name.join(df_age, usingColumns = Seq("id")).show()
df_name.join(df_age, usingColumns = Seq("id"), joinType = "inner").show()

结果:

+---+----+---+
| id|name|age|
+---+----+---+
|  1|Jack| 18|
|  2|Rose| 19|
|  3|Lily| 20|
|  4|Lucy| 21|
+---+----+---+

(2)左外连接代码

df_name.join(df_age,usingColumns = Seq("id"),joinType = "left_outer").show()

结果:

+---+------+----+
| id|  name| age|
+---+------+----+
|  1|  Jack|  18|
|  2|  Rose|  19|
|  3|  Lily|  20|
|  4|  Lucy|  21|
|  7|Rivers|null|
+---+------+----+

关联字段名称为不一样作为测试(关联字段分别为nid,aid)

import spark.implicits._val df_name = spark.read.textFile("./data/name").map(_.split(",")).map(x => (x(0), x(1))).toDF("nid", "name").cache()val df_age = spark.read.textFile("./data/age").map(_.split(",")).map(x => (x(0), x(1))).toDF("aid", "age").cache()

(1)内连接代码

df_name.join(df_age, joinExprs = $"nid" === $"aid").show()
df_name.join(df_age, joinExprs = $"nid" === $"aid", joinType = "inner").show()
df_name.join(df_age).where(condition = $"nid" === $"aid").show()

结果:

+---+----+---+---+
|nid|name|aid|age|
+---+----+---+---+
|  1|Jack|  1| 18|
|  2|Rose|  2| 19|
|  3|Lily|  3| 20|
|  4|Lucy|  4| 21|
+---+----+---+---+

(2)左外连接代码

df_name.join(df_age, joinExprs = $"nid" === $"aid", joinType = "left_outer").show()

结果:

+---+------+----+----+
|nid|  name| aid| age|
+---+------+----+----+
|  1|  Jack|   1|  18|
|  2|  Rose|   2|  19|
|  3|  Lily|   3|  20|
|  4|  Lucy|   4|  21|
|  7|Rivers|null|null|
+---+------+----+----+

还有一种情况(直接用DataFrame引用字段),就是不管关联字段名称是否相同都可以使用
(1)关联字段都用id

df_name.join(df_age,df_name("id") === df_age("id")).show()

结果:

+---+----+---+---+
| id|name| id|age|
+---+----+---+---+
|  1|Jack|  1| 18|
|  2|Rose|  2| 19|
|  3|Lily|  3| 20|
|  4|Lucy|  4| 21|
+---+----+---+---+

这种情况几乎不会用,因为如果数据入库后,要选择id怎么办呢?两个同名的id很尴尬咯

(2)关联字段分别用nid,aid

df_name.join(df_age,df_name("nid") === df_age("aid")).show()

结果:

+---+----+---+---+
|nid|name|aid|age|
+---+----+---+---+
|  1|Jack|  1| 18|
|  2|Rose|  2| 19|
|  3|Lily|  3| 20|
|  4|Lucy|  4| 21|
+---+----+---+---+

我相信到此,你已经明白了join的各种用法了

11种JoinType的区别

其实这个完全没有必要写了,因为一般很少使用到所有的JoinType,因此这里简单的进行了一下对比

println("--------inner-------")df_name.join(df_age, joinExprs = $"nid" === $"aid", joinType = "inner").show()println("--------cross-------")df_name.join(df_age, joinExprs = $"nid" === $"aid", joinType = "cross").show()println("--------outer-------")df_name.join(df_age, joinExprs = $"nid" === $"aid", joinType = "outer").show()println("--------full-------")df_name.join(df_age, joinExprs = $"nid" === $"aid", joinType = "full").show()println("--------full_outer-------")df_name.join(df_age, joinExprs = $"nid" === $"aid", joinType = "full_outer").show()println("--------left-------")df_name.join(df_age, joinExprs = $"nid" === $"aid", joinType = "left").show()println("--------left_outer-------")df_name.join(df_age, joinExprs = $"nid" === $"aid", joinType = "left_outer").show()println("--------right-------")df_name.join(df_age, joinExprs = $"nid" === $"aid", joinType = "right").show()println("--------right_outer-------")df_name.join(df_age, joinExprs = $"nid" === $"aid", joinType = "right_outer").show()println("--------left_semi-------")df_name.join(df_age, joinExprs = $"nid" === $"aid", joinType = "left_semi").show()println("--------left_anti-------")df_name.join(df_age, joinExprs = $"nid" === $"aid", joinType = "left_anti").show()

结果:

--------inner-------
+---+----+---+---+
|nid|name|aid|age|
+---+----+---+---+
|  1|Jack|  1| 18|
|  2|Rose|  2| 19|
|  3|Lily|  3| 20|
|  4|Lucy|  4| 21|
+---+----+---+---+--------cross-------
+---+----+---+---+
|nid|name|aid|age|
+---+----+---+---+
|  1|Jack|  1| 18|
|  2|Rose|  2| 19|
|  3|Lily|  3| 20|
|  4|Lucy|  4| 21|
+---+----+---+---+--------outer-------
+----+------+----+----+
| nid|  name| aid| age|
+----+------+----+----+
|   7|Rivers|null|null|
|   3|  Lily|   3|  20|
|null|  null|   5|  22|
|null|  null|   6|  23|
|   1|  Jack|   1|  18|
|   4|  Lucy|   4|  21|
|   2|  Rose|   2|  19|
+----+------+----+----+--------full-------
+----+------+----+----+
| nid|  name| aid| age|
+----+------+----+----+
|   7|Rivers|null|null|
|   3|  Lily|   3|  20|
|null|  null|   5|  22|
|null|  null|   6|  23|
|   1|  Jack|   1|  18|
|   4|  Lucy|   4|  21|
|   2|  Rose|   2|  19|
+----+------+----+----+--------full_outer-------
+----+------+----+----+
| nid|  name| aid| age|
+----+------+----+----+
|   7|Rivers|null|null|
|   3|  Lily|   3|  20|
|null|  null|   5|  22|
|null|  null|   6|  23|
|   1|  Jack|   1|  18|
|   4|  Lucy|   4|  21|
|   2|  Rose|   2|  19|
+----+------+----+----+--------left-------
+---+------+----+----+
|nid|  name| aid| age|
+---+------+----+----+
|  1|  Jack|   1|  18|
|  2|  Rose|   2|  19|
|  3|  Lily|   3|  20|
|  4|  Lucy|   4|  21|
|  7|Rivers|null|null|
+---+------+----+----+--------left_outer-------
+---+------+----+----+
|nid|  name| aid| age|
+---+------+----+----+
|  1|  Jack|   1|  18|
|  2|  Rose|   2|  19|
|  3|  Lily|   3|  20|
|  4|  Lucy|   4|  21|
|  7|Rivers|null|null|
+---+------+----+----+--------right-------
+----+----+---+---+
| nid|name|aid|age|
+----+----+---+---+
|   1|Jack|  1| 18|
|   2|Rose|  2| 19|
|   3|Lily|  3| 20|
|   4|Lucy|  4| 21|
|null|null|  5| 22|
|null|null|  6| 23|
+----+----+---+---+--------right_outer-------
+----+----+---+---+
| nid|name|aid|age|
+----+----+---+---+
|   1|Jack|  1| 18|
|   2|Rose|  2| 19|
|   3|Lily|  3| 20|
|   4|Lucy|  4| 21|
|null|null|  5| 22|
|null|null|  6| 23|
+----+----+---+---+--------left_semi-------
+---+----+
|nid|name|
+---+----+
|  1|Jack|
|  2|Rose|
|  3|Lily|
|  4|Lucy|
+---+----+--------left_anti-------
+---+------+
|nid|  name|
+---+------+
|  7|Rivers|
+---+------+

DataSet的Join操作相关推荐

  1. Flink学习笔记:Operators之CoGroup及Join操作

    本文为<Flink大数据项目实战>学习笔记,想通过视频系统学习Flink这个最火爆的大数据计算框架的同学,推荐学习课程: Flink大数据项目实战:http://t.cn/EJtKhaz ...

  2. 使用MapReduce实现join操作

    2019独角兽企业重金招聘Python工程师标准>>> 在关系型数据库中,要实现join操作是非常方便的,通过sql定义的join原语就可以实现.在hdfs存储的海量数据中,要实现j ...

  3. shell中join链接多个域_shell 如何实现两个表的join操作

    shell 如何实现两个表的join操作 今天研究的一个问题是:在Shell 脚本中如何实现两个表的 join 操作,这里说的两个表示的其实是 两个文件,但是文件是列表的形式,有固定的分割符号,即就相 ...

  4. 离线轻量级大数据平台Spark之JavaRDD关联join操作

    对两个RDD进行关联操作,如: 1)文件post_data.txt包含:post_id\title\content 2)文件train.txt包含:dev_id\post_id\praise\time ...

  5. Spark SQL JOIN操作代码示例

    title: Spark SQL JOIN操作 date: 2021-05-08 15:53:21 tags: Spark 本文主要介绍 Spark SQL 的多表连接,需要预先准备测试数据.分别创建 ...

  6. MapReduce实现join操作

    前阵子把MapReduce实现join操作的算法设想清楚了,但一直没有在代码层面落地.今天终于费了些功夫把整个流程走了一遭,期间经历了诸多麻烦并最终得以将其一一搞定,再次深切体会到,什么叫从计算模型到 ...

  7. 5、HIVE DML操作、load数据、update、Delete、Merge、where语句、基于分区的查询、HAVING子句、LIMIT子句、Group By语法、Hive 的Join操作等

    目录: 4.2.1 Load文件数据到表中 4.2.2查询的数据插入到表中 4.2.3将Hive查询的结果存到本地Linux的文件系统目录中 4.2.4通过SQL语句的方式插入数据 4.2.5 UPD ...

  8. C# LINQ系列:LINQ to DataSet的DataTable操作 及 DataTable与Linq相互转换

    LINQ to DataSet需要使用System.Core.dll.System.Data.dll和System.Data.DataSetExtensions.dll,在项目中添加引用System. ...

  9. datatable对两个csv的join操作

    代码根据key=TransactionID来进行join操作 go.py import datatable as dtfolder_path = './' train_identity = dt.fr ...

最新文章

  1. ISP 【一】————boost标准库使用——批量读取保存文件 /boost第三方库的使用及其cmake添加,图像gramma
  2. python使用np.argsort对一维numpy概率值数据排序获取升序索引、获取的top索引(例如top2、top5、top10)索引二维numpy数组中对应的原始数据:原始数据概率最小的头部数据
  3. Oracle CRS的管理与维护
  4. 181102 Python环境搭建(安装Sublime Text3)
  5. C# IIS ManagementException: 访问遭到拒绝
  6. 蓝桥杯 第七届 JAVA B组 凑算式
  7. php中常见的header类型
  8. 用python处理excel文件_用python 读写excel文件(附资料下载)
  9. Java学习笔记day08_day09_对象实例化_private_this
  10. 动态模型之增压暂停【FunTester测试框架】
  11. mysql 汉化成中文版
  12. 在LabWindows/CVI中程控的调整表格控件的属性
  13. Qt On Android 一键加QQ群
  14. php-screw 安装,liunx 下安装 php_screw 扩展 以及报错处理
  15. Unable to find image ‘XXX‘ locally docker: Error response from daemon: pull access denied for
  16. (附源码)springboot应用支撑平台和应用系统 毕业设计 984655
  17. java 错误1335_安装JAVA的JDK时出现,错误1335? – 手机爱问
  18. win7系统不显示图片的缩略图的终极解决方法
  19. 企立方:拼多多点击率问题
  20. Grin接盘显卡算力?机构分析称Grin市值2019年有望挤进前40

热门文章

  1. Centos7:dubbo监控中心安装,配置和使用
  2. spring注解记录
  3. 获取FileUpload上传的文件大小
  4. Miller_Rabin测试法
  5. 解决多个py模块调用同一个python的logging模块,打印日志冲突问题
  6. 流水线、精益生产、丰田生产方式和TOC的基本原则
  7. CS224n研究热点11 深度强化学习用于对话生成
  8. 什么是平板电脑的杀手锏?
  9. Android 使用ViewPager实现画廊Gallery效果
  10. mylyn提交到JIRA的日期格式错误