

  • 第一种:
def join(right: Dataset[_], joinExprs: Column, joinType: String): DataFrameJoin with another DataFrame, using the given join expression.
The following performs a full outer join between df1 and df2.// Scala:
import org.apache.spark.sql.functions._
df1.join(df2, $"df1Key" === $"df2Key", "outer")// Java:
import static org.apache.spark.sql.functions.*;
df1.join(df2, col("df1Key").equalTo(col("df2Key")), "outer");
Right side of the join.joinExprs
Join expression.joinType
Type of join to perform. Default inner(默认为inner). Must be one of:
inner, cross,
outer, full, full_outer,
left, left_outer,
right, right_outer,
left_semi, left_anti.Since
  • 第二种:
def join(right: Dataset[_], joinExprs: Column): DataFrame
Inner join with another DataFrame, using the given join expression
(由于默认为inner,此处省略了joinType).// The following two are equivalent:
df1.join(df2, $"df1Key" === $"df2Key")
df1.join(df2).where($"df1Key" === $"df2Key")
  • 第三种:

def join(right: Dataset[_], usingColumns: Seq[String], joinType: String): DataFrameEqui-join with another DataFrame using the given columns.
A cross join with a predicate is specified as an inner join.
If you would explicitly like to perform a cross join use the crossJoin method.Different from other join functions, the join columns will only appear once in the output, i.e. similar to SQL's JOIN USING syntax.right
Right side of the join operation.usingColumns
Names of the columns to join on. This columns must exist on both sides.
Type of join to perform. Default inner. Must be one of: inner, cross, outer, full, full_outer, left, left_outer, right, right_outer, left_semi, left_anti.Since
If you perform a self-join using this function without aliasing the input DataFrames,
you will NOT be able to reference any columns after the join,
since there is no way to disambiguate which side of the join you would like to reference.
  • 第四种:

def join(right: Dataset[_], usingColumns: Seq[String]): DataFrame
Inner equi-join with another DataFrame using the given columns.Different from other join functions, the join columns will only appear once in the output, i.e. similar to SQL's JOIN USING syntax.// Joining df1 and df2 using the columns "user_id" and "user_name"
df1.join(df2, Seq("user_id", "user_name"))
Right side of the join operation.usingColumns
Names of the columns to join on. This columns must exist on both sides.
If you perform a self-join using this function without aliasing the input DataFrames,
you will NOT be able to reference any columns after the join,
since there is no way to disambiguate which side of the join you would like to reference.
  • 第五种:
def join(right: Dataset[_], usingColumn: String): DataFrameInner equi-join with another DataFrame using the given column.Different from other join functions, the join column will only appear once in the output, i.e. similar to SQL's JOIN USING syntax.// Joining df1 and df2 using the column "user_id"
df1.join(df2, "user_id")
Right side of the join operation.usingColumn
Name of the column to join on. This column must exist on both sides.
If you perform a self-join using this function without aliasing the input DataFrames,
you will NOT be able to reference any columns after the join,
since there is no way to disambiguate which side of the join you would like to reference.
  • 第六种:
def join(right: Dataset[_]): DataFrame
Join with another DataFrame.Behaves as an INNER JOIN and requires a subsequent join predicate.right
Right side of the join operation.Since


  • 一般统计用的最多的为inner、left_outer,偶尔用full,剩下的几乎不太用
  • 看两个关联字段名称是否完全一致,如果一致,直接用含有usingColumn(是个字符串)或usingColumns(是个Seq(),一般用来写两个或以上的关联字段,当然也可以写只有一个关联的字段,此时,类似于usingColumn)参数的join方法,否则需要写表达式(joinExprs)
  • 如何写joinExprs: Column个表达式呢?其实官网案例已经很明确了,但是对于新手来说,刚开始一脸懵逼,下面用简单的案例描述


  • 测试数据


  • 代码测试
 val spark = SparkSession.builder().appName(this.getClass.getSimpleName).master("local[*]").getOrCreate()


import spark.implicits._
val df_name = spark.read.textFile("./data/name.txt").map(_.split(",")).map(x => (x(0), x(1))).toDF("id", "name").cache()val df_age = spark.read.textFile("./data/age.txt").map(_.split(",")).map(x => (x(0), x(1))).toDF("id", "age").cache()

(1) 内连接代码

df_name.join(df_age, usingColumn = "id").show()
df_name.join(df_age, usingColumns = Seq("id")).show()
df_name.join(df_age, usingColumns = Seq("id"), joinType = "inner").show()


| id|name|age|
|  1|Jack| 18|
|  2|Rose| 19|
|  3|Lily| 20|
|  4|Lucy| 21|


df_name.join(df_age,usingColumns = Seq("id"),joinType = "left_outer").show()


| id|  name| age|
|  1|  Jack|  18|
|  2|  Rose|  19|
|  3|  Lily|  20|
|  4|  Lucy|  21|
|  7|Rivers|null|


import spark.implicits._val df_name = spark.read.textFile("./data/name").map(_.split(",")).map(x => (x(0), x(1))).toDF("nid", "name").cache()val df_age = spark.read.textFile("./data/age").map(_.split(",")).map(x => (x(0), x(1))).toDF("aid", "age").cache()


df_name.join(df_age, joinExprs = $"nid" === $"aid").show()
df_name.join(df_age, joinExprs = $"nid" === $"aid", joinType = "inner").show()
df_name.join(df_age).where(condition = $"nid" === $"aid").show()


|  1|Jack|  1| 18|
|  2|Rose|  2| 19|
|  3|Lily|  3| 20|
|  4|Lucy|  4| 21|


df_name.join(df_age, joinExprs = $"nid" === $"aid", joinType = "left_outer").show()


|nid|  name| aid| age|
|  1|  Jack|   1|  18|
|  2|  Rose|   2|  19|
|  3|  Lily|   3|  20|
|  4|  Lucy|   4|  21|
|  7|Rivers|null|null|


df_name.join(df_age,df_name("id") === df_age("id")).show()


| id|name| id|age|
|  1|Jack|  1| 18|
|  2|Rose|  2| 19|
|  3|Lily|  3| 20|
|  4|Lucy|  4| 21|



df_name.join(df_age,df_name("nid") === df_age("aid")).show()


|  1|Jack|  1| 18|
|  2|Rose|  2| 19|
|  3|Lily|  3| 20|
|  4|Lucy|  4| 21|




println("--------inner-------")df_name.join(df_age, joinExprs = $"nid" === $"aid", joinType = "inner").show()println("--------cross-------")df_name.join(df_age, joinExprs = $"nid" === $"aid", joinType = "cross").show()println("--------outer-------")df_name.join(df_age, joinExprs = $"nid" === $"aid", joinType = "outer").show()println("--------full-------")df_name.join(df_age, joinExprs = $"nid" === $"aid", joinType = "full").show()println("--------full_outer-------")df_name.join(df_age, joinExprs = $"nid" === $"aid", joinType = "full_outer").show()println("--------left-------")df_name.join(df_age, joinExprs = $"nid" === $"aid", joinType = "left").show()println("--------left_outer-------")df_name.join(df_age, joinExprs = $"nid" === $"aid", joinType = "left_outer").show()println("--------right-------")df_name.join(df_age, joinExprs = $"nid" === $"aid", joinType = "right").show()println("--------right_outer-------")df_name.join(df_age, joinExprs = $"nid" === $"aid", joinType = "right_outer").show()println("--------left_semi-------")df_name.join(df_age, joinExprs = $"nid" === $"aid", joinType = "left_semi").show()println("--------left_anti-------")df_name.join(df_age, joinExprs = $"nid" === $"aid", joinType = "left_anti").show()


|  1|Jack|  1| 18|
|  2|Rose|  2| 19|
|  3|Lily|  3| 20|
|  4|Lucy|  4| 21|
|  1|Jack|  1| 18|
|  2|Rose|  2| 19|
|  3|Lily|  3| 20|
|  4|Lucy|  4| 21|
| nid|  name| aid| age|
|   7|Rivers|null|null|
|   3|  Lily|   3|  20|
|null|  null|   5|  22|
|null|  null|   6|  23|
|   1|  Jack|   1|  18|
|   4|  Lucy|   4|  21|
|   2|  Rose|   2|  19|
| nid|  name| aid| age|
|   7|Rivers|null|null|
|   3|  Lily|   3|  20|
|null|  null|   5|  22|
|null|  null|   6|  23|
|   1|  Jack|   1|  18|
|   4|  Lucy|   4|  21|
|   2|  Rose|   2|  19|
| nid|  name| aid| age|
|   7|Rivers|null|null|
|   3|  Lily|   3|  20|
|null|  null|   5|  22|
|null|  null|   6|  23|
|   1|  Jack|   1|  18|
|   4|  Lucy|   4|  21|
|   2|  Rose|   2|  19|
|nid|  name| aid| age|
|  1|  Jack|   1|  18|
|  2|  Rose|   2|  19|
|  3|  Lily|   3|  20|
|  4|  Lucy|   4|  21|
|  7|Rivers|null|null|
|nid|  name| aid| age|
|  1|  Jack|   1|  18|
|  2|  Rose|   2|  19|
|  3|  Lily|   3|  20|
|  4|  Lucy|   4|  21|
|  7|Rivers|null|null|
| nid|name|aid|age|
|   1|Jack|  1| 18|
|   2|Rose|  2| 19|
|   3|Lily|  3| 20|
|   4|Lucy|  4| 21|
|null|null|  5| 22|
|null|null|  6| 23|
| nid|name|aid|age|
|   1|Jack|  1| 18|
|   2|Rose|  2| 19|
|   3|Lily|  3| 20|
|   4|Lucy|  4| 21|
|null|null|  5| 22|
|null|null|  6| 23|
|  1|Jack|
|  2|Rose|
|  3|Lily|
|  4|Lucy|
|nid|  name|
|  7|Rivers|


