Apache Sedona（GeoSpark） spatial join 源码解析

文章目录

Apache Sedona（GeoSpark） Spatial Join
Range join
Distance join
源码解析
- SedonSQLRegistrator.registerAll(sparkSession)
- JoinQueryDetector
- - planSpatialJoin
  - planDistanceJoin
- TraitJoinQueryExec
- - doExecute
  - - 1. 构造SpatialRDD
    - 2. doSpatialPartitioning
    - 3. spatialJoin
    - 4. extraCondition and JoinRow

Apache Sedona（GeoSpark） Spatial Join

Sedona Spatial operators fully supports Apache SparkSQL query optimizer. It has the following query optimization features:

Automatically optimizes range join query and distance join query.
Automatically performs predicate pushdown.

Range join

Introduction: Find geometries from A and geometries from B such that each geometry pair satisfies a certain predicate. Most predicates supported by SedonaSQL can trigger a range join.

Spark SQL Example:

SELECT *
FROM polygondf, pointdf
WHERE ST_Contains(polygondf.polygonshape,pointdf.pointshape)

SELECT *
FROM polygondf, pointdf
WHERE ST_Intersects(polygondf.polygonshape,pointdf.pointshape)

SELECT *
FROM pointdf, polygondf
WHERE ST_Within(pointdf.pointshape, polygondf.polygonshape)

Spark SQL Physical plan:

== Physical Plan ==
RangeJoin polygonshape#20: geometry, pointshape#43: geometry, false
:- Project [st_polygonfromenvelope(cast(_c0#0 as decimal(24,20)), cast(_c1#1 as decimal(24,20)), cast(_c2#2 as decimal(24,20)), cast(_c3#3 as decimal(24,20)), mypolygonid) AS polygonshape#20]
:  +- *FileScan csv
+- Project [st_point(cast(_c0#31 as decimal(24,20)), cast(_c1#32 as decimal(24,20)), myPointId) AS pointshape#43]+- *FileScan csv

!!!note
All join queries in SedonaSQL are inner joins

Distance join

Introduction: Find geometries from A and geometries from B such that the internal Euclidean distance of each geometry pair is less or equal than a certain distance

Spark SQL Example:

Only consider fully within a certain distance

SELECT *
FROM pointdf1, pointdf2
WHERE ST_Distance(pointdf1.pointshape1,pointdf2.pointshape2) < 2

Consider intersects within a certain distance

SELECT *
FROM pointdf1, pointdf2
WHERE ST_Distance(pointdf1.pointshape1,pointdf2.pointshape2) <= 2

Spark SQL Physical plan:

== Physical Plan ==
DistanceJoin pointshape1#12: geometry, pointshape2#33: geometry, 2.0, true
:- Project [st_point(cast(_c0#0 as decimal(24,20)), cast(_c1#1 as decimal(24,20)), myPointId) AS pointshape1#12]
:  +- *FileScan csv
+- Project [st_point(cast(_c0#21 as decimal(24,20)), cast(_c1#22 as decimal(24,20)), myPointId) AS pointshape2#33]+- *FileScan csv

!!!warning
Sedona doesn’t control the distance’s unit (degree or meter). It is same with the geometry. To change the geometry’s unit, please transform the coordinate reference system. See ST_Transform.

源码解析

SedonSQLRegistrator.registerAll(sparkSession)

在初始化SparkSession后，需要调用SedonaSQLRegistrator.registerAll(sparkSession)来注册SedonaSQL User Defined Type, User Defined Function and optimized join query strategy。

JoinQueryDetector是针对spatial join的策略，UdtRegistrator.registerAll()注册GeometryUDT和IndexUDT。UdfRegistrator.registerAll(sqlContext)注册自定义的udf，udaf等。

JoinQueryDetector

JoinQueryDetector继承自Strategy，用于将逻辑计划转换为物理计划。从apply方法中可以看到，JoinQueryDetector匹配Join逻辑计划节点，根据其Join类中的condition的类型来决定生成那种join类型，即RangeJoinExec或者DistanceJoinExec，并传入leftShape和rightShape这两个表示几何列的表达式。

Spark Join 逻辑计划：

planSpatialJoin

此方法用于生成RangeJoinExec物理计划。首先调用matchExpressionsToPlans检查left和right两个子逻辑计划的outputSet是否包含了Expression代表的几何类型。

planDistanceJoin

此方法用于生成DistanceJoinExec，具体逻辑与planSpatialJoin大致相同。

TraitJoinQueryExec

RangeJoinExec ：

DistanceJoinExec ：

可以看出DistanceJoinExec和RangeJoinExec的具体实现逻辑都在TraitJoinQueryExec中。

TraitJoinQueryExec ：

TraitJoinQueryExec是一个接口，继承SparkPlan。

doExecute

1. 构造SpatialRDD

在doExecute方法中首先调用BindReferences.bindReference方法，将leftShape和rightShape绑定到left和right子物理计划的output中。生成的BindReferences表达式的eval方法可以从left和right的InternalRow中直接获取到几何列。

然后调用left和right的execute方法获取子RDD。然后调用toSpatialRddPair方法生成SpatialRdd（这里不介绍SpatialRdd的内部结构了），即从unsafeRow中获取到几何列，然后转换为Geometry对象。

toSpatialRdd方法利用内部自定义的集合对象序列化器GeometrySerializer.deserialize方法将获取到的几何列转换对集合对象。

2. doSpatialPartitioning

为了完成spatial join，两个SpatialRDD必须具有相同的分区。首先决定JoinSparitionDominantSide，然后决定numPartitions。

doSpatialPartitioning方法中，dominantShapes根据sedonaConf的设置选择相应的空间分区的方式进行自定义分区。followerShapes获取dominantShapes的分区器，进行相同的空间分区。

3. spatialJoin

首先构造JoinParams对象，其决定了join时是否使用索引，是否考虑边界相交、索引类型，以及joinBuildSide。

然后调用JoinQuery.spatialJoin方法，进行空间连接操作。

JoinQuery.spatialJoin

首先检查两个SpatialRDD的CRS和Partitioning是否相符合。

构造JoinJudgement，其继承自FlatMapFunction2接口，用于zipPartitions算子中，两个SpatialRDD中相同分区上的元素如何进行空间连接。

比如：

RightIndexLookupJudgement ：leftRDD.spatialPartitionedRDD.zipPartitions(rightRDD.indexedRDD, judgement)，即利用rightRDD.indexedRDD上的分区空间索引，逐个遍历leftRDD.spatialPartitionedRDD中的记录，对空间索引进行查询，获取可以空间连接的记录对。
DynamicIndexLookupJudgement需要在连接之前进行分区索引的建立。

4. extraCondition and JoinRow

最后senoda支持对join后过滤的谓词下推，所以可以进行定义extraCondition进行过滤。

然后由于上面进行空间连接的两个表中几何列，而我们要获得的是表中所有列的连接。之前我们将Row对象设置到了Geometry中的userData中，所以利用getUserData方法获取UnsafeRow，然后构造UnsafeRowJoiner，将两个UnsafeRow进行连接后返回。