spark常用函数比较

2019独角兽企业重金招聘Python工程师标准>>>

算法分类：转换(transformation)和执行(action)

查看算子使用demo

coalesce & repartition & partitionBy

reparation是coalesce的特殊情况 ,reparation会将coalesce中的shuffle参数设置为true,会使用HashPartitioner重新混洗分区,如果原有分区数据不均匀可以用reparation来重新混洗分区,使数据均匀分布,重新混洗过的分区和新的分区时宽依赖的关系

coalesce shuffle参数为false的情况不会重新混洗分区,它是合并分区,比如把原来1000个分区合并成100个,父rdd和子rdd是窄依赖,

coalesce当shuffle参数设置为false时，如果设置的新partition数量大于之前的，则按照之前的分区数量重新分区。如果shuffle参数设置为true则效果和repartition一致。

def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {coalesce(numPartitions, shuffle = true)
}

partitionBy需要指定分区函数和分区数量

var rdd2=rdd.partitionBy(new HashPartitioner(2))

range

// range函数是闭开区间[)
range(1,4,1)
//输出：1,2,3
// to 函数是闭闭区间[]
sc.makeRDD(1 to 5,2)
// 输出：1,2,3,4,5

zip & zipWithIndex & zipWithUniqueId

zip

1.如果两个RDD分区数不同，则抛出异常:Can’t zip RDDs with unequal numbers of partitions

2.如果两个RDD的元素个数不同，则抛出异常：Can only zip RDDs with same number of elements in each partition

zipPartitions

zipPartitions函数将多个RDD按照partition组合成为新的RDD。

该函数需要组合的RDD具有相同的分区数，但对于每个分区内的元素数量没有要求。

var rdd1=sparkSession.range(1,4,1).rdd
var rdd2=sparkSession.range(4,7,1).rdd
var rdd3=sparkSession.range(7,10,1).rdd
// zip函数用于将两个RDD组合成Key/Value形式的RDD,这里默认两个RDD的partition数量以及每个partition的元素数量都相同，否则会抛出异常。
var rdd5=rdd1 zip rdd2 zip rdd3
/*** +-----+---+* |   _1| _2|* +-----+---+* |[1,4]|  7|* |[2,5]|  8|* |[3,6]|  9|* +-----+---+*/
// 该函数将RDD中的元素和这个元素在RDD中的ID（索引号）组合成键/值对。
var rdd6=rdd1.zipWithIndex/*** +---+---+* | _1| _2|* +---+---+* |  1|  0|* |  2|  1|* |  3|  2|* +---+---+*/
var rdd7=sparkSession.range(1,10,2).rdd
// 该函数将RDD中元素和一个唯一ID组合成键/值对，该唯一ID生成算法如下：
// 每个分区中第一个元素的唯一ID值为：该分区索引号
// 每个分区中第N个元素的唯一ID值为：(前一个元素的唯一ID值) + (该RDD总的分区数)
var rdd8=rdd7.zipWithUniqueId()/*** +---+---+* | _1| _2|* +---+---+* |  1|  0|* |  3|  2|* |  5|  1|* |  7|  3|* |  9|  5|* +---+---+*/

mapPartitionsWithIndex

  var rdd1 = sparkSession.sparkContext.makeRDD(Array((1, "A"), (2, "B"), (3, "C"), (4, "D")),2)// 函数作用同mapPartitions相同，不过提供了两个参数，第一个参数为分区的索引var rdd2 = rdd1.mapPartitionsWithIndex {(partIdx, iter) => {var part_map = scala.collection.mutable.Map[String, List[(Int, String)]]()while (iter.hasNext) {var part_name = "part_" + partIdxvar elem = iter.next()if (part_map.contains(part_name)) {var elems = part_map(part_name)elems ::= elempart_map(part_name)=elems} else {part_map(part_name) = List[(Int, String)] {elem}}}part_map.iterator}}.collect()/*** +------+--------------+* |    _1|            _2|* +------+--------------+* |part_0|[[2,B], [1,A]]|* |part_1|[[4,D], [3,C]]|* +------+--------------+*/

map & mapValues

var rdd1=sparkSession.sparkContext.makeRDD(Array((1, "A"), (2, "B"), (3, "C"), (4, "D")),2)
// 对[K,V]整体操作
var rdd3=rdd1.map(_+"_").foreach(println(_))
/*** (1,A)_* (3,C)_* (2,B)_* (4,D)_*/
var rdd2=rdd1.mapValues(_+"_")/*** +---+---+* | _1| _2|* +---+---+* |  1| A_|* |  2| B_|* |  3| C_|* |  4| D_|* +---+---+*/// 键值对转换rdd1.map(_.swap).foreach(println(_))/*** (C,3)* (D,4)* (A,1)* (B,2)*/// 使用map实现mapValues rdd1.map(x=>(x._1,x._2+"_")).foreach(println(_))/*** (1,A_)* (2,B_)* (3,C_)* (4,D_)*/

flodByKey

val rdd4=sparkSession.sparkContext.makeRDD(Array(("A",0),("A",2),("B",1),("B",2),("C",1)))
val rdd5=rdd4.foldByKey(2)(_+_).collect()/*** +---+---+* | _1| _2|* +---+---+* |  B|  5|* |  A|  4|* |  C|  3|* +---+---+*/

groupByKey & reduceByKey & aggregateByKey & flodByKey

reduceByKey现在map过程中先进行聚合，再到reduce端聚合，减少数据太大带来的压力，减小RPC过程中的传输压力。groupByKey是直接在reduce端进行聚合的，所以效率比reduceByKey低。

foldByKey和reduceByKey的功能是相似的，都是在map端先进行聚合，再到reduce聚合。不同的是flodByKey需要传入一个参数。该参数是计算的初始值。

groupByKey是对每个key进行合并操作，但只生成一个sequence，groupByKey本身不能自定义操作函数。spark只能先将所有的键值对都移动，这样的后果是集群节点之间的开销很大，导致传输延时,详情。

val words = Array("one", "two", "two", "three", "three", "three")
val wordsRDD = sparkSession.sparkContext.parallelize(words).map(word => (word, 1))
val wordsCountWithGroup = wordsRDD.groupByKey().map(w => (w._1, w._2.sum)).collect()
val wordsCountWithReduce = wordsRDD.reduceByKey(_ + _).collect()
val wordsCountWithAggregate=wordsRDD.aggregateByKey(0)((u:Int,v)=>u+v,_+_).foreach(println)// aggregate简写seqOp和comOp使用同一个函数
val wordsCountWithFlod=wordsRDD.flodByKey(0)(_+_)
val wordsCountWithCombe=wordsRDD.combineByKey((v: Int) => v,(c: Int, v: Int) => c+v,(c1: Int, c2: Int) => c1 + c2).collect

combineByKey

注意：

同一个partition才会走mergeValue
不同partition才会走mergeCombiners

/*** 参考：* https://www.jianshu.com/p/d7552ea4f882* https://cloud.tencent.com/developer/ask/98711* 该函数用于将RDD[K,V]转换成RDD[K,C],这里的V类型和C类型可以相同也可以不同。** def combineByKey[C](* createCombiner: V => C,* mergeValue: (C, V) => C,* mergeCombiners: (C, C) => C): RDD[(K, C)] = self.withScope {* combineByKeyWithClassTag(createCombiner, mergeValue, mergeCombiners)(null)* }** 参数的含义如下：* createCombiner：组合器函数，用于将V类型转换成C类型，输入参数为RDD[K,V]中的V,输出为C* mergeValue：在每个分区上执行;合并值函数，将一个C类型和一个V类型值合并成一个C类型，输入参数为(C,V)，输出为C,* mergeCombiners：将不同分区的结果合并;合并组合器函数，用于将两个C类型值合并成一个C类型，输入参数为(C,C)，输出为C* numPartitions：结果RDD分区数，默认保持原有的分区数* partitioner：分区函数,默认为HashPartitioner* mapSideCombine：是否需要在Map端进行combine操作，类似于MapReduce中的combine，默认为true* serializer：序列化类，默认为null*/// 对各个科目求平均值val scores = sparkSession.sparkContext.makeRDD(List(("chinese", 88) , ("chinese", 90) , ("math", 60), ("math", 87)),2)var avgScoresRdd=scores.combineByKey((x:Int)=>(x,1),(c:(Int,Int),x:Int)=>(c._1+x,c._2+1),(c1:(Int,Int),c2:(Int,Int))=>(c1._1+c2._1,c1._2+c2._2))sparkSession.createDataFrame(avgScoresRdd).show()var avgScores=avgScoresRdd.map{ case (key, value) => (key, value._1 / value._2.toFloat) }//.map(x=>(x,(x._1/x._2))

cogroup & union

cogroup相当于SQL中的全外连接full outer join，返回左右RDD中的记录，关联不上的为空。可指定分区数和分区函数，返回的是key和每个RDD的迭代器

def cogroup[W](other: RDD[(K, W)]): RDD[(K, (Iterable[V], Iterable[W]))]
var rdd1 = sparkSession.sparkContext.makeRDD(Array(("A","1"),("B","2")),2)
var rdd2 = sparkSession.sparkContext.makeRDD(Array(("A","3"),("C","4")),2)
var rdd3 = sparkSession.sparkContext.makeRDD(Array(("A","5"),("C","6"),("D","8")),2)
rdd1.cogroup(rdd2,rdd3).collect().foreach(x=>println("("+x._1+","+x._2._1+","+x._2._2+x._2._3+")"))/*** output:* (B,CompactBuffer(2),CompactBuffer()CompactBuffer())* (D,CompactBuffer(),CompactBuffer()CompactBuffer(8))* (A,CompactBuffer(1),CompactBuffer(3)CompactBuffer(5))* (C,CompactBuffer(),CompactBuffer(4)CompactBuffer(6))*/
rdd1.union(rdd2).collect().foreach(x=>println("("+x._1+","+x._2)+")")

jion

// join相当于SQL中的内关联join，只返回两个RDD根据K可以关联上的结果，join只能用于两个RDD之间的关联，如果要多个RDD关联，多关联几次即可。
def join[W](other: RDD[(K, W)]): RDD[(K, (V, W))]
// leftOuterJoin类似于SQL中的左外关联left outer join，返回结果以前面的RDD为主，关联不上的记录为空。只能用于两个RDD之间的关联，如果要多个RDD关联，多关联几次即可
def leftOuterJoin[W](other: RDD[(K, W)]): RDD[(K, (V, Option[W]))]
// rightOuterJoin类似于SQL中的有外关联right outer join，返回结果以参数中的RDD为主，关联不上的记录为空。只能用于两个RDD之间的关联，如果要多个RDD关联，多关联几次即可。
def rightOuterJoin[W](other: RDD[(K, W)]): RDD[(K, (Option[V], W))]
// cogroup相当于SQL中的全外连接full outer join，返回左右RDD中的记录，关联不上的为空。
def cogroup[W](other: RDD[(K, W)]): RDD[(K, (Iterable[V], Iterable[W]))]

注意：

rdd1.leftOuterJoin(rdd2)和rdd2.rightOuterJoin(rdd1)的结果是相同的，但是输出格式是不一致的，不管是left jion还是right jion，输出结果都是先输出左边的rdd对应的列，再输出右边的RDD对象的列

union & intersection & subtract

subtractByKey和基本转换操作中的subtract类似，返回在主RDD中出现，并且不在otherRDD中出现的元素,可指定输出分区数量和分区函数。

transformation

map/mapValues/flatMap/mapPartitions/mapPartitionsWithIndex
filter
distinct:并局部无序而整体有序返回

action

rdd.foreach
rdd.first
rdd.take(10): 从第一个分区的第一行数据开始取，不排序
rdd.takeOrdered(10):与top函数类似，但是与top函数的排序方式相反
rdd.top(10):默认按照降序的方式取前10个元素，可自定义排序规则
rdd.sortBy(x=>x._2,true):按照RDD第二列进行升序排列(false为降序)
rdd.countByValue():countByValue()函数与tuple元组中的（k,v）中的v 没有关系，这点要搞清楚，countByValue是针对Rdd中的每一个元素对象。
rdd.aggregate(1)({(x:Int,y:Int)=>x+y},{(sum1:Int,sum2:Int)=>sum1+sum2})
rdd. fold(1)()(x:Int,y:Int)=>x+y): aggregate简写seqOp和comOp使用同一个函数
saveAsTextFile,saveAsObjectFile,saveAsSequenceFile
rdd.takeSample

sparkSql

object aggregatesFun extends Catalogs_Tutorial{import org.apache.spark.sql.functions._questionsDataFrame.filter("id > 400 and id< 450").filter("owner_userid is not null").join(dfTags,dfQuestions.col("id").equalTo(dfTags("id"))).groupBy(dfQuestions.col("owner_userid")).agg( avg("score"),max("answer_count"))
//    .sparkSession.conf.set("retainGroupColumns",false) // 结果是否展示分组字段.show()
}
+------------+----------+-----------------+
|owner_userid|avg(score)|max(answer_count)|
+------------+----------+-----------------+
|         268|      26.0|                1|
|         136|      57.6|                9|
|         123|      20.0|                3|
+------------+----------+-----------------+

统计函数

基本统计函数：avg,mean,max,min,sum
高级统计函数：皮尔逊相关性(corr)，协方差(cov)，频繁项(freqItems)，交叉表(crosstabe)，行列转换（透视（pivot）），抽样（sample）分层抽样(sampleBy)，词频统计(countMinSketch)，布隆过滤器
显示对dataFrame的统计结果：describe，包含标准差(stddev)和avg,max,min,count

手写wordCount

object LocalWorldCount {def main(args: Array[String]): Unit = {val conf=new SparkConf()conf.setAppName("my first spark local App")conf.setMaster("local")val sc=new SparkContext(conf)val lines=sc.textFile("file:\\E:\\data\\worldCount.txt")val words=lines.flatMap(line=>line.split(" "))val pairs=words.map(word=>(word,1))val worldCount=pairs.reduceByKey(_+_)val sortedWordCount=worldCount.map(pair=>(pair._2,pair._1)).sortByKey(true).map(pair=>(pair._2,pair._1))sortedWordCount.collect.foreach(println)sc.stop()}
}
// 对应sql
lines.

算子选择

mapPartitions/reduceByKey/foreachPartition/

使用filter之后进行coalesce操作。

使用repartitionAndSortWithinPartitions替代repartition与sort类操作。

repartitionAndSortWithinPartitions是Spark官网推荐的一个算子。官方建议，如果是需要在repartition重分区之后还要进行排序，就可以直接使用repartitionAndSortWithinPartitions算子。因为该算子可以一边进行重分区的shuffle操作，一边进行排序。shuffle与sort两个操作同时进行，比先shuffle再sort来说，性能可能是要高的。

转载于:https://my.oschina.net/freelili/blog/3037961