个人主页zicesun.com

这里，从源码的角度总结一下Spark RDD算子的用法。

单值型Transformation算子

map

  /*** Return a new RDD by applying a function to all elements of this RDD.*/def map[U: ClassTag](f: T => U): RDD[U] = withScope {val cleanF = sc.clean(f)new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.map(cleanF))}
复制代码

源码中有一个 sc.clean() 函数，它的所用是去除闭包中不能序列话的外部引用变量。Scala支持闭包，闭包会把它对外的引用(闭包里面引用了闭包外面的对像)保存到自己内部，这个闭包就可以被单独使用了，而不用担心它脱离了当前的作用域；但是在spark这种分布式环境里，这种作法会带来问题，如果对外部的引用是不可serializable的，它就不能正确被发送到worker节点上去了；还有一些引用，可能根本没有用到，这些没有使用到的引用是不需要被发到worker上的；实际上sc.clean函数调用的是ClosureCleaner.clean()；ClosureCleaner.clean()通过递归遍历闭包里面的引用，检查不能serializable的, 去除unused的引用；

map函数是一个粗粒度的操作，对于一个RDD来说，会使用迭代器对分区进行遍历，然后针对一个分区使用你想要执行的操作f, 然后返回一个新的RDD。其实可以理解为rdd的每一个元素都会执行同样的操作。

scala> val array = Array(1,2,3,4,5,6)
array: Array[Int] = Array(1, 2, 3, 4, 5, 6)

scala> val rdd = sc.app
appName   applicationAttemptId   applicationId

scala> val rdd = sc.parallelize(array, 2)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:26

scala> val mapRdd = rdd.map(x => x * 2)
mapRdd: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[1] at map at <console>:25

scala> mapRdd.collect().foreach(println)
2
4
6
8
10
12
复制代码

flatMap

flatMap方法与map方法类似，但是允许一次map方法中输出多个对象，而不是map中的一个对象经过函数转换成另一个对象。

  /***  Return a new RDD by first applying a function to all elements of this*  RDD, and then flattening the results.*/def flatMap[U: ClassTag](f: T => TraversableOnce[U]): RDD[U] = withScope {val cleanF = sc.clean(f)new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.flatMap(cleanF))}
复制代码

scala> val a = sc.parallelize(1 to 10, 5)
a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[2] at parallelize at <console>:24

scala> a.flatMap(num => 1 to num).collect
res1: Array[Int] = Array(1, 1, 2, 1, 2, 3, 1, 2, 3, 4, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 8, 1, 2, 3, 4, 5, 6, 7, 8, 9, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10)复制代码

mapPartitions

mapPartitions是map的另一个实现，map的输入函数应用与RDD的每个元素，但是mapPartitions的输入函数作用于每个分区，也就是每个分区的内容作为整体。

 /*** Return a new RDD by applying a function to each partition of this RDD.** `preservesPartitioning` indicates whether the input function preserves the partitioner, which* should be `false` unless this is a pair RDD and the input function doesn't modify the keys.*/def mapPartitions[U: ClassTag](f: Iterator[T] => Iterator[U],preservesPartitioning: Boolean = false): RDD[U] = withScope {val cleanedF = sc.clean(f)new MapPartitionsRDD(this,(context: TaskContext, index: Int, iter: Iterator[T]) => cleanedF(iter),preservesPartitioning)}
复制代码

scala> def myfunc[T](iter: Iterator[T]):Iterator[(T,T)]={ | var res = List[(T,T)]()| var pre = iter.next| while(iter.hasNext){|   var cur = iter.next|   res .::= (pre, cur)|   pre = cur| }| res.iterator| }
myfunc: [T](iter: Iterator[T])Iterator[(T, T)]

scala> val a = sc.parallelize(1 to 9,3)
a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:24

scala> a.mapPartitions
mapPartitions   mapPartitionsWithIndex

scala> a.mapPartitions(myfunc).collect
res0: Array[(Int, Int)] = Array((2,3), (1,2), (5,6), (4,5), (8,9), (7,8))
复制代码

mapPartitionWithIndex

mapPartitionWithIndex方法与mapPartitions方法类似，不同的是mapPartitionWithIndex会对原始分区的索引进行追踪，这样就可以知道分区所对应的元素，方法的参数为一个函数，函数的输入为整型索引和迭代器。

  /*** Return a new RDD by applying a function to each partition of this RDD, while tracking the index* of the original partition.** `preservesPartitioning` indicates whether the input function preserves the partitioner, which* should be `false` unless this is a pair RDD and the input function doesn't modify the keys.*/def mapPartitionsWithIndex[U: ClassTag](f: (Int, Iterator[T]) => Iterator[U],preservesPartitioning: Boolean = false): RDD[U] = withScope {val cleanedF = sc.clean(f)new MapPartitionsRDD(this,(context: TaskContext, index: Int, iter: Iterator[T]) => cleanedF(index, iter),preservesPartitioning)}
复制代码

scala> val x = sc.parallelize(1 to 10, 3)
x: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[2] at parallelize at <console>:24

scala> def myFunc(index:Int, iter:Iterator[Int]):Iterator[String]={|   iter.toList.map(x => index + "," + x).iterator| }
myFunc: (index: Int, iter: Iterator[Int])Iterator[String]

scala> x.mapPartitions
mapPartitions   mapPartitionsWithIndex

scala> x.mapPartitionsWithIndex(myFunc).collect
res1: Array[String] = Array(0,1, 0,2, 0,3, 1,4, 1,5, 1,6, 2,7, 2,8, 2,9, 2,10)
复制代码

foreach

foreach主要对每一个输入的数据对象执行循环操作，可以用来执行对RDD元素的输出操作。

  /*** Applies a function f to all elements of this RDD.*/def foreach(f: T => Unit): Unit = withScope {val cleanF = sc.clean(f)sc.runJob(this, (iter: Iterator[T]) => iter.foreach(cleanF))}
复制代码

scala> var x = sc.parallelize(List(1 to 9), 3)
x: org.apache.spark.rdd.RDD[scala.collection.immutable.Range.Inclusive] = ParallelCollectionRDD[5] at parallelize at <console>:24

scala> x.foreach(print)
Range(1, 2, 3, 4, 5, 6, 7, 8, 9)
复制代码

foreachPartition

foreachPartition方法和mapPartition的作用一样，通过迭代器参数对RDD中每一个分区的数据对象应用函数，区别在于使用的参数是否有返回值。

  /*** Applies a function f to each partition of this RDD.*/def foreachPartition(f: Iterator[T] => Unit): Unit = withScope {val cleanF = sc.clean(f)sc.runJob(this, (iter: Iterator[T]) => cleanF(iter))}
复制代码

scala> val b = sc.parallelize(List(1,2,3,4,5,6), 3)
b: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at parallelize at <console>:24

scala> b.foreachPartition(x => println(x.reduce((a,b) => a +b)))
7
3
11
复制代码

glom

glom的作用与collec类似，collect是将RDD直接转化为数组的形式，而glom则是将RDD分区数据组装到数组类型的RDD中，每一个返回的数组包含一个分区的所有元素，按分区转化为数组，有几个分区就返回几个数组类型的RDD。

  /*** Return an RDD created by coalescing all elements within each partition into an array.*/def glom(): RDD[Array[T]] = withScope {new MapPartitionsRDD[Array[T], T](this, (context, pid, iter) => Iterator(iter.toArray))}
复制代码

下面的例子中，RDD a有三个分区，glom将a转化为由三个数组构成的RDD。

scala> val a = sc.parallelize(1 to 9, 3)
a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[2] at parallelize at <console>:24

scala> a.glom.collect
res5: Array[Array[Int]] = Array(Array(1, 2, 3), Array(4, 5, 6), Array(7, 8, 9))

scala> a.glom
res6: org.apache.spark.rdd.RDD[Array[Int]] = MapPartitionsRDD[4] at glom at <console>:26
复制代码

union

union方法与++方法是等价的，将两个RDD去并集，取并集的过程中不会去重。

  /*** Return the union of this RDD and another one. Any identical elements will appear multiple* times (use `.distinct()` to eliminate them).*/def union(other: RDD[T]): RDD[T] = withScope {sc.union(this, other)}/*** Return the union of this RDD and another one. Any identical elements will appear multiple* times (use `.distinct()` to eliminate them).*/def ++(other: RDD[T]): RDD[T] = withScope {this.union(other)}
复制代码

scala> val a = sc.parallelize(1 to 4,2)
a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[5] at parallelize at <console>:24

scala> val b = sc.parallelize(2 to 5,1)
b: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[6] at parallelize at <console>:24

scala> a.un
union   unpersist

scala> a.union(b).collect
res7: Array[Int] = Array(1, 2, 3, 4, 2, 3, 4, 5)复制代码

cartesian

计算两个RDD中每个对象的笛卡尔积

  /*** Return the Cartesian product of this RDD and another one, that is, the RDD of all pairs of* elements (a, b) where a is in `this` and b is in `other`.*/def cartesian[U: ClassTag](other: RDD[U]): RDD[(T, U)] = withScope {new CartesianRDD(sc, this, other)}复制代码

cala> val a = sc.parallelize(1 to 4,2)
a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[5] at parallelize at <console>:24

scala> val b = sc.parallelize(2 to 5,1)
b: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[6] at parallelize at <console>:24

scala> a.cartesian(b).collect
res8: Array[(Int, Int)] = Array((1,2), (1,3), (1,4), (1,5), (2,2), (2,3), (2,4), (2,5), (3,2), (3,3), (3,4), (3,5), (4,2), (4,3), (4,4), (4,5))复制代码

groupBy

groupBy方法有三个重载方法，功能是讲元素通过map函数生成Key-Value格式，然后使用groupByKey方法对Key-Value进行聚合。

/*** Return an RDD of grouped items. Each group consists of a key and a sequence of elements* mapping to that key. The ordering of elements within each group is not guaranteed, and* may even differ each time the resulting RDD is evaluated.** @note This operation may be very expensive. If you are grouping in order to perform an* aggregation (such as a sum or average) over each key, using `PairRDDFunctions.aggregateByKey`* or `PairRDDFunctions.reduceByKey` will provide much better performance.*/def groupBy[K](f: T => K)(implicit kt: ClassTag[K]): RDD[(K, Iterable[T])] = withScope {groupBy[K](f, defaultPartitioner(this))}/*** Return an RDD of grouped elements. Each group consists of a key and a sequence of elements* mapping to that key. The ordering of elements within each group is not guaranteed, and* may even differ each time the resulting RDD is evaluated.** @note This operation may be very expensive. If you are grouping in order to perform an* aggregation (such as a sum or average) over each key, using `PairRDDFunctions.aggregateByKey`* or `PairRDDFunctions.reduceByKey` will provide much better performance.*/def groupBy[K](f: T => K,numPartitions: Int)(implicit kt: ClassTag[K]): RDD[(K, Iterable[T])] = withScope {groupBy(f, new HashPartitioner(numPartitions))}/*** Return an RDD of grouped items. Each group consists of a key and a sequence of elements* mapping to that key. The ordering of elements within each group is not guaranteed, and* may even differ each time the resulting RDD is evaluated.** @note This operation may be very expensive. If you are grouping in order to perform an* aggregation (such as a sum or average) over each key, using `PairRDDFunctions.aggregateByKey`* or `PairRDDFunctions.reduceByKey` will provide much better performance.*/def groupBy[K](f: T => K, p: Partitioner)(implicit kt: ClassTag[K], ord: Ordering[K] = null): RDD[(K, Iterable[T])] = withScope {val cleanF = sc.clean(f)this.map(t => (cleanF(t), t)).groupByKey(p)}复制代码

scala> val a = sc.parallelize(1 to 9,2)
a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[9] at parallelize at <console>:24

scala> a.groupBy(x => {if(x % 2 == 0) "even" else "odd"}).collect
res9: Array[(String, Iterable[Int])] = Array((even,CompactBuffer(2, 4, 6, 8)), (odd,CompactBuffer(1, 3, 5, 7, 9)))

scala> def myfunc(a: Int):Int={| a % 2| }
myfunc: (a: Int)Int

scala> a.groupBy(myfunc).collect
res10: Array[(Int, Iterable[Int])] = Array((0,CompactBuffer(2, 4, 6, 8)), (1,CompactBuffer(1, 3, 5, 7, 9)))

scala> a.groupBy(myfunc(_), 1).collect
res11: Array[(Int, Iterable[Int])] = Array((0,CompactBuffer(2, 4, 6, 8)), (1,CompactBuffer(1, 3, 5, 7, 9)))复制代码

filter

filter方法对输入元素进行过滤，参数是一个返回值为boolean的函数，如果函数对元素的运算结果为true，则通过元素，否则就将该元素过滤，不进入结果集。

  /*** Return a new RDD containing only the elements that satisfy a predicate.*/def filter(f: T => Boolean): RDD[T] = withScope {val cleanF = sc.clean(f)new MapPartitionsRDD[T, T](this,(context, pid, iter) => iter.filter(cleanF),preservesPartitioning = true)}复制代码

scala> val a = sc.parallelize(List("we", "are", "from", "China", "not", "from", "America"))
a: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[16] at parallelize at <console>:24

scala> val b = a.filter(x => x.length >= 4)
b: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[17] at filter at <console>:25

scala> b.collect.foreach(println)
from
China
from
America复制代码

distinct

distinct方法将RDD中重复的元素去掉，只留下唯一的RDD元素。

  /*** Return a new RDD containing the distinct elements in this RDD.*/def distinct(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {map(x => (x, null)).reduceByKey((x, y) => x, numPartitions).map(_._1)}复制代码

scala> val a = sc.parallelize(List("we", "are", "from", "China", "not", "from", "America"))
a: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[18] at parallelize at <console>:24

scala> val b = a.map(x => x.length)
b: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[19] at map at <console>:25

scala> val c = b.distinct
c: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[22] at distinct at <console>:25

scala> c.foreach(println)
5
4
2
3
7复制代码

subtract

subtract方法就是求集合A-B，即把集合A中包含集合B的元素都删除，返回剩下的元素。

 /*** Return an RDD with the elements from `this` that are not in `other`.** Uses `this` partitioner/partition size, because even if `other` is huge, the resulting* RDD will be &lt;= us.*/def subtract(other: RDD[T]): RDD[T] = withScope {subtract(other, partitioner.getOrElse(new HashPartitioner(partitions.length)))}/*** Return an RDD with the elements from `this` that are not in `other`.*/def subtract(other: RDD[T], numPartitions: Int): RDD[T] = withScope {subtract(other, new HashPartitioner(numPartitions))}/*** Return an RDD with the elements from `this` that are not in `other`.*/def subtract(other: RDD[T],p: Partitioner)(implicit ord: Ordering[T] = null): RDD[T] = withScope {if (partitioner == Some(p)) {// Our partitioner knows how to handle T (which, since we have a partitioner, is// really (K, V)) so make a new Partitioner that will de-tuple our fake tuplesval p2 = new Partitioner() {override def numPartitions: Int = p.numPartitionsoverride def getPartition(k: Any): Int = p.getPartition(k.asInstanceOf[(Any, _)]._1)}// Unfortunately, since we're making a new p2, we'll get ShuffleDependencies// anyway, and when calling .keys, will not have a partitioner set, even though// the SubtractedRDD will, thanks to p2's de-tupled partitioning, already be// partitioned by the right/real keys (e.g. p).this.map(x => (x, null)).subtractByKey(other.map((_, null)), p2).keys} else {this.map(x => (x, null)).subtractByKey(other.map((_, null)), p).keys}}复制代码

scala> val a = sc.parallelize(1 to 9, 2)
a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[23] at parallelize at <console>:24

scala> val b = sc.parallelize(2 to 5, 4)
b: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[24] at parallelize at <console>:24

scala> val c = a.subtract(b)
c: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[28] at subtract at <console>:27

scala> c.collect
res14: Array[Int] = Array(6, 8, 1, 7, 9)复制代码

persist与cache

cache,缓存数据，把RDD缓存到内存中，以便下次计算式再次被调用。persist是把RDD根据不同的级别进行持久化，通过参数指定持久化级别，如果不带参数则为默认持久化级别，即只保存到内存中，与Cache等价。

sample

sample方法的作用是随即对RDD中的元素进行采样，或得一个新的子RDD。根据参数制定是否放回采样，子集占总数的百分比和随机种子。

 /*** Return a sampled subset of this RDD.** @param withReplacement can elements be sampled multiple times (replaced when sampled out)* @param fraction expected size of the sample as a fraction of this RDD's size*  without replacement: probability that each element is chosen; fraction must be [0, 1]*  with replacement: expected number of times each element is chosen; fraction must be greater*  than or equal to 0* @param seed seed for the random number generator** @note This is NOT guaranteed to provide exactly the fraction of the count* of the given [[RDD]].*/def sample(withReplacement: Boolean,fraction: Double,seed: Long = Utils.random.nextLong): RDD[T] = {require(fraction >= 0,s"Fraction must be nonnegative, but got ${fraction}")withScope {require(fraction >= 0.0, "Negative fraction value: " + fraction)if (withReplacement) {new PartitionwiseSampledRDD[T, T](this, new PoissonSampler[T](fraction), true, seed)} else {new PartitionwiseSampledRDD[T, T](this, new BernoulliSampler[T](fraction), true, seed)}}}复制代码

scala> val a = sc.parallelize(1 to 100, 2)
a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[31] at parallelize at <console>:24

scala> val b = a.sample(false, 0.2, 0)
b: org.apache.spark.rdd.RDD[Int] = PartitionwiseSampledRDD[32] at sample at <console>:25

scala> b.foreach(println)
5
19
20
26
27
29
30
57
40
61
45
68
73
50
75
79
81
85
89
99复制代码

键值对型transformation算子

groupByKey

类似于groupBy，将每一个相同的Key的Value聚集起来形成序列，可以使用默认的分区器和自定义的分区器。

  /*** Group the values for each key in the RDD into a single sequence. Allows controlling the* partitioning of the resulting key-value pair RDD by passing a Partitioner.* The ordering of elements within each group is not guaranteed, and may even differ* each time the resulting RDD is evaluated.** @note This operation may be very expensive. If you are grouping in order to perform an* aggregation (such as a sum or average) over each key, using `PairRDDFunctions.aggregateByKey`* or `PairRDDFunctions.reduceByKey` will provide much better performance.** @note As currently implemented, groupByKey must be able to hold all the key-value pairs for any* key in memory. If a key has too many values, it can result in an `OutOfMemoryError`.*/def groupByKey(partitioner: Partitioner): RDD[(K, Iterable[V])] = self.withScope {// groupByKey shouldn't use map side combine because map side combine does not// reduce the amount of data shuffled and requires all map side data be inserted// into a hash table, leading to more objects in the old gen.val createCombiner = (v: V) => CompactBuffer(v)val mergeValue = (buf: CompactBuffer[V], v: V) => buf += vval mergeCombiners = (c1: CompactBuffer[V], c2: CompactBuffer[V]) => c1 ++= c2val bufs = combineByKeyWithClassTag[CompactBuffer[V]](createCombiner, mergeValue, mergeCombiners, partitioner, mapSideCombine = false)bufs.asInstanceOf[RDD[(K, Iterable[V])]]}/*** Group the values for each key in the RDD into a single sequence. Hash-partitions the* resulting RDD with into `numPartitions` partitions. The ordering of elements within* each group is not guaranteed, and may even differ each time the resulting RDD is evaluated.** @note This operation may be very expensive. If you are grouping in order to perform an* aggregation (such as a sum or average) over each key, using `PairRDDFunctions.aggregateByKey`* or `PairRDDFunctions.reduceByKey` will provide much better performance.** @note As currently implemented, groupByKey must be able to hold all the key-value pairs for any* key in memory. If a key has too many values, it can result in an `OutOfMemoryError`.*/def groupByKey(numPartitions: Int): RDD[(K, Iterable[V])] = self.withScope {groupByKey(new HashPartitioner(numPartitions))}复制代码

scala> val a = sc.parallelize(List("mk", "zq", "xwc", "fig", "dcp", "snn"), 2)
a: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[33] at parallelize at <console>:24

scala> val b = a.keyBy(x => x.length)
b: org.apache.spark.rdd.RDD[(Int, String)] = MapPartitionsRDD[34] at keyBy at <console>:25

scala> b.groupByKey.collect
res17: Array[(Int, Iterable[String])] = Array((2,CompactBuffer(mk, zq)), (3,CompactBuffer(xwc, fig, dcp, snn)))复制代码

combineByKey

comineByKey方法能够有效地讲键值对形式的RDD相同的Key的Value合并成序列形式，用户能自定义RDD的分区器和是否在Map端进行聚合操作。

  /*** Generic function to combine the elements for each key using a custom set of aggregation* functions. This method is here for backward compatibility. It does not provide combiner* classtag information to the shuffle.** @see `combineByKeyWithClassTag`*/def combineByKey[C](createCombiner: V => C,mergeValue: (C, V) => C,mergeCombiners: (C, C) => C,partitioner: Partitioner,mapSideCombine: Boolean = true,serializer: Serializer = null): RDD[(K, C)] = self.withScope {combineByKeyWithClassTag(createCombiner, mergeValue, mergeCombiners,partitioner, mapSideCombine, serializer)(null)}/*** Simplified version of combineByKeyWithClassTag that hash-partitions the output RDD.* This method is here for backward compatibility. It does not provide combiner* classtag information to the shuffle.** @see `combineByKeyWithClassTag`*/def combineByKey[C](createCombiner: V => C,mergeValue: (C, V) => C,mergeCombiners: (C, C) => C,numPartitions: Int): RDD[(K, C)] = self.withScope {combineByKeyWithClassTag(createCombiner, mergeValue, mergeCombiners, numPartitions)(null)}复制代码

scala> val a = sc.parallelize(List("xwc", "fig","wc", "dcp", "zq", "znn", "mk", "zl", "hk", "lp"), 2)
a: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[36] at parallelize at <console>:24

scala> val b = sc.parallelize(List(1,2,2,3,2,1,2,2,2,3),2)
b: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[37] at parallelize at <console>:24

scala> val c = b.zip(a)
c: org.apache.spark.rdd.RDD[(Int, String)] = ZippedPartitionsRDD2[38] at zip at <console>:27
scala> val d = c.combineByKey(List(_), (x:List[String], y:String)=>y::x, (x:List[String], y:List[String])=>x::: y)
d: org.apache.spark.rdd.RDD[(Int, List[String])] = ShuffledRDD[39] at combineByKey at <console>:25

scala> d.collect
res18: Array[(Int, List[String])] = Array((2,List(zq, wc, fig, hk, zl, mk)), (1,List(xwc, znn)), (3,List(dcp, lp)))复制代码

上面的例子使用三个参数重载的方法，该方法的第一个参数createCombiner把元素V转换成另一类元素C，该例子中使用的参数是List(_),表示将输入元素放在List集合中；第二个参数mergeValue的含义是吧元素V合并到元素C中，该例子中使用的是(x:List[String],y:String)=>y::x,表示将y字符合并到x链表集合中；第三个参数的含义是讲两个C元素合并，该例子中使用的是（x:List[String], y:List[String]）=>x:::y, 表示把x链表集合中的内容合并到y链表中。

reduceByKey

使用一个reduce函数来实现对想要的Key的value的聚合操作，发送给reduce前会在map端本地merge操作，该方法的底层实现是调用combineByKey方法的一个重载方法。

 /*** Merge the values for each key using an associative and commutative reduce function. This will* also perform the merging locally on each mapper before sending results to a reducer, similarly* to a "combiner" in MapReduce.*/def reduceByKey(partitioner: Partitioner, func: (V, V) => V): RDD[(K, V)] = self.withScope {combineByKeyWithClassTag[V]((v: V) => v, func, func, partitioner)}/*** Merge the values for each key using an associative and commutative reduce function. This will* also perform the merging locally on each mapper before sending results to a reducer, similarly* to a "combiner" in MapReduce. Output will be hash-partitioned with numPartitions partitions.*/def reduceByKey(func: (V, V) => V, numPartitions: Int): RDD[(K, V)] = self.withScope {reduceByKey(new HashPartitioner(numPartitions), func)}/*** Merge the values for each key using an associative and commutative reduce function. This will* also perform the merging locally on each mapper before sending results to a reducer, similarly* to a "combiner" in MapReduce. Output will be hash-partitioned with the existing partitioner/* parallelism level.*/def reduceByKey(func: (V, V) => V): RDD[(K, V)] = self.withScope {reduceByKey(defaultPartitioner(self), func)}复制代码

scala> val a = sc.parallelize(List("dcp","fjg","snn","wc", "za"), 2)
a: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at parallelize at <console>:24

scala> val b = a.map(x => (x.length,x))
b: org.apache.spark.rdd.RDD[(Int, String)] = MapPartitionsRDD[2] at map at <console>:25

scala> b.reduceByKey((a, b) => a + b ).collect
res1: Array[(Int, String)] = Array((2,wcza), (3,dcpfjgsnn))复制代码

sortByKey

根据Key值对键值对进行排序，如果是字符，则按照字典顺序排序，如果是数组则按照数字大小排序，可通过参数指定升序还是降序。

scala> val a = sc.parallelize(List("dcp","fjg","snn","wc", "za"), 2)
a: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[4] at parallelize at <console>:24

scala> val b = sc.parallelize(1 to a.count.toInt,2)
b: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[5] at parallelize at <console>:26

scala> val c = a.zip(b)
c: org.apache.spark.rdd.RDD[(String, Int)] = ZippedPartitionsRDD2[6] at zip at <console>:27

scala> c.sortByKey(true).collect
res2: Array[(String, Int)] = Array((dcp,1), (fjg,2), (snn,3), (wc,4), (za,5))复制代码

cogroup

scala> val a = sc.parallelize(List(1,2,2,3,1,3),2)
a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[10] at parallelize at <console>:24scala> val b = a.map(x => (x, "b"))
b: org.apache.spark.rdd.RDD[(Int, String)] = MapPartitionsRDD[11] at map at <console>:25scala> val c = a.map(x => (x, "c"))
c: org.apache.spark.rdd.RDD[(Int, String)] = MapPartitionsRDD[12] at map at <console>:25scala> b.cogroup(c).collect
res3: Array[(Int, (Iterable[String], Iterable[String]))] = Array((2,(CompactBuffer(b, b),CompactBuffer(c, c))), (1,(CompactBuffer(b, b),CompactBuffer(c, c))), (3,(CompactBuffer(b, b),CompactBuffer(c, c))))scala> val a = sc.parallelize(List(1,2,2,2,1,3),1)
a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[15] at parallelize at <console>:24scala> val b = a.map(x => (x, "b"))
b: org.apache.spark.rdd.RDD[(Int, String)] = MapPartitionsRDD[16] at map at <console>:25scala> val c = a.map(x => (x, "c"))
c: org.apache.spark.rdd.RDD[(Int, String)] = MapPartitionsRDD[17] at map at <console>:25scala> b.cogroup(c).collect
res4: Array[(Int, (Iterable[String], Iterable[String]))] = Array((1,(CompactBuffer(b, b),CompactBuffer(c, c))), (3,(CompactBuffer(b),CompactBuffer(c))), (2,(CompactBuffer(b, b, b),CompactBuffer(c, c, c))))复制代码

join

首先对RDD进行cogroup操作，然后对每个新的RDD下Key的值进行笛卡尔积操作，再返回结果使用flatmapValue方法。

scala> val a= sc.parallelize(List("fjg","wc","xwc"),2)
a: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[20] at parallelize at <console>:24 scala> val c = sc.parallelize(List("fig", "wc", "sbb", "zq","xwc","dcp"), 2)
c: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[22] at parallelize at <console>:24scala> val d = c.keyBy(_.length)
d: org.apache.spark.rdd.RDD[(Int, String)] = MapPartitionsRDD[23] at keyBy at <console>:25scala> val b = a.keyBy(_.length)
b: org.apache.spark.rdd.RDD[(Int, String)] = MapPartitionsRDD[24] at keyBy at <console>:25scala> b.join(d).collect
res6: Array[(Int, (String, String))] = Array((2,(wc,wc)), (2,(wc,zq)), (3,(fjg,fig)), (3,(fjg,sbb)), (3,(fjg,xwc)), (3,(fjg,dcp)), (3,(xwc,fig)), (3,(xwc,sbb)), (3,(xwc,xwc)), (3,(xwc,dcp)))复制代码

Action算子

collect

把RDD中的元素以数组的形式返回。

  /*** Return an array that contains all of the elements in this RDD.** @note This method should only be used if the resulting array is expected to be small, as* all the data is loaded into the driver's memory.*/def collect(): Array[T] = withScope {val results = sc.runJob(this, (iter: Iterator[T]) => iter.toArray)Array.concat(results: _*)}复制代码

scala> val a = sc.parallelize(List("a", "b", "c"),2)
a: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[28] at parallelize at <console>:24

scala> a.collect
res7: Array[String] = Array(a, b, c)复制代码

reduce

使用一个带两个参数的函数把元素进行聚集，返回一个元素的结果。该函数中的二元操作应该满足交换律和结合律，这样才能在并行计算中得到正确的计算结果。

  /*** Reduces the elements of this RDD using the specified commutative and* associative binary operator.*/def reduce(f: (T, T) => T): T = withScope {val cleanF = sc.clean(f)val reducePartition: Iterator[T] => Option[T] = iter => {if (iter.hasNext) {Some(iter.reduceLeft(cleanF))} else {None}}var jobResult: Option[T] = Noneval mergeResult = (index: Int, taskResult: Option[T]) => {if (taskResult.isDefined) {jobResult = jobResult match {case Some(value) => Some(f(value, taskResult.get))case None => taskResult}}}sc.runJob(this, reducePartition, mergeResult)// Get the final result out of our Option, or throw an exception if the RDD was emptyjobResult.getOrElse(throw new UnsupportedOperationException("empty collection"))}复制代码

scala> val a = sc.parallelize(1 to 10, 2)
a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[29] at parallelize at <console>:24

scala> a.reduce((a, b) => a + b)
res8: Int = 55复制代码

take

take方法会从RDD中取出前n个元素。先扫描一个分区，之后从分区中得到结果，然后评估该分区的元素是否满足n，若果不满足则继续从其他分区中扫描获取。

/*** Take the first num elements of the RDD. It works by first scanning one partition, and use the* results from that partition to estimate the number of additional partitions needed to satisfy* the limit.** @note This method should only be used if the resulting array is expected to be small, as* all the data is loaded into the driver's memory.** @note Due to complications in the internal implementation, this method will raise* an exception if called on an RDD of `Nothing` or `Null`.*/def take(num: Int): Array[T] = withScope {val scaleUpFactor = Math.max(conf.getInt("spark.rdd.limit.scaleUpFactor", 4), 2)if (num == 0) {new Array[T](0)} else {val buf = new ArrayBuffer[T]val totalParts = this.partitions.lengthvar partsScanned = 0while (buf.size < num && partsScanned < totalParts) {// The number of partitions to try in this iteration. It is ok for this number to be// greater than totalParts because we actually cap it at totalParts in runJob.var numPartsToTry = 1Lval left = num - buf.sizeif (partsScanned > 0) {// If we didn't find any rows after the previous iteration, quadruple and retry.// Otherwise, interpolate the number of partitions we need to try, but overestimate// it by 50%. We also cap the estimation in the end.if (buf.isEmpty) {numPartsToTry = partsScanned * scaleUpFactor} else {// As left > 0, numPartsToTry is always >= 1numPartsToTry = Math.ceil(1.5 * left * partsScanned / buf.size).toIntnumPartsToTry = Math.min(numPartsToTry, partsScanned * scaleUpFactor)}}val p = partsScanned.until(math.min(partsScanned + numPartsToTry, totalParts).toInt)val res = sc.runJob(this, (it: Iterator[T]) => it.take(left).toArray, p)res.foreach(buf ++= _.take(num - buf.size))partsScanned += p.size}buf.toArray}}复制代码

scala> val a = sc.parallelize(1 to 10, 2)
a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[30] at parallelize at <console>:24

scala> a.take(5)
res9: Array[Int] = Array(1, 2, 3, 4, 5)复制代码

top

top会采用隐式排序转换来获取最大的前n个元素。

  /*** Returns the top k (largest) elements from this RDD as defined by the specified* implicit Ordering[T] and maintains the ordering. This does the opposite of* [[takeOrdered]]. For example:* {{{*   sc.parallelize(Seq(10, 4, 2, 12, 3)).top(1)*   // returns Array(12)**   sc.parallelize(Seq(2, 3, 4, 5, 6)).top(2)*   // returns Array(6, 5)* }}}** @note This method should only be used if the resulting array is expected to be small, as* all the data is loaded into the driver's memory.** @param num k, the number of top elements to return* @param ord the implicit ordering for T* @return an array of top elements*/def top(num: Int)(implicit ord: Ordering[T]): Array[T] = withScope {takeOrdered(num)(ord.reverse)}/*** Returns the first k (smallest) elements from this RDD as defined by the specified* implicit Ordering[T] and maintains the ordering. This does the opposite of [[top]].* For example:* {{{*   sc.parallelize(Seq(10, 4, 2, 12, 3)).takeOrdered(1)*   // returns Array(2)**   sc.parallelize(Seq(2, 3, 4, 5, 6)).takeOrdered(2)*   // returns Array(2, 3)* }}}** @note This method should only be used if the resulting array is expected to be small, as* all the data is loaded into the driver's memory.** @param num k, the number of elements to return* @param ord the implicit ordering for T* @return an array of top elements*/def takeOrdered(num: Int)(implicit ord: Ordering[T]): Array[T] = withScope {if (num == 0) {Array.empty} else {val mapRDDs = mapPartitions { items =>// Priority keeps the largest elements, so let's reverse the ordering.val queue = new BoundedPriorityQueue[T](num)(ord.reverse)queue ++= collectionUtils.takeOrdered(items, num)(ord)Iterator.single(queue)}if (mapRDDs.partitions.length == 0) {Array.empty} else {mapRDDs.reduce { (queue1, queue2) =>queue1 ++= queue2queue1}.toArray.sorted(ord)}}}复制代码

scala> val c = sc.parallelize(Array(1,2,3,5,3,8,7,97,32),2)
c: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[31] at parallelize at <console>:24

scala> c.top(3)
res10: Array[Int] = Array(97, 32, 8)复制代码

count

count方法计算返回RDD中元素的个数。

  /*** Return the number of elements in the RDD.*/def count(): Long = sc.runJob(this, Utils.getIteratorSize _).sum复制代码

scala> val c = sc.parallelize(Array(1,2,3,5,3,8,7,97,32),2)
c: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[33] at parallelize at <console>:24

scala> c.count
res11: Long = 9复制代码

takeSample

返回一个固定大小的数组形式的采样子集，此外还会把返回元素的顺序随机打乱。

 /*** Return a fixed-size sampled subset of this RDD in an array** @param withReplacement whether sampling is done with replacement* @param num size of the returned sample* @param seed seed for the random number generator* @return sample of specified size in an array** @note this method should only be used if the resulting array is expected to be small, as* all the data is loaded into the driver's memory.*/def takeSample(withReplacement: Boolean,num: Int,seed: Long = Utils.random.nextLong): Array[T] = withScope {val numStDev = 10.0require(num >= 0, "Negative number of elements requested")require(num <= (Int.MaxValue - (numStDev * math.sqrt(Int.MaxValue)).toInt),"Cannot support a sample size > Int.MaxValue - " +s"$numStDev * math.sqrt(Int.MaxValue)")if (num == 0) {new Array[T](0)} else {val initialCount = this.count()if (initialCount == 0) {new Array[T](0)} else {val rand = new Random(seed)if (!withReplacement && num >= initialCount) {Utils.randomizeInPlace(this.collect(), rand)} else {val fraction = SamplingUtils.computeFractionForSampleSize(num, initialCount,withReplacement)var samples = this.sample(withReplacement, fraction, rand.nextInt()).collect()// If the first sample didn't turn out large enough, keep trying to take samples;// this shouldn't happen often because we use a big multiplier for the initial sizevar numIters = 0while (samples.length < num) {logWarning(s"Needed to re-sample due to insufficient sample size. Repeat #$numIters")samples = this.sample(withReplacement, fraction, rand.nextInt()).collect()numIters += 1}Utils.randomizeInPlace(samples, rand).take(num)}}}}复制代码

scala> val c = sc.parallelize(Array(1,2,3,5,3,8,7,97,32),2)
c: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[34] at parallelize at <console>:24

scala> c.takeSample(true,3, 1)
res14: Array[Int] = Array(1, 3, 7)复制代码

saveAsTextFile

将RDD存储为文本文件，一次存一行

countByKey

类似count，但是countByKey会根据Key计算对应的Value个数，返回Map类型的结果。

  /*** Count the number of elements for each key, collecting the results to a local Map.** @note This method should only be used if the resulting map is expected to be small, as* the whole thing is loaded into the driver's memory.* To handle very large results, consider using rdd.mapValues(_ => 1L).reduceByKey(_ + _), which* returns an RDD[T, Long] instead of a map.*/def countByKey(): Map[K, Long] = self.withScope {self.mapValues(_ => 1L).reduceByKey(_ + _).collect().toMap}复制代码

scala> val c = sc.parallelize(List("fig", "wc", "sbb", "zq","xwc","dcp"), 2)
c: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[36] at parallelize at <console>:24

scala> val d = c.keyBy(_.length)
d: org.apache.spark.rdd.RDD[(Int, String)] = MapPartitionsRDD[37] at keyBy at <console>:25

scala> d.countByKey
res15: scala.collection.Map[Int,Long] = Map(2 -> 2, 3 -> 4)复制代码

aggregate

  /*** Aggregate the elements of each partition, and then the results for all the partitions, using* given combine functions and a neutral "zero value". This function can return a different result* type, U, than the type of this RDD, T. Thus, we need one operation for merging a T into an U* and one operation for merging two U's, as in scala.TraversableOnce. Both of these functions are* allowed to modify and return their first argument instead of creating a new U to avoid memory* allocation.** @param zeroValue the initial value for the accumulated result of each partition for the*                  `seqOp` operator, and also the initial value for the combine results from*                  different partitions for the `combOp` operator - this will typically be the*                  neutral element (e.g. `Nil` for list concatenation or `0` for summation)* @param seqOp an operator used to accumulate results within a partition* @param combOp an associative operator used to combine results from different partitions*/def aggregate[U: ClassTag](zeroValue: U)(seqOp: (U, T) => U, combOp: (U, U) => U): U = withScope {// Clone the zero value since we will also be serializing it as part of tasksvar jobResult = Utils.clone(zeroValue, sc.env.serializer.newInstance())val cleanSeqOp = sc.clean(seqOp)val cleanCombOp = sc.clean(combOp)val aggregatePartition = (it: Iterator[T]) => it.aggregate(zeroValue)(cleanSeqOp, cleanCombOp)val mergeResult = (index: Int, taskResult: U) => jobResult = combOp(jobResult, taskResult)sc.runJob(this, aggregatePartition, mergeResult)jobResult}复制代码

fold

/*** Aggregate the elements of each partition, and then the results for all the partitions, using a* given associative function and a neutral "zero value". The function* op(t1, t2) is allowed to modify t1 and return it as its result value to avoid object* allocation; however, it should not modify t2.** This behaves somewhat differently from fold operations implemented for non-distributed* collections in functional languages like Scala. This fold operation may be applied to* partitions individually, and then fold those results into the final result, rather than* apply the fold to each element sequentially in some defined ordering. For functions* that are not commutative, the result may differ from that of a fold applied to a* non-distributed collection.** @param zeroValue the initial value for the accumulated result of each partition for the `op`*                  operator, and also the initial value for the combine results from different*                  partitions for the `op` operator - this will typically be the neutral*                  element (e.g. `Nil` for list concatenation or `0` for summation)* @param op an operator used to both accumulate results within a partition and combine results*                  from different partitions*/def fold(zeroValue: T)(op: (T, T) => T): T = withScope {// Clone the zero value since we will also be serializing it as part of tasksvar jobResult = Utils.clone(zeroValue, sc.env.closureSerializer.newInstance())val cleanOp = sc.clean(op)val foldPartition = (iter: Iterator[T]) => iter.fold(zeroValue)(cleanOp)val mergeResult = (index: Int, taskResult: T) => jobResult = op(jobResult, taskResult)sc.runJob(this, foldPartition, mergeResult)jobResult}复制代码

转载于:https://juejin.im/post/5cfdf178e51d454d1d6284e8

Spark学习之Spark RDD算子相关推荐

Spark性能调优-RDD算子调优篇
Spark性能调优-RDD算子调优篇 RDD算子调优 1. RDD复用在对RDD进行算子时,要避免相同的算子和计算逻辑之下对RDD进行重复的计算,如下图所示: 对上图中的RDD计算架构进行修改,得到 ...
Spark学习之Spark调优与调试（7）
Spark学习之Spark调优与调试(7) 1. 对Spark进行调优与调试通常需要修改Spark应用运行时配置的选项. 当创建一个SparkContext时就会创建一个SparkConf实例. 2. ...
Spark学习之Spark Streaming（9）
Spark学习之Spark Streaming(9) 1. Spark Streaming允许用户使用一套和批处理非常接近的API来编写流式计算应用,这就可以大量重用批处理应用的技术甚至代码. 2. ...
Spark学习笔记(7)——RDD行动算子
RDD方法又称RDD算子. 算子 : Operator(操作) RDD的方法和Scala集合对象的方法不一样,集合对象的方法都是在同一个节点的内存中完成的.RDD的方法可以将计算逻辑发送到Execut ...
Spark学习之Spark初识
一.什么是Spark Apache Spark 是专为大规模数据处理而设计的快速通用的计算引擎.Spark是UC Berkeley AMP lab (加州大学伯克利分校的AMP实验室)所开源的类Had ...
Spark学习之Spark Streaming
一.简介许多应用需要即时处理收到的数据,例如用来实时追踪页面访问统计的应用.训练机器学习模型的应用,还有自动检测异常的应用.Spark Streaming 是 Spark 为这些应用而设计的模型.它 ...
Spark学习之spark集群搭建
(推广一下自己的个人主页 zicesun.com) 本文讲介绍如何搭建spark集群. 搭建spark集群需要进行一下几件事情: 集群配置ssh无秘登录 java jdk1.8 scala-2.11. ...
Spark学习：spark读取HBase数据报异常java.io.NotSerializableException
1.准备工作,安装好HABSE之后,执行Hbase shell create '表名称', '列名称1','列名称2','列名称N' create '表名称','列族名称' 在hbase中列是可以动态 ...
Spark学习笔记 --- Spark Streaming 与 Stom 比较
对比点 Storm

Spark学习之Spark RDD算子