定义

定义可参考RDD的API

aggregate[U](zeroValue: U)(seqOp: (U, T) ⇒ U, combOp: (U, U) ⇒ U)(implicit arg0: ClassTag[U]): U
Aggregate the elements of each partition, and then the results for all the partitions, using given combine functions and a neutral “zero value”. This function can return a different result type, U, than the type of this RDD, T. Thus, we need one operation for merging a T into an U and one operation for merging two U’s, as in scala.TraversableOnce. Both of these functions are allowed to modify and return their first argument instead of creating a new U to avoid memory allocation.
zeroValue
the initial value for the accumulated result of each partition for the seqOp operator, and also the initial value for the combine results from different partitions for the combOp operator - this will typically be the neutral element (e.g. Nil for list concatenation or 0 for summation)
seqOp
an operator used to accumulate results within a partition
combOp
an associative operator used to combine results from different partitions

实验1-熟悉使用

api讲的比较清楚了，该函数用来聚集每个分区的元素，并用合并函数和zeroValue来聚集分区结果。并给予我们两个函数，seqOp和CombOp

实验程序

打开spark-shell，我们执行实验1（当复制并粘贴以下代码实验时请将注释去掉）

//该函数用来将每个分区的index展示出来
def myfunc[T](index:Int,iter:Iterator[T]):Iterator[(Int,T)]={
var res = List[(Int,T)]()
for(x<-iter)
res.::=(index,x)
res.iterator
}
val data = sc.parallelize(1 to 10,3)
data.mapPartitionsWithIndex(myfunc).collect
data.aggregate(0)((a,b)=>if(a>b) a else b ,_+_)

实验结果

结果分析

实验2-zeroValue

api讲解如下：zeroValue值为seqOp函数的初始值，同时也是combOp函数的初始值。

实验程序

打开spark-shell，我们执行实验2（当复制并粘贴以下代码实验时请将注释去掉）

//seqOp函数
def seqOp(arg1:Int,arg2:Int):Int={
var res:Int=arg2
if(arg1>arg2)
res=arg1
println("seqOp:"+arg1+","+arg2+"=>"+res)
res
}
//combOp函数
def combOp(arg1:Int,arg2:Int):Int={
println("combOp:"+arg1+","+arg2+"=>"+(arg1+arg2))
arg1+arg2
}
//将每个分区index显示出来
def myfunc[T](index:Int,iter:Iterator[T]):Iterator[(Int,T)]={
var res = List[(Int,T)]()
for(x<-iter)
res.::=(index,x)
res.iterator
}
val data = sc.parallelize(1 to 10,3)
data.mapPartitionsWithIndex(myfunc).collect
data.aggregate(11)(seqOp,combOp)

实验结果

结果分析

当然，该实验的zeroValue取值比较极端，大家可换成5或者6试一试

参考博客：
[1]：http://homepage.cs.latrobe.edu.au/zhe/ZhenHeSparkRDDAPIExamples.html#aggregate
[2]：http://www.iteblog.com/archives/1268

RDD之aggregate相关推荐

Spark RDD使用详解5--Action算子
本质上在Actions算子中通过SparkContext执行提交作业的runJob操作,触发了RDD DAG的执行. 根据Action算子的输出空间将Action算子进行分类:无输出. HDFS. ...
Spark 键值对RDD操作
https://www.cnblogs.com/yongjian/p/6425772.html 概述键值对RDD是Spark操作中最常用的RDD,它是很多程序的构成要素,因为他们提供了并行操作各个键 ...
rdd转换成java数据结构_Spark RDD转换成其他数据结构
在Spark推荐系统编程中,一般都是通过文件加载成RDD: //在这里默认 (userId, itemId, preference) val fields = sparkContext.textFil ...
Spark：RDD编程总结(概述、算子、分区、共享变量)
目录 1.RDD概述 1.1.RDD是什么 1.2.RDD的弹性 1.3.RDD的特点 1.3.1.分区 1.3.2.只读 1.3.3.依赖 1.3.4.缓存 1.3.5.检查点 2.RDD编程 2. ...
Spark算子总结版
Spark的算子的分类从大方向来说,Spark 算子大致可以分为以下两类: 1)Transformation 变换/转换算子:这种变换并不触发提交作业,完成作业中间过程处理. Transformat ...
kali视频学习笔记
DAY1 系统安装 1. 用u盘烧录KALI镜像,不含live开头,含amd64,4G 2. 用u盘启动安装图形界面,选简单中文-汉语,默认KFCE,全工具 3. 改密码,sudo passwd ro ...
Spark的算子的分类
从大方向来说Spark 算子大致可以分为以下两类: Transformation 变换/转换算子这种变换并不触发提交作业完成作业中间过程处理.Transformation 操作是延迟计算的也就是说从一 ...
由spark.sql.shuffle.partitions混洗分区浅谈下spark的分区
背景 spark的分区无处不在,但是编程的时候又很少直接设置,本文想通过一个例子说明从spark读取数据到内存中后的分区数,然后经过shuffle操作后的分区数,最后再通过主动设置repartitio ...
2021年大数据Spark（十五）：Spark Core的RDD常用算子
目录常用算子基本算子分区操作函数算子重分区函数算子 1).增加分区函数 2).减少分区函数 3).调整分区函数聚合函数算子 Scala集合中的聚合函数 ...

RDD之aggregate

定义

实验1-熟悉使用

实验程序

实验结果

结果分析

实验2-zeroValue

实验程序

实验结果

结果分析

RDD之aggregate相关推荐

最新文章

热门文章