Spark Accumulator累加器

什么是累加器

累加器用来对信息进行聚合
1 算子在计算时,不会影响到driver里的变量的值(driver里的变量称之为共享变量)
2 算子使用的其实都是driver里的变量的一个副本
3 如果想要影响driver里的变量,需要搜集数据到Driver端才行
4 除了搜集之外,Spark提供的累加器也可以完成对Driver中的变量的更新.

为何需要累加器?

算子在计算时,不会影响到driver里的变量的值(driver里的变量称之为共享变量)

object Test_021 {def main(args: Array[String]): Unit = {val conf: SparkConf = new SparkConf().setAppName("count").setMaster("local")val sc = new SparkContext(conf)var arr = sc.parallelize(Array(1, 2, 3, 4, 5, 6, 7, 8), 2)//sum是在driver上的sumvar sum = 0//算子是在worker里的executor上里执行的arr.foreach(x => {//sum是driver上传送过来的,初始值0,然后再worker上进行累加,并没有累加到driver端的sum上sum += x})//打印的是driver自己的sum,所以结果是0println(sum) //0}
}

不用累加器进行求和

object Test_021 {def main(args: Array[String]): Unit = {val conf: SparkConf = new SparkConf().setAppName("count").setMaster("local")val sc = new SparkContext(conf)var arr = sc.parallelize(Array(1, 2, 3, 4, 5, 6, 7, 8), 2)//sum是在driver上的sumvar sum = 0arr.collect()foreach(x => {sum += x})println(sum) //36}
}

低版本累加器

低版本累加器,可以帮我们完成求和等操作
SparkContext有一个accumulator方法
调用时,传入一个初始值
在累加时,调用累加器的add方法
在获取累加器的值时,调用累加器的value方法

object Test_021 {def main(args: Array[String]): Unit = {val conf: SparkConf = new SparkConf().setAppName("count").setMaster("local")val sc = new SparkContext(conf)var arr = sc.parallelize(Array(1, 2, 3, 4, 5, 6, 7, 8), 2)//使用低版本累加器var myacc: Accumulator[Int] =sc.accumulator(0)arr.foreach(x=>{myacc.add(x)})println(myacc.value)//36}
}

高版本累加器AccumulatorV2

本身是个抽象类
有一些可用的子类累加器比如 CollectionAccumulator,DoubleAccumulator,LongAccumulator
使用时需要创建子类型对象并在Spark-Context里面注册

object Test_021 {def main(args: Array[String]): Unit = {val conf: SparkConf = new SparkConf().setAppName("sum").setMaster("local")val sc = new SparkContext(conf)var arr = sc.parallelize(Array(1, 2, 3, 4, 5, 6, 7, 8), 2)//创建累加器对象//An accumulator for computing sum, count, and average of 64-bit integers.var myAcc = new LongAccumulator()//向上下文注册累加器//Register the given accumulator with given name.sc.register(myAcc, "sum")arr.foreach(x => {myAcc.add(x)})println(myAcc.value) //36}
}

自定义累加器

1 继承AccumulatorV2
2 规定泛型,第一个泛型是要输入的数据类型,第二个是要输出的数据类型
3 定义一个成员变量

object Test_021 {def main(args: Array[String]): Unit = {val conf: SparkConf = new SparkConf().setAppName("sum").setMaster("local")val sc = new SparkContext(conf)var arr = sc.parallelize(Array(1, 2, 3, 4, 5, 6, 7, 8), 2)//创建累加器对象//An accumulator for computing sum, count, and average of 64-bit integers.var myAcc = new SumAccumulator//向上下文注册累加器//Register the given accumulator with given name.sc.register(myAcc, "sum")arr.foreach(x => {myAcc.add(x)})println(myAcc.value) //36}
}class SumAccumulator extends  AccumulatorV2[Long,Long]{//定义一个变量,存储累加后的结果var sum:Long=0//判断累加器是否为空,true表示空override def isZero: Boolean = {//sum的结果为0表示没有累加过,即为空sum==0}//复制累加器对象到别的worker上,也就是创建一个新的累加器对象override def copy(): AccumulatorV2[Long, Long] ={val other =new SumAccumulator//将累加器的对象的值得对象赋值到新的累加器对象上other.sum=this.sumother}//重置累加器,就是回归初始值override def reset(): Unit = {sum=0}//将要累加的数据累加到累加器的值上override def add(v: Long): Unit = {sum+=v}//用于两两合并累加器的值,override def merge(other: AccumulatorV2[Long, Long]): Unit = {sum+=other.value}override def value: Long = {sum}
}

利用自定义累加器统计单词

import org.apache.spark.rdd.RDD
import org.apache.spark.util.AccumulatorV2
import org.apache.spark.{SparkConf, SparkContext}
import scala.collection.mutable
//利用累加器统计单词
object _TestAcc {def main(args: Array[String]): Unit = {val conf: SparkConf = new SparkConf().setAppName("wordcount").setMaster("local")val sc = new SparkContext(conf)val words: RDD[String] = sc.parallelize(Array("hello", "word", "hello", "word", "kitty", "word"))val myAcc = new WordCountAccumulatorsc.register(myAcc)words.foreach(myAcc.add)for (elem <- myAcc.value) {println(elem)}}
}class WordCountAccumulator extends AccumulatorV2[String, mutable.HashMap[String, Int]] {//成员变量的维护var map = new mutable.HashMap[String, Int]()override def isZero: Boolean = {map.isEmpty}override def copy(): AccumulatorV2[String, mutable.HashMap[String, Int]] = {val newAcc = new WordCountAccumulatornewAcc.map = this.mapnewAcc}override def reset(): Unit = {map.clear()}override def add(v: String): Unit = {//分区类累加,查看这个单词是否存在map中,如果不存在,则value是1,如果存在,取出value,累加1map模式匹配只有两种,要么None,要么Some(value)map.get(v) match {case None => map.put(v, 1)case Some(x) => map.put(v, x + 1)}}override def merge(other: AccumulatorV2[String, mutable.HashMap[String, Int]]): Unit = {//两个累加器进行合并时,如果有相同单词,就累加value值.如果没有相同的单词,就直接封装原来的值for (elem <- other.value) {//表示的是other里的每一个单词的kv对象//查看this的map中是否有other里的这个单词map.get(elem._1) match {case Some(e) => map.put(elem._1, e + elem._2)case None => map.put(elem._1, elem._2)}}}override def value: mutable.HashMap[String, Int] = {map}
}

注意事项

1.累加器的创建:

1.1.创建一个累加器的实例
1.2.通过sc.register()注册一个累加器
1.3.通过累加器实名.add来添加数据
1.4.通过累加器实例名.value来获取累加器的值

2.最好不要在转换操作中访问累加器(因为血统的关系和转换操作可能执行多次),最好在行动操作中访问

3 由于最终还是要返回到Driver端进行汇报,因此要注意累加的数据量结果的大小问题.

作用:

1.能够精确的统计数据的各种数据例如:
可以统计出符合userID的记录数,在同一个时间段内产生了多少次购买,可以使用ETL进行数据清洗,并使用Accumulator来进行数据的统计

2.作为调试工具,能够观察每个task的信息,通过累加器可以在sparkIUI观察到每个task所处理的记录数