Transformations on DStreams之updateStateByKey 的使用和状态累加

Transformations on DStreams之transform的使用实现黑名单操作/指定过滤
https://blog.csdn.net/qq_43688472/article/details/86616864
只处理当前批次的数据，所谓的无状态的方式，来一次，处理一次
有状态：改批次的数据和以前批次的数据是需要“累加”的

例如：今天某点到某点什么数据出现的次数

1.在那个基础上加个时间戳，把他放到某处，在进行累加
2.直接的方式完成
官网：http://spark.apache.org/docs/latest/streaming-programming-guide.html#transformations-on-dstreams
updateStateByKey(func)
Return a new “state” DStream where the state for each key is updated by applying the given function on the previous state of the key and the new values for the key. This can be used to maintain(维持) arbitrary state data for each key.
累计旧状态进行更新

IDEA操作

package g5.learningimport org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}import scala.collection.mutable.ListBufferobject UpdateStateByKeyApp {def main(args: Array[String]): Unit = {//准备工作val conf = new SparkConf().setMaster("local[2]").setAppName("UpdateStateByKeyApp")val ssc = new StreamingContext(conf, Seconds(10))ssc.checkpoint("hdfs://hadoop001:8020/ss/logs")//这里要加这个，为什么，因为这是个有状态的数据，你要旧数据一个地方存放才能累加//业务逻辑val lines = ssc.socketTextStream("hadoop001", 9999)val results = lines.flatMap(_.split(",")).map((_,1))
val state = results.updateStateByKey(updateFunction)state.print()//streaming的启动ssc.start() // Start the computationssc.awaitTermination() // Wait for the computation to terminate}def updateFunction(newValues: Seq[Int], runningCount: Option[Int]): Option[Int] = {val newCount = newValues.sumval pre =runningCount.getOrElse(0) // add the new values with the previous running count to get the new countSome(newCount+ pre)}}

问题：

这里你会发现在hdfs上你会产生很多的小文件