2019独角兽企业重金招聘Python工程师标准>>>

Spark Streaming的窗口操作 博客分类: spark

Spark Streaming的Window Operation可以理解为定时的进行一定时间段内的数据的处理。

不要怪我语文不太好。。下面上原理图吧,一图胜千言:

滑动窗口在监控和统计应用的场景比较广泛,比如每隔一段时间(2s)统计最近3s的请求量或者异常次数,根据请求或者异常次数采取相应措施

如图:

1. 红色的矩形就是一个窗口,窗口hold的是一段时间内的数据流。

2.这里面每一个time都是时间单元,在官方的例子中,每个窗口大小(window size)是3时间单元 (time unit), 而且每隔2个单位时间,窗口会slide(滑动)一次。

所以基于窗口的操作,需要指定2个参数:

  • window length - The duration of the window (3 in the figure)
  • slide interval - The interval at which the window-based operation is performed (2 in the figure).
1.窗口大小,个人感觉是一段时间内数据的容器。
2.滑动间隔,就是我们可以理解的cron表达式吧。 - -!
举个例子吧:
还是以最著名的wordcount举例,每隔10秒,统计一下过去30秒过来的数据。
[java]  view plain copy
  1. // Reduce last 30 seconds of data, every 10 seconds
  2. val windowedWordCounts = pairs.reduceByKeyAndWindow(_ + _, Seconds(30), Seconds(10))

这里的paris就是一个MapedRDD, 类似(word,1)

[java]  view plain copy
  1. reduceByKeyAndWindow // 这个类似RDD里面的reduceByKey,就是对RDD应用function

在这里是根据key,对至进行聚合,然后累加。

下面粘贴一下它的API,仅供参考:
window(windowLengthslideInterval) Return a new DStream which is computed based on windowed batches of the source DStream.
countByWindow(windowLength,slideInterval) Return a sliding window count of elements in the stream.
reduceByWindow(funcwindowLength,slideInterval) Return a new single-element stream, created by aggregating elements in the stream over a sliding interval using func. The function should be associative so that it can be computed correctly in parallel.
reduceByKeyAndWindow(func,windowLengthslideInterval, [numTasks]) When called on a DStream of (K, V) pairs, returns a new DStream of (K, V) pairs where the values for each key are aggregated using the given reduce function func over batches in a sliding window. Note: By default, this uses Spark's default number of parallel tasks (2 for local machine, 8 for a cluster) to do the grouping. You can pass an optional numTasks argument to set a different number of tasks.
reduceByKeyAndWindow(funcinvFunc,windowLengthslideInterval, [numTasks]) A more efficient version of the above reduceByKeyAndWindow() where the reduce value of each window is calculated incrementally using the reduce values of the previous window. This is done by reducing the new data that enter the sliding window, and "inverse reducing" the old data that leave the window. An example would be that of "adding" and "subtracting" counts of keys as the window slides. However, it is applicable to only "invertible reduce functions", that is, those reduce functions which have a corresponding "inverse reduce" function (taken as parameterinvFunc. Like in reduceByKeyAndWindow, the number of reduce tasks is configurable through an optional argument.
countByValueAndWindow(windowLength,slideInterval, [numTasks]) When called on a DStream of (K, V) pairs, returns a new DStream of (K, Long) pairs where the value of each key is its frequency within a sliding window. Like in reduceByKeyAndWindow, the number of reduce tasks is configurable through an optional argument.
   

Output Operations

When an output operator is called, it triggers the computation of a stream. Currently the following output operators are defined:

print() Prints first ten elements of every batch of data in a DStream on the driver.
foreachRDD(func) The fundamental output operator. Applies a function, func, to each RDD generated from the stream. This function should have side effects, such as printing output, saving the RDD to external files, or writing it over the network to an external system.
saveAsObjectFiles(prefix, [suffix]) Save this DStream's contents as a SequenceFile of serialized objects. The file name at each batch interval is generated based on prefix and suffix"prefix-TIME_IN_MS[.suffix]".
saveAsTextFiles(prefix, [suffix]) Save this DStream's contents as a text files. The file name at each batch interval is generated based on prefix and suffix"prefix-TIME_IN_MS[.suffix]".
saveAsHadoopFiles(prefix, [suffix]) Save this DStream's contents as a Hadoop file. The file name at each batch interval is generated based on prefix and suffix"prefix-TIME_IN_MS[.suffix]".

原创,转载请注明出处 http://blog.csdn.net/oopsoom/article/details/23776477

转载于:https://my.oschina.net/xiaominmin/blog/1599578

Spark Streaming的窗口操作相关推荐

  1. Spark Streaming的IDEA操作在spark操作的差别和解决

    Spark Streaming的IDEA操作 博客https://blog.csdn.net/qq_43688472/article/details/86499291 这里就不重复操作了 [hadoo ...

  2. Spark Streaming中的操作函数分析

    参考文章:http://blog.csdn.net/dabokele/article/details/52602412 根据Spark官方文档中的描述,在Spark Streaming应用中,一个DS ...

  3. 通过Spark Streaming的window操作实战模拟热点搜索词案例实战

    本博文主要内容包括: 1.在线热点搜索词实现解析 2.SparkStreaming 利用reduceByKeyAndWindow实现在线热点搜索词实战 一:在线热点搜索词实现解析 背景描述:在社交网络 ...

  4. Spark Streaming简介 (三十四)

    Spark Streaming简介 Spark Streaming 是 Spark 提供的对实时数据进行流式计算的组件.它是 Spark 核心 API 的一个扩展,具有吞吐量高.容错能力强的实时流数据 ...

  5. 021 Spark Streaming

    1.简介 Spark Streaming抽象.架构与原理 StreamingContext 是 Spark Streaming 程序的入口,其指定sparkConf.确定DStream生成的间隔.设定 ...

  6. 实验十八 Spark实验:Spark Streaming

    实验指导: 18.1 实验目的 1. 了解Spark Streaming版本的WordCount和MapReduce版本的WordCount的区别: 2. 理解Spark Streaming的工作流程 ...

  7. Spark Streaming

    spark streaming介绍 Spark streaming是Spark核心API的一个扩展,它对实时流式数据的处理具有可扩展性.高吞吐量.可容错性等特点.我们可以从kafka.flume.wi ...

  8. Spark Streaming源码解读之No Receivers彻底思考

    本期内容 : Direct Acess Kafka Spark Streaming接收数据现在支持的两种方式: 01. Receiver的方式来接收数据,及输入数据的控制 02. No Receive ...

  9. Spark Streaming介绍,DStream,DStream相关操作(来自学习资料)

    一. Spark Streaming介绍 1. SparkStreaming概述 1.1. 什么是Spark Streaming Spark Streaming类似于Apache Storm,用于流式 ...

最新文章

  1. 仅用 480 块 GPU 跑出万亿参数!全球首个“低碳版”巨模型 M6 来了
  2. 不可不知的软件架构模式
  3. Android应用--简、美音乐播放器获取专辑图片(自定义列表适配器)
  4. select * 映射错误_高性能IO模型分析-浅析Select、Poll、Epoll机制(三)
  5. 更改centos 7 的默认启动为命令界面
  6. gin redis 链接不上_php + redis 高并发商品秒杀 完整业务模拟流程 实现方案
  7. 1-8:学习shell之高级键盘技巧
  8. python与材料计算快速入门线上训练营_Python18天训练营第二课基础1
  9. 华为AI战略完整披露!2款AI芯片首次曝光,拳打TPU,争锋英伟达
  10. mysql5.7未生成初始密码.mysql_secert文件,登陆数据库
  11. 关于工作[update]
  12. day25 crm 权限管理 通用的增删改查框架
  13. 6.Entity FrameWork Core 5.0 删除、修改数据
  14. 学生签到系统c代码_C++实现简单的学生管理系统
  15. 高端玩家分析 DNF大搬运后TOP10大工作引荐
  16. QT菜单栏颜色与背景颜色设置
  17. Java程序输出26个大写字母的ASCII对照表
  18. 新年拍照好伙伴,vivo S12 Pro前后都精彩
  19. php 高德地图创建标注,自定义图标-点标记-示例中心-JS API 示例 | 高德地图API
  20. html 表格自动编号,Word中如何给表格设置自动编号

热门文章

  1. JFreeChart基本的用法实例(一)
  2. linux查看进程自身全路径,在linux环境下如何查看进程的全路径
  3. parsel安装老是失败_Photoshop安装失败解决办法
  4. c语言编译及下载环境变量,windows 下使用g++ 编译器-Go语言中文社区
  5. mysql不存在就批量新增_mysql批量插入,存在则修改,不存在则插入
  6. mysql 临时表增加主键_MySQL之重建表
  7. python实现50行代码_50行代码实现python计算器主要功能
  8. linux彻底卸载multipath,深度分析LINUX环境下如何配置multipath
  9. 计算机认识新朋友教案,小班教案认识新朋友
  10. from flask.ext.cache import make_template_fragment_key