原文地址:http://www.javacodegeeks.com/2015/02/streaming-big-data-storm-spark-samza.html

There are a number of distributed computation systems that can process Big Data in real time or near-real time. This article will start with a short description of three Apache frameworks, and attempt to provide a quick, high-level overview of some of their similarities and differences.

Apache Storm

In Storm, you design a graph of real-time computation called a topology, and feed it to the cluster where the master node will distribute the code among worker nodes to execute it. In a topology, data is passed around between spouts that emit data streams as immutable sets of key-value pairs called tuples, and bolts that transform those streams (count, filter etc.). Bolts themselves can optionally emit data to other bolts down the processing pipeline.

storm-architecture41

Apache Spark

Spark Streaming (an extension of the core Spark API) doesn’t process streams one at a time like Storm. Instead, it slices them in small batches of time intervals before processing them. The Spark abstraction for a continuous stream of data is called a DStream (for Discretized Stream). A DStream is a micro-batch of RDDs (Resilient Distributed Datasets). RDDs are distributed collections that can be operated in parallel by arbitrary functions and by transformations over a sliding window of data (windowed computations).

spark-architecture4

Apache Samza

Samza ’s approach to streaming is to process messages as they are received, one at a time. Samza’s stream primitive is not a tuple or a Dstream, but a message. Streams are divided into partitions and each partition is an ordered sequence of read-only messages with each message having a unique ID (offset). The system also supports batching, i.e. consuming several messages from the same stream partition in sequence. Samza`s Execution & Streaming modules are both pluggable, although Samza typically relies on Hadoop’s YARN (Yet Another Resource Negotiator) and Apache Kafka.

samza42

Common Ground

All three real-time computation systems are open-source, low-latencydistributed, scalable and fault-tolerant. They all allow you to run your stream processing code through parallel tasks distributed across a cluster of computing machines with fail-over capabilities. They also provide simple APIs to abstract the complexity of the underlying implementations.

The three frameworks use different vocabularies for similar concepts:

apache-concepts21

Comparison Matrix

A few of the differences are summarized in the table below:

apaches1

There are three general categories of delivery patterns:

  1. At-most-once: messages may be lost. This is usually the least desirable outcome.
  2. At-least-once: messages may be redelivered (no loss, but duplicates). This is good enough for many use cases.
  3. Exactly-once: each message is delivered once and only once (no loss, no duplicates). This is a desirable feature although difficult to guarantee in all cases.

Another aspect is state management. There are different strategies to store state. Spark Streaming writes data into the distributed file system (e.g. HDFS). Samza uses an embedded key-value store. With Storm, you’ll have to either roll your own state management at your application layer, or use a higher-level abstraction called Trident.

Use Cases

All three frameworks are particularly well-suited to efficiently process continuous, massive amounts of real-time data. So which one to use? There are no hard rules, at most a few general guidelines.

If you want a high-speed event processing system that allows for incremental computations, Storm would be fine for that. If you further need to run distributed computations on demand, while the client is waiting synchronously for the results, you’ll have Distributed RPC (DRPC) out-of-the-box. Last but not least, because Storm uses Apache Thrift, you can write topologies in any programming language. If you need state persistence and/or exactly-once delivery though, you should look at the higher-level Trident API, which also offers micro-batching.

A few companies using Storm: Twitter, Yahoo!, Spotify, The Weather Channel...

Speaking of micro-batching, if you must have stateful computations, exactly-once delivery and don’t mind a higher latency, you could consider Spark Streaming…specially if you also plan for graph operations, machine learning or SQL access. The Apache Spark stack lets you combine several libraries with streaming (Spark SQL, MLlib, GraphX) and provides a convenient unifying programming model. In particular, streaming algorithms (e.g. streaming k-means) allow Spark to facilitate decisions in real-time.

spark-stack2

A few companies using Spark: Amazon, Yahoo!, NASA JPL, eBay Inc., Baidu…

If you have a large amount of state to work with (e.g. many gigabytes per partition), Samza co-locates storage and processing on the same machines, allowing to work efficiently with state that won’t fit in memory. The framework also offers flexibility with its pluggable API: its default execution, messaging and storage engines can each be replaced with your choice of alternatives. Moreover, if you have a number of data processing stages from different teams with different codebases, Samza ‘s fine-grained jobs would be particularly well-suited, since they can be added/removed with minimal ripple effects.

A few companies using Samza: LinkedIn, Intuit, Metamarkets, Quantiply, Fortscale…

Conclusion

We only scratched the surface of The Three Apaches. We didn’t cover a number of other features and more subtle differences between these frameworks. Also, it’s important to keep in mind the limits of the above comparisons, as these systems are constantly evolving.

转载于:https://www.cnblogs.com/davidwang456/p/4892213.html

Streaming Big Data: Storm, Spark and Samza--转载相关推荐

  1. 【PySpark】<Big Data>Spark概述

    目录 一.Spark概述: 二.​​​​Spark发展历史: 三.Spark VS Hadoop(MapReduce): 四.Spark特点: 运行高速: 易于使用: 通用性强: 运行方式: 五.Sp ...

  2. 流式大数据处理的三种框架:Storm,Spark和Samza

    2019独角兽企业重金招聘Python工程师标准>>> 许多分布式计算系统都可以实时或接近实时地处理大数据流.本文将对三种Apache框架分别进行简单介绍,然后尝试快速.高度概述其异 ...

  3. 三个大数据处理框架:Storm,Spark和Samza 介绍比较

    转自:http://www.open-open.com/lib/view/open1426065900123.html 许多分布式计算系统都可以实时或接近实时地处理大数据流.本文将对三种Apache框 ...

  4. Storm,Spark和Samza

    http://www.csdn.net/article/2015-03-09/2824135 Apache Storm 在Storm中,先要设计一个用于实时计算的图状结构,我们称之为拓扑(topolo ...

  5. spark 流式计算_流式传输大数据:Storm,Spark和Samza

    spark 流式计算 有许多分布式计算系统可以实时或近实时处理大数据. 本文将从对三个Apache框架的简短描述开始,并试图对它们之间的某些相似之处和不同之处提供一个快速的高级概述. 阿帕奇风暴 在风 ...

  6. 流式传输大数据:Storm,Spark和Samza

    有许多分布式计算系统可以实时或近实时处理大数据. 本文将从对三个Apache框架的简短描述开始,并试图对它们之间的某些相似之处和不同之处提供一个快速的高级概述. 阿帕奇风暴 在风暴 ,你设计要求的T ...

  7. Spark 精品文章转载(目录)

    学习 Spark 中,别人整理不错的文章,转载至本博客,方便自己学习,并按照不同的模块归档整理.每个文章作者能力不同,写文章所处的时间点不同,可能会略有差异,在阅读的同时,注意当时的文章的内容是否已经 ...

  8. 【备忘】2017零基础自学云计算分析hadoop/storm/spark大数据开发视频教程

    day01 软件安装.Linux相关.shell     day02 自动化部署高级文本命令     day03 集群部署zookeeper     day04 并发动态大数据机制.Java反射.动态 ...

  9. Data guard概念篇一(转载)

    本文转载至以下链接,感谢作者分享! http://tech.it168.com/db/2008-02-14/200802141545840_1.shtml 一.Data Guard配置(Data Gu ...

最新文章

  1. Java内存模型深度解析:volatile--转
  2. mysql续型_mysql续集1
  3. 小师妹学JVM之:JIT中的PrintAssembly
  4. 如何创建最简单的 ABAP 数据库表,以及编码从数据库表中读取数据 (上) 试读版
  5. uni-app打包h5
  6. android搜索功能xml,Android_Android ActionBar搜索功能用法详解,本文实例讲述了Android ActionBar - phpStudy...
  7. Chrome浏览器兼容性 检测工具 (chrome插件)
  8. Qt 设置textEdit插入文本的字体、大小和颜色
  9. android c++标准命名空间demo
  10. 据说学会这款数据分析工具,会被各大名企高薪哄抢
  11. ubuntu安装使用latex和texmaker--PC端
  12. JavaScript中的attachEvent和addEventListener
  13. python音乐实例详解_python下载无损音乐示例源码(qq音乐)
  14. Qt字符串生成二维码功能
  15. 下一代图像压缩格式科普---HEIF 与AVIF格式
  16. JavaScript 是怎么运行起来的?
  17. office安装包百度云
  18. Elasticsearch灾备同步方案功能验证(三)
  19. MySQL 密码设置
  20. fpu测试_解毒盖世G600散热器,3900X超频测试能不能压住?

热门文章

  1. c++自底向上算符优先分析_c语言运算符的优先级
  2. aspose.word在某个字后面自动换行_在Arctime里制作字幕如何自动换行?如何添加注释、广告语?...
  3. java int数列转字符串,鍥剧墖杞瓧绗︿覆
  4. 设备租赁系统源码_滑雪场一卡通管理系统,设备租赁更简便
  5. java中的ul是什么标签_li和ul标签用法举例
  6. android自动填充包名,debug/release 修改包名,取不同包名下的agconnect-services.json 文件...
  7. python 设计 实践_Python程序设计实践教程
  8. 完整的由客户端登录(注册)思路
  9. bulkwrite 批量插入_使用SqlBulkCopy批量插入数据
  10. Leetcode 剑指 Offer 58 - II. 左旋转字符串 (每日一题 20210830)