本节课分成二部分讲解:

一、Spark Streaming on Pulling from Flume实战

二、Spark Streaming on Pulling from Flume源码解析

先简单介绍下Flume的两种模式:推模式(Flume push to Spark Streaming)和 拉模式(Spark Streaming pull from Flume )

采用推模式推模式的理解就是Flume作为缓存,存有数据。监听对应端口,如果服务可以连接,就将数据push过去。(简单,耦合要低),缺点是Spark Streaming程序没有启动的话,Flume端会报错,同时会导致Spark Streaming程序来不及消费的情况。

采用拉模式:拉模式就是自己定义一个sink,Spark Streaming自己去channel里面取数据,根据自身条件去获取数据,稳定性好。


Flume pull实战:

第一步:安装Flume,本节课不在说明,参考(第87课:Flume推送数据到SparkStreaming案例实战和内幕源码解密)

第二步:配置Flume,首先参照官网(http://spark.apache.org/docs/latest/streaming-flume-integration.html)要求添加依赖或直接下载3个jar包,并将其放入Flume安装目录下的lib目录中

spark-streaming-flume-sink_2.10-1.6.0.jar、scala-library-2.10.5.jar、commons-lang3-3.3.2.jar

第三步:配置Flume环境参数,修改flume-conf.properties,从flume-conf.properties.template复制一份进行修改

#Flume pull模式

agent0.sources = source1

agent0.channels = memoryChannel

agent0.sinks = sink1

#配置Source1

agent0.sources.source1.type = spooldir

agent0.sources.source1.spoolDir = /home/hadoop/flume/tmp/TestDir

agent0.sources.source1.channels = memoryChannel

agent0.sources.source1.fileHeader = false

agent0.sources.source1.interceptors = il

agent0.sources.source1.interceptors.il.type = timestamp

#配置Sink1

agent0.sinks.sink1.type = org.apache.spark.streaming.flume.sink.SparkSink

agent0.sinks.sink1.hostname = SparkMaster

agent0.sinks.sink1.port = 9999

agent0.sinks.sink1.channel = memoryChannel

#配置channel

agent0.channels.memoryChannel.type = file

agent0.channels.memoryChannel.checkpointDir = /home/hadoop/flume/tmp/checkpoint

agent0.channels.memoryChannel.dataDirs = /home/hadoop/flume/tmp/dataDir

启动flume命令:

root@SparkMaster:~/flume/flume-1.6.0/bin# ./flume-ng agent --conf ../conf/ --conf-file ../conf/flume-conf.properties --name agent0 -Dflume.root.logger=INFO,console

或者root@SparkMaster:~/flume/flume-1.6.0# flume-ng agent --conf ./conf/ --conf-file ./conf/flume-conf.properties --name agent0 -Dflume.root.logger=INFO,console

第四步:编写简单的业务代码(Java版)

package com.dt.spark.SparkApps.sparkstreaming;

import java.util.Arrays;

import org.apache.spark.SparkConf;

import org.apache.spark.api.java.function.FlatMapFunction;

import org.apache.spark.api.java.function.Function2;

import org.apache.spark.api.java.function.PairFunction;

import org.apache.spark.streaming.Durations;

import org.apache.spark.streaming.api.java.JavaDStream;

import org.apache.spark.streaming.api.java.JavaPairDStream;

import org.apache.spark.streaming.api.java.JavaReceiverInputDStream;

import org.apache.spark.streaming.api.java.JavaStreamingContext;

import org.apache.spark.streaming.flume.FlumeUtils;

import org.apache.spark.streaming.flume.SparkFlumeEvent;

import scala.Tuple2;

public class SparkStreamingPullDataFromFlume {

public static void main(String[] args) {

SparkConf conf = new SparkConf().setMaster("spark://SparkMaster:7077");

conf.setAppName("SparkStreamingPullDataFromFlume");

JavaStreamingContext jsc = new JavaStreamingContext(conf, Durations.seconds(30));

// 获取数据

JavaReceiverInputDStream lines = FlumeUtils.createPollingStream(jsc, "SparkMaster", 9999);

// 进行单词切分

JavaDStream<String> words = lines.flatMap(new FlatMapFunction<SparkFlumeEvent, String>() {

public Iterable<String> call(SparkFlumeEvent event) throws Exception {

String line = new String(event.event().getBody().toString());

return Arrays.asList(line.split(" "));

}

});

// 进行map操作,转换成(key,value)格式

JavaPairDStream<String, Integer> pairs = words.mapToPair(new PairFunction<String, String, Integer>() {

public Tuple2<String, Integer> call(String word) throws Exception {

return new Tuple2<String, Integer>(word, 1);

}

});

// 进行reduceByKey动作,将key相同的value值进行合并

JavaPairDStream<String, Integer> wordsCount = pairs.reduceByKey(new Function2<Integer, Integer, Integer>() {

public Integer call(Integer v1, Integer v2) throws Exception {

return v1 + v2;

}

});

wordsCount.print();

jsc.start();

jsc.awaitTermination();

jsc.close();

}

}

将程序打包成jar文件上传到Spark集群中

第五步:启动HDFS、Spark集群和Flume

启动Flume:root@SparkMaster:~/flume/flume-1.6.0/bin# ./flume-ng agent --conf ../conf/ --conf-file ../conf/flume-conf.properties --name agent0 -Dflume.root.logger=INFO,console

第六步:往/home/hadoop/flume/tmp/TestDir目录中上传测试文件,查看Flume的日志变化

第七步:通过spark-submit命令运行程序:

./spark-submit --class com.dt.spark.SparkApps.SparkStreamingPullDataFromFlume --name SparkStreamingPullDataFromFlume /home/hadoop/spark/SparkStreamingPullDataFromFlume.jar

每隔30秒查看运行结果

第二部分:源码分析

1、创建createPollingStream (FlumeUtils.scala )

注意:默认的存储方式是MEMORY_AND_DISK_SER_2

/**

* Creates an input stream that is to be used with the Spark Sink deployed on a Flume agent.

* This stream will poll the sink for data and will pull events as they are available.

* This stream will use a batch size of 1000 events and run 5 threads to pull data.

* @param hostname Address of the host on which the Spark Sink is running

* @param port Port of the host at which the Spark Sink is listening

* @param storageLevel Storage level to use for storing the received objects

*/

def createPollingStream(

ssc: StreamingContext,

hostname: String,

port: Int,

storageLevel: StorageLevel = StorageLevel.MEMORY_AND_DISK_SER_2

): ReceiverInputDStream[SparkFlumeEvent] = {

createPollingStream(ssc, Seq(new InetSocketAddress(hostname, port)), storageLevel)

}

2、参数配置:默认的全局参数,private 级别配置无法修改

private val DEFAULT_POLLING_PARALLELISM = 5

private val DEFAULT_POLLING_BATCH_SIZE = 1000

/**

* Creates an input stream that is to be used with the Spark Sink deployed on a Flume agent.

* This stream will poll the sink for data and will pull events as they are available.

* This stream will use a batch size of 1000 events and run 5 threads to pull data.

* @param addresses List of InetSocketAddresses representing the hosts to connect to.

* @param storageLevel Storage level to use for storing the received objects

*/

def createPollingStream(

ssc: StreamingContext,

addresses: Seq[InetSocketAddress],

storageLevel: StorageLevel

): ReceiverInputDStream[SparkFlumeEvent] = {

createPollingStream(ssc, addresses, storageLevel,

DEFAULT_POLLING_BATCH_SIZE, DEFAULT_POLLING_PARALLELISM)

}

3、创建FlumePollingInputDstream对象

/**

* Creates an input stream that is to be used with the Spark Sink deployed on a Flume agent.

* This stream will poll the sink for data and will pull events as they are available.

* @param addresses List of InetSocketAddresses representing the hosts to connect to.

* @param maxBatchSize Maximum number of events to be pulled from the Spark sink in a

*                     single RPC call

* @param parallelism Number of concurrent requests this stream should send to the sink. Note

*                    that having a higher number of requests concurrently being pulled will

*                    result in this stream using more threads

* @param storageLevel Storage level to use for storing the received objects

*/

def createPollingStream(

ssc: StreamingContext,

addresses: Seq[InetSocketAddress],

storageLevel: StorageLevel,

maxBatchSize: Int,

parallelism: Int

): ReceiverInputDStream[SparkFlumeEvent] = {

new FlumePollingInputDStream[SparkFlumeEvent](ssc, addresses, maxBatchSize,

parallelism, storageLevel)

}

4、继承自ReceiverInputDstream并覆写getReciver方法,调用FlumePollingReciver接口

private[streaming] class FlumePollingInputDStream[T: ClassTag](

_ssc: StreamingContext,

val addresses: Seq[InetSocketAddress],

val maxBatchSize: Int,

val parallelism: Int,

storageLevel: StorageLevel

) extends ReceiverInputDStream[SparkFlumeEvent](_ssc) {

override def getReceiver(): Receiver[SparkFlumeEvent] = {

new FlumePollingReceiver(addresses, maxBatchSize, parallelism, storageLevel)

}

}

5、ReceiverInputDstream 构建了一个线程池,设置为后台线程;并使用lazy和工厂方法创建线程和NioClientSocket(NioClientSocket底层使用NettyServer的方式)

lazy val channelFactoryExecutor =

Executors.newCachedThreadPool(new ThreadFactoryBuilder().setDaemon(true).

setNameFormat("Flume Receiver Channel Thread - %d").build())

lazy val channelFactory =

new NioClientSocketChannelFactory(channelFactoryExecutor, channelFactoryExecutor)

6、receiverExecutor 内部也是线程池;connections是指链接分布式Flume集群的FlumeConnection实体句柄的个数,线程拿到实体句柄访问数据。

lazy val receiverExecutor = Executors.newFixedThreadPool(parallelism,

new ThreadFactoryBuilder().setDaemon(true).setNameFormat("Flume Receiver Thread - %d").build())

private lazy val connections = new LinkedBlockingQueue[FlumeConnection]()

7、启动时创建NettyTransceiver,根据并行度(默认5个)循环提交FlumeBatchFetcher

override def onStart(): Unit = {

// Create the connections to each Flume agent.

addresses.foreach(host => {

val transceiver = new NettyTransceiver(host, channelFactory)

val client = SpecificRequestor.getClient(classOf[SparkFlumeProtocol.Callback], transceiver)

connections.add(new FlumeConnection(transceiver, client))

})

for (i <- 0 until parallelism) {

logInfo("Starting Flume Polling Receiver worker threads..")

// Threads that pull data from Flume.

receiverExecutor.submit(new FlumeBatchFetcher(this))

}

}

8、FlumeBatchFetcher run方法中从Receiver中获取connection链接句柄ack跟消息确认有关

def run(): Unit = {

while (!receiver.isStopped()) {

val connection = receiver.getConnections.poll()

val client = connection.client

var batchReceived = false

var seq: CharSequence = null

try {

getBatch(client) match {

case Some(eventBatch) =>

batchReceived = true

seq = eventBatch.getSequenceNumber

val events = toSparkFlumeEvents(eventBatch.getEvents)

if (store(events)) {

sendAck(client, seq)

} else {

sendNack(batchReceived, client, seq)

}

case None =>

}

} catch {

9、获取一批一批数据方法

/**

* Gets a batch of events from the specified client. This method does not handle any exceptions

* which will be propogated to the caller.

* @param client Client to get events from

* @return [[Some]] which contains the event batch if Flume sent any events back, else [[None]]

*/

private def getBatch(client: SparkFlumeProtocol.Callback): Option[EventBatch] = {

val eventBatch = client.getEventBatch(receiver.getMaxBatchSize)

if (!SparkSinkUtils.isErrorBatch(eventBatch)) {

// No error, proceed with processing data

logDebug(s"Received batch of ${eventBatch.getEvents.size} events with sequence " +

s"number: ${eventBatch.getSequenceNumber}")

Some(eventBatch)

} else {

logWarning("Did not receive events from Flume agent due to error on the Flume agent: " +

eventBatch.getErrorMsg)

None

}

}

备注:

资料来源于:DT_大数据梦工厂

更多私密内容,请关注微信公众号:DT_Spark

如果您对大数据Spark感兴趣,可以免费听由王家林老师每天晚上20:00开设的Spark永久免费公开课,地址YY房间号:68917580

转载于:https://blog.51cto.com/18610086859/1773079

第88课:Spark Streaming从Flume Pull数据案例实战及内幕源码解密相关推荐

  1. 第91课:SparkStreaming基于Kafka Direct案例实战和内幕源码解密 java.lang.ClassNotFoundException 踩坑解决问题详细内幕版本

    第91课:SparkStreaming基于Kafka Direct案例实战和内幕源码解密    /* * *王家林老师授课http://weibo.com/ilovepains */  每天晚上20: ...

  2. 第93课:SparkStreaming updateStateByKey 基本操作综合案例实战和内幕源码解密

    Spark Streaming的DStream为我们提供了一个updateStateByKey方法,它的主要功能是可以随着时间的流逝在Spark Streaming中为每一个key维护一份state状 ...

  3. Spark Streaming整合flume实战

    Spark Streaming对接Flume有两种方式 Poll:Spark Streaming从flume 中拉取数据 Push:Flume将消息Push推给Spark Streaming 1.安装 ...

  4. DStream实战之Spark Streaming整合fulme实战, Flume向Spark Streaming中push推数据 36

    前言 本文所需要的安装包&Flume配置文件,博主都已上传,链接为本文涉及安装包&Flume配置文件本文涉及的安装包&Flume配置文件,请自行下载~ flume作为日志实时采 ...

  5. Spark Streaming概述_大数据培训

    Spark Streaming是什么 Spark Streaming用于流式数据的处理.Spark Streaming支持的数据输入源很多,例如:Kafka.Flume.Twitter.ZeroMQ和 ...

  6. 第36课:kaishi 彻底解密Spark 2.1.X中Sort Shuffle中Reducer端源码内幕

    第36课:kaishi 彻底解密Spark 2.1.X中Sort Shuffle中Reducer端源码内幕 本文根据家林大神系列课程编写 http://weibo.com/ilovepains 本课讲 ...

  7. 65、Spark Streaming:数据接收原理剖析与源码分析

    一.数据接收原理 二.源码分析 入口包org.apache.spark.streaming.receiver下ReceiverSupervisorImpl类的onStart()方法 ###overri ...

  8. 101、Spark Streaming之数据接收原理剖析与源码分析

    流程图 数据接收原理剖析.png 源码剖析 入口包org.apache.spark.streaming.receiver下ReceiverSupervisorImpl类的onStart()方法 ove ...

  9. 第33课:彻底解密Spark 2.1.X中Shuffle 中Mapper端的源码实现

    第33课:彻底解密Spark 2.1.X中Shuffle 中Mapper端的源码实现 本文根据家林大神系列课程编写 http://weibo.com/ilovepains Spark是MapReduc ...

  10. Spark Streaming和Flume集成指南V1.4.1

    Apache Flume是一个用来有效地收集,聚集和移动大量日志数据的分布式的,有效的服务.这里我们解释一下怎样配置Flume和Spark Streaming来从Flume获取数据.这里有两个方法. ...

最新文章

  1. 记录一个无水印 免费的录屏软件 【需要登录哔哩哔哩账号】
  2. 嵌入式C语言代码规范
  3. oracle 空值 group by,为什么group by 没有将一样的数据合为一条
  4. LeetCode-数组-704. 二分查找
  5. poj 1185 NYOJ 85 炮兵阵地(状态压缩dp)
  6. kafka消息存储原理及查询机制
  7. Socket 客户端的断开重连
  8. 如何才能避免聚会尬聊
  9. 第9章matlab符号计算答案,第9章 MATLAB符号计算_MATELAB课程设计_ppt_大学课件预览_高等教育资讯网...
  10. python小波分解与重构_python - 使用pyWavelets进行多级局部小波重构 - 堆栈内存溢出...
  11. Linux基础(2)-基础命令和bash的基础特性(1)
  12. 网游中的网络编程3:在UDP上建立虚拟连接
  13. 学生信息管理系统(C语言)
  14. 最全面的Linux命令大全出炉了
  15. Electron 屏幕锁定 快捷键锁定 屏蔽快捷键
  16. java-家庭作业4
  17. 【仿写网站】用swiper实现故宫博物院首页轮播图
  18. 如何在win7下安装XP系统?
  19. Fleck For Web Socket
  20. 云服务器几核CPU够用

热门文章

  1. 如何在win10 64位下搭载汇编环境(包含汇编dosbox和masm文件)
  2. 智能优化算法:静电放电算法-附代码
  3. 智能优化算法:绯鲵鲣优化算法-附代码
  4. 智能优化算法:海鸥优化算法-附代码
  5. arcpy 验证中心点是否位于图层之内
  6. 【ArcGIS|3D分析】要素的立体显示
  7. Linux关闭占用端口的进程
  8. linux 修改git端口号,SSH默认端口更改后使用Git
  9. 倒计时 5 天!Apache Flink Meetup · 北京站,1.13 新版本 x 互娱实践分享的开发者盛筵!...
  10. 25 亿条/秒消息处理!Flink 又双叒叕被 Apache 官方提名