Flink的ProcessFunction API

1 ProcessFunction

ProcessFunction是一个低阶的流处理操作，可以访问事件(event)(流元素)，状态(state)(容错性，一致性，仅在keyed stream中)，定时器(timers)(event time和processing time，仅在keyed stream中)。也就是说可以访问普通的转换算子无法访问事件的时间戳信息和Watermark的。

ProcessFunction可以看作是一个具有keyed state 键控状态和 timers定时器访问权的FlatMapFunction，通过对输入流中接收的每个事件调用来处理事件。①通过RuntimeContext访问keyed state②计时器允许应用程序对处理时间和事件时间中的更改作出响应。对processElement(…)函数的每次调用都获得一个Context对象，该对象可以访问元素的event time timestamp和TimerService；③TimerService可用于为将来的event/process time瞬间注册回调。当到达计时器的特定时间时，将调用onTimer(…)方法。在该调用期间，所有状态都再次限定在创建计时器时使用的键的范围内，从而允许计时器操作键控状态。总之ProcessFunction可以访问时间戳、watermark以及注册定时事件，输出特定的一些事件等。Flink SQL就是使用Process Function实现的。

如果要访问键控状态和计时器，则必须应用在keyedStream上

stream.keyBy(...).process(new MyProcessFunction())

Flink提供了8个Process Function：ProcessFunction，KeyedProcessFunction，CoProcessFunction，ProcessJoinFunction，BroadcastProcessFunction，KeyedBroadcastProcessFunction，ProcessWindowFunction，ProcessAllWindowFunction。

所有的Process Function都继承自RichFunction接口，所以都有open()、close()和getRuntimeContext()等方法，还额外提供了两个方法processElement和onTimer

processElement：每来一个元素都会调用这个方法，调用结果将会放在Collector数据类型中输出。获得的Context可以访问元素的时间戳，元素的key，以及TimerService时间服务。Context还可以将结果输出到别的流(side outputs)。

onTimer：是一个回调函数，当之前注册的定时器到达触发时间调用。参数timestamp为定时器所设定的触发的时间戳。Collector为输出结果的集合。OnTimerContext和processElement的Context参数一样，提供了上下文的一些信息，例如定时器触发的时间信息(事件时间或者处理时间)。

2 低阶join

要实现对两个输入的低级操作，应用程序可以使用CoProcessFunction或KeyedCoProcessFunction。

CoProcessFunction实现对两个输入的低阶操作，它绑定到两个不同的输入流，分别调用processElement1(…)和processElement2(…)对两个输入流的数据进行处理

实现低阶join通常遵循以下模式：①为一个(或两个)输入创建一个状态对象②当从输入源收到元素时，更新状态③从另一个输入接收元素后，检索状态并生成连接的结果

3 KeyedProcessFunction

KeyedProcessFunction作为ProcessFunction的扩展，在其onTimer(…)方法中提供对定时器对应key的访问。

KeyedProcessFunction用来操作KeyedStream。KeyedProcessFunction会处理流的每一个元素，输出为0个、1个或者多个元素。

override def onTimer(timestamp: Long, ctx: OnTimerContext, out: Collector[OUT]): Unit = {var key = ctx.getCurrentKey// ...
}

4 Timers

processing-time/event-time timer都由TimerService在内部维护并排队等待执行，仅在keyed stream中有效。

由于Flink对(每个key+timestamp)只维护一个计时器。如果为相同的timestamp注册了多个timer ，则只调用onTimer()方法一次。

Flink保证同步调用onTimer()和processElement() 。因此用户不必担心状态的并发修改。

容错：Timer具有容错和checkpoint能力(基于flink app的状态)。从故障恢复或从savepoint启动应用程序时，Timer将被恢复。大量计时器会增加检查点时间，因为计时器是检查点状态的一部分。

定时器合并：由于Flink对每个键和时间戳只维护一个计时器，因此可以通过降低计时器频率来合并计时器，从而减少计时器的数量。 event-time timer只会在watermarks到来时触发。

//对于1秒的定时器分辨率(事件或处理时间)，可以将目标时间舍入整秒。计时器的发射时间最多提前1秒，但不迟于要求的毫秒精度。因此，每键最多有一个定时器和第二个定时器。
val coalescedTime = ((ctx.timestamp + timeout) / 1000) * 1000
ctx.timerService.registerProcessingTimeTimer(coalescedTime)//事件时间计时器只在水印进入的情况下触发，您还可以使用当前Watermark调度这些计时器并将其与下一个Watermark合并：
val coalescedTime = ctx.timerService.currentWatermark + 1
ctx.timerService.registerEventTimeTimer(coalescedTime)//停止处理时间计时器：
val timestampOfTimerToStop = ...
ctx.timerService.deleteProcessingTimeTimer(timestampOfTimerToStop)//停止事件时间计时器：
val timestampOfTimerToStop = ...
ctx.timerService.deleteEventTimeTimer(timestampOfTimerToStop)

5 官方案例

KeyedProcessFunction维护每个键的计数，并在没有对该键进行更新的情况下，在一分钟内(在事件发生时)发出一个键/计数对：

计数、键和最后修改时间戳存储在ValueState，这是由Key隐式限定范围的。
对于每个记录，KeyedProcessFunction增加计数器并设置最后修改的时间戳。
该函数还会在以后的一分钟内安排一个回调(在事件发生时)。
在每次回调时，它会检查回调的事件时间戳和存储计数的最后修改时间，如果它们匹配，则发出键/计数(也就是说，在这一分钟内没有发生进一步的更新)。

import org.apache.flink.api.common.state.ValueState
import org.apache.flink.api.common.state.ValueStateDescriptor
import org.apache.flink.api.java.tuple.Tuple
import org.apache.flink.streaming.api.functions.KeyedProcessFunction
import org.apache.flink.util.Collector// the source data stream
val stream: DataStream[Tuple2[String, String]] = ...// apply the process function onto a keyed stream
val result: DataStream[Tuple2[String, Long]] = stream.keyBy(0).process(new CountWithTimeoutFunction())/*** The data type stored in the state*/
case class CountWithTimestamp(key: String, count: Long, lastModified: Long)/*** The implementation of the ProcessFunction that maintains the count and timeouts*/
class CountWithTimeoutFunction extends KeyedProcessFunction[Tuple, (String, String), (String, Long)] {/** The state that is maintained by this process function */lazy val state: ValueState[CountWithTimestamp] = getRuntimeContext.getState(new ValueStateDescriptor[CountWithTimestamp]("myState", classOf[CountWithTimestamp]))override def processElement(value: (String, String), ctx: KeyedProcessFunction[Tuple, (String, String), (String, Long)]#Context, out: Collector[(String, Long)]): Unit = {// initialize or retrieve/update the stateval current: CountWithTimestamp = state.value match {case null =>CountWithTimestamp(value._1, 1, ctx.timestamp)case CountWithTimestamp(key, count, lastModified) =>CountWithTimestamp(key, count + 1, ctx.timestamp)}// write the state backstate.update(current)// schedule the next timer 60 seconds from the current event timectx.timerService.registerEventTimeTimer(current.lastModified + 60000)}override def onTimer(timestamp: Long, ctx: KeyedProcessFunction[Tuple, (String, String), (String, Long)]#OnTimerContext, out: Collector[(String, Long)]): Unit = {state.value match {case CountWithTimestamp(key, count, lastModified) if (timestamp == lastModified + 60000) =>out.collect((key, count))case _ =>}}
}