0. 前言0

1. 前言

2. Overview of BOCD

3. Possible extensions

4. References

5. 后记

0. 前言0

本文是关于一篇关于贝叶斯变化点检测的一个非常（beginner-friendly ）博文的介绍（虽然我依然还是没有看懂，显见得我连biginner都还算不上），原文链接如下：

bayesian-online-change-point-detection-an-intuitive-understanding

这篇博文是关于<<[Ryan Prescott Adams2007]: Bayesian Online Changepoint Detection>>中的算法的介绍。本来我是想读[Ryan Prescott Adams2007]，啃半天没读懂，于是找到了这篇想帮助beginner更好理解原论文的博文，结果我依然完全读明白（至少到发布这篇博客为止）。本想把这篇博文翻译出来分享给大家，但是还没读明白就翻译怕暴殄天物，但是又确实想分享给大家（且说不定能籍此找到同样感兴趣的伙伴一起讨论岂不快哉），所以就将原文附上，并在有些地方夹带一些我的私货（注释或疑问，既然是私货，自然就避免不了存在理解错误等等）...

1. 前言

What is a Changepoint? There could be abrupt variations in a data sequence that could be characterized by any combination of parameters — e.g., Mean or variance or periodicity of the data may abruptly change after a particular time. We might be interested in catching the earliest time at which this variation occurs and this is referred to as “Changepoint” detection. The applications of changepoint detection could be in varieties of domains — financial data, sensor data, biometrics etc.,

I had set out to solve this exact problem for my domain and I came across few notable algorithms in this regard —

Bayesian online changepoint detection (BOCD)
PELT (Pruned Exact Linear Time) changepoint detection
Few non parametric based algorithms

Selection criteria for the algorithm that I was looking for —

Need to be an online detector (i.e., it should function at real time on streaming data)
Should provision for inputting all the domain knowledge information and leverage that in coming up with the changepoints. [chenxy]就是说要求这种算法能够有效利用（领域特定的或者其它的）先验知识的辅助，而这正是贝叶斯方法相比其它种类算法的优势所在。
Should be robust to noise, occasional false positives and outliers
Should be elegant and we should be able to understand the inner workings of the algorithm so that it could be extended to suit the needs (This was a very important criterion for me)
Should be able to detect any change in time series data — either mean or variance or any other change

总而言之，要(1)能在线工作、(2)能利用先验信息、(3)鲁棒、(4)优雅而且可扩展、(5)能检测各种类型变化点—比如说平均值、方差等等

In this context, Bayesian online changepoint detection (BOCD) suited my basic needs. I further noticed that most of the content on the above algorithms was too esoteric and mathematical in nature that one might get easily lost in the equations to develop any intuitive understanding on how exactly the changepoint detection algorithm works. Here I want to discuss my intuitive understanding on BOCD in my simple possible terms (And forgive me, if in this process, I use certain ‘terms’ or ‘principles’ loosely for which some academicians might kill me for that!). Hopefully this intuitive understanding might help someone wanting to extend this algorithm or tune the hyper parameters according to their need. You can look at the code, in parallel, on GitHub link that I had put in the References section below.

2. Overview of BOCD

I will present a quick overview of the BOCD algorithm from the original paper and will use the same example figures referenced in this paper with modified illustration and detail.

Fig 1(a) represents an example of a data sequence arriving in time steps (Note — Each data point arrival corresponds to a new time step) with time plotted on x-axis and the data value plotted on y-axis. As can be observed, the mean got changed at the 5th time step and once again at the 11th time step. So essentially, we want to detect these change points at “t=5” and “t=11” when presented with the above data sequence. Here g1, g2, g3 can be thought of three partitions of data, separated by change points.

Fig 1(a) — Data arriving in time steps (time on x-axis and data values on y-axis)

Fig 1(b) — The run length representation for the above data

Now refer to Fig 1(b) — BOCD models the change point detection in terms of run length (行程长度，游程长度). Having observed previous data point(s), the run length simply indicates if the new datum still belongs to same partition . If it does, then the run length at the next time step grows by 1 (indicated by the dark line), otherwise it falls to zero (indicated by the dashed line). This process repeats at every time step.(出现变化点时run-length会归零，否则会加1。如上图所示，Fig1(a)中x5是一个变化点，因此对应的r5就变成了0)

A small detour (绕道，迂回路) to just give an analogy — Think of Clustering — Clustering could be an alternate way to model the same concept by determining how far the new datum is from the cluster centroid to infer if the partition has to change. But in Bayesian approach, we deal with probabilities to determine if the new datum has changed partitions. And the beauty of Bayesian statistics, unlike distance thresholding, is that it has lot of advantages (at least in change point detection) as it provides us a way to “input” our domain knowledge into the mix. For example, some ways in which we can input our domain knowledge in BOCD -

Input1: Let’s say we have some information on when a machine part could likely fail after 4 years and we want to input this assumption into the observed sensor data stream — This knowledge can be modeled using a Hazard function into the algorithm (Hazard function is very simple to understand and there are many links out there that explain it very well…So I will not get into the details of this but you can see the references section below where I have provided a link to a video where it is explained very clearly). ([chenxy]Domain knowledge:领域知识—当然也是一种先验知识. 比如说根据既往的统计数据知道机器的某零部件使用了4年以后失效的概率就会变得很大，这个先验知识可以用风险函数来建模并提供给BOCD算法利用)
Input2: Assume, we want to model some prior information about the data — For example, we might have some knowledge about the underlying probability distribution like is it a Gaussian or a Poisson distribution etc., (For example, if we want to detect a change in the number of requests at a call center handled per day, then possibly a Poisson distribution could be the underlying distribution). Even if we do not have enough information about the values of parameters to be applied for the underlying distribution, we can just initiate it with some likely values and the algorithm will be able to figure out the distribution parameters from the observed data. This information is useful in modelling the underlying probability distribution to come up with posterior predictive probability (referred to as pdf value later — will come to this later) ([chenxy]prior information:先验概率信息. 我们可能知道一些关于观测数据的概率分布的信息，比如说，呼叫中心的每天接入的呼叫次数的变化量通常呈泊松分布，这个称为prior probability distribution。先验概率信息不仅包括属于什么类型的概率分布，还有具体的概率分布参数，即超参数hyperparameter,即原论文Algorithm1中的 and,或者本文后面所提到的μ, σ. 但是即便我们完全不知道先验概率分布也没有关系，选择一个比较可能的初始值，然后经过贝叶斯推断的迭代，算法会逐渐收敛并推断出数据所遵循的概率分布特性，这个称之为posterior probability distribution. 贝叶斯推断的一个优点在于，观测数据越大，先验概率分布对于后验预测推断的影响就越小。换句话来说，观测数据足够长的话先验概率分布随便怎么定都可以)

总而言之就是两种先验信息：(1) 风险函数；(2) 数据的概率分布特性(underlying prob distribution)

With this understanding, now let’s jump into the details and understand step by step —

Assume we are at t=2 (fig 2a). At this time step, we have already observed the dark point at t=1 and just observed the green point. So, what happens internally? Refer to figure 2b.

Fig2a/Fig2b

We will use the notation of R[r,t] to denote a particular point on this graph in Fig 2b. As per this notation —

r → represents run length (row value)—纵轴坐标
t → represents the corresponding timestep (column value) —横轴坐标
R → represents a matrix of all possible run-lengths and timesteps
We initialize R[0,0] = 1 （[chenxy]注意，这个相当于第三维的值在上面2维的Fig2b中没有体现出来。但是从根本上R代表什么呢？换句话说，它的物理含义是什么呢？初始化为1有什么物理含义呢？有必然性吗，比如说初始化为0可以吗？读到最后部分并对比开源代码以及原论文才终于明白R矩阵中所存储的值run-length posterior distribution，即R.P.Adams2007原论文中的式(2)。这样的话初始化为1就可以理解了，即t=0时刻的run-length为0的概率确定性地为1）

As per the permission I have taken from you to use some terms loosely, I will sometimes refer R[r,t] as a node to indicate that it holds some state values. This could make things little easier for me to explain, although that is not the way it is implemented in the algorithm.(在本文中，R[r,t]有时候表示Fig1b上的坐标为(t,r)—按笛卡尔坐标的习惯横坐标放在前面—的节点，该节点包含了一些状态信息如下文所述；有时候则仅表示该点的run-length的后验概率信息即)

So rewriting what I said, in this notation, the algorithm have already learnt something about the dark point in Fig 2a (that corresponds to R[0,1] in Fig2b, 此处是指接收到x1后，状态从R[0,0]根据从x1中学习到的信息变化到R[0,1]了) and just observed a new datum(green point).

So what can we learn about the data value that was seen at R[0,1]([chenxy]Note, the data seen at R[r(t),t] is )?

We know the data point value and
The prior probability distribution parameters that we initiated the algorithm with (Refer ‘Input2’ above). ([chenxy]此处感觉作者有误？概率分布参数（通常称为超参数, hyper-parameter, 即原论文Algorithm1中的 and）在一开始会用先验值进行初始化，然后在每个新的数据进来后都会被更新，更新值又被用做下一步的先验概率参数。这里的是在R[0,1]观测到x2，此时的 and已经在上一时刻观测到x1时被更新过了，不再是“Input2”中初始化的那个了)

Using the above two inputs we can now compute the probability density value (pdf value) for this new datum i.e, the pdf that the new datum belongs to the distribution using the parameters at R[0,1]

[chenxy]究竟是如何更新的呢？{R[0,0], X1}-->R[0,1], and then {R[0,1], X2}-->R[1,2]吗？由于统计的是行程长度run-length，看到第一个数据样点后启动了当前行程，当前长度仍然为0；然后，如果第2个样点判断为与第1个样点属于一伙的，行程长度就加1了；否则，如果判断X2与X1不是属于一伙的，就重启行程统计，并将行程长度复位归零。Hence the R[r[t],t] update rule can be written as: , here, 。当然，在贝叶斯推断的世界，实际更新并非这样非黑即白的，一切都是以概率的形式出现的.

So to generalize, node at R[r,t] holds the following 5 state values (Refer to Fig 3 below ) —

pdf value: Predicted probability value of this new datum at R[r,t] computed through pdf function using the distribution parameters at the previous node i.e at R[r-1,t-1]. This understanding is very crucial and essentially what it translates to is this — If the new datum is very similar to the observed points that the run length has already seen along its path, then understandably this pdf value will be quite high for the new datum otherwise it will be very low.([chenxy]感觉这个标记(notation)有点confusing. r(表示run-length)本身也是t的函数，所以R[r(t),t] instead of R[r,t]是不是更合适。这样状态转移关系就可以写成{R[r(t-1),t-1], Xt}àR[r(t),t]，其中，r(t) = r(t-1)+1 or 0.以上的描述有点容易让人误解从R[r-1,t-1]总是到达R[r,t]，但是这只是在没有变化点发生时才成立；变化点发生时是从R[r-1,t-1]到达R[0,t]). 这个”pdf value”（用的太随意了）应该是对应原论文[R.P.Adams2007]的Algorithm1-Step3的Predictive Probability: .
New distribution parameter values: Factoring the new datum into the underlying distribution, compute the new distribution values. For example, considering a Gaussian as an underlying data distribution, then calculate new values μ′, σ′ at (r,t) using the μ, σ values at (r-1, t-1) factoring the new datum. How the new μ′ and σ′ are obtained is a straightforward math (呃…有智商被鄙视的感觉…) that I omit here. [chenxy] (1)这里“Factoring the new datum into the underlying distribution…”表示什么意思呢？(2) distribution parameter values应该是指
Hazard value, as a function of run length, r (See references for explanation of Hazard function)
Growth probability: This is the probability that the run length grows to the next step (i.e., from ‘r-1’ to ‘r’)([chenxy]感觉这里’to the next step’是多余的反而容易引起误解) for the new datum. Very loosely speaking we can understand this as multiplication of three probabilities (applying the multiplicative rule of probability):
1. Probability that the run length has come till ‘r-1’ (this is nothing but the growth probability at Node R[r-1,t-1])
2. Probability that a Hazard didn’t happen to arrive at this run length till ‘r’ (=1-H, if H is probability that hazard happens)
3. pdf value for the new datum given the underlying probability distribution parameters that the run length has seen along its path.
4. All the above three have to happen to grow this run length by one from (r-1,t-1) → (r,t)… hence, we get:
  
  [chenxy] 从原文的文字说明改写，对应于[R.P.Adams2007]-Algorithm1-Step4. 这种地方用文字说明并不会让事情更加容易理解，只会适得其反。
The changepoint probability value for this node at (r,t): This intuition is same as growth probability except that we consider a hazard (while computing growth probability we consider no hazard happens) to happen in which event the run length becomes zero for the next step. So this change probability:

[chenxy] 同上。Corresponding to [R.P.Adams2007]-Algorithm1-Step5.

(Note — This equation is valid only when ‘run length’ is not equal to zero for that node. When the node’s run length is ‘0’ then the change probability for that node = sum of change probabilities for all the nodes at previous time step. We will see why this is so shortly…)

[chenxy] (1)依赖于上一时刻的run-length是否为0，当前时刻的change/reset probability的计算是不一样的…被作者此处的Note搞得更加糊涂了，后面似乎也并没有更多的解释；(2)被4和5中的”the next step”搞糊涂了。以上应该是关于在R(r(t-1),t-1)观测到新的数据然后更新到R(r(t),t)状态的讨论

Fig3

Considering the above data sequence, the same growth story repeats for the time steps at 3 and 4 as well. Now consider when a red point value happens at time step t = 5 (Fig 4a).

Fig 4a

See Fig 4b for the corresponding run length till R[3,4]. The growth probability for each node along the run length i.e., for R[0,1] → R[1,2] → R[2,3] → R[3,4] has been good enough for the run length to get increment by +1 at each time step until t=4 because the dark point values were similar to each other (rather they appear to come from the same underlying probability distribution and hence the pdf value for each point along the path from the underlying distribution was higher resulting in higher value of growth probability). When a very dissimilar point (like the red one) happens, the pdf value returned from the underlying probability distribution constituted by these previous 4 similar points at R[0,1], R[1,2], R[2,3] and R[3,4] respectively(?) becomes very low! So the computed growth probability for the node at R[0,5] in turn becomes very low!

[chenxy]感觉越来越乱了。难道不应该是这样吗：在R[3,4]状态，收到新的数据X5后，计算比如说growth probability和changepoint probability，联合两者（比如说前者足够低而后者足够高）得出X5与前面X1~X4不属于同一伙（即changepoint出现）的判断，或者说转移到R(0,5)的似然概率高于转移到R(4,5)的概率，因而转移到R(0,5)。当然，还有一种可能是像维特比译码算法那样，从R(3,4)向R(0,5)转移的路径以及从R(3,4)向R(4,5)转移的路径都被保留，只不过前者的概率值要高于后者。

Fig 4b

Now let’s discuss how to compute change probability for a node when the run-length for that node is ‘0’, that I omitted earlier —Before we talk about the changepoint, one thing that I haven’t detailed enough so far is that we start off with(从…开始，用…开始) a run length ‘0’ at every time step. See the below Fig-5 for a visual understanding in this regard.

Now, if the computed growth probability of the next node for the new datum is big enough, then that slope grows, otherwise it falls back to zero on its next step as indicated by the dashed lines in the below Fig-5 and this is true for all nodes at every time step. So, to compute the change probability for node at run length ‘0’ (regardless of the time step), we use the probability sum rule, i.e, we sum up all the change probabilities of each node at previous time step because each node at the previous time step has some probability that it is vulnerable for a change when the new datum happens. [chenxy]对应于[R.P.Adams2007]-Algorithm1-Step5中对r(t-1)的求和。

Fig5 (与通信系统的前向纠错码Viterbi, Turbo等的Trellis Diagram很类似)

So, when the red data point happens — The change probability for node at R[0,5] = Sum of {change_prob for node at R[0,4], R[1,4], R[2,4] and so on…}. So the change probability at R[0,5] mostly dominates the growth probability for any node at R[:,5].

When another new data point at time step 6 arrives (assuming this data point to be very similar to the one at time step 5 and dissimilar from points at 1,2,3 and 4 time steps), the growth probability at R[1,6] will become higher whereas the growth probabilities for R[2,6], R[3,6] etc., will start becoming much lower, as the point 6 doesn’t confer to the underlying probability distribution that the previous run lengths have seen…And this causes the run length from R[0,5] to grow and run lengths that started off from R[0,1], R[0,2], R[0.3] and R[0,4] to almost stop!!

Congratulations if you have come this far! Now finally, the way we can identify the changepoints is —

Prepare a ‘maxes’ array by taking the argmax on column axis of matrix R (growth probability value? [chenxy]从开源的参考代码来看，似乎R矩阵存储的应该是run-length posterior distribution，即R.P.Adams2007原论文中的式(2),后面跟着的这句话说的也是这个意思) for each time step. This gives the max run length for that time step — You can look at this maxes array in the code referenced in the github link below.
Compare two adjacent values in this ‘maxes’ array. If there is a significant jump greater than a fixed threshold in any two adjacent values of this maxes array, then you have a corresponding changepoint there!

3. Possible extensions

You can model different distributions depending on the known prior parameters and knowledge about the underlying data
You can consider using a MCMC as an alternative to the conjugate prior (I haven’t introduced conjugate priors in this discussion although whatever I described above is heavily relied on this assumption. The reason I omitted is that this is slightly mathematical and could distract the reader)
You can further condition the change point trigger on top of the output obtained by this algorithm (for example you may want to reduce the false positives, OR input more prior domain knowledge information)
You may consider a moving window if the data sequence is becoming too large

4. References

Original paper is here.
If you are interested in the mathematical treatment then a very well written and detailed blog here
Reference to understanding hazard function here
Original github forked and extended with Normal Known Precision, Poisson and gaussian Hazard functions here

5. 后记

到目前对这篇博文的理解还处于半懂不懂之间，更不用谈原始论文。后面在理解上有加深的时候（也有可能逐渐发现以上所加的注释或者疑问有不妥当的地方）再回头来更新。

2021-09-03

另参见：论文笔记：Bayesian Online Changepoint Detection

2021-09-04：修订1：R矩阵中所存储的值代表增长概率值(growth probability value)

2021-09-05：订正：R矩阵中所存储的值代表run-length poterior distribution.

2021-09-08：修订

以上Reference中提及的Bayesian Online Changepoint Detection是一篇更深入更全面的BOCD的解读，当然对于我等biginner来说攻克难度就更大一些。希望后面有勇气进一步挑战一下

论文笔记: 贝叶斯在线变点检测：An intuitive understanding相关推荐

论文：贝叶斯优化方法和应用综述（1）--------陈述设计类问题举例子，与 model-free优化计算的对比
陈述: 就是想看一下贝叶斯学派的陈述,从不同的学派的对比,看看有什么优缺点,然后自己思考下. 摘要: 通过设计恰当的概率代理模型和采集函数,贝叶斯优化框架只需经过少数次目标函数评估即可获得理 ...
【论文笔记】非常高效的物体检测mimic方法 Mimicking Very Efficient Network for Object Detection
转载:http://www.sohu.com/a/160564635_651893 AI科技评论按:CVPR 2017正在夏威夷火热进行中,精彩论文层出不穷.下文是Momenta 高级研发工程师蒋云飞 ...
论文笔记-基于BiLSTM 模型的漏洞检测
一.摘要首先从源代码中提取方法体,形成方法集:为方法集中的每个方法构建抽象语法树,借助抽象语法树抽取方法中的语句,形成语句集:替换语句集中程序员自定义的变量名.方法名及字符串,并为每条语句分配一 ...
论文：贝叶斯优化算法和应用综述（２）－－概率模型和采集函数的介绍以及综述列表
声明:简单的罗列----- 文后续的章节:
机器学习理论《统计学习方法》学习笔记：第四章朴素贝叶斯法
机器学习理论<统计学习方法>学习笔记:第四章朴素贝叶斯法 4 朴素贝叶斯法 4.1 朴素贝叶斯法的学习与分类 4.1.1 基本方法 4.1.2 后验概率最大化的含义 4.2 朴素贝叶斯法 ...
《统计学习方法》读书笔记——朴素贝叶斯法（公式推导+代码实现）
传送门 <统计学习方法>读书笔记--机器学习常用评价指标 <统计学习方法>读书笔记--感知机(原理+代码实现) <统计学习方法>读书笔记--K近邻法(原理+代码实现 ...
贝叶斯网络计算机系统性能建模,基于贝叶斯网络的AIBNS系统建模研究及其应用-计算机应用技术专业论文.docx...
摘要智能授导系统ITS研究的重点在于增加远程网络教学系统的适摘要智能授导系统ITS研究的重点在于增加远程网络教学系统的适应性和智能性.学生模型是ITS的基础和核心,是其他模块进行工作的前提.论 ...
04 朴素贝叶斯法——读书笔记
一.相关概念: 先验概率: 是指事件发生前的预判概念,也可以说是"因"发生的概率,即表示为 P(X). 条件概率: 是指事件发生后求得反向条件概率,也可以说是在"因&qu ...
《机器学习实战》-04 朴素贝叶斯
说明: 作业的所有代码都要基于Python3 学习大纲:https://blog.csdn.net/qq_34243930/article/details/84669684 (所有计划均在学习大纲里) ...

论文笔记: 贝叶斯在线变点检测：An intuitive understanding

0. 前言0

1. 前言

2. Overview of BOCD

3. Possible extensions

4. References

5. 后记

论文笔记: 贝叶斯在线变点检测：An intuitive understanding相关推荐

最新文章

热门文章