线性条件随机场代码解读

NER中CRF是必不可少的环节，特地看了一遍CRF相关理论以及allennlp中CRF的代码，特在这里笔记记录下来！

1.线性CRF简介

1.1一般形式

关于线性条件随机场的详细介绍，请参考李航老师的《统计学习方法》或者这里，这里仅仅给出一般的公式定义。
设P(Y|X)P(Y|X)P(Y|X)为线性链条件随机场，则在随机变量XXX取值为x" role="presentation">xxx的条件下，随机变量YYY取值为y" role="presentation">yyy的条件概率具有如下形式(注意xxx, y" role="presentation">yyy都是序列)：

P(y|x)=1Z(x)exp(∑i,kλktk(yi−1,yi,x,i)+∑i,lμlsl(yi,x,i))(11.10)P(y|x)=1Z(x)exp(∑i,kλktk(yi−1,yi,x,i)+∑i,lμlsl(yi,x,i))(11.10)

P(y|x)=\frac{1}{Z(x)}exp(\sum_{i,k}\lambda_kt_k(y_{i-1},y_i,x,i)+\sum_{i,l}\mu_ls_l(y_i,x,i))\;\;\;(11.10)
其中,

Z(x)=∑yexp(∑i,kλktk(yi−1,yi,x,i)+∑i,lμlsl(yi,x,i))(11.11)Z(x)=∑yexp(∑i,kλktk(yi−1,yi,x,i)+∑i,lμlsl(yi,x,i))(11.11)

Z(x)=\sum_{y}exp(\sum_{i,k}\lambda_kt_k(y_{i-1},y_i,x,i)+\sum_{i,l}\mu_ls_l(y_i,x,i))\;\;\;(11.11)
式子中， tktkt_k是 转移特征函数，依赖于当前和前一个位置； slsls_l是 状态特征函数，依赖于当前位置； λkλk\lambda_k和 μlμl\mu_l是对应的权重。 Z(x)Z(x)Z(x)是规范化因子，求和是在所有可能的输出序列上进行的(注意这个所有可能不是任意的组合，这需要依赖于 xxx的取值)。

1.2简化形式

注意到条件随机场式(11.10)中同一特征在各个位置都有定义，可以对同一个特征在各个位置求和，将局部特征函数转哈U为一个全局特征函数，这样就可以将条件随机场写成权值向量和特征向量(包括转移特征和状态特征)的内积形式，即条件随机场的简化形式。
首先将转移特征和状态特征及其权值用统一的符号表示，设有K1" role="presentation">K1K1K_1个转换特征， K2K2K_2个状态特征， K=K1+K2K=K1+K2K=K_1+K_2，记:

fk(yi−1,yi,x,i)={tk(yi−1,yi,x,i),sl(yi,x,i),k=1,2,...,Kk=K1+l;l=1,2,...,K2(11.12)fk(yi−1,yi,x,i)={tk(yi−1,yi,x,i),k=1,2,...,Ksl(yi,x,i),k=K1+l;l=1,2,...,K2(11.12)

f_k(y_{i-1}, y_i, x, i)=\begin{cases} t_k(y_{i-1}, y_i,x,i), & \text {$k=1, 2, ...,K$} \\ s_l(y_i,x,i), & \text{$k=K_1+l;l=1, 2, ...,K_2$} \end{cases}\;\;\;(11.12)
然后，对转移与状态特征在各个位置i求和，记作:

fk(y,x)=∑infk(yi−1,yi,x,i),k=1,2,...,K(11.13)fk(y,x)=∑infk(yi−1,yi,x,i),k=1,2,...,K(11.13)

f_k(y, x)=\sum_{i}^{n}f_k(y_{i-1}, y_i,x,i), k=1, 2, ...,K\;\;\;(11.13)
用 wkwkw_k表示特征 fk(y,x)fk(y,x)f_k(y,x)的权值，即:

wk={λk,μl,k=1,2,...,Kk=K1+l;l=1,2,...,K2(11.14)wk={λk,k=1,2,...,Kμl,k=K1+l;l=1,2,...,K2(11.14)

w_k=\begin{cases} \lambda_k, & \text {$k=1, 2, ...,K$} \\ \mu_l, & \text{$k=K_1+l;l=1, 2, ...,K_2$} \end{cases}\;\;\;(11.14)
于是用上面的简化形式，条件随机场可以表示为:

P(y|x)=1Z(x)exp∑k=1Kwkfk(y,x)(11.15)P(y|x)=1Z(x)exp∑k=1Kwkfk(y,x)(11.15)

P(y|x)=\frac{1}{Z(x)}exp\sum_{k=1}^{K}w_kf_k(y,x)\;\;\;(11.15)

Z(x)=∑yexp∑k=1Kwkfk(y,x)(11.16)Z(x)=∑yexp∑k=1Kwkfk(y,x)(11.16)

Z(x)=\sum_{y}exp\sum_{k=1}^{K}w_kf_k(y,x)\;\;\;\;(11.16)
若以w表示权值向量，即:
w=(w1,w2,...,wK)Tw=(w1,w2,...,wK)Tw=(w_1,w_2,...,w_K)^T，以 F(y,x)=(f1(y,x),f2(y,x),...,fK(y,x))TF(y,x)=(f1(y,x),f2(y,x),...,fK(y,x))TF(y,x)=(f_1(y,x),f_2(y,x),...,f_K(y,x))^T,则条件随机场可以写成向量 www与F(y,x)" role="presentation">F(y,x)F(y,x)F(y,x)的内积形式:

Pw(y|x)=1Z(x)exp(w∗F(y,x))(11.19)Pw(y|x)=1Z(x)exp(w∗F(y,x))(11.19)

P_w(y|x)=\frac{1}{Z(x)}exp(w*F(y,x))\;\;\;(11.19)
其中,

Zw(x)=∑yexp(w∗F(y,x))(11.20)Zw(x)=∑yexp(w∗F(y,x))(11.20)

Z_w(x)=\sum_{y}exp(w*F(y,x))\;\;\;\;(11.20)

1.3条件随机场的矩阵形式

条件随机场还可以由矩阵表示，事实上，在代码实现中，我们肯定需要用到矩阵运算的！假设Pw(y|x)Pw(y|x)P_w(y|x)是由式子(11.15)和(11.16)(11.15)和(11.16)(11.15)和(11.16)给出的线性链条件随机场，表示对给定观测序列xxx,相应的标记序列y" role="presentation">yyy的条件概率。引进特殊的起点和终点标记y0=start,yn+1=stopy0=start,yn+1=stopy_0=start, y_{n+1}=stop，这时Pw(y|x)Pw(y|x)P_w(y|x)可以通过矩阵形式表示。
对观测序列xxx的每一个位置i=1,2,...,n+1" role="presentation">i=1,2,...,n+1i=1,2,...,n+1i=1,2,...,n+1，定义一个mmm阶矩阵(m" role="presentation">mmm是标记yiyiy_i取值的个数)

Mi(x)=[Mi(yi−1,yi|x)](11.21)Mi(x)=[Mi(yi−1,yi|x)](11.21)

M_i(x)=[M_i(y_{i-1},y_i|x)]\;\;\;\;(11.21)

Mi(yi−1,yi|x)=exp(Wi(yi−1,yi|x))(11.22)Mi(yi−1,yi|x)=exp(Wi(yi−1,yi|x))(11.22)

M_i(y_{i-1},y_i|x)=exp(W_i(y_{i-1},y_i|x))\;\;\;\;(11.22)

Wi(yi−1,yi|x)=∑k=1Kwkfk(yi−1,yi|x)(11.23)Wi(yi−1,yi|x)=∑k=1Kwkfk(yi−1,yi|x)(11.23)

W_i(y_{i-1},y_i|x)=\sum_{k=1}^{K}w_kf_k(y_{i-1},y_i|x)\;\;\;\;(11.23)
这样，给定观察序列x，相应标记序列y的非规范化概率可以通过该序列 n+1n+1n+1个矩阵适当元素的乘积 ∏n+1i=1Mi(yi,yi|x)∏i=1n+1Mi(yi,yi|x)\prod_{i=1}^{n+1}M_i(y_i,y_i|x)表示.于是，条件概率 Pw(y|x)Pw(y|x)P_w(y|x)是:

Pw(y|x)=1Z(x)∏i=1n+1Mi(yi,yi|x)(11.24)Pw(y|x)=1Z(x)∏i=1n+1Mi(yi,yi|x)(11.24)

P_w(y|x)=\frac{1}{Z(x)}\prod_{i=1}^{n+1}M_i(y_i,y_i|x)\;\;\;\;(11.24)
其中 Zw(x)Zw(x)Z_w(x)为规范化因子，是 (n+1)(n+1)(n+1)个矩阵的乘积的 (start,stop)(start,stop)(start, stop)元素:

Zw(x)=(M1(x)M2(x)...Mn+1(x))start,stopZw(x)=(M1(x)M2(x)...Mn+1(x))start,stop

Z_w(x)=(M_1(x)M_2(x)...M_{n+1}(x))_{start,stop}
注意, y0=starty0=starty_0=start与 yn+1=stopyn+1=stopy_{n+1}=stop表示开始和终止状态，规范化因子 Zw(x)Zw(x)Z_w(x)是以 startstartstart为起点 stopstopstop为终点通过状态的所有路径 y1y2...yny1y2...yny_1y_2...y_n的非规范化概率 ∏n+1i=1Mi(yi,yi|x)∏i=1n+1Mi(yi,yi|x)\prod_{i=1}^{n+1}M_i(y_i,y_i|x)之和， 这个所有路径与xxx的取值也是息息相关的，x" role="presentation">xxx能够决定各位置各标签的得分！.

下面，我们将1.1和1.3的内容拼接起来，证明二者的一致性!

P(y|x)=1Z(x)exp(∑i,kλktk(yi−1,yi,x,i)+∑i,lμlsl(yi,x,i))P(y|x)=1Z(x)exp(∑i,kλktk(yi−1,yi,x,i)+∑i,lμlsl(yi,x,i))

P(y|x)=\frac{1}{Z(x)}exp(\sum_{i,k}\lambda_kt_k(y_{i-1},y_i,x,i)+\sum_{i,l}\mu_ls_l(y_i,x,i))
这里，我们仅仅考虑后面的非规范化项。

exp(∑i,kλktk(yi−1,yi,x,i)+∑i,lμlsl(yi,x,i))=exp(∑i,kλktk(yi−1,yi,x,i)+∑i,lμlsl(yi,x,i))=

exp(\sum_{i,k}\lambda_kt_k(y_{i-1},y_i,x,i)+\sum_{i,l}\mu_ls_l(y_i,x,i))=

exp(∑i(∑kλktk(yi−1,yi,x,i)+∑lμlsl(yi,x,i)))=exp(∑i(∑kλktk(yi−1,yi,x,i)+∑lμlsl(yi,x,i)))=

exp\left(\sum_{i}\left(\sum_{k}\lambda_kt_k(y_{i-1},y_i,x,i)+\sum_{l}\mu_ls_l(y_i,x,i)\right)\right)=

∏iexp(∑kλktk(yi−1,yi,x,i)+∑lμlsl(yi,x,i))=∏iexp(∑kλktk(yi−1,yi,x,i)+∑lμlsl(yi,x,i))=

\prod_iexp\left(\sum_{k}\lambda_kt_k(y_{i-1},y_i,x,i)+\sum_{l}\mu_ls_l(y_i,x,i)\right)=

∏iexp(∑k=1Kwkfk(yi−1,yi,x,i))=∏iexp(∑k=1Kwkfk(yi−1,yi,x,i))=

\prod_iexp\left(\sum_{k=1}^{K}w_kf_k(y_{i-1},y_i,x,i)\right)=

∏iexp(Wi(yi−1,yi|x))=∏iMi(yi−1,yi|x)=∏iMi(x)∏iexp(Wi(yi−1,yi|x))=∏iMi(yi−1,yi|x)=∏iMi(x)

\prod_iexp\left(W_i(y_{i-1}, y_i|x)\right)=\prod_iM_i(y_{i-1},y_i|x)=\prod_iM_i(x)
其中第二步到第三步是根据 expexpexp相加可以展开为连乘的特性，第三步到第四步用到了 1.2中的简化形式，后面就是直接套用 MMM的定义了。通过这个证明可以发现：无论是先将所有得分先加起来做exp" role="presentation">expexpexp还是直接先 expexpexp再连乘，答案都是一样的！实现的时候，可以考虑这两种不同的方式！

2.前向-后向算法

条件随机场(CRF)完全由特征函数tk,sltk,slt_k,s_l和对应的权重λk,μlλk,μl\lambda_k,\mu_l确定，我们需要利用前向-后向算法，计算出给定输入序列和对应的实际标签序列的log−likelihoodlog−likelihoodlog-likelihood概率值，然后通过最大化这个概率值，来更新上面特征和权重中的参数，实现学习的效果！ 学习完这些参数之后，对于一个给定的输入序列，我们可以用维特比算法找出当前参数下得分最高的预测标签序列！
这里讲解学习过程中一个很重要的算法，前向-后向算法！
对于每个指标i=0,1,...,n+1i=0,1,...,n+1i=0,1,...,n+1(包括了start和stop)，定义前向向量αi(x)αi(x)\alpha_i(x):

α0(y|x)={1,0,y=start否则(11.26)α0(y|x)={1,y=start0,否则(11.26)

\alpha_0(y|x)=\begin{cases} 1, & \text {$y=start$} \\ 0, & \text{否则} \end{cases}\;\;\;\;(11.26)
递推公式为:
αTi(yi|x)=αTi−1(yi−1|x)[Mi(yi−1,yi|x)],i=1,2,...,n+1(11.27)αiT(yi|x)=αi−1T(yi−1|x)[Mi(yi−1,yi|x)],i=1,2,...,n+1(11.27)\alpha_i^T(y_i|x)=\alpha_{i-1}^T(y_{i-1}|x)[M_i(y_{i-1}, y_i|x)],\;\;i=1,2,...,n+1\;\;\;\;(11.27)
又可以表示为:

αTi(x)=αTi−1(x)Mi(x)(11.28)αiT(x)=αi−1T(x)Mi(x)(11.28)

\alpha_i^T(x)=\alpha_{i-1}^T(x)M_i(x)\;\;\;\;(11.28)
αTi(yi|x)αiT(yi|x)\alpha_i^T(y_i|x)表示在位置 iii的标记是yi" role="presentation">yiyiy_i并且到位置 iii的前部分标记序列的非规范化概率，yi" role="presentation">yiyiy_i可取的值由 mmm个，所以αi(x)" role="presentation">αi(x)αi(x)\alpha_i(x)是 mmm维列向量。为了更好的理解递推过程，我们可以对前几个α" role="presentation">αα\alpha进行展开，当然 MMM也进行相应的展开。

α1(x)=α0(x)M1(x)=α0(x)exp(∑k=1Kwkfk(y0,y1,x,1))" role="presentation">α1(x)=α0(x)M1(x)=α0(x)exp(∑k=1Kwkfk(y0,y1,x,1))α1(x)=α0(x)M1(x)=α0(x)exp(∑k=1Kwkfk(y0,y1,x,1))

\alpha_1(x)=\alpha_0(x)M_1(x)=\alpha_0(x)exp\left(\sum_{k=1}^Kw_kf_k(y_0,y_1,x,1)\right)

α2(x)=α0(x)exp(∑k=1Kwkfk(y0,y1,x,1))exp(∑k=1Kwkfk(y1,y2,x,2))α2(x)=α0(x)exp(∑k=1Kwkfk(y0,y1,x,1))exp(∑k=1Kwkfk(y1,y2,x,2))

\alpha_2(x)=\alpha_0(x)exp\left(\sum_{k=1}^Kw_kf_k(y_0,y_1,x,1)\right)exp\left(\sum_{k=1}^Kw_kf_k(y_1,y_2,x,2)\right)

........

....

αi(x)=α0(x)exp(∑k=1Kwkfk(y0,y1,x,1))...(∑k=1Kwkfk(yi−1,yi,x,i))αi(x)=α0(x)exp(∑k=1Kwkfk(y0,y1,x,1))...(∑k=1Kwkfk(yi−1,yi,x,i))

\alpha_i(x)=\alpha_0(x)exp\left(\sum_{k=1}^Kw_kf_k(y_0,y_1,x,1)\right)...\left(\sum_{k=1}^Kw_kf_k(y_{i-1},y_i,x,i)\right)
注意，这里的连乘是 expexpexp连乘，转换为先连加在 expexpexp是等价的。
同样，对每个指标 i=0,1,...,n+1i=0,1,...,n+1i=0,1,...,n+1,定义后向向量 βi(x)βi(x)\beta_i(x):

βn+1(yn+1|x)={1,0,yn+1=stop否则(11.29)βn+1(yn+1|x)={1,yn+1=stop0,否则(11.29)

\beta_{n+1}(y_{n+1}|x)=\begin{cases} 1, & \text {$y_{n+1}=stop$} \\ 0, & \text{否则} \end{cases}\;\;\;\;(11.29)

\beta_i(y_i|x)=[M_i(y_i,y_{i+1}|x)]\beta_{i+1}(y_{i+1}|x)
又可以表示为

βi(x)=Mi+1(x)βi+1(x)βi(x)=Mi+1(x)βi+1(x)

\beta_i(x)=M_{i+1}(x)\beta_{i+1}(x)
βi(yi|x)βi(yi|x)\beta_i(y_i|x)表示在位置 iii的标记为yi" role="presentation">yiyiy_i并且从 i+1i+1i+1到 nnn的后部分标记序列的非规范化概率。
由前向-后向定义不难得到：

Z(x)=αnT(x)∗1=1T∗β1(x)" role="presentation">Z(x)=αTn(x)∗1=1T∗β1(x)Z(x)=αnT(x)∗1=1T∗β1(x)

Z(x)=\alpha_n^T(x)*1=1^T*\beta_1(x)
这里， 111是元素均为1的m" role="presentation">mmm维列向量。
你会发现，前后向算法本质上差不多，目的也是一样的，只是方向不同！

3.CRF优化问题

3.1正确序列的概率表达式

我们这里以bi-LSTM + CRF为例子。假设输入为:

X=(x1,x2,...,xn)X=(x1,x2,...,xn)

X=(x_1,x_2,...,x_n)
我们假设 PPP是通过bi−LSTM" role="presentation">bi−LSTMbi−LSTMbi-LSTM预测的各个位置各标签的得分矩阵，大小为 n∗kn∗kn*k， kkk是独立的标签的总数量，Pi,j" role="presentation">Pi,jPi,jP_{i,j}是句中第 iii个词预测第j" role="presentation">jjj个标签的得分。假设句子预测的标签为:

y=(y1,y2,...,yn)y=(y1,y2,...,yn)

y=(y_1,y_2,...,y_n)
我们定义它的得分为:

s(X,y)=∑i=1nAyi−1,yi+∑i=1nPi,yi其中y0为start标签s(X,y)=∑i=1nAyi−1,yi+∑i=1nPi,yi其中y0为start标签

s(X,y)=\sum_{i=1}^nA_{y_{i-1}\;,y_{i}}+\sum_{i=1}^nP_{i,y_i}\;\;其中y_0为start标签
你可能会觉得这里为什么和1.1小节中的式(11.10)略有不同，(11.10)中分子中多了exp是因为它做了一个softmax操作！本质上二者是一致的(分子部分)。其中Ai−1,yiAi−1,yi\;\;A_{i-1}\;,y_{i}\;\;对应转换特征，仅仅有一个转换特征，也就是k=1k=1k=1;Pi,yiPi,yi\;\;P_{i,y_i}\;\;是状态特征，仅仅有一个，也就是l=1l=1l=1\;\;。其中 AAA标签转换得分矩阵，即从一种标签转化为另一种标签的分数，这是需要学习的参数；我们一般会为A" role="presentation">AAA加上两个标签 startstartstart和 endendend标签，或者称为 stopstopstop，分别对应 y0和yn+1y0和yn+1y_0和y_{n+1}。
我们的目标是让目标标签序列的总体得分尽可能的大。用 softmaxsoftmaxsoftmax表示就是:

p(y|X)=es(X,y)∑y˘∈YXes(X,y˘)p(y|X)=es(X,y)∑y˘∈YXes(X,y˘)

p(y|X)=\frac{e^{s(X,y)}}{\sum_{\breve{y}\in Y_X}e^{s(X,\breve{y})}}
这个式子和式 (11.10)(11.10)(11.10)就完全等价了,其中 YXYXY_X表示输入序列 XXX可能预测的所有标签序列集合。在训练的时候，我们一般是最大化正确标签序列对应的log−probability" role="presentation">log−probabilitylog−probabilitylog-probability值:

log(p(y|X))=s(X,y)−log⎛⎝∑y˘∈YXes(X,y˘)⎞⎠=s(X,y)−logaddy˘∈YXs(X,y˘)log(p(y|X))=s(X,y)−log(∑y˘∈YXes(X,y˘))=s(X,y)−logaddy˘∈YXs(X,y˘)

log(p(y|X))=s(X, y)-log\left(\sum_{\breve{y}\in Y_X}e^{s(X,\breve{y})}\right)=s(X,y)-logadd_{\breve{y}\in Y_X} s(X, \breve{y})
所以，我们在计算这个 log−likelihoodlog−likelihoodlog-likelihood概率时，需要计算两部分，前一部分对应分子部分，后一部分对应分母部分。我们希望能够迭代计算出相应的值！

3.2 计算log-likelihood概率

计算分两部分进行，第一部分是分子部分的值，也就是s(X,y)s(X,y)s(X,y)；第二部分是分母部分的值，也就是log(∑y˘∈YXes(X,y˘))log(∑y˘∈YXes(X,y˘))log\left(\sum_{\breve{y}\in Y_X}e^{s(X,\breve{y})}\right)。

3.2.1 分子部分

首先给出S(X,y)S(X,y)S(X,y)分数计算方式:

S(X,y)=∑i=1nAyi−1,yi+∑i=1nPi,yiS(X,y)=∑i=1nAyi−1,yi+∑i=1nPi,yi

S(X,y)=\sum_{i=1}^nA_{y_{i-1}\;,y_{i}}+\sum_{i=1}^nP_{i,y_i}
在代码实现中，我们是沿着句子中每个位置进行推进迭代的，也就是使用前向算法，我们列举每一步迭代的结果:

S1=∑i=11Ayi−1,yi+∑i=11Pi,yi=Ay0,y1+P1,y1S1=∑i=11Ayi−1,yi+∑i=11Pi,yi=Ay0,y1+P1,y1

S_1=\sum_{i=1}^1A_{y_{i-1}\;,y_{i}}+\sum_{i=1}^1P_{i,y_i}=A_{y_0,\;y_1} + P_{1,\;y_1}

S2=∑i=12Ayi−1,yi+∑i=12Pi,yi=Ay0,y1+Ay1,y2+P1,y1+P2,y2=S1+Ay1,y2+P2,y2S2=∑i=12Ayi−1,yi+∑i=12Pi,yi=Ay0,y1+Ay1,y2+P1,y1+P2,y2=S1+Ay1,y2+P2,y2

S_2=\sum_{i=1}^2A_{y_{i-1}\;,y_{i}}+\sum_{i=1}^2P_{i,y_i}=A_{y_0,\;y_1}+A_{y_1,\;y_2}+P_{1,\;y_1}+P_{2,\;y_2}=S1+A_{y_1,\;y_2} + P_{2,\;y_2}
.........

Sn=∑i=1nAyi−1,yi+∑i=1nPi,yi=Sn−1+Ayn−1,yn+Pn,ynSn=∑i=1nAyi−1,yi+∑i=1nPi,yi=Sn−1+Ayn−1,yn+Pn,yn

S_n=\sum_{i=1}^nA_{y_{i-1}\;,y_{i}}+\sum_{i=1}^nP_{i,y_i}=S_{n-1}+A_{y_{n-1}\;,\;y_n} + P_{n,\;y_n}
所以，我们在沿着句子中某个位置 iii进行迭代时，只需要一直记录对应的Si−1,A,P" role="presentation">Si−1,A,PSi−1,A,P\;S_{i-1},A, P\;\;这三项值，就能够计算出分子部分的值！

3.2.2 分母部分

分母部分的计算相对来说比较麻烦，也需要构造每一步迭代项，分母部分计算公式如下:

Z(X)=log⎛⎝∑y˘∈YXes(X,y˘)⎞⎠=log⎛⎝∑y˘∈YXexp(∑i=1nAyi−1,yi+∑i=1nPi,yi)⎞⎠Z(X)=log(∑y˘∈YXes(X,y˘))=log(∑y˘∈YXexp(∑i=1nAyi−1,yi+∑i=1nPi,yi))

Z(X)=log\left(\sum_{\breve{y}\in Y_X}e^{s(X,\breve{y})}\right)=log\left(\sum_{\breve{y}\in Y_X}exp(\sum_{i=1}^nA_{y_{i-1}\;,y_{i}}+\sum_{i=1}^nP_{i,y_i})\right)
我们也按照句子中的每一个位置进行展开！

Z1=log⎛⎝∑y˘∈tagsexp(∑i=11Ayi−1,yi+∑i=11Pi,yi)⎞⎠=Z1=log(∑y˘∈tagsexp(∑i=11Ayi−1,yi+∑i=11Pi,yi))=

Z_1=log\left(\sum_{\breve{y}\in tags}exp(\sum_{i=1}^1A_{y_{i-1}\;,y_{i}}+\sum_{i=1}^1P_{i,y_i})\right)=

log⎛⎝∑y˘∈tagsexp(Ay0˘,y1˘+P1,y˘1)⎞⎠(3,1)log(∑y˘∈tagsexp(Ay0˘,y1˘+P1,y˘1))(3,1)

log\left(\sum_{\breve{y}\in tags}exp(A_{\breve{y_0}\;,\breve{y_1}}+P_{1,\breve{y}_1})\right)\;\;\;\;(3,1)
其中tagstagstags表示所有标签集合，y˘iy˘i\breve{y}_i表示位置iii的对应的任意标签。

Z2=log(∑y˘∈tagsexp(∑i=12Ay˘i−1,y˘i+∑i=1nPi,y˘i))=" role="presentation">Z2=log⎛⎝∑y˘∈tagsexp(∑i=12Ay˘i−1,y˘i+∑i=1nPi,y˘i)⎞⎠=Z2=log(∑y˘∈tagsexp(∑i=12Ay˘i−1,y˘i+∑i=1nPi,y˘i))=

Z_2=log\left(\sum_{\breve{y}\in tags}exp(\sum_{i=1}^2A_{\breve{y}_{i-1}\;,\breve{y}_{i}}+\sum_{i=1}^nP_{i,\breve{y}_i})\right)=

log⎛⎝∑y˘∈tagsexp(Ay˘0,y˘1+Ay˘1,y˘2+P1,y˘1+P2,y˘2)⎞⎠=log(∑y˘∈tagsexp(Ay˘0,y˘1+Ay˘1,y˘2+P1,y˘1+P2,y˘2))=

log\left(\sum_{\breve{y}\in tags}exp(A_{\breve{y}_0\;,\breve{y}_1}+A_{\breve{y}_1\;,\breve{y}_2}+P_{1,\breve{y}_1}+P_{2,\breve{y}_2})\right)=

log⎛⎝∑y˘∈tagsexp(Ay˘0,y˘1+P1,y˘1)exp(Ay˘1,y˘2+P2,y˘2)⎞⎠=log(∑y˘∈tagsexp(Ay˘0,y˘1+P1,y˘1)exp(Ay˘1,y˘2+P2,y˘2))=

log\left(\sum_{\breve{y}\in tags}exp(A_{\breve{y}_0\;,\breve{y}_1}+P_{1,\breve{y}_1})exp(A_{\breve{y}_1\;,\breve{y}_2}+P_{2,\breve{y}_2})\right)=

log⎛⎝∑y˘∈tagsexp(Ay˘0,y˘1+P1,y˘1)∑y˘∈tagsexp(Ay˘1,y˘2+P2,y˘2)⎞⎠(3,2)log(∑y˘∈tagsexp(Ay˘0,y˘1+P1,y˘1)∑y˘∈tagsexp(Ay˘1,y˘2+P2,y˘2))(3,2)

log\left(\sum_{\breve{y}\in tags}exp(A_{\breve{y}_0\;,\breve{y}_1}+P_{1,\breve{y}_1})\sum_{\breve{y}\in tags}exp(A_{\breve{y}_1\;,\breve{y}_2}+P_{2,\breve{y}_2})\right)\;\;\;\;(3,2)
从第三步到第四步可以展开为两个求和，因为长度为2的任意标签序列是两个长度为1的任意标签序列的任意组合。注意根据式(3,1)(3,1)(3,1)

Z1=log⎛⎝∑y˘∈tagsexp(Ay0˘,y1˘+P1,y˘1)⎞⎠Z1=log(∑y˘∈tagsexp(Ay0˘,y1˘+P1,y˘1))

Z1=log\left(\sum_{\breve{y}\in tags}exp(A_{\breve{y_0}\;,\breve{y_1}}+P_{1,\breve{y}_1})\right),
所以

exp(Z1)=∑y˘∈tagsexp(Ay˘0,y˘1+P1,y˘1)(3,4)exp(Z1)=∑y˘∈tagsexp(Ay˘0,y˘1+P1,y˘1)(3,4)

exp(Z_1)=\sum_{\breve{y}\in tags}exp(A_{\breve{y}_0\;,\breve{y}_1}+P_{1,\breve{y}_1})\;\;\;\;(3, 4)
将这个式子带入到式子(3.2)(3.2)(3.2)的前半部分，所以:

Z2=log⎛⎝exp(Z1)∑y˘∈tagsexp(Ay˘1,y˘2+P2,y˘2)⎞⎠=Z2=log(exp(Z1)∑y˘∈tagsexp(Ay˘1,y˘2+P2,y˘2))=

Z_2=log\left(exp(Z_1)\sum_{\breve{y}\in tags}exp(A_{\breve{y}_1\;,\breve{y}_2}+P_{2,\breve{y}_2})\right)=

log⎛⎝∑y˘∈tagsexp(Z1+Ay˘1,y˘2+P2,y˘2)⎞⎠log(∑y˘∈tagsexp(Z1+Ay˘1,y˘2+P2,y˘2))

log\left(\sum_{\breve{y}\in tags}exp(Z_1+A_{\breve{y}_1\;,\breve{y}_2}+P_{2,\breve{y}_2})\right)
这里Z1Z1Z_1可以直接放进去是因为此时的Z1Z1Z_1已经计算出来了，是一个常量值了。
进行推广:
............

Zn=log⎛⎝∑y˘∈tagsexp(Zn−1+Ay˘n−1,y˘n+P2,y˘2)⎞⎠Zn=log(∑y˘∈tagsexp(Zn−1+Ay˘n−1,y˘n+P2,y˘2))

Z_n=log\left(\sum_{\breve{y}\in tags}exp(Z_{n-1}+A_{\breve{y}_{n-1}\;,\breve{y}_n}+P_{2,\breve{y}_2})\right)
这样我们也找到了递推项，当我们沿着句子的每个位置进行迭代时，只需要一直记录对应的Zi−1,A,PZi−1,A,PZ_{i-1},A,P这三个值，就可以计算出分母部分的值。

4.CRF学习算法

一般使用梯度下降法，tensorflowtensorflowtensorflow和pytorchpytorchpytorch等学习工具都提供了梯度下降法的支持！

5.源码解读

下面对allennlp中提供的CRF源码进行解读！代码如下:

def allowed_transitions(constraint_type: str, tokens: Dict[int, str]) -> List[Tuple[int, int]]:"""Given tokens and a constraint type, returns the allowed transitions. It willadditionally include transitions for the start and end states, which are usedby the conditional random field.Parameters----------constraint_type : ``str``, requiredIndicates which constraint to apply. Current choices are "BIO" and "BIOUL".tokens : ``Dict[int, str]``, requiredA mapping {token_id -> token}. Most commonly this would be the value fromVocabulary.get_index_to_token_vocabulary()这应该是标签的tokens, 即所有的标签列表->id列表，类似于idx2tagReturns-------``List[Tuple[int, int]]``The allowed transitions (from_token_id, to_token_id).这个方法的作用是，预选准备好所有的可能的标签之间的互换，可以排除很多不可能出现的情况"""# 一般都需要先计算总的tag数，记住要加上起始和终止n_tags = len(tokens)start_tag = n_tagsend_tag = n_tags + 1allowed = []# begin, inside, other, unique, Last?if constraint_type == "BIOUL":for i, (from_bioul, *from_entity) in tokens.items():for j, (to_bioul, *to_entity) in tokens.items():# 预先准备好可能的转换，避免做完全遍历的维特比?is_allowed = any([# O can transition to O, B-* or U-*# L-x can transition to O, B-*, or U-*# U-x can transition to O, B-*, or U-*from_bioul in ('O', 'L', 'U') and to_bioul in ('O', 'B', 'U'),# B-x can only transition to I-x or L-x# I-x can only transition to I-x or L-xfrom_bioul in ('B', 'I') and to_bioul in ('I', 'L') and from_entity == to_entity])if is_allowed:allowed.append((i, j))# start transitions, 开始可以转换为other, begin uniquefor i, (to_bioul, *to_entity) in tokens.items():if to_bioul in ('O', 'B', 'U'):allowed.append((start_tag, i))# end transitions, other, unique, last可以转换为endfor i, (from_bioul, *from_entity) in tokens.items():if from_bioul in ('O', 'L', 'U'):allowed.append((i, end_tag))# begin, inside, otherelif constraint_type == "BIO":for i, (from_bio, *from_entity) in tokens.items():for j, (to_bio, *to_entity) in tokens.items():is_allowed = any([# Can always transition to O or B-xto_bio in ('O', 'B'),# Can only transition to I-x from B-x or I-xto_bio == 'I' and from_bio in ('B', 'I') and from_entity == to_entity])if is_allowed:allowed.append((i, j))# start transitions, 以start tag开始可以转换的情况for i, (to_bio, *to_entity) in tokens.items():if to_bio in ('O', 'B'):allowed.append((start_tag, i))# end transitions, 以end_tag结束可以转换的情况for i, (from_bio, *from_entity) in tokens.items():if from_bio in ('O', 'B', 'I'):allowed.append((i, end_tag))else:raise ConfigurationError(f"Unknown constraint type: {constraint_type}")return allowedclass ConditionalRandomField(torch.nn.Module):"""This module uses the "forward-backward" algorithm to computethe log-likelihood of its inputs assuming a conditional random field model.See, e.g. http://www.cs.columbia.edu/~mcollins/fb.pdfParameters----------num_tags : int, requiredThe number of tags.constraints : List[Tuple[int, int]], optional (default: None)An optional list of allowed transitions (from_tag_id, to_tag_id).These are applied to ``viterbi_tags()`` but do not affect ``forward()``.These should be derived from `allowed_transitions` so that thestart and end transitions are handled correctly for your tag type.include_start_end_transitions : bool, optional (default: True)Whether to include the start and end transition parameters."""def __init__(self,num_tags: int,constraints: List[Tuple[int, int]] = None,include_start_end_transitions: bool = True) -> None:super().__init__()self.num_tags = num_tags# transitions[i, j] is the logit for transitioning from state i to state j.self.transitions = torch.nn.Parameter(torch.Tensor(num_tags, num_tags))# _constraint_mask indicates valid transitions (based on supplied constraints).# Include special start of sequence (num_tags + 1) and end of sequence tags (num_tags + 2)if constraints is None:# All transitions are valid.constraint_mask = torch.Tensor(num_tags + 2, num_tags + 2).fill_(1.)else:constraint_mask = torch.Tensor(num_tags + 2, num_tags + 2).fill_(0.)for i, j in constraints:constraint_mask[i, j] = 1.self._constraint_mask = torch.nn.Parameter(constraint_mask, requires_grad=False)# Also need logits for transitioning from "start" state and to "end" state.self.include_start_end_transitions = include_start_end_transitionsif include_start_end_transitions:self.start_transitions = torch.nn.Parameter(torch.Tensor(num_tags))self.end_transitions = torch.nn.Parameter(torch.Tensor(num_tags))self.reset_parameters()def reset_parameters(self):torch.nn.init.xavier_normal_(self.transitions)if self.include_start_end_transitions:torch.nn.init.normal_(self.start_transitions)torch.nn.init.normal_(self.end_transitions)def _input_likelihood(self, logits: torch.Tensor, mask: torch.Tensor) -> torch.Tensor:"""Computes the (batch_size,) denominator term for the log-likelihood, which is thesum of the likelihoods across all possible state sequences.logit: [batch_size, sequence_lenth, num_tag]表示这个batch训练集的目标序列这里计算的是log-likelihood的分母部分"""batch_size, sequence_length, num_tags = logits.size()# Transpose batch size and sequence dimensions# 无论前向还是后向算法，都是根据词位置逐个推进的mask = mask.float().transpose(0, 1).contiguous()logits = logits.transpose(0, 1).contiguous()# Initial alpha is the (batch_size, num_tags) tensor of likelihoods combining the# transitions to the initial states and the logits for the first timestep.# 前向算法使用的是alpha，注意每个词对应位置的alpha都是num_tag维向量if self.include_start_end_transitions:alpha = self.start_transitions.view(1, num_tags) + logits[0]else:alpha = logits[0]# For each i we compute logits for the transitions from timestep i-1 to timestep i.# We do so in a (batch_size, num_tags, num_tags) tensor where the axes are# (instance, current_tag, next_tag), 即size对应为(实例，当前tag位置，下一tag位置)# 根据词的位置，依次推进前向算法for i in range(1, sequence_length):# 哪个维度对分数不影响，我们在那个维度扩散分数，# The emit scores are for time i ("next_tag") so we broadcast along the current_tag axis.emit_scores = logits[i].view(batch_size, 1, num_tags)# Transition scores are (current_tag, next_tag) so we broadcast along the instance axis.transition_scores = self.transitions.view(1, num_tags, num_tags)# Alpha is for the current_tag, so we broadcast along the next_tag axis.broadcast_alpha = alpha.view(batch_size, num_tags, 1)# Add all the scores together and logexp over the current_tag axis# 大牛这里都没有将它们expand成(batch_size, num_tags, num_tags)，但结果是一样的inner = broadcast_alpha + emit_scores + transition_scores# In valid positions (mask == 1) we want to take the logsumexp over the current_tag dimension# of ``inner``. Otherwise (mask == 0) we want to retain the previous alpha.# tag有效，则持续累加; 否则保留之前的alpha值alpha = (util.logsumexp(inner, 1) * mask[i].view(batch_size, 1) +alpha * (1 - mask[i]).view(batch_size, 1))# Every sequence needs to end with a transition to the stop_tag.# 到end-tag时，只有转换得分，没有logit得分，因为logit的长度仅仅是词长if self.include_start_end_transitions:stops = alpha + self.end_transitions.view(1, num_tags)else:stops = alpha# Finally we log_sum_exp along the num_tags dim, result is (batch_size,)return util.logsumexp(stops)def _joint_likelihood(self,logits: torch.Tensor,tags: torch.Tensor,mask: torch.LongTensor) -> torch.Tensor:"""Computes the numerator term for the log-likelihood, which is just score(inputs, tags)计算log-likelihood的分子部分logits: [batch_size, seq_len, tag_num]表示batch中每个词序列中,每个词预测各标签的分数tags: [batch_size, seq_len], 表示batch中每个词序列的真是标签序列mask: [batch_size, seq_len], 提示实际长度?"""batch_size, sequence_length, num_tags = logits.data.shape# Transpose batch size and sequence dimensions:# 需要按照词的position推进logits = logits.transpose(0, 1).contiguous()mask = mask.float().transpose(0, 1).contiguous()tags = tags.transpose(0, 1).contiguous()# Start with the transition scores from start_tag to the first tag in each input# tag是该if self.include_start_end_transitions:score = self.start_transitions.index_select(0, tags[0])else:score = 0.0# Broadcast the transition scores to one per batch element# batch中各位置的转换时一致的broadcast_transitions = self.transitions.view(1, num_tags, num_tags).expand(batch_size, num_tags, num_tags)# Add up the scores for the observed transitions and all the inputs but the last# 将给定的序列的转换分数加上for i in range(sequence_length - 1):# Each is shape (batch_size,)current_tag, next_tag = tags[i], tags[i+1]# The scores for transitioning from current_tag to next_tagtransition_score = (broadcast_transitions# Choose the current_tag-th row for each input# 先gather row.gather(1, current_tag.view(batch_size, 1, 1).expand(batch_size, 1, num_tags))# Squeeze down to (batch_size, num_tags).squeeze(1)# Then choose the next_tag-th column for each of those# 从row中再gather col.gather(1, next_tag.view(batch_size, 1))# And squeeze down to (batch_size,).squeeze(1))# The score for using current_tag# 使用当前标签对应的得分emit_score = logits[i].gather(1, current_tag.view(batch_size, 1)).squeeze(1)# Include transition score if next element is unmasked,# input_score if this element is unmasked.# 只有标签是有效的情况下，才会进行累加score = score + transition_score * mask[i + 1] + emit_score * mask[i]# Transition from last state to "stop" state. To start with, we need to find the last tag# for each instance.# 用mask来确定batch中每个序列的最后一个位置，然后取出每个序列的最后一个taglast_tag_index = mask.sum(0).long() - 1last_tags = tags.gather(0, last_tag_index.view(1, batch_size).expand(sequence_length, batch_size))# Is (sequence_length, batch_size), but all the columns are the same, so take the first.last_tags = last_tags[0]# Compute score of transitioning to `stop_tag` from each "last tag".if self.include_start_end_transitions:last_transition_score = self.end_transitions.index_select(0, last_tags)else:last_transition_score = 0.0# Add the last input if it's not masked.last_inputs = logits[-1]                                         # (batch_size, num_tags)last_input_score = last_inputs.gather(1, last_tags.view(-1, 1))  # (batch_size, 1)last_input_score = last_input_score.squeeze()                    # (batch_size,)score = score + last_transition_score + last_input_score * mask[-1]return scoredef forward(self,inputs: torch.Tensor,tags: torch.Tensor,mask: torch.ByteTensor = None) -> torch.Tensor:"""Computes the log likelihood."""# pylint: disable=arguments-differif mask is None:mask = torch.ones(*tags.size(), dtype=torch.long)log_denominator = self._input_likelihood(inputs, mask)log_numerator = self._joint_likelihood(inputs, tags, mask)# log-likelihood是两者相减的结果return torch.sum(log_numerator - log_denominator)

实现的重点是log−likelihoodlog−likelihoodlog-likelihood中分子和分母部分的计算，在代码中分子部分是按照(Astart,y0,),(Ay0,y1,P0),(Ay1,y2,P1),...,(Ayn−1,yn,Pn−1),(Ayn,end,Pn)(Astart,y0,),(Ay0,y1,P0),(Ay1,y2,P1),...,(Ayn−1,yn,Pn−1),(Ayn,end,Pn)(A_{start,y_0},),(A_{y_0,y_1},P_0), (A_{y_1,y_2},P_1),...,(A_{y_{n-1}\;,y_n},P_{n-1}),(A_{y_n\;, end}, P_n)这样的组合方式递推的；而在分母计算部分是按照(Astarty0,P0),(Ay0,y1,P1),...,(Ayn−1,yn,Pn),(Ayn,end)(Astarty0,P0),(Ay0,y1,P1),...,(Ayn−1,yn,Pn),(Ayn,end)(A_{start\;y_0},P_0),(A_{y_0\;,y_1},P_1),...,(A_{y_{n-1}\;,y_n},P_n),(A_{y_n\;,end})两者结果本质是等价的，因为都是累加的结果！
至于预测算法维特比算法，以后有机会再写吧，写博客太麻烦了！