RNN（Recurrent Neural Network）的几个难点

1. vanish of gradient

RNN的error相对于某个时间点t的梯度为：

\(\frac{\partial E_t}{\partial W}=\sum_{k=1}^{t}\frac{\partial E_t}{\partial y_t}\frac{\partial y_t}{\partial h_i}\frac{\partial h_t}{\partial h_k}\frac{\partial h_k}{\partial W}\) (公式1),

其中\(h\)是hidden node的输出，\(y_t\)是网络在t时刻的output，\(W\)是hidden nodes 到hidden nodes的weight，而\(\frac{\partial h_t}{\partial h_k}\)是导数在时间段[k,t]上的链式展开，这段时间可能很长，会造成vanish或者explosion gradiant。将\(\frac{\partial h_t}{\partial h_k}\)沿时间展开：\(\frac{\partial h_t}{\partial h_k}=\prod_{j=k+1}^{t}\frac{\partial h_j}{\partial h_{j-1}}=\prod_{j=k+1}^{t}W^T \times diag [\frac{\partial\sigma(h_{j-1})}{\partial h_{j-1}}]\)。上式中的diag矩阵是个什么鬼？我来举个例子，你就明白了。假设现在要求解\(\frac{\partial h_5}{\partial h_4}\)，回忆向前传播时\(h_5\)是怎么得到的：\(h_5=W\sigma(h_4)+W^{hx}x_4\)，则\(\frac{\partial h_5}{\partial h_4}=W\frac{\partial \sigma(h_4)}{\partial h_4}\)，注意到\(\sigma(h_4)\)和\(h_4\)都是向量（维度为D），所以\(\frac{\partial \sigma(h_4)}{\partial h_4}\)是Jacobian矩阵也即：\(\frac{\partial \sigma(h_4)}{\partial h_4}=\) \(\begin{bmatrix} \frac{\partial\sigma_1(h_{41})}{\partial h_{41}}&\cdots&\frac{\partial\sigma_1(h_{41})}{\partial h_{4D}} \\ \vdots&\cdots&\vdots \\ \frac{\partial\sigma_D(h_{4D})}{\partial h_{41}}&\cdots&\frac{\partial\sigma_D(h_{4D})}{\partial h_{4D}}\end{bmatrix}\)，明显的，非对角线上的值都是0。这是因为sigmoid logistic function \(\sigma\)是element-wise的操作。

后面推导vanish或者explosion gradiant的过程就很简单了，我就不写了，请参考http://cs224d.stanford.edu/lecture_notes/LectureNotes4.pdf 中的公式(14)往后部分。

2. weight shared (tied) 时, the gradient of tied weight = sum of gradient of individual weights

举个例子你就明白了：假设有向前传播\(y=F[W_1f(W_2x)]\), 且weights \(W_1\) \(W_2\) tied, 现在要求gradient \(\frac{\partial y}{\partial W}\)

办法一：

先求gradient \(\frac{\partial F[]}{\partial W_2} = F'[]f() \)

再求gradient \(\frac{\partial F[]}{\partial W_1} = F'[] (W_2f'()x) \)

将上两式相加后得，\(F'[]f()+F'[] (W_2f'()x)=F'[](f()+W_2f'()x)\)

假设weights \(W_1\) \(W_2\) tied，则上式=\(F'[](f()+Wf'()x) = \frac{\partial y}{\partial W} \)

办法二：

现在我们换个办法，在假设weights \(W_1\) \(W_2\) tied的基础上，直接求gradient

\(\frac{\partial y}{\partial W} = F'[]( \frac{\partial Wf()}{\partial W} + W \frac{\partial f()}{\partial W} ) = F'[](f()+Wf'()x) \)

可见，两种方法的结果是一样的。所以，当权重共享时，关于权重的梯度=两个不同权重梯度的和。

3. LSTM & Gated Recurrent units 是如何避免vanish的？

To understand this, you will have to go through some math. The most accessible article wrt recurrent gradient problems IMHO is Pascanu's ICML2013 paper [1].

A summary: vanishing/exploding gradient comes from the repeated application of the recurrent weight matrix [2]. That the spectral radius of the recurrent weight matrix is bigger than 1 makes exploding gradients possible (it is a necessary condition), while a spectral radius smaller than 1 makes it vanish, which is a sufficient condition.

Now, if gradients vanish, that does not mean that all gradients vanish. Only some of them, gradient information local in time will still be present. That means, you might still have a non-zero gradient--but it will not contain long term information. That's because some gradient g + 0 is still g. （上文中公式1，因为是相加，所以有些为0，也不会引起全部为0）

If gradients explode, all of them do. That is because some gradient g + infinity is infinity.（上文中公式1，因为是相加，所以有些为无限大，会引起全部为无限大）

That is the reason why LSTM does not protect you from exploding gradients, since LSTM also uses a recurrent weight matrix（h(t) = o(t) ◦ tanh(c(t))？）, not only internal state-to-state connections（ c(t) = f (t) ◦ ˜c(t−1) +i(t) ◦ ˜c(t) h(t)）. Successful LSTM applications typically use gradient clipping.

LSTM overcomes the vanishing gradient problem, though. That is because if you look at the derivative of the internal state at T to the internal state at T-1, there is no repeated weight application. The derivative actually is the value of the forget gate. And to avoid that this becomes zero, we need to initialise it properly in the beginning.

That makes it clear why the states can act as "a wormhole through time", because they can bridge long time lags and then (if the time is right) "re inject" it into the other parts of the net by opening the output gate.

[1] Pascanu, Razvan, Tomas Mikolov, and Yoshua Bengio. "On the difficulty of training recurrent neural networks." arXiv preprint arXiv:1211.5063 (2012).

[2] It might "vanish" also due to saturating nonlinearities, but that is sth that can also happen in shallow nets and can be overcome with more careful weight initialisations.

ref: Recursive Deep Learning for Natural Language Processing and Computer Vision.pdf

CS224D-3-note bp.pdf

未完待续。。。

转载于:https://www.cnblogs.com/congliu/p/4546634.html

RNN（Recurrent Neural Network）的几个难点相关推荐

RNN(recurrent neural network regularization)
论文:https://arxiv.org/pdf/1409.2329.pdf 摘要: 论文为RNN中的LSTM单元提出一个简单的调整技巧,dropout在调整神经网络中取得非常大的成功,但是在RNN( ...
Recurrent Neural Network系列2--利用Python，Theano实现RNN
作者:zhbzz2007 出处:http://www.cnblogs.com/zhbzz2007 欢迎转载,也请保留这段声明.谢谢! 本文翻译自 RECURRENT NEURAL NETWORKS T ...
深度学习之递归神经网络(Recurrent Neural Network，RNN)
为什么有bp神经网络.CNN.还需要RNN? BP神经网络和CNN的输入输出都是互相独立的:但是实际应用中有些场景输出内容和之前的内容是有关联的. RNN引入"记忆"的概念:递归 ...
什么是RNN？一文看懂强大的循环神经网络(Recurrent Neural Network, RNN)
循环神经网络(Recurrent Neural Network,RNN)是一类用于处理序列数据的神经网络.所谓序列数据,即前面的输入和后面的输入是有关系的,如一个句子,或者视频帧.就像卷积网络是专门用 ...
RNN循环神经网络（recurrent neural network）
自己开发了一个股票智能分析软件,功能很强大,需要的点击下面的链接获取: https://www.cnblogs.com/bclshuai/p/11380657.html 1.1 RNN循环神经网络 ...
深度学习笔记（四）——循环神经网络（Recurrent Neural Network, RNN）
目录一.RNN简介 (一).简介 (二).RNN处理任务示例--以NER为例二.模型提出 (一).基本RNN结构 (二).RNN展开结构三.RNN的结构变化 (一).N to N结构RNN模型 ...
【李宏毅机器学习笔记】 23、循环神经网络（Recurrent Neural Network，RNN）
[李宏毅机器学习笔记]1.回归问题(Regression) [李宏毅机器学习笔记]2.error产生自哪里? [李宏毅机器学习笔记]3.gradient descent [李宏毅机器学习笔记]4.Cl ...
RNN（Recurrent Neural Network）是怎么来的？
RNN(Recurrent Neural Network)是怎么来的? 一些应用场景,比如说写论文,写诗,翻译等等. 既然已经学习过神经网络,深度神经网络,卷积神经网络,为什么还要学习RNN? 首先我 ...
循环神经网络（Recurrent Neural Network, RNN）
基本概念一般的神经网络(BP以及CNN)只对预先确定的大小起作用:它们接受固定大小的输入并产生固定大小的输出.它们的输出都是只考虑前一个输入的影响而不考虑其它时刻输入的影响, 比如简单的猫,狗,手写 ...
(zhuan) Recurrent Neural Network
Recurrent Neural Network 2016年07月01日 Deep learning Deep learning 字数:24235 this blog from: http://jxg ...

RNN（Recurrent Neural Network）的几个难点

RNN（Recurrent Neural Network）的几个难点相关推荐

最新文章

热门文章