1. vanish of gradient

RNN的error相对于某个时间点t的梯度为:

\(\frac{\partial E_t}{\partial W}=\sum_{k=1}^{t}\frac{\partial E_t}{\partial y_t}\frac{\partial y_t}{\partial h_i}\frac{\partial h_t}{\partial h_k}\frac{\partial h_k}{\partial W}\) (公式1),

其中\(h\)是hidden node的输出,\(y_t\)是网络在t时刻的output,\(W\)是hidden nodes 到hidden nodes的weight,而\(\frac{\partial h_t}{\partial h_k}\)是导数在时间段[k,t]上的链式展开,这段时间可能很长,会造成vanish或者explosion gradiant。将\(\frac{\partial h_t}{\partial h_k}\)沿时间展开:\(\frac{\partial h_t}{\partial h_k}=\prod_{j=k+1}^{t}\frac{\partial h_j}{\partial h_{j-1}}=\prod_{j=k+1}^{t}W^T \times diag [\frac{\partial\sigma(h_{j-1})}{\partial h_{j-1}}]\)。上式中的diag矩阵是个什么鬼?我来举个例子,你就明白了。假设现在要求解\(\frac{\partial h_5}{\partial h_4}\),回忆向前传播时\(h_5\)是怎么得到的:\(h_5=W\sigma(h_4)+W^{hx}x_4\),则\(\frac{\partial h_5}{\partial h_4}=W\frac{\partial \sigma(h_4)}{\partial h_4}\),注意到\(\sigma(h_4)\)和\(h_4\)都是向量(维度为D),所以\(\frac{\partial \sigma(h_4)}{\partial h_4}\)是Jacobian矩阵也即:\(\frac{\partial \sigma(h_4)}{\partial h_4}=\) \(\begin{bmatrix} \frac{\partial\sigma_1(h_{41})}{\partial h_{41}}&\cdots&\frac{\partial\sigma_1(h_{41})}{\partial h_{4D}} \\ \vdots&\cdots&\vdots \\ \frac{\partial\sigma_D(h_{4D})}{\partial h_{41}}&\cdots&\frac{\partial\sigma_D(h_{4D})}{\partial h_{4D}}\end{bmatrix}\),明显的,非对角线上的值都是0。这是因为sigmoid logistic function \(\sigma\)是element-wise的操作。

后面推导vanish或者explosion gradiant的过程就很简单了,我就不写了,请参考http://cs224d.stanford.edu/lecture_notes/LectureNotes4.pdf 中的公式(14)往后部分。

2. weight shared (tied) 时, the gradient of tied weight = sum of gradient of individual weights

举个例子你就明白了:假设有向前传播\(y=F[W_1f(W_2x)]\), 且weights \(W_1\) \(W_2\) tied, 现在要求gradient  \(\frac{\partial y}{\partial W}\)

办法一:

先求gradient \(\frac{\partial F[]}{\partial W_2} = F'[]f() \)

再求gradient \(\frac{\partial F[]}{\partial W_1} = F'[] (W_2f'()x) \)

将上两式相加后得,\(F'[]f()+F'[] (W_2f'()x)=F'[](f()+W_2f'()x)\)

假设weights \(W_1\) \(W_2\) tied,则上式=\(F'[](f()+Wf'()x) = \frac{\partial y}{\partial W} \)

办法二:

现在我们换个办法,在假设weights \(W_1\) \(W_2\) tied的基础上,直接求gradient

\(\frac{\partial y}{\partial W} =  F'[]( \frac{\partial Wf()}{\partial W} + W \frac{\partial f()}{\partial W} )  = F'[](f()+Wf'()x) \)

可见,两种方法的结果是一样的。所以,当权重共享时,关于权重的梯度=两个不同权重梯度的和。

3. LSTM & Gated Recurrent units 是如何避免vanish的?

To understand this, you will have to go through some math. The most accessible article wrt recurrent gradient problems IMHO is Pascanu's ICML2013 paper [1].

A summary: vanishing/exploding gradient comes from the repeated application of the recurrent weight matrix [2]. That the spectral radius of the recurrent weight matrix is bigger than 1 makes exploding gradients possible (it is a necessary condition), while a spectral radius smaller than 1 makes it vanish, which is a sufficient condition.

Now, if gradients vanish, that does not mean that all gradients vanish. Only some of them, gradient information local in time will still be present. That means, you might still have a non-zero gradient--but it will not contain long term information. That's because some gradient g + 0 is still g. (上文中公式1,因为是相加,所以有些为0,也不会引起全部为0)

If gradients explode, all of them do. That is because some gradient g + infinity is infinity.(上文中公式1,因为是相加,所以有些为无限大,会引起全部为无限大)

That is the reason why LSTM does not protect you from exploding gradients, since LSTM also uses a recurrent weight matrix(h(t) = o(t) ◦ tanh(c(t))?), not only internal state-to-state connections( c(t) = f (t) ◦ ˜c(t−1) +i(t) ◦ ˜c(t) h(t)). Successful LSTM applications typically use gradient clipping.

LSTM overcomes the vanishing gradient problem, though. That is because if you look at the derivative of the internal state at T to the internal state at T-1, there is no repeated weight application. The derivative actually is the value of the forget gate. And to avoid that this becomes zero, we need to initialise it properly in the beginning.

That makes it clear why the states can act as "a wormhole through time", because they can bridge long time lags and then (if the time is right) "re inject" it into the other parts of the net by opening the output gate.

[1] Pascanu, Razvan, Tomas Mikolov, and Yoshua Bengio. "On the difficulty of training recurrent neural networks." arXiv preprint arXiv:1211.5063 (2012).

[2] It might "vanish" also due to saturating nonlinearities, but that is sth that can also happen in shallow nets and can be overcome with more careful weight initialisations.

ref: Recursive Deep Learning for Natural Language Processing and Computer Vision.pdf

CS224D-3-note bp.pdf

未完待续。。。

转载于:https://www.cnblogs.com/congliu/p/4546634.html

RNN(Recurrent Neural Network)的几个难点相关推荐

  1. RNN(recurrent neural network regularization)

    论文:https://arxiv.org/pdf/1409.2329.pdf 摘要: 论文为RNN中的LSTM单元提出一个简单的调整技巧,dropout在调整神经网络中取得非常大的成功,但是在RNN( ...

  2. Recurrent Neural Network系列2--利用Python,Theano实现RNN

    作者:zhbzz2007 出处:http://www.cnblogs.com/zhbzz2007 欢迎转载,也请保留这段声明.谢谢! 本文翻译自 RECURRENT NEURAL NETWORKS T ...

  3. 深度学习之递归神经网络(Recurrent Neural Network,RNN)

    为什么有bp神经网络.CNN.还需要RNN? BP神经网络和CNN的输入输出都是互相独立的:但是实际应用中有些场景输出内容和之前的内 容是有关联的. RNN引入"记忆"的概念:递归 ...

  4. 什么是RNN?一文看懂强大的循环神经网络(Recurrent Neural Network, RNN)

    循环神经网络(Recurrent Neural Network,RNN)是一类用于处理序列数据的神经网络.所谓序列数据,即前面的输入和后面的输入是有关系的,如一个句子,或者视频帧.就像卷积网络是专门用 ...

  5. RNN循环神经网络(recurrent neural network)

     自己开发了一个股票智能分析软件,功能很强大,需要的点击下面的链接获取: https://www.cnblogs.com/bclshuai/p/11380657.html 1.1  RNN循环神经网络 ...

  6. 深度学习笔记(四)——循环神经网络(Recurrent Neural Network, RNN)

    目录 一.RNN简介 (一).简介 (二).RNN处理任务示例--以NER为例 二.模型提出 (一).基本RNN结构 (二).RNN展开结构 三.RNN的结构变化 (一).N to N结构RNN模型 ...

  7. 【李宏毅机器学习笔记】 23、循环神经网络(Recurrent Neural Network,RNN)

    [李宏毅机器学习笔记]1.回归问题(Regression) [李宏毅机器学习笔记]2.error产生自哪里? [李宏毅机器学习笔记]3.gradient descent [李宏毅机器学习笔记]4.Cl ...

  8. RNN(Recurrent Neural Network)是怎么来的?

    RNN(Recurrent Neural Network)是怎么来的? 一些应用场景,比如说写论文,写诗,翻译等等. 既然已经学习过神经网络,深度神经网络,卷积神经网络,为什么还要学习RNN? 首先我 ...

  9. 循环神经网络(Recurrent Neural Network, RNN)

    基本概念 一般的神经网络(BP以及CNN)只对预先确定的大小起作用:它们接受固定大小的输入并产生固定大小的输出.它们的输出都是只考虑前一个输入的影响而不考虑其它时刻输入的影响, 比如简单的猫,狗,手写 ...

  10. (zhuan) Recurrent Neural Network

    Recurrent Neural Network 2016年07月01日 Deep learning Deep learning 字数:24235 this blog from: http://jxg ...

最新文章

  1. Redux 入门教程(二):中间件与异步操作
  2. 科目二倒车入库不论怎么都能入进去的方法
  3. json_decode
  4. LVS DR模式搭建、keepalived+LVS
  5. dwr框架java解析excel_dwr poi java 将excel 导出到客户端
  6. SpringBoot(53) 整合canal实现数据同步
  7. App Store榜单优化:App出海必须掌握的ASO技巧
  8. linux中gimp命令截图,Linux利用GIMP截图
  9. 用思维导图带你重赏《从百草园到三味书屋》
  10. 编辑chm格式的文档
  11. 【系统化学习】CSDN算法技能树测评
  12. ruby 读取文本_使用Ruby进行文本处理
  13. 大数据可视化技术——平行坐标图、成对关系图、高级折线图
  14. SGI STL的rb_tree浅析
  15. 小学六年级能用计算机器,做数学题都用计算器 六年级小学生背不全九九乘法表...
  16. 数据分析师兴起并繁荣背后的原因
  17. 《炬丰科技-半导体工艺》 超临界二氧化碳处理技术在光刻技术中的应用及其对微抗蚀剂图案附着力的影响
  18. python第一弹 爬虫淘女郎图片
  19. ai背景合成_ai全自动视频剪辑软件,每天批量制作800条原创视频
  20. matlab实验报告

热门文章

  1. 深度学习概述:当你没有方向时的加油站
  2. bash: 未预期的符号 `( 附近有语法错误_鲜鲜历史丨石榴:好吃颜值高,还是个文化符号...
  3. mysql 替换 多个逗号_如何使用mySQL replace()替换多个记录中的字符串?
  4. iview日期选择器更改显示日期书_如何动态 设置 iview DatePicker 控件的 禁用日期(option)...
  5. 007_Spring Data JPA JPQL
  6. mysql和php的登录注册界面_php实现注册和登录界面的方法
  7. 上传文件 微信小程序input_快速上手微信小程序UI框架
  8. python硬件编程智能家居_利用 Python 的力量,实现 Tableau 与智能家居系统集成
  9. vue前端 html,Vue.js v-html
  10. Photoshop CS5软件安装资料及教程