先说结论:不论是machine translation还是music datasets、speech signal modeling,GRU和LSTM的performance差别不大,但GRU往往比LSTM训练时间短、收敛快。

原文:http://blog.csdn.net/meanme/article/details/48845793

1.概要:

传统的RNN在训练long-term dependencies 的时候会遇到很多困难,最常见的便是vanish gradient problen。期间有很多种解决这个问题的方法被发表。大致可以分为两类:一类是以新的方法改善或者代替传统的SGD方法,如Bengio提出的clip gradient;另一种则是设计更加精密的recurrent unit,如LSTM,GRU。而本文的重点是比较LSTM,GRU的performance。由于在machine translation上这两种unit的performance已经得到验证(效果差别不明显,performance差别不大),本文是在music datasets和speech signal modeling上进行实验的。

2.LSTM与GRU:

1) LSTM:

2)GRU:

3)概括的来说,LSTM和GRU都能通过各种Gate将重要特征保留,保证其在long-term 传播的时候也不会被丢失;还有一个不太好理解,作用就是有利于BP的时候不容易vanishing:

3.实验结果:

实验用了三个unit,传统的tanh,以及LSTM和GRU:

可以发现LSTM和GRU的差别并不大,但是都比tanh要明显好很多,所以在选择LSTM或者GRU的时候还要看具体的task data是什么

不过在收敛时间和需要的epoch上,GRU应该要更胜一筹:

4.代码(keras):

1)LSTM:

class LSTM(Recurrent):'''Acts as a spatiotemporal projection,turning a sequence of vectors into a single vector.Eats inputs with shape:(nb_samples, max_sample_length (samples shorter than this are padded with zeros at the end), input_dim)and returns outputs with shape:if not return_sequences:(nb_samples, output_dim)if return_sequences:(nb_samples, max_sample_length, output_dim)For a step-by-step description of the algorithm, see:http://deeplearning.net/tutorial/lstm.htmlReferences:Long short-term memory (original 97 paper)http://deeplearning.cs.cmu.edu/pdfs/Hochreiter97_lstm.pdfLearning to forget: Continual prediction with LSTMhttp://www.mitpressjournals.org/doi/pdf/10.1162/089976600300015015Supervised sequence labelling with recurrent neural networkshttp://www.cs.toronto.edu/~graves/preprint.pdf'''def __init__(self, input_dim, output_dim=128,init='glorot_uniform', inner_init='orthogonal', forget_bias_init='one',activation='tanh', inner_activation='hard_sigmoid',weights=None, truncate_gradient=-1, return_sequences=False):super(LSTM, self).__init__()self.input_dim = input_dimself.output_dim = output_dimself.truncate_gradient = truncate_gradientself.return_sequences = return_sequencesself.init = initializations.get(init)self.inner_init = initializations.get(inner_init)self.forget_bias_init = initializations.get(forget_bias_init)self.activation = activations.get(activation)self.inner_activation = activations.get(inner_activation)self.input = T.tensor3()self.W_i = self.init((self.input_dim, self.output_dim))self.U_i = self.inner_init((self.output_dim, self.output_dim))self.b_i = shared_zeros((self.output_dim))self.W_f = self.init((self.input_dim, self.output_dim))self.U_f = self.inner_init((self.output_dim, self.output_dim))self.b_f = self.forget_bias_init((self.output_dim))self.W_c = self.init((self.input_dim, self.output_dim))self.U_c = self.inner_init((self.output_dim, self.output_dim))self.b_c = shared_zeros((self.output_dim))self.W_o = self.init((self.input_dim, self.output_dim))self.U_o = self.inner_init((self.output_dim, self.output_dim))self.b_o = shared_zeros((self.output_dim))self.params = [self.W_i, self.U_i, self.b_i,self.W_c, self.U_c, self.b_c,self.W_f, self.U_f, self.b_f,self.W_o, self.U_o, self.b_o,]if weights is not None:self.set_weights(weights)def _step(self,xi_t, xf_t, xo_t, xc_t, mask_tm1,h_tm1, c_tm1,u_i, u_f, u_o, u_c):h_mask_tm1 = mask_tm1 * h_tm1c_mask_tm1 = mask_tm1 * c_tm1i_t = self.inner_activation(xi_t + T.dot(h_mask_tm1, u_i))f_t = self.inner_activation(xf_t + T.dot(h_mask_tm1, u_f))c_t = f_t * c_mask_tm1 + i_t * self.activation(xc_t + T.dot(h_mask_tm1, u_c))o_t = self.inner_activation(xo_t + T.dot(h_mask_tm1, u_o))h_t = o_t * self.activation(c_t)return h_t, c_tdef get_output(self, train=False):X = self.get_input(train)padded_mask = self.get_padded_shuffled_mask(train, X, pad=1)X = X.dimshuffle((1, 0, 2))xi = T.dot(X, self.W_i) + self.b_ixf = T.dot(X, self.W_f) + self.b_fxc = T.dot(X, self.W_c) + self.b_cxo = T.dot(X, self.W_o) + self.b_o[outputs, memories], updates = theano.scan(self._step,sequences=[xi, xf, xo, xc, padded_mask],outputs_info=[T.unbroadcast(alloc_zeros_matrix(X.shape[1], self.output_dim), 1),T.unbroadcast(alloc_zeros_matrix(X.shape[1], self.output_dim), 1)],non_sequences=[self.U_i, self.U_f, self.U_o, self.U_c],truncate_gradient=self.truncate_gradient)if self.return_sequences:return outputs.dimshuffle((1, 0, 2))return outputs[-1]def get_config(self):return {"name": self.__class__.__name__,"input_dim": self.input_dim,"output_dim": self.output_dim,"init": self.init.__name__,"inner_init": self.inner_init.__name__,"forget_bias_init": self.forget_bias_init.__name__,"activation": self.activation.__name__,"inner_activation": self.inner_activation.__name__,"truncate_gradient": self.truncate_gradient,"return_sequences": self.return_sequences}
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53
  • 54
  • 55
  • 56
  • 57
  • 58
  • 59
  • 60
  • 61
  • 62
  • 63
  • 64
  • 65
  • 66
  • 67
  • 68
  • 69
  • 70
  • 71
  • 72
  • 73
  • 74
  • 75
  • 76
  • 77
  • 78
  • 79
  • 80
  • 81
  • 82
  • 83
  • 84
  • 85
  • 86
  • 87
  • 88
  • 89
  • 90
  • 91
  • 92
  • 93
  • 94
  • 95
  • 96
  • 97
  • 98
  • 99
  • 100
  • 101
  • 102
  • 103
  • 104
  • 105
  • 106
  • 107
  • 108
  • 109
  • 110
  • 111
  • 112
  • 113
  • 114
  • 115
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53
  • 54
  • 55
  • 56
  • 57
  • 58
  • 59
  • 60
  • 61
  • 62
  • 63
  • 64
  • 65
  • 66
  • 67
  • 68
  • 69
  • 70
  • 71
  • 72
  • 73
  • 74
  • 75
  • 76
  • 77
  • 78
  • 79
  • 80
  • 81
  • 82
  • 83
  • 84
  • 85
  • 86
  • 87
  • 88
  • 89
  • 90
  • 91
  • 92
  • 93
  • 94
  • 95
  • 96
  • 97
  • 98
  • 99
  • 100
  • 101
  • 102
  • 103
  • 104
  • 105
  • 106
  • 107
  • 108
  • 109
  • 110
  • 111
  • 112
  • 113
  • 114
  • 115

2)GRU:

class GRU(Recurrent):'''Gated Recurrent Unit - Cho et al. 2014Acts as a spatiotemporal projection,turning a sequence of vectors into a single vector.Eats inputs with shape:(nb_samples, max_sample_length (samples shorter than this are padded with zeros at the end), input_dim)and returns outputs with shape:if not return_sequences:(nb_samples, output_dim)if return_sequences:(nb_samples, max_sample_length, output_dim)References:On the Properties of Neural Machine Translation: Encoder–Decoder Approacheshttp://www.aclweb.org/anthology/W14-4012Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modelinghttp://arxiv.org/pdf/1412.3555v1.pdf'''def __init__(self, input_dim, output_dim=128,init='glorot_uniform', inner_init='orthogonal',activation='sigmoid', inner_activation='hard_sigmoid',weights=None, truncate_gradient=-1, return_sequences=False):super(GRU, self).__init__()self.input_dim = input_dimself.output_dim = output_dimself.truncate_gradient = truncate_gradientself.return_sequences = return_sequencesself.init = initializations.get(init)self.inner_init = initializations.get(inner_init)self.activation = activations.get(activation)self.inner_activation = activations.get(inner_activation)self.input = T.tensor3()self.W_z = self.init((self.input_dim, self.output_dim))self.U_z = self.inner_init((self.output_dim, self.output_dim))self.b_z = shared_zeros((self.output_dim))self.W_r = self.init((self.input_dim, self.output_dim))self.U_r = self.inner_init((self.output_dim, self.output_dim))self.b_r = shared_zeros((self.output_dim))self.W_h = self.init((self.input_dim, self.output_dim))self.U_h = self.inner_init((self.output_dim, self.output_dim))self.b_h = shared_zeros((self.output_dim))self.params = [self.W_z, self.U_z, self.b_z,self.W_r, self.U_r, self.b_r,self.W_h, self.U_h, self.b_h,]if weights is not None:self.set_weights(weights)def _step(self,xz_t, xr_t, xh_t, mask_tm1,h_tm1,u_z, u_r, u_h):h_mask_tm1 = mask_tm1 * h_tm1z = self.inner_activation(xz_t + T.dot(h_mask_tm1, u_z))r = self.inner_activation(xr_t + T.dot(h_mask_tm1, u_r))hh_t = self.activation(xh_t + T.dot(r * h_mask_tm1, u_h))h_t = z * h_mask_tm1 + (1 - z) * hh_treturn h_tdef get_output(self, train=False):X = self.get_input(train)padded_mask = self.get_padded_shuffled_mask(train, X, pad=1)X = X.dimshuffle((1, 0, 2))x_z = T.dot(X, self.W_z) + self.b_zx_r = T.dot(X, self.W_r) + self.b_rx_h = T.dot(X, self.W_h) + self.b_houtputs, updates = theano.scan(self._step,sequences=[x_z, x_r, x_h, padded_mask],outputs_info=T.unbroadcast(alloc_zeros_matrix(X.shape[1], self.output_dim), 1),non_sequences=[self.U_z, self.U_r, self.U_h],truncate_gradient=self.truncate_gradient)if self.return_sequences:return outputs.dimshuffle((1, 0, 2))return outputs[-1]def get_config(self):return {"name": self.__class__.__name__,"input_dim": self.input_dim,"output_dim": self.output_dim,"init": self.init.__name__,"inner_init": self.inner_init.__name__,"activation": self.activation.__name__,"inner_activation": self.inner_activation.__name__,"truncate_gradient": self.truncate_gradient,"return_sequences": self.return_sequences}
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53
  • 54
  • 55
  • 56
  • 57
  • 58
  • 59
  • 60
  • 61
  • 62
  • 63
  • 64
  • 65
  • 66
  • 67
  • 68
  • 69
  • 70
  • 71
  • 72
  • 73
  • 74
  • 75
  • 76
  • 77
  • 78
  • 79
  • 80
  • 81
  • 82
  • 83
  • 84
  • 85
  • 86
  • 87
  • 88
  • 89
  • 90
  • 91
  • 92
  • 93
  • 94
  • 95
  • 96
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53
  • 54
  • 55
  • 56
  • 57
  • 58
  • 59
  • 60
  • 61
  • 62
  • 63
  • 64
  • 65
  • 66
  • 67
  • 68
  • 69
  • 70
  • 71
  • 72
  • 73
  • 74
  • 75
  • 76
  • 77
  • 78
  • 79
  • 80
  • 81
  • 82
  • 83
  • 84
  • 85
  • 86
  • 87
  • 88
  • 89
  • 90
  • 91
  • 92
  • 93
  • 94
  • 95
  • 96
0

LSTM对比GRU:Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling相关推荐

  1. A Critical Review of Recurrent Neural Networks for Sequence Learning-论文(综述)阅读笔记

    A Critical Review of Recurrent Neural Networks for Sequence Learning 阅读笔记 //2022.3.31 下午15:00开始 论文地址 ...

  2. 循环神经网络(RNN, Recurrent Neural Networks)介绍

    循环神经网络(RNN, Recurrent Neural Networks)介绍   循环神经网络(Recurrent Neural Networks,RNNs)已经在众多自然语言处理(Natural ...

  3. Recurrent Neural Networks 简述

    前言 RNN:Recurrent Neural Network 是序列数据处理时匹配度最好的模型,现在对当前几个重要的RNN模型做个简单梳理. RNN RNN,循环神经网络,RNN具有天然的时间深度, ...

  4. 第五门课 序列模型(Sequence Models) 第一周 循环序列模型(Recurrent Neural Networks)

    第五门课 序列模型(Sequence Models) 第一周 循环序列模型(Recurrent Neural Networks) 文章目录 第五门课 序列模型(Sequence Models) 第一周 ...

  5. [C5W1] Sequence Models - Recurrent Neural Networks

    第一周 循环序列模型(Recurrent Neural Networks) 为什么选择序列模型?(Why Sequence Models?) 在本课程中你将学会序列模型,它是深度学习中最令人激动的内容 ...

  6. 吴恩达deeplearning.ai系列课程笔记+编程作业(13)序列模型(Sequence Models)-第一周 循环序列模型(Recurrent Neural Networks)

    第五门课 序列模型(Sequence Models) 第一周 循环序列模型(Recurrent Neural Networks) 文章目录 第五门课 序列模型(Sequence Models) 第一周 ...

  7. 【论文阅读】An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling

    1.摘要 一般而言,序列模型与循环网络(recurrent networks)关系紧密(由于RNN的循环自回归结构能较好地表达出时间序列).而传统的卷积网络(convolutional network ...

  8. An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling

    TCN:An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling 该论 ...

  9. GRU(Gated Recurrent Neural Network)

    当时间步数较大或者时间步较小时,循环神经网络的梯度较容易出现衰减或爆炸.虽然裁剪梯度可以应对梯度爆炸,但无法解决梯度衰减的问题. 门控循环神经网络(gated recurrent neural net ...

最新文章

  1. 微生物相关网络构建教程中文Microbial association network construction tutorial
  2. Laravel学习笔记之Decorator Pattern
  3. Word中的图片显示出不来的解决办法
  4. Java统计文件夹中文件总行数
  5. 不安装cudnn可不可以_Ubuntu16.04+gtx1060+cuda8.0+cudnn8.0+tensorflow安装
  6. use SAP web IDE to commit change to git
  7. mmseg java_MMSeg中文分词算法
  8. Camera Vision - video surveillance on C#
  9. android xutil 数据库,Android XUtils3框架的基本使用方法(二)
  10. 全通教育回应深交所:巴九灵年赚7500万 没有吴晓波依然正常运作
  11. 大数据可视化分析方法与流程
  12. Apache Atlas 安装及入门
  13. 教你高效管理CrossOver容器
  14. (01)f103,4pin四脚的 oled(01)
  15. 偏远的时代covid 19如何迎接下一个技术变革
  16. android常用的混淆规则,关于Android混淆的基本做法
  17. java 推荐系统_电商个性化推荐系统:协同过滤算法方案解析
  18. 单元测试测试用例覆盖率为0
  19. eclipse与DW联合开发java web项目
  20. 【C#】消除锯齿 - 指定抗锯齿的呈现。

热门文章

  1. 《前端》解决背景适配不同屏幕大小,如何解决因屏幕大小的不同而导致背景的放大或缩小。
  2. apache ------proxy_blance
  3. 蓝桥杯 Web 应用开发模拟赛首次公开!参赛选手速进!
  4. scroll事件详解
  5. python 用openpyxi读取excel文件的简单使用
  6. 骑车与走路(YZOJ-1041)
  7. 【概率论】5-6:正态分布(The Normal Distributions Part II)
  8. 大数据系列之数据仓库Hive命令使用及JDBC连接
  9. 读书笔记数据科学入门————数据科学导论
  10. mysql isnull()_MySql中的IFNULL、NULLIF和ISNULL用法详解