LSTM对比GRU:Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling
先说结论:不论是machine translation还是music datasets、speech signal modeling,GRU和LSTM的performance差别不大,但GRU往往比LSTM训练时间短、收敛快。
原文:http://blog.csdn.net/meanme/article/details/48845793
1.概要:
传统的RNN在训练long-term dependencies 的时候会遇到很多困难,最常见的便是vanish gradient problen。期间有很多种解决这个问题的方法被发表。大致可以分为两类:一类是以新的方法改善或者代替传统的SGD方法,如Bengio提出的clip gradient;另一种则是设计更加精密的recurrent unit,如LSTM,GRU。而本文的重点是比较LSTM,GRU的performance。由于在machine translation上这两种unit的performance已经得到验证(效果差别不明显,performance差别不大),本文是在music datasets和speech signal modeling上进行实验的。
2.LSTM与GRU:
1) LSTM:
2)GRU:
3)概括的来说,LSTM和GRU都能通过各种Gate将重要特征保留,保证其在long-term 传播的时候也不会被丢失;还有一个不太好理解,作用就是有利于BP的时候不容易vanishing:
3.实验结果:
实验用了三个unit,传统的tanh,以及LSTM和GRU:
可以发现LSTM和GRU的差别并不大,但是都比tanh要明显好很多,所以在选择LSTM或者GRU的时候还要看具体的task data是什么
不过在收敛时间和需要的epoch上,GRU应该要更胜一筹:
4.代码(keras):
1)LSTM:
class LSTM(Recurrent):'''Acts as a spatiotemporal projection,turning a sequence of vectors into a single vector.Eats inputs with shape:(nb_samples, max_sample_length (samples shorter than this are padded with zeros at the end), input_dim)and returns outputs with shape:if not return_sequences:(nb_samples, output_dim)if return_sequences:(nb_samples, max_sample_length, output_dim)For a step-by-step description of the algorithm, see:http://deeplearning.net/tutorial/lstm.htmlReferences:Long short-term memory (original 97 paper)http://deeplearning.cs.cmu.edu/pdfs/Hochreiter97_lstm.pdfLearning to forget: Continual prediction with LSTMhttp://www.mitpressjournals.org/doi/pdf/10.1162/089976600300015015Supervised sequence labelling with recurrent neural networkshttp://www.cs.toronto.edu/~graves/preprint.pdf'''def __init__(self, input_dim, output_dim=128,init='glorot_uniform', inner_init='orthogonal', forget_bias_init='one',activation='tanh', inner_activation='hard_sigmoid',weights=None, truncate_gradient=-1, return_sequences=False):super(LSTM, self).__init__()self.input_dim = input_dimself.output_dim = output_dimself.truncate_gradient = truncate_gradientself.return_sequences = return_sequencesself.init = initializations.get(init)self.inner_init = initializations.get(inner_init)self.forget_bias_init = initializations.get(forget_bias_init)self.activation = activations.get(activation)self.inner_activation = activations.get(inner_activation)self.input = T.tensor3()self.W_i = self.init((self.input_dim, self.output_dim))self.U_i = self.inner_init((self.output_dim, self.output_dim))self.b_i = shared_zeros((self.output_dim))self.W_f = self.init((self.input_dim, self.output_dim))self.U_f = self.inner_init((self.output_dim, self.output_dim))self.b_f = self.forget_bias_init((self.output_dim))self.W_c = self.init((self.input_dim, self.output_dim))self.U_c = self.inner_init((self.output_dim, self.output_dim))self.b_c = shared_zeros((self.output_dim))self.W_o = self.init((self.input_dim, self.output_dim))self.U_o = self.inner_init((self.output_dim, self.output_dim))self.b_o = shared_zeros((self.output_dim))self.params = [self.W_i, self.U_i, self.b_i,self.W_c, self.U_c, self.b_c,self.W_f, self.U_f, self.b_f,self.W_o, self.U_o, self.b_o,]if weights is not None:self.set_weights(weights)def _step(self,xi_t, xf_t, xo_t, xc_t, mask_tm1,h_tm1, c_tm1,u_i, u_f, u_o, u_c):h_mask_tm1 = mask_tm1 * h_tm1c_mask_tm1 = mask_tm1 * c_tm1i_t = self.inner_activation(xi_t + T.dot(h_mask_tm1, u_i))f_t = self.inner_activation(xf_t + T.dot(h_mask_tm1, u_f))c_t = f_t * c_mask_tm1 + i_t * self.activation(xc_t + T.dot(h_mask_tm1, u_c))o_t = self.inner_activation(xo_t + T.dot(h_mask_tm1, u_o))h_t = o_t * self.activation(c_t)return h_t, c_tdef get_output(self, train=False):X = self.get_input(train)padded_mask = self.get_padded_shuffled_mask(train, X, pad=1)X = X.dimshuffle((1, 0, 2))xi = T.dot(X, self.W_i) + self.b_ixf = T.dot(X, self.W_f) + self.b_fxc = T.dot(X, self.W_c) + self.b_cxo = T.dot(X, self.W_o) + self.b_o[outputs, memories], updates = theano.scan(self._step,sequences=[xi, xf, xo, xc, padded_mask],outputs_info=[T.unbroadcast(alloc_zeros_matrix(X.shape[1], self.output_dim), 1),T.unbroadcast(alloc_zeros_matrix(X.shape[1], self.output_dim), 1)],non_sequences=[self.U_i, self.U_f, self.U_o, self.U_c],truncate_gradient=self.truncate_gradient)if self.return_sequences:return outputs.dimshuffle((1, 0, 2))return outputs[-1]def get_config(self):return {"name": self.__class__.__name__,"input_dim": self.input_dim,"output_dim": self.output_dim,"init": self.init.__name__,"inner_init": self.inner_init.__name__,"forget_bias_init": self.forget_bias_init.__name__,"activation": self.activation.__name__,"inner_activation": self.inner_activation.__name__,"truncate_gradient": self.truncate_gradient,"return_sequences": self.return_sequences}
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
- 34
- 35
- 36
- 37
- 38
- 39
- 40
- 41
- 42
- 43
- 44
- 45
- 46
- 47
- 48
- 49
- 50
- 51
- 52
- 53
- 54
- 55
- 56
- 57
- 58
- 59
- 60
- 61
- 62
- 63
- 64
- 65
- 66
- 67
- 68
- 69
- 70
- 71
- 72
- 73
- 74
- 75
- 76
- 77
- 78
- 79
- 80
- 81
- 82
- 83
- 84
- 85
- 86
- 87
- 88
- 89
- 90
- 91
- 92
- 93
- 94
- 95
- 96
- 97
- 98
- 99
- 100
- 101
- 102
- 103
- 104
- 105
- 106
- 107
- 108
- 109
- 110
- 111
- 112
- 113
- 114
- 115
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
- 34
- 35
- 36
- 37
- 38
- 39
- 40
- 41
- 42
- 43
- 44
- 45
- 46
- 47
- 48
- 49
- 50
- 51
- 52
- 53
- 54
- 55
- 56
- 57
- 58
- 59
- 60
- 61
- 62
- 63
- 64
- 65
- 66
- 67
- 68
- 69
- 70
- 71
- 72
- 73
- 74
- 75
- 76
- 77
- 78
- 79
- 80
- 81
- 82
- 83
- 84
- 85
- 86
- 87
- 88
- 89
- 90
- 91
- 92
- 93
- 94
- 95
- 96
- 97
- 98
- 99
- 100
- 101
- 102
- 103
- 104
- 105
- 106
- 107
- 108
- 109
- 110
- 111
- 112
- 113
- 114
- 115
2)GRU:
class GRU(Recurrent):'''Gated Recurrent Unit - Cho et al. 2014Acts as a spatiotemporal projection,turning a sequence of vectors into a single vector.Eats inputs with shape:(nb_samples, max_sample_length (samples shorter than this are padded with zeros at the end), input_dim)and returns outputs with shape:if not return_sequences:(nb_samples, output_dim)if return_sequences:(nb_samples, max_sample_length, output_dim)References:On the Properties of Neural Machine Translation: Encoder–Decoder Approacheshttp://www.aclweb.org/anthology/W14-4012Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modelinghttp://arxiv.org/pdf/1412.3555v1.pdf'''def __init__(self, input_dim, output_dim=128,init='glorot_uniform', inner_init='orthogonal',activation='sigmoid', inner_activation='hard_sigmoid',weights=None, truncate_gradient=-1, return_sequences=False):super(GRU, self).__init__()self.input_dim = input_dimself.output_dim = output_dimself.truncate_gradient = truncate_gradientself.return_sequences = return_sequencesself.init = initializations.get(init)self.inner_init = initializations.get(inner_init)self.activation = activations.get(activation)self.inner_activation = activations.get(inner_activation)self.input = T.tensor3()self.W_z = self.init((self.input_dim, self.output_dim))self.U_z = self.inner_init((self.output_dim, self.output_dim))self.b_z = shared_zeros((self.output_dim))self.W_r = self.init((self.input_dim, self.output_dim))self.U_r = self.inner_init((self.output_dim, self.output_dim))self.b_r = shared_zeros((self.output_dim))self.W_h = self.init((self.input_dim, self.output_dim))self.U_h = self.inner_init((self.output_dim, self.output_dim))self.b_h = shared_zeros((self.output_dim))self.params = [self.W_z, self.U_z, self.b_z,self.W_r, self.U_r, self.b_r,self.W_h, self.U_h, self.b_h,]if weights is not None:self.set_weights(weights)def _step(self,xz_t, xr_t, xh_t, mask_tm1,h_tm1,u_z, u_r, u_h):h_mask_tm1 = mask_tm1 * h_tm1z = self.inner_activation(xz_t + T.dot(h_mask_tm1, u_z))r = self.inner_activation(xr_t + T.dot(h_mask_tm1, u_r))hh_t = self.activation(xh_t + T.dot(r * h_mask_tm1, u_h))h_t = z * h_mask_tm1 + (1 - z) * hh_treturn h_tdef get_output(self, train=False):X = self.get_input(train)padded_mask = self.get_padded_shuffled_mask(train, X, pad=1)X = X.dimshuffle((1, 0, 2))x_z = T.dot(X, self.W_z) + self.b_zx_r = T.dot(X, self.W_r) + self.b_rx_h = T.dot(X, self.W_h) + self.b_houtputs, updates = theano.scan(self._step,sequences=[x_z, x_r, x_h, padded_mask],outputs_info=T.unbroadcast(alloc_zeros_matrix(X.shape[1], self.output_dim), 1),non_sequences=[self.U_z, self.U_r, self.U_h],truncate_gradient=self.truncate_gradient)if self.return_sequences:return outputs.dimshuffle((1, 0, 2))return outputs[-1]def get_config(self):return {"name": self.__class__.__name__,"input_dim": self.input_dim,"output_dim": self.output_dim,"init": self.init.__name__,"inner_init": self.inner_init.__name__,"activation": self.activation.__name__,"inner_activation": self.inner_activation.__name__,"truncate_gradient": self.truncate_gradient,"return_sequences": self.return_sequences}
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
- 34
- 35
- 36
- 37
- 38
- 39
- 40
- 41
- 42
- 43
- 44
- 45
- 46
- 47
- 48
- 49
- 50
- 51
- 52
- 53
- 54
- 55
- 56
- 57
- 58
- 59
- 60
- 61
- 62
- 63
- 64
- 65
- 66
- 67
- 68
- 69
- 70
- 71
- 72
- 73
- 74
- 75
- 76
- 77
- 78
- 79
- 80
- 81
- 82
- 83
- 84
- 85
- 86
- 87
- 88
- 89
- 90
- 91
- 92
- 93
- 94
- 95
- 96
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
- 34
- 35
- 36
- 37
- 38
- 39
- 40
- 41
- 42
- 43
- 44
- 45
- 46
- 47
- 48
- 49
- 50
- 51
- 52
- 53
- 54
- 55
- 56
- 57
- 58
- 59
- 60
- 61
- 62
- 63
- 64
- 65
- 66
- 67
- 68
- 69
- 70
- 71
- 72
- 73
- 74
- 75
- 76
- 77
- 78
- 79
- 80
- 81
- 82
- 83
- 84
- 85
- 86
- 87
- 88
- 89
- 90
- 91
- 92
- 93
- 94
- 95
- 96
- 顶
- 0
- 踩
LSTM对比GRU:Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling相关推荐
- A Critical Review of Recurrent Neural Networks for Sequence Learning-论文(综述)阅读笔记
A Critical Review of Recurrent Neural Networks for Sequence Learning 阅读笔记 //2022.3.31 下午15:00开始 论文地址 ...
- 循环神经网络(RNN, Recurrent Neural Networks)介绍
循环神经网络(RNN, Recurrent Neural Networks)介绍 循环神经网络(Recurrent Neural Networks,RNNs)已经在众多自然语言处理(Natural ...
- Recurrent Neural Networks 简述
前言 RNN:Recurrent Neural Network 是序列数据处理时匹配度最好的模型,现在对当前几个重要的RNN模型做个简单梳理. RNN RNN,循环神经网络,RNN具有天然的时间深度, ...
- 第五门课 序列模型(Sequence Models) 第一周 循环序列模型(Recurrent Neural Networks)
第五门课 序列模型(Sequence Models) 第一周 循环序列模型(Recurrent Neural Networks) 文章目录 第五门课 序列模型(Sequence Models) 第一周 ...
- [C5W1] Sequence Models - Recurrent Neural Networks
第一周 循环序列模型(Recurrent Neural Networks) 为什么选择序列模型?(Why Sequence Models?) 在本课程中你将学会序列模型,它是深度学习中最令人激动的内容 ...
- 吴恩达deeplearning.ai系列课程笔记+编程作业(13)序列模型(Sequence Models)-第一周 循环序列模型(Recurrent Neural Networks)
第五门课 序列模型(Sequence Models) 第一周 循环序列模型(Recurrent Neural Networks) 文章目录 第五门课 序列模型(Sequence Models) 第一周 ...
- 【论文阅读】An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling
1.摘要 一般而言,序列模型与循环网络(recurrent networks)关系紧密(由于RNN的循环自回归结构能较好地表达出时间序列).而传统的卷积网络(convolutional network ...
- An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling
TCN:An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling 该论 ...
- GRU(Gated Recurrent Neural Network)
当时间步数较大或者时间步较小时,循环神经网络的梯度较容易出现衰减或爆炸.虽然裁剪梯度可以应对梯度爆炸,但无法解决梯度衰减的问题. 门控循环神经网络(gated recurrent neural net ...
最新文章
- 微生物相关网络构建教程中文Microbial association network construction tutorial
- Laravel学习笔记之Decorator Pattern
- Word中的图片显示出不来的解决办法
- Java统计文件夹中文件总行数
- 不安装cudnn可不可以_Ubuntu16.04+gtx1060+cuda8.0+cudnn8.0+tensorflow安装
- use SAP web IDE to commit change to git
- mmseg java_MMSeg中文分词算法
- Camera Vision - video surveillance on C#
- android xutil 数据库,Android XUtils3框架的基本使用方法(二)
- 全通教育回应深交所:巴九灵年赚7500万 没有吴晓波依然正常运作
- 大数据可视化分析方法与流程
- Apache Atlas 安装及入门
- 教你高效管理CrossOver容器
- (01)f103,4pin四脚的 oled(01)
- 偏远的时代covid 19如何迎接下一个技术变革
- android常用的混淆规则,关于Android混淆的基本做法
- java 推荐系统_电商个性化推荐系统:协同过滤算法方案解析
- 单元测试测试用例覆盖率为0
- eclipse与DW联合开发java web项目
- 【C#】消除锯齿 - 指定抗锯齿的呈现。
热门文章
- 《前端》解决背景适配不同屏幕大小,如何解决因屏幕大小的不同而导致背景的放大或缩小。
- apache ------proxy_blance
- 蓝桥杯 Web 应用开发模拟赛首次公开!参赛选手速进!
- scroll事件详解
- python 用openpyxi读取excel文件的简单使用
- 骑车与走路(YZOJ-1041)
- 【概率论】5-6:正态分布(The Normal Distributions Part II)
- 大数据系列之数据仓库Hive命令使用及JDBC连接
- 读书笔记数据科学入门————数据科学导论
- mysql isnull()_MySql中的IFNULL、NULLIF和ISNULL用法详解