
深度学习 , 自然语言处理 (Deep Learning, Natural Language Processing)

In my last article, I have introduced Recurrent Neural Networks and the complications it carries. To combat the drawbacks we use LSTMs & GRUs.

在上一篇文章中,我介绍了递归神经网络及其带来的复杂性。 为了克服这些缺点,我们使用LSTM和GRU。

障碍,短期记忆 (The obstacle, Short-term Memory)

Recurrent Neural Networks are confined to short-term memory. If a long sequence is fed to the network, they’ll have a hard time remembering the information and might as well leave out important information from the beginning.

递归神经网络仅限于短期记忆。 如果将较长的时间馈入网络,他们将很难记住信息,并且一开始可能会遗漏重要的信息。

Besides, Recurrent Neural Networks faces Vanishing Gradient Problem when backpropagation comes into play. Due to the conflict, the updated gradients are much smaller leaving no change in our model and thus not contributing much in learning.

此外,当反向传播起作用时,递归神经网络将面临消失梯度问题。 由于存在冲突,更新后的梯度要小得多,因此我们的模型没有任何变化,因此对学习没有太大贡献。

Weight Update Rule

“When we perform backpropagation, we calculate weights and biases for each node. But, if the improvements in the former layers are meager then the adjustment to the current layer will be much smaller. This causes gradients to dramatically diminish and thus leading to almost NULL changes in our model and due to that our model is no longer learning and no longer improving.”

“当我们进行反向传播时,我们将计算每个节点的权重和偏差。 但是,如果前几层的改进很少,那么对当前层的调整将小得多。 这将导致梯度急剧减小,从而导致我们的模型几乎为NULL更改,并且由于我们的模型不再学习并且不再改进。”

为什么选择LSTM和GRU? (Why LSTMs and GRUs?)

Let us say, you’re looking at reviews for Schitt’s Creek online to determine if you could watch it or not. The basic approach will be to read the review and determine its sentiments.

让我们说,您正在在线查看Schitt's Creek的评论,以确定是否可以观看。 基本方法是阅读评论并确定其观点。

When you look for the review, your subconscious mind will try to remember decisive keywords. You will try to remember more weighted words like “Aaaaastonishing”, “timeless”, “incredible”, “eccentric”, and “capricious” and will not focus on regular words like “think”, “exactly”, “most” etc.

当您寻找评论时,您的潜意识将设法记住决定性的关键字。 您将尝试记住更重的单词,例如“ Aaaaastonishing”,“永恒”,“令人难以置信”,“古怪”和“任性”,而不会专注于常规单词,例如“思考”,“完全”,“多数”等。

The next time you’ll be asked to recall the review, you probably will be having a hard time, but, I bet you must remember the sentiment and few important and decisive words as mentioned above.


And that’s what exactly LSTM and GRU are intended to operate.


Learn and Remember only important information and forget every other stuff.


LSTM(长期短期记忆) (LSTM (Long short term memory))

LSTMs are a progressive form of vanilla RNN that were introduced to combat its shortcomings. To implement the above-mentioned intuition and administer the significant information due to finite-sized state vector in RNN we employ selectively read, write, and forget gates.

LSTM是香草RNN的进步形式,旨在克服其缺点。 为了实现上述直觉并由于RNN中有限大小的状态向量而管理重要信息,我们采用选择性的读,写和遗忘门。


The abstract concept revolves around cell states and various gates. The cell state can transfer relative information to the sequence chain and is capable of carrying relevant information throughout the computation thus solving the problem of Short-term memory. As the process continues, the more relevant information is added and removed via gates. Gates are special types of neural networks that learn about relevant information during training.

抽象概念围绕着细胞状态和各种门。 单元状态可以将相对信息传递到序列链,并且能够在整个计算过程中携带相关信息,从而解决了短期记忆的问题。 随着过程的继续,将通过门添加和删除更相关的信息。 闸门是特殊类型的神经网络,可在训练过程中了解相关信息。

选择性写 (Selective Write)

Let us assume, the hidden state (sₜ), previous hidden state (sₜ₋₁), current input (xₜ), and bias (b).

让我们假设隐藏状态( sₜ ),先前的隐藏状态( sₜ₋₁ ),当前输入( xₜ )和偏置( b )。

Now, we are accumulating all the outputs from previous state sₜ₋₁ and computing output for current state s

现在,我们将累加先前状态sₜ₋₁的所有输出,并计算当前状态s output的输出

Vanilla RNN

Using Selective Write, we are interested in only passing on the relevant information to the next state sₜ. To implement the strategy we could assign a value ranging from 0 to 1 to each input to determine how much information is to be passed on to the next hidden state.

使用选择性写入,我们只希望将相关信息传递到下一个状态sₜ。 为了实现该策略,我们可以为每个输入分配一个介于0到1之间的值,以确定将多少信息传递给下一个隐藏状态。

Selective Write in LSTM

We can store the fraction of information to be passed on in a vector hₜ₋₁ that can be computed by multiplying previous state vector sₜ₋₁ and oₜ₋₁ that stores the value between 0 and 1 for each input.


The next issue we encounter is, how to get oₜ₋₁?


To compute oₜ₋₁ we must learn it and the only vectors that we have control on, are our parameters. So, to continue computation, we need to express oₜ₋₁ in the form of parameters.

要计算oₜ₋₁,我们必须学习它,而我们唯一可以控制的向量就是我们的参数。 因此,要继续计算,我们需要以参数的形式表示oₜ₋₁

After learning Uo, Wo, and Bo using Gradient Descent, we can expect a precise prediction using our output gate (oₜ₋₁) that is controlling how much information will be passed to the next gate.

在使用梯度下降学习了Uo,WoBo之后 ,我们可以期望使用输出门( oₜ₋₁ )进行精确的预测,该输出门控制将多少信息传递到下一个门。

选择性阅读 (Selective Read)

After passing the relevant information from the previous gate, we introduce a new hidden state vector Šₜ (marked in green).

从前一门传递了相关信息之后,我们引入了一个新的隐藏状态向量Šₜ (标记为绿色)。

Šₜ captures all the information from the previous state hₜ₋₁ and the current input xₜ.


But, our goal is to remove as much unimportant stuff as possible and to continue with our idea we will selectively read from the Šₜ to construct a new cell stage.


Selective Read

To store all-important pieces of content, we will again roll back to 0–1 strategy where we will assign a value between 0–1 to each input defining the proportion that we will like to read.


Vector iₜ will store the proportional value for each input that will later be multiplied with Šₜ to control the information flowing through current input, which is called the Input Gate.


To compute iₜ we must learn it and the only vectors that we have control on are our parameters. So, to continue computation, we need to express iₜ in the form of parameters.

要计算iₜ,我们必须学习它,而我们唯一可以控制的向量就是我们的参数。 因此,要继续计算,我们需要的参数的形式来表达iₜ。

After learning Ui, Wi, and Bi using Gradient Descent, we can expect a precise prediction using our input gate (iₜ) that is controlling how much information will be catered to our model.

在使用梯度下降学习Ui,WiBi之后 ,我们可以期望使用输入门( iₜ )进行精确的预测,该输入门将控制将要满足我们模型的信息量。

Summing up the parameters that were learned till now:


有选择地忘记 (Selectively Forget)

After selectively reading and writing the information, now we are aiming to forget all the irrelevant stuff that could help us to cut the clutter.


To discard all the squandered information from sₜ₋₁ we use Forget gate fₜ.

为了丢弃sₜ₋₁中所有浪费的信息,我们使用“忘记门” fₜ。

Selective Forget

Following the above-mentioned tradition, we will introduce forget gate fₜ that will constitute a value ranging from 0 to 1 which will be used to determine the importance of each input.

遵循上述传统,我们将引入遗忘门fₜ ,它将构成一个介于0到1之间的值,该值将用于确定每个输入的重要性。

To compute fₜ we must learn it and the only vectors that we have control on are our parameters. So, to continue computation, we need to express fₜ in form of provided parameters.

要计算fₜ,我们必须学习它,而我们唯一可以控制的向量就是我们的参数。 因此,要继续计算,我们需要以提供的参数的形式表示fₜ

After learning Uf, Wf, and Bf using Gradient Descent, we can expect a precise prediction using our forget gate (fₜ) that is controlling how much information will be discarded.

在使用梯度下降学习了Uf,WfBf之后 ,我们可以期望使用我们的遗忘门( f a )进行精确的预测,该遗忘门控制着将丢弃多少信息。

Summing up information from forgetting gate and input gate will impart us about current hidden state information.


最终模型 (Final Model)

LSTM model

The full set of equations looks like:


The parameters required in LSTM are way more than that required in vanilla RNN.


Due to the large variation of the number of gates and their arrangements, LSTM can have many types.


GRU(门控循环单位) (GRUs (Gated Recurrent Units))

As mentioned earlier, LSTM can have many variations and GRU is one of them. Unlikely LSTM, GRU tries to implement fewer gates and thus helps to lower down the computational cost.

如前所述,LSTM可以有多种变体,GRU就是其中之一。 与LSTM不同,GRU尝试实现较少的门,从而有助于降低计算成本。

In Gated Recurrent Units, we have an output gate that controls the proportion of information that will be passed to the next hidden state, besides, we have an input gate that controls information flow from current input and unlike RNN we don’t use forget gates.


Gated Recurrent Units

To lower down the computational time we remove forget gate and to discard the information we use compliment of input gate vector i.e. (1-iₜ).

为了降低计算时间,我们删除了遗忘门,并使用输入门矢量(1- iₜ )的补充来丢弃信息。

The equations implemented for GRU are:


关键点 (Key Points)

  • LSTM & GRU are introduced to avoid short-term memory of RNN.引入LSTM和GRU是为了避免RNN的短期记忆。
  • LSTM forgets by using Forget Gates.LSTM通过使用“忘记门”来忘记。
  • LSTM remembers using Input Gates.LSTM记得使用输入门。
  • LSTM keeps long-term memory using Cell State.LSTM使用“单元状态”保持长期记忆。
  • GRUs are fast and computationally less expensive than LSTM.GRU比LSTM速度快且计算成本更低。
  • The gradients in LSTM can still vanish in case of forward propagation.在向前传播的情况下,LSTM中的梯度仍会消失。
  • LSTM doesn’t solve the problem of exploding gradient, therefore we use gradient clipping.LSTM不能解决梯度爆炸的问题,因此我们使用梯度削波。

实际用例 (Practical Use Cases)

  • Sentiment Analysis using RNN使用RNN进行情感分析
  • AI music generation using LSTM使用LSTM生成AI音乐

结论 (Conclusion)

Hopefully, this article will help you to understand about Long short-term memory(LSTM) and Gated Recurrent Units(GRU) in the best possible way and also assist you to its practical usage.


As always, thank you so much for reading, and please share this article if you found it useful!


Feel free to connect:


LinkedIn ~

领英(LinkedIn)〜https: //

Instagram ~

Instagram〜https: //

Github ~

Github〜https: //

Follow for further Machine Learning/ Deep Learning blogs.


Medium ~

中〜https ://

想了解更多? (Want to learn more?)

Detecting COVID-19 Using Deep Learning


The Inescapable AI Algorithm: TikTok


An insider’s guide to Cartoonization using Machine Learning


Why are YOU responsible for George Floyd’s Murder and Delhi Communal Riots?

您为什么要为乔治·弗洛伊德(George Floyd)的谋杀和德里公社暴动负责?

Recurrent Neural Network for Dummies


Convolution Neural Network for Dummies


Diving Deep into Deep Learning


Why Choose Random Forest and Not Decision Trees


Clustering: What it is? When to use it?

聚类:是什么? 什么时候使用?

Start off your ML Journey with k-Nearest Neighbors


Naive Bayes Explained


Activation Functions Explained


Parameter Optimization Explained


Gradient Descent Explained


Logistic Regression Explained


Linear Regression Explained


Determining Perfect Fit for your ML Model







  • 使用TensorFlow 2.0+和Keras实现AlexNet CNN架构
  • power bi_如何将Power BI模型的尺寸减少90%!
  • 使用Optuna的XGBoost模型的高效超参数优化
  • latex 表格中虚线_如何识别和修复表格识别中的虚线
  • 构建强化学习_如何构建强化学习项目(第1部分)
  • sam服务器是什么_使用SAM CLI将机器学习模型部署到无服务器后端
  • pca 主成分分析_六分钟的主成分分析(PCA)的直观说明。
  • seaborn 教程_使用Seaborn进行数据可视化教程
  • alexnet 结构_AlexNet的体系结构和实现
  • python做作业没头绪_使用Python做作业
  • 使用Python构建推荐系统的机器学习
  • DeepR —训练TensorFlow模型进行生产
  • 通化红灯_我们如何构建廉价,可扩展的架构来对世界进行卡通化!
  • 机器学习学习吴恩达逻辑回归_机器学习基础:逻辑回归
  • 软件测试 测试停止标准_停止正常测试
  • cloud 部署_使用Google Cloud AI平台开发,训练和部署TensorFlow模型
  • 机器学习多元线性回归_过度简化的机器学习(1):多元回归
  • 机器学习与分布式机器学习_机器学习的歧义
  • 单词嵌入_单词嵌入与单词袋:推荐系统的奇怪案例
  • 自然语言处理综述_自然语言处理
  • 来自天秤座的梦想_天秤座:单线全自动机器学习
  • 数据增强 数据集扩充_数据扩充的抽象总结
  • 贝叶斯优化神经网络参数_贝叶斯超参数优化:神经网络,TensorFlow,相预测示例
  • 如何学习 azure_Azure的监督学习
  • t-sne 流形_流形学习[t-SNE,LLE,Isomap等]变得轻松
  • 数据库课程设计结论_结论
  • 摘要算法_摘要
  • 数据库主从不同步_数据从不说什么
  • android 揭示动画_遗传编程揭示具有相互作用的多元线性回归
  • 检测和语义分割_分割和对象检测-第5部分


  1. 图解LSTM与GRU单元的各个公式和区别

    作者 | Che_Hongshu 来源 | AI蜗牛车 (ID: AI_For_Car) 因为自己LSTM和GRU学的时间相隔很远,并且当时学的也有点小小的蒙圈,也因为最近一直在用lstm,gru等等 ...

  2. 难以置信!LSTM和GRU的解析从未如此清晰(动图+视频)

    作者 | Michael Nguyen 编译 | 蔡志兴.费棋 编辑 | Jane 出品 | AI科技大本营 [导语]机器学习工程师 Michael Nguyen 在其博文中发布了关于 LSTM 和 ...


    前言 平时很少写总结性的文章,感觉还是需要阶段性总结一些可以串在一起的知识点,所以这次写了下.因为我写的内容主要在时序.时空预测这个方向,所以主要还是把rnn,lstm,gru,convlstm,co ...

  4. 循环神经网络实现文本情感分类之Pytorch中LSTM和GRU模块使用

    循环神经网络实现文本情感分类之Pytorch中LSTM和GRU模块使用 1. Pytorch中LSTM和GRU模块使用 1.1 LSTM介绍 LSTM和GRU都是由torch.nn提供 通过观察文档, ...

  5. 从LSTM到GRU基于门控的循环神经网络总结

    1.概述 为了改善基本RNN的长期依赖问题,一种方法是引入门控机制来控制信息的累积速度,包括有选择性地加入新的信息,并有选择性遗忘之前累积的信息.下面主要介绍两种基于门控的循环神经网络:长短时记忆网络 ...

  6. 循环神经网络(RNN、LSTM、GRU)

    循环神经网络(RNN.LSTM.GRU) 目录 循环神经网络(RNN.LSTM.GRU) 概述: 计算: LSTM(长短记忆模型): GRU:

  7. 记忆网络RNN、LSTM与GRU

    RNN 结构 训练 应用 RNN Variants LSTM 结构 梯度消失及梯度爆炸 GRU 结构 一般的神经网络输入和输出的维度大小都是固定的,针对序列类型(尤其是变长的序列)的输入或输出数据束手 ...

  8. 【强烈推荐】最好理解的LSTM与GRU教程

    AI识别你的语音.回答你的问题.帮你翻译外语,都离不开一种特殊的循环神经网络(RNN):长短期记忆网络(Long short-term memory,LSTM). 最近,国外有一份关于LSTM及其变种 ...

  9. 【图文并茂】RNN、LSTM、GRU、ConvLSTM、ConvGRU、ST-LSTM的总结

    前言 平时很少写总结性的文章,感觉还是需要阶段性总结一些可以串在一起的知识点,所以这次写了下.因为我写的内容主要在时序.时空预测这个方向,所以主要还是把rnn,lstm,gru,convlstm,co ...


  1. tools。php,phpTools/tools.php at master · superve/phpTools · GitHub
  2. springboot——kaptcha
  3. 照片识别出错_云投诉丨四川德阳市民文明手册咋满篇乱码?回应:电脑识别问题,已全部收回...
  4. java什么是工作空间_[Java教程]Java开发工具(Eclipse工作空间的基本配置)
  5. python3.7安装Numpy库
  6. linkedin第三方登陆
  7. linux光纤盘刷新,Linux 在shell终端中清空DNS缓存,刷新DNS的方法(ubuntu,debian)
  8. Android相同包名不同签名的apk安装失败问题分析
  9. postgresql 日期相减
  10. python之pyautogui实现鼠标键盘控制
  11. Java串口通信读写串口导致程序崩溃问题
  12. 光模块第五节之COB工艺
  13. 各种奇奇怪的不明链接!长期更...
  14. 目标检测算法YOLO-V1算法详解
  15. 详解软件项目管理流程的每一步
  16. 网页动态蜘蛛网线条特效
  17. 微信小程序后端数据\n换行无效解决
  18. java castor_Castor简单介绍
  19. 关于IE 10 你应该知道的6件事
  20. Typora PicGo-Core Gitee搭建个人在线笔记


  1. N使用exus2打造企业maven仓库(三)
  2. 介绍Pro*c编程的经验
  3. BGP no-export
  4. jquery 清空表单
  5. NAS优缺点完全剖析
  6. 和AI机器人Alice的一段聊天记录
  7. python意外缩进引发逻辑错误_如何编写 Python 程序
  8. WAP2.0开发规范及原则
  9. 简单实用一分钟上手级权限控制
  10. python linux 时间格式化,Python3 格式化日期