Attention is all your need

论文地址, 此文仅仅是为了记录自己的学习过程，有错误欢迎大家指正。

1. 提出背景

在过去几年中，以LSTM和GRU的循环神经网络在nlp领域中发挥了巨大的作用吗，但是此类模型有一些缺陷，一个是存储记忆限制了训练样本的batch size，二是计算代价大，三也就是此类模型固有缺陷，训练时候序列之间两个对象的联系依赖于两个对象的距离，这个时候就得引入attention机制，让序列中两个对象的距离不再决定序列中两个对象的联系。

2. 文章题目解释

Attention is all your need.你仅仅需要注意力机制，在过去一段时间里，注意力机制是和循环神经网络组合使用的，就比如我在全球AI挑战赛那个比赛中的模型，attention作为神经网络中的一层，发挥着它特有的作用。论文的作者，通过all your need三个词强调单单依靠attention就可以建立起一个模型，而不需要其他的LSTM和GRU，这个模型被称为 Transformer

3. 模型架构

直接引用论文中的图片

Encoder and Decoder Stacks(直接摘自原文，原文讲的很精炼，已经不能再概括了)

Encoder: The encoder is composed of a stack of N = 6 identical layers. Each layer has two sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, positionwise fully connected feed-forward network. We employ a residual connection around each of the two sub-layers, followed by layer normalization . That is, the output of each sub-layer is LayerNorm(x + Sublayer(x)), where Sublayer(x) is the function implemented by the sub-layer itself. To facilitate these residual connections, all sub-layers in the model, as well as the embedding layers, produce outputs of dimension dmodel = 512.

Figure1中左边的那一部分就是对应一个encoder的layer，一个layer由两小部分构成，下半部分为attention层，上半部分为feed forward层，两层都有residual连接，对应CV中的residule net，作用是降低神经网络训练的难度，当然，加入norm也是为了降低训练难度

Decoder: The decoder is also composed of a stack of N = 6 identical layers. In addition to the two sub-layers in each encoder layer, the decoder inserts a third sub-layer, which performs multi-head attention over the output of the encoder stack. Similar to the encoder, we employ residual connections around each of the sub-layers, followed by layer normalization. We also modify the self-attention sub-layer in the decoder stack to prevent positions from attending to subsequent positions. This masking, combined with fact that the output embeddings are offset by one position, ensures that the
predictions for position i can depend only on the known outputs at positions less than i.

Figure1中右边的那一部分就对应一个decoder的layer，一个layer由三小部分构成，其中两部分和encoder中的类似，我们主要关注那个masked attention层，它主要作用是让对position i的预测仅仅依赖于i前面的position，即将注意力限制在前面几个position，mask掉后面几个position

4. attention

我们定义这样的一类函数为attention，用python代码粗略写个大概，可能不够严谨哈。。。

def attention(query, dictionary):```dictionary是一个python的dict数据类型，即key->valuequery中的单个元素，key，value，output都是vector```result = np.zeros(dictionary.values[0].shape)for i in len(query):result = result + h(query[i] * dictionary.keys[query[i]])*dictionary.values[query[i]]return result

attention函数一

引用原文中的插图，上面的attention函数是根据下面这个公式给出的。

Attention Is All You Need读后感相关推荐

读后感和机翻《他们在看哪里，为什么看?在复杂的任务中共同推断人类的注意力和意图》
以下是研究朱松纯FPICU概念中I(intent)的相关论文记录: 读后感: 作者干了什么事? 算法可以从视频中预估人类的注意力位置和意图. 怎么实现的? 提出了一个人-注意力-对象(HAO)图来联合 ...
CLIP-对比图文多模态预训练的读后感
CLIP-对比图文多模态预训练的读后感 FesianXu 20210724 at Baidu Search Team 前言 CLIP是近年来在多模态方面的经典之作,其用大量的数据和算力对模型进行预训练 ...
读英语计算机书籍读后感,英语读后感
英语读后感(一): 阿甘正传英语读后感500字 Life is like a box of chocoles -- <Forrest Gump> "Mom always said ...
英语名篇——关于《论学习》的读后感
Thoughts after Reading "Of Study" <论学习>读后感 After reading the article "O ...
a king读后感 love of the_小王子英语读后感推荐
小王子英语读后感推荐不知道大家有没有读过<小王子>这本书,它是法国作家圣埃克苏佩里创作的最著名的童话书.下面是小编帮大家整理的小王子英语读后感,希望大家喜欢. 篇一:小王子读后感(英语) ...
论文阅读：FFA-Net: Feature Fusion Attention Network for Single Image Dehazing
代码:https://github.com/zhilin007/FFA-Net 目录 1. 摘要 2. 网络结构 2.1 Feature Attention(FA) Channel Attention ...
幸福是一种能力读后感_我分析了736天的幸福感。这是我学到的。
幸福是一种能力读后感 by Manas Kulkarni 通过玛纳斯·库尔卡尼(Manas Kulkarni) 我分析了736天的幸福感. 这是我学到的. (I analyzed my happine ...
a king读后感 love of the_小王英文读后感
[ 标签 : 标题 ] 篇一: <小王子>读后感 ( 英文版 ) Little Prince This is a fairytale whose distribution volume i ...
《Pragmatic Unit Testing In Java with JUnit》—单元测试之道读后感
<Pragmatic Unit Testing In Java with JUnit> ...

Attention Is All You Need读后感