前言

Attention机制是很好的一个东西，Attention机制在近几年来在图像，自然语言处理和cv等领域中都取得了重要的突破，被证明有益于提高模型的性能。让我们一起来了解下注意力机制吧。

什么是注意力机制

所谓Attention机制，便是聚焦于局部信息的机制，比如，图像中的某一个图像区域。随着任务的变化，注意力区域往往会发生变化。

面对上面这样的一张图，如果你只是从整体来看，只看到了很多人头，但是你拉近一个一个仔细看就了不得了，都是天才科学家。图中除了人脸之外的信息其实都是无用的，也做不了什么任务，Attention机制便是要找到这些最有用的信息，可以想到，最简单的场景就是从照片中检测人脸了。
注意力机制的本质就是定位到感兴趣的信息，抑制无用信息，结果通常都是以概率图或者概率特征向量的形式展示。
我们来看下这个例子可能久更加明白了：假设我们要翻译一句话：打电脑游戏，play computer game。

如果不引入注意力机制，那么我们从Encoder获得语义编码c之后，这个语义编码在Decoder中传递，其内容就和Encoder无关了。但是事实上我们希望在翻译打电脑游戏中的打的时候，我们更注意打->play的转换，此时我们希望Decoder更加注意Encoder从打中提取出来的特征。这就是注意力机制的概念，它的本意是让神经网络模型在做特定的事的时候可以注意到它需要注意的地方。
由于神经网络是一堆数字的传递，每个事物的特征也是由一堆数字组成的，比如打字的特征也是一堆数字，电脑的特征也是一堆数字，游戏的特征也是一堆数字，语义编码就是这么多特征的组合。
那么如何使得神经网络模型对某个内容进行注意呢？其实就是将改变不同内容的权重，当我们需要神经网络注意到打的时候，我们只需要提高打字的特征的权重就可以了。
假设函数 f 可以用于提取特征，函数 g 可以实现解码。那么如果我们要神经网络注意到打，可以通过如下方式进行。

空间注意力

空间域将原始图片中的空间信息变换到另一个空间中并保留了关键信息。
空间注意力的作者发明者认为之前通道pooling的方法太过于暴力，直接将信息合并会导致关键信息无法识别出来，所以提出了一个叫空间转换器（spatial transformer）的模块，将图片中的的空间域信息做对应的空间变换，从而能将关键的信息提取出来。

比如这个直观的实验图：
(a)列是原始的图片信息，其中第一个手写数字7没有做任何变换，第二个手写数字5，做了一定的旋转变化，而第三个手写数字6，加上了一些噪声信号；
(b)列中的彩色边框是学习到的spatial transformer的框盒（bounding
box），每一个框盒其实就是对应图片学习出来的一个spatial transformer；
©列中是通过spatial
transformer转换之后的特征图，可以看出7的关键区域被选择出来，5被旋转成为了正向的图片，6的噪声信息没有被识别进入。

对于时间步的注意力机制

ps：（我感觉cv里面叫空间注意力，不知道我理解错没，如果理解错了私信我）

1.数据集的制作

本次我们要进行的是使用注意力机制 + LSTM 进行时间序列预测
默认的 n = 30000, input_dim = 2 ,timesteps = 20。生成的数据为：

假设我们存在一个TIME_STEP为10，INPUT_DIM为2的输入。当我们的TIME_STEP为2的输入为[0,0]时，其输出为0；当我们的TIME_STEP为2的输入为[1,1]时，其输出为1；其它TIME_STEP等于其它的时候，如0，1，3，……9时，其对应的时间的输入为为符合正态分布的数。
代码：

#-------------------------------------------#
#   获得数据集
#   attention_column代表我们希望被注意的列
#-------------------------------------------#
def get_data_recurrent(n, time_steps, input_dim, attention_column=2):x = np.random.normal(loc=0, scale=10, size=(n, time_steps, input_dim))y = np.random.randint(low=0, high=2, size=(n, 1))x[:, attention_column, :] = np.tile(y[:], (1, input_dim))return x, y

所以当我们使用这样的数据去进行注意力机制 LSTM 的训练,我们希望得到的结果是注意力层 主要关注第2个timestep 而对其他timestep 的关注度较低。

#-------------------------------------#
x = [[[14.05795148 10.6586937 ][-5.17788409  3.0967234 ][ 1.          1.        ][-7.16327903  7.36591461][ 3.07887461 18.46302035][ 8.7123103  15.77254757][-7.6266161  -4.56511326][ 1.64038985  0.10782463][ 3.62548177  3.22431191][ 0.76630364 -3.95249622]]]
y = [[1]]
#-------------------------------------#
#-------------------------------------#
x = [[[ -4.22167643   1.98029051][ -1.00985459  15.08588672][  0.           0.        ][ 13.48448467  -0.66743308][ 31.3199347    3.0311851 ][ -4.81579489   1.62016606][  7.40993759   4.25739609][ 13.37376609 -11.63055067][ -6.46277603 -13.94173142][-12.01871193  -9.53632924]]]
y = [[0]]
#-----------------------------------

2.建立注意力模型

我们将一串时间序列传入到LSTM中，可以获得一个维度为(batch_size, time_steps, lstm_units)的输出，我们可以把其当作每一个时间节点的特征。
经过Permute将2、1轴翻转后，其维度从(batch_size, time_steps, lstm_units)转化成(batch_size, lstm_units, time_steps)。
再经过一个全连接层和Softmax后，其维度仍为(batch_size, lstm_units, time_steps)，其实际内涵为，利用全连接层计算每一个time_steps的权重。
再经过Permute将2、1轴翻转后，其维度从(batch_size, lstm_units, time_steps)转化成(batch_size, time_steps, lstm_units)。代表每一个STEP中每一个特征的权重。
最后将这个结果与Input相乘，也就是将每个STEP的权重，乘上他们的特征。

3.建立整体神经网络

我们构建一个简单的注意力机制的神经网络，进行预测。
代码：

#-------------------------------------------#
#  建立注意力模型
#-------------------------------------------#
def get_attention_model():inputs = Input(shape=(TIME_STEPS, INPUT_DIM,))lstm_units = 32# (batch_size, time_steps, INPUT_DIM) -> (batch_size, input_dim, lstm_units)lstm_out = LSTM(lstm_units, return_sequences=True)(inputs)attention_mul = attention_3d_block(lstm_out)# (batch_size, input_dim, lstm_units) -> (batch_size, input_dim*lstm_units)attention_mul = Flatten()(attention_mul)output = Dense(1, activation='sigmoid')(attention_mul)model = Model(input=[inputs], output=output)return model

‘’

4.完整代码

from keras.layers import merge
from keras.layers.core import *
from keras.layers.recurrent import LSTM
from keras.models import *
import matplotlib.pyplot as plt
import pandas as pd
import numpy as npINPUT_DIM = 2
TIME_STEPS = 10#-------------------------------------------#
#   对每一个step的INPUT_DIM的attention几率
#   求平均
#-------------------------------------------#
def get_activations(model, inputs, layer_name=None):inp = model.inputfor layer in model.layers:if layer.name == layer_name:Y = layer.outputmodel = Model(inp,Y)out = model.predict(inputs)# print("*"*100)# print(out)# print("*" * 100)out = np.mean(out[0],axis=-1)return out
#-------------------------------------------#
#   获得数据集
#   attention_column代表我们希望被注意的列
#-------------------------------------------#
def get_data_recurrent(n, time_steps, input_dim, attention_column=2):x = np.random.normal(loc=0, scale=10, size=(n, time_steps, input_dim))y = np.random.randint(low=0, high=2, size=(n, 1))x[:, attention_column, :] = np.tile(y[:], (1, input_dim))print(x)return x, y
#-------------------------------------------#
#   注意力模块
#-------------------------------------------#
def attention_3d_block(inputs):# inputs.shape = (batch_size, time_steps, lstm_units)# (batch_size, time_steps, lstm_units) -> (batch_size, lstm_units, time_steps)a = Permute((2, 1))(inputs)# 对最后一维进行全连接# (batch_size, lstm_units, time_steps) -> (batch_size, lstm_units, time_steps)a = Dense(TIME_STEPS, activation='softmax')(a)# (batch_size, lstm_units, time_steps) -> (batch_size, time_steps, lstm_units)a_probs = Permute((2, 1), name='attention_vec')(a)# 相乘# 相当于获得每一个step中，每个维度在所有step中的权重output_attention_mul = merge([inputs, a_probs], name='attention_mul', mode='mul')return output_attention_mul#-------------------------------------------#
#  建立注意力模型
#-------------------------------------------#
def get_attention_model():inputs = Input(shape=(TIME_STEPS, INPUT_DIM,))lstm_units = 32# (batch_size, time_steps, INPUT_DIM) -> (batch_size, input_dim, lstm_units)lstm_out = LSTM(lstm_units, return_sequences=True)(inputs)attention_mul = attention_3d_block(lstm_out)# (batch_size, input_dim, lstm_units) -> (batch_size, input_dim*lstm_units)attention_mul = Flatten()(attention_mul)output = Dense(1, activation='sigmoid')(attention_mul)model = Model(input=[inputs], output=output)return modelif __name__ == '__main__':N = 100000#l利用高斯分步X, Y = get_data_recurrent(N, TIME_STEPS, INPUT_DIM)# print(Y)model = get_attention_model()model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])print(model.summary())model.fit(X, Y, epochs=1, batch_size=64, validation_split=0.1)attention_vectors = []for i in range(300):testing_X, testing_Y = get_data_recurrent(1, TIME_STEPS, INPUT_DIM)attention_vector = get_activations(model,testing_X,layer_name='attention_vec')print('attention =', attention_vector)assert (np.sum(attention_vector) - 1.0) < 1e-5attention_vectors.append(attention_vector)attention_vector_final = np.mean(np.array(attention_vectors), axis=0)pd.DataFrame(attention_vector_final, columns=['attention (%)']).plot(kind='bar',title='Attention Mechanism as ''a function of input'' dimensions.')plt.show()

5.实验效果

通道注意力机制

在卷积网络中，一个维度是图像的尺度空间，即长宽，另一个维度就是通道，因此基于通道的Attention也是很常用的机制。
在mobileNetv3，本质上是利用了一个基于通道的Attention模型，它通过建模各个特征通道的重要程度，然后针对不同的任务增强或者抑制不同的通道，原理图如下。

在正常的卷积操作后分出了一个旁路分支，首先进行Squeeze操作(即图中Fsq(·))，它将空间维度进行特征压缩，即每个二维的特征图变成一个实数（mobileNetv3利用全局平均池化），相当于具有全局感受野的池化操作，特征通道数不变。
然后是Excitation操作(即图中的Fex(·))，它通过参数w为每个特征通道生成权重，w被学习用来显式地建模特征通道间的相关性。在文章中，使用了一个2层bottleneck结构(先降维最后升维)的全连接层+relu6+hard_swish函数来实现。
得到了每一个特征通道的权重之后，就将该权重应用于原来的每个特征通道，基于特定的任务，就可以学习到不同通道的重要性。
将其机制应用于若干基准模型，在增加少量计算量的情况下，获得了更明显的性能提升。作为一种通用的设计思想，它可以被用于任何现有网络，具有较强的实践意义。
通道注意力机制的本质，在于建模了各个特征之间的重要性，对于不同的任务可以根据输入进行特征分配，简单而有效。
代码：

def squeeze(inputs):# 注意力机制单元input_channels = int(inputs.shape[-1])x = GlobalAveragePooling2D()(inputs)x = Dense(int(input_channels / 4))(x)x = Activation(relu6)(x)x = Dense(input_channels)(x)x = Activation(hard_swish)(x)x = Reshape((1, 1, input_channels))(x)x = Multiply()([inputs, x])return x

浅谈Attention机制的作用相关推荐

浅谈Attention机制
浅谈Attention机制 Attention注意机制现在大热,很多深度学习的框架都带上了注意力机制,而且也取得了很好的性能指标.乘着大热也来水一水文章,发表发表自己的看法.事先说明老哥我在NLP上萌 ...
系统学习NLP（二十三）--浅谈Attention机制的理解
转自:https://zhuanlan.zhihu.com/p/35571412 Attentin机制的发家史 Attention机制最早是应用于图像领域的,九几年就被提出来的思想.随着谷歌大佬的一波 ...
[深度学习-原理]浅谈Attention Model
系列文章目录深度学习NLP(一)之Attention Model; 深度学习NLP(二)之Self-attention, Muti-attention和Transformer; 深度学习NLP(三) ...
浅谈HTTP协议的作用过程
浅谈http协议的作用过程引言正文一.HTTP定义二.HTTP完整的请求过程域名解析与服务器建立连接发送http请求给服务器服务器返回数据给客户端客户端与服务器端断开通信结束语引 ...
浅谈Attention注意力机制及其实现
1. 什么是注意力机制 1.1 注意力机制的思想关于什么是注意力机制,粗略的描述就是"你正在做什么,你就将注意力集中在那一点上".这种机制就和人脑在思考问题时一样.例如我们在思考 ...
【基础整理】attention：浅谈注意力机制与自注意力模型（附键值对注意力 + 多头注意力）
划水休息两天不看论文了 ~ 来重新复习一下基础qaq 以下讲解参考大名鼎鼎的 nndl 邱锡鹏 <神经网络与深度学习> 部分内容(详见第八章,注意力与外部记忆)是对于不太行的初学者也比较友 ...
java 事件驱动原理_浅谈事件驱动机制
事件驱动机制是指在持续事务管理过程中,进行决策的一种策略,即跟随当前时间点上出现的事件,调动可用资源,执行相关任务,使不断出现的问题得以解决,防止事务堆积.在计算机编程.公共关系.经济活动等领域均有应 ...
浅谈Handler机制
Android中Handler是一个十分重要的东西,很多时候都需要用到Handler.那什么是Handler呢?又为什么要用Handler呢? 什么是Handler? 我们知道Android更新UI的 ...
【C#】：浅谈反射机制【转】
http://blog.csdn.net/lianjiangwei/article/details/47207875 什么是反射? 反射提供了封装程序集.模块和类型的对象(Type 类型).可以使用反 ...

浅谈Attention机制的作用

浅谈注意力机制的作用

前言