python attention机制_[深度应用]·Keras实现Self-Attention文本分类（机器如何读懂人心）...

[深度应用]·Keras实现Self-Attention文本分类(机器如何读懂人心)

笔者在[深度概念]·Attention机制概念学习笔记博文中，讲解了Attention机制的概念与技术细节，本篇内容配合讲解，使用Keras实现Self-Attention文本分类，来让大家更加深入理解Attention机制。

一、Self-Attention概念详解

了解了模型大致原理，我们可以详细的看一下究竟Self-Attention结构是怎样的。其基本结构如下

对于self-attention来讲，Q(Query), K(Key), V(Value)三个矩阵均来自同一输入，首先我们要计算Q与K之间的点乘，然后为了防止其结果过大，会除以一个尺度标度 $equation?tex=%5Csqrt%7Bd_k%7D$ ，其中 $equation?tex=d_k$ 为一个query和key向量的维度。再利用Softmax操作将其结果归一化为概率分布，然后再乘以矩阵V就得到权重求和的表示。该操作可以表示为 $equation?tex=Attention%28Q%2CK%2CV%29+%3D+softmax%28%5Cfrac%7BQK%5ET%7D%7B%5Csqrt%7Bd_k%7D%7D%29V$

这里可能比较抽象，我们来看一个具体的例子(图片来源于https://jalammar.github.io/illustrated-transformer/，该博客讲解的极其清晰，强烈推荐)，假如我们要翻译一个词组Thinking Machines，其中Thinking的输入的embedding vector用 $equation?tex=x_1$ 表示，Machines的embedding vector用 $equation?tex=x_2$ 表示。

当我们处理Thinking这个词时，我们需要计算句子中所有词与它的Attention Score，这就像将当前词作为搜索的query，去和句子中所有词(包含该词本身)的key去匹配，看看相关度有多高。我们用 $equation?tex=q_1$ 代表Thinking对应的query vector， $equation?tex=k_1$ 及 $equation?tex=k_2$ 分别代表Thinking以及Machines对应的key vector，则计算Thinking的attention score的时候我们需要计算 $equation?tex=q_1$ 与 $equation?tex=k_1%2Ck_2$ 的点乘，同理，我们计算Machines的attention score的时候需要计算 $equation?tex=q_2$ 与 $equation?tex=k_1%2Ck_2$ 的点乘。如上图中所示我们分别得到了 $equation?tex=q_1$ 与 $equation?tex=k_1%2Ck_2$ 的点乘积，然后我们进行尺度缩放与softmax归一化，如下图所示：

显然，当前单词与其自身的attention score一般最大，其他单词根据与当前单词重要程度有相应的score。然后我们在用这些attention score与value vector相乘，得到加权的向量。

如果将输入的所有向量合并为矩阵形式，则所有query, key, value向量也可以合并为矩阵形式表示

其中 $equation?tex=W%5EQ%2C+W%5EK%2C+W%5EV$ 是我们模型训练过程学习到的合适的参数。上述操作即可简化为矩阵形式

二、Self_Attention模型搭建

笔者使用Keras来实现对于Self_Attention模型的搭建，由于网络中间参数量比较多，这里采用自定义网络层的方法构建Self_Attention，关于如何自定义Keras可以参看这里：编写你自己的 Keras 层

Keras实现自定义网络层。需要实现以下三个方法:(注意input_shape是包含batch_size项的)

build(input_shape): 这是你定义权重的地方。这个方法必须设 self.built = True，可以通过调用 super([Layer], self).build() 完成。

call(x): 这里是编写层的功能逻辑的地方。你只需要关注传入 call 的第一个参数：输入张量，除非你希望你的层支持masking。

compute_output_shape(input_shape): 如果你的层更改了输入张量的形状，你应该在这里定义形状变化的逻辑，这让Keras能够自动推断各层的形状。

实现代码如下：

from keras.preprocessing import sequence

from keras.datasets import imdb

from matplotlib import pyplot as plt

import pandas as pd

from keras import backend as K

from keras.engine.topology import Layer

class Self_Attention(Layer):

def __init__(self, output_dim, **kwargs):

self.output_dim = output_dim

super(Self_Attention, self).__init__(**kwargs)

def build(self, input_shape):

# 为该层创建一个可训练的权重

#inputs.shape = (batch_size, time_steps, seq_len)

self.kernel = self.add_weight(name='kernel',

shape=(3,input_shape[2], self.output_dim),

initializer='uniform',

trainable=True)

super(Self_Attention, self).build(input_shape) # 一定要在最后调用它

def call(self, x):

WQ = K.dot(x, self.kernel[0])

WK = K.dot(x, self.kernel[1])

WV = K.dot(x, self.kernel[2])

print("WQ.shape",WQ.shape)

print("K.permute_dimensions(WK, [0, 2, 1]).shape",K.permute_dimensions(WK, [0, 2, 1]).shape)

QK = K.batch_dot(WQ,K.permute_dimensions(WK, [0, 2, 1]))

QK = QK / (64**0.5)

QK = K.softmax(QK)

print("QK.shape",QK.shape)

V = K.batch_dot(QK,WV)

return V

def compute_output_shape(self, input_shape):

return (input_shape[0],input_shape[1],self.output_dim)

这里可以对照一中的概念讲解来理解代码

如果将输入的所有向量合并为矩阵形式，则所有query, key, value向量也可以合并为矩阵形式表示

上述内容对应

WQ = K.dot(x, self.kernel[0])

WK = K.dot(x, self.kernel[1])

WV = K.dot(x, self.kernel[2])

其中 $equation?tex=W%5EQ%2C+W%5EK%2C+W%5EV$ 是我们模型训练过程学习到的合适的参数。上述操作即可简化为矩阵形式

上述内容对应(为什么使用batch_dot呢？这是由于input_shape是包含batch_size项的)

QK = K.batch_dot(WQ,K.permute_dimensions(WK, [0, 2, 1]))

QK = QK / (64**0.5)

QK = K.softmax(QK)

print("QK.shape",QK.shape)

V = K.batch_dot(QK,WV)

这里 QK = QK / (64**0.5) 是除以一个归一化系数，(64**0.5)是笔者自己定义的，其他文章可能会采用不同的方法。

三、训练网络

项目完整代码如下，这里使用的是Keras自带的imdb影评数据集

#%%

from keras.preprocessing import sequence

from keras.datasets import imdb

from matplotlib import pyplot as plt

import pandas as pd

from keras import backend as K

from keras.engine.topology import Layer

class Self_Attention(Layer):

def __init__(self, output_dim, **kwargs):

self.output_dim = output_dim

super(Self_Attention, self).__init__(**kwargs)

def build(self, input_shape):

# 为该层创建一个可训练的权重

#inputs.shape = (batch_size, time_steps, seq_len)

self.kernel = self.add_weight(name='kernel',

shape=(3,input_shape[2], self.output_dim),

initializer='uniform',

trainable=True)

super(Self_Attention, self).build(input_shape) # 一定要在最后调用它

def call(self, x):

WQ = K.dot(x, self.kernel[0])

WK = K.dot(x, self.kernel[1])

WV = K.dot(x, self.kernel[2])

print("WQ.shape",WQ.shape)

print("K.permute_dimensions(WK, [0, 2, 1]).shape",K.permute_dimensions(WK, [0, 2, 1]).shape)

QK = K.batch_dot(WQ,K.permute_dimensions(WK, [0, 2, 1]))

QK = QK / (64**0.5)

QK = K.softmax(QK)

print("QK.shape",QK.shape)

V = K.batch_dot(QK,WV)

return V

def compute_output_shape(self, input_shape):

return (input_shape[0],input_shape[1],self.output_dim)

max_features = 20000

print('Loading data...')

(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)

#标签转换为独热码

y_train, y_test = pd.get_dummies(y_train),pd.get_dummies(y_test)

print(len(x_train), 'train sequences')

print(len(x_test), 'test sequences')

#%%数据归一化处理

maxlen = 64

print('Pad sequences (samples x time)')

x_train = sequence.pad_sequences(x_train, maxlen=maxlen)

x_test = sequence.pad_sequences(x_test, maxlen=maxlen)

print('x_train shape:', x_train.shape)

print('x_test shape:', x_test.shape)

#%%

batch_size = 32

from keras.models import Model

from keras.optimizers import SGD,Adam

from keras.layers import *

from Attention_keras import Attention,Position_Embedding

S_inputs = Input(shape=(64,), dtype='int32')

embeddings = Embedding(max_features, 128)(S_inputs)

O_seq = Self_Attention(128)(embeddings)

O_seq = GlobalAveragePooling1D()(O_seq)

O_seq = Dropout(0.5)(O_seq)

outputs = Dense(2, activation='softmax')(O_seq)

model = Model(inputs=S_inputs, outputs=outputs)

print(model.summary())

# try using different optimizers and different optimizer configs

opt = Adam(lr=0.0002,decay=0.00001)

loss = 'categorical_crossentropy'

model.compile(loss=loss,

optimizer=opt,

metrics=['accuracy'])

#%%

print('Train...')

h = model.fit(x_train, y_train,

batch_size=batch_size,

epochs=5,

validation_data=(x_test, y_test))

plt.plot(h.history["loss"],label="train_loss")

plt.plot(h.history["val_loss"],label="val_loss")

plt.plot(h.history["acc"],label="train_acc")

plt.plot(h.history["val_acc"],label="val_acc")

plt.legend()

plt.show()

#model.save("imdb.h5")

四、结果输出

(TF_GPU) D:\Files\DATAs\prjs\python\tf_keras\transfromerdemo>C:/Files/APPs/RuanJian/Miniconda3/envs/TF_GPU/python.exe d:/Files/DATAs/prjs/python/tf_keras/transfromerdemo/train.1.py

Using TensorFlow backend.

Loading data...

25000 train sequences

25000 test sequences

Pad sequences (samples x time)

x_train shape: (25000, 64)

x_test shape: (25000, 64)

WQ.shape (?, 64, 128)

K.permute_dimensions(WK, [0, 2, 1]).shape (?, 128, 64)

QK.shape (?, 64, 64)

_________________________________________________________________

Layer (type) Output Shape Param #

=================================================================

input_1 (InputLayer) (None, 64) 0

_________________________________________________________________

embedding_1 (Embedding) (None, 64, 128) 2560000

_________________________________________________________________

self__attention_1 (Self_Atte (None, 64, 128) 49152

_________________________________________________________________

global_average_pooling1d_1 ( (None, 128) 0

_________________________________________________________________

dropout_1 (Dropout) (None, 128) 0

_________________________________________________________________

dense_1 (Dense) (None, 2) 258

=================================================================

Total params: 2,609,410

Trainable params: 2,609,410

Non-trainable params: 0

_________________________________________________________________

None

Train...

Train on 25000 samples, validate on 25000 samples

Epoch 1/5

25000/25000 [==============================] - 17s 693us/step - loss: 0.5244 - acc: 0.7514 - val_loss: 0.3834 - val_acc: 0.8278

Epoch 2/5

25000/25000 [==============================] - 15s 615us/step - loss: 0.3257 - acc: 0.8593 - val_loss: 0.3689 - val_acc: 0.8368

Epoch 3/5

25000/25000 [==============================] - 15s 614us/step - loss: 0.2602 - acc: 0.8942 - val_loss: 0.3909 - val_acc: 0.8303

Epoch 4/5

25000/25000 [==============================] - 15s 618us/step - loss: 0.2078 - acc: 0.9179 - val_loss: 0.4482 - val_acc: 0.8215

Epoch 5/5

25000/25000 [==============================] - 15s 619us/step - loss: 0.1639 - acc: 0.9368 - val_loss: 0.5313 - val_acc: 0.8106

python attention机制_[深度应用]·Keras实现Self-Attention文本分类（机器如何读懂人心）...相关推荐

机器如何读懂人心：Keras实现Self-Attention文本分类
作者 | 小宋是呢转载自CSDN博客一.Self-Attention概念详解了解了模型大致原理,我们可以详细的看一下究竟Self-Attention结构是怎样的.其基本结构如下对于self-a ...
python attention机制_从零开始学Python自然语言处理（26）—— 强大的Attention机制...
前文传送门: 在上一次面试失利后,我回来仔细研究了一下Attention机制,研究完我不禁感悟,这机制真的厉害啊!因为我之前面试被问到的Encoder - Decoder框架中有个瓶颈是编码的结果以固 ...
深度学习attention原理_深度学习系列——attention机制与应用
本文介绍attention起源+原理和一些应用一.简介 Attention机制通俗的讲就是把注意力集中放在重要的点上,而忽略其他不重要的因素.关于这个的解释个人感觉计算机视觉比NLP根据有直接的解释 ...
attention机制_简析Attention机制—优缺点，实现，应用
什么是Attention机制? Attention机制的本质来自于人类视觉注意力机制.人们在看东西的时候一般不会从到头看到尾全部都看,往往只会根据需求观察注意特定的一部分. 简单来说,就是一种权重参数 ...
attention机制_聊聊NLP中的Attention机制---抛砖引玉
写在前面:有一段时间没更新专栏了,哈哈,先吐槽下自己的龟速更新. Attention机制基本已成为NLP的居家旅行必备技能,同时也是我一直关注的技术点,希望本篇内容能带给大家些许思考.如有描述不对的地 ...
One-Error多标签分类_深度学习：如何在多标签分类问题中考虑标签间的相关性？
1 多标签问题的简单解决思路利用神经网络,我们可以很轻松处理一个多标签问题.如标题图所示,为前馈神经网络添加适当数量的隐含层,同时在输出层使用某个阈值判断标签分类结果即为一种基础的解决思路. 上述是 ...
AI深度学习入门与实战21 文本分类：用 Bert 做出一个优秀的文本分类模型
在上一讲,我们一同了解了文本分类(NLP)问题中的词向量表示,以及简单的基于 CNN 的文本分类算法 TextCNN.结合之前咱们学习的 TensorFlow 或者其他框架,相信你已经可以构建出一个属 ...
cnn 预测过程代码_代码实践 | CNN卷积神经网络之文本分类
学习目录阿力阿哩哩:深度学习 | 学习目录zhuanlan.zhihu.com 前面我们介绍了:阿力阿哩哩:一文掌握CNN卷积神经网络zhuanlan.zhihu.com阿力阿哩哩:代码实践|全连 ...
Python 教你训练一个98%准确率的微博抑郁文本分类模型(含数据)
Paddle是一个比较高级的深度学习开发框架,其内置了许多方便的计算单元可供使用,我们之前写过PaddleHub相关的文章: 1.Python 识别文本情感就这么简单 2.比PS还好用!Python ...

python attention机制_[深度应用]·Keras实现Self-Attention文本分类（机器如何读懂人心）...

python attention机制_[深度应用]·Keras实现Self-Attention文本分类（机器如何读懂人心）...相关推荐

最新文章

热门文章