任务三：基于注意力机制的文本匹配

输入两个句子判断，判断它们之间的关系。参考ESIM（可以只用LSTM，忽略Tree-LSTM），用双向的注意力机制实现。

参考
1. 《神经网络与深度学习》第7章
2. Reasoning about Entailment with Neural Attention https://arxiv.org/pdf/1509.06664v1.pdf
3. Enhanced LSTM for Natural Language Inference https://arxiv.org/pdf/1609.06038v3.pdf
数据集：https://nlp.stanford.edu/projects/snli/
实现要求：Pytorch
知识点：
1. 注意力机制
2. token2token attetnion
时间：两周

任务三和四相当于让我们复现论文，这个过程多少有点艰难（万事开头难）
经历了八个小时的debug之后终于跑通了整个代码…
但是跑通那一刻真的很快乐哈哈哈，虽然不知道效果怎么样
总之训练速度特别慢，1660ti属实顶不住几十万的数据量

话不多说，直接开始分析ESIM

0x0 Hybrid Neural Inference Models

We present here our natural language inference networks which are composed of the following major components: input encoding, local inference modeling, and inference composition.

整个ESIM模型的结构图：

0x1 Input Encoding

首先使用双向LSTM来encode我们的premise和hypothesis

Here BiLSTM learns to representa word (e.g., ai) and its context.

经过BiLSTM的编码之后得到a_bar和b_bar

BiLSTM

class BiLSTM(nn.Module):def __init__(self, input_size, hidden_size=128, dropout_rate=0.1, layer_num=1):super(BiLSTM, self).__init__()self.hidden_size = hidden_sizeif layer_num == 1:self.bilstm = nn.LSTM(input_size, hidden_size // 2, layer_num, batch_first=True, bidirectional=True)else:self.bilstm = nn.LSTM(input_size, hidden_size // 2, layer_num, batch_first=True, dropout=dropout_rate,bidirectional=True)self.init_weights()def init_weights(self):for p in self.bilstm.parameters():if p.dim() > 1:nn.init.normal_(p)p.data.mul_(0.01)else:p.data.zero_()# This is the range of indices for our forget gates for each LSTM cellp.data[self.hidden_size // 2: self.hidden_size] = 1def forward(self, x, lens):''':param x: (batch, seq_len, input_size):param lens: (batch, ):return: (batch, seq_len, hidden_size)'''ordered_lens, index = lens.sort(descending=True)ordered_x = x[index]packed_x = nn.utils.rnn.pack_padded_sequence(ordered_x, ordered_lens.cpu(), batch_first=True)packed_output, _ = self.bilstm(packed_x)output, _ = nn.utils.rnn.pad_packed_sequence(packed_output, batch_first=True)recover_index = index.argsort()recover_output = output[recover_index]return recover_output

Input_Encoding

class Input_Encoding(nn.Module):def __init__(self, num_features, hidden_size, embedding_size, vectors,num_layers=1, batch_first=True, drop_out=0.5):super(Input_Encoding, self).__init__()self.num_features = num_featuresself.num_hidden = hidden_sizeself.embedding_size = embedding_sizeself.num_layers = num_layersself.dropout = nn.Dropout(drop_out)self.embedding = nn.Embedding.from_pretrained(vectors).cuda()self.bilstm = BiLSTM(embedding_size, hidden_size, drop_out, num_layers)def forward(self, x, lens):#x = torch.LongTensor(x)x = self.embedding(x)x = self.dropout(x)x = self.bilstm(x, lens)return x

0x2 Local Inference Modeling

Modeling local inference needs to employ some forms of hard or soft alignment to associate the relevant subcomponents between a premise and a hypothesis. This includes early methods motivated from the alignment in conventional automatic machine translation (MacCartney, 2009).In neural network models, this is often achieved with soft attention.

Locality of inference

接下来用软性注意力机制，使用点积模型：

Local inference collected over sequences

众所周知，注意力机制其实就是一个加权表示

对局句子a和句子b，每一个token做attention，e这个矩阵， $e_{ij}$ 代表a的第i个token和b的第j个token之间的点积

Enhancement of local inference information

In our models, we further enhance the local inference information collected. We compute the
difference and the element-wise product for the tuple <a¯, a˜> as well as for <b¯, b˜>.

使用一个玄学拼接

Local_Inference_Modeling

class Local_Inference_Modeling(nn.Module):def __init__(self):super(Local_Inference_Modeling, self).__init__()self.softmax_1 = nn.Softmax(dim=1).to(device)self.softmax_2 = nn.Softmax(dim=2).to(device)def forward(self, a_, b_):''':param a_: (batch, seq_len_A, hidden_size):param b_: (batch, seq_len_B, hidden_size):return: (batch, seq_len_A, hidden_size * 4), (batch, seq_len_B, hidden_size * 4)'''e = torch.matmul(a_, b_.transpose(1, 2)).to(device)a_tilde = (self.softmax_2(e)).bmm(b_)b_tilde = (self.softmax_1(e).transpose(1, 2)).bmm(a_)m_a = torch.cat([a_, a_tilde, a_ - a_tilde, a_ * a_tilde], dim=-1)m_b = torch.cat([b_, b_tilde, b_ - b_tilde, b_ * b_tilde], dim=-1)return m_a, m_b

0x3 Inference Composition

To determine the overall inference relationship between a premise and hypothesis, we explore a composition layer to compose the enhanced local inference information ma and mb. We perform the
composition sequentially or in its parse context
using BiLSTM and tree-LSTM, respectively.

The composition layer

跟第一部分一样的编码方式：

$va,i=BiLSTM(ma,i),∀i∈[1,...,la]{v_a,i} = BiLSTM(m_a,i),\forall i\in[1,...,l_a]$
$vb,i=BiLSTM(mb,j),∀j∈[1,...,lb]{v_b,i} = BiLSTM(m_b,j),\forall j\in[1,...,l_b]$

然后：

class Inference_Composition(nn.Module):def __init__(self, num_features, m_size, hidden_size, num_layers, embedding_size, batch_first=True, drop_out=0.5):super(Inference_Composition,self).__init__()self.linear = nn.Linear(4 * hidden_size, hidden_size).to(device)self.bilstm = BiLSTM(hidden_size, hidden_size, drop_out, num_layers).to(device)self.drop_out = nn.Dropout(drop_out).to(device)def forward(self, x, lens):""":param x: (batch, seq_len, hidden_size * 4):param lens: (seq_len, ):return: (batch, seq_len, hidden_size)"""x = self.linear(x)x = self.drop_out(x)x = self.bilstm(x, lens)return x

0x4 Final Prediction

We then put v into a final multilayer perceptron (MLP) classifier. The MLP has a hidden layer with
tanh activation and softmax output layer in our experiments. The entire model (all three components
described above) is trained end-to-end. For training, we use multi-class cross-entropy loss.

这一部分没写公式，就是一个双层的全连接：

这里的pooling就是将每个token的hidden_size压缩到1，即每个token此时只有一个特征

class Prediction(nn.Module):def __init__(self, v_size, mid_size, num_classes=4, drop_out=0.5):super(Prediction, self).__init__()self.mlp = nn.Sequential(nn.Linear(v_size, mid_size), nn.Tanh(),nn.Linear(mid_size, num_classes)).to(device)def forward(self, a, b):""":param a: (batch, seq_len_A, hidden_size):param b: (batch, seq_len_B, hidden_size):return: (batch, num_classes)"""v_a_avg = F.avg_pool1d(a, a.size(2)).squeeze(-1)v_a_max = F.max_pool1d(a, a.size(2)).squeeze(-1)v_b_avg = F.avg_pool1d(b, b.size(2)).squeeze(-1)v_b_max = F.max_pool1d(b, b.size(2)).squeeze(-1)out_put = torch.cat((v_a_avg, v_a_max, v_b_avg, v_b_max), dim=-1)return self.mlp(out_put)

0x5 Overall inference models

Our model can be based only on the sequential networks by removing all tree components and we call it Enhanced Sequential Inference Model (ESIM) (see the left part of Figure 1). We will show that ESIM outperforms all previous results. We will also encode parse information with tree LSTMs in multiple layers as described (see the right side of Figure 1). We train this model and incorporate it into ESIM by averaging the predicted probabilities to get the final label for a premise-hypothesis pair. We will show that parsing information complements very well with ESIM and further improves the performance, and we call the final model Hybrid Inference Model(HIM).

class ESIM(nn.Module):def __init__(self, num_features, hidden_size, embedding_size, num_classes=4, vectors=None,num_layers=1, batch_first=True, drop_out=0.5, freeze=False):super(ESIM, self).__init__()self.embedding_size = embedding_sizeself.input_encoding = Input_Encoding(num_features, hidden_size, embedding_size, vectors,num_layers=1, batch_first=True, drop_out=0.5)self.local_inference = Local_Inference_Modeling()self.inference_composition = Inference_Composition(num_features, 4 * hidden_size, hidden_size,num_layers, embedding_size=embedding_size,batch_first=True, drop_out=0.5)self.prediction = Prediction(4 * hidden_size, hidden_size, num_classes, drop_out)def forward(self, a, len_a, b, len_b):a_bar = self.input_encoding(a, len_a)b_bar = self.input_encoding(b, len_b)m_a, m_b = self.local_inference(a_bar, b_bar)v_a = self.inference_composition(m_a, len_a)v_b = self.inference_composition(m_b, len_b)out_put = self.prediction(v_a, v_b)return out_put

0x6 Experimental Setup

Data

from torchtext.legacy.data import Iterator, BucketIterator
from torchtext.legacy import data
import torchdef data_loader(batch_size=32, device="cuda", data_path='data', vectors=None):TEXT = data.Field(batch_first=True, include_lengths=True, lower=True)LABEL = data.LabelField(batch_first=True)TREE = Nonefields = {'sentence1': ('premise', TEXT),'sentence2': ('hypothesis', TEXT),'gold_label': ('label', LABEL)}train_data, dev_data, test_data = data.TabularDataset.splits(path = data_path,train='snli_1.0_train.jsonl',validation='snli_1.0_dev.jsonl',test='snli_1.0_test.jsonl',format ='json',fields = fields,filter_pred=lambda ex: ex.label != '-')TEXT.build_vocab(train_data, vectors=vectors, unk_init=torch.Tensor.normal_)LABEL.build_vocab(dev_data)train_iter, dev_iter = BucketIterator.splits((train_data, dev_data),batch_sizes=(batch_size, batch_size),device=device,sort_key=lambda x: len(x.premise) + len(x.hypothesis),sort_within_batch=True,repeat=False,shuffle=True)test_iter = Iterator(test_data,batch_size=batch_size,device=device,sort=False,sort_within_batch=False,repeat=False,shuffle=False)return train_iter, dev_iter, test_iter, TEXT, LABEL

Training

import torch
import torch.nn as nn
import torch.optim as optim
from torchtext.vocab import Vectors
from models import ESIM
from tqdm import tqdm
from utils import data_loader
import os
os.environ['CUDA_LAUNCH_BLOCKING'] = "1"
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
batch_size = 36
hidden_size = 600 # every LSTM's(forward and backward) hidden size is half of HIDDEN_SIZE
epochs = 10
drop_out = 0.5
num_layers = 1
learning_rate = 4e-4
patience = 5
clip = 10
embedding_size = 300
device = 'cuda'
vectors = Vectors('glove.6B.300d.txt', 'C:/Users/Mechrevo/Desktop/nlp-beginner/embedding')
data_path = 'data'def train(train_iter, dev_iter, loss_func, optimizer, epochs, patience=5, clip=5):for epoch in range(epochs):model.train()total_loss = 0n = 0train_correct = 0train_error = 0for batch in tqdm(train_iter):premise, premise_lens = batch.premisehypothesis, hypothesis_lens = batch.hypothesislabels = batch.labelmodel.zero_grad()output = model(premise, premise_lens, hypothesis, hypothesis_lens).to(device)loss = loss_func(output, labels)train_correct += (output.argmax(1) == labels).sum().item()train_error   += (output.argmax(1) != labels).sum().item()total_loss += loss.item()n += batch_sizeloss.backward()torch.nn.utils.clip_grad_norm_(model.parameters(), clip)optimizer.step()if n % 3600 == 0:print('epoch : {} step : {}, loss : {}, acc : {}'.format(epoch + 1, int(n / 3600), total_loss/n,train_correct/(train_error+train_correct)))tqdm.write("Epoch: {}, Train Average Loss: {}, Train Acc: {}" .format(epoch + 1, total_loss/n,train_correct/(train_error+train_correct)))model.eval()with torch.no_grad():n2 = 0val_total_loss = 0dev_correct = 0dev_error = 0for batch in tqdm(dev_iter):premise, premise_lens = batch.premisehypothesis, hypothesis_lens = batch.hypothesislabels = batch.labeloutput = model(premise, premise_lens, hypothesis, hypothesis_lens).to(device)loss = loss_func(output, labels)dev_correct += (output.argmax(1) == labels).sum().item()dev_error   += (output.argmax(1) != labels).sum().item()val_total_loss += lossn2 += batch_sizetqdm.write("Epoch: {}, Validation Average Loss: {}, Validation Acc: {}".format(epoch + 1, val_total_loss/n2,dev_correct/(dev_error+dev_correct)))def test(test_iter, loss_func):model.eval()with torch.no_grad():n = 0test_total_loss = 0test_correct = 0test_error = 0for batch in tqdm(test_iter):premise, premise_lens = batch.premisehypothesis, hypothesis_lens = batch.hypothesislabels = batch.labeloutput = model(premise, premise_lens, hypothesis, hypothesis_lens).to(device)loss = loss_func(output, labels)test_correct += (output.argmax(1) == labels).sum().item()test_error   += (output.argmax(1) != labels).sum().item()test_total_loss += lossn += batch_sizeprint('Test Acc: {}, loss : {}'.format(test_correct/(test_error+test_correct), test_total_loss/n))def count_parameters(model):return sum(p.numel() for p in model.parameters() if p.requires_grad)if __name__ == '__main__':train_iter, dev_iter, test_iter, TEXT, LABEL = data_loader(batch_size, device, data_path, vectors)model = ESIM(num_features=(TEXT.vocab), hidden_size=hidden_size, embedding_size=embedding_size,num_classes=4, vectors=TEXT.vocab.vectors, num_layers=num_layers,batch_first=True, drop_out=0.5, freeze=False).to(device)print(f'The model has {count_parameters(model):,} trainable parameters')optimizer = optim.Adam(model.parameters(), lr=learning_rate)loss_func = nn.CrossEntropyLoss()train(train_iter, dev_iter, loss_func, optimizer, epochs, patience, clip)test(test_iter, loss_func)

复旦nlp实验室 nlp-beginner 任务三：基于注意力机制的文本匹配相关推荐

NLP-Beginner任务三学习笔记：基于注意力机制的文本匹配
**输入两个句子判断,判断它们之间的关系.参考ESIM(可以只用LSTM,忽略Tree-LSTM),用双向的注意力机制实现** 数据集:The Stanford Natural Language Pr ...
【自然语言处理（NLP）】基于注意力机制的中-英机器翻译
[自然语言处理(NLP)]基于注意力机制的中-英机器翻译作者简介:在校大学生一枚,华为云享专家,阿里云专家博主,腾云先锋(TDP)成员,云曦智划项目总负责人,全国高等学校计算机教学与产业实践资源建设 ...
通俗解读NLP中几种常见的注意力机制
1 前言注意力机制在NLP领域中有广泛的应用,诸如机器翻译.智能对话.篇章问答等.在模型设计中使用注意力机制,可以显著提升模型的性能.然而,对于初识注意力机制的朋友来说,可能会有这样的疑问:自然语言 ...
《Effective Approaches to Attention-based Neural Machine Translation》—— 基于注意力机制的有效神经机器翻译方法
目录 <Effective Approaches to Attention-based Neural Machine Translation> 一.论文结构总览二.论文背景知识 2.1 ...
关于ATIS以及基于注意力机制的递归神经网络模型的学习记录
关于ATIS以及基于注意力机制的递归神经网络模型的学习记录此为本人学习的类笔记,主要内容为借助Google翻译机译的论文WHAT IS LEFT TO BE UNDERSTOOD IN ATIS? ...
小米加入 AI 研究大家庭！联合西工大推出基于注意力机制的普通话语音识别算法...
雷锋网(公众号:雷锋网) AI 科技评论按:小米近期发布了自己的 AI 音箱,加入了智能家居的战局.正当我们觉得小米会不会只是蹭"人工智能"热点的时候,小米的这篇论文证明了自己真的 ...
Nat.Commun.|使用基于注意力机制的多标签神经网络预测并解释12种RNA修饰
今天介绍来自西交利物浦大学和福建医科大学的Zitao Song, Daiyun Huang等人六月份发表在Nature Communication的文章"Attention-based mu ...
基于注意力机制的图卷积网络预测药物-疾病关联
BIB | 基于注意力机制的图卷积网络预测药物-疾病关联智能生信人工智能×生物医药关注科学求真赢 10 万奖金 · 院士面对面 9 人赞同了该文章今天给大家介绍华中农业大学章文教授团队在 ...
可视化神经机器翻译模型（基于注意力机制的Seq2seq模型）
可视化神经机器翻译模型(基于注意力机制的Seq2seq模型) 序列到序列模型是深度学习模型,在机器翻译.文本摘要和图像字幕等任务中取得了很大的成功.谷歌翻译在2016年底开始在生产中使用这样的模型 ...
ciaodvd数据集的简单介绍_基于注意力机制的规范化矩阵分解推荐算法
随着互联网技术的发展以及智能手机的普及, 信息超载问题也亟待解决.推荐系统[作为解决信息超载问题的有效工具, 已被成功应用于各个领域, 包括电子商务.电影.音乐和基于位置的服务等[.推荐系统通过分析用 ...

复旦nlp实验室 nlp-beginner 任务三：基于注意力机制的文本匹配