任务三:基于注意力机制的文本匹配

输入两个句子判断,判断它们之间的关系。参考ESIM(可以只用LSTM,忽略Tree-LSTM),用双向的注意力机制实现。

  1. 参考

    1. 《神经网络与深度学习》 第7章
    2. Reasoning about Entailment with Neural Attention https://arxiv.org/pdf/1509.06664v1.pdf
    3. Enhanced LSTM for Natural Language Inference https://arxiv.org/pdf/1609.06038v3.pdf
  2. 数据集:https://nlp.stanford.edu/projects/snli/
  3. 实现要求:Pytorch
  4. 知识点:
    1. 注意力机制
    2. token2token attetnion
  5. 时间:两周

任务三和四相当于让我们复现论文,这个过程多少有点艰难(万事开头难)
经历了八个小时的debug之后终于跑通了整个代码…
但是跑通那一刻真的很快乐哈哈哈,虽然不知道效果怎么样
总之训练速度特别慢,1660ti属实顶不住几十万的数据量

话不多说,直接开始分析ESIM

0x0 Hybrid Neural Inference Models

We present here our natural language inference networks which are composed of the following major components: input encoding, local inference modeling, and inference composition.

整个ESIM模型的结构图

0x1 Input Encoding

首先使用双向LSTMencode我们的premisehypothesis

Here BiLSTM learns to representa word (e.g., ai) and its context.

经过BiLSTM的编码之后得到a_barb_bar

BiLSTM

class BiLSTM(nn.Module):def __init__(self, input_size, hidden_size=128, dropout_rate=0.1, layer_num=1):super(BiLSTM, self).__init__()self.hidden_size = hidden_sizeif layer_num == 1:self.bilstm = nn.LSTM(input_size, hidden_size // 2, layer_num, batch_first=True, bidirectional=True)else:self.bilstm = nn.LSTM(input_size, hidden_size // 2, layer_num, batch_first=True, dropout=dropout_rate,bidirectional=True)self.init_weights()def init_weights(self):for p in self.bilstm.parameters():if p.dim() > 1:nn.init.normal_(p)p.data.mul_(0.01)else:p.data.zero_()# This is the range of indices for our forget gates for each LSTM cellp.data[self.hidden_size // 2: self.hidden_size] = 1def forward(self, x, lens):''':param x: (batch, seq_len, input_size):param lens: (batch, ):return: (batch, seq_len, hidden_size)'''ordered_lens, index = lens.sort(descending=True)ordered_x = x[index]packed_x = nn.utils.rnn.pack_padded_sequence(ordered_x, ordered_lens.cpu(), batch_first=True)packed_output, _ = self.bilstm(packed_x)output, _ = nn.utils.rnn.pad_packed_sequence(packed_output, batch_first=True)recover_index = index.argsort()recover_output = output[recover_index]return recover_output

Input_Encoding

class Input_Encoding(nn.Module):def __init__(self, num_features, hidden_size, embedding_size, vectors,num_layers=1, batch_first=True, drop_out=0.5):super(Input_Encoding, self).__init__()self.num_features = num_featuresself.num_hidden = hidden_sizeself.embedding_size = embedding_sizeself.num_layers = num_layersself.dropout = nn.Dropout(drop_out)self.embedding = nn.Embedding.from_pretrained(vectors).cuda()self.bilstm = BiLSTM(embedding_size, hidden_size, drop_out, num_layers)def forward(self, x, lens):#x = torch.LongTensor(x)x = self.embedding(x)x = self.dropout(x)x = self.bilstm(x, lens)return x

0x2 Local Inference Modeling

Modeling local inference needs to employ some forms of hard or soft alignment to associate the relevant subcomponents between a premise and a hypothesis. This includes early methods motivated from the alignment in conventional automatic machine translation (MacCartney, 2009).In neural network models, this is often achieved with soft attention.

Locality of inference

接下来用软性注意力机制,使用点积模型:

Local inference collected over sequences

众所周知,注意力机制其实就是一个加权表示

对局句子a和句子b,每一个token做attention,e这个矩阵,eije_{ij}eij代表a的第i个token和b的第j个token之间的点积

Enhancement of local inference information

In our models, we further enhance the local inference information collected. We compute the
difference and the element-wise product for the tuple <a¯, a˜> as well as for <b¯, b˜>.

使用一个玄学拼接

Local_Inference_Modeling

class Local_Inference_Modeling(nn.Module):def __init__(self):super(Local_Inference_Modeling, self).__init__()self.softmax_1 = nn.Softmax(dim=1).to(device)self.softmax_2 = nn.Softmax(dim=2).to(device)def forward(self, a_, b_):''':param a_: (batch, seq_len_A, hidden_size):param b_: (batch, seq_len_B, hidden_size):return: (batch, seq_len_A, hidden_size * 4), (batch, seq_len_B, hidden_size * 4)'''e = torch.matmul(a_, b_.transpose(1, 2)).to(device)a_tilde = (self.softmax_2(e)).bmm(b_)b_tilde = (self.softmax_1(e).transpose(1, 2)).bmm(a_)m_a = torch.cat([a_, a_tilde, a_ - a_tilde, a_ * a_tilde], dim=-1)m_b = torch.cat([b_, b_tilde, b_ - b_tilde, b_ * b_tilde], dim=-1)return m_a, m_b

0x3 Inference Composition

To determine the overall inference relationship between a premise and hypothesis, we explore a composition layer to compose the enhanced local inference information ma and mb. We perform the
composition sequentially or in its parse context
using BiLSTM and tree-LSTM, respectively.

The composition layer

跟第一部分一样的编码方式:

va,i=BiLSTM(ma,i),∀i∈[1,...,la]{v_a,i} = BiLSTM(m_a,i),\forall i\in[1,...,l_a]va,i=BiLSTM(ma,i),i[1,...,la]
vb,i=BiLSTM(mb,j),∀j∈[1,...,lb]{v_b,i} = BiLSTM(m_b,j),\forall j\in[1,...,l_b]vb,i=BiLSTM(mb,j),j[1,...,lb]

然后:

class Inference_Composition(nn.Module):def __init__(self, num_features, m_size, hidden_size, num_layers, embedding_size, batch_first=True, drop_out=0.5):super(Inference_Composition,self).__init__()self.linear = nn.Linear(4 * hidden_size, hidden_size).to(device)self.bilstm = BiLSTM(hidden_size, hidden_size, drop_out, num_layers).to(device)self.drop_out = nn.Dropout(drop_out).to(device)def forward(self, x, lens):""":param x: (batch, seq_len, hidden_size * 4):param lens: (seq_len, ):return: (batch, seq_len, hidden_size)"""x = self.linear(x)x = self.drop_out(x)x = self.bilstm(x, lens)return x

0x4 Final Prediction

We then put v into a final multilayer perceptron (MLP) classifier. The MLP has a hidden layer with
tanh activation and softmax output layer in our experiments. The entire model (all three components
described above) is trained end-to-end. For training, we use multi-class cross-entropy loss.

这一部分没写公式,就是一个双层的全连接:

这里的pooling就是将每个token的hidden_size压缩到1,即每个token此时只有一个特征

class Prediction(nn.Module):def __init__(self, v_size, mid_size, num_classes=4, drop_out=0.5):super(Prediction, self).__init__()self.mlp = nn.Sequential(nn.Linear(v_size, mid_size), nn.Tanh(),nn.Linear(mid_size, num_classes)).to(device)def forward(self, a, b):""":param a: (batch, seq_len_A, hidden_size):param b: (batch, seq_len_B, hidden_size):return: (batch, num_classes)"""v_a_avg = F.avg_pool1d(a, a.size(2)).squeeze(-1)v_a_max = F.max_pool1d(a, a.size(2)).squeeze(-1)v_b_avg = F.avg_pool1d(b, b.size(2)).squeeze(-1)v_b_max = F.max_pool1d(b, b.size(2)).squeeze(-1)out_put = torch.cat((v_a_avg, v_a_max, v_b_avg, v_b_max), dim=-1)return self.mlp(out_put)

0x5 Overall inference models

Our model can be based only on the sequential networks by removing all tree components and we call it Enhanced Sequential Inference Model (ESIM) (see the left part of Figure 1). We will show that ESIM outperforms all previous results. We will also encode parse information with tree LSTMs in multiple layers as described (see the right side of Figure 1). We train this model and incorporate it into ESIM by averaging the predicted probabilities to get the final label for a premise-hypothesis pair. We will show that parsing information complements very well with ESIM and further improves the performance, and we call the final model Hybrid Inference Model(HIM).

class ESIM(nn.Module):def __init__(self, num_features, hidden_size, embedding_size, num_classes=4, vectors=None,num_layers=1, batch_first=True, drop_out=0.5, freeze=False):super(ESIM, self).__init__()self.embedding_size = embedding_sizeself.input_encoding = Input_Encoding(num_features, hidden_size, embedding_size, vectors,num_layers=1, batch_first=True, drop_out=0.5)self.local_inference = Local_Inference_Modeling()self.inference_composition = Inference_Composition(num_features, 4 * hidden_size, hidden_size,num_layers, embedding_size=embedding_size,batch_first=True, drop_out=0.5)self.prediction = Prediction(4 * hidden_size, hidden_size, num_classes, drop_out)def forward(self, a, len_a, b, len_b):a_bar = self.input_encoding(a, len_a)b_bar = self.input_encoding(b, len_b)m_a, m_b = self.local_inference(a_bar, b_bar)v_a = self.inference_composition(m_a, len_a)v_b = self.inference_composition(m_b, len_b)out_put = self.prediction(v_a, v_b)return out_put

0x6 Experimental Setup

Data

from torchtext.legacy.data import Iterator, BucketIterator
from torchtext.legacy import data
import torchdef data_loader(batch_size=32, device="cuda", data_path='data', vectors=None):TEXT = data.Field(batch_first=True, include_lengths=True, lower=True)LABEL = data.LabelField(batch_first=True)TREE = Nonefields = {'sentence1': ('premise', TEXT),'sentence2': ('hypothesis', TEXT),'gold_label': ('label', LABEL)}train_data, dev_data, test_data = data.TabularDataset.splits(path = data_path,train='snli_1.0_train.jsonl',validation='snli_1.0_dev.jsonl',test='snli_1.0_test.jsonl',format ='json',fields = fields,filter_pred=lambda ex: ex.label != '-')TEXT.build_vocab(train_data, vectors=vectors, unk_init=torch.Tensor.normal_)LABEL.build_vocab(dev_data)train_iter, dev_iter = BucketIterator.splits((train_data, dev_data),batch_sizes=(batch_size, batch_size),device=device,sort_key=lambda x: len(x.premise) + len(x.hypothesis),sort_within_batch=True,repeat=False,shuffle=True)test_iter = Iterator(test_data,batch_size=batch_size,device=device,sort=False,sort_within_batch=False,repeat=False,shuffle=False)return train_iter, dev_iter, test_iter, TEXT, LABEL

Training

import torch
import torch.nn as nn
import torch.optim as optim
from torchtext.vocab import Vectors
from models import ESIM
from tqdm import tqdm
from utils import data_loader
import os
os.environ['CUDA_LAUNCH_BLOCKING'] = "1"
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
batch_size = 36
hidden_size = 600 # every LSTM's(forward and backward) hidden size is half of HIDDEN_SIZE
epochs = 10
drop_out = 0.5
num_layers = 1
learning_rate = 4e-4
patience = 5
clip = 10
embedding_size = 300
device = 'cuda'
vectors = Vectors('glove.6B.300d.txt', 'C:/Users/Mechrevo/Desktop/nlp-beginner/embedding')
data_path = 'data'def train(train_iter, dev_iter, loss_func, optimizer, epochs, patience=5, clip=5):for epoch in range(epochs):model.train()total_loss = 0n = 0train_correct = 0train_error = 0for batch in tqdm(train_iter):premise, premise_lens = batch.premisehypothesis, hypothesis_lens = batch.hypothesislabels = batch.labelmodel.zero_grad()output = model(premise, premise_lens, hypothesis, hypothesis_lens).to(device)loss = loss_func(output, labels)train_correct += (output.argmax(1) == labels).sum().item()train_error   += (output.argmax(1) != labels).sum().item()total_loss += loss.item()n += batch_sizeloss.backward()torch.nn.utils.clip_grad_norm_(model.parameters(), clip)optimizer.step()if n % 3600 == 0:print('epoch : {} step : {}, loss : {}, acc : {}'.format(epoch + 1, int(n / 3600), total_loss/n,train_correct/(train_error+train_correct)))tqdm.write("Epoch: {}, Train Average Loss: {}, Train Acc: {}" .format(epoch + 1, total_loss/n,train_correct/(train_error+train_correct)))model.eval()with torch.no_grad():n2 = 0val_total_loss = 0dev_correct = 0dev_error = 0for batch in tqdm(dev_iter):premise, premise_lens = batch.premisehypothesis, hypothesis_lens = batch.hypothesislabels = batch.labeloutput = model(premise, premise_lens, hypothesis, hypothesis_lens).to(device)loss = loss_func(output, labels)dev_correct += (output.argmax(1) == labels).sum().item()dev_error   += (output.argmax(1) != labels).sum().item()val_total_loss += lossn2 += batch_sizetqdm.write("Epoch: {}, Validation Average Loss: {}, Validation Acc: {}".format(epoch + 1, val_total_loss/n2,dev_correct/(dev_error+dev_correct)))def test(test_iter, loss_func):model.eval()with torch.no_grad():n = 0test_total_loss = 0test_correct = 0test_error = 0for batch in tqdm(test_iter):premise, premise_lens = batch.premisehypothesis, hypothesis_lens = batch.hypothesislabels = batch.labeloutput = model(premise, premise_lens, hypothesis, hypothesis_lens).to(device)loss = loss_func(output, labels)test_correct += (output.argmax(1) == labels).sum().item()test_error   += (output.argmax(1) != labels).sum().item()test_total_loss += lossn += batch_sizeprint('Test Acc: {}, loss : {}'.format(test_correct/(test_error+test_correct), test_total_loss/n))def count_parameters(model):return sum(p.numel() for p in model.parameters() if p.requires_grad)if __name__ == '__main__':train_iter, dev_iter, test_iter, TEXT, LABEL = data_loader(batch_size, device, data_path, vectors)model = ESIM(num_features=(TEXT.vocab), hidden_size=hidden_size, embedding_size=embedding_size,num_classes=4, vectors=TEXT.vocab.vectors, num_layers=num_layers,batch_first=True, drop_out=0.5, freeze=False).to(device)print(f'The model has {count_parameters(model):,} trainable parameters')optimizer = optim.Adam(model.parameters(), lr=learning_rate)loss_func = nn.CrossEntropyLoss()train(train_iter, dev_iter, loss_func, optimizer, epochs, patience, clip)test(test_iter, loss_func)

复旦nlp实验室 nlp-beginner 任务三:基于注意力机制的文本匹配相关推荐

  1. NLP-Beginner任务三学习笔记:基于注意力机制的文本匹配

    **输入两个句子判断,判断它们之间的关系.参考ESIM(可以只用LSTM,忽略Tree-LSTM),用双向的注意力机制实现** 数据集:The Stanford Natural Language Pr ...

  2. 【自然语言处理(NLP)】基于注意力机制的中-英机器翻译

    [自然语言处理(NLP)]基于注意力机制的中-英机器翻译 作者简介:在校大学生一枚,华为云享专家,阿里云专家博主,腾云先锋(TDP)成员,云曦智划项目总负责人,全国高等学校计算机教学与产业实践资源建设 ...

  3. 通俗解读NLP中几种常见的注意力机制

    1 前言 注意力机制在NLP领域中有广泛的应用,诸如机器翻译.智能对话.篇章问答等.在模型设计中使用注意力机制,可以显著提升模型的性能.然而,对于初识注意力机制的朋友来说,可能会有这样的疑问:自然语言 ...

  4. 《Effective Approaches to Attention-based Neural Machine Translation》—— 基于注意力机制的有效神经机器翻译方法

    目录 <Effective Approaches to Attention-based Neural Machine Translation> 一.论文结构总览 二.论文背景知识 2.1 ...

  5. 关于ATIS以及基于注意力机制的递归神经网络模型 的学习记录

    关于ATIS以及基于注意力机制的递归神经网络模型 的学习记录 此为本人学习的类笔记,主要内容为借助Google翻译机译的论文WHAT IS LEFT TO BE UNDERSTOOD IN ATIS? ...

  6. 小米加入 AI 研究大家庭!联合西工大推出基于注意力机制的普通话语音识别算法...

    雷锋网(公众号:雷锋网) AI 科技评论按:小米近期发布了自己的 AI 音箱,加入了智能家居的战局.正当我们觉得小米会不会只是蹭"人工智能"热点的时候,小米的这篇论文证明了自己真的 ...

  7. Nat.Commun.|使用基于注意力机制的多标签神经网络预测并解释12种RNA修饰

    今天介绍来自西交利物浦大学和福建医科大学的Zitao Song, Daiyun Huang等人六月份发表在Nature Communication的文章"Attention-based mu ...

  8. 基于注意力机制的图卷积网络预测药物-疾病关联

    BIB | 基于注意力机制的图卷积网络预测药物-疾病关联 智能生信 人工智能×生物医药 ​关注 科学求真 赢 10 万奖金 · 院士面对面 9 人赞同了该文章 今天给大家介绍华中农业大学章文教授团队在 ...

  9. 可视化神经机器翻译模型(基于注意力机制的Seq2seq模型)

    可视化神经机器翻译模型(基于注意力机制的Seq2seq模型)   序列到序列模型是深度学习模型,在机器翻译.文本摘要和图像字幕等任务中取得了很大的成功.谷歌翻译在2016年底开始在生产中使用这样的模型 ...

  10. ciaodvd数据集的简单介绍_基于注意力机制的规范化矩阵分解推荐算法

    随着互联网技术的发展以及智能手机的普及, 信息超载问题也亟待解决.推荐系统[作为解决信息超载问题的有效工具, 已被成功应用于各个领域, 包括电子商务.电影.音乐和基于位置的服务等[.推荐系统通过分析用 ...

最新文章

  1. sqlserver定时差异备份_一分钟看懂完全备份、差异备份以及增量备份
  2. 【Kotlin】接口 ( 声明 | 实现 | 接口方法 | 接口属性 | 接口覆盖冲突 | 接口继承 )
  3. ssh excel 导入 mysql_ssh poi解析excel并将数据存入数据库
  4. 我是如何学习写一个操作系统(九):文件系统
  5. 推荐 6 个不错的JavaScript动画库
  6. Ansible 之 用户管理
  7. 【记录】利用jar包制作docker镜像
  8. 服务器不知道循环生成文件,Windows服务器下PowerShell命令往服务器共享文件夹进行文件拷贝、循环文件重命名...
  9. python类型转换方法_整理了最全的Python3数据类型转换方法,可以收藏当手册用...
  10. quartus 13.1自带仿真测试流程
  11. 安川g7接线端子图_安川G7(IP)+蓝光STB板同步
  12. 台式计算机拆机步骤ppt,三相异步电动机拆装的方法和步骤.PPT
  13. 测试计划及方案怎么写?
  14. html作品简介代码,HTML5的标签的代码的简单介绍 HTML5标签的简介
  15. 淘宝新店已经不死不活没有生意怎么办
  16. 用脉冲响应不变法设计IIR 滤波器 MATLAB实现
  17. 天池数据集|精品数据集推荐(工业篇)
  18. Digital Pixel 杂志 Joomla模板 joomla摄影艺术数码作品博客商业
  19. 调整数组顺序使奇数位于偶数前面——《剑指offer》
  20. tiktok小店运营分享

热门文章

  1. GTD+敏捷=一种新的计划列表理念和方法。
  2. 如何成为优秀的技术人员
  3. 2007年第一份成绩单——关于《WebWork in Action》中文版
  4. ios android c跨平台,Unity 使用C/C++ 跨平台终极解决方案(PC,iOS,Android,以及支持C/C++的平台)...
  5. 文件不能超过200k_为什么答题时上传的文件大小不允许
  6. java框架是什么_Spring 是什么框架?
  7. 药品质量不合格统计机器人
  8. C/C++[codeup 1942]进制转换
  9. 翻译:控制容器的反转IoC和依赖注入模式DIP 概念发源地 Martin Fowler
  10. 重启mysql tomcat_linux下MySQL、Tomcat、Redis、Nginx停止和重启