文章目录

  • han_attention(双向GRU+attention)
  • 一、文件目录
  • 二、语料集
  • 三、数据处理(IMDB_Data_Loader.py)
  • 四、模型(HAN_Model.py)
  • 五、训练和测试
  • 实验结果

han_attention(双向GRU+attention)

词编码:

词级别的注意力机制:

句子编码:

句子级别的注意力机制:

一、文件目录

二、语料集

数据集: http://ir.hit.edu.cn/~dytang/paper/emnlp2015/emnlp-2015-data.7z

三、数据处理(IMDB_Data_Loader.py)

1.数据集加载(排序,分句)
2.读取标签和数据
3.创建word2id(源语言和目标语言)
   3.1统计词频
   3.2加入 pad:0,unk:1创建word2id
4.将数据转化成id(源语言和目标语言)
5.添加目标数据的输入(target_data_input)

from gensim.models import KeyedVectors
from torch.utils import data
import os
import torch
import numpy as np
class IMDB_Data(data.DataLoader):def __init__(self,data_name,min_count,word2id=None,max_sentence_length=100,batch_size=64,is_pretrain=True):self.path = os.path.abspath(".")if "data" not in self.path:self.path += "/data"self.data_name = "/imdb/"+data_nameself.min_count = min_countself.word2id = word2idself.max_sentence_length = max_sentence_lengthself.batch_size =batch_sizeself.datas,self.labels = self.load_data()if is_pretrain:self.get_word2vec()else:self.weight = Nonefor i in range(len(self.datas)):self.datas[i] = np.array(self.datas[i])# 数据集加载def load_data(self):datas = open(self.path+self.data_name,encoding="utf-8").read().splitlines()datas = [data.split("        ")[-1].split() + [data.split("       ")[2]] for data in datas]  # 取出数据并分词+标签# 根据长度排序datas = sorted(datas, key=lambda x: len(x), reverse=True)labels = [int(data[-1]) - 1 for data in datas]datas = [data[0:-1] for data in datas]if self.word2id == None:self.get_word2id(datas)# 分句for i, data in enumerate(datas):datas[i] = " ".join(data).split("<sssss>")for j, sentence in enumerate(datas[i]):datas[i][j] = sentence.split()datas = self.convert_data2id(datas)return datas,labels# word2iddef get_word2id(self, datas):word_freq = {}for data in datas:for word in data:word_freq[word] = word_freq.get(word, 0) + 1word2id = {"<pad>": 0, "<unk>": 1}for word in word_freq:if word_freq[word] < self.min_count:continueelse:word2id[word] = len(word2id)self.word2id = word2id# 将数据转化为id,句子必须一样的长度,每个文档的句子一样多def convert_data2id(self, datas):for i, document in enumerate(datas):if i % 10000 == 0:print(i, len(datas))for j, sentence in enumerate(document):for k, word in enumerate(sentence):datas[i][j][k] = self.word2id.get(word,self.word2id["<unk>"])datas[i][j] = datas[i][j][0:self.max_sentence_length] + \[self.word2id["<pad>"]] * (self.max_sentence_length - len(datas[i][j]))for i in range(0, len(datas), self.batch_size):max_data_length = max([len(x) for x in datas[i:i + self.batch_size]])for j in range(i, min(i + self.batch_size, len(datas))):datas[j] = datas[j] + [[self.word2id["<pad>"]] * self.max_sentence_length] * (max_data_length - len(datas[j]))return datasdef get_word2vec(self):print("Reading word2vec Embedding...")wvmodel = KeyedVectors.load_word2vec_format(self.path + "/imdb.model", binary=True)tmp = []for word, index in self.word2id.items():try:tmp.append(wvmodel.get_vector(word))except:passmean = np.mean(np.array(tmp))std = np.std(np.array(tmp))print(mean, std)vocab_size = len(self.word2id)embed_size = 200np.random.seed(2)embedding_weights = np.random.normal(mean, std, [vocab_size, embed_size])  # 正太分布初始化方法for word, index in self.word2id.items():try:embedding_weights[index, :] = wvmodel.get_vector(word)except:passself.weight = torch.from_numpy(embedding_weights).float()def __getitem__(self, idx):return self.datas[idx], self.labels[idx]def __len__(self):return len(self.labels)
if __name__=="__main__":imdb_data = IMDB_Data(data_name="imdb-train.txt.ss",min_count=5,is_pretrain=True)training_iter = torch.utils.data.DataLoader(dataset=imdb_data,batch_size=64,shuffle=False,num_workers=0)for data, label in training_iter:print (np.array(data).shape)

四、模型(HAN_Model.py)

import torch
import torch.nn as nn
import numpy as np
from torch.nn import functional as F
from torch.autograd import Variable
class HAN_Model(nn.Module):def __init__(self,vocab_size,embedding_size,gru_size,class_num,is_pretrain=False,weights=None):super(HAN_Model, self).__init__()if is_pretrain:self.embedding = nn.Embedding.from_pretrained(weights, freeze=False)else:self.embedding = nn.Embedding(vocab_size, embedding_size)self.word_gru = nn.GRU(input_size=embedding_size,hidden_size=gru_size,num_layers=1,bidirectional=True,batch_first=True)#自己设的query Uwself.word_context = nn.Parameter(torch.Tensor(2*gru_size, 1),requires_grad=True)#将一个固定不可训练的tensor转换成可以训练的类型parameterself.word_dense = nn.Linear(2*gru_size,2*gru_size)self.sentence_gru = nn.GRU(input_size=2*gru_size,hidden_size=gru_size,num_layers=1,bidirectional=True,batch_first=True)# 自己设的query Usself.sentence_context = nn.Parameter(torch.Tensor(2*gru_size, 1),requires_grad=True)self.sentence_dense = nn.Linear(2*gru_size,2*gru_size)self.fc = nn.Linear(2*gru_size,class_num)def forward(self, x,gpu=False):sentence_num = x.shape[1]sentence_length = x.shape[2]x = x.view([-1,sentence_length]) # x: bs*sentence_num*sentence_length -> (bs*sentence_num)*sentence_length 转成二维x_embedding = self.embedding(x) # (bs*sentence_num)*sentence_length*embedding_sizeword_outputs, word_hidden = self.word_gru(x_embedding) # word_outputs.shape: (bs*sentence_num)*sentence_length*2gru_sizeword_outputs_attention = torch.tanh(self.word_dense(word_outputs)) # (bs*sentence_num)*sentence_length*2gru_sizeweights = torch.matmul(word_outputs_attention,self.word_context) # (bs*sentence_num)*sentence_length*1weights = F.softmax(weights,dim=1) # (bs*sentence_num)*sentence_length*1x = x.unsqueeze(2) # (bs*sentence_num)*sentence_length*1 加维度# 权值矩阵:有值的地方保留,pad部分设为0if gpu:weights = torch.where(x!=0,weights,torch.full_like(x,0,dtype=torch.float).cuda())else:weights = torch.where(x != 0, weights, torch.full_like(x, 0, dtype=torch.float)) # (bs*sentence_num)*sentence_length*1# 和恢复为1weights = weights/(torch.sum(weights,dim=1).unsqueeze(1)+1e-4) # (bs*sentence_num)*sentence_length*1sentence_vector = torch.sum(word_outputs*weights,dim=1).view([-1,sentence_num,word_outputs.shape[-1]]) #bs*sentence_num*2gru_sizesentence_outputs, sentence_hidden = self.sentence_gru(sentence_vector)# sentence_outputs.shape: bs*sentence_num*2gru_sizeattention_sentence_outputs = torch.tanh(self.sentence_dense(sentence_outputs)) # sentence_outputs.shape: bs*sentence_num*2gru_sizeweights = torch.matmul(attention_sentence_outputs,self.sentence_context) # sentence_outputs.shape: bs*sentence_num*1weights = F.softmax(weights,dim=1) # sentence_outputs.shape: bs*sentence_num*1x = x.view(-1, sentence_num, x.shape[1]) # bs*sentence_num*sentence_lengthx = torch.sum(x, dim=2).unsqueeze(2) # bs*sentence_num*1if gpu:weights = torch.where(x!=0,weights,torch.full_like(x,0,dtype=torch.float).cuda())else:weights = torch.where(x != 0, weights, torch.full_like(x, 0, dtype=torch.float)) #  bs*sentence_num*1weights = weights / (torch.sum(weights,dim=1).unsqueeze(1)+1e-4) # bs*sentence_num*1document_vector = torch.sum(sentence_outputs*weights,dim=1)# bs*2gru_sizeoutput = self.fc(document_vector) #bs*class_numreturn output
han_model = HAN_Model(vocab_size=30000,embedding_size=200,gru_size=50,class_num=4)
x = torch.Tensor(np.zeros([64,50,100])).long()
x[0][0][0:10] = 1
output = han_model(x)
print (output.shape)

五、训练和测试

import torch
import torch.autograd as autograd
import torch.nn as nn
import torch.optim as optim
from model import HAN_Model
from data import IMDB_Data
import numpy as np
from tqdm import tqdm
import config as argumentparser
config = argumentparser.ArgumentParser()
torch.manual_seed(config.seed)
if config.cuda and torch.cuda.is_available():  # 是否使用gputorch.cuda.set_device(config.gpu)def get_test_result(data_iter,data_set):# 生成测试结果model.eval()true_sample_num = 0for data, label in data_iter:if config.cuda and torch.cuda.is_available():data = data.cuda()label = label.cuda()else:data = torch.autograd.Variable(data).long()if config.cuda and torch.cuda.is_available():out = model(data, gpu=True)else:out = model(data)true_sample_num += np.sum((torch.argmax(out, 1) == label).cpu().numpy())acc = true_sample_num / data_set.__len__()return acc
# 导入训练集
training_set = IMDB_Data("imdb-train.txt.ss",min_count=config.min_count,max_sentence_length = config.max_sentence_length,batch_size=config.batch_size,is_pretrain=True)
training_iter = torch.utils.data.DataLoader(dataset=training_set,batch_size=config.batch_size,shuffle=False,num_workers=0)
# 导入测试集
test_set = IMDB_Data("imdb-test.txt.ss",min_count=config.min_count,word2id=training_set.word2id,max_sentence_length = config.max_sentence_length,batch_size=config.batch_size)
test_iter = torch.utils.data.DataLoader(dataset=test_set,batch_size=config.batch_size,shuffle=False,num_workers=0)
if config.cuda and torch.cuda.is_available():training_set.weight = training_set.weight.cuda()
model = HAN_Model(vocab_size=len(training_set.word2id),embedding_size=config.embedding_size,gru_size = config.gru_size,class_num=config.class_num,weights=training_set.weight,is_pretrain=False)if config.cuda and torch.cuda.is_available(): # 如果使用gpu,将模型送进gpumodel.cuda()criterion = nn.CrossEntropyLoss() # 这里会做softmax
optimizer = optim.Adam(model.parameters(), lr=config.learning_rate)
loss = -1for epoch in range(config.epoch):model.train()process_bar = tqdm(training_iter)for data, label in process_bar:if config.cuda and torch.cuda.is_available():data = data.cuda()label = label.cuda()else:data = torch.autograd.Variable(data).long()label = torch.autograd.Variable(label).squeeze()if config.cuda and torch.cuda.is_available():out = model(data,gpu=True)else:out = model(data)loss_now = criterion(out, autograd.Variable(label.long()))if loss == -1:loss = loss_now.data.item()else:loss = 0.95*loss+0.05*loss_now.data.item()process_bar.set_postfix(loss=loss_now.data.item())process_bar.update()optimizer.zero_grad()loss_now.backward()optimizer.step()test_acc = get_test_result(test_iter, test_set)print("The test acc is: %.5f" % test_acc)

实验结果

生成测试集和验证集的blue值(翻译的评价指标),将测试集的原始数据和翻译数据都存入文件。

han_attention(双向GRU+attention)(imdb数据集---文档分类)相关推荐

  1. 【机器学习入门】(2) 朴素贝叶斯算法:原理、实例应用(文档分类预测)附python完整代码及数据集

    各位同学好,今天我向大家介绍python机器学习中的朴素贝叶斯算法.内容有:算法的基本原理:案例实战--新闻文档的分类预测. 案例简介:新闻数据有20个主题,有10万多篇文章,每篇文章对应不同的主题, ...

  2. 四个数据欧几里得距离_从单词嵌入到文档距离 :WMD一种有效的文档分类方法...

    文档分类和文档检索已显示出广泛的应用. 文档分类的重要部分是正确生成文档表示. 马特·库斯纳(Matt J. Kusner)等人在2015年提出了Word Mover's Distance(WMD)[ ...

  3. 机器学习算法(三):基于概率论的分类方法:朴素贝叶斯理论与python实现+经典应用(文档分类、垃圾邮件过滤)

    算法学习笔记更新,本章内容是朴素贝叶斯,是一个用到概率论的分类方法. 算法简介   朴素贝叶斯是贝叶斯决策的一部分,简单说,就是利用条件概率来完成分类.说起条件概率,猛地一下戳到了笔者的伤口.想当年, ...

  4. UDN社区 开源文档分类导航

    UDN社区 开源文档分类导航 http://udn.yyuap.com/doc/ 转载于:https://my.oschina.net/u/2012579/blog/666418

  5. Python scikit-learn,分类,朴素贝叶斯算法,文档分类,MultinomialNB,拉普拉斯平滑系数

    朴素贝叶斯预测分类的思想就是根据待预测文档的特征(TF-IDF高的词)分别计算属于各个类别的概率,其中概率最大的类别,就是预测的类别.(朴素的意思就是文档的特征(词)之间相互独立) 朴素贝叶斯进行文档 ...

  6. 朴素贝叶斯--文档分类

    原文:http://ihoge.cn/2018/MultinomialNB.html 把文档转换成向量 TF-IDF是一种统计方法,用以评估一个词语对于一份文档的重要程度. TF表示词频, 即:词语在 ...

  7. 使用Glove词嵌入对IMDB数据集进行二分类

    IMDB Dataset of 50K Movie Reviews 1.从文件中读取数据 data_path = '/kaggle/input/imdb-dataset-of-50k-movie-re ...

  8. 使用Python按文件名所包含的特定关键词实现文档分类整理

    碎碎念 我作为中国最早接触互联网的一撮人,硬盘上有无数上古时代的图片,也有很多是从早已不怎么流行的论坛里下载的图片. 众所周知,这用PHPWind v7.5 搭建的,广泛使用的中文论坛,它的用户id其 ...

  9. [深度学习-NLP]Imdb数据集情感分析之模型对比(贝叶斯, LSTM, GRU, TextCNN, Transformer, BERT)

    一,详细原理以及代码请看下面博客 1.Imdb数据集情感分析之离散贝叶斯 2.Imdb数据集情感分析之LSTM长短期记忆 3.Imdb数据集情感分析之卷积神经网络-TextCNN 4.Imdb数据集情 ...

最新文章

  1. 当代硕博生常犯错觉大赏:我的idea非常棒,别人肯定想不到!
  2. oracle创建定时任务
  3. android服务应用场景,Android Service的使用介绍
  4. 微型计算机普遍使用的编码是,微型计算机中普遍使用的字符编码是什么吗
  5. centos一键清理磁盘空间_如何清理 Docker 占用的磁盘空间
  6. 数学史思维导图_【学科活动】思维导图展风采,数学文化提素养——庆云县第四中学(北校区)四年级数学组活动小记...
  7. 如何用脚本可靠关闭一个linux服务或进程
  8. TensorFlow 1.11.0正式版发布了,强力支持Keras
  9. android6.0原生brower_android原生browser分析(二)--界面篇
  10. Linux DHCP服务详解
  11. 用户登录+页面跳转+后台首页实现
  12. 论文总结:云安全研究方向及进展综述
  13. Vue引入百度地图增加导航功能
  14. Altium Designer 18 导入网络报表到PCB 文件中
  15. 《哪吒》爆红的背后:你是选择妥协还是逆天改命?
  16. 两句css 搞定页面滚动时的卡顿问题?
  17. 凿开数据冰层,透出智能时代的光:华为云与开发者的结伴旅行
  18. 讯飞和掌阅死磕亚马逊,彩色电子墨水阅读器是为了干掉kindle吗
  19. 9999*9999这样的命令在python中无法运行_智慧树形势与政策2018章节答案
  20. 获取一个新的日期,它的值为指定日期当年的最后一天

热门文章

  1. 希尔排序选择排序时间复杂度分析
  2. axure 动态面板 自动适应浏览器宽度_动态面板之“固定到浏览器”与“自适应窗口宽度”特性解读图文教程(18)...
  3. WEB前端之html img标签引用本地图片
  4. CT前瞻(三):Adobe系列XD软件绘制简单的原型图与交互设计
  5. 【VScode技巧】:VScode界面显示模糊
  6. 阿里云OSS对象存储-图文详解
  7. 替换读到的文件中的某一元素 pd 格式
  8. 学会这两招,你就知道怎样图片转文字
  9. java接口自动化测试-导入xslx模板进行批量检索
  10. java 1.8下载_jre1.8官方下载-JAVA运行环境(jre8 64位)1.8.0.25 官网最新版【离线版】下载_东坡手机下载...