论文总体结构

本文历史意义:

1、构建多个文本分类数据集,推动文本分类发展

2、提出CharTextCNN方法,由于只使用字符信息,所以可以用于多种语言中

一、Abstract(通过实验探究了字符级别卷积神经网络用于文本分类的有效性,模型取得较好结果)

摘要部分讲解了本文主要做什么,主要是三个方面,一是从实验角度探究字符级别卷积神经网络的有效性,二是构造几个大规模文本分类数据集,三是和对比模型相互比较。

二、Introduction(字符级别特征可以有效从原始新号如图像、语音中提取特征,字符级别也长用于自然语言任务,本文进行探索)

背景介绍部分主要从两个角度展开,一个是卷积神经网络,另一个是字符级别特征,首先介绍了文本分类简介、CNN的有效性、论述了本文使用字符级别CNN用户文本分类、文本分类在自然语言处理中的应用,最后论述了字符级别信息的使用。

三、Character level Convolutional Networks(字符级别的卷积神经网络模型以及一种数据扩充方法):

主要写了卷积计算公式,以及量化字符表大小长度为70,右侧给出网络结构,6个卷积层、3个全连接,后面给出网络模型一些参数

数据增强部分主要说的是用同义词替换的方法,增加数据,每个同义词按照语意的相似性进行排序,然后按照一定的数据分布进行替换。

CharTextCNN模型的优缺点:

缺点:

1、 字符级别文本长度特别长,不利于处理长文本分类

2、只使用字符级别信息,所以模型使用到的语意方面信息较少

3、小语料效果较差

优点:

1、模型结构简单在大语料上效果很好(3层卷积、3层全连接)

2、可以用于各种语言,不需要做分词处理

3、在噪音比较多的文本上表现较好,因为基本上不存在oov问题

四、Comparision Models:

介绍一些对比分类模型,包括词袋模型和基于深度学习的模型

五、Datasets and Results:

几个文本数据集及实验结果

介绍几个文本数据集的大小,以及后续介绍每个数据集的介绍。

六、Discussion:

讨论实验结果以及一些参数设置

主要讨论不同模型之间的对比效果,主要围绕数据集大小、数据集topic为语意还是语法的实验效果进行讨论

七、Conclusion and Outlook

全文总结、未来展望

本文关键点:

1、卷积神经网络能够有效地提取关键的特征

2、字符级别的特征对于自然语言处理的有效性

3、CharTextCNN模型

创新点:

1、提出一种新的文本分类模型-CharTextCNN

2、提出多个大规模文本分类数据集

3、在多个文本分类数据集上取得最好或者非常有竞争力的结果

启发点:

1、基于卷积神经网络的文本分类不需要语言的语法和语义结构的知识

2、实验结果告诉我们没有一个机器学习模型能够在各种数据集上都能表现的最好

3、本文从实验的角度分析了字符级别卷积神经网络在文本分类任务上的适用性

八、代码实现

""" 数据预处理  """# encoding = 'utf-8'import os
import torch
import json
import csvf = open("./data/AG/train.csv")
datas = csv.reader(f,delimiter=',',quotechar='"')
datas = list(datas)label,data,lowercase = [],[],Truefor row in datas:label.append(int(row[0])-1)text = " ".join(row[1:])if lowercase:text = text.lower()data.append(text)print(label[0:5])
print(data[0:5])[2, 2, 2, 2, 2]
["wall st. bears claw back into the black (reuters) reuters - short-sellers, wall street's dwindling\\band of ultra-cynics, are seeing green again.", 'carlyle looks toward commercial aerospace (reuters) reuters - private investment firm carlyle group,\\which has a reputation for making well-timed and occasionally\\controversial plays in the defense industry, has quietly placed\\its bets on another part of the market.', "oil and economy cloud stocks' outlook (reuters) reuters - soaring crude prices plus worries\\about the economy and the outlook for earnings are expected to\\hang over the stock market next week during the depth of the\\summer doldrums.", 'iraq halts oil exports from main southern pipeline (reuters) reuters - authorities have halted oil export\\flows from the main pipeline in southern iraq after\\intelligence showed a rebel militia could strike\\infrastructure, an oil official said on saturday.', 'oil prices soar to all-time record, posing new menace to us economy (afp) afp - tearaway world oil prices, toppling records and straining wallets, present a new economic menace barely three months before the us presidential elections.']with open("./data/alphabet.json") as f:alphabet = "".join(json.load(f))def char2Index(char):return alphabet.find(char)l0 = 1014
def ontHotEncode(idx):X = torch.zeros(l0,len(alphabet))for index_char,char in enumerate(data[idx]):if char2Index(char)!=-1:X[index_char][char2Index(char)] = 1.0return X

模型理论细节:

""" 模型代码 """import torch
import torch.nn as nn
import numpy as npclass CharTextCNN(nn.Module):def __init__(self,config):super(CharTextCNN,self).__init__()in_features = [config.char_num] + config.features[0:-1]out_features = config.featureskernel_sizes = config.kernel_sizesself.convs = []self.conv1 = nn.Sequential(nn.Conv1d(in_features[0], out_features[0], kernel_size=kernel_sizes[0], stride=1), # 一维卷积nn.BatchNorm1d(out_features[0]), # bn层nn.ReLU(), # relu激活函数层nn.MaxPool1d(kernel_size=3, stride=3) #一维池化层) # 卷积+bn+relu+pooling模块self.conv2  = nn.Sequential(nn.Conv1d(in_features[1], out_features[1], kernel_size=kernel_sizes[1], stride=1),nn.BatchNorm1d(out_features[1]),nn.ReLU(),nn.MaxPool1d(kernel_size=3, stride=3))self.conv3 = nn.Sequential(nn.Conv1d(in_features[2], out_features[2], kernel_size=kernel_sizes[2], stride=1),nn.BatchNorm1d(out_features[2]),nn.ReLU())self.conv4 = nn.Sequential(nn.Conv1d(in_features[3], out_features[3], kernel_size=kernel_sizes[3], stride=1),nn.BatchNorm1d(out_features[3]),nn.ReLU())self.conv5 = nn.Sequential(nn.Conv1d(in_features[4], out_features[4], kernel_size=kernel_sizes[4], stride=1),nn.BatchNorm1d(out_features[4]),nn.ReLU())self.conv6 = nn.Sequential(nn.Conv1d(in_features[5], out_features[5], kernel_size=kernel_sizes[5], stride=1),nn.BatchNorm1d(out_features[5]),nn.ReLU(),nn.MaxPool1d(kernel_size=3, stride=3))self.fc1 = nn.Sequential(nn.Linear(8704, 1024), # 全连接层 #((l0-96)/27)*256nn.ReLU(),nn.Dropout(p=config.dropout) # dropout层) # 全连接+relu+dropout模块self.fc2 = nn.Sequential(nn.Linear(1024, 1024),nn.ReLU(),nn.Dropout(p=config.dropout))self.fc3 = nn.Linear(1024, config.num_classes)def forward(self, x):x = self.conv1(x)x = self.conv2(x)x = self.conv3(x)x = self.conv4(x)x = self.conv5(x)x = self.conv6(x)x = x.view(x.size(0), -1) x = self.fc1(x)x = self.fc2(x)x = self.fc3(x)return xclass config:def __init__(self):self.char_num = 70  # 字符的个数self.features = [256,256,256,256,256,256] # 每一层特征个数self.kernel_sizes = [7,7,3,3,3,3] # 每一层的卷积核尺寸self.dropout = 0.5 # dropout大小self.num_classes = 4 # 数据的类别个数config = config()
chartextcnn = CharTextCNN(config)
test = torch.zeros([64,70,1014])
out = chartextcnn(test)from torchsummary import summarysummary(chartextcnn, input_size=(70,1014))----------------------------------------------------------------Layer (type)               Output Shape         Param #
================================================================Conv1d-1            [-1, 256, 1008]         125,696BatchNorm1d-2            [-1, 256, 1008]             512ReLU-3            [-1, 256, 1008]               0MaxPool1d-4             [-1, 256, 336]               0Conv1d-5             [-1, 256, 330]         459,008BatchNorm1d-6             [-1, 256, 330]             512ReLU-7             [-1, 256, 330]               0MaxPool1d-8             [-1, 256, 110]               0Conv1d-9             [-1, 256, 108]         196,864BatchNorm1d-10             [-1, 256, 108]             512ReLU-11             [-1, 256, 108]               0Conv1d-12             [-1, 256, 106]         196,864BatchNorm1d-13             [-1, 256, 106]             512ReLU-14             [-1, 256, 106]               0Conv1d-15             [-1, 256, 104]         196,864BatchNorm1d-16             [-1, 256, 104]             512ReLU-17             [-1, 256, 104]               0Conv1d-18             [-1, 256, 102]         196,864BatchNorm1d-19             [-1, 256, 102]             512ReLU-20             [-1, 256, 102]               0MaxPool1d-21              [-1, 256, 34]               0Linear-22                 [-1, 1024]       8,913,920ReLU-23                 [-1, 1024]               0Dropout-24                 [-1, 1024]               0Linear-25                 [-1, 1024]       1,049,600ReLU-26                 [-1, 1024]               0Dropout-27                 [-1, 1024]               0Linear-28                    [-1, 4]           4,100
================================================================
Total params: 11,342,852
Trainable params: 11,342,852
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.27
Forward/backward pass size (MB): 11.29
Params size (MB): 43.27
Estimated Total Size (MB): 54.83
----------------------------------------------------------------
""" 训练模型部分 """import torch
import torch.autograd as autograd
import torch.nn as nn
import torch.optim as optim
from model import CharTextCNN
from data import AG_Data
from tqdm import tqdm
import numpy as np
import config as argumentparserconfig = argumentparser.ArgumentParser() # 读入参数设置
config.features = list(map(int,config.features.split(","))) # 将features用,分割,并且转成int
config.kernel_sizes = list(map(int,config.kernel_sizes.split(","))) # kernel_sizes,分割,并且转成int# 导入训练集
training_set = AG_Data(data_path="/AG/train.csv",l0=config.l0)
training_iter = torch.utils.data.DataLoader(dataset=training_set,batch_size=config.batch_size,shuffle=True,num_workers=0)# 导入测试集
test_set = AG_Data(data_path="/AG/test.csv",l0=config.l0)test_iter = torch.utils.data.DataLoader(dataset=test_set,batch_size=config.batch_size,shuffle=False,num_workers=0)model = CharTextCNN(config) # 初始化模型
criterion = nn.CrossEntropyLoss() # 构建loss结构
optimizer = optim.Adam(model.parameters(), lr=config.learning_rate) #构建优化器
loss  = -1def get_test_result(data_iter,data_set):# 生成测试结果model.eval()data_loss = 0true_sample_num = 0for data, label in data_iter:if config.cuda and torch.cuda.is_available():data = data.cuda()label = label.cuda()else:data = torch.autograd.Variable(data).float()out = model(data)true_sample_num += np.sum((torch.argmax(out, 1) == label).cpu().numpy()) # 得到一个batch的预测正确的样本个数acc = true_sample_num / data_set.__len__()return accfor epoch in range(config.epoch):model.train()process_bar = tqdm(training_iter)for data, label in process_bar:if config.cuda and torch.cuda.is_available():data = data.cuda()  # 如果使用gpu,将数据送进goulabel = label.cuda()else:data = torch.autograd.Variable(data).float()label = torch.autograd.Variable(label).squeeze()out = model(data)loss_now = criterion(out, autograd.Variable(label.long()))if loss == -1:loss = loss_now.data.item()else:loss = 0.95*loss+0.05*loss_now.data.item()  # 平滑操作process_bar.set_postfix(loss=loss_now.data.item()) # 输出loss,实时监测loss的大小process_bar.update()optimizer.zero_grad() # 梯度更新loss_now.backward()optimizer.step()

Character-level Convolutional Networks for Text Classification相关推荐

  1. 【论文复现】Character-level Convolutional Networks for Text Classification

    写在前面 今天讨论的论文依然是文本分类主题的.Character-level Convolutional Networks for Text Classification这篇论文是在2016年4月份发 ...

  2. 论文阅读笔记:Graph Convolutional Networks for Text Classification

    Abstract 文本分类作为一个经典的自然语言处理任务,已经有很多利用卷积神经网络进行文本分类的研究,但是利用图卷积神经网络进行研究的仍然较少. 本文基于单词共现和文档单词间的关系构建一个text ...

  3. Character-level Convolutional Networks for Text Classification之每日一篇

    这篇文章发表于2016.04,作者还发表了一篇Text Unders tanding from Scratch的论文,有兴趣的可以去看看. 字符级的卷积网络是一个有效的方法. 模型如何很好地进行比较, ...

  4. 当GCN遇见NLP(三) Tensor Graph Convolutional Networks for Text Classification,AAAI2020

    文章目录 1.Introduction 2.Model 2.1 Graph Tensor 2.2 Text graph tensor construction Semantic-based graph ...

  5. Very Deep Convolutional Networks for Text Classification之每日一篇

    源码:https://github.com/lethienhoa/Very-Deep-Convolutional-Networks-for-Natural-Language-Processing 一: ...

  6. 【文本分类】Recurrent Convolutional Neural Networks for Text Classification

    ·摘要:   从模型的角度,本文作者将RNN(Bi-LSTM)和max_pooling结合使用,提出RCNN模型,应用到了NLP的文本分类任务中,提高了分类精度. ·参考文献:   [1] Recur ...

  7. 【论文解读 ICLR 2020 | DropEdge】TOWARDS DEEP GRAPH CONVOLU-TIONAL NETWORKS ON NODE CLASSIFICATION

    论文题目:DROPEDGE: TOWARDS DEEP GRAPH CONVOLU-TIONAL NETWORKS ON NODE CLASSIFICATION 论文来源:ICLR 2020 论文链接 ...

  8. Recurrent Convolutional Neural Networks for Text Classification(中文版)

    文章目录 用于文本分类的递归卷积神经网络 摘要 介绍 相关工作 文本分类 深度神经网络 模型 词表示学习 文本表示学习 训练 训练网络参数 预训练单词嵌入 实验 数据集 20Newsgroups 复旦 ...

  9. Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps

    开源:http://code.google.com/p/cuda-convnet/ 网上的代码实现(在这个网页里面,自己寻找):https://www.jianshu.com/p/e46b1aa488 ...

最新文章

  1. 数组-二维数组中的查找
  2. Hibernate 关联映射 之 多对多 关联(二) 之拆分
  3. 虚拟机ubuntu19.04下设置idea快捷键
  4. Socket网络通讯开发总结之:Java 与 C进行Socket通讯
  5. struts2 验证框架、国际化
  6. IOS_改变UITextField placeHolder颜色、字体
  7. ASP.NET小收集:Word的编码是Unicode
  8. python3 在线工具_Curl转python在线工具
  9. 机器学习方法_机器学习大拿253页新书:可解释机器学习方法的局限籍(附下载)...
  10. thinkphp的like用法
  11. 那些在一个公司死磕了5-10年的测试员,最后都怎么样了?
  12. 最近粉丝涨得比较快,可能是系统推荐了
  13. Atitit.反编译apk android源码以及防止反编译apk
  14. 2020-12-13:C语言钱币兑换问题
  15. 计算机综合症怎么治,哪些运动可以用来治疗“电脑综合症”
  16. 详解OpenCV的椭圆曲线点坐标近似计算函数ellipse2Poly()
  17. ctab提取dna流程图_CTAB法提取植物DNA原理以及步骤
  18. vue中 gojs 的使用及去除水印
  19. CA(证书颁发机构)
  20. 报错Uncaught ReferenceError: *** is not defined at HTMLTableRowElement.onc

热门文章

  1. 你以为A10 Networks只做应用交付?
  2. 数据为王:大数据如何影响消费金融
  3. ExecutorCompletionService原理具体解释
  4. #51CTO学院四周年#其实、其实,我就是来吐槽的”
  5. git-stash用法小结
  6. django创建一个管理员用户
  7. 从屌丝毕业生到三次优秀员工(腾讯三年工作感悟)
  8. telerik 某些ajax拿数据方式下 load on demand 不起作用
  9. Sherri Sparks
  10. 词典建立过程缓慢的解决~~子系统构架重新设计!