[NLP-CNN] Convolutional Neural Networks for Sentence Classification -2014-EMNLP

1. Overview

本文将CNN用于句子分类任务

(1) 使用静态vector + CNN即可取得很好的效果；=> 这表明预训练的vector是universal的特征提取器，可以被用于多种分类任务中。

(2) 根据特定任务进行fine-tuning 的vector + CNN 取得了更好的效果。

(3) 改进模型架构，使得可以使用 task-specific 和 static 的vector。

(4) 在7项任务中的4项取得了SOTA的效果。

思考：卷积神经网络的核心思想是捕获局部特征。在图像领域，由于图像本身具有局部相关性，因此，CNN是一个较为适用的特征提取器。在NLP中，可以将一段文本n-gram看做一个有相近特征的片段——窗口，因而希望通过CNN来捕获这个滑动窗口内的局部特征。卷积神经网络的优势在于可以对这样的n-gram特征进行组合和筛选，获取不同的抽象层次的语义信息。

2. Model

对于该模型，主要注意三点：

1. 如何应用的CNN，即在文本中如何使用CNN

2. 如何将static和fine-tuned vector结合在一个架构中

3. 正则化的策略

本文的思路是比较简单的。

2.1 CNN的应用

<1> feature map 的获取

word vector 是k维，sentence length = n (padded)，则将该sentence表示为每个单词的简单的concat,如fig1所示，组成最左边的矩形。

卷积核是对窗口大小为h的词进行卷积。大小为h的窗口内单词的表征为 h * k 维度，那么设定一个维度同样为h*k的卷积核 w，对其进行卷积运算。

之后加偏置，进行非线性变换即可得到经过CNN之后提取的特征的表征$c_i$。

这个$c_i$是某一个卷积核对一个窗口的卷积后的特征表示，对于长度为n的sentence，滑动窗口可以滑动n - h + 1次，也就可以得到一个feature map

显然，$c$的维度为n - h + 1. 当然，这是对一个卷积核获取的feature map, 为了提取到多种特征，可以设置不同的卷积核，它们对应的卷积核的大小可以不同，也就是h可以不同。

这个过程对应了Figure1中最左边两个图形的过程。

<2> max pooling

这里的max pooling有个名词叫 max-over-time-pooling.它的over-time体现在：如图，每个feature map中选择值最大的组成到max pooling后的矩阵中，而这个feature map则是沿着滑动窗口，也就是沿着文本序列进行卷积得到的，那么也就是max pooling得到的是分别在每一个卷积核卷积下的，某一个滑动窗口--句子的某一个子序列卷积后的值，这个值相比于其他滑动窗口的更大。句子序列是有先后顺序的，滑动窗口也是，所以是 over-time.

这里记为：，是对应该filter的最大值。

<3> 全连接层

这里也是采用全连接层，将前面层提取的信息，映射到分类类别中，获取其概率分布。

2.2 static 和 fine-tuned vector的结合

paper中，将句子分别用 static 和fine-tuned vector 表征为两个channel。如Figure1最左边的图形所示，有两个矩阵，这两个矩阵分别表示用static 和fine-tuned vector拼接组成的句子的表征。比如，前面的矩阵的第一行是wait这个词的static的vector；后面的矩阵的第一行是wait这个词的fine-tuned的vector.

二者信息如何结合呢？

paper中的策略也很简单，用同样的卷积核对其进行特征提取后，将两个channel获得的值直接Add在一起，放到feature map中，这样Figure1中的feature map实际上是两种vector进行特征提取后信息的综合。

2.3 正则化的策略

为了避免co-adapation问题，Hinton提出了dropout。在本paper中，对于倒数第二层，也就是max pooling后获取的部分，也使用这样的正则化策略。

假设有m个feature map, 那么记。

如果不使用dropout,其经过线性映射的表示为：

那么如果使用dropout，其经过线性映射的表示为：

这里的$r$是m维的mask向量，其值或为0，或为1，其值为1的概率服从伯努利分布。

那么在进行反向传播时，只有对应mask为1的单元，其梯度才会传播更新。

在测试阶段，权值矩阵w会被scale p倍，即$\hat{w} = pw$，并且$\hat{w}$不进行dropout，来对训练阶段为遇到过的数据进行score.

另外可以选择对$w$进行$l_2$正则化，当在梯度下降后，$||w||_2 > s$ 时，将其值限制为s.

3. Datasets and Experimental Setup

3.1 Datasets:

1. MR: Movie reviews with one sentence per review. positive/negative reviews

2. SST-1: Stanford Sentiment Treebank—an extension of MR but with train/dev/test splits provided and fine-grained labels (very positive, positive, neutral, negative, very negative), re-labeled by Socher et al. (2013).4

3. SST-2: Same as SST-1 but with neutral reviews removed and binary labels.

4. Subj: Subjectivity dataset where the task is to classify a sentence as being subjective or objective (Pang and Lee, 2004)

5. TREC: TREC question dataset—task involves classifying a question into 6 question types (whether the question is about person, location, numeric information, etc.) (Li and Roth, 2002)

6. CR: Customer reviews of various products (cameras, MP3s etc.). Task is to predict positive/negative reviews (Hu and Liu, 2004)

7. MPQA: Opinion polarity detection subtask of the MPQA dataset (Wiebe et al., 2005).

3.2 Hyperparameters and Training

激活函数：ReLU

window(h): 3,4,5, 每个有100个feature map

dropout p = 0.5

l2(s) = 3

mini-batch size = 50

在SST-2的dev set上进行网格搜索(grid search)选择的以上超参数。

批量梯度下降

使用Adadelta update rule

对于没有提供标准dev set的数据集，随机在training data 中选10%作为dev set.

3.3 Pre-trained Word Vectors

word2vec vectors that were trained on 100 billion words from Google News

3.4 Model Variations

paper中提供的几种模型的变型主要为了测试，初始的word vector的设置对模型效果的影响。

CNN-rand: 完全随机初始化

CNN-static: 用word2vec预训练的初始化

CNN-non-static: 用针对特定任务fine-tuned的

CNN-multichannel: 将static与fine-tuned的结合，每个作为一个channel

效果：后三者相比于完全rand的在7个数据集上效果都有提升。

并且本文所提出的这个简单的CNN模型的效果，和一些利用parse-tree等复杂模型的效果相差很小。在SST-2, CR 中取得了SOTA.

本文提出multichannel的方法，本想希望通过避免overfitting来提升效果的，但是实验结果显示，并没有显示处完全的优势，在一些数据集上的效果，不及其他。

4. Code

Theano: 1. paper的实现代码：yoonkim/CNN_sentence: https://github.com/yoonkim/CNN_sentence

Tensorflow: 2. dennybritz/cnn-text-classification-tf: https://github.com/dennybritz/cnn-text-classification-tf

Keras: 3. alexander-rakhlin/CNN-for-Sentence-Classification-in-Keras: https://github.com/alexander-rakhlin/CNN-for-Sentence-Classification-in-Keras

Pytorch: 4. Shawn1993/cnn-text-classification-pytorch: https://github.com/Shawn1993/cnn-text-classification-pytorch

试验了MR的效果，eval准确率最高为73%，低于github中给出的77.5%和paper中76.1%的准确率；

试验了SST的效果，eval准确率最高为37%，低于github中给出的37.2%和paper中45.0%的准确率。

这里展示model.py的代码：

1 importtorch2 importtorch.nn as nn3 importtorch.nn.functional as F4 from torch.autograd importVariable5
6
7 classCNN_Text(nn.Module):8
9     def __init__(self, args):10         super(CNN_Text, self).__init__()11         self.args =args12
13         V =args.embed_num14         D =args.embed_dim15         C =args.class_num16         Ci = 1
17         Co =args.kernel_num18         Ks =args.kernel_sizes19
20         self.embed =nn.Embedding(V, D)21         #self.convs1 = [nn.Conv2d(Ci, Co, (K, D)) for K in Ks]
22         self.convs1 = nn.ModuleList([nn.Conv2d(Ci, Co, (K, D)) for K inKs])23         '''
24 self.conv13 = nn.Conv2d(Ci, Co, (3, D))25 self.conv14 = nn.Conv2d(Ci, Co, (4, D))26 self.conv15 = nn.Conv2d(Ci, Co, (5, D))27         '''
28         self.dropout =nn.Dropout(args.dropout)29         self.fc1 = nn.Linear(len(Ks)*Co, C)30
31     defconv_and_pool(self, x, conv):32         x = F.relu(conv(x)).squeeze(3)  #(N, Co, W)
33         x = F.max_pool1d(x, x.size(2)).squeeze(2)34         returnx35
36     defforward(self, x):37         x = self.embed(x)  #(N, W, D)
38
39         ifself.args.static:40             x =Variable(x)41
42         x = x.unsqueeze(1)  #(N, Ci, W, D)
43
44         x = [F.relu(conv(x)).squeeze(3) for conv in self.convs1]  #[(N, Co, W), ...]*len(Ks)
45
46         x = [F.max_pool1d(i, i.size(2)).squeeze(2) for i in x]  #[(N, Co), ...]*len(Ks)
47
48         x = torch.cat(x, 1)49
50         '''
51 x1 = self.conv_and_pool(x,self.conv13) #(N,Co)52 x2 = self.conv_and_pool(x,self.conv14) #(N,Co)53 x3 = self.conv_and_pool(x,self.conv15) #(N,Co)54 x = torch.cat((x1, x2, x3), 1) # (N,len(Ks)*Co)55         '''
56         x = self.dropout(x)  #(N, len(Ks)*Co)
57         logit = self.fc1(x)  #(N, C)
58         return logit

Pytorch 5. prakashpandey9/Text-Classification-Pytorch: https://github.com/prakashpandey9/Text-Classification-Pytorch

注意，该代码中models的CNN部分是paper的简单实现，但是代码的main.py需要有修改

由于选用的是IMDB的数据集，其label是1,2，而pytorch在计算loss时，要求target的范围在0<= t < n_classes，也就是需要将标签(1,2)转换为(0,1)，使其符合pytorch的要求，否则会报错：“Assertion `t >= 0 && t < n_classes` failed.”

可以通过将标签2改为0，来实现：

1 target = (target != 2)2 target = target.long()

应为该代码中用的损失函数是cross_entropy, 所以应转为long类型。

方便起见，这里展示修改后的完整的main.py的代码，里面的超参数可以自行更改。

1 importos2 importtime3 importload_data4 importtorch5 importtorch.nn.functional as F6 from torch.autograd importVariable7 importtorch.optim as optim8 importnumpy as np9 from models.LSTM importLSTMClassifier10 from models.CNN importCNN11
12 TEXT, vocab_size, word_embeddings, train_iter, valid_iter, test_iter =load_data.load_dataset()13
14 defclip_gradient(model, clip_value):15     params = list(filter(lambda p: p.grad is notNone, model.parameters()))16     for p inparams:17         p.grad.data.clamp_(-clip_value, clip_value)18
19 deftrain_model(model, train_iter, epoch):20     total_epoch_loss =021     total_epoch_acc =022
23     device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')24     #model.cuda()
25     #model.to(device)
26
27     optim = torch.optim.Adam(filter(lambdap: p.requires_grad, model.parameters()))28     steps =029 model.train()30     for idx, batch inenumerate(train_iter):31         text =batch.text[0]32         target =batch.label33         ##########Assertion `t >= 0 && t < n_classes` failed.###################
34         target = (target != 2)35         target =target.long()36         ########################################################################
37         #target = torch.autograd.Variable(target).long()
38
39         iftorch.cuda.is_available():40             text =text.cuda()41             target =target.cuda()42
43         if (text.size()[0] is not 32):#One of the batch returned by BucketIterator has length different than 32.
44             continue
45 optim.zero_grad()46         prediction =model(text)47
48 prediction.to(device)49
50         loss =loss_fn(prediction, target)51 loss.to(device)52
53         num_corrects = (torch.max(prediction, 1)[1].view(target.size()).data ==target.data).float().sum()54         acc = 100.0 * num_corrects/len(batch)55
56 loss.backward()57         clip_gradient(model, 1e-1)58 optim.step()59         steps += 1
60
61         if steps % 100 ==0:62             print (f'Epoch: {epoch+1}, Idx: {idx+1}, Training Loss: {loss.item():.4f}, Training Accuracy: {acc.item(): .2f}%')63
64         total_epoch_loss +=loss.item()65         total_epoch_acc +=acc.item()66
67     return total_epoch_loss/len(train_iter), total_epoch_acc/len(train_iter)68
69 defeval_model(model, val_iter):70     total_epoch_loss =071     total_epoch_acc =072 model.eval()73 with torch.no_grad():74         for idx, batch inenumerate(val_iter):75             text =batch.text[0]76             if (text.size()[0] is not 32):77                 continue
78             target =batch.label79             #target = torch.autograd.Variable(target).long()
80
81             target = (target != 2)82             target =target.long()83
84
85             iftorch.cuda.is_available():86                 text =text.cuda()87                 target =target.cuda()88
89             prediction =model(text)90             loss =loss_fn(prediction, target)91             num_corrects = (torch.max(prediction, 1)[1].view(target.size()).data ==target.data).sum()92             acc = 100.0 * num_corrects/len(batch)93             total_epoch_loss +=loss.item()94             total_epoch_acc +=acc.item()95
96     return total_epoch_loss/len(val_iter), total_epoch_acc/len(val_iter)97
98
99 #learning_rate = 2e-5
100 #batch_size = 32
101 #output_size = 2
102 #hidden_size = 256
103 #embedding_length = 300
104
105 learning_rate = 1e-3
106 batch_size = 32
107 output_size = 1
108 #hidden_size = 256
109 embedding_length = 300
110
111 #model = LSTMClassifier(batch_size, output_size, hidden_size, vocab_size, embedding_length, word_embeddings)
112
113 model = CNN(batch_size = batch_size, output_size = 2, in_channels = 1, out_channels = 100, kernel_heights = [3,4,5], stride = 1, padding = 0, keep_probab = 0.5, vocab_size = vocab_size, embedding_length = 300, weights =word_embeddings)114
115 device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')116 model.to(device)117
118 loss_fn =F.cross_entropy119
120 for epoch in range(1):121     train_loss, train_acc =train_model(model, train_iter, epoch)122     val_loss, val_acc =eval_model(model, valid_iter)123
124     print(f'Epoch: {epoch+1:02}, Train Loss: {train_loss:.3f}, Train Acc: {train_acc:.2f}%, Val. Loss: {val_loss:3f}, Val. Acc: {val_acc:.2f}%')125
126 test_loss, test_acc =eval_model(model, test_iter)127 print(f'Test Loss: {test_loss:.3f}, Test Acc: {test_acc:.2f}%')128
129 '''Let us now predict the sentiment on a single sentence just for the testing purpose.'''
130 test_sen1 = "This is one of the best creation of Nolan. I can say, it's his magnum opus. Loved the soundtrack and especially those creative dialogues."
131 test_sen2 = "Ohh, such a ridiculous movie. Not gonna recommend it to anyone. Complete waste of time and money."
132
133 test_sen1 =TEXT.preprocess(test_sen1)134 test_sen1 = [[TEXT.vocab.stoi[x] for x intest_sen1]]135
136 test_sen2 =TEXT.preprocess(test_sen2)137 test_sen2 = [[TEXT.vocab.stoi[x] for x intest_sen2]]138
139 test_sen =np.asarray(test_sen2)140 test_sen =torch.LongTensor(test_sen)141
142 #test_tensor = Variable(test_sen, volatile=True)
143
144 #test_tensor = torch.tensor(test_sen, dtype= torch.long)
145 #test_tensor.new_tensor(test_sen, requires_grad = False)
146 test_tensor =test_sen.clone().detach().requires_grad_(False)147
148 test_tensor =test_tensor.cuda()149
150 model.eval()151 output = model(test_tensor, 1)152 output =output.cuda()153 out = F.softmax(output, 1)154
155 if (torch.argmax(out[0]) ==0):156     print ("Sentiment: Positive")157 else:158     print ("Sentiment: Negative")

View Code

[支付宝] Bless you~ O(∩_∩)O

As you start to walk out on the way, the way appears. ~Rumi

转载于:https://www.cnblogs.com/shiyublog/p/11210504.html