gram矩阵_ZEN-基于N-gram的中文Encoder

ZEN

N-Gram

1. N-gram的提取

N-gram的提取分为两步，第一步是根据现有语料基于频率生成N-gram词表Lexicon, 请注意这些N-gram可能是包含关系，例如里面同时存在的粤港澳和港澳。第二步是根据此表生成训练数据的N-gram matrix，如下图所示。

N-gram matrix是一个

的矩阵，其中

是句子中包含的字数，

是句子可以提取的N-gram的数量。

表示第i个词是否属于第j个N-gram

这里N-Gram矩阵的生成非常朴素，代码位置examples.utils_sequence_level_tasks中, 在函数convert_examples_to_features中。这个函数主要是将输入的batch rokenize 之后转化成word id，以及label进行处理，同时对N-Gram进行编码。其他过程我们这里不再多说，主要看一下N-Gram矩阵这部分的逻辑。

# ----------- code for ngram BEGIN-----------

需要注意的ngram_dict是提前生成的，每一句话我们先遍历每一种组合，生成所有可能的ngram，并记录他们的长度和起始位置。ngram_positions_matrix就是我们需要的N-Gram matrix，他是一个max_seq_length*max_ngram_in_seq的矩阵，其中max_seq_length是输入的词的长度，max_ngram_in_seq是一个句子中最多的N-Gram组合的数量，默认是128，然后遍历赋值。需要注意当一个word被mask掉他的N-gram也不再考虑。

2. N-gram编码

N-gram encoder的结构如下图所示，文章中采用多层transformer结构来对N-gram进行编码，因为N-gram的顺序不需要考虑所以position encoding。N-gram encoder对于模型效率的提升是有很大影响的，为什么嘞，因为N-gram encoder可以学习到一些句子中重要的词组，从而提升模型的效率。这里面输入的N-gram embedding可以理解为Word embedding,

代码里N-Gram Embedding的编码方式也和Word Embedding相差不多。如下分别是ZEN的Word Emebedding和N-Gram Emebedding的生成方式。

class BertEmbeddings(nn.Module):"""Construct the embeddings from word, position and token_type embeddings."""def __init__(self, config):super(BertEmbeddings, self).__init__()self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, padding_idx=0)self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size)self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.hidden_size)# self.LayerNorm is not snake-cased to stick with TensorFlow model variable name and be able to load# any TensorFlow checkpoint fileself.LayerNorm = BertLayerNorm(config.hidden_size, eps=config.layer_norm_eps)self.dropout = nn.Dropout(config.hidden_dropout_prob)def forward(self, input_ids, token_type_ids=None):seq_length = input_ids.size(1)position_ids = torch.arange(seq_length, dtype=torch.long, device=input_ids.device)position_ids = position_ids.unsqueeze(0).expand_as(input_ids)if token_type_ids is None:token_type_ids = torch.zeros_like(input_ids)words_embeddings = self.word_embeddings(input_ids)position_embeddings = self.position_embeddings(position_ids)token_type_embeddings = self.token_type_embeddings(token_type_ids)embeddings = words_embeddings + position_embeddings + token_type_embeddingsembeddings = self.LayerNorm(embeddings)embeddings = self.dropout(embeddings)return embeddingsclass BertWordEmbeddings(nn.Module):"""Construct the embeddings from ngram, position and token_type embeddings."""def __init__(self, config):super(BertWordEmbeddings, self).__init__()self.word_embeddings = nn.Embedding(config.word_size, config.hidden_size, padding_idx=0)self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.hidden_size)# self.LayerNorm is not snake-cased to stick with TensorFlow model variable name and be able to load# any TensorFlow checkpoint fileself.LayerNorm = BertLayerNorm(config.hidden_size, eps=config.layer_norm_eps)self.dropout = nn.Dropout(config.hidden_dropout_prob)def forward(self, input_ids, token_type_ids=None):if token_type_ids is None:token_type_ids = torch.zeros_like(input_ids)words_embeddings = self.word_embeddings(input_ids)token_type_embeddings = self.token_type_embeddings(token_type_ids)embeddings = words_embeddings + token_type_embeddingsembeddings = self.LayerNorm(embeddings)embeddings = self.dropout(embeddings)return embeddings

3. N-gram进行预训练

模型结构如下所示。

ZEN模型将对字和其有关的N-gram进行编码，这个该如何结合呢，就是将矩阵相加。

是character_encoder第l层输出的第i个character的hidden output
是第l层和第i个character有关的第k个N-gram。需要注意的是这里一个字可以被包含到多个N-gram中，例如粤港澳大湾区和港澳

那么对于第l层encoder这种增强可以表示为

是这一层的embedding matrix
是character-N-gram相关矩阵
M是N-gram matrix

需要注意的是如果这个字被masked掉了，那么这个字的N-gram就不会被加进去了。

ZEN Encoder的代码如下，其中hidden_states加上了N-Gram经过attention的结果。

class ZenEncoder(nn.Module):def __init__(self, config, output_attentions=False, keep_multihead_output=False):super(ZenEncoder, self).__init__()self.output_attentions = output_attentionslayer = BertLayer(config, output_attentions=output_attentions,keep_multihead_output=keep_multihead_output)self.layer = nn.ModuleList([copy.deepcopy(layer) for _ in range(config.num_hidden_layers)])self.word_layers = nn.ModuleList([copy.deepcopy(layer) for _ in range(config.num_hidden_word_layers)])self.num_hidden_word_layers = config.num_hidden_word_layersdef forward(self, hidden_states, ngram_hidden_states, ngram_position_matrix, attention_mask,ngram_attention_mask,output_all_encoded_layers=True, head_mask=None):# Need to check what is the attention masking doing hereall_encoder_layers = []all_attentions = []num_hidden_ngram_layers = self.num_hidden_word_layersfor i, layer_module in enumerate(self.layer):hidden_states = layer_module(hidden_states, attention_mask, head_mask[i])if i < num_hidden_ngram_layers:ngram_hidden_states = self.word_layers[i](ngram_hidden_states, ngram_attention_mask, head_mask[i])if self.output_attentions:ngram_attentions, ngram_hidden_states = ngram_hidden_statesif self.output_attentions:attentions, hidden_states = hidden_statesall_attentions.append(attentions)hidden_states += torch.bmm(ngram_position_matrix.float(), ngram_hidden_states.float())if output_all_encoded_layers:all_encoder_layers.append(hidden_states)if not output_all_encoded_layers:all_encoder_layers.append(hidden_states)if self.output_attentions:return all_attentions, all_encoder_layersreturn all_encoder_layers

实验结果

1. 实验设置

论文使用了中文wiki作为语料，并去除了标点符号，进行了简体转化，对英文字母统一转为小写的数据清洗。

N-gram词典是根据训练语料，对N-gram按照词频排序并设置阈值，频率低于阈值的N-gram将会被剔除。最终的N-gram包含17.9万~6.4万之间。N-gram embedding是随机初始化的，模型结构和Bert结构相同，采用12层12个muti-head attention结构，hidden size大小为768。预训练也和Bert相同采用MLM和NSP任务。

2. 实验效果

模型的实验效果如下图所示，其实R表示模型参数随机加载，P表示模型参数根据谷歌的Bert模型初始化，B表示用的是Bert Base，L表示Bert Large。可以看出ZEN在多个模型上取得了当前比较好的效果。

1. 小规模语料上进行预训练

当前的预训练模型大都是在大型数据集上进行实验，对于部分领域大规模数据集很难收集，于是本文抽取了1/10大小的维基语料进行预训练，模型参数采取随机初始化。可以看出ZEN在小规模数据集上的效果要稍稍优于Bert。应该是因为N-gram对embedding进行了增强，这表示ZEN在小规模数据集的场景要优于Bert。

2. 收敛速度

下图展示了ZEN在CWS(Chinese word segmentation)和SA(Sentiment analysis)任务上的不同训练epoch的表现。可以看出相同的epochZEN的效果比Bert的更好，同事ZEN比Bert收敛更快。

3. N-gram Threshold

文中对我们提取N-gram频率的阈值进行了分析，发现阈值在10~20时候效果最好。同时论文对使用最多的N-gram的数量也进行了分析，发现随着N-gram数量的增多模型效果有了部分提升。

4. 热力图分析

论文对encoder的N-gram也进行了热力图分析，如下图所示，是两句话在1~7层中每个N-gram的weight。可以看出，“有意义”的N-gram所占的权重比“无意义”的N-gram权重要高，例如“提高”和“波士顿”比“会提高”和“士顿”的权重要高。这表ZEN会在N-gram中注重语义，选择比较合适的词组。同时我们发现较长的词组在比较高的层中获得权重比较大，这也表示这些比较长的词组对模型理解语句有比较重要的影响。