Bert源代码（二）模型

模型训练、评估和预测流程
Bert模型
- Transformer模型
- Bert模型
Bert模型代码解析
参考文献

export BERT_BASE_DIR=/path/to/bert/uncased_L-12_H-768_A-12
export GLUE_DIR=/path/to/glue
python run_classifier.py \--task_name=MRPC \--do_train=true \--do_eval=true \--data_dir=$GLUE_DIR/MRPC \--vocab_file=$BERT_BASE_DIR/vocab.txt \--bert_config_file=$BERT_BASE_DIR/bert_config.json \--init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \--max_seq_length=128 \--train_batch_size=32 \--learning_rate=2e-5 \--num_train_epochs=3.0 \--output_dir=/tmp/mrpc_output/

模型训练、评估和预测流程

和上一篇文章的预训练类似。

使用ColaProcessor，MnliProcessor，MrpcProcessor，XnliProcessor四种processor处理输入相应输入格式数据，获得当前句text_a和下一句text_b，再通过FullTokenizer来token化。最后通过TFRecordWriter将数据保存为tfrecord文件。
定义input_fn函数file_based_input_fn_builder
定义RunConfig配置run_config
创建model_fn：从传入的init_checkpoint初始化bert模型、AdamWeightDecayOptimizer，返回TPUEstimatorSpec
结合model_fn、run_config和input_fn使用TPUEstimator进行训练、评估、预测

Bert模型

先上几张图：

Transformer模型

上面的图展示了Transformer结构：Transformer是一种Seq2Seq模型，由encoder和decoder组成。encoder端由Nx个基础单元叠加，每个基础单元为：Multi-head的Self-Attention和Feed Forward串联；decoder端由Nx个基础单元叠加，每个基础单元为：输出端的Masked Multi-head Self-Attention和encoder-decoder的Multi-head Attention和Feed Forward串联。当然encoder和decoder端都会加入短连接Resnet和Normalization。输入输出端都在原来Inputs和Outputs上接一层Embedding层，再加入Position的Embedding来补偿位置信息（相对于RNN而言，距离始终为1，避免了RNN长距离依赖的问题。用Position Embedding来补偿距离信息。）

Bert模型

Bert模型在Input Embedding基础上加入Token Embedding、Segment Embedding、Position Embedding。引入了Deep Bidirectional Masked Lm和Next Prediction Sentence（这两个创新点在上一篇文章中已经结合代码详细介绍了）。

Bert模型代码解析

我们来逐步解析代码：

 with tf.variable_scope(scope, default_name="bert"):with tf.variable_scope("embeddings"):# 输入Embedding# Perform embedding lookup on the word ids.(self.embedding_output, self.embedding_table) = embedding_lookup(input_ids=input_ids,vocab_size=config.vocab_size,embedding_size=config.hidden_size,initializer_range=config.initializer_range,word_embedding_name="word_embeddings",use_one_hot_embeddings=use_one_hot_embeddings)# Add positional embeddings and token type embeddings, then layer# normalize and perform dropout.# 1. 添加Position Embedding和Token Embeddingself.embedding_output = embedding_postprocessor(input_tensor=self.embedding_output,use_token_type=True,token_type_ids=token_type_ids,token_type_vocab_size=config.type_vocab_size,token_type_embedding_name="token_type_embeddings",use_position_embeddings=True,position_embedding_name="position_embeddings",initializer_range=config.initializer_range,max_position_embeddings=config.max_position_embeddings,dropout_prob=config.hidden_dropout_prob)with tf.variable_scope("encoder"):# This converts a 2D mask of shape [batch_size, seq_length] to a 3D# mask of shape [batch_size, seq_length, seq_length] which is used# for the attention scores.attention_mask = create_attention_mask_from_input_mask(input_ids, input_mask)# Run the stacked transformer.# `sequence_output` shape = [batch_size, seq_length, hidden_size].# 2. 创建transformer模型self.all_encoder_layers = transformer_model(input_tensor=self.embedding_output,attention_mask=attention_mask,hidden_size=config.hidden_size,num_hidden_layers=config.num_hidden_layers,num_attention_heads=config.num_attention_heads,intermediate_size=config.intermediate_size,intermediate_act_fn=get_activation(config.hidden_act),hidden_dropout_prob=config.hidden_dropout_prob,attention_probs_dropout_prob=config.attention_probs_dropout_prob,initializer_range=config.initializer_range,do_return_all_layers=True)# 用于masked lm的sequence_output，即final hidden layerself.sequence_output = self.all_encoder_layers[-1]# The "pooler" converts the encoded sequence tensor of shape# [batch_size, seq_length, hidden_size] to a tensor of shape# [batch_size, hidden_size]. This is necessary for segment-level# (or segment-pair-level) classification tasks where we need a fixed# dimensional representation of the segment.# 用于句子分类的pooled_output，使用final hidden layer的第一个token[CLS]的embeding向量with tf.variable_scope("pooler"):# We "pool" the model by simply taking the hidden state corresponding# to the first token. We assume that this has been pre-trainedfirst_token_tensor = tf.squeeze(self.sequence_output[:, 0:1, :], axis=1)self.pooled_output = tf.layers.dense(first_token_tensor,config.hidden_size,activation=tf.tanh,kernel_initializer=create_initializer(config.initializer_range))

输入加入其他embedding解析：

def embedding_postprocessor(input_tensor,use_token_type=False,token_type_ids=None,token_type_vocab_size=16,token_type_embedding_name="token_type_embeddings",use_position_embeddings=True,position_embedding_name="position_embeddings",initializer_range=0.02,max_position_embeddings=512,dropout_prob=0.1):"""Performs various post-processing on a word embedding tensor.Args:input_tensor: float Tensor of shape [batch_size, seq_length,embedding_size].use_token_type: bool. Whether to add embeddings for `token_type_ids`.token_type_ids: (optional) int32 Tensor of shape [batch_size, seq_length].Must be specified if `use_token_type` is True.token_type_vocab_size: int. The vocabulary size of `token_type_ids`.token_type_embedding_name: string. The name of the embedding table variablefor token type ids.use_position_embeddings: bool. Whether to add position embeddings for theposition of each token in the sequence.position_embedding_name: string. The name of the embedding table variablefor positional embeddings.initializer_range: float. Range of the weight initialization.max_position_embeddings: int. Maximum sequence length that might ever beused with this model. This can be longer than the sequence length ofinput_tensor, but cannot be shorter.dropout_prob: float. Dropout probability applied to the final output tensor.Returns:float tensor with same shape as `input_tensor`.Raises:ValueError: One of the tensor shapes or input values is invalid."""input_shape = get_shape_list(input_tensor, expected_rank=3)batch_size = input_shape[0]seq_length = input_shape[1]width = input_shape[2]output = input_tensor# 添加token embeddingif use_token_type:if token_type_ids is None:raise ValueError("`token_type_ids` must be specified if""`use_token_type` is True.")token_type_table = tf.get_variable(name=token_type_embedding_name,shape=[token_type_vocab_size, width],initializer=create_initializer(initializer_range))# This vocab will be small so we always do one-hot here, since it is always# faster for a small vocabulary.flat_token_type_ids = tf.reshape(token_type_ids, [-1])one_hot_ids = tf.one_hot(flat_token_type_ids, depth=token_type_vocab_size)token_type_embeddings = tf.matmul(one_hot_ids, token_type_table)token_type_embeddings = tf.reshape(token_type_embeddings,[batch_size, seq_length, width])output += token_type_embeddings# 添加position embeddingif use_position_embeddings:assert_op = tf.assert_less_equal(seq_length, max_position_embeddings)with tf.control_dependencies([assert_op]):full_position_embeddings = tf.get_variable(name=position_embedding_name,shape=[max_position_embeddings, width],initializer=create_initializer(initializer_range))# Since the position embedding table is a learned variable, we create it# using a (long) sequence length `max_position_embeddings`. The actual# sequence length might be shorter than this, for faster training of# tasks that do not have long sequences.## So `full_position_embeddings` is effectively an embedding table# for position [0, 1, 2, ..., max_position_embeddings-1], and the current# sequence has positions [0, 1, 2, ... seq_length-1], so we can just# perform a slice.position_embeddings = tf.slice(full_position_embeddings, [0, 0],[seq_length, -1])num_dims = len(output.shape.as_list())# Only the last two dimensions are relevant (`seq_length` and `width`), so# we broadcast among the first dimensions, which is typically just# the batch size.position_broadcast_shape = []for _ in range(num_dims - 2):position_broadcast_shape.append(1)position_broadcast_shape.extend([seq_length, width])position_embeddings = tf.reshape(position_embeddings,position_broadcast_shape)output += position_embeddings# 加入LN和dropoutoutput = layer_norm_and_dropout(output, dropout_prob)return output

transformer模型解析：

def transformer_model(input_tensor,attention_mask=None,hidden_size=768,num_hidden_layers=12,num_attention_heads=12,intermediate_size=3072,intermediate_act_fn=gelu,hidden_dropout_prob=0.1,attention_probs_dropout_prob=0.1,initializer_range=0.02,do_return_all_layers=False):if hidden_size % num_attention_heads != 0:raise ValueError("The hidden size (%d) is not a multiple of the number of attention ""heads (%d)" % (hidden_size, num_attention_heads))attention_head_size = int(hidden_size / num_attention_heads)input_shape = get_shape_list(input_tensor, expected_rank=3)batch_size = input_shape[0]seq_length = input_shape[1]input_width = input_shape[2]# The Transformer performs sum residuals on all layers so the input needs# to be the same as the hidden size.if input_width != hidden_size:raise ValueError("The width of the input tensor (%d) != hidden size (%d)" %(input_width, hidden_size))# We keep the representation as a 2D tensor to avoid re-shaping it back and# forth from a 3D tensor to a 2D tensor. Re-shapes are normally free on# the GPU/CPU but may not be free on the TPU, so we want to minimize them to# help the optimizer.prev_output = reshape_to_matrix(input_tensor)all_layer_outputs = []# 定义基础单元叠加数量Nx：num_hidden_layersfor layer_idx in range(num_hidden_layers):with tf.variable_scope("layer_%d" % layer_idx):layer_input = prev_output# Attention层with tf.variable_scope("attention"):attention_heads = []with tf.variable_scope("self"):attention_head = attention_layer(from_tensor=layer_input,to_tensor=layer_input,attention_mask=attention_mask,num_attention_heads=num_attention_heads,size_per_head=attention_head_size,attention_probs_dropout_prob=attention_probs_dropout_prob,initializer_range=initializer_range,do_return_2d_tensor=True,batch_size=batch_size,from_seq_length=seq_length,to_seq_length=seq_length)attention_heads.append(attention_head)attention_output = Noneif len(attention_heads) == 1:attention_output = attention_heads[0]else:# In the case where we have other sequences, we just concatenate# them to the self-attention head before the projection.attention_output = tf.concat(attention_heads, axis=-1)# Run a linear projection of `hidden_size` then add a residual# with `layer_input`.# dropout并加入短链接ResNet和LNwith tf.variable_scope("output"):attention_output = tf.layers.dense(attention_output,hidden_size,kernel_initializer=create_initializer(initializer_range))attention_output = dropout(attention_output, hidden_dropout_prob)attention_output = layer_norm(attention_output + layer_input)# The activation is only applied to the "intermediate" hidden layer.# 映射到intermediate_size大小with tf.variable_scope("intermediate"):intermediate_output = tf.layers.dense(attention_output,intermediate_size,activation=intermediate_act_fn,kernel_initializer=create_initializer(initializer_range))# Down-project back to `hidden_size` then add the residual.# 映射回hidden_size大小，并加入短连接ResNet和LNwith tf.variable_scope("output"):layer_output = tf.layers.dense(intermediate_output,hidden_size,kernel_initializer=create_initializer(initializer_range))layer_output = dropout(layer_output, hidden_dropout_prob)layer_output = layer_norm(layer_output + attention_output)prev_output = layer_outputall_layer_outputs.append(layer_output)if do_return_all_layers:final_outputs = []for layer_output in all_layer_outputs:# reshap回input_shape尺寸final_output = reshape_from_matrix(layer_output, input_shape)final_outputs.append(final_output)return final_outputselse:final_output = reshape_from_matrix(prev_output, input_shape)return final_output

Attention层：
$softmax(\frac{QK^T}{\sqrt{d_k}})V$

QKT)V
其中Q为query, K为key, V为value,

dkQ∈R^{n\text{ × }d_k}

dvd_k, K∈R^{m\text{ × }d_k}, V∈R^{m\text{ × }d_v}

，

d_k

为query和key的维度。其中，除以

d_k

是为了防止点积过大，

QK^T

是计算序列中各个位置的相互关系。

def attention_layer(from_tensor,to_tensor,attention_mask=None,num_attention_heads=1,size_per_head=512,query_act=None,key_act=None,value_act=None,attention_probs_dropout_prob=0.0,initializer_range=0.02,do_return_2d_tensor=False,batch_size=None,from_seq_length=None,to_seq_length=None):def transpose_for_scores(input_tensor, batch_size, num_attention_heads,seq_length, width):output_tensor = tf.reshape(input_tensor, [batch_size, seq_length, num_attention_heads, width])output_tensor = tf.transpose(output_tensor, [0, 2, 1, 3])return output_tensorfrom_shape = get_shape_list(from_tensor, expected_rank=[2, 3])to_shape = get_shape_list(to_tensor, expected_rank=[2, 3])if len(from_shape) != len(to_shape):raise ValueError("The rank of `from_tensor` must match the rank of `to_tensor`.")if len(from_shape) == 3:batch_size = from_shape[0]from_seq_length = from_shape[1]to_seq_length = to_shape[1]elif len(from_shape) == 2:if (batch_size is None or from_seq_length is None or to_seq_length is None):raise ValueError("When passing in rank 2 tensors to attention_layer, the values ""for `batch_size`, `from_seq_length`, and `to_seq_length` ""must all be specified.")# Scalar dimensions referenced here:#   B = batch size (number of sequences)#   F = `from_tensor` sequence length, 输入tensor#   T = `to_tensor` sequence length, 输出tensor#   N = `num_attention_heads`, Attention头数目#   H = `size_per_head`, 每个Attention头的大小from_tensor_2d = reshape_to_matrix(from_tensor)to_tensor_2d = reshape_to_matrix(to_tensor)# `query_layer` = [B*F, N*H]# Q, [B*F, N*H], 将from_tensor映射为num_attention_heads * size_per_head大小query_layer = tf.layers.dense(from_tensor_2d,num_attention_heads * size_per_head,activation=query_act,name="query",kernel_initializer=create_initializer(initializer_range))# `key_layer` = [B*T, N*H]# K, [B*T, N*H], 将to_tensor映射为num_attention_heads * size_per_head大小key_layer = tf.layers.dense(to_tensor_2d,num_attention_heads * size_per_head,activation=key_act,name="key",kernel_initializer=create_initializer(initializer_range))# `value_layer` = [B*T, N*H]# V, [B*T, N*H], 将to_tensor映射为num_attention_heads * size_per_head大小value_layer = tf.layers.dense(to_tensor_2d,num_attention_heads * size_per_head,activation=value_act,name="value",kernel_initializer=create_initializer(initializer_range))# `query_layer` = [B, N, F, H]# 重排列维度query_layer = transpose_for_scores(query_layer, batch_size,num_attention_heads, from_seq_length,size_per_head)# `key_layer` = [B, N, T, H]key_layer = transpose_for_scores(key_layer, batch_size, num_attention_heads,to_seq_length, size_per_head)# Take the dot product between "query" and "key" to get the raw# attention scores.# `attention_scores` = [B, N, F, T]# 计算QK/sqrt(dk)attention_scores = tf.matmul(query_layer, key_layer, transpose_b=True)attention_scores = tf.multiply(attention_scores,1.0 / math.sqrt(float(size_per_head)))if attention_mask is not None:# `attention_mask` = [B, 1, F, T]attention_mask = tf.expand_dims(attention_mask, axis=[1])# Since attention_mask is 1.0 for positions we want to attend and 0.0 for# masked positions, this operation will create a tensor which is 0.0 for# positions we want to attend and -10000.0 for masked positions.# 对于attention_mask为1的加零维持原样；# 对于attention_mask为0的加入-10000令最终该位置的softmax结果为0。adder = (1.0 - tf.cast(attention_mask, tf.float32)) * -10000.0# Since we are adding it to the raw scores before the softmax, this is# effectively the same as removing these entirely.attention_scores += adder# Normalize the attention scores to probabilities.# `attention_probs` = [B, N, F, T]# 计算softmaxattention_probs = tf.nn.softmax(attention_scores)# This is actually dropping out entire tokens to attend to, which might# seem a bit unusual, but is taken from the original Transformer paper.# 加入dropoutattention_probs = dropout(attention_probs, attention_probs_dropout_prob)# `value_layer` = [B, T, N, H]value_layer = tf.reshape(value_layer,[batch_size, to_seq_length, num_attention_heads, size_per_head])# `value_layer` = [B, N, T, H]value_layer = tf.transpose(value_layer, [0, 2, 1, 3])# `context_layer` = [B, N, F, H]# softmax结果乘以V, [B, N, F, T] x [B, N, T, H] = [B, N, F, H]context_layer = tf.matmul(attention_probs, value_layer)# `context_layer` = [B, F, N, H]# 重排列维度context_layer = tf.transpose(context_layer, [0, 2, 1, 3])if do_return_2d_tensor:# `context_layer` = [B*F, N*H]context_layer = tf.reshape(context_layer,[batch_size * from_seq_length, num_attention_heads * size_per_head])else:# `context_layer` = [B, F, N*H]# 将num_attention_heads个head连接在一起context_layer = tf.reshape(context_layer,[batch_size, from_seq_length, num_attention_heads * size_per_head])return context_layer

参考文献

Attention Is All You Need
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Bert源代码（二）模型相关推荐

原创 | 从ULMFiT、Transformer、BERT等经典模型看NLP 发展趋势
自然语言处理(Natural Language Process,简称NLP)是计算机科学.信息工程以及人工智能的子领域,专注于人机语言交互,探讨如何处理和运用自然语言.自然语言处理的研究,最早可以说开 ...
必须要GPT-3吗？不，BERT的MLM模型也能小样本学习
©PaperWeekly 原创 · 作者|苏剑林单位|追一科技研究方向|NLP.神经网络大家都知道现在 GPT-3 风头正盛,然而,到处都是 GPT-3.GPT-3 地推,读者是否记得 GPT- ...
论文解读：《基于BERT和二维卷积神经网络的DNA增强子序列识别transformer结构》
论文解读:<A transformer architecture based on BERT and 2D convolutional neural network to identify DN ...
五分钟搭建一个基于BERT的NER模型
BERT 简介 BERT是2018年google 提出来的预训练的语言模型,并且它打破很多NLP领域的任务记录,其提出在nlp的领域具有重要意义.预训练的(pre-train)的语言模型通过无监督的学 ...
游戏开发全免费下载网站：源代码插件模型场景全部免费
www.unityfly.com unity项目源代码插件模型场景免费资源学习分享 unity爱心飞扬下载站
Django(十二)模型表关系的实现
Django框架 (十二)模型表关系的实现前言准备在数据库中,我们知道很多表之间是有着关联的,也就是我们常说的一对一,多对多,一对多.所以我们需要使用到外键,对于多对多的表,我们还需要考虑到中间表 ...
我的简约论坛php源码,我的论坛源代码(二)
我的论坛源代码(二) 更新时间:2006年10月09日 00:00:00 作者: 主界面,也就是显示主题列表的这页. //foxbbs.php 功能:显示论坛的主题狐网论坛 P {FONT-FA ...
问答系统 - 使用BERT或DrQA模型在SQuAD数据集上构建问答系统。
在本篇博客中,我们将介绍如何使用BERT或DrQA模型在SQuAD数据集上构建问答系统.SQuAD是一个基于文本的问答数据集,其中包含数千个问题及其对应的答案,我们可以利用这个数据集训练问答系统. 我 ...
Bert机器问答模型QA（阅读理解）
Github参考代码:https://github.com/edmondchensj/ChineseQA-with-BERT https://zhuanlan.zhihu.com/p/33368203 ...
unity全免费下载网站源代码插件模型场景全部免费
www.unityfly.com unity项目源代码插件模型场景免费资源学习分享 unity爱心飞扬下载站

Bert源代码（二）模型

Bert源代码（二）模型

模型训练、评估和预测流程

Bert模型

Transformer模型

Bert模型

Bert模型代码解析

参考文献

Bert源代码（二）模型相关推荐

最新文章

热门文章