Bert源代码(二)模型

  • 模型训练、评估和预测流程
  • Bert模型
    • Transformer模型
    • Bert模型
  • Bert模型代码解析
  • 参考文献
export BERT_BASE_DIR=/path/to/bert/uncased_L-12_H-768_A-12
export GLUE_DIR=/path/to/glue
python run_classifier.py \--task_name=MRPC \--do_train=true \--do_eval=true \--data_dir=$GLUE_DIR/MRPC \--vocab_file=$BERT_BASE_DIR/vocab.txt \--bert_config_file=$BERT_BASE_DIR/bert_config.json \--init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \--max_seq_length=128 \--train_batch_size=32 \--learning_rate=2e-5 \--num_train_epochs=3.0 \--output_dir=/tmp/mrpc_output/

模型训练、评估和预测流程

和上一篇文章的预训练类似。

  1. 使用ColaProcessor,MnliProcessor,MrpcProcessor,XnliProcessor四种processor处理输入相应输入格式数据,获得当前句text_a和下一句text_b,再通过FullTokenizer来token化。最后通过TFRecordWriter将数据保存为tfrecord文件。
  2. 定义input_fn函数file_based_input_fn_builder
  3. 定义RunConfig配置run_config
  4. 创建model_fn:从传入的init_checkpoint初始化bert模型、AdamWeightDecayOptimizer,返回TPUEstimatorSpec
  5. 结合model_fn、run_config和input_fn使用TPUEstimator进行训练、评估、预测

Bert模型

先上几张图:

Transformer模型




上面的图展示了Transformer结构:Transformer是一种Seq2Seq模型,由encoder和decoder组成。encoder端由Nx个基础单元叠加,每个基础单元为:Multi-head的Self-Attention和Feed Forward串联;decoder端由Nx个基础单元叠加,每个基础单元为:输出端的Masked Multi-head Self-Attention和encoder-decoder的Multi-head Attention和Feed Forward串联。当然encoder和decoder端都会加入短连接Resnet和Normalization。输入输出端都在原来Inputs和Outputs上接一层Embedding层,再加入Position的Embedding来补偿位置信息(相对于RNN而言,距离始终为1,避免了RNN长距离依赖的问题。用Position Embedding来补偿距离信息。)

Bert模型

Bert模型在Input Embedding基础上加入Token Embedding、Segment Embedding、Position Embedding。引入了Deep Bidirectional Masked Lm和Next Prediction Sentence(这两个创新点在上一篇文章中已经结合代码详细介绍了)。

Bert模型代码解析

我们来逐步解析代码:

 with tf.variable_scope(scope, default_name="bert"):with tf.variable_scope("embeddings"):# 输入Embedding# Perform embedding lookup on the word ids.(self.embedding_output, self.embedding_table) = embedding_lookup(input_ids=input_ids,vocab_size=config.vocab_size,embedding_size=config.hidden_size,initializer_range=config.initializer_range,word_embedding_name="word_embeddings",use_one_hot_embeddings=use_one_hot_embeddings)# Add positional embeddings and token type embeddings, then layer# normalize and perform dropout.# 1. 添加Position Embedding和Token Embeddingself.embedding_output = embedding_postprocessor(input_tensor=self.embedding_output,use_token_type=True,token_type_ids=token_type_ids,token_type_vocab_size=config.type_vocab_size,token_type_embedding_name="token_type_embeddings",use_position_embeddings=True,position_embedding_name="position_embeddings",initializer_range=config.initializer_range,max_position_embeddings=config.max_position_embeddings,dropout_prob=config.hidden_dropout_prob)with tf.variable_scope("encoder"):# This converts a 2D mask of shape [batch_size, seq_length] to a 3D# mask of shape [batch_size, seq_length, seq_length] which is used# for the attention scores.attention_mask = create_attention_mask_from_input_mask(input_ids, input_mask)# Run the stacked transformer.# `sequence_output` shape = [batch_size, seq_length, hidden_size].# 2. 创建transformer模型self.all_encoder_layers = transformer_model(input_tensor=self.embedding_output,attention_mask=attention_mask,hidden_size=config.hidden_size,num_hidden_layers=config.num_hidden_layers,num_attention_heads=config.num_attention_heads,intermediate_size=config.intermediate_size,intermediate_act_fn=get_activation(config.hidden_act),hidden_dropout_prob=config.hidden_dropout_prob,attention_probs_dropout_prob=config.attention_probs_dropout_prob,initializer_range=config.initializer_range,do_return_all_layers=True)# 用于masked lm的sequence_output,即final hidden layerself.sequence_output = self.all_encoder_layers[-1]# The "pooler" converts the encoded sequence tensor of shape# [batch_size, seq_length, hidden_size] to a tensor of shape# [batch_size, hidden_size]. This is necessary for segment-level# (or segment-pair-level) classification tasks where we need a fixed# dimensional representation of the segment.# 用于句子分类的pooled_output,使用final hidden layer的第一个token[CLS]的embeding向量with tf.variable_scope("pooler"):# We "pool" the model by simply taking the hidden state corresponding# to the first token. We assume that this has been pre-trainedfirst_token_tensor = tf.squeeze(self.sequence_output[:, 0:1, :], axis=1)self.pooled_output = tf.layers.dense(first_token_tensor,config.hidden_size,activation=tf.tanh,kernel_initializer=create_initializer(config.initializer_range))

输入加入其他embedding解析:

def embedding_postprocessor(input_tensor,use_token_type=False,token_type_ids=None,token_type_vocab_size=16,token_type_embedding_name="token_type_embeddings",use_position_embeddings=True,position_embedding_name="position_embeddings",initializer_range=0.02,max_position_embeddings=512,dropout_prob=0.1):"""Performs various post-processing on a word embedding tensor.Args:input_tensor: float Tensor of shape [batch_size, seq_length,embedding_size].use_token_type: bool. Whether to add embeddings for `token_type_ids`.token_type_ids: (optional) int32 Tensor of shape [batch_size, seq_length].Must be specified if `use_token_type` is True.token_type_vocab_size: int. The vocabulary size of `token_type_ids`.token_type_embedding_name: string. The name of the embedding table variablefor token type ids.use_position_embeddings: bool. Whether to add position embeddings for theposition of each token in the sequence.position_embedding_name: string. The name of the embedding table variablefor positional embeddings.initializer_range: float. Range of the weight initialization.max_position_embeddings: int. Maximum sequence length that might ever beused with this model. This can be longer than the sequence length ofinput_tensor, but cannot be shorter.dropout_prob: float. Dropout probability applied to the final output tensor.Returns:float tensor with same shape as `input_tensor`.Raises:ValueError: One of the tensor shapes or input values is invalid."""input_shape = get_shape_list(input_tensor, expected_rank=3)batch_size = input_shape[0]seq_length = input_shape[1]width = input_shape[2]output = input_tensor# 添加token embeddingif use_token_type:if token_type_ids is None:raise ValueError("`token_type_ids` must be specified if""`use_token_type` is True.")token_type_table = tf.get_variable(name=token_type_embedding_name,shape=[token_type_vocab_size, width],initializer=create_initializer(initializer_range))# This vocab will be small so we always do one-hot here, since it is always# faster for a small vocabulary.flat_token_type_ids = tf.reshape(token_type_ids, [-1])one_hot_ids = tf.one_hot(flat_token_type_ids, depth=token_type_vocab_size)token_type_embeddings = tf.matmul(one_hot_ids, token_type_table)token_type_embeddings = tf.reshape(token_type_embeddings,[batch_size, seq_length, width])output += token_type_embeddings# 添加position embeddingif use_position_embeddings:assert_op = tf.assert_less_equal(seq_length, max_position_embeddings)with tf.control_dependencies([assert_op]):full_position_embeddings = tf.get_variable(name=position_embedding_name,shape=[max_position_embeddings, width],initializer=create_initializer(initializer_range))# Since the position embedding table is a learned variable, we create it# using a (long) sequence length `max_position_embeddings`. The actual# sequence length might be shorter than this, for faster training of# tasks that do not have long sequences.## So `full_position_embeddings` is effectively an embedding table# for position [0, 1, 2, ..., max_position_embeddings-1], and the current# sequence has positions [0, 1, 2, ... seq_length-1], so we can just# perform a slice.position_embeddings = tf.slice(full_position_embeddings, [0, 0],[seq_length, -1])num_dims = len(output.shape.as_list())# Only the last two dimensions are relevant (`seq_length` and `width`), so# we broadcast among the first dimensions, which is typically just# the batch size.position_broadcast_shape = []for _ in range(num_dims - 2):position_broadcast_shape.append(1)position_broadcast_shape.extend([seq_length, width])position_embeddings = tf.reshape(position_embeddings,position_broadcast_shape)output += position_embeddings# 加入LN和dropoutoutput = layer_norm_and_dropout(output, dropout_prob)return output

transformer模型解析:

def transformer_model(input_tensor,attention_mask=None,hidden_size=768,num_hidden_layers=12,num_attention_heads=12,intermediate_size=3072,intermediate_act_fn=gelu,hidden_dropout_prob=0.1,attention_probs_dropout_prob=0.1,initializer_range=0.02,do_return_all_layers=False):if hidden_size % num_attention_heads != 0:raise ValueError("The hidden size (%d) is not a multiple of the number of attention ""heads (%d)" % (hidden_size, num_attention_heads))attention_head_size = int(hidden_size / num_attention_heads)input_shape = get_shape_list(input_tensor, expected_rank=3)batch_size = input_shape[0]seq_length = input_shape[1]input_width = input_shape[2]# The Transformer performs sum residuals on all layers so the input needs# to be the same as the hidden size.if input_width != hidden_size:raise ValueError("The width of the input tensor (%d) != hidden size (%d)" %(input_width, hidden_size))# We keep the representation as a 2D tensor to avoid re-shaping it back and# forth from a 3D tensor to a 2D tensor. Re-shapes are normally free on# the GPU/CPU but may not be free on the TPU, so we want to minimize them to# help the optimizer.prev_output = reshape_to_matrix(input_tensor)all_layer_outputs = []# 定义基础单元叠加数量Nx:num_hidden_layersfor layer_idx in range(num_hidden_layers):with tf.variable_scope("layer_%d" % layer_idx):layer_input = prev_output# Attention层with tf.variable_scope("attention"):attention_heads = []with tf.variable_scope("self"):attention_head = attention_layer(from_tensor=layer_input,to_tensor=layer_input,attention_mask=attention_mask,num_attention_heads=num_attention_heads,size_per_head=attention_head_size,attention_probs_dropout_prob=attention_probs_dropout_prob,initializer_range=initializer_range,do_return_2d_tensor=True,batch_size=batch_size,from_seq_length=seq_length,to_seq_length=seq_length)attention_heads.append(attention_head)attention_output = Noneif len(attention_heads) == 1:attention_output = attention_heads[0]else:# In the case where we have other sequences, we just concatenate# them to the self-attention head before the projection.attention_output = tf.concat(attention_heads, axis=-1)# Run a linear projection of `hidden_size` then add a residual# with `layer_input`.# dropout并加入短链接ResNet和LNwith tf.variable_scope("output"):attention_output = tf.layers.dense(attention_output,hidden_size,kernel_initializer=create_initializer(initializer_range))attention_output = dropout(attention_output, hidden_dropout_prob)attention_output = layer_norm(attention_output + layer_input)# The activation is only applied to the "intermediate" hidden layer.# 映射到intermediate_size大小with tf.variable_scope("intermediate"):intermediate_output = tf.layers.dense(attention_output,intermediate_size,activation=intermediate_act_fn,kernel_initializer=create_initializer(initializer_range))# Down-project back to `hidden_size` then add the residual.# 映射回hidden_size大小,并加入短连接ResNet和LNwith tf.variable_scope("output"):layer_output = tf.layers.dense(intermediate_output,hidden_size,kernel_initializer=create_initializer(initializer_range))layer_output = dropout(layer_output, hidden_dropout_prob)layer_output = layer_norm(layer_output + attention_output)prev_output = layer_outputall_layer_outputs.append(layer_output)if do_return_all_layers:final_outputs = []for layer_output in all_layer_outputs:# reshap回input_shape尺寸final_output = reshape_from_matrix(layer_output, input_shape)final_outputs.append(final_output)return final_outputselse:final_output = reshape_from_matrix(prev_output, input_shape)return final_output

Attention层:
Attention(Q,K,V)=softmax(QKTdk)VAttention(Q, K, V) = softmax(\frac{QK^T}{\sqrt{d_k}})VAttention(Q,K,V)=softmax(dk

QKT)V
其中Q为query, K为key, V为value, Q∈Rn× dkQ∈R^{n\text{ × }d_k}QRn×dk dk,K∈Rm× dk,V∈Rm× dvd_k, K∈R^{m\text{ × }d_k}, V∈R^{m\text{ × }d_v}dk,KRm×dk,VRm×dvdkd_kdk为query和key的维度。​其中,除以dkd_kdk是为了防止点积过大,QKTQK^TQKT是计算序列中各个位置的相互关系。

def attention_layer(from_tensor,to_tensor,attention_mask=None,num_attention_heads=1,size_per_head=512,query_act=None,key_act=None,value_act=None,attention_probs_dropout_prob=0.0,initializer_range=0.02,do_return_2d_tensor=False,batch_size=None,from_seq_length=None,to_seq_length=None):def transpose_for_scores(input_tensor, batch_size, num_attention_heads,seq_length, width):output_tensor = tf.reshape(input_tensor, [batch_size, seq_length, num_attention_heads, width])output_tensor = tf.transpose(output_tensor, [0, 2, 1, 3])return output_tensorfrom_shape = get_shape_list(from_tensor, expected_rank=[2, 3])to_shape = get_shape_list(to_tensor, expected_rank=[2, 3])if len(from_shape) != len(to_shape):raise ValueError("The rank of `from_tensor` must match the rank of `to_tensor`.")if len(from_shape) == 3:batch_size = from_shape[0]from_seq_length = from_shape[1]to_seq_length = to_shape[1]elif len(from_shape) == 2:if (batch_size is None or from_seq_length is None or to_seq_length is None):raise ValueError("When passing in rank 2 tensors to attention_layer, the values ""for `batch_size`, `from_seq_length`, and `to_seq_length` ""must all be specified.")# Scalar dimensions referenced here:#   B = batch size (number of sequences)#   F = `from_tensor` sequence length, 输入tensor#   T = `to_tensor` sequence length, 输出tensor#   N = `num_attention_heads`, Attention头数目#   H = `size_per_head`, 每个Attention头的大小from_tensor_2d = reshape_to_matrix(from_tensor)to_tensor_2d = reshape_to_matrix(to_tensor)# `query_layer` = [B*F, N*H]# Q, [B*F, N*H], 将from_tensor映射为num_attention_heads * size_per_head大小query_layer = tf.layers.dense(from_tensor_2d,num_attention_heads * size_per_head,activation=query_act,name="query",kernel_initializer=create_initializer(initializer_range))# `key_layer` = [B*T, N*H]# K, [B*T, N*H], 将to_tensor映射为num_attention_heads * size_per_head大小key_layer = tf.layers.dense(to_tensor_2d,num_attention_heads * size_per_head,activation=key_act,name="key",kernel_initializer=create_initializer(initializer_range))# `value_layer` = [B*T, N*H]# V, [B*T, N*H], 将to_tensor映射为num_attention_heads * size_per_head大小value_layer = tf.layers.dense(to_tensor_2d,num_attention_heads * size_per_head,activation=value_act,name="value",kernel_initializer=create_initializer(initializer_range))# `query_layer` = [B, N, F, H]# 重排列维度query_layer = transpose_for_scores(query_layer, batch_size,num_attention_heads, from_seq_length,size_per_head)# `key_layer` = [B, N, T, H]key_layer = transpose_for_scores(key_layer, batch_size, num_attention_heads,to_seq_length, size_per_head)# Take the dot product between "query" and "key" to get the raw# attention scores.# `attention_scores` = [B, N, F, T]# 计算QK/sqrt(dk)attention_scores = tf.matmul(query_layer, key_layer, transpose_b=True)attention_scores = tf.multiply(attention_scores,1.0 / math.sqrt(float(size_per_head)))if attention_mask is not None:# `attention_mask` = [B, 1, F, T]attention_mask = tf.expand_dims(attention_mask, axis=[1])# Since attention_mask is 1.0 for positions we want to attend and 0.0 for# masked positions, this operation will create a tensor which is 0.0 for# positions we want to attend and -10000.0 for masked positions.# 对于attention_mask为1的加零维持原样;# 对于attention_mask为0的加入-10000令最终该位置的softmax结果为0。adder = (1.0 - tf.cast(attention_mask, tf.float32)) * -10000.0# Since we are adding it to the raw scores before the softmax, this is# effectively the same as removing these entirely.attention_scores += adder# Normalize the attention scores to probabilities.# `attention_probs` = [B, N, F, T]# 计算softmaxattention_probs = tf.nn.softmax(attention_scores)# This is actually dropping out entire tokens to attend to, which might# seem a bit unusual, but is taken from the original Transformer paper.# 加入dropoutattention_probs = dropout(attention_probs, attention_probs_dropout_prob)# `value_layer` = [B, T, N, H]value_layer = tf.reshape(value_layer,[batch_size, to_seq_length, num_attention_heads, size_per_head])# `value_layer` = [B, N, T, H]value_layer = tf.transpose(value_layer, [0, 2, 1, 3])# `context_layer` = [B, N, F, H]# softmax结果乘以V, [B, N, F, T] x [B, N, T, H] = [B, N, F, H]context_layer = tf.matmul(attention_probs, value_layer)# `context_layer` = [B, F, N, H]# 重排列维度context_layer = tf.transpose(context_layer, [0, 2, 1, 3])if do_return_2d_tensor:# `context_layer` = [B*F, N*H]context_layer = tf.reshape(context_layer,[batch_size * from_seq_length, num_attention_heads * size_per_head])else:# `context_layer` = [B, F, N*H]# 将num_attention_heads个head连接在一起context_layer = tf.reshape(context_layer,[batch_size, from_seq_length, num_attention_heads * size_per_head])return context_layer

参考文献

  1. Attention Is All You Need
  2. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Bert源代码(二)模型相关推荐

  1. 原创 | 从ULMFiT、Transformer、BERT等经典模型看NLP 发展趋势

    自然语言处理(Natural Language Process,简称NLP)是计算机科学.信息工程以及人工智能的子领域,专注于人机语言交互,探讨如何处理和运用自然语言.自然语言处理的研究,最早可以说开 ...

  2. 必须要GPT-3吗?不,BERT的MLM模型也能小样本学习

    ©PaperWeekly 原创 · 作者|苏剑林 单位|追一科技 研究方向|NLP.神经网络 大家都知道现在 GPT-3 风头正盛,然而,到处都是 GPT-3.GPT-3 地推,读者是否记得 GPT- ...

  3. 论文解读:《基于BERT和二维卷积神经网络的DNA增强子序列识别transformer结构》

    论文解读:<A transformer architecture based on BERT and 2D convolutional neural network to identify DN ...

  4. 五分钟搭建一个基于BERT的NER模型

    BERT 简介 BERT是2018年google 提出来的预训练的语言模型,并且它打破很多NLP领域的任务记录,其提出在nlp的领域具有重要意义.预训练的(pre-train)的语言模型通过无监督的学 ...

  5. 游戏开发全免费下载网站:源代码插件模型场景全部免费

    www.unityfly.com unity项目源代码插件模型场景免费资源学习分享 unity爱心飞扬下载站

  6. Django(十二)模型表关系的实现

    Django框架 (十二)模型表关系的实现 前言准备 在数据库中,我们知道很多表之间是有着关联的,也就是我们常说的一对一,多对多,一对多.所以我们需要使用到外键,对于多对多的表,我们还需要考虑到中间表 ...

  7. 我的简约论坛php源码,我的论坛源代码(二)

    我的论坛源代码(二) 更新时间:2006年10月09日 00:00:00   作者: 主界面,也就是显示主题列表的这页. //foxbbs.php 功能:显示论坛的主题 狐网论坛 P {FONT-FA ...

  8. 问答系统 - 使用BERT或DrQA模型在SQuAD数据集上构建问答系统。

    在本篇博客中,我们将介绍如何使用BERT或DrQA模型在SQuAD数据集上构建问答系统.SQuAD是一个基于文本的问答数据集,其中包含数千个问题及其对应的答案,我们可以利用这个数据集训练问答系统. 我 ...

  9. Bert机器问答模型QA(阅读理解)

    Github参考代码:https://github.com/edmondchensj/ChineseQA-with-BERT https://zhuanlan.zhihu.com/p/33368203 ...

  10. unity全免费下载网站 源代码插件模型场景全部免费

    www.unityfly.com unity项目源代码插件模型场景免费资源学习分享 unity爱心飞扬下载站

最新文章

  1. python模拟手写_python-自己手写的贴吧爬虫
  2. reactjs组件通讯:父组件传递数据给子组件
  3. centos7搭建FTP服务器
  4. 同步方法 调用异步防范_Spring一个注解实现方法的异步调用,再也不用单开线程了...
  5. HTML---文本样式---行高---字符间距---文本对齐方式---文本使用线条修饰---文本的大小写---处理元素内的空白---字体样式---无序列表有序列表---表格
  6. 紧急预警:wls9_async_response.war组件漏洞的延续
  7. 使用join()方法 分隔拆分后的数组
  8. stm32usb转串口驱动_新品推荐:乐扩PCI-E转8口RS-232串口卡 支持短铁片
  9. Vuebnb 一个用 vue.js + Laravel 构建的全栈应用
  10. B - Catch That Cow(广度搜索)
  11. ECSHOP源码分析
  12. 运放放大倍数计算公式_运放选型速记指南
  13. Lua中的os.time和os.date以及时区计算
  14. 2019世界移动通信大会--中国5G迎来高光时刻
  15. 超详细open vn搭建之Linux亲测可用
  16. matlab书籍(数学建模,信号处理,智能优化,统计分析)
  17. iOS WIFI 相关
  18. 乐观人生VS悲观人生
  19. android6.0按键处理浅析
  20. 国密sm2加密算法 前后端加密实现

热门文章

  1. openstack-mitaka(一) 架构简介
  2. ABAP CLEAR REFRESH FREE 说明(刘欣)
  3. linux开启nginx 443端口无效,http – Nginx正在侦听端口80或443但没有响应
  4. hdu 4966 GGS-DDU 最小树形图
  5. 获取Map的key值的几种方式
  6. 【Android】Android 封装 Http 请求工具
  7. 域名注册之后如何操作
  8. 什么是前贴片、中贴片和后贴片广告?它们分别在哪里使用?
  9. securitytube视频列表
  10. apicloud访问php,支付宝 app应用 受权 php + APICloud