版权声明:本文为博主原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接和本声明。

本文链接:https://blog.csdn.net/weixin_39470744/article/details/84401339

         <!--一个博主专栏付费入口结束--><link rel="stylesheet" href="https://csdnimg.cn/release/phoenix/template/css/ck_htmledit_views-4a3473df85.css"><div id="content_views" class="markdown_views prism-atom-one-dark"><!-- flowchart 箭头图标 勿删 --><svg xmlns="http://www.w3.org/2000/svg" style="display: none;"><path stroke-linecap="round" d="M5,0 0,2.5 5,5z" id="raphael-marker-block" style="-webkit-tap-highlight-color: rgba(0, 0, 0, 0);"></path></svg><p></p><div class="toc"><h3><a name="t0"></a>目录</h3><ul><li><a href="#_1" rel="nofollow" target="_self">前言</a></li><li><a href="#_6" rel="nofollow" target="_self">源码解析</a></li><ul><li><a href="#_7" rel="nofollow" target="_self">模型配置参数</a></li><li><a href="#BertModel_22" rel="nofollow" target="_self">BertModel</a></li><li><a href="#word_embedding_131" rel="nofollow" target="_self">word embedding</a></li><li><a href="#embedding_postprocessor_181" rel="nofollow" target="_self">embedding_postprocessor</a></li><li><a href="#Transformer_281" rel="nofollow" target="_self">Transformer</a></li><li><a href="#self_attention_423" rel="nofollow" target="_self">self_attention</a></li></ul><li><a href="#_620" rel="nofollow" target="_self">模型应用</a></li></ul></div><p></p>

前言

BERT的模型主要是基于Transformer架构(论文:Attention is all you need)。它抛开了RNN等固有模式,直接用注意力机制处理Seq2Seq问题,体现了大道至简的思想。网上对此模型解析的资料有很多,但大都千篇一律。这里推荐知乎的一篇《Attention is all you need》解读,我觉得这篇把transformer介绍的非常好。
由于模型最闹心的就是维度问题,维度理清了,理解模型就很容易,所以我在源码中会注释每个操作后tensor的维度信息。
下面开始介绍BERT的模型 modeling.py是怎么建立的,我始终认为读代码和注释是理解的最快方法,所以看代码时如果官方注释有的地方看不懂。请善看中文注释维度信息

源码解析

模型配置参数

" attention_probs_dropout_prob": 0.1,         #乘法attention时,softmax后dropout概率"hidden_act": "gelu",         #激活函数"hidden_dropout_prob": 0.1,        #隐藏层dropout概率"hidden_size": 768,                #隐藏单元数"initializer_range": 0.02,          #初始化范围"intermediate_size": 3072,      #升维维度"max_position_embeddings": 512,     #一个大于seq_length的参数,用于生成position_embedding"num_attention_heads": 12,    #每个隐藏层中的attention head数"num_hidden_layers": 12,       #隐藏层数"type_vocab_size": 2,        #segment_ids类别 [0,1]"vocab_size": 30522       #词典中词数
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11

这里的输入参数:input_ids,input_mask,token_type_ids对应上篇文章中输出的input_ids,input_mask,segment_ids

BertModel

这部分是总流程,整个modling脚本有900多行代码,所以我列个流程图一部一部走。整体流程如下。首先对input_ids和token_type_ids进行embedding操作,将embedding结果送入Transformer训练,最后得到编码结果。

def __init__(self,config,is_training,input_ids,input_mask=None,token_type_ids=None,use_one_hot_embeddings=True,scope=None):"""Constructor for BertModel.Args:config: `BertConfig` instance.is_training: bool. rue for training model, false for eval model. Controlswhether dropout will be applied.input_ids: int32 Tensor of shape [batch_size, seq_length].input_mask: (optional) int32 Tensor of shape [batch_size, seq_length].token_type_ids: (optional) int32 Tensor of shape [batch_size, seq_length].use_one_hot_embeddings: (optional) bool. Whether to use one-hot wordembeddings or tf.embedding_lookup() for the word embeddings. On the TPU,it is must faster if this is True, on the CPU or GPU, it is faster ifthis is False.scope: (optional) variable scope. Defaults to "bert".Raises:ValueError: The config is invalid or one of the input tensor shapesis invalid."""config = copy.deepcopy(config)if not is_training:config.hidden_dropout_prob = 0.0config.attention_probs_dropout_prob = 0.0
input_shape <span class="token operator">=</span> <span class="token function">get_shape_list</span><span class="token punctuation">(</span>input_ids<span class="token punctuation">,</span> expected_rank<span class="token operator">=</span><span class="token number">2</span><span class="token punctuation">)</span>
batch_size <span class="token operator">=</span> input_shape<span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span>
seq_length <span class="token operator">=</span> input_shape<span class="token punctuation">[</span><span class="token number">1</span><span class="token punctuation">]</span><span class="token keyword">if</span> input_mask is None<span class="token punctuation">:</span>input_mask <span class="token operator">=</span> tf<span class="token punctuation">.</span><span class="token function">ones</span><span class="token punctuation">(</span>shape<span class="token operator">=</span><span class="token punctuation">[</span>batch_size<span class="token punctuation">,</span> seq_length<span class="token punctuation">]</span><span class="token punctuation">,</span> dtype<span class="token operator">=</span>tf<span class="token punctuation">.</span>int32<span class="token punctuation">)</span><span class="token keyword">if</span> token_type_ids is None<span class="token punctuation">:</span>token_type_ids <span class="token operator">=</span> tf<span class="token punctuation">.</span><span class="token function">zeros</span><span class="token punctuation">(</span>shape<span class="token operator">=</span><span class="token punctuation">[</span>batch_size<span class="token punctuation">,</span> seq_length<span class="token punctuation">]</span><span class="token punctuation">,</span> dtype<span class="token operator">=</span>tf<span class="token punctuation">.</span>int32<span class="token punctuation">)</span><span class="token keyword">with</span> tf<span class="token punctuation">.</span><span class="token function">variable_scope</span><span class="token punctuation">(</span>scope<span class="token punctuation">,</span> default_name<span class="token operator">=</span><span class="token string">"bert"</span><span class="token punctuation">)</span><span class="token punctuation">:</span><span class="token keyword">with</span> tf<span class="token punctuation">.</span><span class="token function">variable_scope</span><span class="token punctuation">(</span><span class="token string">"embeddings"</span><span class="token punctuation">)</span><span class="token punctuation">:</span># Perform embedding lookup on the word ids<span class="token punctuation">.</span>#<span class="token punctuation">[</span>batch_size<span class="token punctuation">,</span>seq_length<span class="token punctuation">,</span>embedding_size<span class="token punctuation">]</span>    <span class="token punctuation">[</span>vocab_size<span class="token punctuation">,</span>embedding_size<span class="token punctuation">]</span><span class="token punctuation">(</span>self<span class="token punctuation">.</span>embedding_output<span class="token punctuation">,</span> self<span class="token punctuation">.</span>embedding_table<span class="token punctuation">)</span> <span class="token operator">=</span> <span class="token function">embedding_lookup</span><span class="token punctuation">(</span>    #word_embeddinginput_ids<span class="token operator">=</span>input_ids<span class="token punctuation">,</span>           #<span class="token punctuation">[</span>batch_size<span class="token punctuation">,</span>seq_length<span class="token punctuation">]</span>vocab_size<span class="token operator">=</span>config<span class="token punctuation">.</span>vocab_size<span class="token punctuation">,</span>embedding_size<span class="token operator">=</span>config<span class="token punctuation">.</span>hidden_size<span class="token punctuation">,</span>initializer_range<span class="token operator">=</span>config<span class="token punctuation">.</span>initializer_range<span class="token punctuation">,</span>word_embedding_name<span class="token operator">=</span><span class="token string">"word_embeddings"</span><span class="token punctuation">,</span>use_one_hot_embeddings<span class="token operator">=</span>use_one_hot_embeddings<span class="token punctuation">)</span># Add positional embeddings and token type embeddings<span class="token punctuation">,</span> then layer# normalize and perform dropout<span class="token punctuation">.</span>self<span class="token punctuation">.</span>embedding_output <span class="token operator">=</span> <span class="token function">embedding_postprocessor</span><span class="token punctuation">(</span>       #token_embedding和position_embedding        <span class="token punctuation">[</span>batch_size<span class="token punctuation">,</span>seq_length<span class="token punctuation">,</span>embedding_size<span class="token punctuation">]</span>input_tensor<span class="token operator">=</span>self<span class="token punctuation">.</span>embedding_output<span class="token punctuation">,</span>use_token_type<span class="token operator">=</span>True<span class="token punctuation">,</span>token_type_ids<span class="token operator">=</span>token_type_ids<span class="token punctuation">,</span>token_type_vocab_size<span class="token operator">=</span>config<span class="token punctuation">.</span>type_vocab_size<span class="token punctuation">,</span>token_type_embedding_name<span class="token operator">=</span><span class="token string">"token_type_embeddings"</span><span class="token punctuation">,</span>use_position_embeddings<span class="token operator">=</span>True<span class="token punctuation">,</span>position_embedding_name<span class="token operator">=</span><span class="token string">"position_embeddings"</span><span class="token punctuation">,</span>initializer_range<span class="token operator">=</span>config<span class="token punctuation">.</span>initializer_range<span class="token punctuation">,</span>max_position_embeddings<span class="token operator">=</span>config<span class="token punctuation">.</span>max_position_embeddings<span class="token punctuation">,</span>dropout_prob<span class="token operator">=</span>config<span class="token punctuation">.</span>hidden_dropout_prob<span class="token punctuation">)</span><span class="token keyword">with</span> tf<span class="token punctuation">.</span><span class="token function">variable_scope</span><span class="token punctuation">(</span><span class="token string">"encoder"</span><span class="token punctuation">)</span><span class="token punctuation">:</span># This converts a <span class="token number">2</span>D mask <span class="token keyword">of</span> shape <span class="token punctuation">[</span>batch_size<span class="token punctuation">,</span> seq_length<span class="token punctuation">]</span> to a <span class="token number">3</span>D# mask <span class="token keyword">of</span> shape <span class="token punctuation">[</span>batch_size<span class="token punctuation">,</span> seq_length<span class="token punctuation">,</span> seq_length<span class="token punctuation">]</span> which is used# <span class="token keyword">for</span> the attention scores<span class="token punctuation">.</span>attention_mask <span class="token operator">=</span> <span class="token function">create_attention_mask_from_input_mask</span><span class="token punctuation">(</span>     input_ids<span class="token punctuation">,</span> input_mask<span class="token punctuation">)</span># Run the stacked transformer<span class="token punctuation">.</span># <span class="token template-string"><span class="token string">`sequence_output`</span></span> shape <span class="token operator">=</span> <span class="token punctuation">[</span>batch_size<span class="token punctuation">,</span> seq_length<span class="token punctuation">,</span> hidden_size<span class="token punctuation">]</span><span class="token punctuation">.</span>self<span class="token punctuation">.</span>all_encoder_layers <span class="token operator">=</span> <span class="token function">transformer_model</span><span class="token punctuation">(</span>        #transformer_model  <span class="token function">list</span><span class="token punctuation">(</span>#<span class="token punctuation">[</span>batch_size<span class="token punctuation">,</span>seq_length<span class="token punctuation">,</span>embedding_size<span class="token punctuation">]</span><span class="token punctuation">)</span>input_tensor<span class="token operator">=</span>self<span class="token punctuation">.</span>embedding_output<span class="token punctuation">,</span>attention_mask<span class="token operator">=</span>attention_mask<span class="token punctuation">,</span>hidden_size<span class="token operator">=</span>config<span class="token punctuation">.</span>hidden_size<span class="token punctuation">,</span>num_hidden_layers<span class="token operator">=</span>config<span class="token punctuation">.</span>num_hidden_layers<span class="token punctuation">,</span>num_attention_heads<span class="token operator">=</span>config<span class="token punctuation">.</span>num_attention_heads<span class="token punctuation">,</span>intermediate_size<span class="token operator">=</span>config<span class="token punctuation">.</span>intermediate_size<span class="token punctuation">,</span>intermediate_act_fn<span class="token operator">=</span><span class="token function">get_activation</span><span class="token punctuation">(</span>config<span class="token punctuation">.</span>hidden_act<span class="token punctuation">)</span><span class="token punctuation">,</span>hidden_dropout_prob<span class="token operator">=</span>config<span class="token punctuation">.</span>hidden_dropout_prob<span class="token punctuation">,</span>attention_probs_dropout_prob<span class="token operator">=</span>config<span class="token punctuation">.</span>attention_probs_dropout_prob<span class="token punctuation">,</span>initializer_range<span class="token operator">=</span>config<span class="token punctuation">.</span>initializer_range<span class="token punctuation">,</span>do_return_all_layers<span class="token operator">=</span>True<span class="token punctuation">)</span>self<span class="token punctuation">.</span>sequence_output <span class="token operator">=</span> self<span class="token punctuation">.</span>all_encoder_layers<span class="token punctuation">[</span><span class="token operator">-</span><span class="token number">1</span><span class="token punctuation">]</span>     #获取最后一层的输出# The <span class="token string">"pooler"</span> converts the encoded sequence tensor <span class="token keyword">of</span> shape# <span class="token punctuation">[</span>batch_size<span class="token punctuation">,</span> seq_length<span class="token punctuation">,</span> hidden_size<span class="token punctuation">]</span> to a tensor <span class="token keyword">of</span> shape# <span class="token punctuation">[</span>batch_size<span class="token punctuation">,</span> hidden_size<span class="token punctuation">]</span><span class="token punctuation">.</span> This is necessary <span class="token keyword">for</span> segment<span class="token operator">-</span>level# <span class="token punctuation">(</span>or segment<span class="token operator">-</span>pair<span class="token operator">-</span>level<span class="token punctuation">)</span> classification tasks where we need a fixed# dimensional representation <span class="token keyword">of</span> the segment<span class="token punctuation">.</span><span class="token keyword">with</span> tf<span class="token punctuation">.</span><span class="token function">variable_scope</span><span class="token punctuation">(</span><span class="token string">"pooler"</span><span class="token punctuation">)</span><span class="token punctuation">:</span># We <span class="token string">"pool"</span> the model by simply taking the hidden state corresponding# to the first token<span class="token punctuation">.</span> We assume that <span class="token keyword">this</span> has been pre<span class="token operator">-</span>trainedfirst_token_tensor <span class="token operator">=</span> tf<span class="token punctuation">.</span><span class="token function">squeeze</span><span class="token punctuation">(</span>self<span class="token punctuation">.</span>sequence_output<span class="token punctuation">[</span><span class="token punctuation">:</span><span class="token punctuation">,</span> <span class="token number">0</span><span class="token punctuation">:</span><span class="token number">1</span><span class="token punctuation">,</span> <span class="token punctuation">:</span><span class="token punctuation">]</span><span class="token punctuation">,</span> axis<span class="token operator">=</span><span class="token number">1</span><span class="token punctuation">)</span>    #取每个每个训练语料的第一个词的编码结果<span class="token punctuation">[</span><span class="token constant">CLS</span><span class="token punctuation">]</span>,它有整条训练语料的编码信息   <span class="token punctuation">[</span>batch_size<span class="token punctuation">,</span> hidden_size<span class="token punctuation">]</span>self<span class="token punctuation">.</span>pooled_output <span class="token operator">=</span> tf<span class="token punctuation">.</span>layers<span class="token punctuation">.</span><span class="token function">dense</span><span class="token punctuation">(</span>     #接一个全连接层进行输出 <span class="token punctuation">[</span>batch_size<span class="token punctuation">,</span> hidden_size<span class="token punctuation">]</span>first_token_tensor<span class="token punctuation">,</span>config<span class="token punctuation">.</span>hidden_size<span class="token punctuation">,</span>activation<span class="token operator">=</span>tf<span class="token punctuation">.</span>tanh<span class="token punctuation">,</span>kernel_initializer<span class="token operator">=</span><span class="token function">create_initializer</span><span class="token punctuation">(</span>config<span class="token punctuation">.</span>initializer_range<span class="token punctuation">)</span><span class="token punctuation">)</span>

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53
  • 54
  • 55
  • 56
  • 57
  • 58
  • 59
  • 60
  • 61
  • 62
  • 63
  • 64
  • 65
  • 66
  • 67
  • 68
  • 69
  • 70
  • 71
  • 72
  • 73
  • 74
  • 75
  • 76
  • 77
  • 78
  • 79
  • 80
  • 81
  • 82
  • 83
  • 84
  • 85
  • 86
  • 87
  • 88
  • 89
  • 90
  • 91
  • 92
  • 93
  • 94
  • 95
  • 96
  • 97
  • 98
  • 99
  • 100
  • 101
  • 102
  • 103
  • 104

word embedding

首先看word_embedding部分,它传入input_ids,运用one_hot为中介返回embedding结果

def embedding_lookup(input_ids,vocab_size,embedding_size=128,initializer_range=0.02,word_embedding_name="word_embeddings",use_one_hot_embeddings=False):"""Looks up words embeddings for id tensor.Args:input_ids: int32 Tensor of shape [batch_size, seq_length] containing wordids.vocab_size: int. Size of the embedding vocabulary.embedding_size: int. Width of the word embeddings.initializer_range: float. Embedding initialization range.word_embedding_name: string. Name of the embedding table.use_one_hot_embeddings: bool. If True, use one-hot method for wordembeddings. If False, use `tf.nn.embedding_lookup()`. One hot is betterfor TPUs.Returns:float Tensor of shape [batch_size, seq_length, embedding_size]."""# This function assumes that the input is of shape [batch_size, seq_length,# num_inputs].## If the input is a 2D tensor of shape [batch_size, seq_length], we# reshape to [batch_size, seq_length, 1].if input_ids.shape.ndims == 2:input_ids = tf.expand_dims(input_ids, axis=[-1])                 #最低维扩维 [batch_size,seq_length,1]

embedding_table = tf.get_variable(
name=word_embedding_name,
shape=[vocab_size, embedding_size],
initializer=create_initializer(initializer_range))

if use_one_hot_embeddings:
flat_input_ids = tf.reshape(input_ids, [-1]) #[batch_sizeseq_length]
one_hot_input_ids = tf.one_hot(flat_input_ids, depth=vocab_size) #[batch_sizeseq_length,vocab_size]
output = tf.matmul(one_hot_input_ids, embedding_table) #[batch_size*seq_length,embedding_size]
else:
output = tf.nn.embedding_lookup(embedding_table, input_ids)

input_shape = get_shape_list(input_ids)

output = tf.reshape(output,
input_shape[0:-1] + [input_shape[-1] * embedding_size]) #[batch_size,seq_length,embedding_size]
return (output, embedding_table)

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45

embedding_postprocessor

再看embedding_postprocessor 它包括token_type_embeddingposition_embedding。也就是图中的Segement EmbeddingsPosition Embeddings

但此代码中Position Embeddings部分与之前提出的Transformer不同,此代码中Position Embeddings是训练出来的,而传统的Transformer(如下)是固定值

def embedding_postprocessor(input_tensor,      #[batch_size,seq_length,embedding_size]use_token_type=False,token_type_ids=None,             #[batch_size,seq_length]token_type_vocab_size=16,token_type_embedding_name="token_type_embeddings",use_position_embeddings=True,position_embedding_name="position_embeddings",initializer_range=0.02,max_position_embeddings=512,dropout_prob=0.1):"""Performs various post-processing on a word embedding tensor.Args:input_tensor: float Tensor of shape [batch_size, seq_length,embedding_size].use_token_type: bool. Whether to add embeddings for `token_type_ids`.token_type_ids: (optional) int32 Tensor of shape [batch_size, seq_length].Must be specified if `use_token_type` is True.token_type_vocab_size: int. The vocabulary size of `token_type_ids`.token_type_embedding_name: string. The name of the embedding table variablefor token type ids.use_position_embeddings: bool. Whether to add position embeddings for theposition of each token in the sequence.position_embedding_name: string. The name of the embedding table variablefor positional embeddings.initializer_range: float. Range of the weight initialization.max_position_embeddings: int. Maximum sequence length that might ever beused with this model. This can be longer than the sequence length ofinput_tensor, but cannot be shorter.dropout_prob: float. Dropout probability applied to the final output tensor.Returns:float tensor with same shape as `input_tensor`.Raises:ValueError: One of the tensor shapes or input values is invalid."""input_shape = get_shape_list(input_tensor, expected_rank=3)batch_size = input_shape[0]seq_length = input_shape[1]width = input_shape[2]

output = input_tensor

if use_token_type: #Segement Embeddings部分
if token_type_ids is None:
raise ValueError("token_type_ids must be specified if"
"use_token_type is True.")
token_type_table = tf.get_variable(
name=token_type_embedding_name,
shape=[token_type_vocab_size, width],
initializer=create_initializer(initializer_range))
# This vocab will be small so we always do one-hot here, since it is always
# faster for a small vocabulary.
flat_token_type_ids = tf.reshape(token_type_ids, [-1]) #[batch_sizeseq_length]
one_hot_ids = tf.one_hot(flat_token_type_ids, depth=token_type_vocab_size) #[batch_sizeseq_length,2] token_type只有01
token_type_embeddings = tf.matmul(one_hot_ids, token_type_table) #[batch_size*seq_length,embedding_size]
token_type_embeddings = tf.reshape(token_type_embeddings,
[batch_size, seq_length, width]) #[batch_size, seq_length, width=embedding_size]
output += token_type_embeddings #[batch_size, seq_length, embedding_size]

if use_position_embeddings: #Position Embeddings部分
assert_op = tf.assert_less_equal(seq_length, max_position_embeddings) #确保seq_length<max_position_embedding
with tf.control_dependencies([assert_op]):
full_position_embeddings = tf.get_variable(
name=position_embedding_name,
shape=[max_position_embeddings, width],
initializer=create_initializer(initializer_range))
# Since the position embedding table is a learned variable, we create it
# using a (long) sequence length max_position_embeddings. The actual
# sequence length might be shorter than this, for faster training of
# tasks that do not have long sequences.
#
# So full_position_embeddings is effectively an embedding table
# for position [0, 1, 2, , max_position_embeddings-1], and the current
# sequence has positions [0, 1, 2, seq_length-1], so we can just
# perform a slice.
position_embeddings = tf.slice(full_position_embeddings, [0, 0], #[seq_length,embedding_size]
[seq_length, -1])
num_dims = len(output.shape.as_list())

  # Only the last two dimensions are <span class="token function">relevant</span> <span class="token punctuation">(</span><span class="token template-string"><span class="token string">`seq_length`</span></span> and <span class="token template-string"><span class="token string">`width`</span></span><span class="token punctuation">)</span><span class="token punctuation">,</span> so# we broadcast among the first dimensions<span class="token punctuation">,</span> which is typically just# the batch size<span class="token punctuation">.</span>position_broadcast_shape <span class="token operator">=</span> <span class="token punctuation">[</span><span class="token punctuation">]</span><span class="token keyword">for</span> _ <span class="token keyword">in</span> <span class="token function">range</span><span class="token punctuation">(</span>num_dims <span class="token operator">-</span> <span class="token number">2</span><span class="token punctuation">)</span><span class="token punctuation">:</span>position_broadcast_shape<span class="token punctuation">.</span><span class="token function">append</span><span class="token punctuation">(</span><span class="token number">1</span><span class="token punctuation">)</span>position_broadcast_shape<span class="token punctuation">.</span><span class="token function">extend</span><span class="token punctuation">(</span><span class="token punctuation">[</span>seq_length<span class="token punctuation">,</span> width<span class="token punctuation">]</span><span class="token punctuation">)</span>      #<span class="token punctuation">[</span><span class="token number">1</span><span class="token punctuation">,</span>seq_length<span class="token punctuation">,</span>embedding_size<span class="token punctuation">]</span>position_embeddings <span class="token operator">=</span> tf<span class="token punctuation">.</span><span class="token function">reshape</span><span class="token punctuation">(</span>position_embeddings<span class="token punctuation">,</span>     #<span class="token punctuation">[</span><span class="token number">1</span><span class="token punctuation">,</span>seq_length<span class="token punctuation">,</span>embedding_size<span class="token punctuation">]</span>position_broadcast_shape<span class="token punctuation">)</span>output <span class="token operator">+=</span> position_embeddings               #<span class="token punctuation">[</span>batch_size<span class="token punctuation">,</span> seq_length<span class="token punctuation">,</span> embedding_size<span class="token punctuation">]</span> 与#<span class="token punctuation">[</span><span class="token number">1</span><span class="token punctuation">,</span>seq_length<span class="token punctuation">,</span>embedding_size<span class="token punctuation">]</span>相加

#因为每一个batch的同一位置的position_embedding是一样的,所以相当于batch_size个position_embeddings与output相加

output = layer_norm_and_dropout(output, dropout_prob)
return output

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53
  • 54
  • 55
  • 56
  • 57
  • 58
  • 59
  • 60
  • 61
  • 62
  • 63
  • 64
  • 65
  • 66
  • 67
  • 68
  • 69
  • 70
  • 71
  • 72
  • 73
  • 74
  • 75
  • 76
  • 77
  • 78
  • 79
  • 80
  • 81
  • 82
  • 83
  • 84
  • 85
  • 86
  • 87
  • 88
  • 89
  • 90
  • 91
  • 92

Transformer

embedding之后,首先构造一个attention_mask,这个attention_mask表示的含义是将原来的input_mask的[batch_size,seq_length]扩维到[batch_size,from_seq_length,to_seq_length]。保证对于每个from_seq_length都有一个input_mask。之后将他们传入到transformer模型。
transformer整体架构如图所示

下面我们来看transformer_model。首先对embedding进行multi-head attention,对输入进行残差layer_norm。后传入feed forward,再进行残差layer_norm
本块代码中与原论文中不一样的点为:在进行multi-head attention后先链接了一个全连接层,再进行的残差和layer_norm。而原论文中貌似没有那个全连接层。下面是代码,关键部分我已写上注释

def transformer_model(input_tensor,attention_mask=None,   #[batch_size,form_seq_length,to_seq_length]hidden_size=768,num_hidden_layers=12,num_attention_heads=12,intermediate_size=3072,intermediate_act_fn=gelu,hidden_dropout_prob=0.1,attention_probs_dropout_prob=0.1,initializer_range=0.02,do_return_all_layers=False):"""Multi-headed, multi-layer Transformer from "Attention is All You Need".This is almost an exact implementation of the original Transformer encoder.See the original paper:https://arxiv.org/abs/1706.03762Also see:https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/models/transformer.pyArgs:input_tensor: float Tensor of shape [batch_size, seq_length, hidden_size].attention_mask: (optional) int32 Tensor of shape [batch_size, seq_length,seq_length], with 1 for positions that can be attended to and 0 inpositions that should not be.hidden_size: int. Hidden size of the Transformer.num_hidden_layers: int. Number of layers (blocks) in the Transformer.num_attention_heads: int. Number of attention heads in the Transformer.intermediate_size: int. The size of the "intermediate" (a.k.a., feedforward) layer.intermediate_act_fn: function. The non-linear activation function to applyto the output of the intermediate/feed-forward layer.hidden_dropout_prob: float. Dropout probability for the hidden layers.attention_probs_dropout_prob: float. Dropout probability of the attentionprobabilities.initializer_range: float. Range of the initializer (stddev of truncatednormal).do_return_all_layers: Whether to also return all layers or just the finallayer.Returns:float Tensor of shape [batch_size, seq_length, hidden_size], the finalhidden layer of the Transformer.Raises:ValueError: A Tensor shape or parameter is invalid."""if hidden_size % num_attention_heads != 0:raise ValueError("The hidden size (%d) is not a multiple of the number of attention ""heads (%d)" % (hidden_size, num_attention_heads))

attention_head_size = int(hidden_size / num_attention_heads)
input_shape = get_shape_list(input_tensor, expected_rank=3)
batch_size = input_shape[0]
seq_length = input_shape[1]
input_width = input_shape[2]

The Transformer performs sum residuals on all layers so the input needs

to be the same as the hidden size.

if input_width != hidden_size:
raise ValueError(“The width of the input tensor (%d) != hidden size (%d)” %
(input_width, hidden_size))

We keep the representation as a 2D tensor to avoid re-shaping it back and

forth from a 3D tensor to a 2D tensor. Re-shapes are normally free on

the GPU/CPU but may not be free on the TPU, so we want to minimize them to

help the optimizer.

prev_output = reshape_to_matrix(input_tensor) #这里官方说为了避免来回升降维,所以直接先变形为2D,最后再恢复成3D [batch_size*seq_length,hidden_size]

all_layer_outputs = []
for layer_idx in range(num_hidden_layers):
with tf.variable_scope(“layer_%d” % layer_idx):
layer_input = prev_output

  <span class="token keyword">with</span> tf<span class="token punctuation">.</span><span class="token function">variable_scope</span><span class="token punctuation">(</span><span class="token string">"attention"</span><span class="token punctuation">)</span><span class="token punctuation">:</span>attention_heads <span class="token operator">=</span> <span class="token punctuation">[</span><span class="token punctuation">]</span><span class="token keyword">with</span> tf<span class="token punctuation">.</span><span class="token function">variable_scope</span><span class="token punctuation">(</span><span class="token string">"self"</span><span class="token punctuation">)</span><span class="token punctuation">:</span>attention_head <span class="token operator">=</span> <span class="token function">attention_layer</span><span class="token punctuation">(</span>            #进行self_attention 即multi<span class="token operator">-</span>head attentionfrom_tensor<span class="token operator">=</span>layer_input<span class="token punctuation">,</span>      #<span class="token punctuation">[</span>batch_size<span class="token operator">*</span>seq_length<span class="token punctuation">,</span>hidden_size<span class="token punctuation">]</span>to_tensor<span class="token operator">=</span>layer_input<span class="token punctuation">,</span>        #<span class="token punctuation">[</span>batch_size<span class="token operator">*</span>seq_length<span class="token punctuation">,</span>hidden_size<span class="token punctuation">]</span>attention_mask<span class="token operator">=</span>attention_mask<span class="token punctuation">,</span>num_attention_heads<span class="token operator">=</span>num_attention_heads<span class="token punctuation">,</span>size_per_head<span class="token operator">=</span>attention_head_size<span class="token punctuation">,</span>attention_probs_dropout_prob<span class="token operator">=</span>attention_probs_dropout_prob<span class="token punctuation">,</span>initializer_range<span class="token operator">=</span>initializer_range<span class="token punctuation">,</span>do_return_2d_tensor<span class="token operator">=</span>True<span class="token punctuation">,</span>batch_size<span class="token operator">=</span>batch_size<span class="token punctuation">,</span>from_seq_length<span class="token operator">=</span>seq_length<span class="token punctuation">,</span>to_seq_length<span class="token operator">=</span>seq_length<span class="token punctuation">)</span>attention_heads<span class="token punctuation">.</span><span class="token function">append</span><span class="token punctuation">(</span>attention_head<span class="token punctuation">)</span>attention_output <span class="token operator">=</span> None<span class="token keyword">if</span> <span class="token function">len</span><span class="token punctuation">(</span>attention_heads<span class="token punctuation">)</span> <span class="token operator">==</span> <span class="token number">1</span><span class="token punctuation">:</span>attention_output <span class="token operator">=</span> attention_heads<span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span><span class="token keyword">else</span><span class="token punctuation">:</span># In the <span class="token keyword">case</span> where we have other sequences<span class="token punctuation">,</span> we just concatenate# them to the self<span class="token operator">-</span>attention head before the projection<span class="token punctuation">.</span>attention_output <span class="token operator">=</span> tf<span class="token punctuation">.</span><span class="token function">concat</span><span class="token punctuation">(</span>attention_heads<span class="token punctuation">,</span> axis<span class="token operator">=</span><span class="token operator">-</span><span class="token number">1</span><span class="token punctuation">)</span># Run a linear projection <span class="token keyword">of</span> <span class="token template-string"><span class="token string">`hidden_size`</span></span> then add a residual# <span class="token keyword">with</span> <span class="token template-string"><span class="token string">`layer_input`</span></span><span class="token punctuation">.</span><span class="token keyword">with</span> tf<span class="token punctuation">.</span><span class="token function">variable_scope</span><span class="token punctuation">(</span><span class="token string">"output"</span><span class="token punctuation">)</span><span class="token punctuation">:</span>attention_output <span class="token operator">=</span> tf<span class="token punctuation">.</span>layers<span class="token punctuation">.</span><span class="token function">dense</span><span class="token punctuation">(</span>                  #对attention的输出做一个全连接层attention_output<span class="token punctuation">,</span>hidden_size<span class="token punctuation">,</span>kernel_initializer<span class="token operator">=</span><span class="token function">create_initializer</span><span class="token punctuation">(</span>initializer_range<span class="token punctuation">)</span><span class="token punctuation">)</span>attention_output <span class="token operator">=</span> <span class="token function">dropout</span><span class="token punctuation">(</span>attention_output<span class="token punctuation">,</span> hidden_dropout_prob<span class="token punctuation">)</span>   attention_output <span class="token operator">=</span> <span class="token function">layer_norm</span><span class="token punctuation">(</span>attention_output <span class="token operator">+</span> layer_input<span class="token punctuation">)</span>         #残差和layer_norm#Feed Foward过程,先对输出升维、再进行降维# The activation is only applied to the <span class="token string">"intermediate"</span> hidden layer<span class="token punctuation">.</span><span class="token keyword">with</span> tf<span class="token punctuation">.</span><span class="token function">variable_scope</span><span class="token punctuation">(</span><span class="token string">"intermediate"</span><span class="token punctuation">)</span><span class="token punctuation">:</span>intermediate_output <span class="token operator">=</span> tf<span class="token punctuation">.</span>layers<span class="token punctuation">.</span><span class="token function">dense</span><span class="token punctuation">(</span>             #升维attention_output<span class="token punctuation">,</span>intermediate_size<span class="token punctuation">,</span>activation<span class="token operator">=</span>intermediate_act_fn<span class="token punctuation">,</span>kernel_initializer<span class="token operator">=</span><span class="token function">create_initializer</span><span class="token punctuation">(</span>initializer_range<span class="token punctuation">)</span><span class="token punctuation">)</span># Down<span class="token operator">-</span>project back to <span class="token template-string"><span class="token string">`hidden_size`</span></span> then add the residual<span class="token punctuation">.</span><span class="token keyword">with</span> tf<span class="token punctuation">.</span><span class="token function">variable_scope</span><span class="token punctuation">(</span><span class="token string">"output"</span><span class="token punctuation">)</span><span class="token punctuation">:</span>                       #降维layer_output <span class="token operator">=</span> tf<span class="token punctuation">.</span>layers<span class="token punctuation">.</span><span class="token function">dense</span><span class="token punctuation">(</span>intermediate_output<span class="token punctuation">,</span>hidden_size<span class="token punctuation">,</span>kernel_initializer<span class="token operator">=</span><span class="token function">create_initializer</span><span class="token punctuation">(</span>initializer_range<span class="token punctuation">)</span><span class="token punctuation">)</span>layer_output <span class="token operator">=</span> <span class="token function">dropout</span><span class="token punctuation">(</span>layer_output<span class="token punctuation">,</span> hidden_dropout_prob<span class="token punctuation">)</span>layer_output <span class="token operator">=</span> <span class="token function">layer_norm</span><span class="token punctuation">(</span>layer_output <span class="token operator">+</span> attention_output<span class="token punctuation">)</span>    #加入残差prev_output <span class="token operator">=</span> layer_output                 #本层输出作为下一层输入all_layer_outputs<span class="token punctuation">.</span><span class="token function">append</span><span class="token punctuation">(</span>layer_output<span class="token punctuation">)</span>       #所有层的输出结果列表

if do_return_all_layers:
final_outputs = []
for layer_output in all_layer_outputs:
final_output = reshape_from_matrix(layer_output, input_shape)
final_outputs.append(final_output)
return final_outputs
else:
final_output = reshape_from_matrix(prev_output, input_shape)
return final_output

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53
  • 54
  • 55
  • 56
  • 57
  • 58
  • 59
  • 60
  • 61
  • 62
  • 63
  • 64
  • 65
  • 66
  • 67
  • 68
  • 69
  • 70
  • 71
  • 72
  • 73
  • 74
  • 75
  • 76
  • 77
  • 78
  • 79
  • 80
  • 81
  • 82
  • 83
  • 84
  • 85
  • 86
  • 87
  • 88
  • 89
  • 90
  • 91
  • 92
  • 93
  • 94
  • 95
  • 96
  • 97
  • 98
  • 99
  • 100
  • 101
  • 102
  • 103
  • 104
  • 105
  • 106
  • 107
  • 108
  • 109
  • 110
  • 111
  • 112
  • 113
  • 114
  • 115
  • 116
  • 117
  • 118
  • 119
  • 120
  • 121
  • 122
  • 123
  • 124
  • 125
  • 126
  • 127
  • 128
  • 129
  • 130
  • 131
  • 132
  • 133

self_attention

接下来介绍self_attention机制。他运用乘法注意力,自己和自己做attention,使每个词都全局语义信息。同时运用Multi-head attention。即将hidden_size平分为多个部分(head)。每个head进行self_attention。不同head学习不同子空间语义。

下面是代码,关键部分我已写上注释。首先将输入的key和value,reshape成[batch_size,num_head,seq_length,size_per_head]。在对这些head进行乘法注意力运算。经过softmax后乘以value。最后返回tensor with shape [batch_size*seq_length,hidden_size]

def attention_layer(from_tensor,      #from_tensor和to_tensor都是输入embedding  [batch_size*seq_length,hidden_size]to_tensor,attention_mask=None,    #[batch_size,form_seq_length,to_seq_length]num_attention_heads=1,size_per_head=512,query_act=None,key_act=None,value_act=None,attention_probs_dropout_prob=0.0,initializer_range=0.02,do_return_2d_tensor=False,batch_size=None,from_seq_length=None,to_seq_length=None):"""Performs multi-headed attention from `from_tensor` to `to_tensor`.This is an implementation of multi-headed attention based on "Attentionis all you Need". If `from_tensor` and `to_tensor` are the same, thenthis is self-attention. Each timestep in `from_tensor` attends to thecorresponding sequence in `to_tensor`, and returns a fixed-with vector.This function first projects `from_tensor` into a "query" tensor and`to_tensor` into "key" and "value" tensors. These are (effectively) a listof tensors of length `num_attention_heads`, where each tensor is of shape[batch_size, seq_length, size_per_head].Then, the query and key tensors are dot-producted and scaled. These aresoftmaxed to obtain attention probabilities. The value tensors are theninterpolated by these probabilities, then concatenated back to a singletensor and returned.In practice, the multi-headed attention are done with transposes andreshapes rather than actual separate tensors.Args:from_tensor: float Tensor of shape [batch_size, from_seq_length,from_width].to_tensor: float Tensor of shape [batch_size, to_seq_length, to_width].attention_mask: (optional) int32 Tensor of shape [batch_size,from_seq_length, to_seq_length]. The values should be 1 or 0. Theattention scores will effectively be set to -infinity for any positions inthe mask that are 0, and will be unchanged for positions that are 1.num_attention_heads: int. Number of attention heads.size_per_head: int. Size of each attention head.query_act: (optional) Activation function for the query transform.key_act: (optional) Activation function for the key transform.value_act: (optional) Activation function for the value transform.attention_probs_dropout_prob: (optional) float. Dropout probability of theattention probabilities.initializer_range: float. Range of the weight initializer.do_return_2d_tensor: bool. If True, the output will be of shape [batch_size* from_seq_length, num_attention_heads * size_per_head]. If False, theoutput will be of shape [batch_size, from_seq_length, num_attention_heads* size_per_head].batch_size: (Optional) int. If the input is 2D, this might be the batch sizeof the 3D version of the `from_tensor` and `to_tensor`.from_seq_length: (Optional) If the input is 2D, this might be the seq lengthof the 3D version of the `from_tensor`.to_seq_length: (Optional) If the input is 2D, this might be the seq lengthof the 3D version of the `to_tensor`.Returns:float Tensor of shape [batch_size, from_seq_length,num_attention_heads * size_per_head]. (If `do_return_2d_tensor` istrue, this will be of shape [batch_size * from_seq_length,num_attention_heads * size_per_head]).Raises:ValueError: Any of the arguments or tensor shapes are invalid."""

def transpose_for_scores(input_tensor, batch_size, num_attention_heads,
seq_length, width):
output_tensor = tf.reshape(
input_tensor, [batch_size, seq_length, num_attention_heads, width])

output_tensor <span class="token operator">=</span> tf<span class="token punctuation">.</span><span class="token function">transpose</span><span class="token punctuation">(</span>output_tensor<span class="token punctuation">,</span> <span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">,</span> <span class="token number">2</span><span class="token punctuation">,</span> <span class="token number">1</span><span class="token punctuation">,</span> <span class="token number">3</span><span class="token punctuation">]</span><span class="token punctuation">)</span>
<span class="token keyword">return</span> output_tensor

from_shape = get_shape_list(from_tensor, expected_rank=[2, 3])
to_shape = get_shape_list(to_tensor, expected_rank=[2, 3])

if len(from_shape) != len(to_shape):
raise ValueError(
“The rank of from_tensor must match the rank of to_tensor.”)

if len(from_shape) 3:
batch_size = from_shape[0]
from_seq_length = from_shape[1]
to_seq_length = to_shape[1]
elif len(from_shape) 2:
if (batch_size is None or from_seq_length is None or to_seq_length is None):
raise ValueError(
"When passing in rank 2 tensors to attention_layer, the values "
"for batch_size, from_seq_length, and to_seq_length "
“must all be specified.”)

Scalar dimensions referenced here:

B = batch size (number of sequences)

F = from_tensor sequence length

T = to_tensor sequence length

N = num_attention_heads

H = size_per_head

from_tensor_2d = reshape_to_matrix(from_tensor) #[batch_sizeseq_length,hidden_size]
to_tensor_2d = reshape_to_matrix(to_tensor) #[batch_sizeseq_length,hidden_size]
#首先将key和value输入进全连接层 但是激活函数为None,这里为什么我也不知道。。。

query_layer = [BF, NH]

query_layer = tf.layers.dense(
from_tensor_2d,
num_attention_heads size_per_head,
activation=query_act, #None
name=“query”,
kernel_initializer=create_initializer(initializer_range)) # [batch_sizeseq_length,hidden_size] hidden_size即num_attention_heads*size_per_head

key_layer = [BT, NH]

key_layer = tf.layers.dense(
to_tensor_2d,
num_attention_heads * size_per_head,
activation=key_act, #None
name=“key”,
kernel_initializer=create_initializer(initializer_range))

value_layer = [BT, NH]

value_layer = tf.layers.dense(
to_tensor_2d,
num_attention_heads * size_per_head,
activation=value_act, #None
name=“value”,
kernel_initializer=create_initializer(initializer_range))
#reshape成四位,用于注意力矩阵运算

query_layer = [B, N, F, H]

query_layer = transpose_for_scores(query_layer, batch_size, #将num_attention_heads调到第二维。这里表示每个batch有N个head,每个head有F个token,每个token用H表示。不同head学习不同子空间的特征
num_attention_heads, from_seq_length,
size_per_head)

key_layer = [B, N, T, H]

key_layer = transpose_for_scores(key_layer, batch_size, num_attention_heads,
to_seq_length, size_per_head)

Take the dot product between “query” and “key” to get the raw

attention scores. 乘法注意力

attention_scores = [B, N, F, T]

attention_scores = tf.matmul(query_layer, key_layer, transpose_b=True)
attention_scores = tf.multiply(attention_scores,
1.0 / math.sqrt(float(size_per_head)))

if attention_mask is not None:
# attention_mask = [B, 1, F, T]
attention_mask = tf.expand_dims(attention_mask, axis=[1])

#这部分将每条训练语料的结尾padding的部分都变为一个极小值,其他有实数据的部分都为<span class="token number">0</span>
# Since attention_mask is <span class="token number">1.0</span> <span class="token keyword">for</span> positions we want to attend and <span class="token number">0.0</span> <span class="token keyword">for</span>
# masked positions<span class="token punctuation">,</span> <span class="token keyword">this</span> operation will create a tensor which is <span class="token number">0.0</span> <span class="token keyword">for</span>
# positions we want to attend and <span class="token operator">-</span><span class="token number">10000.0</span> <span class="token keyword">for</span> masked positions<span class="token punctuation">.</span>
adder <span class="token operator">=</span> <span class="token punctuation">(</span><span class="token number">1.0</span> <span class="token operator">-</span> tf<span class="token punctuation">.</span><span class="token function">cast</span><span class="token punctuation">(</span>attention_mask<span class="token punctuation">,</span> tf<span class="token punctuation">.</span>float32<span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token operator">*</span> <span class="token operator">-</span><span class="token number">10000.0</span># Since we are adding it to the raw scores before the softmax<span class="token punctuation">,</span> <span class="token keyword">this</span> is
# effectively the same <span class="token keyword">as</span> removing these entirely<span class="token punctuation">.</span>
#相加后,有实数据的部分加的,padding部分都是一个极小值
attention_scores <span class="token operator">+=</span> adder

Normalize the attention scores to probabilities.

attention_probs = [B, N, F, T]

attention_probs = tf.nn.softmax(attention_scores)

This is actually dropping out entire tokens to attend to, which might

seem a bit unusual, but is taken from the original Transformer paper.

attention_probs = dropout(attention_probs, attention_probs_dropout_prob)

value_layer = [B, T, N, H]

value_layer = tf.reshape(
value_layer,
[batch_size, to_seq_length, num_attention_heads, size_per_head])

value_layer = [B, N, T, H]

value_layer = tf.transpose(value_layer, [0, 2, 1, 3])

context_layer = [B, N, F, H]

注意力矩阵乘以value

context_layer = tf.matmul(attention_probs, value_layer)

context_layer = [B, F, N, H]

context_layer = tf.transpose(context_layer, [0, 2, 1, 3])

if do_return_2d_tensor:

返回2D结果

# <span class="token template-string"><span class="token string">`context_layer`</span></span> <span class="token operator">=</span> <span class="token punctuation">[</span><span class="token constant">B</span><span class="token operator">*</span><span class="token constant">F</span><span class="token punctuation">,</span> <span class="token constant">N</span><span class="token operator">*</span><span class="token constant">V</span><span class="token punctuation">]</span>
context_layer <span class="token operator">=</span> tf<span class="token punctuation">.</span><span class="token function">reshape</span><span class="token punctuation">(</span>context_layer<span class="token punctuation">,</span><span class="token punctuation">[</span>batch_size <span class="token operator">*</span> from_seq_length<span class="token punctuation">,</span> num_attention_heads <span class="token operator">*</span> size_per_head<span class="token punctuation">]</span><span class="token punctuation">)</span>

else:
# context_layer = [B, F, NV]
context_layer = tf.reshape(
context_layer,
[batch_size, from_seq_length, num_attention_heads size_per_head])

return context_layer

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53
  • 54
  • 55
  • 56
  • 57
  • 58
  • 59
  • 60
  • 61
  • 62
  • 63
  • 64
  • 65
  • 66
  • 67
  • 68
  • 69
  • 70
  • 71
  • 72
  • 73
  • 74
  • 75
  • 76
  • 77
  • 78
  • 79
  • 80
  • 81
  • 82
  • 83
  • 84
  • 85
  • 86
  • 87
  • 88
  • 89
  • 90
  • 91
  • 92
  • 93
  • 94
  • 95
  • 96
  • 97
  • 98
  • 99
  • 100
  • 101
  • 102
  • 103
  • 104
  • 105
  • 106
  • 107
  • 108
  • 109
  • 110
  • 111
  • 112
  • 113
  • 114
  • 115
  • 116
  • 117
  • 118
  • 119
  • 120
  • 121
  • 122
  • 123
  • 124
  • 125
  • 126
  • 127
  • 128
  • 129
  • 130
  • 131
  • 132
  • 133
  • 134
  • 135
  • 136
  • 137
  • 138
  • 139
  • 140
  • 141
  • 142
  • 143
  • 144
  • 145
  • 146
  • 147
  • 148
  • 149
  • 150
  • 151
  • 152
  • 153
  • 154
  • 155
  • 156
  • 157
  • 158
  • 159
  • 160
  • 161
  • 162
  • 163
  • 164
  • 165
  • 166
  • 167
  • 168
  • 169
  • 170
  • 171
  • 172
  • 173
  • 174
  • 175
  • 176
  • 177
  • 178
  • 179
  • 180
  • 181
  • 182
  • 183
  • 184
  • 185
  • 186
  • 187
  • 188
  • 189
  • 190
  • 191

模型应用

模型怎么用呢,在BertModel class中有两个函数。get_pool_output表示获取每个batch第一个词的[CLS]表示结果。BERT认为这个词包含了整条语料的信息;适用于句子级别的分类问题。get_sequence_output表示BERT最终的输出结果,shape为[batch_size,seq_length,hidden_size]。可以直观理解为对每条语料的最终表示,适用于seq2seq问题。

def get_pooled_output(self):return self.pooled_outp          #[batch_size, hidden_size]
def get_sequence_output(self):"""Gets final hidden layer of encoder.Returns:float Tensor of shape [batch_size, seq_length, hidden_size] correspondingto the final hidden of the transformer encoder."""return self.sequence_output
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9

下一篇是训练过程。最近突然有两件事要忙,所以可能要鸽几天了

                                </div><link href="https://csdnimg.cn/release/phoenix/mdeditor/markdown_views-b6c3c6d139.css" rel="stylesheet"></div>

谷歌BERT预训练源码解析(二):模型构建相关推荐

  1. 谷歌BERT预训练源码解析(一):训练数据生成

    目录 预训练源码结构简介 输入输出 源码解析 参数 主函数 创建训练实例 下一句预测&实例生成 随机遮蔽 输出 结果一览 预训练源码结构简介 关于BERT,简单来说,它是一个基于Transfo ...

  2. 谷歌BERT预训练源码解析(三):训练过程

    目录 前言 源码解析 主函数 自定义模型 遮蔽词预测 下一句预测 规范化数据集 前言 本部分介绍BERT训练过程,BERT模型训练过程是在自己的TPU上进行的,这部分我没做过研究所以不做深入探讨.BE ...

  3. 【深度学习模型】智云视图中文车牌识别源码解析(二)

    [深度学习模型]智云视图中文车牌识别源码解析(二) 感受 HyperLPR可以识别多种中文车牌包括白牌,新能源车牌,使馆车牌,教练车牌,武警车牌等. 代码不可谓不混乱(别忘了这是职业公司的准产品级代码 ...

  4. AlphaFold2源码解析(9)--模型之损失

    AlphaFold2源码解析(9)–模型之损失 损失函数和辅助头 该网络是端到端训练的,梯度来自主帧对齐点误差 (FAPE) 损失 LFAPEL_{FAPE}LFAPE​和许多辅助损失. 每个示例的总 ...

  5. AlphaFold2源码解析(4)--模型架构

    AlphaFold2源码解析(4)–模型架构 我们将Alphafold的流程分为一下几个部分: 搜索同源序列和模板 特征构造 特征表示 MSA表示与残基对表示之间互相交换信息 残基的抽象表示转换成具体 ...

  6. php 框架源码分析,Laravel框架源码解析之模型Model原理与用法解析

    本文实例讲述了Laravel框架源码解析之模型Model原理与用法.分享给大家供大家参考,具体如下: 前言 提前预祝猿人们国庆快乐,吃好.喝好.玩好,我会在电视上看着你们. 根据单一责任开发原则来讲, ...

  7. erlang下lists模块sort(排序)方法源码解析(二)

    上接erlang下lists模块sort(排序)方法源码解析(一),到目前为止,list列表已经被分割成N个列表,而且每个列表的元素是有序的(从大到小) 下面我们重点来看看mergel和rmergel ...

  8. Kubernetes学习笔记之Calico CNI Plugin源码解析(二)

    女主宣言 今天小编继续为大家分享Kubernetes Calico CNI Plugin学习笔记,希望能对大家有所帮助. PS:丰富的一线技术.多元化的表现形式,尽在"360云计算" ...

  9. Mobx 源码解析 二(autorun)

    前言 我们在Mobx 源码解析 一(observable)已经知道了observable 做的事情了, 但是我们的还是没有讲解明白在我们的Demo中,我们在Button 的Click 事件中只是对ba ...

最新文章

  1. COM:下一代微生物组技术在作物生产中的应用——局限性以及基于知识的解决方案的需求
  2. 选中断还是轮询方式?深究其中的区别
  3. SharePoint 2007 Web Content Management 性能优化系列 3 - IIS压缩
  4. mysql口令更换周期_Linux设置口令复杂度和口令定期更换策略
  5. 重庆二师计算机科学与技术,应用型本科院校计算机科学与技术专业一流课程建设思考──以重庆第二师范学院为例...
  6. canva画图 图片居中裁剪_Canvas裁剪图片(截选框可拖拽)
  7. 测试C#代码执行时间
  8. [导入]C++ GUi 选择
  9. you must reset your password using alter table
  10. oracle 存储过程循环打开游标数据处理
  11. 【Spring Boot】3.Spring Boot的配置
  12. shell awk实现实时监控网卡流量脚本(常见应用二)
  13. Mac 版 QQ 音乐上线离线提示音的方法?
  14. 这款中间件支持多线程,居然吊打牛B的 Redis!
  15. Javc笔记(三) package和import
  16. SVN导出下载指定历史版本
  17. U盘复制东西时显示:磁盘被写保护,请去掉写保护或使用另一张磁盘的解决方法。
  18. HTML让文字在图片上显示
  19. c语言中的下标变量是什么,c语言引用数组元素时其数组下标的允许的数据类型是什么...
  20. osgEarth使用笔记4——加载矢量数据

热门文章

  1. flask 学习实战项目实例
  2. 前端Vue学习之路(五)插件的使用
  3. Brat序列标注工具小结
  4. html,xml_网页开发_爬虫_笔记
  5. python中的next()以及iter()函数
  6. LeetCode简单题之有多少小于当前数字的数字
  7. LeetCode简单题之二叉搜索树的最小绝对差/最小距离
  8. VGG16迁移学习实现
  9. ADAS系统长篇综述(下)
  10. Android 自定义 —— View lineTo 与 rLineTo 的区别