问题关键:past_key_value

  • 模型的整体结构(由外到内)
  • 最外层generation_utils.py之中的greedy_search调用模型解读
  • t5Stack模型的解读
  • t5block网络层中内容解读
  • t5layerselfattention的解读
  • t5layerselfattention+t5layercrossattention中t5layerselfattention代码解读
    • t5attention的第一次运行
    • t5attention encoder 第二次调用
  • t5layerselfattention+t5layercrossattention中t5layerselfattention代码解读
    • 第一次decoder部分的t5layerselfattention代码调用
    • 第二次decoder部分的t5layerselfattention代码调用(这里的第二次为调用了6个encoder的t5layerselfattention以及decoder中的6个encoder的t5layerselfattention和t5layercrossattention内容)
  • t5layerselfattention+t5layercrossattention中t5layercrossattention代码解读
    • 第一次调用t5layercrossattention
    • 第二次调用t5layercrossattention内容

通过之前对于代码的阅读,发现问题的关键就在于past_key_value参数的变化,导致输入不需要那么复杂的输入了,

模型的整体结构(由外到内)

模型的整体结构决定着数据的运转方向
模型整体的框架结构图

最外层generation_utils.py之中的greedy_search调用模型解读

while True:if synced_gpus:# Under synced_gpus the `forward` call must continue until all gpus complete their sequence.# The following logic allows an early break if all peers finished generating their sequencethis_peer_finished_flag = torch.tensor(0.0 if this_peer_finished else 1.0).to(input_ids.device)# send 0.0 if we finished, 1.0 otherwisedist.all_reduce(this_peer_finished_flag, op=dist.ReduceOp.SUM)# did all peers finish? the reduced sum will be 0.0 thenif this_peer_finished_flag.item() == 0.0:breakmodel_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs)# forward pass to get next tokenoutputs = self(**model_inputs,return_dict=True,output_attentions=output_attentions,output_hidden_states=output_hidden_states,)if synced_gpus and this_peer_finished:cur_len = cur_len + 1continue  # don't waste resources running the code we don't neednext_token_logits = outputs.logits[:, -1, :]# Store scores, attentions and hidden_states when requiredif return_dict_in_generate:if output_scores:scores += (next_token_logits,)if output_attentions:decoder_attentions += ((outputs.decoder_attentions,) if self.config.is_encoder_decoder else (outputs.attentions,))if self.config.is_encoder_decoder:cross_attentions += (outputs.cross_attentions,)if output_hidden_states:decoder_hidden_states += ((outputs.decoder_hidden_states,)if self.config.is_encoder_decoderelse (outputs.hidden_states,))# pre-process distributionnext_tokens_scores = logits_processor(input_ids, next_token_logits)# argmaxnext_tokens = torch.argmax(next_tokens_scores, dim=-1)# finished sentences should have their next token be a padding tokenif eos_token_id is not None:if pad_token_id is None:raise ValueError("If `eos_token_id` is defined, make sure that `pad_token_id` is defined.")next_tokens = next_tokens * unfinished_sequences + pad_token_id * (1 - unfinished_sequences)# update generated ids, model inputs, and length for next stepinput_ids = torch.cat([input_ids, next_tokens[:, None]], dim=-1)model_kwargs = self._update_model_kwargs_for_generation(outputs, model_kwargs, is_encoder_decoder=self.config.is_encoder_decoder)cur_len = cur_len + 1# if eos_token was found in one sentence, set sentence to finishedif eos_token_id is not None:unfinished_sequences = unfinished_sequences.mul((next_tokens != eos_token_id).long())# stop when each sentence is finished, or if we exceed the maximum lengthif unfinished_sequences.max() == 0 or stopping_criteria(input_ids, scores):if not synced_gpus:breakelse:this_peer_finished = True

这里的输入部分

input_ids = torch.cat([input_ids,next_tokens[:,None]],dim=-1)

得到input_ids = [0,644]
然后

model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs)

这里估摸着参数为之前传下来的参数(目前只看了[0][0] = (1,8,1,64))

model_inputs['past_key_value'][0][0] = torch.Size([1, 8, 1, 64])
model_inputs['past_key_value'][0][1] = torch.Size([1, 8, 1, 64])
model_inputs['past_key_value'][1][0] = torch.Size([1, 8, 11, 64])
model_inputs['past_key_value'][1][1] = torch.Size([1, 8, 11, 64])

t5Stack模型的解读

t5stack的定义

def forward(self,input_ids=None,attention_mask=None,encoder_hidden_states=None,encoder_attention_mask=None,inputs_embeds=None,head_mask=None,cross_attn_head_mask=None,past_key_values=None,use_cache=None,output_attentions=None,output_hidden_states=None,return_dict=None,
):

进入t5stack类别之中查看内容

for i,(layer_module,past_key_value) in enumerate(zip(self.block,past_key_values)):............else:layer_outputs = layer_module(hidden_states,attention_mask=extended_attention_mask,position_bias=position_bias,encoder_hidden_states=encoder_hidden_states,encoder_attention_mask=encoder_extended_attention_mask,encoder_decoder_position_bias=encoder_decoder_position_bias,layer_head_mask=layer_head_mask,cross_attn_layer_head_mask=cross_attn_layer_head_mask,past_key_value=past_key_value,use_cache=use_cache,output_attentions=output_attentions,)

初始的时候这里的layer_module读取的是模型,past_key_values中存储的是6个None,到后面由于上一层的greedy_search中的参数不同,所以传入的past_key_values参数不同。
这里面的past_key_value保存的是6个对应的past_key_value内容(第一次全部为None),

past_key_value[0][0] = (1,8,1,64)
past_key_value[0][1] = (1,8,1,64)
past_key_value[0][2] = (1,8,11,64)
past_key_value[0][3] = (1,8,11,64)
............
............
past_key_value[5][0] = (1,8,1,64)
past_key_value[5][1] = (1,8,1,64)
past_key_value[5][2] = (1,8,11,64)
past_key_value[5][3] = (1,8,11,64)

上一次的t5stack留给这次的t5stack使用,都是同一层的在使用
注意,t5stack之中的past_value_state 第一次为 [None,None,None,None,None,None],后续每一次都是上一波的遗留下来的内容
也就是说,后面的t5block网络层中的内容传入的是同时期上一波t5block的输出,比如第二次调用t5block第二层的内容传入的是第一次调用t5block网络层第二层的内容。

t5block网络层中内容解读

进入t5block之中使用

hidden_states,present_key_value_state = self_attention_outputs[:2]

这里传递的是t5layerselfattention网络层之中传播的先前预测的内容(上一个网络结构同层的内容),这也能理解为什么这里刚开始

self_attn_past_key_value = past_key_value[:2]
......
......
self_attention_outputs = self.layer[0](......past_key_value=self_attn_past_key_value,......
)

获得的

self_attn_past_key_value[0][0] = (1,8,1,64)
self_attn_past_key_value[0][1] = (1,8,1,64)
self_attn_past_key_value[0][2] = (1,8,11,64)
self_attn_past_key_value[0][3] = (1,8,11,64)

经过这一波数据输出之后,调用新的present_key_value_state

hidden_states,present_key_value_state = self_attention_outputs[:2]

这里的present_key_value_state的内容为

present_key_value_state[0] =
torch.Size([1, 8, 1, 64])
present_key_value_state[1] =
torch.Size([1, 8, 1, 64])

接下来经过decoder部分之后,调用新的present_key_value_state

cross_attention_outputs = self.layer[1](hidden_states,key_value_states=encoder_hidden_states,attention_mask=encoder_attention_mask,position_bias=encoder_decoder_position_bias,layer_head_mask=cross_attn_layer_head_mask,past_key_value=cross_attn_past_key_value,query_length=query_length,use_cache=use_cache,output_attentions=output_attentions,
)

获得的新的present_key_value_state的内容

# Combine self attn and cross attn key value states
if present_key_value_state is not None:present_key_value_state = present_key_value_state + cross_attention_outputs[1]

获得新的present_key_value_state的内容为

present_key_value_state =
torch.Size([1, 8, 1, 64])
torch.Size([1, 8, 1, 64])
torch.Size([1, 8, 11, 64])
torch.Size([1, 8, 11, 64])

另外两个位置的偏移参数也保存在后面

# Keep cross-attention outputs and relative position weights
attention_outputs = attention_outputs + cross_attention_outputs[2:]

获得的位置偏移的内容为

attention_outputs =
torch.Size([1, 8, 1, 1])
torch.Size([1, 8, 1, 11])

t5layerselfattention的解读

t5block之中有两种模式,一种是t5layerselfattention的解读,一种是t5layerselfattention+t5layercrossattention网络结构的解读,这里我们解读t5layerselfattention
这里注入的past_key_value的内容应该为

None或者
(1,8,1,64)
(1,8,1,64)

t5layerselfattention+t5layercrossattention中t5layerselfattention代码解读

t5block之中有两种模式,一种是t5layerselfattention的解读,一种是t5layerselfattention+t5layercrossattention网络结构的解读,这里我们解读t5layerselfattention+t5layercrossattention网络结构中t5layerselfattention的代码内容
t5layerselfattention直接进入t5attention的内容之中

t5attention的第一次运行

第一次运行的时候

batch_size = 1,seq_length = 11,key_length = 11

然后我们进入调用的过程

query_states = shape(self.q(hidden_states))

得到

query_states = (1,8,11,64)

(这里的query_states的内容其实是固定的)
接下来进入对于key_states和value_states的操作之中

key_states = project(hidden_states, self.k, key_value_states, past_key_value[0] if past_key_value is not None else None
)
value_states = project(hidden_states, self.v, key_value_states, past_key_value[1] if past_key_value is not None else None
)

进入project函数之中去查看内容

def project(hidden_states, proj_layer, key_value_states, past_key_value):"""projects hidden states correctly to key/query states"""if key_value_states is None:# self-attn# (batch_size, n_heads, seq_length, dim_per_head)hidden_states = shape(proj_layer(hidden_states))elif past_key_value is None:# cross-attn# (batch_size, n_heads, seq_length, dim_per_head)hidden_states = shape(proj_layer(key_value_states))if past_key_value is not None:if key_value_states is None:# self-attn# (batch_size, n_heads, key_length, dim_per_head)hidden_states = torch.cat([past_key_value, hidden_states], dim=2)else:# cross-attnhidden_states = past_key_valuereturn hidden_states

这里的key_value_states is None,后面的elif、if语句都没有被调用过,直接调用网络层

hidden_states = shape(proj_layer(hidden_states))

获得的结果

hidden_states = torch.size([1,8,11,64])

接下来调用

key_states = project(hidden_states, self.k, key_value_states, past_key_value[0] if past_key_value is not None else None
)
value_states = project(hidden_states, self.v, key_value_states, past_key_value[1] if past_key_value is not None else None
)

获得的结果

key_states = torch.Size([1, 8, 11, 64])
value_states = torch.Size([1, 8, 11, 64])

然后计算相应的分数

# compute scores
scores = torch.matmul(query_states, key_states.transpose(3, 2)
)  # equivalent of torch.einsum("bnqd,bnkd->bnqk", query_states, key_states), compatible with onnx op>9

获得结果

scores = (1,8,11,11)

接下来计算position_bias的内容

if position_bias is None:if not self.has_relative_attention_bias:position_bias = torch.zeros((1, self.n_heads, real_seq_length, key_length), device=scores.device, dtype=scores.dtype)if self.gradient_checkpointing and self.training:position_bias.requires_grad = Trueelse:position_bias = self.compute_bias(real_seq_length, key_length)# if key and values are already calculated# we want only the last query position biasif past_key_value is not None:position_bias = position_bias[:, :, -hidden_states.size(1) :, :]if mask is not None:position_bias = position_bias + mask  # (batch_size, n_heads, seq_length, key_length)

这里应该运行的内容为

position_bias = self.compute_bias(real_seq_length,key_length)

得到position_bias的形状

position_bias = (1,8,11,11)

接下来的操作的内容

scores += position_bias
attn_weights = nn.functional.softmax(scores.float(), dim=-1).type_as(scores
)  # (batch_size, n_heads, seq_length, key_length)
attn_weights = nn.functional.dropout(attn_weights, p=self.dropout, training=self.training
)  # (batch_size, n_heads, seq_length, key_length)
# Mask heads if we want to
if layer_head_mask is not None:attn_weights = attn_weights * layer_head_mask

这里的attn_weights = (1,8,11,11)
然后接下来经历一波输出

attn_output = unshape(torch.matmul(attn_weights,value_states))
attn_output = self.o(attn_output)

attn_weights,包括key_states、value_states以及position_bias都相当于中间过程的参数内容,只有outputs是最终结果的参数的内容
最后将这些内容保存成tulpe输出

present_key_value_state = (key_states, value_states) if (self.is_decoder and use_cache) else None
outputs = (attn_output,) + (present_key_value_state,) + (position_bias,)
if output_attentions:outputs = outputs + (attn_weights,)
return outputs

这里计算出来的position_bias第一次为None,之后计算出来会往后传递,节约了模型的运行时间。position_bias在6个encoder中的selflayerattention部分是一样的,在6个decoder中的selflayerattention是一样的,6个decoder中的selfcrossattention中的内容是一样的,selflayerattention和selfcrossattention中的position_bias是不一样的

t5attention encoder 第二次调用

第一次调用完就结束了,在预测过程之中,encoder只调用一次6个对应的t5attention encoder内容,encoder调用完成之后,decoder部分的内容是不断地被调用,直到decoder部分输出预测的停止符号为止

t5layerselfattention+t5layercrossattention中t5layerselfattention代码解读

第一次调用的过程没有之前的t5layerselfattention的调用,decoder_input_ids = (1,1)
这里的decoder_input_ids是一开始就初始化好的输入的参数,与之前的encoder_outputs的内容无关
从t5forconditionalgeneration的类别来看

decoder_outputs = self.decoder(input_ids=decoder_input_ids,attention_mask=decoder_attention_mask,inputs_embeds=decoder_inputs_embeds,past_key_values=past_key_values,encoder_hidden_states=hidden_states,encoder_attention_mask=attention_mask,head_mask=decoder_head_mask,cross_attn_head_mask=cross_attn_head_mask,use_cache=use_cache,output_attentions=output_attentions,output_hidden_states=output_hidden_states,return_dict=return_dict,
)

这里的之前encoder输出的部分只有

encoder_hidden_states=hidden_states

调用了之前的hidden_states = (1,11,512),其余的参数都与encoder部分无关
然后进入t5block的类别进行查看

self_attention_outputs = self.layer[0](hidden_states,attention_mask=attention_mask,position_bias=position_bias,layer_head_mask=layer_head_mask,past_key_value=self_attn_past_key_value,use_cache=use_cache,output_attentions=output_attentions,
)

也就是说在decoder中的selflayerattention调用始终与之前encoder的输出无关
查看t5block中的decoder部分内容

cross_attention_outputs = self.layer[1](hidden_states,key_value_states=encoder_hidden_states,attention_mask=encoder_attention_mask,position_bias=encoder_decoder_position_bias,layer_head_mask=cross_attn_layer_head_mask,past_key_value=cross_attn_past_key_value,query_length=query_length,use_cache=use_cache,output_attentions=output_attentions,
)

decoder中的crosslayerattention部分会调用之前encoder的输出

key_value_states = encoder_hidden_states

这里我们先查看第一次encoder部分的输出

第一次decoder部分的t5layerselfattention代码调用

刚开始的参数

batch_size,seq_length = hidden_states.shape[:2]
real_seq_length = seq_length

获得的参数

batch_size = 1,seq_length = 1,real_seq_length = 1

接下来这里调用网络层是不变的

query_states = shape(self.q(hidden_states))

获得query_states内容

query_states = torch.Size([1, 8, 1, 64])

然后调用

key_states = project(hidden_states, self.k, key_value_states, past_key_value[0] if past_key_value is not None else None
)
value_states = project(hidden_states, self.v, key_value_states, past_key_value[1] if past_key_value is not None else None
)

这里调用的语句跟之前的一样

if key_value_states is None:# self-attn# (batch_size, n_heads, seq_length, dim_per_head)hidden_states = shape(proj_layer(hidden_states))

获得形状

key_states = torch.tensor([1, 8, 1, 64])
value_states = torch.tensor([1, 8, 1, 64])

然后调用position_bias的内容

else:position_bias = self.compute_bias(real_seq_length, key_length)

这里的position_bias = (1,8,1,1)
这之后scores相乘并加上position_bias

scores = torch.matmul(query_states, key_states.transpose(3, 2)
)  # equivalent of torch.einsum("bnqd,bnkd->bnqk", query_states, key_states), compatible with onnx op>9
............
scores +=position_bias

最后调用输出内容

outputs = (attn_output,)+(present_key_value_state,)+(position_bias,)

第二次decoder部分的t5layerselfattention代码调用(这里的第二次为调用了6个encoder的t5layerselfattention以及decoder中的6个encoder的t5layerselfattention和t5layercrossattention内容)

这里的第二次相当于预测完第一个数值之后,第二次运行到新的位置。这里调用的past_key_value[0]相当于上一个位置同一层输出的key_states,past_key_value[1]相当于上一个位置同一层输出的value_states(比如这里是第二波6个encoder+3个decoder+第4个decoder的selflayerattention,那么前面就相当于第一波的6个encoder+3个decoder+第4个decoder的selflayerattention的内容)
接下来进入

key_states = project(hidden_states, self.k, key_value_states, past_key_value[0] if past_key_value is not None else None
)
value_states = project(hidden_states, self.v, key_value_states, past_key_value[1] if past_key_value is not None else None
)
if past_key_value is not None:if key_value_states is None:# self-attn# (batch_size, n_heads, key_length, dim_per_head)hidden_states = torch.cat([past_key_value, hidden_states], dim=2)else:# cross-attnhidden_states = past_key_value

这里如果是t5layerselfattention的时候会调用第一个if,如果是crossattention的时候会调用第二个if
如果为t5layerselfattention的时候,在project函数里面会调用如下代码

if past_key_value is not None:if key_value_states is None:# self-attn# (batch_size, n_heads, key_length, dim_per_head)hidden_states = torch.cat([past_key_value, hidden_states], dim=2)............
return hidden_states

获得第二波中的输出内容

key_states.size = torch.Size([1, 8, 2, 64])
value_states.size = torch.Size([1, 8, 2, 64])

接下来调用scores内容

# compute scores
scores = torch.matmul(query_states, key_states.transpose(3, 2)
)  # equivalent of torch.einsum("bnqd,bnkd->bnqk", query_states, key_states), compatible with onnx op>9

获得的结果

scores = torch.Size([1, 8, 1, 2])

接下来查看position_bias的计算

if position_bias is None:if not self.has_relative_attention_bias:position_bias = torch.zeros((1, self.n_heads, real_seq_length, key_length), device=scores.device, dtype=scores.dtype)if self.gradient_checkpointing and self.training:position_bias.requires_grad = Trueelse:position_bias = self.compute_bias(real_seq_length, key_length)

接下来的操作,有对应的一行小字标注:

if key and values are already calculated,
we want only the last query position bias.

调用对应的代码

if past_key_value is not None:position_bias = position_bias[:, :, -hidden_states.size(1) :, :]

注意取出来的是最后面的一维,取出来之后,position_bias = (1,8,1,2)
这里得到的position_bias的结果

position_bias = torch.Size([1, 8, 2, 2])

这里是原先position_bias的扩充,比如原先的position_bias的内容为

position_bias =
tensor([[[[ 3.5000]],[[ 0.4531]],[[ 3.1875]],[[ 0.9727]],[[-5.4688]],[[ 5.1875]],[[ 2.1562]],[[ 0.5391]]]])

现在的position_bias为

position_bias =
tensor([[[[ 3.9844,  3.5000]],[[ 1.2266,  0.4531]],[[ 4.3438,  3.1875]],[[ 2.0312,  0.9727]],[[ 0.7969, -5.4688]],[[ 4.9375,  5.1875]],[[ 4.7500,  2.1562]],[[ 4.5000,  0.5391]]]])

然后调用语句

scores += position_bias
#scores = (1,8,1,2)
attn_weights = nn.functional.softmax(scores.float(), dim=-1).type_as(scores
)  # (batch_size, n_heads, seq_length, key_length)
attn_weights = nn.functional.dropout(attn_weights, p=self.dropout, training=self.training
)  # (batch_size, n_heads, seq_length, key_length)# Mask heads if we want to
if layer_head_mask is not None:attn_weights = attn_weights * layer_head_mask

到这为止scores的内容都为(1,8,1,2)
接下来调用

attn_output = unshape((torch.matmul(attn_weights,value_states))

attn_weights = (1,8,1,2),value_states = (1,8,2,64)
相乘之后得到结果(1,8,1,64)
然后使用unshape之后进行输出

attn_output = unshape(torch.matmul(attn_weights,value_states))
#attn_output = (1,1,512)
attn_output = self.o(attn_output)

获得结果

attn_output = (1,1,512)

t5layerselfattention+t5layercrossattention中t5layercrossattention代码解读

t5block之中有两种模式,一种是t5layerselfattention的解读,一种是t5layerselfattention+t5layercrossattention网络结构的解读,这里我们解读t5layerselfattention+t5layercrossattention网络结构中t5layercrossattention的代码内容

第一次调用t5layercrossattention

前面的参数跟selflayerattention差不多

batch_size = 1,seq_length = 1,real_seq_length = 1

接着调用语句

key_length = real_seq_length if key_value_states is None else key_value_states.shape[1]

这里由于key_value_states不为None,所以这里获取得到的是

key_length = 11

这里的key_value_states = (1,11,512),就是之前6个encoder输出的内容的结果(6个t5layercrossattention的结果一样)
接下来调用project映射部分的内容

def project(hidden_states, proj_layer, key_value_states, past_key_value):"""projects hidden states correctly to key/query states"""if key_value_states is None:# self-attn# (batch_size, n_heads, seq_length, dim_per_head)hidden_states = shape(proj_layer(hidden_states))elif past_key_value is None:# cross-attn# (batch_size, n_heads, seq_length, dim_per_head)hidden_states = shape(proj_layer(key_value_states))if past_key_value is not None:if key_value_states is None:# self-attn# (batch_size, n_heads, key_length, dim_per_head)hidden_states = torch.cat([past_key_value, hidden_states], dim=2)else:# cross-attnhidden_states = past_key_valuereturn hidden_states

第一波layercrossattention直接调用上面的这一语句

elif past_key_value is None:hidden_states = shape(proj_layer(key_value_states))

这里的key_value_states的内容为之前encoder输出的部分(1,8,11,64)
因此这里的hidden_states = (1,8,11,64)
然后past_key_value == None后面的if语句没有调用
接下来调用

query_states = shape(self.q(hidden_states))

query_states = (1,8,1,64)
然后后面两个调用

key_states = project(hidden_states, self.k, key_value_states, past_key_value[0] if past_key_value is not None else None
)
value_states = project(hidden_states, self.v, key_value_states, past_key_value[1] if past_key_value is not None else None
)

这里由于key_value = (1,8,11,64)为上面encoder输出的hidden_states

hidden_states = (1,8,11,64)

所以调用project中的语句

elif past_key_value is None:# cross-attn# (batch_size, n_heads, seq_length, dim_per_head)hidden_states = shape(proj_layer(key_value_states))

得到

key_states = (1,8,11,64)
value_states = (1,8,11,64)

然后scores调用中间过程

scores = torch.matmul(query_states,key_states.transpose(3,2))

得到结果

scores = (1,8,1,64)*(1,8,64,11) = (1,8,1,11)

然后调用crosslayerattention的position_bias
注意crosslayerattention的position_bias永远为零

if position_bias is None:if not self.has_relative_attention_bias:position_bias = torch.zeros((1, self.n_heads, real_seq_length, key_length), device=scores.device, dtype=scores.dtype)

获得的position_bias的内容

position_bias =
tensor([[[[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]],[[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]],[[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]],[[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]],[[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]],[[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]],[[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]],[[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]]]])

接着调用后续的语句

scores += position_bias
attn_weights = nn.functional.softmax(scores.float(), dim=-1).type_as(scores
)  # (batch_size, n_heads, seq_length, key_length)
attn_weights = nn.functional.dropout(attn_weights, p=self.dropout, training=self.training
)  # (batch_size, n_heads, seq_length, key_length)# Mask heads if we want to
if layer_head_mask is not None:attn_weights = attn_weights * layer_head_mask

attn_weights = (1,8,1,11)
最后相乘并返回

attn_output = unshape(torch.matmul(attn_weights,value_states))
attn_output= self.o(attn_output)

获得结果

attn_output = (1,8,1,11)*(1,8,11,64) = (1,8,1,64)->(1,1,512)
attn_output经过线性层之后->(1,1,512)

最后把这些参数都打包在一起进行输出

present_key_value_state = (key_states, value_states) if (self.is_decoder and use_cache) else None
outputs = (attn_output,) + (present_key_value_state,) + (position_bias,)
if output_attentions:outputs = outputs + (attn_weights,)

第二次调用t5layercrossattention内容

刚开始调用的参数一样

batch_size,seq_length = hidden_states.shape[:2]
real_seq_length = seq_length

这里的batch_size = 1,seq_length = 1,real_seq_length = 1
接着调用

key_length = real_seq_length if key_value_states is None else key_value_states.shape[1]

获得参数

key_length = 11

唯一的区别就在于key_states和value_states的调用过程不一样

key_states = project(hidden_states, self.k, key_value_states, past_key_value[0] if past_key_value is not None else None
)
value_states = project(hidden_states, self.v, key_value_states, past_key_value[1] if past_key_value is not None else None
)

首先,这里传入的past_key_value[0]和past_key_value[1]为同一层次上一波的运行的结果
这里调用的past_key_value[0]相当于上一个位置同一层输出的key_states,past_key_value[1]相当于上一个位置同一层输出的value_states(比如这里是第二波6个encoder+3个decoder+第4个decoder的selflayerattention,那么前面就相当于第一波的6个encoder+3个decoder+第4个decoder的selflayerattention的内容)
接下来进入project函数之中

def project(hidden_states, proj_layer, key_value_states, past_key_value):"""projects hidden states correctly to key/query states"""if key_value_states is None:# self-attn# (batch_size, n_heads, seq_length, dim_per_head)hidden_states = shape(proj_layer(hidden_states))elif past_key_value is None:# cross-attn# (batch_size, n_heads, seq_length, dim_per_head)hidden_states = shape(proj_layer(key_value_states))if past_key_value is not None:if key_value_states is None:# self-attn# (batch_size, n_heads, key_length, dim_per_head)hidden_states = torch.cat([past_key_value, hidden_states], dim=2)else:# cross-attnhidden_states = past_key_valuereturn hidden_states

直接运行最后一个else

hidden_states = past_key_value

获得hidden_states = torch.Size([1, 8, 11, 64])
接下来调用position_bias的内容,注意crosslayerattention的position_bias永远为零

if position_bias is None:if not self.has_relative_attention_bias:position_bias = torch.zeros((1, self.n_heads, real_seq_length, key_length), device=scores.device, dtype=scores.dtype)

获得的position_bias的结果

position_bias =
tensor([[[[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]],[[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]],[[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]],[[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]],[[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]],[[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]],[[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]],[[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]]]])

总结一下project函数中的内容为,第一个if针对第一次的selflayerattention(包括encoder和decoder部分),else针对第一次的layercrossattention,第二个if针对第二次~第n次的selflayerattention,else针对第二次~第n次的layercrossattention
后续的操作都差不多

(1,8,1,64)*(1,8,64,11) = (1,8,1,11)
(1,8,1,11)*(1,8,11,64) = (1,8,1,64)
if position_bias is None......if mask is not None:......

这里的mask不为None,可以研究一下

终结篇:t5模型结构的阅读相关推荐

  1. python基础27 -----python进程终结篇-----IO模型

    一.IO模型 1.IO模型分类 1.阻塞IO--------blocking IO 2.非阻塞IO------nonblocking IO 3. 多路复用IO------- multiplexing ...

  2. 【Pytorch深度学习50篇】·······第五篇:【YOLO】【1】----- YOLO V3 V4 V5的模型结构

    第四篇,我们已经简单的过了一个segmentation网络,下面我们要进入一个相对要复杂一点的系列了,它也是用于目标检测的,与segmentation不同的是,这个网络会回归出目标的位置和大小.YOL ...

  3. Clear Type之父谈阅读革命(三,终结篇)

    Clear Type之父谈阅读革命(三,终结篇) 译者:程化 Charles:我想,自由总是有其代价的,对吗? Bill:是的,我们都要支付代价,比如,新闻自由的代价就是XX法(译者注:搞不懂,像是L ...

  4. T5: Text-to-Text Transfer Transformer 阅读笔记

    作者:徐啸 知乎专栏:自然语言处理学习之旅 https://zhuanlan.zhihu.com/p/89719631 写在前面 谷歌用一篇诚意满满(财大气粗)的基于实验的综述,试图帮助研究者们「拨开 ...

  5. 脑洞大开的机器视觉多领域学习模型结构 | CVPR 2018论文解读

    在碎片化阅读充斥眼球的时代,越来越少的人会去关注每篇论文背后的探索和思考. 在这个栏目里,你会快速 get 每篇精选论文的亮点和痛点,时刻紧跟 AI 前沿成果. 点击本文底部的「阅读原文」即刻加入社区 ...

  6. Linux调度系统全景指南(终结篇)

    点击上方蓝字关注公众号,更多经典内容等着你 | 导语本文主要是讲Linux的调度系统, 本篇是终结篇,主要讲当前多核系统调度策略和调度优化,调度可以说是操作系统的灵魂,为了让CPU资源利用最大化,Li ...

  7. bert模型简介、transformers中bert模型源码阅读、分类任务实战和难点总结

    bert模型简介.transformers中bert模型源码阅读.分类任务实战和难点总结:https://blog.csdn.net/HUSTHY/article/details/105882989 ...

  8. 谷歌AI论文BERT双向编码器表征模型:机器阅读理解NLP基准11种最优(公号回复“谷歌BERT论文”下载彩标PDF论文)

    谷歌AI论文BERT双向编码器表征模型:机器阅读理解NLP基准11种最优(公号回复"谷歌BERT论文"下载彩标PDF论文) 原创: 秦陇纪 数据简化DataSimp 今天 数据简化 ...

  9. 博通Broadcom SDK源码学习与开发12终结篇——TR069网管协议

    声明:原创作品,严禁用于商业目的. 本系列文章将全面剖析以Bcm33xxx芯片开发Cablemodem产品的SDK源码为例,从编译系统到各个功能模块进行分析与探讨. 文章目录 0.写在前篇 1. TR ...

最新文章

  1. Annotation之一:Java Annotation基本功能介绍
  2. redis 配置文件 append only file(aof)部分---数据持久化
  3. 美团_ccListView选中Item的颜色修改及解决登陆后升级通知再次弹出的bug
  4. MapReduce的优点
  5. boost::mp11::mp_remove相关用法的测试程序
  6. Java 中按文件名称分类,按文件大小分类,按照文件类型分类,按照最后修改时间分类的工具类
  7. 经典面试题(5):小心javascript自动插入分号机制
  8. 【前端笔试算法题】之小偷分赃题
  9. linux 备份mysql_linux下备份MYSQL数据库的方法
  10. 云计算学习笔记002---云计算的理解及介绍,google云计算平台实现原理
  11. 在eclipse中引入mybatis和spring的约束文件
  12. 关于VLAN的几种接口模式
  13. 集成电路实践----D触发器
  14. 定义标准的POJO类
  15. 推荐几款好用的去视频水印软件,快来看看吧
  16. 电能质量监测装置及系统
  17. ant jeecg vue 前端通过dom节点 导出xlsx表格
  18. 美国软件和菜头——《软件随想录》读后感
  19. S32K系列S32K144学习笔记——LPIT0
  20. timewrap 算法

热门文章

  1. 富勒烯|Fullerene C60 富勒石 CAS:131159-39-2 |瑞禧
  2. 三维渲染 光能辐射基础
  3. 如何做好新媒体运营推广工作,黎想首谈一位优秀新媒体运营策划人员的思维模型
  4. 打印一本200页的书要多少钱?
  5. Ubuntu20.04LTS 安装QQ 微信 钉钉 最简单,最好用的方式!
  6. 带你一文搞懂VMware Workstation的三种网络模式
  7. 【C语言从青铜到王者】第零篇·与C语言来个约会
  8. Camera2 APP Flash 打闪流程及原理分析
  9. matlab 三维 雷达天线波束
  10. python作图设置背景颜色_如何在matplotlib中设置绘图的外部背景颜色