
一 Demo


from pytorch_pretrained_bert import BertTokenizer,BertModel
import torch# bert_tokenizer = BertTokenizer.from_pretrained("bert-base-chinese")
bert_tokenizer = BertTokenizer.from_pretrained(r'C:\XXX\bert-base-chinese-vocab.txt')
a = "张三和李四都住在村头"
a_token = bert_tokenizer.tokenize(a)
a_seq_ids = bert_tokenizer.convert_tokens_to_ids(a_token)
print(a_seq_ids)# bert_model = BertModel.from_pretrained("bert-base-chinese")
bert_model = BertModel.from_pretrained(r'C:\XXX\bert-base-chinese')
batch_data = torch.Tensor(a_seq_ids).long().view((1,-1))
out,_ = bert_model(batch_data)
print(out[0].shape)['张', '三', '和', '李', '四', '都', '住', '在', '村', '头']
[2476, 676, 1469, 3330, 1724, 6963, 857, 1762, 3333, 1928]
torch.Size([1, 10, 768])

二 源码

2.1 模型和词汇表加载


2.1.1 加载预训练词汇表:


2.1.2 加载预训练模型:


pretrained_model_name_or_path: either:- a str with the name of a pre-trained model to load selected in the list of:. `bert-base-uncased`. `bert-large-uncased`. `bert-base-cased`. `bert-large-cased`. `bert-base-multilingual-uncased`. `bert-base-multilingual-cased`. `bert-base-chinese`- a path or url to a pretrained model archive containing:. `bert_config.json` a configuration file for the model. `pytorch_model.bin` a PyTorch dump of a BertForPreTraining instance- a path or url to a pretrained model archive containing:. `bert_config.json` a configuration file for the model. `model.chkpt` a TensorFlow checkpoint
from_tf: should we load the weights from a locally saved TensorFlow checkpoint
cache_dir: an optional path to a folder in which the pre-trained models will be cached.
state_dict: an optional state dictionnary (collections.OrderedDict object) to use instead of Google pre-trained models
*inputs, **kwargs: additional input for the specific Bert class(ex: num_labels for BertForSequenceClassification)

2.2 前向传播


`input_ids`: a torch.LongTensor of shape [batch_size, sequence_length]with the word token indices in the vocabulary(see the tokens preprocessing logic in the scripts``, `` and ``)
`token_type_ids`: an optional torch.LongTensor of shape [batch_size, sequence_length] with the tokentypes indices selected in [0, 1]. Type 0 corresponds to a `sentence A` and type 1 corresponds toa `sentence B` token (see BERT paper for more details).
`attention_mask`: an optional torch.LongTensor of shape [batch_size, sequence_length] with indicesselected in [0, 1]. It's a mask to be used if the input sequence length is smaller than the maxinput sequence length in the current batch. It's the mask that we typically use for attention whena batch has varying length sentences.
`output_all_encoded_layers`: boolean which controls the content of the `encoded_layers` output as described below. Default: `True`.
Outputs: Tuple of (encoded_layers, pooled_output)
`encoded_layers`: controled by `output_all_encoded_layers` argument:- `output_all_encoded_layers=True`: outputs a list of the full sequences of encoded-hidden-states at the endof each attention block (i.e. 12 full sequences for BERT-base, 24 for BERT-large), eachencoded-hidden-state is a torch.FloatTensor of size [batch_size, sequence_length, hidden_size],- `output_all_encoded_layers=False`: outputs only the full sequence of hidden-states correspondingto the last attention block of shape [batch_size, sequence_length, hidden_size],
`pooled_output`: a torch.FloatTensor of size [batch_size, hidden_size] which is the output of aclassifier pretrained on top of the hidden state associated to the first character of theinput (`CLS`) to train on the Next-Sentence task (see BERT's paper).
Example usage:
# Already been converted into WordPiece token ids
input_ids = torch.LongTensor([[31, 51, 99], [15, 5, 0]])
input_mask = torch.LongTensor([[1, 1, 1], [1, 1, 0]])
token_type_ids = torch.LongTensor([[0, 0, 1], [0, 1, 0]])

2.3 框架初始化

pytorch_pretrained_bert/__init__.py该文件为入口,整合了多个模型:Bert、GPT、GPT2、Transformer其中PYTORCH_PRETRAINED_BERT_CACHE 中设置了Bert模型下载的缓存目录


2.4 预训练资源下载:

PRETRAINED_MODEL_ARCHIVE_MAP = {'bert-base-uncased': "",'bert-large-uncased': "",'bert-base-cased': "",'bert-large-cased': "",'bert-base-multilingual-uncased': "",'bert-base-multilingual-cased': "",'bert-base-chinese': "",
PRETRAINED_VOCAB_ARCHIVE_MAP = {'bert-base-uncased': "",'bert-large-uncased': "",'bert-base-cased': "",'bert-large-cased': "",'bert-base-multilingual-uncased': "",'bert-base-multilingual-cased': "",'bert-base-chinese': "",
PRETRAINED_VOCAB_POSITIONAL_EMBEDDINGS_SIZE_MAP = {'bert-base-uncased': 512,'bert-large-uncased': 512,'bert-base-cased': 512,'bert-large-cased': 512,'bert-base-multilingual-uncased': 512,'bert-base-multilingual-cased': 512,'bert-base-chinese': 512,


