BERT-Whole Word Masked(WWM)

记录一下BERT的WWM实现和中文的WWM

上代码: 英文 Bert-WWM数据创建方法

def create_masked_lm_predictions(tokens, masked_lm_prob,max_predictions_per_seq, vocab_words, rng):"""Creates the predictions for the masked LM objective.params:tokens: input ids;masked_lm_prob: masked prob(how many tokens will be masked, default 0.15)max_predictions_per_seq: a numbervocab_words: vocabrng: random module"""cand_indexes = []for (i, token) in enumerate(tokens):if token == "[CLS]" or token == "[SEP]":continue# Whole Word Masking means that if we mask all of the wordpieces# corresponding to an original word. When a word has been split into# WordPieces, the first token does not have any marker and any subsequence# tokens are prefixed with ##. So whenever we see the ## token, we# append it to the previous set of word indexes.## Note that Whole Word Masking does *not* change the training code# at all -- we still predict each WordPiece independently, softmaxed# over the entire vocabulary.if (FLAGS.do_whole_word_mask and len(cand_indexes) >= 1 andtoken.startswith("##")):# 如果WWM，则记录每个单词的所有sub token，如此达到对整个单词所有sub token进行广义maskcand_indexes[-1].append(i)else:cand_indexes.append([i])rng.shuffle(cand_indexes)  # shuffle再取前15%的词进行8-1-1处理，表示随机取15%的词进行maskoutput_tokens = list(tokens)num_to_predict = min(max_predictions_per_seq,max(1, int(round(len(tokens) * masked_lm_prob))))masked_lms = []covered_indexes = set()for index_set in cand_indexes:  # 对整个词或者单个字符if len(masked_lms) >= num_to_predict:break# If adding a whole-word mask would exceed the maximum number of# predictions, then just skip this candidate.if len(masked_lms) + len(index_set) > num_to_predict:continueis_any_index_covered = Falsefor index in index_set:if index in covered_indexes:is_any_index_covered = Truebreakif is_any_index_covered:continuefor index in index_set:covered_indexes.add(index)masked_token = None# 80% of the time, replace with [MASK] 8-1-1mask策略if rng.random() < 0.8:masked_token = "[MASK]"else:# 10% of the time, keep originalif rng.random() < 0.5:masked_token = tokens[index]# 10% of the time, replace with random wordelse:masked_token = vocab_words[rng.randint(0, len(vocab_words) - 1)]output_tokens[index] = masked_token# 记录labelmasked_lms.append(MaskedLmInstance(index=index, label=tokens[index]))assert len(masked_lms) <= num_to_predictmasked_lms = sorted(masked_lms, key=lambda x: x.index)masked_lm_positions = []masked_lm_labels = []for p in masked_lms:masked_lm_positions.append(p.index)masked_lm_labels.append(p.label)return (output_tokens, masked_lm_positions, masked_lm_labels)

中文Bert-WWM预训练数据创建方法（from ymcui)

BERT-Whole Word Masked(WWM)相关推荐

Whole Word Masking (wwm) BERT PaddlePaddle常用预训练模型加载
Whole Word Masking (wwm),暂翻译为全词Mask或整词Mask,是谷歌在2019年5月31日发布的一项BERT的升级版本,主要更改了原预训练阶段的训练样本生成策略. 简单来说,原 ...
Whole Word Masking (wwm)
Whole Word Masking (wwm) 本文代码部分参考github项目: https://github.com/BSlience/search-engine-zerotohero/tree ...
如何使用bert做word embedding
调研目的:如何使用第三方库快速对数据进行预训练,得到embedding 知乎一: 请问如何用nlp预训练模型做word embedding ,如bert怎么提取出embedding? 作者(香港大学 ...
【芝麻街一家】 Bert Bart RoBERTa
预训练语言模型基础结构大名鼎鼎的芝麻街 Smaller Model Network Architecture Improvements How to Fine-tune Extraction-bas ...
刷新中文阅读理解水平，哈工大讯飞联合发布基于全词覆盖中文BERT预训练模型...
作者 | HFL 来源 | 哈工大讯飞联合实验室(ID:rgznai100) 为了进一步促进中文自然语言处理的研究发展,哈工大讯飞联合实验室发布基于全词覆盖(Whole Word Masking)的中 ...
BERT and it‘s family
本文主要转载自: mathor's blog https://www.zhihu.com/search?type=content&q=bert%20family 大名鼎鼎的芝麻街预训练语言模 ...
李宏毅DLHLP.18.BERT and its family.2/2.ELMo,BERT,GPT,XLNet,MASS,BART,UniLM,ELECTRA
文章目录介绍 How to pre-train Context Vector (CoVe) Self-supervised Learning Predict Next Token Predict N ...
预训练语言模型整理（ELMo/GPT/BERT...）
预训练语言模型整理(ELMo/GPT/BERT...)简介预训练任务简介# 自回归语言模型# 自编码语言模型预训练模型的简介与对比 ELMo 细节# ELMo的下游使用# GPT/GPT2# GP ...
金融领域首个开源中文BERT预训练模型，熵简科技推出FinBERT 1.0
出品 | AI科技大本营头图 | CSDN付费下载于东方IC 为了促进自然语言处理技术在金融科技领域的应用和发展,熵简科技 AI Lab 近期开源了基于 BERT 架构的金融领域预训练语言模型 Fi ...

BERT-Whole Word Masked(WWM)

BERT-Whole Word Masked(WWM)相关推荐

最新文章

热门文章