NLP教程

TF_IDF
词向量
句向量
Seq2Seq 语言生成模型
CNN的语言模型
语言模型的注意力
Transformer 将注意力发挥到极致
ELMo 一词多义
GPT 单向语言模型
BERT 双向语言模型
NLP模型的多种应用

怎么了

BERT 和 GPT 还有 ELMo 是一个性质的东西。它存在的意义是要变成一种预训练模型，提供 NLP 中对句子的理解。ELMo 用了双向 LSTM 作为句子信息的提取器，同时还能表达词语在句子中的不同含义；GPT 呢，它是一种单向的语言模型，同样也可以用 attention 的方式提取到更加丰富的语言意思信息。而BERT，它就和GPT是同一个家族，都是从Transformer 演变而来的。那么 BERT 和 GPT 有有什么不同之处呢？

其实最大的不同之处是，BERT 认为如果看一个句子只从单向观看，是不是还会缺少另一个方向的信息？所以 BERT 像 ELMo 一样，算是一种双向的语言模型。而这种双向性，其实正是原封不动的 Transformer 的 Encoder 部分。

怎么训练

BERT就是一个Transformer的Encoder，只是在训练步骤上有些不同。在这个教程中就不会详细说明Encoder的结构了。

为了让BERT理解语义内容，它的训练会比GPT tricky得多。 GPT之所以训练方案上比较简单，是因为我们把它当成一个RNN一样训练，比如用前文预测后文（用mask挡住了后文的信息）。前后没有信息的穿越，这也是单向语言模型好训练的一个原因。但是如果又要利用前后文的信息（不mask掉后文信息），又要好训练，这就比较头疼了。因为我在预测词X的时候，实际上是看着X来预测X，这样并没有什么意义。

好在BERT的研发人员想到了一个还可以的办法，就是我在句子里面遮住X，不让模型看到X，然后来用前后文的信息预测X。这就是BERT训练时最核心的概念了。

但是这样做又会导致一个问题。我们人类理解完形填空的意思，知道那个空（mask）是无或者没有的意思。但是模型不知道呀，它的空（mask）会被当成一个词去理解。因为我们给的是一个叫mask的词向量输入到模型里的。模型还以为你要用mask这个词向量来预测个啥。为了避免这种情况发生，研究人员又做了一个取巧的方案：除了用mask来表示要预测的词，我还有些时候，把mask随机替换成其他词，或者原封不动。具体下来就是下面三种方式：

随机选取15%的词做如下改变

80% 的时间，将它替换成 [MASK]
10% 的时间，将它替换成其他任意词
10% 的时间，不变

举个例子:

Input： The man went to [MASK] store with [MASK] dog
Target:                  the               his

预测 [MASK] 是BERT的一项最主要的任务。在非监督学习中，我们还能怎么玩？让模型有更多的可以被训练的任务？其实呀，我们还能借助上下文信息做件事，就是让模型判断，相邻这这两句话是不是上下文关系。

举个例子，我在一个两句话的段落中将这两句话拆开，然后将两句话同时输入模型，让模型输出True/False判断是否是上下文。同时我还可以随机拼凑不是上下文的句子，让它学习这两句不是上下文。

Input : the man went to the store [SEP] he bought a gallon of milk [SEP]
Is next : TrueInput = the man heading to the store [SEP] penguin [MASK] are flight ##less birds [SEP]
Is next : False

有了这两项任务，一个[MASK],一个上下文预测，我们应该就能创造出非常多的训练数据来给模型训练进行监督训练啦。其实也就是把非监督的数据做成了两个监督学习的任务，模型还是被监督学习的。

请注意：我写的BERT代码和原文有一处不同，我认为不用传递给模型一个[CLS]信息让模型知道当前在做的是什么任务，因为我想要得到的是一个语言理解器，至于对于不同的任务，可以 Finetune 出不同的头来适应，因为谁知道你下游是不是一个你训练过的任务（Task）呢？所以我觉得没必要专门为了Task去搞一个Task input。我更关注的是训练出一个语言模型，而不是一个语言任务模型。

代码

我们这里选择的数据还是和做ELMo，GPT 时相同的数据(MRPC)，可以进行横向对比。

def train(model, data, step=10000):for t in range(step):seqs, segs, seqs_, loss_mask, xlen, nsp_labels = random_mask_or_replace(data, ...)loss, pred = model.step(seqs, segs, seqs_, loss_mask, nsp_labels)d = utils.MRPCData("./MRPC", 2000)
m = BERT()
train(m, d, step=10000)

我们注意到 random_mask_or_replace() 在每次循环中，将数据进行了一次MASK和replace操作。目的就是为了让BERT有个可以被预测的词位。

通过上面的训练过程，如果我们打印出训练结果，可以发现，BERT在收敛，但是收敛的速度比GPT慢很多，我们上次训练的GPT只用了5000步就收敛到一个比较好的地方，但是这次的BERT训练了10000步，还是没能收敛到特别好。这也是BERT在训练上的一个硬伤。

step:  0 | time: 0.64 | loss: 9.655
| tgt:  <GO> <quote> we can 't change the past , sour we can schering-plough a lot about the future , <quote> sheehan said gamecocks a news conference wednesday afternoon . prevents <quote> we heads 't change the past leg but goldman can do a lot about analogous future , <quote> sheehan said hours after arriving in phoenix
| prd:  tennis subject bar condition adviser down higher ko larned sleep charing arrest shipments alone corp. forging lord rucker humans requiring peaks assignment communion parking locked jeb novels aboard civilians sciences moroccan offer juvenile non-discriminatory reactors <NUM>-to-49-year-old slashed touch-screen underperformed aches trenton north partway odds tito websites company-sponsored orthopedic behind mother-of-two breaking campaigning cooperate down denver marched
| tgt word:  ['but', 'do', 'at', '<SEP>', 'can', ',', 'we', 'the']
| prd word:  ['sleep', 'shipments', 'communion', 'sciences', 'juvenile', 'touch-screen', 'aches', 'websites']step:  100 | time: 14.04 | loss: 8.924
| tgt:  <GO> this year , local health departments hired part-time water samplers and purchased testing equipment with a $ <NUM> grant from the environmental protection agency . <SEP> this year , peninsula health officials got the money to hire part-time water samplers and purchase testing equipment thanks to a $ <NUM> grant from the environmental protection agency
| prd:  <GO> harrison operated <GO> <GO> , <GO> the the <SEP> <SEP> <GO> the the <SEP> manila stuck <SEP> the <GO> medics <SEP> <GO> <SEP> the sherry offend daschle cronan , washington-area , membership , the <NUM> , , the the the <NUM> the the the the , stricter <NUM> the , , the <NUM> the the
| tgt word:  ['testing', 'with', '<NUM>', 'protection', ',', 'health', 'water', 'protection']
| prd word:  ['the', 'manila', 'the', '<SEP>', ',', ',', 'the', 'the']...step:  9800 | time: 14.16 | loss: 2.888
| tgt:  <GO> in <NUM> , the building 's owners , the port authority of new york and new jersey , issued guidelines to upgrade the fireproofing to a thickness of <NUM> { inches . <SEP> the nist discovered that in <NUM> the port authority issued guidelines to upgrade the fireproofing to a thickness of <NUM> 1 / <NUM> inches
| prd:  <GO> in <NUM> , the new 's the , the , , of new new and new <NUM> , , to to to the fireproofing to a of of <NUM> fireproofing <NUM> the <SEP> the nist the that in <NUM> the to to , <NUM> to <NUM> the , to a , of <NUM> to , <NUM> to
| tgt word:  ['authority', 'inches', 'that', 'in', 'authority', 'a', '1', '/']
| prd word:  [',', '<NUM>', 'that', 'in', 'to', 'a', 'to', ',']step:  9900 | time: 14.06 | loss: 3.090
| tgt:  <MASK> <MASK> stock closed friday at $ <MASK> , down $ <NUM> , or <NUM> percent , on the nasdaq stock market . <MASK> shares of brocade closed at $ <NUM> , down $ <NUM> , or <MASK> percent
| prd:  <GO> the <SEP> was friday at $ <NUM> , down $ <NUM> , or <NUM> percent , on the <GO> <SEP> market the <SEP> the of <GO> was at $ <NUM> , down $ <NUM> , or <NUM> percent
| tgt word:  ['<GO>', 'the', '<NUM>', '<SEP>', '<NUM>']
| prd word:  ['<GO>', 'the', '<NUM>', '<SEP>', '<NUM>']

random_mask_or_replace() 这个功能怎么设计呢？简单来说也就是要将原始句子替换一下他们的[MASK]位置，或者是replace成其他词，又或者啥都不做。我还有个tricky的做法，为了只计算被masked或者replaced这些位置的loss，在模型前向完了，他会对每一个词位都计算一下误差，但是我们可以在计算真正loss的时候，只保留这些被masked/replaced位置的loss，其他词语的位子都忽略掉。所以我这里还会生成一个loss_mask,用来在计算loss时，只关注需要计算的部分。

def _get_loss_mask(...):# 间接在计算loss时，只看被MASK或者被Replaced位置，其他位置忽略return loss_maskdef do_mask(seq, len_arange, pad_id, mask_id):# 80% 添加[MASK]loss_mask, rand_id = _get_loss_mask(len_arange, seq, pad_id)seq[rand_id] = mask_idreturn loss_maskdef do_replace(seq, len_arange, pad_id, word_ids):# 10% 替换成其他词loss_mask, rand_id = _get_loss_mask(len_arange, seq, pad_id)seq[rand_id] = np.random.choice(word_ids, size=len(rand_id))return loss_maskdef do_nothing(seq, len_arange, pad_id):# 10% 啥也不做loss_mask, _ = _get_loss_mask(len_arange, seq, pad_id)return loss_maskdef random_mask_or_replace(data, arange, batch_size):seqs, segs, xlen, nsp_labels = data.sample(batch_size)p = np.random.random()if p < 0.7:     # 我代码里稍微改了一下配方比# maskloss_mask = np.concatenate([do_mask(...) for i in range(len(seqs))], axis=0)elif p < 0.85:  # 我代码里稍微改了一下配方比# do nothingloss_mask = np.concatenate([do_nothing(...) for i in range(len(seqs))], axis=0)else:# replaceloss_mask = np.concatenate([do_replace(...) for i in range(len(seqs))], axis=0)return seqs, segs, seqs_, loss_mask, xlen, nsp_labels

因为BERT的主架构是Transformer的Encoder，而我们之前写的GPT也是用的它的encoder。所以这里我们只需要在GPT的结构上修改一下计算loss的方案和双向mask的方案即可。

from GPT import GPT# 这个BERT直接继承我的GPT
class BERT(GPT):def __init__(self, ...):super().__init__(...)def call(self, seqs, segs, training=False):# GPT 的前向，我没改embed = self.input_emb(seqs, segs)  # [n, step, dim]z = self.encoder(embed, training=training, mask=self.mask(seqs))     # [n, step, dim]mlm_logits = self.task_mlm(z)  # [n, step, n_vocab]nsp_logits = self.task_nsp(tf.reshape(z, [z.shape[0], -1]))  # [n, n_cls]return mlm_logits, nsp_logitsdef step(self, seqs, segs, seqs_, loss_mask, nsp_labels):with tf.GradientTape() as tape:# 两个任务的logitsmlm_logits, nsp_logits = self.call(seqs, segs, training=True)# trick：语言模型预测，mask掉loss可忽略的部分mlm_loss_batch = tf.boolean_mask(self.cross_entropy(seqs_, mlm_logits), loss_mask)mlm_loss = tf.reduce_mean(mlm_loss_batch)# 是否下一句预测nsp_loss = tf.reduce_mean(self.cross_entropy(nsp_labels, nsp_logits))loss = mlm_loss + 0.2 * nsp_lossgrads = tape.gradient(loss, self.trainable_variables)self.opt.apply_gradients(zip(grads, self.trainable_variables))return loss, mlm_logitsdef mask(self, seqs):# 覆盖掉GPT原有的encoder mask方法，这里只使用pad maskmask = tf.cast(tf.math.equal(seqs, self.padding_idx), tf.float32)return mask[:, tf.newaxis, tf.newaxis, :]  # [n, 1, 1, step]

所以这个BERT和我的GPT还算挺兼容的，只是稍微改动了下step()和mask().通过这个修改，就保留了BERT的双向注意力，而且在算loss的时候，能只计算需要计算的部分。

最后运行一段时间，注意力的学习成果如下。首先看看它注意力转化成矩阵的模式。

如果将注意力转化成线的模式，我们可以更清楚地看到他对每个词是怎么注意的。这里我们也留意到，其实这个模型还没学好，很多时候，它有点想放弃注意。也就是很多注意力都集中到<GO>或者一些没多大意义的词上了。随着继续训练和数据的增多，这种现象会好很多。（GPT的训练也类似）

前面我们还提到这个BERT训练了10000步还收敛不到一个好结果，而GPT只需要5000步就能收敛得比较好了。这是为什么呢？最主要的原因是BERT每次的训练太没有效率了。每次输入全部训练数据，但是只能预测15%的词，而GPT能够预测100%的词，这不就让BERT单次训练少了很多有效的label信息。

总结

BERT 完美实现了双向语言模型的概念，在我的认知中，双向肯定会比单向语言模型（GPT）获取到更多的信息，所以按理来说应该会更优秀。但是在训练双向语言模型时，会有很多tricks，我们要多多研究一下trick才能使得训练更加有效率更快。

全部代码

可视化代码

def self_attention_matrix(bert_or_gpt="bert", case=0):with open("./visual/tmp/"+bert_or_gpt+"_attention_matrix.pkl", "rb") as f:data = pickle.load(f)src = data["src"]attentions = data["attentions"]encoder_atten = attentions["encoder"]plt.rcParams['xtick.bottom'] = plt.rcParams['xtick.labelbottom'] = Falseplt.rcParams['xtick.top'] = plt.rcParams['xtick.labeltop'] = Trues_len = 0for s in src[case]:if s == "<SEP>":breaks_len += 1plt.figure(0, (7, 28))for j in range(4):plt.subplot(4, 1, j + 1)img = encoder_atten[-1][case, j][:s_len-1, :s_len-1]plt.imshow(img, vmax=img.max(), vmin=0, cmap="rainbow")plt.xticks(range(s_len-1), src[case][:s_len-1], rotation=90, fontsize=9)plt.yticks(range(s_len-1), src[case][1:s_len], fontsize=9)plt.xlabel("head %i" % (j+1))plt.subplots_adjust(top=0.9)plt.tight_layout()plt.savefig("./visual/results/"+bert_or_gpt+"%d_self_attention.png" % case, dpi=500)# plt.show()

utils.MRPCData()

class MRPCData:num_seg = 3pad_id = PAD_IDdef __init__(self, data_dir="./MRPC/", rows=None, proxy=None):maybe_download_mrpc(save_dir=data_dir, proxy=proxy)data, self.v2i, self.i2v = _process_mrpc(data_dir, rows)self.max_len = max([len(s1) + len(s2) + 3 for s1, s2 in zip(data["train"]["s1id"] + data["test"]["s1id"], data["train"]["s2id"] + data["test"]["s2id"])])self.xlen = np.array([[len(data["train"]["s1id"][i]), len(data["train"]["s2id"][i])] for i in range(len(data["train"]["s1id"]))], dtype=int)x = [[self.v2i["<GO>"]] + data["train"]["s1id"][i] + [self.v2i["<SEP>"]] + data["train"]["s2id"][i] + [self.v2i["<SEP>"]]for i in range(len(self.xlen))]self.x = pad_zero(x, max_len=self.max_len)self.nsp_y = data["train"]["is_same"][:, None]self.seg = np.full(self.x.shape, self.num_seg-1, np.int32)for i in range(len(x)):si = self.xlen[i][0] + 2self.seg[i, :si] = 0si_ = si + self.xlen[i][1] + 1self.seg[i, si:si_] = 1self.word_ids = np.array(list(set(self.i2v.keys()).difference([self.v2i[v] for v in ["<PAD>", "<MASK>", "<SEP>"]])))def sample(self, n):bi = np.random.randint(0, self.x.shape[0], size=n)bx, bs, bl, by = self.x[bi], self.seg[bi], self.xlen[bi], self.nsp_y[bi]return bx, bs, bl, by@propertydef num_word(self):return len(self.v2i)@propertydef mask_id(self):return self.v2i["<MASK>"]

import numpy as np
import tensorflow as tf
import utils
import time
from GPT import GPT
import os
import pickleclass BERT(GPT):def __init__(self, model_dim, max_len, n_layer, n_head, n_vocab, lr, max_seg=3, drop_rate=0.1, padding_idx=0):super().__init__(model_dim, max_len, n_layer, n_head, n_vocab, lr, max_seg, drop_rate, padding_idx)# I think task emb is not necessary for pretraining,# because the aim of all tasks is to train a universal sentence embedding# the body encoder is the same across all tasks,# and different output layer defines different task just like transfer learning.# finetuning replaces output layer and leaves the body encoder unchanged.# self.task_emb = keras.layers.Embedding(#     input_dim=n_task, output_dim=model_dim,  # [n_task, dim]#     embeddings_initializer=tf.initializers.RandomNormal(0., 0.01),# )def step(self, seqs, segs, seqs_, loss_mask, nsp_labels):with tf.GradientTape() as tape:mlm_logits, nsp_logits = self.call(seqs, segs, training=True)mlm_loss_batch = tf.boolean_mask(self.cross_entropy(seqs_, mlm_logits), loss_mask)mlm_loss = tf.reduce_mean(mlm_loss_batch)nsp_loss = tf.reduce_mean(self.cross_entropy(nsp_labels, nsp_logits))loss = mlm_loss + 0.2 * nsp_lossgrads = tape.gradient(loss, self.trainable_variables)self.opt.apply_gradients(zip(grads, self.trainable_variables))return loss, mlm_logitsdef mask(self, seqs):mask = tf.cast(tf.math.equal(seqs, self.padding_idx), tf.float32)return mask[:, tf.newaxis, tf.newaxis, :]  # [n, 1, 1, step]def _get_loss_mask(len_arange, seq, pad_id):rand_id = np.random.choice(len_arange, size=max(2, int(MASK_RATE * len(len_arange))), replace=False)loss_mask = np.full_like(seq, pad_id, dtype=np.bool)loss_mask[rand_id] = Truereturn loss_mask[None, :], rand_iddef do_mask(seq, len_arange, pad_id, mask_id):loss_mask, rand_id = _get_loss_mask(len_arange, seq, pad_id)seq[rand_id] = mask_idreturn loss_maskdef do_replace(seq, len_arange, pad_id, word_ids):loss_mask, rand_id = _get_loss_mask(len_arange, seq, pad_id)seq[rand_id] = np.random.choice(word_ids, size=len(rand_id))return loss_maskdef do_nothing(seq, len_arange, pad_id):loss_mask, _ = _get_loss_mask(len_arange, seq, pad_id)return loss_maskdef random_mask_or_replace(data, arange, batch_size):seqs, segs, xlen, nsp_labels = data.sample(batch_size)seqs_ = seqs.copy()p = np.random.random()if p < 0.7:# maskloss_mask = np.concatenate([do_mask(seqs[i],np.concatenate((arange[:xlen[i, 0]], arange[xlen[i, 0] + 1:xlen[i].sum() + 1])),data.pad_id,data.v2i["<MASK>"]) for i in range(len(seqs))], axis=0)elif p < 0.85:# do nothingloss_mask = np.concatenate([do_nothing(seqs[i],np.concatenate((arange[:xlen[i, 0]], arange[xlen[i, 0] + 1:xlen[i].sum() + 1])),data.pad_id) for i in range(len(seqs))], axis=0)else:# replaceloss_mask = np.concatenate([do_replace(seqs[i],np.concatenate((arange[:xlen[i, 0]], arange[xlen[i, 0] + 1:xlen[i].sum() + 1])),data.pad_id,data.word_ids) for i in range(len(seqs))], axis=0)return seqs, segs, seqs_, loss_mask, xlen, nsp_labelsdef train(model, data, step=10000, name="bert"):t0 = time.time()arange = np.arange(0, data.max_len)for t in range(step):seqs, segs, seqs_, loss_mask, xlen, nsp_labels = random_mask_or_replace(data, arange, 16)loss, pred = model.step(seqs, segs, seqs_, loss_mask, nsp_labels)if t % 100 == 0:pred = pred[0].numpy().argmax(axis=1)t1 = time.time()print("\n\nstep: ", t,"| time: %.2f" % (t1 - t0),"| loss: %.3f" % loss.numpy(),"\n| tgt: ", " ".join([data.i2v[i] for i in seqs[0][:xlen[0].sum()+1]]),"\n| prd: ", " ".join([data.i2v[i] for i in pred[:xlen[0].sum()+1]]),"\n| tgt word: ", [data.i2v[i] for i in seqs_[0]*loss_mask[0] if i != data.v2i["<PAD>"]],"\n| prd word: ", [data.i2v[i] for i in pred*loss_mask[0] if i != data.v2i["<PAD>"]],)t0 = t1os.makedirs("./visual/models/%s" % name, exist_ok=True)model.save_weights("./visual/models/%s/model.ckpt" % name)def export_attention(model, data, name="bert"):model.load_weights("./visual/models/%s/model.ckpt" % name)# save attention matrix for visualizationseqs, segs, xlen, nsp_labels = data.sample(32)model.call(seqs, segs, False)data = {"src": [[data.i2v[i] for i in seqs[j]] for j in range(len(seqs))], "attentions": model.attentions}path = "./visual/tmp/%s_attention_matrix.pkl" % nameos.makedirs(os.path.dirname(path), exist_ok=True)with open(path, "wb") as f:pickle.dump(data, f)if __name__ == "__main__":utils.set_soft_gpu(True)MODEL_DIM = 256N_LAYER = 4LEARNING_RATE = 1e-4MASK_RATE = 0.15d = utils.MRPCData("./MRPC", 2000)print("num word: ", d.num_word)m = BERT(model_dim=MODEL_DIM, max_len=d.max_len, n_layer=N_LAYER, n_head=4, n_vocab=d.num_word,lr=LEARNING_RATE, max_seg=d.num_seg, drop_rate=0.2, padding_idx=d.v2i["<PAD>"])train(m, d, step=10000, name="bert")export_attention(m, d, "bert")

NLP教程笔记：BERT 双向语言模型相关推荐

NLP教程笔记：GPT 单向语言模型
NLP教程 TF_IDF 词向量句向量 Seq2Seq 语言生成模型 CNN的语言模型语言模型的注意力 Transformer 将注意力发挥到极致 ELMo 一词多义 GPT 单向语言模型 BER ...
NLP学习笔记——BERT的一些应用（简记）
本文内容中:挑出pytorch 版的 BERT 相关代码,从代码结构.具体实现与原理,以及使用的角度进行分析 Transformers版本:4.4.2(2021 年 3 月 19 日发布) 1. ...
NLP学习笔记一（语言模型+NLM+Word2Vec）
花书十二章+NLP 最近刚好轮到自己讲花书十二章,感觉goodfellow在NLP这块写的不是很全,所以就自己参考宗老师的<统计自然语言处理>来理了一下思路,现在整理一下. 一.NLP前言 ...
谷歌AI论文BERT双向编码器表征模型：机器阅读理解NLP基准11种最优(公号回复“谷歌BERT论文”下载彩标PDF论文)
谷歌AI论文BERT双向编码器表征模型:机器阅读理解NLP基准11种最优(公号回复"谷歌BERT论文"下载彩标PDF论文) 原创: 秦陇纪数据简化DataSimp 今天数据简化 ...
深度之眼Paper带读笔记NLP.30：BERT
文章目录前言第一课导读语言模型与Word Embedding 语言模型 Language Model 神经网络语言模型Neural Network Language Model 词嵌入 Wor ...
[NLP学习笔记-Task10] Transformer + BERT
Encoder-Decoder框架 Encoder-Decoder是为seq2seq(序列到序列)量身打造的一个深度学习框架,在机器翻译.机器问答等领域有着广泛的应用.这是一个抽象的框架,由两个组件: ...
NLP.TM[36] | NLP之源：n-gram语言模型
[NLP.TM] 本人有关自然语言处理和文本挖掘方面的学习和笔记,欢迎大家关注. 往期回顾 NLP.TM[32] | 浅谈文本增强技术 NLP.TM[33] | 纠错:pycorrector的错误检测 ...
NLP突破性成果 BERT 模型详细解读 bert参数微调
https://zhuanlan.zhihu.com/p/46997268 NLP突破性成果 BERT 模型详细解读章鱼小丸子不懂算法的产品经理不是好的程序员关注她 82 人赞了该文章 Goo ...
5 分钟入门 Google 最强NLP模型：BERT
BERT (Bidirectional Encoder Representations from Transformers) 10月11日,Google AI Language 发布了论文 BERT: ...

NLP教程笔记：BERT 双向语言模型

NLP教程

目录

怎么了

怎么训练

代码

总结

全部代码

NLP教程笔记：BERT 双向语言模型相关推荐

最新文章

热门文章