Paper：《Pre-Trained Models: Past, Present and Future大规模预训练模型的发展历史、最新现状和未来发展三个方向》翻译与解读

导读：DNN把手工提取转为自动学习但是大参数却泛化差→手动构建高质量data但成本高→有限的data如何高效DNN→迁移学习(基于经验知识少样本解决)→目标任务转为微调→预训练模型（PTM）的第一波浪潮→2016年CV首先受益→NLP自监督预训练利用大量的未标记文本提供通用的语言知识→2017年Transformers/2018年GPT/2019年Bert→NLP让大模型成为趋势更适合零样本学习场景→设计有效的架构、利用丰富的上下文/多源数据、提高计算效率、进行解释和理论分析

Paper：《Pre-Trained Models: Past, Present and Future大规模预训练模型的发展历史、最新现状和未来发展三个方向》翻译与解读

Abstract

1 Introduction简介

2 Background背景

2.1 Transfer Learning and Supervised Pre-Training 迁移学习和有监督预训练

2.2 Self-Supervised Learning and Self-Supervised Pre-Training 自监督学习和自监督预训练

3 Transformer and Representative PTMs 代表性的预训练模型

3.1 Transformer

3.2 GPT

3.3 BERT

3.4 After GPT and BERT

4 Designing Effective Architecture设计有效的架构

4.1 Uniﬁed Sequence Modeling统一序列建模

4.2 Cognitive-Inspired Architectures架构

4.3 More Variants of Existing PTMs现有PTMs的更多变体

5 Utilizing Multi-Source Data利用多源数据

5.1 Multilingual Pre-Training多语言预训练

5.2 Multimodal Pre-Training多模态预训练

5.3 Knowledge-Enhanced Pre-Training增强知识的预训练

6 Improving Computational Efﬁciency提高计算效率

6.1 System-Level Optimization系统级优化

6.2 Efﬁcient Pre-Training 高效的预训练

6.3 Model Compression模型压缩

7 Interpretation and Theoretical Analysis解释与理论分析

7.1 Knowledge of PTMs知识

7.2 Robustness of PTMs鲁棒性

7.3 Structural Sparsity of PTMs结构稀疏性

7.4 Theoretical Analysis of PTMs理论分析

8 Future Direction未来方向

8.1 Architectures and Pre-Training Methods架构和预训练方法

8.2 Multilingual and Multimodal Pre-Training多语言、多模态的预训练

8.3 Computational Efﬁciency计算效率

8.4 Theoretical Foundation理论基础

8.5 Modeledge Learning

8.6 Cognitive and Knowledgeable Learning认知和知识学习

8.7 Applications应用

9 Conclusion

Note and Contribution

Paper：《Pre-Trained Models: Past, Present and Future大规模预训练模型的发展历史、最新现状和未来发展三个方向》翻译与解读

作者：清华唐杰团队等
发布时间：2021年9月
文章地址：

http://keg.cs.tsinghua.edu.cn/jietang/publications/AIOPEN21-Han-et-al-Pre-Trained%20Models-%20Past,%20Present%20and%20Future.pdf

Xu Han1⇤ , Zhengyan Zhang1⇤, Ning Ding1⇤, Yuxian Gu1⇤, Xiao Liu1⇤, Yuqi Huo2⇤, Jiezhong Qiu1, Liang Zhang2, Wentao Han1,† Minlie Huang1†, Qin Jin2†, Yanyan Lan4†, Yang Liu1,4†, Zhiyuan Liu1†, Zhiwu Lu3†, Xipeng Qiu5†, Ruihua Song3†, Jie Tang1†, Ji-Rong Wen3†, Jinhui Yuan6†, Wayne Xin Zhao3†, Jun Zhu1†
1 Department of Computer Science and Technology, Tsinghua University, Beijing, China 2 School of Information, Renmin University of China, Beijing, China
3 Gaoling School of Artiﬁcial Intelligence, Renmin University of China, Beijing, China
4 Institute for AI Industry Research, Tsinghua University, Beijing, China
5 School of Computer Science, Fudan University, Shanghai, China
6 OneFlow Inc., Beijing, China
{hanxu17,zy-z19,dingn18,gu-yx17,liuxiao17,qiujz16}@mails.tsinghua.edu.cn,
{hanwentao,aihuang,lanyanyan,liuyang2011,liuzy,jietang,dcszj}@tsinghua.edu.cn,
{bnhony,zhangliang00,qjin,luzhiwu,jrwen,batmanfly}@ruc.edu.cn,
xpqiu@fudan.edu.cn, songruihua_bloon@outlook.com, yuanjinhui@oneflow.org

Abstract

Large-scale pre-trained models (PTMs) such as BERT and GPT have recently achieved great success and become a milestone in the ﬁeld of artiﬁcial intelligence (AI). Owing to sophisticated pre-training objectives and huge model parameters, large-scale PTMs can ef-fectively capture knowledge from massive la-beled and unlabeled data. By storing knowl-edge into huge parameters and ﬁne-tuning on speciﬁc tasks, the rich knowledge implicitly encoded in huge parameters can beneﬁt a vari-ety of downstream tasks, which has been exten-sively demonstrated via experimental veriﬁca-tion and empirical analysis. It is now the con-sensus of the AI community to adopt PTMs as backbone for downstream tasks rather than learning models from scratch. In this paper, we take a deep look into the history of pre-training, especially its special relation with transfer learning and self-supervised learning, to reveal the crucial position of PTMs in the AI development spectrum. Further, we compre-hensively review the latest breakthroughs of PTMs. These breakthroughs are driven by the surge of computational power and the increas-ing availability of data, towards four impor-tant directions: designing effective architec-tures, utilizing rich contexts, improving com-putational efﬁciency, and conducting interpre-tation and theoretical analysis. Finally, we dis-cuss a series of open problems and research directions of PTMs, and hope our view can in-spire and advance the future study of PTMs.

近年来，BERT和GPT等大型预训练模型(PTM)取得了巨大的成功，成为人工智能(AI)领域的一个里程碑。由于复杂的训练前目标和庞大的模型参数，大规模PTMs能够有效地从大量有标签和无标签的数据中捕获知识。通过将知识存储到巨大的参数中，并对特定的任务进行微调，隐含在巨大参数中的丰富知识可以使各种下游任务受益，这已通过实验验证和经验分析得到广泛证明。现在AI社区的共识是采用PTMs作为下游任务的骨干，而不是从零开始学习模型。在本文中，我们深入研究了预训练的历史，特别是它与迁移学习和自我监督学习的特殊关系，以揭示PTMs在人工智能发展谱系中的关键地位。此外，我们全面回顾了PTMs的最新突破。这些突破是由计算能力的激增和数据可用性的增加所驱动的，朝着四个重要方向发展:设计有效的架构，利用丰富的上下文，提高计算效率，以及进行解释和理论分析。。最后，我们讨论了PTMs的一系列有待解决的问题和研究方向，希望我们的观点能对PTMs的未来研究有所启发和推动。

1 Introduction简介

Deep neural networks, such as convolutional neural networks (CNNs) (Krizhevsky et al., 2012; Kim, 2014; Kalchbrenner et al., 2014; He et al., 2016), recurrent neural networks (RNNs) (Sutskever et al., 2014; Donahue et al., 2015; Liu et al., 2016; Wu et al., 2016), graph neural networks (GNNs) (Kipf and Welling, 2016; Veliˇckovic´ et al., 2018; Schlichtkrull et al., 2018), and attention neu-ral networks (Jaderberg et al., 2015; Wang et al., 2017), have been widely applied for various artiﬁ-cial intelligence (AI) tasks in recent years. Differ-ent from previous non-neural models that largely relied on hand-crafted features and statistical meth-ods, neural models can automatically learn low-dimensional continuous vectors (a.k.a., distributed representations) from data as task-speciﬁc features, thereby getting rid of complex feature engineer-ing. Despite the success of deep neural networks, a number of studies have found that one of their crit-ical challenges is data hungry. Since deep neural networks usually have a large number of param-eters, they are thus easy to overﬁt and have poor generalization ability (Belkin et al., 2019; Xu et al., 2021) without sufﬁcient training data.

Considering this issue, over the same period of developing deep neural networks, massive efforts have been devoted to manually constructing high-quality datasets for AI tasks (Deng et al., 2009; Lin et al., 2014; Bojar et al., 2014), making it possible to learn effective neural models for speciﬁc tasks that are superior to conventional non-neural models. However, it is expensive and time-consuming to manually annotate large-scale data. For example,utilizing crowdsourcing to segment images costs about $6.4 per image (Liu et al., 2020b). Some complex tasks that require expert annotations may charge much more to build their datasets. Several tasks such as visual recognition (Deng et al., 2009) and machine translation (Bojar et al., 2014) have datasets containing millions of samples, yet it is impossible to build such large-scale datasets for all AI tasks. More generally, the dataset of a speciﬁc AI task usually has a limited size. Hence, for a long time until now, it has been a key research issue: how to train effective deep neural models for speciﬁc tasks with limited human-annotated data.

深度神经网络，如卷积神经网络(CNNs) (Krizhevsky et al., 2012; Kim, 2014; Kalchbrenner et al., 2014; He et al., 2016)，循环神经网络 (RNNs) (Sutskever et al., , 2014; Donahue et al., 2015; Liu et al., 2016; Wu et al., 2016)，图神经网络(GNNs) (Kipf and Welling, 2016; Veliˇckovic´ et al., 2018; Schlichtkrull et al., 2018) , 2018) 和注意力神经网络 (Jaderberg et al., 2015; Wang et al., 2017) 近年来已广泛应用于各种人工智能 (AI) 任务。与以前主要依赖手工特征和统计方法的非神经模型不同，神经模型可以从数据中自动学习低维连续向量（又称分布式表示）作为任务特定的特征，从而摆脱复杂的特征工程。尽管深度神经网络取得了成功，但许多研究发现，它们面临的关键挑战之一是数据匮乏。由于深度神经网络通常具有大量参数，因此在没有足够训练数据的情况下，它们很容易过度拟合并且泛化能力较差（Belkin 等人，2019；Xu 等人，2021）。

考虑到这个问题，在开发深度神经网络的同一时期，大量努力致力于为人工智能任务手动构建高质量的数据集（Deng et al., 2009; Lin et al., 2014; Bojar et al., 2014 )，从而有可能为特定任务学习有效的神经模型，这些模型优于传统的非神经模型。但是，手动注释大规模数据既昂贵又耗时。例如，使用众包分割图像的成本约为每张图像 6.4 美元（Liuet al.，2020b）。一些需要专家注释的复杂任务可能需要花费更多的钱来构建它们的数据集。视觉识别 (Deng et al., 2009) 和机器翻译 (Bojar et al., 2014) 等一些任务的数据集包含数百万个样本，但不可能为所有 AI 任务构建如此大规模的数据集。更普遍地说，特定 AI 任务的数据集通常具有有限的大小。因此，到目前为止，很长一段时间以来，它一直是一个关键的研究问题：如何在有限的人工注释数据的情况下为特定任务训练有效的深度神经模型。

Figure 1: The two ﬁgures show the signiﬁcant improvement on performance of both language understanding and language generation after using large-scale PTMs.

图1:这两幅图显示了在使用大规模的PTMs后，语言理解和语言生成性能的显著提高。

One milestone for this issue is the introduction of transfer learning (Thrun and Pratt, 1998; Pan and Yang, 2009). Instead of training a model from scratch with large amounts of data, human beings can learn to solve new problems with very few sam-ples. This amazing learning process is motivated by the fact that human beings can use previously learned knowledge to handle new problems. In-spired by this, transfer learning formalizes a two-phase learning framework: a pre-training phase to capture knowledge from one or more source tasks, and a ﬁne-tuning stage to transfer the captured knowledge to target tasks. Owing to the wealth of knowledge obtained in the pre-training phase, the ﬁne-tuning phase can enable models to well handle target tasks with limited samples.

Transfer learning provides a feasible method for alleviating the challenge of data hungry, and it has soon been widely applied to the ﬁeld of computer vision (CV). A series of CNNs (Krizhevsky et al., 2012; Simonyan and Zisserman, 2015; Szegedy et al., 2015; He et al., 2016) are pre-trained on the human-annotated visual recognition dataset Ima-geNet (Deng et al., 2009). Beneﬁting from the strong visual knowledge distributed in ImageNet, ﬁne-tuning these pre-trained CNNs with a small amount of task-speciﬁc data can perform well on downstream tasks. This triggers the ﬁrst wave of exploring pre-trained models (PTMs) in the era of deep learning. In this wave, PTMs are used for al-most all CV tasks such as image classiﬁcation (He et al., 2016), object detection (Sermanet et al., 2014; Ren et al., 2016), image segmentation (Long et al., 2015), and image captioning (Vinyals et al.,2015).

The natural language processing (NLP) com-munity was also aware of the potential of PTMs and started to develop PTMs for NLP tasks (Qiu et al., 2020). To take full advantage of large-scale unlabeled corpora to provide versatile lin-guistic knowledge for NLP tasks, the NLP com-munity adopts self-supervised learning (Liu et al., 2020b) to develop PTMs. The motivation of self-supervised learning is to leverage intrinsic correla-tions in the text as supervision signals instead of human supervision. For example, given the sen-tence “Beijing is the capital of China”, we mask the last word in the sentence, and then require mod-els to predict the masked position with the word “China”. Through self-supervised learning, tremen-dous amounts of unlabeled textual data can be uti-lized to capture versatile linguistic knowledge with-out labor-intensive workload. This self-supervised setting in essence follows the well-known language model learning (Bengio et al., 2003).

这个问题的一个里程碑是迁移学习的引入（Thrun 和 Pratt，1998；Pan 和 Yang，2009）。人类可以学习用很少的样本来解决新问题，而不是用大量的数据从头开始训练模型。这一惊人的学习过程的动机是人类可以使用以前学到的知识来处理新问题。受此启发，迁移学习形式化了一个两阶段的学习框架：从一个或多个源任务中获取知识的预训练阶段，以及将获取的知识转移到目标任务的微调阶段。由于在训练前阶段获得了丰富的知识，因此微调阶段可以使模型在样本有限的情况下很好地处理目标任务。

迁移学习为缓解数据饥饿的挑战提供了一种可行的方法，并很快被广泛应用于计算机视觉(CV)领域。一系列CNN（Krizhevsky 等人，2012；Simonyan 和 Zisserman，2015；Szegedy 等人，2015；He 等人，2016）在人类标注的视觉识别数据集Ima-geNet上进行预训练(Deng等人，2009)。受益于 ImageNet 中分布的强大视觉知识，使用少量特定于任务的数据对这些预训练的 CNN 进行微调可以在下游任务上表现良好。这引发了深度学习时代探索预训练模型（PTM）的第一波浪潮。在这一浪潮中，PTMs被用于几乎所有的CV任务，如图像分类(He et al.， 2016)，目标检测(Sermanet et al.， 2014;Ren et al.， 2016)、图像分割(Long et al.，2015)和图像字幕(Vinyals et al.，2015)。

自然语言处理(NLP)社区也意识到了PTMs的潜力，并开始为NLP任务开发PTMs (Qiu et al.， 2020)。为了充分利用大规模未标记语料库为 NLP 任务提供通用的语言知识，NLP社区采用了自我监督学习(Liu et al.， 2020b)来开发PTMs。自我监督学习的动机是利用文本中的内在相关性作为监督信号而不是人工监督。例如，给定“Beijing is the capital of China”这句话，我们将最后一个单词屏蔽掉，然后要求模型用“China”这个词来预测被屏蔽的位置。通过自我监督学习，可以利用大量的未标记文本数据来获取通用的语言知识，而无需耗费大量的劳动密集型工作量。这种自我监督的设置在本质上遵循了著名的语言模型学习(Bengio et al.， 2003)。

Figure 2: Figure 2(a) shows the number of publications with the keyword “language model” as well as their citations in different years. Figure 2(b) shows the parameter size of large-scale PTMs for NLP tasks and the pre-training data size are increasing by 10 times per year. From these ﬁgures, we can ﬁnd that, after 2018, when large-scale NLP PTMs begin to be explored, more and more efforts are devoted to this ﬁeld, and the model size and data size used by the PTMs are also getting larger.

图2:图2(a)显示了关键词为“language model”的出版物数量及其在不同年份的被引次数。图2(b)显示了针对NLP任务的大规模PTMs的参数大小，训练前的数据大小以每年10倍的速度增长。从这些数据可以看出，在2018年以后，当大规模的NLP PTMs开始被探索的时候，越来越多的精力投入到这个领域，PTMs所使用的模型大小和数据大小也越来越大。

For a long time, the problem of vanishing or ex-ploding gradients (Bengio et al., 1994) is the pain point of using deep neural networks for NLP tasks. Therefore, when the CV community advances the research of deep PTMs, the early exploration of the NLP community focuses on pre-training shallow networks to capture semantic meanings of words, like Word2Vec (Mikolov et al., 2013b,a,c) and GloVe (Pennington et al., 2014). Although these pre-trained word embeddings play an important role in various NLP tasks, they still face a major limitation to represent polysemous words in differ-ent contexts, as each word is represented by only one dense vector. A famous example in NLP is that the word “bank” has entirely different meanings in the sentences “open a bank account” and “on a bank of the river”. This motivates pre-training RNNs to provide contextualized word embeddings (Mela-mud et al., 2016; Peters et al., 2018; Howard and Ruder, 2018), yet the performance of these models is still limited by their model size and depth.

With the development of deep neural networks in the NLP community, the introduction of Trans-formers (Vaswani et al., 2017) makes it feasible to train very deep neural models for NLP tasks. With Transformers as architectures and language model learning as objectives, deep PTMs GPT (Radford and Narasimhan, 2018) and BERT (Devlin et al., 2019) are proposed for NLP tasks in 2018. From GPT and BERT, we can ﬁnd that when the size of PTMs becomes larger, large-scale PTMs with hundreds of millions of parameters can capture polysemous disambiguation, lexical and syntactic structures, as well as factual knowledge from the text. By ﬁne-tuning large-scale PTMs with quite a few samples, rich linguistic knowledge of PTMs brings awesome performance on downstream NLP tasks. As shown in Figure 1(a) and Figure 1(b), large-scale PTMs well perform on both language understanding and language generation tasks in the past several years and even achieve better results than human performance. As shown in Figure 2(a), all these efforts and achievements in the NLP com-munity let large-scale PTMs become the focus of AI research, after the last wave that PTMs allow for huge advances in the CV community.

长期以来，梯度消失或爆炸的问题（Bengio et al., 1994）是使用深度神经网络进行NLP 任务的痛点。因此，在CV社区推进深度PTMs的研究时，NLP社区的早期探索主要集中在对浅层网络进行预处理以捕获单词的语义意义，如Word2Vec (Mikolov et al.， 2013b,a,c)和GloVe (Pennington et al.， 2014)。尽管这些预先训练好的词嵌入在各种NLP任务中发挥着重要作用，但由于每个词仅由一个密集向量表示，因此在不同语境下对多义词的表示仍存在很大的局限性。NLP中一个著名的例子是“bank”这个词在“open A bank account”和“on A bank of the river”这两个句子中有着完全不同的含义。这促使预训练 RNN 提供上下文化的词嵌入（Mela-mud 等人，2016；Peters 等人，2018；Howard 和 Ruder，2018），但这些模型的性能仍然受到模型大小和深度的限制。

随着深度神经网络在NLP领域的发展，Trans-formers (Vaswani et al., 2017)的引入，使得为NLP任务训练非常深度的神经模型成为可能。以Transformer为架构，以语言模型学习为目标，2018 年针对 NLP 任务提出了深度 PTM GPT (Radford and Narasimhan, 2018) 和 BERT (Devlin et al., 2019)。从GPT和BERT中我们可以发现，我们可以发现当PTM 的规模变得更大，具有数亿个参数的大规模 PTM 可以从文本中捕获多义消歧、词汇和句法结构以及事实知识。通过使用相当多的样本微调大规模 PTM，PTM 丰富的语言知识为下游 NLP 任务带来了出色的性能。如图1(a)和图1(b)所示，在过去的几年中，大规模的PTMs在语言理解和语言生成任务上都表现良好，甚至达到了比人类性能更好的结果。如图 2(a) 所示，在上一波 PTM 为 CV 社区带来巨大进步之后，NLP 社区的所有这些努力和成就让大规模 PTM 成为 AI 研究的重点。

Up to now, various efforts have been devoted to exploring large-scale PTMs, either for NLP (Rad-ford et al., 2019; Liu et al., 2020d; Raffel et al., 2020; Lewis et al., 2020a), or for CV (Lu et al., 2019; Li et al., 2019; Tan and Bansal, 2019). Fine-tuning large-scale PTMs for speciﬁc AI tasks in-stead of learning models from scratch has also be-come a consensus (Qiu et al., 2020). As shown in Figure 2(b), with the increasing computational power boosted by the wide use of distributed com-puting devices and strategies, we can further ad-vance the parameter scale of PTMs from million-level to billion-level (Brown et al., 2020; Lepikhin et al., 2021; Zeng et al., 2021; Zhang et al., 2020c, 2021a) and even trillion-level (Fedus et al., 2021). And the emergence of GPT-3 (Brown et al., 2020), which has hundreds of billions of parameters, en-ables us to take a glimpse of the latent power dis-tributed in massive model parameters, especially the great abilities of few-shot learning like human beings (shown in Figure 3).

The existing large-scale PTMs have improved the model performance on various AI tasks and even subverted our current perception of the perfor-mance of deep learning models. However, several fundamental issues about PTMs still remain: it is still not clear for us the nature hidden in huge amounts of model parameters, and huge compu-tational cost of training these behemoths also pre-vents us from further exploration. At this moment, these PTMs have pushed our AI researchers to a crossroad, with a number of open directions to go. “Rome wasn’t built in a day”— PTMs also ex-perience a long development before achieving the latest success. To this end, we try to trace the development history of PTMs and draw their po-sitions in the AI spectrum, which can give us a clear understanding of the core research issues of PTMs. Then, we introduce the details of various latest PTMs, following four important lines that are currently being advanced, including designing effective architectures, utilizing rich contexts, im-proving computational efﬁciency, and conducting interpretation and theoretical analysis. By inte-grating the current development of PTMs into the context of the historical spectrum, we discuss sev-eral open problems and conclude promising future directions for PTMs. We hope our efforts in this pa-per can advance further development of PTMs. In what follows, we will introduce the background of pre-training in Section 2 and Section 3, the model architectures of PTMs in Section 4, using multi-source heterogeneous data for PTMs in Section 5, the computational efﬁciency optimization of PTMs in Section 6, and the theoretical analysis of PTMs in Section 7. Finally, we will brieﬂy discuss a series of open problems and promising directions towards better PTMs in the future.

到目前为止，人们已经做出了各种努力来探索大规模的PTMs，无论是针对NLP (radford et al.， 2019;Liu et al.， 2020d;rafael等人，2020年;Lewis等人，2020a)，或CV (Lu等人，2019;Li et al.， 2019;Tan and Bansal, 2019)。针对特定的人工智能任务对大规模PTMs进行微调，而不是从零开始的学习模型，也已成为共识(Qiu等人，2020年)。如图2(b)所示，随着分布式计算设备和策略的广泛应用，计算能力不断增强，我们可以进一步将PTMs的参数规模从百万级提升到十亿级(Brown et al., 2020; Lepikhin et al., 2021; Zeng et al., 2021; Zhang et al., 2020c, 2021a)，甚至是万亿级(Fedus et al., 2021)。而具有数千亿参数的GPT-3 (Brown et al.， 2020)的出现，让我们得以一窥海量模型参数分布的潜在能力，特别是像人类一样的少量学习能力(如图3所示)。

现有的大规模PTMs提高了模型在各种人工智能任务上的性能，甚至颠覆了我们目前对深度学习模型性能的认知。然而，关于PTMs的几个基本问题仍然存在：我们仍然不清楚隐藏在大量模型参数中的本质，训练这些庞然大物的巨大计算成本也阻碍了我们进一步的探索。目前，这些 PTM 已将我们的 AI 研究人员推到了一个十字路口，有许多开放的方向要走。“罗马不是一天建成的”——PTMs在取得最新的成功之前也经历了一个漫长的发展过程。为此，我们试图追溯PTMs的发展历史，并绘制出它们在AI谱中的位置，这可以让我们清楚地了解PTMs的核心研究问题。然后，我们介绍了各种最新 PTM 的细节，遵循目前正在推进的四个重要方面，包括设计有效的架构，利用丰富的上下文，提高计算效率，以及进行解释和理论分析。通过将 PTM 的当前发展整合到历史范围的背景下，我们讨论了几个未解决的问题并得出 PTM 的有希望的未来方向。我们希望本文的工作能够推动PTMs的进一步发展。接下来，我们将在第 2 节和第 3 节介绍预训练的背景，第 4 节介绍 PTM 的模型架构，第 5 节介绍 PTM 的多源异构数据，第 6 节介绍 PTM 的计算效率优化，以及第 7 节中对 PTM 的理论分析。最后，我们将简要讨论一系列未解决的问题和前景。

Figure 3: GPT-3, with 175 billion parameters, uses 560 GB data and 10,000 GPUs for its training. It has shown the abilities of learning world knowledge, common sense, and logical reasoning.

图3:具有 1750 亿个参数的 GPT-3 使用 560 GB 数据和 10,000 个 GPU 进行训练。它展示了学习世界知识、常识和逻辑推理的能力。

2 Background背景

Although effective PTMs have recently gained the attention of researchers, pre-training is not a novel machine learning tool. In fact, pre-training has been developed for decades, as a typical machine learning paradigm. In this section, we introduce the development of pre-training in the AI spectrum, from early supervised pre-training to current self-supervised pre-training, which can lead to a brief understanding of the background of PTMs.

尽管有效的PTMs最近引起了研究人员的关注，但预训练并不是一种新的机器学习工具。事实上，作为一种典型的机器学习范式，预训练已经发展了几十年。在本节中，我们将介绍人工智能领域中预训练的发展，从早期的有监督的预训练到目前的自监督的预训练，从而可以简要了解 PTM 的背景。

2.1 Transfer Learning and Supervised Pre-Training 迁移学习和有监督预训练

The early efforts of pre-training are mainly in-volved in transfer learning (Thrun and Pratt, 1998). The study of transfer learning is heavily moti-vated by the fact that people can rely on previ-ously learned knowledge to solve new problems and even achieve better results. More formally,transfer learning aims to capture important knowl-edge from multiple source tasks and then apply the knowledge to a target task.

In transfer learning, source tasks and target tasks may have completely different data domains and task settings, yet the knowledge required to handle these tasks is consistent (Pan and Yang, 2009). It is thus important to select a feasible method to trans-fer knowledge from source tasks to target tasks. To this end, various pre-training methods have been proposed to work as the bridge between source and target tasks. Speciﬁcally, these methods ﬁrst pre-train models on the data of multiple source tasks to pre-encode knowledge and then transfer the pre-encoded knowledge to train models for target tasks.

Generally, two pre-training approaches are widely explored in transfer learning: feature transfer and parameter transfer. Feature transfer methods pre-train effective feature represen-tations to pre-encode knowledge across domains and tasks (Johnson and Zhang, 2005; Evgeniou and Pontil, 2007; Dai et al., 2007; Raina et al., 2007). By injecting these pre-trained representations into target tasks, model performance of target tasks can be signiﬁcantly improved. Parameter transfer meth-ods follow an intuitive assumption that source tasks and target tasks can share model parameters or prior distributions of hyper-parameters. Therefore, these methods pre-encode knowledge into shared model parameters (Lawrence and Platt, 2004; Ev-geniou and Pontil, 2004; Williams et al., 2007; Gao et al., 2008), and then transfer the knowledge by ﬁne-tuning pre-trained parameters with the data of target tasks.

早期的预训练工作主要涉及迁移学习(Thrun和Pratt, 1998)。迁移学习的研究很大程度上是基于人们可以依靠以前学到的知识来解决新问题，甚至取得更好的结果。更正式地说，迁移学习的目的是从多个源任务中获取重要的知识，然后将这些知识应用到目标任务中。

在迁移学习中，源任务和目标任务可能具有完全不同的数据域和任务设置，但处理这些任务所需的知识是一致的(Pan和Yang, 2009)。因此，选择一种可行的方法将知识从源任务转移到目标任务是非常重要的。为此，提出了各种预训练方法，作为源任务和目标任务之间的桥梁。具体来说，这些方法首先对多个源任务的数据进行预训练，对知识进行预编码，然后将预编码的知识转移到目标任务的训练模型中。

一般来说，迁移学习的两种预训练方法是特征迁移和参数迁移。特征迁移方法对有效的特征表示进行预训练，以对跨领域和任务的知识进行预编码(Johnson和Zhang, 2005;Evgeniou和Pontil, 2007;Dai等人，2007;Raina等人，2007)。通过将这些预先训练的表征注入到目标任务中，可以显著提高目标任务的模型性能。参数迁移方法遵循一个直观的假设，即源任务和目标任务可以共享模型参数或超参数的先验分布。因此，这些方法将知识预编码为共享的模型参数(Lawrence和Platt, 2004;Ev-geniou和Pontil, 2004年;Williams等人，2007;Gao et al.， 2008)，然后利用目标任务的数据，通过微调预训练参数来传递知识。

Figure 4: The spectrum of pre-training methods from transfer learning, self-supervised learning to the latest pre-training neural models.

图4:从迁移学习、自监督学习到最新的预训练神经模型的预训练方法的范围。

To some extent, both representation transfer and parameter transfer lay the foundation of PTMs. Word embeddings, widely used as the input of NLP tasks, are built on the framework of feature transfer. Inspired by parameter transfer, pre-trained CNNs are applied as the backbone of most state-of-the-art CV models. Some recent well-known PTMs are also based on representation transfer and parameter transfer, e.g., ELMo (Peters et al., 2018) and BERT apply representation transfer and parameter transfer respectively.

Since AlexNet (Krizhevsky et al., 2012), a series of deep neural networks have been developed for AI tasks. As compared with those conventional machine learning models, deep neural models have more parameters and show better capabilities of ﬁtting complex data. Therefore, from AlexNet to later VGG (Simonyan and Zisserman, 2015) and GoogleNet (Szegedy et al., 2015), the architec-ture of these neural networks becomes deeper and deeper, and their performance accordingly becomes better and better. Although the network depth is important, training a deep network is not easy, as stacking more network layers inevitably brings the problem of vanishing or exploding gradients (Ben-gio et al., 1994). Besides the gradient issues, model performance may soon meet a ceiling and then de-grade rapidly with continually increasing network depths.

在一定程度上，表征传递和参数迁移都是PTMs的基础。词嵌入是一种建立在特征迁移框架上的语言处理方法，被广泛用于自然语言处理任务的输入。受参数迁移的启发，预训练的CNN被用作最先进的CV模型的主干。最近一些著名的PTMs也是基于表征迁移和参数迁移的，如ELMo (Peters et al.， 2018)和BERT分别采用了表征迁移和参数迁移。

自AlexNet (Krizhevsky et al.， 2012)以来，一系列用于人工智能任务的深度神经网络被开发出来。与传统的机器学习模型相比，深度神经模型具有更多的参数，显示出更好的拟合复杂数据的能力。因此，从AlexNet到后来的VGG (Simonyan and Zisserman, 2015)和GoogleNet (Szegedy et al.， 2015)，这些神经网络的架构越来越深，性能也越来越好。虽然网络深度很重要，但训练一个深度网络并不容易，因为堆叠更多的网络层不可避免地会带来梯度消失或爆炸的问题(Ben-gio et al.， 1994)。除了梯度问题外，模型性能可能很快就会遇到一个上限，然后随着网络深度的不断增加而迅速下降。

By adding normalization to parameter initialization (LeCun et al., 2012; Saxe et al., 2013) and hidden states (Ioffe and Szegedy, 2015), and intro-ducing shortcut connections with residual layers, ResNet (He et al., 2016) effectively tackles these problems. As we mentioned before, deep neural networks require large amounts of data for train-ing. To provide sufﬁcient data to train deep models, some large-scale supervised datasets have also been built (Russakovsky et al., 2015; Lin et al., 2014; Krishna et al., 2017; Chen et al., 2015; Cordts et al., 2016), and the most representative one is ImageNet. ImageNet contains millions of images divided into thousands of categories, representing a wide vari-ety of everyday objects. Based on the combina-tion of effective model ResNet, informative dataset ImageNet, as well as mature knowledge transfer methods, a wave of pre-training models on labeled data emerges.

The CV community beneﬁts a lot from this wave. By applying ResNet pre-trained on ImageNet as the backbone, various CV tasks have been quickly advanced, like image classiﬁcation (He et al., 2016; Lee et al., 2015), object detection (Ren et al., 2016; Sermanet et al., 2014; Gidaris and Komodakis, 2015), image segmentation (Long et al., 2015; Zheng et al., 2015), image caption (Vinyals et al., 2015; Johnson et al., 2016), visual question answer-ing (Antol et al., 2015; Gao et al., 2015; Xiong et al., 2016), etc. Utilizing PTMs like ResNet50 1 has proven to be a crucial step to obtain highly accurate results on most CV tasks. Inspired by the success of PTMs for CV tasks, some NLP re-searchers also explore supervised Pre-training, and the most representative work is CoVE (McCann et al., 2017). CoVE adopts machine translation as its pre-training objective. After pre-training, the en-coder of source languages can work as a powerful backbone for downstream NLP tasks.

通过对参数初始化 (LeCun et al., 2012; Saxe et al., 2013) 和隐藏状态 (Ioffe and Szegedy, 2015) 添加归一化，并引入带有残差层的快捷连接 ResNet (He et al. , 2016) 有效地解决了这些问题。正如我们之前提到的，深度神经网络需要大量的数据进行训练。为了提供足够的数据来训练深度模型，还建立了一些大规模的监督数据集(Russakovsky et al.， 2015;Lin等人，2014;Krishna等人，2017年;Chen et al.， 2015;Cordts et al.， 2016)，其中最具代表性的是ImageNet。ImageNet 包含数百万张图像，分为数千个类别，代表各种各样的日常对象。基于有效模型 ResNet、信息数据集 ImageNet 以及成熟的知识转移方法的组合，出现了一波标记数据的预训练模型。

CV社区从这波浪潮中获益良多。通过在ImageNet上进行预训练的ResNet作为骨干，各种CV任务得到了快速的推进，比如图像分类(He et al.， 2016;Lee et al.， 2015)，目标检测(Ren et al.， 2016;Sermanet et al.， 2014;Gidaris和Komodakis, 2015)，图像分割(Long et al.， 2015;郑等人，2015)，图像说明(Vinyals等人，2015;Johnson等人，2016)，视觉问答(Antol等人，2015;高等人，2015;Xiong et al.， 2016)等。事实证明，利用 ResNet501等 PTM 是在大多数 CV 任务上获得高度准确结果的关键步骤。受PTMs对CV任务成功的启发，一些NLP研究者也探索了有监督的预训练，其中最具代表性的工作是CoVE (McCann et al.， 2017)。CoVE的训练前目标是机器翻译。经过预训练后，源语言的编码人员可以作为下游NLP任务的强大主干。

2.2 Self-Supervised Learning and Self-Supervised Pre-Training 自监督学习和自监督预训练

As shown in Figure 4, transfer learning can be cat-egorized under four sub-settings, inductive trans-fer learning (Lawrence and Platt, 2004; Mihalkova et al., 2007; Evgeniou and Pontil, 2007), transduc-tive transfer learning (Shimodaira, 2000; Zadrozny,2004; Daume III and Marcu, 2006), self-taught learning (Raina et al., 2007; Dai et al., 2008) 2, and unsupervised transfer learning (Wang et al., 2008).

Among these four settings, the inductive and transductive settings are the core of research, as these two settings aim to transfer knowledge from supervised source tasks to target tasks. Although supervised learning is always one of the core issues of machine learning research, the scale of unlabeled data is much larger than that of manually labeled data. Recently, more and more researchers have noticed the importance of large-scale unlabeled data and are committed to extracting information from unlabeled data. Self-supervised learning has been proposed to extract knowledge from large-scale unlabeled data by leveraging input data itself as supervision.

Self-supervised learning and unsupervised learn-ing have many similarities in their settings. To a certain extent, self-supervised learning can be regarded as a branch of unsupervised learning be-cause they both apply unlabeled data. However, unsupervised learning mainly focuses on detecting data patterns (e.g., clustering, community discov-ery, and anomaly detection), while self-supervised learning is still in the paradigm of supervised set-tings (e.g., classiﬁcation and generation) (Liu et al., 2020b).

如图4所示，迁移学习可以分为四个子设置，即归纳式迁移学习(Lawrence and Platt, 2004;Mihalkova等人，2007年;Evgeniou和Pontil, 2007)，转导式迁移学习(Shimodaira, 2000;Zadrozny, 2004;Daume III和Marcu, 2006)，自学学习(Raina等人，2007;Dai等人，2008)2和无监督式迁移学习(Wang等人，2008)。

在这四种设置中，归纳和转导设置是研究的核心，因为这两种设置旨在将知识从有监督的源任务转移到目标任务。虽然监督学习一直是机器学习研究的核心问题之一，但未标注数据的规模远大于人工标注的数据。最近，越来越多的研究人员注意到了大规模未标记数据的重要性，并致力于从未标记数据中提取信息。自监督学习是一种利用输入数据本身作为监督，来从大规模的未标记数据中提取知识的方法。

自监督学习和无监督学习在它们的设置上有很多相似之处。在一定程度上，自监督学习可以看作是无监督学习的一个分支，因为它们都应用了未标记的数据。然而，无监督学习主要侧重于检测数据模式(如聚类、社区发现和异常检测)，而自监督学习仍处于监督设置的范式(如分类和生成)(Liu et al.， 2020b)。

The development of self-supervised learning makes it possible to perform pre-training on large-scale unsupervised data. Compared to supervised pre-training working as the cornerstone of CV in the deep learning era, self-supervised pre-training allows for huge advances in the ﬁeld of NLP. Al-though some supervised pre-training methods like CoVE have achieved promising results on NLP tasks, it is nearly impossible to annotate a textual dataset as large as ImageNet, considering annotat-ing textual data is far more complex than annotating images. Hence, applying self-supervised learning to utilize unlabeled data becomes the best choice to pre-train models for NLP tasks. The recent stun-ning breakthroughs in PTMs are mainly towards NLP tasks, more speciﬁcally pre-trained language models.

The early PTMs for NLP tasks exist in the form of well-known word embeddings (Collobert and Weston, 2008; Mikolov et al., 2013b; Pennington et al., 2014), which apply self-supervised methods to transform words into distributed representations. As these pre-trained word representations capture syntactic and semantic information in the text, they are often used as input embeddings and initializa-tion parameters for NLP models and offer signiﬁ-cant improvements over random initialization pa-rameters (Turian et al., 2010). Since these word-level models often suffer from the word polysemy, Peters et al. (2018) further adopt a sequence-level neural model to capture complex word features across different linguistic contexts and generates context-aware word embeddings. Using word em-beddings as the input of neural models has almost become the common mode for NLP tasks.

自监督学习的发展使得对大规模非监督数据进行预训练成为可能。与有监督的预训练作为深度学习时代CV的基石相比，自监督预训练在 NLP 领域取得了巨大进步。尽管一些监督预训练方法(如CoVE)在NLP任务上取得了很好的效果，但要注释像 ImageNet 这样大的文本数据集几乎是不可能的，因为注释文本数据要比注释图像复杂得多。因此，应用自监督学习来利用未标记数据成为NLP任务预训练模型的最佳选择。PTMs 最近的惊人突破主要是针对 NLP 任务，更具体地说是预训练的语言模型。

NLP任务早期的PTMs以众所周知的词嵌入形式存在（Collobert 和 Weston，2008；Mikolov 等人，2013b；Pennington 等人，2014），该方法应用自监督方法将单词转换为分布式表示。由于这些预训练的词表示捕获文本中的句法和语义信息，它们经常被用作NLP模型的输入嵌入和初始化参数，并比随机初始化参数提供了显著的改进(Turian et al.， 2010)。由于这些词级模型经常遭受一词多义现象的困扰，Peters等人(2018)进一步采用序列级神经模型来捕获不同语言上下文中的复杂单词特征，并生成上下文感知词嵌入。使用词嵌入作为神经模型的输入几乎已成为自然语言处理任务的常用模式。

After Vaswani et al. (2017) propose Transformers to deal with sequential data, PTMs for NLP tasks have entered a new stage, because it is pos-sible to train deeper language models compared to conventional CNNs and RNNs. Different from those word-level PTMs used as input features, the Transformer-based PTMs such as GPT and BERT can be used as the model backbone of various spe-ciﬁc tasks. After pre-training these Transformer-based PTMs on large-scale textual corpora, both the architecture and parameters of PTMs can serve as a starting point for speciﬁc NLP tasks, i.e., just ﬁne-tuning the parameters of PTMs for speciﬁc NLP tasks can achieve competitive performance. So far, these Transformer-based PTMs have achieved state-of-the-art results on almost all NLP tasks. In-spired by GPT and BERT, many more effective PTMs for NLP tasks have also been proposed, like XLNET (Yang et al., 2019), RoBERTa (Liu et al., 2020d), BART (Lewis et al., 2020a), and T5 (Raffel et al., 2020).

With the recent advance of PTMs for NLP tasks, applying Transformer-based PTMs as the backbone of NLP tasks has become a standard procedure. Motivated by the success of self-supervised learning and Transformers in NLP, some researchers explore self-supervised learning (Wu et al., 2018; Chen et al., 2020c; Chen and He, 2020; He et al., 2020) and Transformers (Carion et al., 2020; Liu et al., 2021c) for CV tasks. These preliminary efforts have shown that self-supervised learning and Transformers can outperform conventional su-pervised CNNs. Furthermore, Transformer-based multimodal PTMs (Lu et al., 2019; Li et al., 2019; Tan and Bansal, 2019) have also been proposed and shown promising results. After the last wave of su-pervised pre-training, self-supervised pre-training has become the focus of current AI research.

Looking back at the pre-training in the AI spec-trum, it is not difﬁcult to ﬁnd that pre-training has been developed for decades, focusing on how to ac-quire versatile knowledge for various downstream tasks. Next, we will comprehensively introduce the latest breakthroughs of PTMs in this wave of self-supervised pre-training. Considering that almost all the latest PTMs are related to pre-trained language models, “PTMs” in the following sections refers to pre-trained language models or multimodal models. For those conventional PTMs based on supervised pre-training, we refer to the papers of He et al.(2019) and Zoph et al. (2020).

在Vaswani et al.(2017)提出Transformers来处理序列数据后，用于 NLP 任务的 PTM 进入了一个新阶段，因为与传统的 CNN 和 RNN 相比，它可以训练更深的语言模型。与那些用作输入特征的单词级PTMs不同，基于Transformer的PTMs(如GPT和BERT)可以用作各种特定任务的模型主干。在大规模文本语料库上对这些基于Transformer的PTM进行预训练后，PTM的结构和参数都可以作为特定NLP任务的起点，即仅针对特定的NLP任务对PTM的参数进行微调就可以获得竞争性能。到目前为止，这些基于Transformer的PTMs在几乎所有的NLP任务上都取得了最先进的结果。在GPT和BERT的启发下，许多针对NLP任务的更有效的PTMs也被提出，如XLNET (Yang等人，2019)、RoBERTa (Liu等人，2020d)、BART (Lewis等人，2020a)和T5 (Raffel等人，2020)。

随着近年来NLP任务的PTMs的进步，应用基于Transformer的PTMs作为NLP任务的主干已经成为一种标准流程。在NLP中自监督学习和Transformers的成功推动下，一些研究者探索了自监督学习(Wu等人，2018;Chen et al.， 2020c;陈和何，2020;他等人，2020)和Transformers(Carion等人，2020;Liu等人，2021c)。这些初步的努力已经表明，自监督学习和Transformers可以超越传统的监督CNN。此外，基于Transformer的多模态PTMs (Lu等人，2019;Li et al.， 2019;Tan和Bansal, 2019年)也被提出并显示出可喜的结果。在上一波有监督的预训练之后，自监督的预训练成为当前人工智能研究的焦点。

回顾人工智能领域的预训练，不难发现预训练已经发展了几十年，专注于如何获取下游各种任务的通用知识。接下来，我们将全面介绍PTMs在这一波自监督式预训练中的最新突破。考虑到几乎所有最新的PTMs都与预训练的语言模型有关，以下章节中的“PTMs”是指预训练的语言模型或多模态模型。对于那些基于监督预训练的传统PTMs，我们参考He et al.(2019)和Zoph et al.(2020)的论文。

Figure 5: An illustration of the self-attention mech-anism of Transformer. The ﬁgure shows the self-attention results when encoding the word “he”, where the darker the color of the square is, the larger the cor-responding attention score is.

图5:《Transformer》的自注意机制示意图。如图所示为编码单词“he”时的自我注意结果，正方形的颜色越深，对应的注意分数越大。

3 Transformer and Representative PTMs 代表性的预训练模型

As we mentioned before, the key to the success of recent PTMs is an integration of self-supervised learning and Transformers. Hence, this section be-gins with the dominant basic neural architecture, Transformer. Then, we will introduce two land-mark Transformer-based PTMs, GPT and BERT, which respectively use autoregressive language modeling and autoencoding language modeling as the pre-training objective. All subsequent PTMs are variants of these two models. The ﬁnal part of this section gives a brief review of typical variants after GPT and BERT to reveal the recent develop-ment of PTMs.

正如我们前面提到的，最近PTMs成功的关键是自监督学习和Transformers的集成。因此，本节从占主导地位的基本神经结构Transformer开始。然后，我们将引入两种具有里程碑意义的基于Transformer的PTMs, GPT和BERT，它们分别使用自回归语言建模和自编码语言建模作为预训练目标。所有后续的PTMs都是这两个模型的变体。本节的最后一部分简要回顾了GPT和BERT之后的典型变体，揭示了PTMs的最新发展。

Figure 6: The difference between GPT and BERT in their self-attention mechanisms and pre-training objectives.

图6:GPT和BERT在自注意力机制和预训练目标方面的差异。

3.1 Transformer

Before Transformer, RNNs have been typical neu-ral networks for processing sequential data (espe-cially for natural languages) for a long time. As RNNs are equipped with sequential nature, they read a word at each time step in order and refer to the hidden states of the previous words to process it. Such a mechanism is considered to be difﬁcult to take advantage of the parallel capabilities of high-performance computing devices such as GPUs and TPUs.

As compared to RNNs, Transformer is an encoder-decoder structure that applies a self-attention mechanism, which can model correlations between all words of the input sequence in parallel. Hence, owing to the parallel computation of the self-attention mechanism, Transformer could fully take advantage of advanced computing devices to train large-scale models. In both the encoding and decoding phases of Transformer, the self-attention mechanism of Transformer computes representa-tions for all input words. Next, we dive into the self-attention mechanism more speciﬁcally.

在Transformer之前，RNN长期以来一直是处理顺序数据(特别是自然语言)的典型神经网络。由于RNN具有顺序性，它们在每个时间步按顺序读取一个单词，并参考之前单词的隐藏状态来处理它。这种机制被认为很难利用高性能计算设备(如GPU和TPU)的并行能力。

与RNNs相比，Transformer是一种编码器-解码器结构，它应用了一种自注意力机制，可以并行地对输入序列中所有单词之间的相关性进行建模。因此，由于自注意力机制的并行计算，Transformer可以充分利用先进的计算设备来训练大规模模型。在Transformer的编码和解码阶段，Transformer的自注意力机制计算所有输入单词的表示。接下来，我们将更具体地探讨自注意力机制。

In the encoding phase, for a given word, Transformer computes an attention score by comparing it with each other word in the input sequence. And such attention scores indicate how much each of the other words should contribute to the next representation of the given word. Then, the attention scores are utilized as weights to compute a weighted aver-age of the representations of all the words. We give an example in Figure 5, where the self-attention mechanism accurately captures the referential rela-tionships between “Jack” and “he”, generating the highest attention score. By feeding the weighted average of all word representations into a fully connected network, we obtain the representation of the given word. Such a procedure is essentially an aggregation of the information of the whole input sequence, and it will be applied to all the words to generate representations in parallel. In the de-coding phase, the attention mechanism is similar to the encoding, except that it only decodes one repre-sentation from left to right at one time. And each step of the decoding phase consults the previously decoded results. For more details of Transformer, please refer to its original paper (Vaswani et al., 2017) and the survey paper (Lin et al., 2021).

Due to the prominent nature, Transformer grad-ually becomes a standard neural structure for natu-ral language understanding and generation. More-over, it also serves as the backbone neural structure for the subsequently derived PTMs. Next, we in-troduce two landmarks that completely open the door towards the era of large-scale self-supervised PTMs, GPT and BERT. In general, GPT is good at natural language generation, while BERT focuses more on natural language understanding.

在编码阶段，对于给定的单词，Transformer通过与输入序列中的其他单词进行比较，计算出一个注意力分数。这样的注意力分数表明了其他每个单词在对给定单词的下一次表征中应该起到多大作用。然后，注意力分数被用作权重来计算所有单词表示的加权平均值。我们在图5中给出了一个示例，其中自注意力机制准确地捕捉了“Jack”和“he”之间的引用关系，从而产生了最高的注意力分数。过将所有单词表示的加权平均值输入到一个完全连接的网络中，我们获得了给定单词的表示。这一过程本质上是整个输入序列信息的集合，它将应用于所有的单词以并行生成表示。在解码阶段，注意力机制与编码类似，不同的是它一次只从左到右解码一种表示。并且，解码阶段的每一步都参考先前解码的结果。关于Transformer的更多细节，请参考其原始论文(Vaswani et al.， 2017)和调查论文(Lin et al.， 2021)。

由于其突出的特性，Transformer逐渐成为一种用于自然语言理解和生成的标准神经结构。此外，它还作为随后衍生的PTMs的中枢神经结构。接下来，我们将介绍两个完全开启大规模自监督PTMs时代大门的里程碑：GPT和BERT。一般来说，GPT擅长于自然语言生成，而BERT更侧重于自然语言理解。

3.2 GPT

As introduced in Section 2, PTMs typically con-sist of two phases, the pre-training phase and the ﬁne-tuning phase. Equipped by the Transformer decoder as the backbone 3, GPT applies a genera-tive pre-training and a discriminative ﬁne-tuning. Theoretically, compared to precedents of PTMs, GPT is the ﬁrst model that combines the modern Transformer architecture and the self-supervised pre-training objective. Empirically, GPT achieves signiﬁcant success on almost all NLP tasks, includ-ing natural language inference, question answering, commonsense reasoning, semantic similarity and classiﬁcation.

如第2节所介绍的，PTMs 通常由两个阶段组成，预训练阶段和微调阶段。由Transformer解码器作为骨干装备的GPT应用了生成式预训练和判别式微调。从理论上讲，与以往的PTMs先例相比，GPT是第一个将现代Transformer架构和自监督的预训练目标结合起来的模型。从经验上看，GPT在几乎所有的NLP任务中都取得了显著的成功，包括自然语言推理、问题回答、常识推理、语义相似度和分类。

Given large-scale corpora without labels, GPT optimizes a standard autoregressive language mod-eling, that is, maximizing the conditional probabili-ties of all the words given their corresponding pre-vious words as contexts. In the pre-training phase of GPT, the conditional probability of each word is modeled by Transformer. As shown in Figure 6, for each word, GPT computes its probability distri-butions by applying multi-head self-attention oper-ations over its previous words followed by position-wise feed-forward layers.

The adaptation procedure of GPT to speciﬁc tasks is ﬁne-tuning, by using the pre-trained pa-rameters of GPT as a start point of downstream tasks. In the ﬁne-tuning phase, passing the input sequence through GPT, we can obtain the representations of the ﬁnal layer of the GPT Transformer. By using the representations of the ﬁnal layer and task-speciﬁc labels, GPT optimizes standard objec-tives of downstream tasks with simple extra ouTPUt layers. As GPT has hundreds of millions of param-eters, it is trained for 1 month on 8 GPUs, which is fairly the ﬁrst “large-scale” PTM in the history of NLP. And undoubtedly, the success of GPT pave the way for the subsequent rise of a series of large-scale PTMs. In the next part, we introduce another most representative model BERT.

预训练阶段：对于给定的无标签的大型语料库，GPT优化了一种标准的自回归语言建模，也就是说，在给定其对应的前一个词作为上下文的情况下，最大化所有单词的条件概率。在GPT的预训练阶段，Transformer对每个单词的条件概率进行建模。如图6所示，对于每个单词，GPT通过对其前面的单词应用多头自注意力操作，然后再加上位置前馈层，来计算其概率分布。

微调阶段：GPT对特定任务的适应过程是通过使用GPT预训练的参数作为下游任务的起点进行微调。在微调阶段，通过GPT传递输入序列，我们可以得到GPT Transformer最后一层的表示。通过使用最后一层的表示和特定于任务的标签，GPT使用简单的额外输出层优化下游任务的标准目标。由于GPT有数亿个参数，因此在8 个 GPU 上训练了1 个月，这是NLP历史上第一个“大规模”的PTM。毫无疑问，GPT的成功为随后一系列大型PTMs的兴起铺平了道路。在接下来的部分中，我们将介绍另一种最具代表性的BERT模型。

Figure 7: The pre-training and ﬁne-tuning phases for BERT.

图7:BERT的预训练和微调阶段。

3.3 BERT

The emergence of BERT has also greatly promoted the development of the PTM ﬁeld. Theoretically, compared with GPT, BERT uses a bidirectional deep Transformer as the main structure. There are also two separate stages to adapt BERT for speciﬁc tasks, pre-training and ﬁne-tuning (see Figure 7).

In the pre-training phase, BERT applies autoen-coding language modeling rather than autoregres-sive language modeling used in GPT. More speciﬁ-cally, inspired by cloze (Taylor, 1953), the objec-tive masked language modeling (MLM) is designed. As shown in Figure 6, in the procedure of MLM, tokens are randomly masked with a special token [MASK], the objective is to predict words at the masked positions with contexts. Compared with standard unidirectional autoregressive language modeling, MLM can lead to a deep bidirectional representation of all tokens.

Besides MLM, the objective of next sentence prediction (NSP) is adopted to capture discourse relationships between sentences for some down-stream tasks with multiple sentences, such as nat-ural language inference and question answering. For this task, a binary classiﬁer is used to predict whether two sentences are coherent. In the pre-training phase, MLM and NSP work together to optimize the parameters of BERT.

BERT的出现也极大地促进了PTM领域的发展。理论上，与GPT相比，BERT采用了双向深度的Transformer结构作为主体结构。还有两个单独的阶段可以使BERT适应特定的任务，即预训练和微调(参见图7)。

在预训练阶段，BERT使用的是自动编码语言建模，而不是GPT中使用的自回归语言建模。更具体地说，受完形填空(Taylor, 1953)的启发，设计了目标掩码语言建模(MLM)。如图6所示，在MLM过程中，token 被一个特殊的 token [MASK] 随机掩码，目标是根据上下文预测掩码位置上的单词。与标准的单向自回归语言建模相比，MLM可以实现所有符号的深度双向表示。

除了 MLM，还采用下一句预测（NSP）的目标来捕获句子之间的话语关系，用于一些具有多个句子的下游任务，例如自然语言推理和问答。对于此任务，使用二元分类器来预测两个句子是否连贯。在预训练阶段，MLM和NSP协同工作以优化BERT的参数。

After pre-training, BERT can obtain robust pa-rameters for downstream tasks. By modifying inputs and ouTPUts with the data of downstream tasks, BERT could be ﬁne-tuned for any NLP tasks. BERT could effectively handle those ap-plications with the input of a single sentence or sentence pairs. For the input, its schema is two sen-tences concatenated with the special token [SEP], which could represent:

(1) sentence pairs in para-phrase,

(2) hypothesis-premise pairs in entailment,

(3) question-passage pairs in question answering, and

(4) a single sentence for text classiﬁcation or sequence tagging.

For the ouTPUt, BERT will pro-duce a token-level representation for each token, which can be used to handle sequence tagging or question answering, and the special token [CLS] can be fed into an extra layer for classiﬁcation. After GPT, BERT has further achieved signiﬁcant improvements on 17 different NLP tasks, including SQuAD (better than human performance), GLUE (7.7% point absolute improvements), MNLI (4.6%point absolute improvements), etc.

通过预训练，BERT可以获得稳健的下游任务参数。通过使用下游任务的数据修改输入和输出，BERT可以对任何NLP任务进行微调。BERT能够有效地处理那些只输入一个句子或句子对的应用程序。对于输入，其模式是两个与特殊标记 [SEP] 连接的句子，可以表示:

(1)短语中的句子对，

(2)蕴含中的假设-前提对，

(3)问答中的问题-段落对，

(4)用于文本分类或序列标注的单句。

对于输出，BERT将为每个token生成一个token级别的表示，它可以用于处理序列标记或问题回答，并且特殊的token[CLS]可以被输入到额外的层进行分类。在GPT之后，BERT在17个不同的NLP任务上进一步取得了显著的改进，包括SQuAD(优于人类表现)、GLUE(7.7%的绝对改进)、MNLI(4.6%的绝对改进)等。

3.4 After GPT and BERT

After GPT and BERT, some of their improvements have been proposed, such as RoBERTa and ALBERT. RoBERTa(Liu et al., 2020d) is one of the success variants of BERT, which mainly has four simple and effective changes:

(1) Removing the NSP task;

(2) More training steps, with bigger batch size and more data;

(3) Longer training sen-tences;

(4) Dynamically changing the [MASK] pat-tern.

RoBERTa achieves impressive empirical re-sults on the basis of BERT. Moreover, RoBERTa has pointed out that the NSP task is relatively use-less for the training of BERT. ALBERT(Lan et al., 2019) is another important variant of BERT, which provides several interesting observations on reduc-ing parameters. First, it factorizes the input word embedding matrix into two smaller ones. Second, it enforces parameter-sharing between all Transformer layers to signiﬁcantly reduce parameters. Third, it proposes the sentence order prediction (SOP) task to substitute BERT’s NSP task. As a sacriﬁce to its space efﬁciency, ALBERT has a slower ﬁne-tuning and inference speed.

继GPT和BERT之后，也有人提出了一些改进方案，如RoBERTa和ALBERT。RoBERTa(Liu et al.， 2020d)是BERT的成功变体之一，它主要有四个简单而有效的改变:

(1)去除NSP任务;

(2)更多的训练步骤，更大的batch size和更多的数据;

(3)较长的训练句;

(4)动态改变[MASK]模式。

RoBERTa在BERT的基础上取得了令人印象深刻的实证结果。此外，RoBERTa还指出，NSP任务对于BERT的训练相对来说用处不大。ALBERT(Lan et al.， 2019)是BERT的另一个重要变体，它提供了一些关于减少参数的有趣观察。首先，它将输入的词嵌入矩阵分解为两个较小的矩阵。其次，它强制所有Transformer层之间的参数共享，以显著减少参数。第三，提出了句子顺序预测(SOP)任务来替代BERT的NSP任务。作为对其空间效率的牺牲，ALBERT具有较慢的微调和推理速度。

As shown in Figure 8, besides RoBERTa and ALBERT, there are various PTMs being proposed in recent years towards better capturing knowl-edge from unlabeled data. Some work improves the model architectures and explores novel pre-training tasks, such as XLNet (Yang et al., 2019), UniLM (Dong et al., 2019), MASS (Song et al., 2019), SpanBERT (Joshi et al., 2020) and ELEC-TRA (Clark et al., 2020). Besides, incorporating rich data sources is also an important direction, such as utilizing multilingual corpora, knowledge graphs, and images. Since the model scale is a crucial success factor of PTMs, researchers also explore to build larger models to reach over hun-dreds of billions of parameters, such as the series of GPT (Radford et al., 2019; Brown et al., 2020), Switch Transformer (Fedus et al., 2021), and mean-while conduct computational efﬁciency optimiza-tion for training PTMs (Shoeybi et al., 2019; Ra-jbhandari et al., 2020; Ren et al., 2021). In the following sections, we will further introduce all these efforts for PTMs in detail.

如图8所示，除了RoBERTa和ALBERT之外，最近几年还提出了各种各样的PTMs，以更好地从未标记的数据中捕获知识。一些工作改进了模型架构并探索了新的预训练任务，如XLNet (Yang et al.， 2019)、UniLM (Dong et al.， 2019)、MASS (Song et al.， 2019)、SpanBERT (Joshi et al.， 2020)和ELEC-TRA(Clark et al.， 2020)。此外，整合丰富的数据源也是一个重要的方向，如利用多语言语料库、知识图谱和图像。由于模型规模是PTMs成功的一个关键因素，研究人员还探索构建更大的模型，以达到数千亿参数，如GPT系列(Radford等人，2019;Brown et al.， 2020)，Switch Transformer(Fedus et al.， 2021)，同时对训练PTMs进行计算效率优化(Shoeybi et al.， 2019;Ra-jbhandari等人，2020年;Ren等，2021)。在下面的小节中，我们将进一步详细介绍所有这些针对PTMs努力的工作。

4 Designing Effective Architecture设计有效的架构

In this section, we dive into the after-BERT PTMs deeper. The success of Transformer-based PTMs has stimulated a stream of novel architectures for modeling sequences for natural language and beyond. Generally, all the after-BERT Transformer architectures for language pre-training could be cat-egorized according to two motivations: toward uni-ﬁed sequence modeling and cognitive-inspired architectures. Besides, we also take a glimpse over other important BERT variants in the third subsection, which mostly focus on improving natu-ral language understanding.

在本节中，我们将更深入地探讨BERT后PTMs。基于Transformer的PTMs的成功激发了一系列为自然语言及其他领域建模序列的新颖架构。一般来说，所有用于语言预训练的BERT后Transformer架构都可以根据两个动机进行分类：统一序列建模和认知启发式架构。此外，我们还在第三小节中对其他重要的BERT变体进行了简要介绍，这些BERT变体主要侧重于提高自然语言的理解能力。

4.1 Uniﬁed Sequence Modeling统一序列建模

Why is NLP so challenging? One of the funda-mental reasons is that it has versatile downstream tasks and applications, which could be generally categorized into three genres:

(1)、Natural language understanding: includes grammatical analysis, syntactic analysis, word/sentence/paragraph classiﬁcation, ques-tion answering, factual/commonsense knowl-edge inference and etc.

(2)、Open-ended language generation: includes dialog generation, story generation, data-to-text generation and etc.

(3)、Non-open-ended language generation: in-cludes machine translation, abstract summa-rizing, blank ﬁlling and etc.

为什么NLP如此具有挑战性?一个根本的原因是它有通用的下游任务和应用程序，可以大致分为三种类型：(1)、自然语言理解：包括语法分析、句法分析、单词/句子/段落分类、问答、事实/常识推理等。

(2)、开放式语言生成：包括对话生成、故事生成、数据转文本生成等。

(3)、非开放式语言生成：包括机器翻译、摘要总结、填空等。

Nevertheless, the differences between them are not so signiﬁcant. As Feynman’s saying goes, “What I cannot create, I do not understand”. On one hand, a model that can not understand must not ﬂuently generate; on the other hand, we can easily turn understanding tasks into generation tasks (Schick and Schütze, 2020). Recent studies also show that GPTs can achieve similar and even better performance on understanding benchmarks than BERTs (Liu et al., 2021b). The boundary between understanding and generation is vague.

Based on the observation, a bunch of novel ar-chitectures has been seeking for unifying different types of language tasks with one PTM. We will take a look over its development and discuss the in-spirations they bring towards a uniﬁed foundation of natural language processing.

Combining Autoregressive and Autoencoding Modeling. The pioneer work to unify GPT-style unidirectional generation and BERT-style bidirec-tional understanding is XLNet (Yang et al., 2019), which proposes the permutated language modeling. The masked-recover strategy in BERT naturally contradicts with its downstream application, where there is no [MASK] in input sentences. XLNet solves the problem by permutating tokens’ order in the pre-training and then applying the autore-gressive prediction paradigm, which endows XLNet with the ability for both understanding and generation. An important follower of permutation language modeling is MPNet (Song et al., 2020), which amends the XLNet’s discrepancy that in pre-training XLNet does not know the sentence’s length while in downstream it knows.

Besides permutated language modeling, another stream would be multi-task training. UniLM (Dong et al., 2019) proposes to jointly train different language modeling objectives together, includ-ing unidirectional, bidirectional, and sequence-to-sequence (seq2seq) objectives. This can be achieved by changing the attention masks in Transformers. UniLM performs quite well in generative question answering and abstract summarization.

然而，它们之间的差异并不那么显著。正如费曼所说:“我不能创造的，我就不能理解。”一方面，不能理解的模型不能流畅地生成;另一方面，我们可以很容易地将理解任务转化为生成任务(Schick和Schütze, 2020)。最近的研究还表明，与BERTs相比，GPTs可以在理解基准上取得类似甚至更好的性能(Liu等人，2021b)。理解和生成之间的界限是模糊的。

基于这种观察，一堆新颖的架构一直在寻求用一个 PTM 来统一不同类型的语言任务。我们将回顾它的发展，并讨论它们给自然语言处理的统一基础带来的启示。

结合自回归和自编码建模。将GPT风格的单向生成和BERT风格的双向理解统一起来的先锋工作是XLNet (Yang等人，2019)，它提出了置换语言建模。BERT中的masked-recover策略自然与其下游应用相矛盾，后者在输入句中没有[MASK]。XLNet通过在预训练打乱token的顺序，然后应用自回归预测范式解决了这一问题，使XLNet具备了既理解又生成的能力。置换语言建模的一个重要追随者是MPNet (Song et al.， 2020)，它修正了XLNet的差异，即在预训练中XLNet不知道句子的长度，而在下游它知道句子的长度。

除了置换语言建模，另一个流程是多任务训练。UniLM (Dong et al.， 2019)提出联合训练不同的语言建模目标，包括单向、双向和序列到序列(seq2seq)目标。这可以通过改变《Transformers》中的注意力掩码来实现。UniLM在生成式问答和抽象总结方面表现得很好。

Recently, GLM (Du et al., 2021) proposes a more elegant approach for combining autoregres-sive and autoencoding. Given a variable-length masked span, instead of providing the number of [MASK] to model as BERT and SpanBERT (Joshi et al., 2020) do, GLM asks Transformer blocks to autoregressively generate the masked tokens. And to preserve the information of [MASK]s’ number, GLM proposes a 2D positional encoding strategy. GLM is the ﬁrst model to achieve the best perfor-mance on all types of tasks including natural lan-guage understanding, conditional generation, and unconditional generation at the same time.

Applying Generalized Encoder-Decoder. Before GLM, both encoder structure (e.g., BERT) or de-coder structure (e.g., GPT) can not solve an im-portant problem: to ﬁll in blanks with variable lengths (Du et al., 2021; Shen et al., 2020b). The decoder-based models can not make it because they can only generate at the end of the sequence and neither the encoder-based models because the num-ber of [MASK]s will leak information. A natural idea is to turn to encoder-decoder architecture originally designed for machine translation, which would produce variable lengths of target sequences conditioned on the sources.

最近，GLM (Du et al.， 2021)提出了一种更优雅的方法来结合自回归和自编码。给定一个可变长度的掩码跨度，GLM 要求Transformer块自回归地生成掩码token，而不是像BERT和SpanBERT (Joshi et al.， 2020)那样提供[MASK]的数量来建模。为了保留[MASK]s的编号信息，GLM 提出了一种2D位置编码策略。GLM 是第一个在所有类型的任务(包括自然语言理解、条件生成和无条件生成)上都能达到最佳性能的模型。

应用广义编码器-解码器(Encoder-Decoder)。在GLM之前，无论是编码器结构(如BERT)，还是解码器结构(如GPT)都不能解决一个重要问题：用可变长度填充空白(Du等人，2021;沈等，2020b)。基于解码器的模型不能做到这一点，因为它们只能在序列的末尾生成，而基于编码器的模型也不能做到这一点，因为[MASK]的数量会泄露信息。一个自然的想法是转向最初为机器翻译设计的编码器-解码器架构，它将根据源产生可变长度的目标序列。

The pioneer of this genre is MASS (Song et al., 2019), which introduces the masked-prediction strategy into the encoder-decoder structure. How-ever, MASS does not touch the problem of ﬁlling variable-length blanks. T5 (Raffel et al., 2020) solves the problem by masking a variable-length of span in text with only one mask token and asks the decoder to recover the whole masked sequence. BART (Lewis et al., 2020a) introduces the inter-esting idea of corrupting the source sequence with multiple operations such as truncation, deletion, re-placement, shufﬂing, and masking, instead of mere masking. There are following works that specify in typical seq2seq tasks, such as PEGASUS (Zhang et al., 2020a) and PALM (Bi et al., 2020).

However, several challenges lie in front of encoder-decoder architectures. First, the encoder-decoder introduces much more parameters com-pared to a single encoder/decoder. Although this problem could be alleviated by parameter-sharing of the encoder and decoder, its parameter-efﬁciency is still doubtful. Second, encoder-decoder struc-tures generally do not perform very well on natu-ral language understanding. Despite reported im-provements over similar-sized vanilla BERT, well-trained RoBERTa or GLM encoder performs much better than them.

这一流派的先驱是MASS (Song et al.， 2019)，它将掩码预测策略引入到编码器-解码器结构中。然而，MASS 并没有涉及填充可变长度空白的问题。T5 (Raffel et al.， 2020)通过使用一个掩码token在文本中掩码一个可变长度的跨度来解决这个问题，并要求解码器恢复整个掩码序列。BART (Lewis等人，2020a)引入了一种有趣的思想，即通过多种操作破坏源序列，例如截断、删除、替换、变换和掩码，而不仅仅是掩码。有以下的工作指定了典型的seq2seq任务，如PEGASUS (Zhang et al.， 2020a)和PALM (Bi et al.， 2020)。

然而，在编码器-解码器架构面前存在着几个挑战。首先，与单个编码器/解码器相比，编码器-解码器引入了更多的参数。虽然这个问题可以通过编码器和解码器的参数共享来缓解，但它的参数效率仍然值得怀疑。第二，编码器-解码器结构通常在自然语言理解方面表现得不太好。尽管有报道称在类似大小的BERT上有改进，但训练有素的RoBERTa或GLM编码器的性能要比它们好得多。

Table 1: Three fundamental types of framework and their suitable downstream tasks. “NLU” refers to nat-ural language understanding. “Cond. Gen.” and “Un-cond. Gen.” refer to conditional and unconditional text generation, respectively. “X” means “is good at”, “—”means “could be adapted to”, and “⇥” means “cannot be directly applied to”. We deﬁne unconditional generation as the task of generating text without further training as in a standard language model, while conditional generation refers to seq2seq tasks such as text summarization. Taken from (Du et al., 2021).

表1：三种基本的框架类型和它们合适的下游任务。“NLU”指的是自然语言理解。“Cond. Gen.”和“Un-cond. Gen.”分别指有条件的和无条件的文本生成。“X”表示“擅长”，“—”表示“可以适应”，“⇥”表示“不能直接适用”。我们将无条件生成定义为在标准语言模型中无需进一步训练即可生成文本的任务，而条件生成则是如文本摘要之类的seq2seq 任务。摄于(Du等人，2021年)。

4.2 Cognitive-Inspired Architectures架构

Figure 8: The family of recent typical PTMs, including both pre-trained language models and multimodal models.

图8:最近典型的PTMs家族，包括预训练的语言模型和多模态模型。

Is the current Transformer a good enough imple-mentation of human beings’ cognitive system? Of course not. Attention mechanism, the core module in Transformer architecture, is inspired by the micro and atom operation of the human’s cogni-tive system and only responsible for the perceptive function. However, human-level intelligence is far more complex than the mere understanding of the association between different things.

In pursuit for human-level intelligence, under-standing the macro architecture of our cogni-tive functions including decision making, logical reasoning, counterfactual reasoning and working memory (Baddeley, 1992) is crucial. In this subsec-tion, we will take a look over the novel attempts in-spired by advances of cognitive science, especially on maintainable working memory and sustainable long-term memory.

现在的Transformer能很好地实现人类的认知系统吗?当然不是。Transformer架构的核心模块——注意力机制，灵感来源于人类认知系统的微观和原子操作，只负责感知功能。然而，人类水平的智力远比仅仅理解不同事物之间的关联要复杂得多。

在追求人类水平的智能时，了解我们的认知功能的宏观架构，包括决策、逻辑推理、反事实推理和工作记忆(Baddeley, 1992)是至关重要的。在这个小节中，我们将回顾认知科学的进步所启发的新尝试，特别是在可维持工作记忆和可持续长期记忆方面。

Maintainable Working Memory. A natural problem of Transformer is its ﬁxed window size and quadratic space complexity, which signiﬁcantly hinders its applications in long document under-standing.

Despite the bunch of modiﬁcations on approx-imate computing of the quadratic growing point-wise attention (Tay et al., 2020), a question is that we humans do not present such a long-range at-tention mechanism. As an alternative, cognitive scientists have revealed that humans could main-tain a working memory (Baddeley, 1992; Brown, 1958; Barrouillet et al., 2004; Wharton et al., 1994),which not only memorizes and organizes but also forgets. The conventional long-short term memory network is an exemplar practice for such a philoso-phy.

For Transformer-based architecture, the Transformer-XL (Dai et al., 2019) is the ﬁrst to introduce segment-level recurrence and relative positional encoding to fulﬁll this goal. How-ever, the recurrence only implicitly models the working memory. As a more explicit solution, CogQA (Ding et al., 2019) proposes to maintain a cognitive graph in the multi-hop reading. It is composed of two systems: the System 1 based on PTMs and the System 2 based on GNNs to model the cognitive graph for multi-hop understanding.

A limitation of CogQA is that its use of the Sys-tem 1 is still based on ﬁxed window size. To endow working memory with the ability to understand long documents, CogLTX (Ding et al., 2020) lever-ages a MemRecall language model to select sen-tences that should be maintained in the working memory and another model for answering or clas-siﬁcation.

可维持工作记忆。Transformer的一个自然问题是其固定的窗口大小和二次空间复杂度，这严重阻碍了其在长文档理解中的应用。

尽管对二次增长的逐点注意的近似计算进行了大量的修改(Tay等人，2020年)，但问题是，我们人类并没有呈现如此长程的注意力机制。作为一种替代方法，认知科学家已经揭示，人类可以保持工作记忆(Baddeley, 1992;布朗,1958;Barrouillet等人，2004年;沃顿等人，1994)，它不仅可以记忆和组织，但也忘记。传统的长短期记忆网络就是这种理念的一个范例。

对于基于Transformer的架构，Transformer-XL (Dai等人，2019)是第一个引入分段级递归和相对位置编码来实现这一目标的。然而，这种递归只是隐含地模拟了工作记忆。作为一种更明确的解决方案，CogQA (Ding et al.， 2019)提出在多跳阅读中维护认知图。它由两个系统组成:基于PTMs的系统1和基于GNN的系统2，用于对认知图进行建模，实现多跳理解。

CogQA的一个限制是它对系统1的使用仍然基于固定的窗口大小。为了赋予工作记忆理解长文档的能力，CogLTX (Ding et al.， 2020)利用MemRecall语言模型来选择应该保存在工作记忆中的句子，并使用另一个模型来回答或分类。

Sustainable Long-Term Memory. The success of GPT-3 (Brown et al., 2020) and recent studies on language models’ ability in recalling factual knowledge (Petroni et al., 2019; Wang et al., 2020a; Liu et al., 2021b) has revealed the fact that Transformers can memorize. But how does Transformers make it?

In Lample et al. (2019), the authors provide some inspiring evidences on how Transformers memo-rize. They replace the feed-forward networks in a Transformer layer with large key-value memory networks, and ﬁnd it to work pretty well. This somehow proves that the feed-forward networks in Transformers is equivalent to memory networks.

Nevertheless, the memory capacity in Transformers is quite limited. For human intelligence, besides working memory for deciding and reasoning, the long-term memory also plays a key role in recall-ing facts and experiences. REALM (Guu et al., 2020) is a pioneer to explore how to construct a sustainable external memory for Transformers. The authors tensorize the whole Wikipedia sentence by sentence, and retrieve relevant sentences as context for masked pre-training. The tensorized Wikipedia is asynchronously updated for a given number of training steps. RAG (Lewis et al., 2020b) extends the masked pre-training to autoregressive generation, which could be better than extractive question answering.

Besides tensorizing the text corpora, (Verga et al., 2020; Févry et al., 2020) propose to tensorize entities and triples in existing knowledge bases. When entities appear in contexts, they replace en-tity tokens’ embedding in an internal Transformer layer with the embedding from outer memory net-works. (Dhingra et al., 2020; Sun et al., 2021) maintain a virtual knowledge from scratch, and propose a differentiable reasoning training objective over it. All of these methods achieve promising improvement on many open-domain question answering benchmarks.

可持续的长期记忆。GPT-3的成功(Brown et al.， 2020)和最近关于语言模型记忆事实知识能力的研究(Petroni et al.， 2019;Wang et al.， 2020a;Liu等人，2021b)揭示了《Transformers》可以记忆的事实。但是《Transformers》是怎么做到的呢?

在Lample等人(2019)中，作者就《Transformers》如何记忆提供了一些鼓舞人心的证据。他们用大型键值记忆网络替换了Transformer层中的前馈网络，并且发现它工作得非常好。这在某种程度上证明了《Transformers》中的前馈网络与记忆网络是等价的。

然而，《Transformers》的记忆容量是相当有限的。对于人类智能而言，除了用于决定和推理的工作记忆外，长期记忆在回忆事实和经验方面也起着关键作用。REALM (Guu等人，2020)是探索如何为Transformers构建可持续外部记忆的先驱。作者将整个维基百科语料库逐句tensorize化，并检索相关的句子作为掩码预训练的上下文。对于给定数量的训练步骤，tensorized化的维基百科会异步更新。RAG (Lewis et al.， 2020b)将掩码预训练扩展到自回归生成，这可能比抽取式问题回答更好。

除了对文本语料库进行张量tensorizing外，(Verga等人，2020;Févry等人，2020)提出对现有知识库中的实体和三元组进行tensorize化。当实体出现在上下文中时，它们用来自外部内存网络的嵌入替换嵌入内部Transformer层的实体token。(Dhingra et al., 2020; Sun et al., 2021)从零开始维护一个虚拟知识，并在此基础上提出一个可微分的推理训练目标。所有这些方法在许多开放域问题回答基准上都取得了有希望的改进。

4.3 More Variants of Existing PTMs现有PTMs的更多变体

Besides the practice to unify sequence model-ing and construct cognitive-inspired architectures, most current studies focus on optimizing BERT’s architecture to boost language models’ perfor-mance on natural language understanding.

A stream of work aims at improving the mask-ing strategy, which could be regarded as a certain kind of data augmentation (Gu et al., 2020). Span-BERT (Joshi et al., 2020) shows that masking a con-tinuous random-length span of tokens with a span boundary objective (SBO) could improve BERT’s performance. Similar ideas have also been explored in ERNIE (Sun et al., 2019b,c) (where a whole entity is masked), NEZHA (Wei et al., 2019), and Whole Word Masking (Cui et al., 2019).

除了统一序列建模和构建认知架构的实践外，目前的研究主要集中在优化BERT的架构，以提高语言模型在自然语言理解方面的性能。

一系列的工作旨在改进掩码策略，这可以被视为一种数据增强(Gu et al.， 2020)。Span -BERT (Joshi et al.， 2020)表明，用跨度边界目标(SBO)掩码连续的随机长度的token跨度可以提高BERT的性能。ERNIE (Sun et al.， 2019b,c)(其中整个实体被掩码)、NEZHA (Wei et al.， 2019)和Whole Word Masking (Cui et al.， 2019)也探索了类似的想法。

Another interesting practice is to change the masked-prediction objective to a harder one. ELECTRA (Clark et al., 2020) transform MLM to a replace token detection (RTD) objective, in which a generator will replace tokens in original sequences and a discriminator will predict whether a token is replaced.

另一个有趣的实践是将带掩码的预测目标更改为一个更难的目标。ELECTRA (Clark et al.， 2020)将传送带转换为一个替换token检测(RTD)目标，在这个目标中，生成器将替换原始序列中的token，鉴别器将预测一个token是否被替换。

5 Utilizing Multi-Source Data利用多源数据

In this section, we introduce some typical PTMs that take advantage of multi-source heterogeneous data, including multilingual PTMs, multimodal PTMs, and knowledge-enhanced PTMs.

在本节中，我们将介绍一些利用多源异构数据的典型PTM，包括多语言PTM、多模态PTM和知识增强的PTM。

5.1 Multilingual Pre-Training多语言预训练

Language models trained on large-scale English corpora have achieved great success in many bench-marks. However, we live in a multilingual world, and training a large language model for each lan-guage is not an elegant solution because of the cost and the amount of data required. In fact, although people from all over the world use different languages, they can express the same meaning. This may indicate that semantics is independent of symbol systems. Additionally, some researchers found that they could get even better performance on benchmarks when training one model with several languages comparing with training several mono-lingual models (Lample and Conneau, 2019; Huang et al., 2020b). Hence, training one model to learn multilingual representations rather than monolin-gual representations may be a better way.

Before BERT, some researchers have explored multilingual representations. There are mainly two ways to learn multilingual representations. One way is to learn through parameter sharing. For ex-ample, training multilingual LSTMs with several language pairs together achieves multilingual trans-lation. Another way is to learn language-agnostic constraints, such as decoupling language representations into language-speciﬁc and language-agnostic representations utilizing the WGAN (Ar-jovsky et al., 2017) framework. Both of these two ways enable models to be applied to multilingual scenarios, but only for speciﬁc tasks. The model in each of them is trained with one speciﬁc task from beginning to end, and cross-lingual knowledge can-not be generalized to other tasks. Hence, for any other multilingual tasks, training new models from scratch is still required. Learning new models from scratch needs a large volume of task-speciﬁc data.

基于大型英语语料库的语言模型在许多基准测试中取得了巨大的成功。然而，我们生活在一个多语言的世界，由于成本和所需数据量的原因，为每种语言训练一个大型语言模型并不是一个优雅的解决方案。事实上，尽管来自世界各地的人们使用不同的语言，但他们可以表达相同的意思。这可能表明语义是独立于符号符号系统的。此外，一些研究人员发现，与训练几个单语言模型相比，用几种语言训练一个模型可以在基准测试中获得更好的性能(Lample和Conneau, 2019;黄等，2020b)。因此，训练一个模型来学习多语言表征，而不是单语语言表征，可能是一个更好的方法。

在BERT之前，一些研究人员已经探索了多语言表征。学习多语言表征的方法主要有两种：一种方法是通过参数共享来学习。例如，用几个语言对一起训练多语言LSTM，就可以实现多语言翻译。另一种方法是学习语言不可知的约束，例如使用WGAN (Ar-jovsky et al.， 2017)框架将语言表示解耦为特定的语言和与语言无关的表示。这两种方法都可以将模型应用于多语言场景，但仅适用于特定的任务。它们中的模型从头到尾都是用一个特定的任务来训练的，跨语言知识不能推广到其他任务中。因此，对于任何其他多语言任务，仍然需要从头开始训练新的模型。从头开始学习新模型需要大量特定于任务的数据。

The appearance of BERT shows that the frame-work of pre-training with general self-supervised tasks and then ﬁne-tuning on speciﬁc downstream tasks is feasible. This motivates researchers to design tasks to pre-train versatile multilingual mod-els. Multilingual tasks could be divided into un-derstanding tasks and generation tasks according to task objectives. Understanding tasks focus on sentence-level or word-level classiﬁcation, and are of help for downstream classiﬁcation tasks such as natural language inference (Conneau et al., 2018b). Generation tasks focus on sentence generation, and are crucial in downstream generation tasks such as machine translation.

Some understanding tasks are ﬁrst used to pre-train multilingual PTMs on non-parallel multilin-gual corpora. For example, multilingual BERT (mBERT) released by Devlin et al. (2019) is pre-trained with the multilingual masked language modeling (MMLM) task using non-parallel multi-lingual Wikipedia corpora in 104 languages. The research conducted by Pires et al. (2019) shows that mBERT has the ability to generalize cross-lingual knowledge in zero-shot scenarios. This in-dicates that even with the same structure of BERT, using multilingual data can enable the model to learn cross-lingual representations. XLM-R (Con-neau et al., 2020) builds a non-parallel multilingual dataset called CC-100, which supports 100 lan-guages. The scale of CC-100 is much larger than the Wikipedia corpora used by mBERT, especially for those low-resource languages. XLM-R is pre-trained with MMLM as the only task on CC-100 and gets better performance on several benchmarks than mBERT, which indicates that a larger scale of multilingual corpora can bring better performance.

BERT的出现表明，基于一般自监督任务的预训练框架，再对特定的下游任务进行微调是可行的。这促使研究人员设计任务来训练通用的多语言模型。多语言任务根据任务目标，可以分为理解任务和生成任务。理解任务侧重于句子级别或单词级别的分类，并有助于下游的分类任务，如自然语言推理(Conneau等，2018b)。生成任务主要专注于句子的生成，对于机器翻译等下游生成任务至关重要。

一些理解任务首先被用于非并行多语言语料库上的多语PTMs的预训练。例如，Devlin等人(2019)发布的多语言BERT(mBERT)使用104种语言的非并行多语言维基百科语料库，通过多语言掩码语言建模(MMLM)任务进行预训练。Pires等人(2019)的研究表明，mBERT具有在零样本场景下概括跨语言知识的能力。这表明，即使使用相同的BERT结构，使用多语言数据也可以使模型学习跨语言表示。XLM-R (Con-neau等人，2020)构建了一个称为CC-100的非并行多语言数据集，它支持100种语言。CC-100的规模比mBERT使用的Wikipedia语料库要大得多，特别是对于那些资源较低的语言。XLM-R在CC-100上以MMLM作为唯一任务进行预训练，在多个基准测试上的性能都优于mBERT，说明多语言语料规模越大，性能越好。

However, the MMLM task cannot well utilize parallel corpora. In fact, parallel corpora are quite important for some NLP tasks such as machine translation. Intuitively, parallel corpora are very helpful to directly learn cross-lingual representa-tions for those sentences in different languages with the same meanings. From this point, XLM (Lample and Conneau, 2019) leverages bilingual sentence pairs to perform the translation language modeling (TLM) task. Similar to MLM in BERT, TLM combines two semantically matched sentences into one and randomly masks tokens in both parts. Compared with MLM, TLM requires models to predict the masked tokens depending on the bilingual con-texts. This encourages models to align the repre-sentations of two languages together.

Besides TLM, there are some other effective methods to learn multilingual representations from parallel corpora. Unicoder (Huang et al., 2019a) provides two novel pre-training tasks based on par-allel corpora: cross-lingual word recovery (CLWR) and cross-lingual paraphrase classiﬁcation (CLPC). CLWR uses target language embeddings to repre-sent source language embeddings by leveraging at-tention mechanisms, and its objective is to recover the source language embeddings. This task enables models to learn word-level alignments between dif-ferent languages. CLPC treats aligned sentences as positive pairs and samples misaligned sentences as negative pairs to perform sentence-level classi-ﬁcation, letting models predict whether the input pair is aligned or not. With CLPC, models can learn sentence-level alignments between different languages. ALM (Yang et al., 2020) automatically generates code-switched sequences from parallel sentences and performs MLM on it, which forces models to make predictions based only on contexts of other languages. InfoXLM (Chi et al., 2020b) analyzes MMLM and TLM from the perspective of information theory, and encourages models to distinguish aligned sentence pairs with misaligned negative examples under the framework of con-trastive learning. HICTL (Wei et al., 2021) extends the idea of using contrastive learning to learn both sentence-level and word-level cross-lingual repre-sentations. ERNIE-M (Ouyang et al., 2020) pro-poses back-translation masked language modeling (BTMLM), and expands the scale of parallel cor-pora through back-translation mechanisms. These works show that leveraging parallel corpora can bring much help towards learning cross-lingual rep-resentations.

然而，MMLM任务不能很好地利用并行语料库。事实上，平行语料库对于机器翻译等NLP任务非常重要。从直观上看，平行语料库对于直接学习具有相同意义的不同语言句子的跨语言表征非常有帮助。从这一点出发，XLM (Lample and Conneau, 2019)利用双语句子对来执行翻译语言建模(TLM)任务。与BERT中的MLM相似，TLM将两个语义匹配的句子组合成一个句子，并随机掩码两部分中的标记。与MLM相比，TLM需要模型根据双语语境预测掩码token。这鼓励模型将两种语言的表示组合在一起。

除了TLM外，还有一些从平行语料库学习多语言表征的有效方法。Unicoder (Huang et al.， 2019a)提供了两个基于平行语料库的新型预训练任务：跨语言单词恢复(CLWR)和跨语言意译分类(CLPC)。CLWR利用目标语言嵌入机制来表示源语言嵌入，其目标是恢复源语言嵌入。这个任务使模型能够学习不同语言之间的单词级别对齐。CLPC将对齐的句子作为正对，样本未对齐的句子作为负对，以执行句子级别的分类，让模型预测输入对是否对齐。使用CLPC，模型可以学习不同语言之间的句子级别对齐。ALM (Yang et al.， 2020)自动从平行句中生成代码转换序列，并对其进行MLM，这迫使模型仅基于其他语言的上下文进行预测。InfoXLM (Chi et al.， 2020b)从信息论的角度分析MMLM和TLM，鼓励模型在对比学习框架下区分对齐的句子对和未对齐的反例。HICTL (Wei et al.， 2021)扩展了使用对比学习来学习句子级和单词级跨语言表示的想法。ERNIE-M (Ouyang et al.， 2020)提出了反向翻译掩码语言模型(BTMLM)，并通过反向翻译机制扩展了平行语料库的规模。这些研究表明，利用平行语料库对学习跨语言表征有很大帮助。

Researches have also widely explored generative models for multilingual PTMs. Normally, a gener-ative model consists of a Transformer encoder and a Transformer decoder. For example, MASS (Song et al., 2019) extends MLM to language generation. It randomly masks a span of tokens in the input sentence and predicts the masked tokens in an autoregressive manner. Denoising autoencoding (DAE) is a typical generation task, which applies noise functions to the input sentence and then re-stores the original sentence with the decoder. The noise functions of DAE usually contain two operations: replacing a span of tokens with a mask token as well as permuting the order of tokens. mBART (Liu et al., 2020c) extends DAE to support multiple languages by adding special symbols. It adds a lan-guage symbol both to the end of the encoder input and the beginning of the decoder input. This en-ables models to know the languages to be encoded and generated.

Although DAE in mBART (Liu et al., 2020c) is trained with multiple languages, the encoding in-put and the decoding ouTPUt are always in the same language. This leads models to capture spurious correlations between language symbols and gener-ated sentences. In other words, models may ignore the given language symbols and directly generate sentences in the same language of the input. To ad-dress this issue, XNLG (Chi et al., 2020a) proposes the cross-lingual autoencoding (XAE) task. Differ-ent from DAE, the encoding input and the decoding ouTPUt of XAE are in different languages, which is similar to machine translation. In addition, XNLG optimizes parameters in a two-stage manner. It trains the encoder with the MLM and TLM tasks in the ﬁrst stage. Then, it ﬁxes the encoder and trains the decoder with the DAE and XAE tasks in the second stage. All parameters are well pre-trained by this way, and the gap between pre-training with MLM and ﬁne-tuning with autoregressive decoding is also ﬁlled.

研究人员也广泛探索了多语言PTMs的生成模型。通常，生成模型由Transformer编码器和Transformer解码器组成。例如MASS (Song et al.， 2019)将MLM扩展到语言生成。该算法在输入句中随机掩码一段标记，并以自回归的方式对掩码标记进行预测。降噪自编码(DAE)是一种典型的生成任务，它将噪声函数应用于输入句子，然后用解码器重新存储原始句子。DAE的噪声函数通常包含两种操作:用掩码token替换一组token，以及置换token的顺序。mBART(Liu et al.， 2020c)通过添加特殊符号扩展了DAE以支持多种语言。它在编码器输入的末尾和解码器输入的开头都添加了语言符号。这使得模型能够知道要编码和生成的语言。

虽然mBART中的DAE (Liu et al.， 2020c)使用多种语言进行训练，但编码输入和解码输出总是使用同一种语言。这导致模型捕捉到语言符号和生成的句子之间的虚假关联性。换句话说，模型可能会忽略给定的语言符号，并直接以与输入相同的语言生成句子。为了解决这个问题，XNLG (Chi et al.， 2020a)提出了跨语言自动编码(XAE)任务。与DAE不同的是，XAE的编码输入和解码输出是不同的语言，类似于机器翻译。此外，XNLG还以两阶段的方式优化参数。在第一阶段，利用MLM和TLM任务对编码器进行训练。然后，在第二阶段对编码器进行修复，并使用DAE和XAE任务训练解码器。通过这种方式，所有参数都得到了很好的预训练，并且也填补了使用 MLM 进行预训练和使用自回归解码进行微调之间的空白。

5.2 Multimodal Pre-Training多模态预训练

Large-scale pre-training and its downstream ap-plications have cascaded impactful research and development with diverse real-world modalities. We see objects, hear sounds and speak languages. Modalities, such as audio, video, image and text, refer to how something happens or is experienced. Tasks include multiple modalities that are devel-oping in a fast-paced. More recently, large-scale PTMs have enhanced research interests in the in-tersection of multiple modalities, such as the in-tersection of image and text, or the intersection of video and text. Speciﬁcally, this kind of modalities can all be classiﬁed as vision and language (V&L), considering that images and videos belong to vi-sion as well as text and speech (audio) belong to language. V&L tasks can be further divided into image-text-based tasks, video-text-based tasks, and video-audio-based tasks according to their speciﬁc modalities being used.

We now present a detailed overview of the previous trends in pre-training on V&L modalities. First, for image-text-based PTMs, the most current solu-tions are to adopt visual-linguistic BERT. The main difﬁculty relies upon integrating non-text informa-tion into the framework of BERT. ViLBERT (Lu et al., 2019) is a model to learn task-agnostic joint representations of images and languages. It extends the BERT architecture to a multimodal model that supports two streams of input, by preprocessing textual and visual information separately. After two encoders, it uses Transformer layers to ob-tain united attention results for both textual and visual information. ViLBERT ﬁrst provides a new mind for learning the relationship between vision and language, which is no longer limited to learn a speciﬁc task but takes the relationship between vision and language as a pre-trainable and transfer-able ability of models. It uses three pre-training tasks: MLM, sentence-image alignment (SIA) and masked region classiﬁcation (MRC). It is evalu-ated on ﬁve downstream tasks: visual question an-swering (VQA), visual commonsense reasoning (VCR), grounding referring expressions (GRE), image-text retrieval (ITIR) and zero-shot image-text retrieval (ZSIR). LXMERT (Tan and Bansal, 2019) has similar architecture compared to Vil-BERT but uses more pre-training tasks: MLM, SIA, MRC, masked region feature regression (MRFR) and VQA. LXMERT is tested on three downstream tasks: VQA, graph question answering (GQA) and natural language for visual reasoning (NLVR2).

大规模的预训练及其下游应用已经将具有影响力的研究和开发与各种现实世界的模态串联起来。我们看到物体，听到声音，说语言。模态，如音频、视频、图像和文本，指的是某事是如何发生或被体检的。任务包括在快节奏中发展的多种模态。最近，大规模的PTM增强了对多种模态交叉的研究兴趣，如图像和文本的交叉，或视频和文本的交叉。具体来说，这种模态可以分为视觉和语言(V&L)，图像和视频属于视觉，文本和语音(音频)属于语言。V&L任务根据其使用的具体方式又可分为基于图像文本的任务、基于视频文本的任务和基于视频音频的任务。

现在，我们将详细概述V&L模态的预训练趋势。首先，对于基于图像文本的PTMs，当前的解决方案是采用视觉语言BERT。主要的困难在于将非文本信息整合到BERT框架中。ViLBERT (Lu et al.， 2019)是一个学习与任务无关的图像和语言联合表示的模型。通过分别对文本和视觉信息进行预处理，将BERT架构扩展为支持两种输入流的多模态模型。经过两个编码器后，使用Transformer层对文本和视觉信息获得统一的注意力结果。ViLBERT 首先为学习视觉与语言之间的关系提供了一种新的思路，不再局限于学习特定的任务，而是将视觉与语言之间的关系看作是模型的一种可预训练和可迁移的能力。该算法采用三个预训练任务：MLM、句子-图像对齐(SIA)和掩码区域分类(MRC)。它在5个下游任务上进行评估：视觉问题回答(VQA)、视觉常识推理(VCR)、基础参考表达式(GRE)、图像文本检索(ITIR)和零样本图像文本检索(ZSIR)。与ViLBERT 相比，LXMERT (Tan和Bansal, 2019)具有类似的架构，但使用了更多的预训练任务：MLM、SIA、MRC、掩码区域特征回归(MRFR)和VQA。LXMERT在三个下游任务上进行测试：VQA、图形问题回答(GQA)和用于视觉推理的自然语言(NLVR2)。

VisualBERT (Li et al., 2019), on the other side, extends the BERT architecture at the minimum. It can be regarded as a simple and effective base-line for V&L pre-training. The Transformer lay-ers of VisualBERT implicitly align elements in the input text and image regions. It uses two pre-training tasks: MLM and IA, and is tested on four downstream tasks: VQA, VCR, NLVR2, and ITIR. Unicoder-VL (Li et al., 2020a) moves the offsite visual detector in VisualBERT into an end-to-end version. It designs the image token for Transformers as the sum of the bounding box and object label features. It uses MLM, SIA and masked object classiﬁcation (MOC) as its pre-training tasks, as well as uses IR, ZSIR and VCR as its downstream tasks. VL-BERT(Su et al., 2020) also uses a similar architecture to VisualBERT. For VL-BERT, each input element is either a token from the input sen-tence or a region-of-interest (RoI) from the input image. It uses MLM and MOC as the pre-training tasks and ﬁnds that adding SIA will decrease model performance. It is evaluated on three downstream tasks: VQA, VCR and GRE.

Some multimodal PTMs are designed to solve speciﬁc tasks such as VQA. B2T2(Alberti et al., 2019) is the model that mainly focuses on VQA. It designs a model for early fusion of the co-reference between textual tokens and visual object features, and then uses MLM and SIA as the pre-training tasks. VLP (Zhou et al., 2020a) focuses on VQA and image captioning. It uses a shared multi-layer Transformer for both encoding and decod-ing, different from many existing methods whose encoder and decoder are implemented using sep-arate models. It is pre-trained on bidirectional masked language prediction (BMLP) and sequence to sequence masked language prediction (s2sMLP). Furthermore, UNITER (Chen et al., 2020e) learns uniﬁed representations between the two modali-ties. UNITER tries many pre-trained tasks, such as MLM, SIA, MRC and MRFR. UNITER is also tested on various downstream tasks: VQA, IR, VCR, NLVR2, referring expression comprehension (REC), and visual entailment (VE).

另一方面，VisualBERT (Li et al.， 2019)最小限度地扩展了BERT架构。可以看作是一个简单而有效的V&L预训练的基线。VisualBERT的Transformer层隐式地对齐输入文本和图像区域中的元素。它使用两个预训练任务：MLM和IA，并在四个下游任务上测试：VQA、VCR、NLVR2和ITIR。Unicoder-VL(Li et al.， 2020a)将VisualBERT中的场外视觉检测器移动到端到端版本。它将Transformers的图像标记设计为包围盒和对象标签特征的和。它以MLM、SIA和MOC作为预训练任务，并使用 IR、ZSIR 和 VCR 作为其下游任务。VL-BERT(Su等人，2020)也使用了与VisualBERT类似的架构。对于VL-BERT，每个输入元素要么是来自输入句子的标记，要么是来自输入图像的感兴趣区域(RoI)。以MLM和MOC作为预训练任务，发现添加SIA会降低模型性能。它通过三个下游任务：VQA、VCR和GRE进行评估。

一些多模态PTMs被设计用来解决特定的任务，比如VQA。B2T2(Alberti et al.， 2019)是主要关注于VQA的模型。该算法设计了一个早期融合文本标记和视觉对象特征之间的协同引用的融合模型，并以MLM和SIA作为预训练任务。VLP (Zhou et al.， 2020a)专注于VQA和图像字幕。它使用一个共享的多层Transformers进行编码和解码，这与许多现有方法不同，后者的编码器和解码器是使用单独的模型实现的。它在双向掩码语言预测 (BMLP) 和序列到序列掩码语言预测 (s2sMLP) 上进行了预训练。此外，UNITER (Chen等人，2020e)学习了两种模态之间的统一表示。UNITER尝试许多预训练的任务，如MLM，SIA, MRC和MRFR。UNITER 还在各种下游任务上进行了测试：VQA、IR、VCR、NLVR2、参照表达理解(REC)和视觉蕴涵(VE)。

ImageBERT (Qi et al., 2020) is the same as Unicoder-VL. It designs a novel weakly super-vised approach to collect large-scale image-text data from the website, whose volume and quality are essential to V&L pre-train tasks. The collect-ing steps include web-page collection, image ﬁlter-ing, sentence detection, sentence cleaning, image-text semantic scoring, and image-text aggregation. The resulting dataset contains ten million images and their descriptions with an average length of 13 words, which shows beneﬁts to pre-training multi-modal PTMs. The pre-training tasks include MLM, SIA, MOC and MRFR, while only being tested on one downstream task: ITIR. Lu et al. (2020) investi-gate relationships between nearly all V&L tasks by developing a large-scale, multi-task training regime. It classiﬁes the common tasks into four groups: VQA, caption-based image retrieval, grounding referring expressions, and multimodal veriﬁcation. It adopts two pre-training tasks by masking mul-timodal modeling only for aligned image-caption pairs and masking overlapped image regions, while performing well on ﬁve downstream tasks: VQA, GQA, IR, RE and NLVR2.

X-GPT (Xia et al., 2020) ﬁnds that while pre-vious BERT-based multimodal PTMs produce ex-cellent results on downstream understanding tasks,they cannot be applied to generation tasks directly. It is then proposed to pre-train text-to-image cap-tion generators through three novel generation tasks, including image-conditioned masked lan-guage modeling (IMLM), image-conditioned de-noising autoencoding (IDA), and text-conditioned image feature generation (TIFG). For downstream tasks, it focuses only on image captioning (IC). Oscar (Li et al., 2020e) uses object tags detected in images as anchor points to ease the learning of alignments signiﬁcantly. It is motivated by the ob-servation that the salient objects in an image can be accurately detected and often mentioned in the paired text. It performs well on six downstream tasks: ITIR, IC, novel object captioning (NOC), VQA, GCQ and NLVR2.

ImageBERT (Qi et al.， 2020)与Unicoder-VL是一样的。它设计了一种新颖的弱监督方法来从网站上收集大规模的图像文本数据，其数量和质量对于V&L预训练任务至关重要。收集步骤包括网页收集、图像过滤、句子检测、句子清洗、图像文本语义评分、图像文本聚合。得到的数据集包含1000万幅图像和它们的描述，平均长度为13个单词，这显示出对预训练的多模态PTMs的好处。预训练的任务包括MLM, SIA, MOC和MRFR，而只测试一个下游任务：ITIR。Lu等人(2020)通过开发大规模的多任务训练机制，调查了几乎所有V&L任务之间的关系。它将常见的任务分为四组：VQA、基于标题的图像检索、grounding引用表达式和多模态验证。该算法采用两种预训练任务，即仅针对对齐的图像标题对进行多模态建模，对重叠的图像区域进行掩码，同时在5个下游任务上表现良好：VQA、GQA、IR、RE 和 NLVR2。

X-GPT (Xia et al.， 2020)发现，尽管之前基于BERT的多模态PTMs在下游理解任务上产生了很好的结果，但它们不能直接应用于生成任务。然后提出通过图像条件掩码语言建模(IMLM)、图像条件去噪自编码(IDA)和文本条件图像特征生成(TIFG)3种新的生成任务对文本图像标题生成器进行预处理训练。对于下游任务，它只关注图像字幕(IC)。Oscar (Li et al.， 2020e)使用图像中检测到的对象标签作为锚点，以显著简化对齐的学习。它的动机是观察图像中的显著物体可以被准确地检测到，并经常在配对的文本中提到。它在ITIR、IC、新对象字幕(NOC)、VQA、GCQ和NLVR2这6个下游任务上表现良好。

A bigger step towards conditional zero-shot im-age generation is taken by DALLE (Ramesh et al., 2021) from OpenAI and CogView (Ding et al., 2021) from Tsinghua and BAAI. DALLE is the very ﬁrst transformer-based text-to-image zero-shot pre-trained model with around 10 billiion pa-rameters. It shows the potential of multi-modal pre-trained models to bridge the gap between text descriptions and image generation, especially the excellent ability in combining different objects, such as “an armchair in the shape of an avocado". CogView improves the numerical precision and training stability by introducing sandwich trans-former and sparse attention mechanism, and thus surpasses the DALLE in FID and . It is also the ﬁrst text-to-image model in Chinese.

Recently, CLIP (Radford et al., 2021) and Wen-Lan (Huo et al., 2021) explore enlarging web-scale data for V&L pre-training with big success. Com-paring to previous works, they face a large-scale distributed pre-training challenge. We will intro-duce how to handle the large-scale distributed pre-training challenge in the next section.

OpenAI公司的DALLE (Ramesh et al.， 2021)和清华大学和BAAI公司的CogView (Ding et al.， 2021)向着有条件的零样本图像生成迈出了更大的一步。DALLE是第一个基于Transformer的文本图像零样本预训练模型，具有大约100亿个参数。它展示了多模态预训练模型在消除文本描述和图像生成之间的差距方面的潜力，特别是在组合不同对象方面的出色能力，例如“一个鳄梨形状的扶手椅”。CogView通过引入三明治Transformer和稀疏注意力机制，提高了数值精度和训练稳定性，从而在FID和FID中超过了DALLE 。它也是第一个中文文本图像模型。

最近，CLIP (Radford et al.， 2021)和Wen-Lan (Huo et al.， 2021)对扩大网络规模的V&L预训练数据进行了研究，取得了巨大成功。与以往的研究相比，它们面临着大规模分布式的预训练挑战。我们将在下一节介绍如何处理大规模分布式的预训练挑战。

5.3 Knowledge-Enhanced Pre-Training增强知识的预训练

Figure 9: An illustration of ZeRO-Offload and ZeRO-Offload with delayed parameter update

图9:带有延迟参数更新的ZeRO-Offload和ZeRO-Offload示意图

PTMs can extract plenty of statistical information from large amounts of data. Besides, external knowledge, such as knowledge graphs, domain-speciﬁc data and extra annotations of pre-training data, is the outcome of human wisdom which can be a good prior to the modeling of statistics. In this subsection, we classify external knowledge accord-ing to the knowledge format and introduce several methods attempting to combine knowledge with PTMs.

The typical form of structured knowledge is knowledge graphs. Many works try to enhance PTMs by integrating entity and relation embed-dings (Zhang et al., 2019b; Liu et al., 2020a; Pe-ters et al., 2019; Sun et al., 2020; Rosset et al., 2020; Qin et al., 2021) or their alignments with the text (Xiong et al., 2019; Sun et al., 2019b). How-ever, real-world knowledge graphs like Wikidata contain more information than entities and rela-tions. Wang et al. (2021) pre-train models based on the descriptions of Wikidata entities, by incorporating a language model loss and a knowledge em-bedding loss together to get knowledge-enhanced representations. Some works regard the paths and even sub-graphs in knowledge graphs as a whole, and directly model them and the aligned text to re-tain more structural information. Since aligning en-tities and relations to raw text is often troublesome and can introduce noise in data pre-processing, an-other line of works (Bosselut et al., 2019; Guan et al., 2020; Chen et al., 2020d) can directly con-vert structural knowledge into the serialized text and let models learn knowledge-text alignments by themselves. An interesting attempt is OAG-BERT (Liu et al., 2021a), which integrates hetero-geneous structural knowledge in the open academic graph (OAG) (Zhang et al., 2019a), which covers 0.7 billion heterogeneous entities and 2 billion relations.

PTMs可以从大量的数据中提取大量的统计信息。此外，外部知识，如知识图谱、特定领域的数据和对预训练数据的额外标注等，都是人类智慧的产物，可以很好地在统计建模之前进行。在这个小节中，我们将根据知识的格式对外部知识进行分类，并介绍几种尝试将知识与PTMs相结合的方法。

结构化知识的典型形式是知识图谱。许多工作试图通过整合实体和关系嵌入来增强PTMs (Zhang et al., 2019b; Liu et al., 2020a; Pe-ters et al., 2019; Sun et al., 2020; Rosset et al., 2020; Qin et al., 2021)或它们与文本的对齐方式 (Xiong et al., 2019; Sun et al., 2019b)。然而，像维基数据这样的现实世界知识图谱比实体和关系包含更多的信息。Wang等人(2021)通过将语言模型损失和知识嵌入损失合并在一起，以基于维基多实体描述的预训练模型来获得知识增强表示。有些工作将知识图谱中的路径甚至子图视为一个整体，并直接对它们和对齐的文本进行建模以保留更多的结构信息。由于将实体和关系与原始文本对齐通常很麻烦，并且可能在数据预处理中引入噪声，另一项工作(Bosselut等人，2019年;Guan等人，2020年;Chen et al.， 2020d)可以直接将结构化知识转换为序列化的文本，并让模型自行学习知识文本对齐。OAG- BERT(Liu et al.， 2021a)是一个有趣的尝试，它将异质性结构知识整合到开放学术图(OAG) (Zhang et al.， 2019a)中，该图覆盖了7亿个异构实体和20亿个关系。

Compared to structured knowledge, unstructured knowledge is more intact but also noisier. How to effectively model this kind of knowledge from the data is also worth being explored. The data of a speciﬁc domain or task can be considered as a kind of unstructured knowledge. Many works (Beltagy et al., 2019; Lee et al., 2020) further pre-train the general PTMs on this data to get better domain-speciﬁc or task-speciﬁc models. Since there are some domain-speciﬁc and task-speciﬁc human an-notations, Ke et al. (2020) incorporate these ex-tra annotations to get better domain-speciﬁc and task-speciﬁc language representations. For all the above-mentioned works, knowledge is implic-itly stored in their model parameters. To model external knowledge in a more interpretable way, some works (Lewis et al., 2020b; Guu et al., 2020) design retrieval-based methods to use structured knowledge on downstream tasks. Another kind of works (Wang et al., 2020b) can use adapters trained on different knowledge sources with extra annotations to distinguish where the knowledge is from.

与结构化知识相比，非结构化知识更完整，但也更嘈杂。如何从数据中有效地为这类知识建模也是值得探索的。特定领域或任务的数据可以看作是一种非结构化知识。许多工作(Beltagy等人，2019年;Lee等人，2020)进一步根据这些数据对通用的PTMs进行预训练，以获得更好的领域特定模型或任务特定模型。由于存在一些特定于领域和特定于任务的人工注释，Ke等人(2020)合并了这些额外的注释，以获得更好的特定于领域和特定于任务的语言表示。对于所有上述工作，知识都隐含地存储在它们的模型参数中。为了以一种更易于解释的方式为外部知识建模，一些研究(Lewis等人，2020b;Guu等人，2020)设计了基于检索的方法，在下游任务上使用结构化知识。另一种工作(Wang et al.， 2020b)可以使用在不同的知识源上训练过的适配器，并添加额外的注释来区分知识的来源。

6 Improving Computational Efﬁciency提高计算效率

As introduced in Section 1, a major trend of PTMs is that the number of parameters is getting larger and larger. Increasing the size of a neural network typically improves accuracy, but it also increases the memory and computational requirements for training the model. In this section, we will in-troduce how to improve computational efﬁciency from the following three aspects: system-level opti-mization, efﬁcient learning algorithms, and model compression strategies.

正如第1节所介绍的，PTMs的一个主要趋势是参数的数量越来越多。增加神经网络的大小通常会提高精度，但它也会增加训练模型的内存和计算需求。在本节中，我们将从以下三个方面介绍如何提高计算效率：系统级优化、高效学习算法和模型压缩策略。

6.1 System-Level Optimization系统级优化

Figure 10: An illustration of the data parallelism and model parallelism with 16 nodes.

图10:16个节点的数据并行性和模型并行性的图解

An effective and practical way to reduce compu-tational requirements is system-level optimization towards computational efﬁciency and memory usage. System-level optimization methods are often model-agnostic and do not change underlying learn-ing algorithms. Therefore, they are widely used in training large-scale PTMs. Generally, these meth-ods can be divided into single-device optimization methods and multi-device optimization ones.

Single-Device Optimization. Current large-scale PTMs usually cost a lot of memory for pre-training. This is mainly due to the redundant representation of ﬂoating-point numbers. Modern deep learning systems are mainly based on a single-precision ﬂoating-point format (FP32). However, the weights of models usually fall in a limited range, and us-ing a half-precision ﬂoating-point format (FP16) can accomplish most of the computation with little precision loss (Gupta et al., 2015).

However, in some cases, training models in FP16 may fail because of the ﬂoating-point trunca-tion and overﬂow. To tackle this problem, mixed-precision training methods (Micikevicius et al., 2018) have been proposed, which preserve some critical weights in FP32 to avoid the ﬂoating-point overﬂow and use dynamic loss scaling operations to get rid of the ﬂoating-point truncation. Sufﬁcient experiments have shown that mixed-precision train-ing methods are more stable than directly training models in FP16. Although mixed-precision train-ing methods can signiﬁcantly reduce the training time and memory usage, they still face some chal-lenges. When model parameters are not initialized well, mixed-precision methods may still cause un-stable training. All these challenges still require to be further explored.

减少计算需求的一种有效且实用的方法是针对计算效率和内存使用的系统级优化。系统级优化方法通常是与模型无关，且不会改变底层的学习算法。因此，它们被广泛用于训练大规模 PTMs。这些方法一般可分为单设备优化方法和多设备优化方法。

单设备优化。当前大型PTMs的预训练需要花费大量的内存。这主要是由于浮点数的冗余表示。现代的深度学习系统主要基于单精度浮点格式(FP32)。然而，模型的权值通常落在一个有限的范围内，我们使用半精度浮点格式(FP16)可以完成大部分的计算，而几乎没有精度损失(Gupta等人，2015)。

但是，在某些情况下，FP16中的训练模型可能会因为浮点截断和溢出而失败。为了解决这一问题，已经提出了混合精度训练方法(Micikevicius et al.， 2018)，该方法在FP32中保留一些关键权重以避免浮点溢出，并使用动态损耗缩放操作来摆脱浮点截断。大量实验表明，混合精度训练方法比FP16中直接训练模型更稳定。混合精度训练方法虽然能显著减少训练时间和内存的使用，但仍面临一些挑战。在模型参数初始化不好的情况下，混合精度方法仍然会导致训练不稳定。所有这些挑战仍需进一步探索。

Besides the redundant representation of ﬂoating-point numbers, the activation states saved for com-puting gradients are also redundant. For exam-ple, in Transformer-based models, apart from the weights of attention layers and linear layers, com-putational devices also store the hidden states of each layer for the efﬁciency of the chain rule used in the gradient back-propagation. As compared with model parameters, these hidden states can consume even much more memory. To handle re-dundant activation states, gradient checkpointing methods (Rasley et al., 2020) have been used to save memory by storing only a part of the activation states after forward pass. The discarded activation states are recomputed during the backward steps if necessary.

When pre-training recent large-scale PTMs, the memory consumption can be too large to ﬁt in a single GPU. Therefore, some works (Huang et al., 2020a) attempt to store model parameters and acti-vation states with the CPU memory rather than the GPU memory, since the CPU memory is usually much larger. As shown in Figure 9, some works such as ZeRO-Ofﬂoad (Ren et al., 2021) design delicate strategies to schedule the swap between the CPU memory and the GPU memory so that memory swap and device computation can be over-lapped as much as possible.

除了浮点数的冗余表示之外，为计算梯度保存的激活状态也是冗余的。例如，在基于Transformer的模型中，除了注意力层和线性层的权值外，计算设备还存储了每一层的隐藏状态，以提高梯度反向传播中链式法则的效率。与模型参数相比，这些隐藏状态甚至会消耗更多的内存。为了处理冗余的激活状态，使用了梯度检查点方法(Rasley et al.， 2020)，通过只存储前向传递后的部分激活状态来节省内存。如果需要，将在向后执行步骤期间重新计算丢弃的激活状态。

在对最近的大规模 PTM 进行预训练时，内存消耗可能太大而无法容纳在单个 GPU 中。因此，有些工作(Huang et al.， 2020a)尝试用CPU内存而不是GPU内存来存储模型参数和激活状态，因为CPU内存通常要大得多。如图9所示，ZeRO-Offload (Ren et al.， 2021)等一些工作设计了精巧的策略来调度CPU内存和GPU内存之间的交换，以便内存交换和设备计算尽可能多地重叠。

Figure 11: An illustration of the pipeline parallelism with 4 nodes and 4 micro batches.

图11：具有 4 个节点和 4 个微批次的管道并行示意图

Multi-Device Optimization. Recently, distributed training is commonly used in pre-training, where multiple GPUs distributed in many computational nodes are used together to train a single model. Data parallelism (Li et al., 2020d) is a simple and effective approach to accelerate training a model.

As shown in Figure 10, when we use data paral-lelism, a large batch is partitioned to different nodes and thus forward pass can be parallelized. At back-ward pass, the gradients on different nodes should be aggregated with all-reduce operations to ensure the consistency of parameter optimization, which may introduce additional communication overhead.

When pre-training models with billions to tril-lions of parameters, traditional data parallelism brings challenges of ﬁtting whole model parame-ters into a single GPU, even with half-precision or mixed-precision training. Although this problem can be solved by using a GPU with larger mem-ory, the expenses can be hard to afford, limiting the use of PTM by ordinary researchers. Model parallelism is an effective way to tackle this prob-lem (Shazeer et al., 2018). As shown in Figure 10, when conducting model parallelism, model parame-ters can be distributed to multiple nodes. The com-munication operations between these nodes like reduce-scatter and all-gather guarantee the correct-ness of forward pass and backward pass. Megatron-LM (Shoeybi et al., 2019) adopts model parallelism to Transformer-based PTMs. It splits self-attention heads as well as feed-forward layers into differ-ent GPUs, reducing the memory burden of a sin-gle GPU. Mesh-Tensorﬂow (Shazeer et al., 2018) also enables users to split tensors along any tensor dimensions, which can bring more customized options for model parallelism.

多设备优化。目前，分布式训练是一种常用的预训练方法，即利用分布在多个计算节点上的多个GPU一起用于训练单个模型。数据并行(Li et al.， 2020d)是一种简单而有效的加速模型训练的方法。

如图10所示，当我们使用数据并行方式时，一个大的批处理被划分到不同的节点，因此可以并行化前向传递。在后向传递时，不同节点上的梯度应该通过 all-reduce 操作进行聚合，以确保参数优化的一致性，这可能会引入额外的通信开销。

在对具有数十亿到数万亿个参数的模型进行预训练时，传统的数据并行性带来了将整个模型参数拟合到单个 GPU 中的挑战，即使是半精度或混合精度训练也是如此。虽然这一问题可以通过使用更大内存的GPU来解决，但费用可能难以承受，限制了普通研究人员对 PTM的使用。模型并行性是解决这一问题的有效方法(Shazeer et al.， 2018)。如图10所示，在进行模型并行时，可以将模型参数分布到多个节点上。这些节点之间的通信操作，如reduce-scatter和all-gather，保证了前向传播和后向传播的正确性。Megatron-LM (Shoeybi et al.， 2019)对基于Transformer的PTMs采用了模型并行性。它将自注意力头和前反馈层拆分到不同的GPU中，从而减少了单个GPU的内存负担。Mesh-Tensorflow (Shazeer等人，2018)还允许用户沿任何张量维度拆分张量，这可以为模型并行性带来更多自定义选项。

Although model parallelism enables different computational nodes to store different parts of model parameters, it has to insert collective communication primitives during both forward pass and backward pass, which can not be overlapped by device computation. On the contrary, the all-reduce collective communication operation in data parallelism usually can be overlapped by the backward computation. As a result, data parallelism is preferred as long as it can conquer the excessive requirement of memory capacity. In the standard implementation of data parallelism, optimizer states are usually copied along different nodes to guarantee synchronized optimization across data parallelism units. This redundancy leads to the additional overhead of GPU memory, especially when models are trained in a mixed-precision manner because the optimizer needs to store 32-bit master states of these models to ensure accuracy. To eliminate the redundancy brought by optimizer states and parameters, ZeRO optimizer (Rajbhandari et al., 2020) methods equally partition and distribute optimizer states to each node of data parallelism, such that each node only updates the optimizer states corresponding to its partition. At the end of a training step, all optimizer states are gathered across data parallelism nodes.

虽然模型并行性使得不同的计算节点可以存储模型参数的不同部分，但是在前向传递和后向传递时都需要插入集体通信原语，这些原语不能被设备计算重叠。与此相反，数据并行中的all-reduce集体通信操作通常会被反向计算重叠。因此，只要能够克服对内存容量的过度需求，数据并行就会成为首选。在数据并行性的标准实现中，优化器状态通常沿着不同的节点复制，以保证数据并行性单元之间的同步优化。这种冗余导致GPU内存的额外开销，特别是当模型以混合精度的方式训练时，因为优化器需要存储这些模型的32位主状态，以确保准确性。为了消除优化器状态和参数带来的冗余，ZeRO 优化器(Rajbhandari et al.， 2020)方法将优化器状态平均划分并分配到数据并行等位的每个节点，这样每个节点只更新其分区对应的优化器状态。在训练步骤的最后，所有的优化器状态都被收集到数据并行节点上。

The above-mentioned model parallelism techniques mainly focus on partitioning and parallelizing matrix operations across different nodes. As shown in Figure 11, another effective method for model parallelism is pipeline parallelism, which partitions a deep neural network into multiple layers and then puts different layers onto different nodes. After the computation of each node, the ouTPUt is sent to the next node where the next layer computation takes place. Since pipeline parallelism only needs to communicate the intermediate activation states between nodes performing adjacent stages of the pipeline, the communication cost is relatively small. Existing pipeline methods include GPipe (Huang et al., 2019b) which can send smaller parts of samples within a mini-batch to different nodes, and TeraPipe (Li et al., 2021) which can apply token-level pipeline mechanisms for Transformer-based models to make each token in a sequence be processed by different nodes. Both of these pipeline methods speed up the large-scale PTMs. However, they should be stopped at the end of each batch until the gradient back-propagation is complete, which can lead to pipeline bubbles.

上述模型并行技术主要集中在对不同节点的矩阵运算进行划分和并行化。如图11所示，另一种有效的模型并行方法是管道并行，它将一个深度神经网络划分为多层，然后将不同的层放到不同的节点上。每个节点计算完后，输出被发送到下一个节点，在那里进行下一层计算。

由于管道并行只需要在执行流水线相邻阶段的节点之间传递中间激活状态，因此通信成本相对较小。现有管道方法包括GPipe(黄et al ., 2019 b),它可以将小批量中的较小部分样本发送到不同的节点，以及 TeraPipe (Li et al., 2021)，它可以将token级管道机制应用于基于Transformer模型，使序列中的每个token由不同的节点处理。这两种管道方法都加快了大规模PTMs的速度。但是，它们应该在每批结束时停止，直到梯度反向传播完成，这可能会导致管道气泡。

6.2 Efﬁcient Pre-Training 高效的预训练

Besides some system-level optimization methods, various efforts have been devoted to exploring more efﬁcient pre-training methods, so that we can pre-train large-scale PTMs with a lower cost solution.

Efﬁcient Training Methods. Conventional pre-training tasks can be sample-inefﬁcient. For exam-ple, for MLM which is widely used to pre-train re-cent PTMs, models are required to predict masked tokens according to contexts. The masked tokens are usually a subset (typically 15%) of input tokens,i.e., models can only learn from a small set of input tokens. To tackle this problem, ELECTRA (Clark et al., 2020) applies the replaced token detection task. This task forces models to distinguish whether an input token is replaced by a generator. This task can leverage more supervision information from each sample since all input tokens need to be distinguished. ELECTRA takes much fewer pre-training steps when it reaches similar performance to those MLM models. Furthermore, traditional MLM randomly masks tokens in a document to predict. Since the difﬁculty of predicting different tokens varies a lot, the random masking strategy makes the training process aimless and inefﬁcient. Therefore, some works selectively mask tokens based on their importance (Gu et al., 2020) or gra-dients (Chen et al., 2020b) in back-propagation to speed up model training.

Apart from the pre-training tasks, the current pre-training dynamics are also sub-optimal. Re-cent large-scale PTMs usually require a large batch size. But in an early work (Goyal et al., 2017), researchers ﬁnd that naively increasing the batch size may cause difﬁculty in optimization. There-fore, they propose a warmup strategy that linearly increases the learning rate at the beginning of train-ing. This strategy is commonly used in recent large-scale PTMs. Another feature of recent PTMs is that they are usually composed of multiple stacks of a base structure like Transformers. The con-ventional training paradigm optimizes each layer simultaneously using the same hyper-parameters. However, some recent works study Transformer-based models and claim that different layers can share similar self-attention patterns. Therefore, a shallow model can ﬁrstly be trained and then duplicated to construct a deep model (Gong et al., 2019). Some layers can also be dropped during training to reduce the complexity of back-propagation and weight update (Zhang and He, 2020). In addition, You et al. (2017) and You et al. (2020) ﬁnd that adaptively using different learning rates at differ-ent layers can also speed up convergence when the batch size is large.

除了一些系统级的优化方法外，各种努力都在探索更有效的预训练方法，以便我们能够以更低的成本解决方案对大规模的PTMs进行预训练。

有效的训练方法。传统的预训练任务可能是样本效率低的。例如，对于广泛用于预训练近期 PTM 的 MLM，需要模型根据上下文预测掩码token。

掩码token通常是输入token的一个子集(通常是15%)，即模型只能从一小组输入token中学习。为了解决这个问题，ELECTRA (Clark et al.， 2020)应用了替换token检测任务。此任务强制模型区分输入token是否被生成器替换。此任务可以利用来自每个样本的更多监督信息，因为需要区分所有输入token。当 ELECTRA 达到与那些MLM模型相似的性能时，所需的预训练步骤要少得多。此外，传统的 MLM 会随机掩码token以在一个文件中进行预测。由于预测不同tokens的难度差异很大，随机掩码策略使训练过程变得漫无目的且效率低下。因此，有些工作根据它们在反向传播中的重要性(Gu et al.， 2020)或梯度(Chen et al.， 2020b)选择性地掩码token，以加速模型训练。

除预训练任务外，当前的预训练动态也是次优的。最近的大型PTMs通常需要大批量。但在早期的研究中(Goyal et al.， 2017)，研究人员发现，天真地增加批量大小可能会导致优化困难。因此，他们提出了一种预热策略，在训练开始时线性增加学习率。这种策略在最近的大型PTMs中很常用。最近的PTM的另一个特点是，它们通常由多个堆栈组成，就像Transformer一样。传统的训练范式使用相同的超参数同时优化每一层。然而，最近一些研究基于Transformer的模型，并声称不同的层可以共享类似的自注意力模式。因此，可以先对浅层模型进行训练，然后复制它来构建一个深层模型(Gong et al.， 2019)。在训练过程中还可以去掉一些层，以降低反向传播和权值更新的复杂性(Zhang and He, 2020)。此外，You et al.(2017)和You et al.(2020)发现，当批量大小较大时，在不同的层上自适应使用不同的学习率也可以加快收敛速度。

Efﬁcient Model Architectures. Besides efﬁcient pre-training methods, more variants of model ar-chitectures can also reduce the computational com-plexity to improve the efﬁciency of training PTMs. For most Transformer-based PTMs, as their input sequence goes longer, their efﬁciency is limited by the computation of attention weights due to its quadratic time and space complexity of the se-quence length. Therefore, many works attempt to reduce the complexity of Transformers. Some works (Peng et al., 2021; Choromanski et al., 2021; Wang et al., 2020c; Katharopoulos et al., 2020) de-sign low-rank kernels to theoretically approximate the original attention weights and result in linear complexity. Some works (Child et al., 2019) intro-duce sparsity into attention mechanisms by limiting the view of each token to a ﬁxed size and separating tokens into several chunks so that the computation of attention weights takes place in every single chunk rather than a complete sequence. Compared to predeﬁned chunks, some works (Roy et al., 2021; Kitaev et al., 2020) ﬁnd that using learnable param-eters to assign tokens into chunks results in bet-ter performance. Another kind of methods (Guo et al., 2019; Lee et al., 2019; Beltagy et al., 2020; Ainslie et al., 2020; Zaheer et al., 2020) combine global and local attention mechanisms, and then use global nodes to gather tokens in a sequence. In this way, the long sequence is compressed into a small number of elements so that we can reduce the complexity.

高效的模型架构。除了有效的预训练方法外，模型架构的更多变体还可以降低计算复杂度，从而提高训练 PTM 的效率。对于大多数基于Transformer的PTMs，随着输入序列的变长，由于其序列长度在时间和空间上的二次复杂度，它们的效率受到注意力权重计算的限制。因此，很多工作都试图降低Transformers的复杂性。部分工作(Peng et al.， 2021;Choromanski等人，2021年;Wang et al.， 2020c;Katharopoulos等人，2020)设计设计了低秩内核，以在理论上近似原始的注意力权重，导致线性复杂度。有些工作(Child et al.， 2019)通过将每个token的视图限制为固定大小并将token分成几个块，从而将注意力权重的计算在每个块中进行，而不是在每个块中进行，从而将稀疏性引入注意力机制。与预定义的块相比，有些工作(Roy等人，2021;Kitaev等人，2020年)发现，使用可学习参数将token分配成块可以获得更好的性能。另一种方法(Guo et al.， 2019;Lee等人，2019年;Beltagy等人，2020年;Ainslie等人，2020年;Zaheer等人，2020)结合全局和局部注意力机制，然后使用全局节点按顺序收集token。通过这种方式，将长序列压缩为少量元素，从而降低复杂度。

Keeping the same theoretical computation com-plexity as the original Transformer, more variants of the model structure can also accelerate the model convergence. Mix-of-experts (MoE) has been proved early (Shazeer et al., 2017) to increase the parameters of deep neural models while keep-ing the computational overhead nearly unchanged. Recently, Switch Transformers (Fedus et al., 2021) employ this technique in pre-training. They add multiple experts to each layer of Transformers. Dur-ing each forward and backward step, they select only one expert for computation, and thus the training and inference time remain similar to the ordi-nary Transformers without experts. Some experi-mental results show that MoE-based models con-verge faster than the ordinary ones due to the signif-icantly larger model capacity brought by multiple experts. Some efﬁcient open-source toolkits (He et al., 2021) are also developed to train large-scale MoE-based models.

在保持原Transformer的理论计算复杂度不变的前提下，模型结构的更多变体也可以加速模型收敛。Mix-of-experts (MoE)专家混合在早期就被证明可以增加深度神经模型的参数，同时保持计算开销几乎不变(Shazeer等人，2017)。最近，Switch Transformers(Fedus et al.， 2021)在预训练中采用了这种技术。他们为Transformers的每一层添加了多个专家。在每一个前进和后退步中，它们只选择一个专家进行计算，因此训练和推理时间与没有专家的普通Transformers相似。一些实验结果表明，由于多位专家带来的模型容量显着增大，基于 MoE 的模型比普通模型收敛得更快。还开发了一些有效的开源工具包（He et al., 2021）来训练基于 MoE 的大规模模型。

6.3 Model Compression模型压缩

Another important approach to improve the efﬁ-ciency of PTMs is model compression. In this setting, large models are compressed to small ones to meet the demand for faster inference and deploy-ment on resource-constrained devices.

Parameter Sharing. PTMs can be compressed with sharing parameters across similar units. AL-BERT (Lan et al., 2019) uses factorized embedding parameterization and cross-layer parameter sharing to reduce the parameters of PTMs. Using same weights across all Transformer layers, ALBERT achieves a signiﬁcant parameter reduction based on the BERT model, and meanwhile has the same or even better performance. This indicates that PTMs can be extremely over-parameterized.

另一种提高PTMs效率的重要方法是模型压缩。在这种情况下，大型模型被压缩为小型模型，以满足在资源受限设备上进行更快推理和部署的需求。

参数共享。PTM 可以通过在相似单元之间共享参数进行压缩。ALBERT(Lan et al.， 2019)使用分解嵌入参数化和跨层参数共享来减少PTMs的参数。在所有Transformer层中使用相同的权值，ALBERT在 BERT模型的基础上实现了显著参数缩减，同时具有相同甚至更好的性能。这表明PTMs可能会被过度参数化。

Model Pruning. To take more advantage of the over-parameterized feature of current PTMs, an-other method to reduce model parameters is model pruning, which cuts off some useless parts in PTMs to achieve accelerating while maintaining the per-formance. In (Fan et al., 2019), Transformer layers are selectively dropped during training, resulting in a more shallow model during inference. In (Michel et al., 2019), (Voita et al., 2019) and (Zhang et al., 2021b), researchers study the redundancy of the attention heads in Transformers and ﬁnd that only a small part of them is enough for good perfor-mance. Most of these heads can be removed with little impact on the accuracy. Other trials such as CompressingBERT (Gordon et al., 2020) try to prune the weights of attention layers and linear lay-ers to reduce the number of parameters in PTMs, while maintaining the comparable performance to the original model.

Knowledge Distillation. Although ALBERT saves the memory usage of PTMs, its inference time is not signiﬁcantly decreased since features still need to go through its layers with the same number as the original model. Knowledge distilla-tion aims at training a small model to reproduce the behavior of a large teacher model. The memory us-age and the time overhead are both decreased when using a small distilled model for inference. There are some typical works employing knowledge dis-tillation for PTMs, such as DistillBERT (Sanh et al., 2019), TinyBERT (Jiao et al., 2019), BERT-PKD (Sun et al., 2019a) and MiniLM (Wang et al., 2020d). In these works, a small student model is trained to mimic the ouTPUt probability, the hidden states, and the attention matrices of a large teacher model during both the pre-training and ﬁne-tuning stages. With knowledge distillation, the model-egy in the teacher model is transferred into the student model, which can lead to increasing perfor-mance compared to training a student model alone. However, the knowledge distillation methods men-tioned above require the data used for pre-training the teacher model, which is usually not released in consideration of the data copyright and privacy. Moreover, the teacher model needs to forward over the entire pre-training data to produce logits or intermediate representations for knowledge distil-lation, causing an even longer training time.

模型修剪。为了更好地利用当前 PTM 的过度参数化特性，另一种减少模型参数的方法是模型剪枝，即剪掉 PTM 中一些无用的部分，以在保持性能的同时实现加速。在(Fan et al.， 2019)中，Transformer层在训练过程中被选择性地丢弃，从而在推理过程中产生更浅的模型。在(Michel et al.， 2019)、(Voita et al.， 2019)和(Zhang et al.， 2021b)中，研究人员研究了Transformers中注意力头的冗余性，并发现只有一小部分注意力头就足以获得良好的性能。大多数头部可以被移除，而对精度影响很小。CompressingBERT (Gordon et al.， 2020)等其他试验，试图修剪注意力层和线性层的权重，以减少PTMs中的参数量，同时保持与原始模型相当的性能。

知识蒸馏。虽然ALBERT节省了PTMs的内存使用量，但它的推理时间并没有显著减少，因为特征仍然需要遍历与原始模型相同数量的层。知识蒸馏旨在训练一个小型模型来重现大型教师模型的行为。当使用小型蒸馏模型进行推理时，内存使用量和时间开销都减少了。有一些典型的工作将知识蒸馏用于PTMs，如DistillBERT(Sanh等人，2019)，TinyBERT (Jiao等人，2019)，BERT-PKD (Sun等人，2019a)和MiniLM (Wang等人，2020d)。在这些研究中，一个小的学生模型被训练来模拟一个大的教师模型在预训练和微调阶段的输出概率、隐藏状态和注意矩阵。通过知识蒸馏，将教师模型中的模型转移到学生模型中，与与单独训练学生模型相比，这可以提高性能。但是，上述的知识蒸馏方法需要教师模型的预训练数据，考虑到数据版权和隐私，这些数据通常不会发布。此外，教师模型需要对整个预训练数据进行转发，以生成用于知识蒸馏的 logits 或中间表示，从而导致训练时间更长。

Model Quantization. To get a more compressed model, model quantization is also a useful tech-nique, which has been widely explored in some CNN-based models (Stock et al., 2020; Polino et al., 2018). Model quantization refers to the com-pression of higher-precision ﬂoating-point parame-ters to lower-precision ﬂoating-point ones. Conven-tional PTMs are usually represented in 32 bits or 16 bits, while models after quantization can be in 8 bits or even 1 or 2 bits. For recent Transformer-based models, 8-bit quantization has been proved to be ef-fective for model compression in Q8BERT (Zafrir et al., 2019), with little impact on the model per-formance. Despite this, training 1 or 2 Bits models remains challenging due to the signiﬁcant decrease in model capacity. To alleviate the performance degradation, other methods to preserve the accu-racy can also be employed. Q-BERT (Shen et al., 2020a) uses mixed-bits quantization in which the parameters with higher Hessian spectrum require higher precision while those parameters with lower Hessian spectrum need lower precision. Ternary-BERT (Zhang et al., 2020b) applies knowledge distillation in quantization, forcing low-bit models to imitate full-precision models. Both Q-BERT and TernaryBERT result in ultra low-bit models. How-ever, low-bit representation is a highly hardware-related technique, which means quantization often requires speciﬁc hardware and can not generalize to other devices.

量化模型。为了得到更压缩的模型，模型量化也是一种有用的技术，这在一些基于CNN的模型中得到了广泛的探索(Stock et al.， 2020;Polino等人，2018)。模型量化是指将高精度浮点参数压缩为低精度浮点参数。传统的PTMs通常用32位或16位表示，而量化后的模型可以用8位甚至1位或2位表示。对于最近的基于Transformer的模型，在Q8BERT中，8位量化已经被证明对模型压缩是有效的(Zafrir等人，2019)，对模型性能的影响很小。尽管如此，由于模型容量的显著下降，训练1或2比特模型仍然具有挑战性。为了减轻性能的下降，还可以采用其他方法来保持精度。Q-BERT (Shen et al.， 2020a)采用混合比特量化，其中Hessian谱高的参数要求较高的精度，而Hessian谱低的参数要求较低的精度。Ternary-BERT (Zhang et al.， 2020b)将知识精馏应用于量化，迫使低位(bit)模型模拟全精度模型。Q-BERT和TernaryBERT都产生了超低位模型。然而，低位表示是一种与硬件高度相关的技术，这意味着量化通常需要特定的硬件，不能推广到其他设备。

7 Interpretation and Theoretical Analysis解释与理论分析

Beyond the superior performance of PTMs on vari-ous NLP tasks, researchers also explore to interpret the behaviors of PTMs, including understanding how PTMs work and uncovering the patterns that PTMs capture. These works cover several impor-tant properties of PTMs: knowledge, robustness, and structural sparsity/modularity. Moreover, there are some pioneering works on building the theoret-ical analysis for PTMs.

除了PTM在各种NLP任务上的卓越表现外，研究人员还探索解释PTM的行为，包括理解PTM如何工作和揭示PTM捕获的模式。这些工作涵盖了PTMs的几个重要特性：知识、鲁棒性和结构稀疏性/模块化。此外，在建立PTMs理论分析方面也有一些开创性的工作。

7.1 Knowledge of PTMs知识

The implicit knowledge captured by PTMs can be roughly divided into two categories: linguistic knowledge and world knowledge.

Linguistic Knowledge. The linguistic knowledge of PTMs attracts most of attentions among all top-ics of PTMs’ interpretation. Compared to con-ventional neural models such as CNNs and RNNs which have fewer layers and parameters, large-scale PTMs can learn rich linguistic knowledge from massive pre-training data. In order to study PTMs’ linguistic knowledge, researcher design sev-eral approaches:

(1) Representation Probing: Fix the parameters of PTMs and train a new linear layer on the hidden representations of PTMs for a spe-ciﬁc probing task. It is the most popular approach because it can be easily adapted to any probing task without particular design.

(2) Representation Analysis: Use the hidden representations of PTMs to compute some statistics such as distances or sim-ilarities. According to these statistics, we can con-struct the relation between different words, phrases, or sentences.

(3) Attention analysis: similar to representation analysis, attention analysis compute statistics about attention matrices and is more suit-able to discover the hierarchical structure of texts.

(4) Generation Analysis: Use language models to directly estimate the probabilities of different sequences or words. The target texts could be correct or incorrect in some linguistic phenomenons.

PTMs捕获的隐性知识大致可以分为两类：语言知识和世界知识。

语言知识。PTM 的语言知识在 PTM 解释的所有主题中吸引了最多的关注。与传统的神经网络模型如 CNNs 和 RNNs 的层数和参数较少相比，大规模的 PTM 可以从海量的预训练数据中学习到丰富的语言知识。为了研究PTMs的语言知识，研究者设计了几种方法：

(1)表征探测：固定PTM的参数并在 PTM 的隐藏表示上训练一个新的线性层，用于特定的探测任务。这是最流行的方法，因为它可以很容易地适应任何探测任务而无需特殊设计。

(2)表征分析：利用PTMs的隐藏表示来计算距离或相似度等统计量。根据这些统计数据，我们可以构建不同单词、短语或句子之间的关系。

(3)注意力分析：与表征分析类似，注意力分析计算注意力矩阵的统计量，更适合于发现文本的层次结构。

(4)生成分析：使用语言模型直接估计不同序列或单词的概率。在某些语言现象中，目标文本可能是正确的或不正确的。

Representation probing have been widely ap-plied to analyze NLP neural models from word embeddings to PTMs (Köhn, 2015; Ettinger et al., 2016; Shi et al., 2016; Adi et al., 2017; Conneau et al., 2018a; Hewitt and Manning, 2019; Glavaš and Vuli´c, 2021). Liu et al. (2019) conduct comprehensive probing experiments on 11 linguistic tasks and ﬁnd that the representations given by large-scale PTMs are competitive compared to pre-vious task-speciﬁc models, which indicates that the models have already learned knowledge about tokens, chunks, and pairwise relations. To further investigate how PTMs represent sentence struc-tures about syntactic, semantic, local, and long-range information, Tenney et al. (2019b) design a new edge probing task and examine PTMs on a broad suite of sub-sentence tasks and show that PTMs have strong ability to encode syntactic in-formative while they bring little improvement on semantic tasks. Similarly, several works also reveal the strong syntax encoding of PTMs (Vilares et al., 2020; Warstadt and Bowman, 2020; Hewitt and Manning, 2019). To analyze the function of differ-ent layers, Jawahar et al. (2019a) and Tenney et al.(2019a) show that PTMs encode linguistic informa-tion with phrase features at the bottom, syntactic features in the middle and semantic features at the top. Compared to non-contextual representations (e.g., word2vec), PTMs’ representations are bet-ter in encoding sentence-level properties (Miaschi and Dell’Orletta, 2020). Furthermore, Manning et al. (2020) explore to reconstruct the sentence tree structures given by linguists using a linear transformation of PTMs’ embeddings and achieve promising results.

表征探测已广泛应用于分析从词嵌入到PTMs的NLP神经模型(Köhn, 2015; Ettinger et al., 2016; Shi et al., 2016; Adi et al., 2017; Conneau et al., 2018a; Hewitt and Manning, 2019; Glavaš and Vuli´c, 2021). Liu et al. (2019) 对11个语言任务进行了全面的探索性实验，发现与之前的特定任务模型相比，大规模 PTM 给出的表示具有竞争力，这表明模型已经学习了有关tokens、chunk块和成对关系的知识。为了进一步研究PTMs如何表示关于句法、语义、局部和远程信息的句子结构，Tenney等人(2019b)设计了一种新的边缘探测任务，并在一组广泛的子句任务上研究了PTMs，结果表明PTMs具有很强的编码句法信息的能力，而对语义任务的改进很小。同样，一些研究也揭示了PTMs的强语法编码(Vilares et al., 2020; Warstadt and Bowman, 2020; Hewitt and Manning, 2019)。为了分析不同层次的功能，Jawahar等人(2019a)和Tenney等人(2019a)发现，PTMs编码的语言信息，底部是短语特征，中间是句法特征，顶部是语义特征。与非上下文表示(如word2vec)相比，PTMs的表示在编码句子级属性方面更好(Miaschi和Dell ' orletta, 2020)。此外，Manning等人(2020)探索利用PTMs嵌入的线性变换重构语言学家给出的句子树结构，并取得了很好的结果。

Besides representation probing, researchers try to uncover the structure and relation among dif-ferent representations. Kim et al. (2020) propose to leverage the concept of Syntactic Distance to construct the constituency trees of sentences from word representations. Rosa and Mareˇcek (2019) analyze how the deletion of one word in a sentence changes representations of other words to reveal the inﬂuence of one word on other words.

There are also several works on interpreting PTMs via attention matrices. Lin et al. (2019) quan-titatively evaluate attention matrices for subject-verb agreement and anaphor-antecedent dependen-cies, and show that PTMs tend to encode positional information in lower layers and capture hierarchical information in higher layers. To better characterize the behaviors of PTMs’ attention matrices, Htut et al. (2019) propose to take the maximum atten-tion weight and compute the maximum spanning tree as two statistics. Based on the experimental results, they ﬁnd that ﬁne-tuning has little impact on the self-attention patterns.

除了表征探测之外，研究者还试图揭示不同表征之间的结构和关系。Kim等人(2020)提出利用句法距离的概念，从单词的表征构造句子的组群树。Rosa和Mareˇcek(2019)分析了句子中删除一个单词如何改变其他单词的表征，揭示了一个单词对其他单词的影响。

也有一些通过注意力矩阵来解释PTMs的工作。Lin等人(2019)定量评估了主谓一致和指代-先行依赖关系的注意力矩阵，并表明PTMs倾向于在较低层编码位置信息，并在较高层捕获层次信息。为了更好地刻画PTMs注意力矩阵的行为，Htut等人(2019)提出将最大注意力权重和计算最大生成树作为两个统计量。根据实验结果，他们发现微调对自注意力模式的影响很小。

Since PTMs can be directly used to generate tokens or estimate the probabilities of different sen-tences, it is intuitive to construct analysis tasks based on generation (Goldberg, 2019). Perturbed Masking (Wu et al., 2020) recovers syntactic trees from PTMs without any extra parameter and the structure given by PTMs are competitive with a human-designed dependency schema in some downstream tasks. To analysis the gain of pre-training on estimating the probabilities of ungram-matical words, Schijndel (Schijndel et al., 2019) show that expanding the training corpus yields di-minishing returns and the training corpus would need to be unrealistically large to make PTMs match human performance.

World Knowledge. In addition to linguistic knowl-edge, PTMs also learn rich world knowledge from pre-training, mainly including commonsense knowledge and factual knowledge (Zhou et al., 2020b; Bouraoui et al., 2020).

For the commonsense knowledge, Ettinger (Et-tinger, 2020) ﬁrst evaluates PTMs’ knowledge in the aspect of psycholinguists and ﬁnd that the mod-els perform well in the situation of shared category or role reversal but fail with challenging inferences and role-based event. Then, to extract common-sense from PTMs, Davison et al. (2019) propose to ﬁrst transform relational triples into masked sen-tences and then rank these sentences according to the mutual information given by PTMs. In the ex-periments, the PTM-based extraction method with-out further training even generalizes better than current supervised approaches. Similarly, Da and Kasai (2019) also ﬁnd that PTMs have learned var-ious commonsense features in their representation space based on a series of probing tasks. In ad-dition to the commonsense features/attributes, the implicit relations between different attributes are important and Forbes et al. (2019) show that current PTMs’ representations cannot model the implicit relations well, which requires further exploration.

由于PTMs可以直接用于生成tokens 或估计不同句子的概率，因此基于生成构造分析任务是直观的(Goldberg, 2019)。Perturbed Masking (Wu et al.， 2020)在没有任何额外参数的情况下从PTMs中恢复语法树，并且PTMs给出的结构在一些下游任务中可以与人类设计的依赖模式具有竞争力。Schijndel (Schijndel et al.， 2019)表明，为了分析预训练在估计不合语法词的概率方面的收益，，扩大训练语料库会产生递减的回报，而训练语料库需要大得不现实，才能使PTMs匹配人类的表现。

世界知识。除了语言知识外，PTMs还通过预训练学习丰富的世界知识，主要包括常识知识和事实知识(Zhou et al.， 2020b;Bouraoui et al.， 2020)。

对于常识知识，Ettinger (Et-tinger, 2020)首先从心理语言学家的角度评估PTMs的知识，发现模型在共享类别或角色转换的情况下表现良好，但在具有挑战性的推理和基于角色的情况下表现不佳事件。然后，为了从PTMs中提取常识，Davison et al.(2019)提出首先将关系三元组转换成掩码句子，然后根据PTMs给出的互信息对这些句子进行排序。在实验中，在没有进一步训练的情况下，基于PTM的提取方法甚至比现有的监督方法具有更好的泛化效果。同样，Da和Kasai(2019)也发现，PTMs在基于一系列探测任务的表征空间中学习了各种常识性特征。除了常识性的特征/属性外，不同属性之间的隐式关系也很重要，Forbes 等(2019)表明，当前的PTMs表征不能很好地模拟隐式关系，这需要进一步的探索。

For factual knowledge, Petroni et al. (2019) pro-pose to formulate the relational knowledge gener-ation as the completion of ﬁll-in-the-blank state-ments. According to the experimental results, they ﬁnd that PTMs signiﬁcantly outperform previous supervised baselines on this task without any ﬁne-tuning. However, the construction of these ﬁll-in-the-blank statements is non-trivial. To extract more factual knowledge from PTMs, LPAQA (Jiang et al., 2020b) have been propose to automatically search better statements/prompts through mining-based and paraphrasing-based methods. Auto-Prompt (Shin et al., 2020) proposes to train discrete prompts for knowledge probing. In P-tuning (Liu et al., 2021b), the authors discover that the bet-ter prompts lie in continuous embedding space, rather than discrete space. The P-tuning boosts the P@1 performance on LAMA to 64%, which is 20%higher than AutoPrompt. Moreover, Roberts et al.(2020) ﬁne-tune PTMs for the task of open-domain question answering and ﬁnd that ﬁne-tuning can further beneﬁt the knowledge generation of PTMs. However, Pörner et al. (2020) ﬁnd that the success of knowledge generation may rely on learning neu-ral stereotypical associations, i.e., a person with an Italian-sounding name will be predicted to Ital-ian by PTMs. For understanding the number in texts, Wallace et al. (2019c) ﬁnd that ELMo cap-tures numeracy the best for all pre-trained meth-ods, which is a character-based model, but BERT, which uses sub-word units, is less exact. (Wang et al., 2020a) investigates the knowledge stored in Transformer’s feed-forward attention matrices and proposes a framework to construct open knowledge graphs using PTMs.

对于事实知识，Petroni et al.(2019)提出将关系知识生成表述为填空语句的完成。根据实验结果，他们发现在没有任何微调的情况下，PTMs在这项任务上的表现明显优于以前的监督基线。然而，这些填空语句的构造并不简单。为了从PTMs中提取更多的事实知识，LPAQA (Jiang et al.， 2020b)被提出通过基于挖掘和基于释义的方法自动搜索更好的语句/提示。Auto-Prompt (Shin et al.， 2020)提出训练离散提示以进行知识探索。在P-tuning(Liu et al.， 2021b)中，作者发现连续嵌入空间中的提示符比离散空间中的提示符更好。P-tuning将LAMA上的P@1性能提高到64%，比AutoPrompt高20%。此外，Roberts等人(2020)对PTMs进行开放域问答任务的微调，发现微调可以进一步有利于PTMs的知识生成。然而，Pörner等人(2020)发现，知识生成的成功可能依赖于学习神经刻板印象联想，即，一个人的名字听起来像意大利人，PTMs会预测他会说意大利语。为了理解文本中的数字，Wallace等人(2019c)发现，ELMo在所有预训练的方法中捕捉计算能力最好，这是一种基于字符的模型，但使用子词单元的BERT则不那么精确。(Wang et al.， 2020a)研究了存储在Transformer前馈注意力矩阵中的知识，并提出了一个使用PTMs构建开放知识图谱的框架。

7.2 Robustness of PTMs鲁棒性

Recent works have identiﬁed the severe robust-ness problem in PTMs using adversarial exam-ples. Adversarial attacks aims to generate new samples, which are mis-classiﬁed by models, by small perturbation on the original inputs. For ex-ample, PTMs can be easily fooled by synonym replacement (Jin et al., 2020; Zang et al., 2020). Meanwhile, irrelevant artifacts such as form words can mislead the PTMs into making wrong predic-tions (Niven and Kao, 2019; Wallace et al., 2019a). Current works mainly utilize the model prediction, prediction probabilities, and model gradients of the models to search adversarial examples. How-ever, it is difﬁcult to maintain the quality of the adversarial examples generated by machines. Re-cently, human-in-the-loop methods (Wallace et al., 2019b; Nie et al., 2020) have been applied to gen-erate more natural, valid, and diverse adversarial examples, which brings larger challenge and ex-pose more properties and problems of PTMs. In conclusion, the robustness of PTMs has become a serious security threat when people deploy PTMs for real-world applications.

最近的工作已经通过对抗性的例子确定了PTMs中严重的鲁棒性问题。对抗性攻击的目的是通过对原始输入的小扰动来生成被模型错误分类的新样本。例如，PTMs很容易被同义词替换所迷惑（Jin 等人，2020；Zang 等人，2020）。与此同时，形式单词等不相关的人工制品可能会误导 PTM 做出错误的预测（Niven 和 Kao，2019；Wallace 等人，2019a）。目前的研究主要是利用模型预测、预测概率和模型梯度来搜索对抗实例。然而，很难保持由机器生成的对抗性示例的质量。最近，人类在环方法(Wallace等人，2019b;Nie et al.， 2020)已经被用于生成更自然、有效和多样化的对抗样本，这带来了更大的挑战和经验，暴露了PTMs更多的属性和问题。总之，当人们为实际应用程序部署PTMs时，PTMs的鲁棒性已经成为一个严重的安全威胁。

7.3 Structural Sparsity of PTMs结构稀疏性

Following BERT, most PTMs adopt Transformer as the architecture backbone. Although people can easily train a deep Transformer and achieve signiﬁ-cant improvement over previous works using CNN and RNN, Transformer meets the problem of over-parameterization. Researchers have shown that the multi-head attention structures are redundant in the tasks of machine translation (Michel et al., 2019), abstractive summarization (Baan et al., 2019), and language understanding (Kovaleva et al., 2019),i.e., when removing part of attention heads, we can achieve better performance. This phenomenon is consistent to the observation in (Clark et al., 2019) where they ﬁnd that most heads in the same layer have similar self-attention patterns. Furthermore, Kovaleva et al. (2019) conduct a qualitative and quantitative analysis of the information encoded by PTMs’ heads. Their ﬁndings suggest that the attention behaviors of different heads can be cate-gorized into a limited set of patterns. Besides the multi-head attention, several other works explore to identify the sparsity of parameters. Gordon et al.(2020) show that low levels of pruning (30-40%) do not affect pre-training loss or the performance on downstream tasks at all. Targeting the sparsity during ﬁne-tuning, Prasanna et al. (2020) validate the lottery ticket hypothesis on PTMs and ﬁnd that it is possible to ﬁnd sub-networks achieving per-formance that is comparable with that of the full model. Surprisingly, Kao et al. (2020) show that we can improvement the performance by simply du-plicating some hidden layers to increase the model capacity, which suggests that the redundant param-eters may beneﬁt the ﬁne-tuning.

在BERT之后，大多数PTMs都采用Transformer作为架构主干。尽管人们可以轻松地训练一个深度Transformer，并比以前使用 CNN 和 RNN 的工作取得显着改进，但Transformer遇到了过度参数化的问题。研究表明，在机器翻译(Michel et al.， 2019)、抽象摘要(Baan et al.， 2019)和语言理解(Kovaleva et al.， 2019)任务中，多头注意力结构是冗余的，即，当去除部分注意力头时，我们可以获得更好的性能。这一现象与(Clark et al.， 2019)的观察结果一致，他们发现同一层中的大多数头部都有相似的自注意力模式。此外，Kovaleva等人(2019)对PTMs头部编码的信息进行了定性和定量分析。他们的发现表明，不同头部的注意力行为可以被归类为一组有限的模式。除了多头注意力之外，其他的一些工作也探讨了参数的稀疏性识别。Gordon等人(2020)表明，低水平的修剪(30-40%)不会影响预训练的损失或下游任务的表现。在微调期间针对稀疏性，Prasanna等人(2020)验证了PTMs上的彩票假设(没有万能的,只有合适的)，并发现有可能找到性能可与完整模型相媲美的子网络。令人惊讶的是，Kao等人(2020)表明，我们可以通过简单地复制一些隐藏层来提高模型容量，从而提高性能，这表明冗余参数可能有利于微调。

7.4 Theoretical Analysis of PTMs理论分析

Since pre-training has achieved great success in deep learning, researchers try to investigate how pre-training works, especially unsupervised pre-training. In the early days of deep learning, peo-ple found that it is effective to train a deep belief network by greedy layer-wise unsupervised pre-training followed by supervised ﬁne-tuning (Hin-ton et al., 2006). Recently, pre-training based on contrast learning including language modeling has become the mainstream approach. In this section, we will introduce some theoretical explanatory hy-potheses or frameworks for pre-training.

Erhan et al. (2010) propose two hypotheses to ex-plain the effect of pre-training:

(1) better optimization and

(2) better regularization.

In the aspect of better optimization, the network with pre-training is closer to the global minimum compared to the models randomly initialized.

In the aspect of better regularization, the training error of PTMs is not necessarily better than the random models while the test error of PTMs is better, which means bet-ter generalization ability.

Then, the experimental results lean towards the second hypothesis. They ﬁnd that the PTM doesn’t achieve lower training error. Moreover, compared to other regularization approaches such as L1/L2, the unsupervised pre-training regularization is much better.

由于预训练在深度学习中取得了巨大的成功，研究者们开始尝试研究预训练的工作原理，特别是无监督预训练是如何起作用的。在深度学习的早期，人们发现通过贪婪的逐层无监督的预训练和监督的微调来训练深度信念网络是有效的(Hin-ton et al.， 2006)。近年来，包括语言建模在内的基于对比学习的预训练已成为主流方法。在本节中，我们将介绍一些用于预训练的理论解释性假设或框架。

Erhan et al.(2010)提出了两个假设来解释预训练的效果:

(1)更好的优化

(2)更好的正则化。

在优化方面，与随机初始化的模型相比，经过预训练的网络更接近全局最小值。

在更好的正则化方面，PTMs的训练误差并不一定优于随机模型，而PTMs的测试误差则更好，这意味着PTMs具有更好的泛化能力。

然后，实验结果倾向于第二种假设。他们发现PTM并没有达到较低的训练误差。此外，与L1/L2等其他正则化方法相比，无监督预训练正则化的效果要好得多。

Towards the recent development of pre-training objective, Saunshi et al. (2019) conduct a theoreti-cal analysis of contrastive unsupervised representa-tion learning. Contrastive learning treats the pairs of text/images appearing in the same context as the semantically similar pairs and the randomly sampled pairs as the semantically dissimilar pairs. Then, the distance between the similar pair should be close and the distance between the dissimilar pair should be distant. In the prediction process of language modeling, the context and the target word are the similar pair and the other words are negative samples (Kong et al., 2020). Saunshi et al.(2019) ﬁrst provide a new conceptual framework to bridge the gap between pre-training and ﬁne-tuning. Speciﬁcally, they introduce the concept of latent classes and the semantically similar pairs are from the same latent class. For example, the latent class can be “’happy” to include all texts including happy sentiments. The latent classes cover all pos-sible classes and the classes deﬁned by downstream tasks are from the set of latent classes. Then, they prove that the loss of contrastive learning is the upper bound of the downstream loss. Hence, when optimizing the pre-training loss, we can expect a lower loss in downstream tasks.

针对最近开发的预训练目标，Saunshi等人(2019)对对比无监督表征学习进行了理论分析。对比学习将出现在同一语境中的文本/图像对视为语义相似的对，将随机抽样的文本/图像对视为语义不相似的对。然后，相似对之间的距离应该是近的，不同对之间的距离应该是远的。在语言建模的预测过程中，语境(上下文)和目标词是相似的对，其他词是负样本(Kong et al.， 2020)。Saunshi等人(2019)首先提供了一个新的概念框架，以弥补预训练和微调之间的差距。具体来说，他们引入了潜在类的概念，语义相似的对来自同一个潜类。例如，潜类可以是“‘happy’”，包含所有包含快乐情绪的文本。潜在类别涵盖所有可能的类别，下游任务定义的类别来自于潜在类别的集合。然后，他们证明了对比学习的损失是下游损失的上界。因此，在优化预训练损失时，我们可以预期下游任务的损失更低。

8 Future Direction未来方向

So far, we have comprehensively reviewed the past and present of PTMs. In the future, on the basis of existing works, PTMs can be further developed from the following aspects: architectures and pre-training methods (section 8.1),

multilingual and multimodal pre-Training (section 8.2),

computa-tional efﬁciency (section 8.3),

theoretical founda-tion (section 8.4),

modeledge learning (section 8.5),

cognitive learning (section 8.6),

and novel applica-tions (section 8.7).

In fact, researchers have made lots of efforts in the above directions, and we have also introduced the latest breakthroughs in the pre-vious sections. However, there are still some open problems in these directions that need to be further addressed. We mainly focus on discussing these open problems in this section.

到目前为止，我们已经全面回顾了PTMs的过去和现在。未来，在现有工作的基础上，PTMs可以从以下几个方面进一步发展：

架构和预训练方法(第8.1节)，

多语言和多模态预训练方法(第8.2节)，

计算效率(第8.3节)，

理论基础(第8.4节)，

modeledge学习(第8.5节)，

认知学习(第8.6节)，以及

新应用(第8.7节)。

事实上，研究人员在上述方向做了很多努力，我们在前几节中也介绍了最新的突破。然而，在这些方向上仍有一些悬而未决的问题需要进一步解决。在本节中，我们主要集中讨论这些未解决的问题。

8.1 Architectures and Pre-Training Methods架构和预训练方法

From the aspect of architectures and pre-training methods, we believe the following problems worth further exploring in the future:

New Architectures. Transformers have been proved to be an effective architecture for pre-training. However, the main limitation of Trans-formers is its computational complexity. Lim-ited by the memory of GPUs, most current PTMs cannot deal with sequences containing more than 512 tokens. Therefore, it is important to search for more efﬁcient model architectures to capture longer-range contextual information. However, the design of deep architecture is challenging, and we may seek help from some automatic methods, such as neural architecture search (NAS). Besides, al-though larger PTMs can usually lead to better per-formance, a practical problem is how to leverage these huge PTMs on some special scenarios, such as low-capacity devices and low-latency applica-tions, where the efﬁciency of PTMs is a key factor. Moreover, different downstream tasks prefer dif-ferent architectures. For example, the Transformer encoder is suitable for natural language understand-ing tasks while the Transformer decoder is suitable for natural language generation tasks. Therefore, we may need to carefully design task-speciﬁc ar-chitectures according to the type of downstream tasks.

从架构和预训练方法方面，我们认为以下问题值得今后进一步探讨:

新架构。Transformers已经被证明是一个有效的预训练架构。然而，Transformer的主要限制是其计算复杂度。受GPU内存的限制，目前大多数PTMs都不能处理包含超过512个以上token的序列。因此，寻找更高效的模型架构来捕获长期的上下文信息是很重要的。然而，深度架构的设计具有挑战性，我们可以寻求一些自动方法的帮助，如神经架构搜索(NAS)。此外，尽管更大的PTMs通常可以带来更好的性能，但一个实际的问题是如何在一些特殊的场景中利用这些巨大的PTMs，比如低容量设备和低延迟应用程序，在这些场景中，PTMs的效率是一个关键因素。此外，不同的下游任务偏好不同的架构。例如，Transformer编码器适用于自然语言理解任务，而Transformer解码器适用于自然语言生成任务。因此，我们可能需要根据下游任务的类型仔细设计特定于任务的架构。

New Pre-Training Tasks. The general-purpose PTMs are always our pursuits for learning the intrinsic universal knowledge of languages (even world knowledge). However, such PTMs usually need deeper architecture, larger corpora and chal-lenging pre-training tasks. All these requirements further result in higher training costs. Moreover, training huge models is also a challenging prob-lem, which needs sophisticated and efﬁcient training techniques such as distributed training, mixed-precision training, etc. Therefore, a more practical direction is to design more efﬁcient self-supervised pre-training tasks and training methods according to the capabilities of existing hardware and soft-ware. ELECTRA (Clark et al., 2020) is a good attempt towards this direction.

新的预训练任务。通用的PTMs一直是我们学习语言内在的通用知识(甚至是世界知识)的追求。然而，这种PTMs通常需要更深的架构、更大的语料库和具有挑战性的预训练任务。所有这些要求进一步导致更高的训练成本。此外，训练大型模型也是一个具有挑战性的问题，需要复杂而高效的训练技术，如分布式训练、混合精度训练等。因此，根据现有硬件和软件的能力，设计更高效的自监督的预训练任务和训练方法是一个更实际的方向。ELECTRA (Clark et al.， 2020)是朝着这个方向的一个很好的尝试。

Beyond Fine-Tuning. Currently, ﬁne-tuning is the dominant method to transfer the knowledge of PTMs to downstream tasks but one deﬁciency is its parameter inefﬁciency: every downstream task has its own ﬁne-tuned parameters. An im-proved solution is to ﬁx the original parameters of PTMs and add small ﬁne-tunable adaption mod-ules for speciﬁc tasks. Thus, we can use a shared PTM to serve multiple downstream tasks. Recently, with the emerging of GPT-3, a novel genre for model tuning, namely prompt tuning, is getting more and more attention. By designing, generating and searching discrete (Petroni et al., 2019; Gao et al., 2021) or continuous (Liu et al., 2021b; Han et al., 2021; Lester et al., 2021) prompts and using MLM for speciﬁc downstream tasks, these models could

(1) bridge the gap between pre-training and ﬁne-tuning, and thereby perform better on down-stream tasks;

(2) reduce the computational cost on ﬁne-tuning the tremendous amounts of parameters.

To sum up, prompt tuning is a promising way to stimulate the linguistic and world knowledge dis-tributed in PTMs.

Reliability. The reliability of PTMs is also becom-ing an issue of great concern with the extensive use of PTMs in production systems. The studies of adversarial attacks (Li et al., 2020b,c; Zhang et al., 2021c) against PTMs help us understand their capabilities by fully exposing their vulnera-bilities. Adversarial defenses (Si et al., 2020; Yao et al., 2021; Li and Qiu, 2021) for PTMs are also promising, which can improve the robustness of PTMs and make them immune against adversarial attacks. Overall, as a key component in many NLP applications, the interpretability and reliability of PTMs remain to be further explored, which will help us understand how PTMs work and provide guidance for better use and further improvement of PTMs.

超越微调。目前，微调是将PTMs知识转移到下游任务的主要方法，但其不足之处是参数效率低下：每个下游任务都有自己的微调参数。一种改进的解决方案是修复PTMs的原始参数，并针对特定的任务添加小的微调自适应模块。因此，我们可以使用一个共享的PTM来服务多个下游任务。近年来，随着GPT-3的出现，一种新颖的模型调优流派，即提示调优受到了越来越多的关注。通过设计、生成和搜索离散(Petroni et al., 2019; Gao et al., 2021)或连续(Liu et al., 2021b; Han et al., 2021; Lester et al., 2021) 提示并使用MLM进行特定的下游任务，这些模型可以

(1)弥补预训练和微调之间的差距，从而在下游任务上表现更好;

(2)降低了对海量参数进行微调的计算成本。

综上所述，提示调优是一种很有前途的方法来激发PTMs中分布的语言和世界知识。

可靠性。随着PTMs在生产系统中的广泛使用，PTMs的可靠性也成为一个备受关注的问题。关于PTMs的对抗性攻击的研究(Li et al.， 2020b,c;Zhang等人，2021c)通过充分暴露PTMs的弱点，帮助我们了解它们的能力。对抗性防御 (Si et al., 2020; Yao et al., 2021; Li and Qiu, 2021)对PTMs的研究也很有前景，这可以提高PTMs的鲁棒性，并使其免受对抗性攻击。总的来说，作为许多NLP应用的关键组件，PTM的可解释性和可靠性仍有待进一步探索，这将有助于我们理解PTM的工作原理，并为更好地使用和进一步改进PTM提供指导。

8.2 Multilingual and Multimodal Pre-Training多语言、多模态的预训练

Although multimodal and multilingual PTMs have witnessed numerous advances in the last two years, they still have the following ongoing research lines:

More Modalities. In addition to image and text, video and audio can also be exploited for multi-modal pre-training. The main challenge thus lies in how to model temporal contexts involved in these two modalities. In particular, for large-scale pre-training over video-text pairs, the conventional self-supervised learning methods are not suitable due to their high computational costs. To handle this problem, it is important to develop more effective and efﬁcient self-supervised learning methods for more complex modalities.

More Insightful Interpretation. It is still un-known why bridging vision and language works. For example, regardless of the advantages brought by multimodal pre-training, does it lead to any harm to the single modality (image or text)? If the answer is yes, can we overcome this drawback dur-ing multimodal pre-training? Along this research line, the latest visualization tools for deep learning can be exploited for the interpretation of multi-modal pre-training.

尽管多模态和多语言的PTMs在过去两年中取得了许多进展，但它们仍有以下正在进行的研究方向:

更多的模态。除了图像和文本，还可以利用视频和音频进行多模态预训练。因此，主要的挑战在于如何为这两种模态所涉及的时间上下文建模。特别是对于视频文本对的大规模预训练，传统的自监督学习方法由于计算量大而不适合。要处理这个问题，重要的是为更复杂的模态开发更有效和高效的自自监督学习方法。

更深刻的解释。现在还不知道为什么视觉和语言之间的桥梁能起作用。例如，多模态预训练虽然有优势，但是否会对单一模态(图像或文本)造成伤害?如果答案是肯定的，我们能否在多模态预训练中克服这个缺点?在这一研究方向上，可以利用最新的深度学习可视化工具来解释多模态预训练。

More Downstream Applications. It is well-known that multimodal pre-training can be applied to image-text retrieval, image-to-text generation, text-to-image generation and other downstream tasks. However, it is still challenging to ﬁnd a “true” real-world application scenario for multi-modal pre-training, since many effective engineer-ing tricks can be leveraged instead (even with less cost). A closer collaboration with the industry is thus needed.

Transfer Learning. Currently, to make multi-modal multilingual models handle different lan-guages, data for each language is required during pre-training. It is not ﬂexible to add unseen lan-guages during pre-training. Therefore, a new pre-training framework should be explored to easily adapt to those unseen languages. Besides, current multimodal multilingual models are not able to pro-cess audio data. For example, to translate English audio to Chinese audio, we need to ﬁrst transfer English audio to English text by an extra speech recognition system. After translation with a cross-lingual model, we need to further transfer Chinese text to Chinese audio by an extra text-to-speech tool. How to directly transfer the source language audio to the target language text or target language audio by multimodal multilingual PTMs is also worth exploring.

更多的下游应用程序。众所周知，多模态预训练可以应用于图文检索、图文生成、文图生成等下游任务。然而，要找到一个“真实”的实际应用场景来进行多模态预训练仍然很有挑战性，因为可以利用许多有效的工程技巧(即使成本更低)。因此，需要与该行业进行更密切的合作。

迁移学习。目前，多模态多语言模型要处理不同的语言，在预训练时需要每种语言的数据。在预训练时添加不可见的语言是不灵活的。因此，应该探索一种新的预训练框架，以轻松适应那些看不见的语言。此外，当前的多模态多语言模型还不能处理音频数据。例如，要将英语音频翻译成汉语音频，我们需要首先通过一个额外的语音识别系统将英语音频转换成英语文本。在采用跨语模型进行翻译后，我们需要通过额外的文本-语音转换工具将中文文本进一步转换为中文音频。如何通过多模态多语言PTMs将源语言音频直接转换为目标语言文本或目标语言音频也值得探索。

8.3 Computational Efﬁciency计算效率

Deep learning models have become increasingly complicated and large (Devlin et al., 2019; Brown et al., 2020; Kaplan et al., 2020; Fedus et al., 2021)in the recent years. The novel requirements of large-scale deep learning models bring severe chal-lenges to the existing deep learning frameworks such as TensorFlow (Abadi et al., 2016) and PyTorch (Paszke et al., 2019), which were designed in the early days without initially foreseeing the emerging requirements such as model/pipeline par-allelism of large models (Brown et al., 2020; Huang et al., 2019b; Wang et al., 2019). To develop more efﬁcient frameworks, the following directions are helpful.	近年来，深度学习模型已经变得越来越复杂和庞大e (Devlin et al., 2019; Brown et al., 2020; Kaplan et al., 2020; Fedus et al., 2021)大规模深度学习模型的新需求给现有的深度学习框架带来了严峻的挑战，如TensorFlow (Abadi等人，2016)和PyTorch (Paszke等人，2019)，这些设计在早期，没有最初预见到出现的需求，例如大型模型的模型/管道并行性 (Brown et al., 2020; Huang et al., 2019b; Wang et al., 2019)。为了开发更高效的框架，以下方向很有帮助。
Data Movement. Developing an efﬁcient dis-tributed deep learning framework faces various challenges. One has to carefully manage the data movement between devices, which may otherwise become the performance bottleneck (Narayanan et al., 2019; Jiang et al., 2020a). A well-deﬁned parallelism strategy is needed to place and schedule computational tasks on inter-connected devices, by minimizing the communication cost, maximizing the computational and memory resources, and op-timizing the computation-communication overlap. In the best case, this efﬁcient parallelism strategy can be generated automatically. Parallelism Strategies. Particular to the choice of parallelism strategy, data parallelism, model parallelism, pipeline parallelism, and various hy-brid parallelism approaches can ﬁnd their best us-age depending on the structure of neural networks and hardware conﬁguration (Ben-Nun and Hoe-ﬂer, 2019). Data parallelism is especially suitable for deep learning models with a relatively small set of parameters (usually less than tens of mil-lion parameters) where near-linear speed-up can be achieved when the back-propagation maximally overlaps with the gradient/parameter communica-tion (Hashemi et al., 2019; Peng et al., 2019; Jiang et al., 2020a). Model parallelism and pipeline par-allelism are for models with a more signiﬁcant number of parameters, which probably cannot ﬁt into a single device. In current practice, a user must thoroughly consider the network structure given a deep learning model and the inter-device commu-nication bandwidth to decide the most appropriate parallelism strategies or switch between different strategies (Shazeer et al., 2018).	数据移动。开发一个高效的分布式深度学习框架面临着各种挑战。一个人必须仔细管理设备之间的数据移动，否则这可能成为性能瓶颈 (Narayanan et al., 2019; Jiang et al., 2020a)。通过最小化通信成本、最大化计算和内存资源以及优化计算-通信重叠，需要一个明确定义的并行策略来在互连设备上放置和调度计算任务。在最好的情况下，这种有效的并行策略可以自动生成。并行策略。特别是并行策略的选择，数据并行、模型并行、*管道并行和各种混合并行方法可以根据神经网络的结构和硬件配置找到它们的最佳应用(Ben-Nun和Hoe-fler, 2019)。数据并行性尤其适用于参数集相对较小(通常小于千万个参数)的深度学习模型，当反向传播与梯度/参数通信最大限度重叠时，可以实现近线性加速(Hashemi等人，2019;彭等，2019;江等，2020a)。模型并行性和管道并行性*适用于具有更多参数的模型，这些模型可能无法放入单个设备中。在当前的实践中，用户必须充分考虑给定深度学习模型的网络结构和设备间的通信带宽，以决定最合适的并行性策略或在不同策略之间切换(Shazeer et al.， 2018)。
Large-Scale Training. Given the poor support to model parallelism and pipeline parallelism by existing deep learning frameworks, some emerg-ing open-source projects develop dedicated frameworks for large-scale training. For example, HugeCTR (Oldridge et al., 2020) is used for large-scale click-through rate estimation. Megatron-LM (Shoeybi et al., 2019; Narayanan et al., 2021) and DeepSpeed (Rajbhandari et al., 2021, 2020) target at training large-scale NLP PTMs. Insight-Face (ins, 2021) trains large-scale face recognition models. However, these frameworks are restricted to limited application cases and cannot serve as a general solution. Further, these approaches cannot work together to constitute a complete solution due to the compatibility issue. Wrappers and Plugins. Without a mechanism to support model parallelism and pipeline parallelism, one has to develop various libraries dedicated to some particular algorithms via inserting the data routing operations by hand between computing op-erations on top of existing frameworks. Further, communication and computation need to be man-ually overlapped to maximize the system through-put. Manually programming communication op-erations is prohibitively complicated and only can solve problems case by case, leading to a signiﬁ-cant obstacle in applying parallelism strategies to new deep learning models. If communication oper-ations can be automatically managed transparently to users by deep learning frameworks, more models and applications can beneﬁt from the distributed training.	大规模的训练。由于现有的深度学习框架对模型并行性和*管道并行的支持不足，一些新兴的开源项目开发了用于大规模训练的专用框架。例如，HugeCTR* (Oldridge et al.， 2020)被用于大规模的点击率估计。*Megatron-LM* (Shoeybi et al.， 2019;Narayanan et al.， 2021)和DeepSpeed (Rajbhandari et al.， 2021, 2020)的目标是训练大规模的NLP PTMs。*Insight-Face* (in, 2021)训练大规模的人脸识别模型。然而，这些框架仅限于有限的应用案例，不能作为通用的解决方案。此外，由于兼容性问题，这些方法不能一起工作以组成一个完整的解决方案。包装和插件。如果没有支持模型并行和*管道并行*的机制，则必须通过在现有框架之上的计算操作之间手动插入数据路由操作来开发专用于某些特定算法的各种库。此外，通信和计算需要手动重叠，以最大化地提高系统吞吐量。手动编程通信操作非常复杂，只能逐个解决问题，这导致将并行策略应用于新的深度学习模型存在重大障碍。如果可以通过深度学习框架对用户透明地自动管理通信操作，那么更多的模型和应用程序可以从分布式训练中受益。
To support more complicated parallelism strate-gies, many schemes are used as wrappers or plu-gins based on some mainstream deep learning frameworks such as TensorFlow and PyTorch. Mesh-TensorFlow (Shazeer et al., 2018), FlexFlow (Jia et al., 2019), OneFlow (one, 2021), Mind-Spore (min, 2021) and GShard (Lepikhin et al., 2021) provide APIs for developers to express a wide range of parallel computation patterns for dif-ferent components of deep neural models. The SBP conﬁguration in OneFlow could be still too com-plex for users to set. However, directly program-ming with communication primitives for a different kind of parallelism is more complicated. OneFlow transforms the manually programming to just set-ting SBP signatures. Moreover, in OneFlow, the user could just set the SBP signatures of a subset of operations instead of the whole set, and leave the rest SBP to be inferred with heuristic approaches like GShard (Lepikhin et al., 2021), in which users provide some initial annotations or use default an-notations as seed, then the algorithm propagates the sharding information to the un-annotated ten-sors. The approach in FlexFlow (Jia et al., 2019) can also be used here. The automatic scheduling of parallelism strategies is the trend of distributed training in the future.	为了支持更复杂的并行化策略，基于TensorFlow和PyTorch等主流深度学习框架，许多方案被用作包装器或插件。Mesh-TensorFlow (Shazeer et al.， 2018)、FlexFlow (Jia et al.， 2019)、OneFlow (one, 2021)、Mind-Spore (min, 2021)和GShard (Lepikhin et al.， 2021)为开发人员提供api，以表达深度神经模型的不同组件的表达广泛的并行计算模式。OneFlow中的SBP配置对于用户来说仍然过于复杂。然而，直接使用通信原语对不同类型的并行进行编程更为复杂。OneFlow将手动编程转换为仅仅设置SBP签名。此外,在 OneFlow 中，用户可以只设置操作子集的 SBP 签名，而不是整个集合，其余的 SBP 由用户提供的启发式方法（如 GShard（Lepikhin 等人，2021））推断一些初始注释或使用默认注释作为种子，然后算法将分片信息传播到未注释的十个向量。FlexFlow (Jia et al.， 2019)中的方法也可以在这里使用。并行策略的自动调度是未来分布式训练的发展趋势。

8.4 Theoretical Foundation理论基础

In this subsection, we analyze the future directions in a more fundamental way. In the aspect of theoret-ical foundation, we discuss the following research problems.

Uncertainty. One under-addressed issue with PTMs (as well as other deep neural networks) is that they are often over-conﬁdent in predictions,i.e., these models do not know what they do not know. For instance, GPT-3 can be used to an-swer questions with promising performance on benchmark datasets. However, if you ask a sim-ple question like “How many eyes does my foot have?”, GPT-3 would certainly produce an answer like “Your foot has two eyes”, which looks counter-intuitive. 4 Of course, the above question is not often asked by normal human beings. It is gener-ally a challenging task to deal with such out-of-distribution (OOD) data in machine learning.

To address the above challenge, one promising direction is to adopt Bayesian methods that explore probabilistic tools to capture the uncertainty of both data and model (also known as aleatoric uncertainty and epistemic uncertainty respectively) (Der Ki-ureghian and Ditlevsen, 2009) or derive some test-ing statistics. Such uncertainty or statistics is help-ful to detect outliers (Wang et al., 2020f). Re-cently, much work has been done on the theory, algorithms and programming libraries of Bayesian deep learning, which conjoins Bayesian methods and deep networks (e.g., see (Shi et al., 2017) for more details). Such progress can be further ex-tended to large-scale PTMs to properly character-ize uncertainty and avoid over-conﬁdent ouTPUts. Of course, improving the computational efﬁciency of Bayesian deep learning is a key factor to address the above challenge.

在本小节中，我们将以更基本的方式分析未来的发展方向。在理论基础方面，我们讨论了以下几个研究问题。

不确定性。PTMs(以及其他深度神经网络)的一个未被解决的问题是，它们经常在预测方面过于自信，例如,这些模型不知道他们不知道的事情。例如，GPT-3可用于回答在基准数据集上具有良好性能的问题。然而，如果你问一个简单的问题，比如“我的脚有几只眼睛?”， GPT-3肯定会给出像“你的脚有两只眼睛”这样的答案，这看起来违反直觉。4当然，上述问题一般人不会经常问。在机器学习中，如何处理这些分布外数据(ODD)是一项具有挑战性的任务。

为了解决上述挑战，一个有希望的方向是采用贝叶斯方法，探索概率工具来捕获数据和模型的不确定性(也称为随机不确定性和认知不确定性)(Der kee -ureghian和Ditlevsen, 2009)或导出一些测试统计数据。这种不确定性或统计数据有助于检测离群值(Wang et al.， 2020f)。近年来，在贝叶斯深度学习的理论、算法和编程库方面做了大量的工作，贝叶斯深度学习将贝叶斯方法和深度网络结合在一起(详见(Shi et al.， 2017))。这种进展可以进一步扩展到大规模的PTMs，以适当地描述不确定性并避免过度自信的输出。当然，提高贝叶斯深度学习的计算效率是解决上述挑战的关键因素。

Generalization and Robustness. Another impor-tant issue with PTMs is on generalization. As an important advancement of deep learning, it inherits the advantages as well as challenges of deep neu-ral networks. It has been observed that classical learning theory is not sufﬁcient to understand the behavior of deep networks (Zhang et al., 2017), thereby calling for new tools in learning theory. As for PTMs, besides theoretical understanding of the neural models themselves (e.g., Transformer and BERT), new questions arise. For example, it is important to theoretically understand the roles of pre-training in improving the generalization of downstream tasks. The recent work (Saunshi et al., 2019) provides a fruitful attempt at understanding contrastive learning with particular assumptions. However, it is still largely open to analyze PTMs under more realistic settings.

4More examples of the Turing test of GPT-3 can be found at

Giving GPT-3 a Turing Test

As we mentioned before, the adversarial robust-ness also raises new questions. In previous work, it was shown that a higher sample complexity is needed in order to achieve adversarial robustness for neural networks (Schmidt et al., 2018). Such analysis has inspired further improvements (e.g.,(Pang et al., 2020)). However, it is generally un-known how large-scale PTMs can help in this as-pect. Are there effective ways to explore PTMs as extra data resources to improve the robustness of downstream tasks? Also, the robustness of PTMs themselves is an unsolved issue, as mentioned be-fore.

泛化性和鲁棒性。PTMs的另一个重要问题是泛化。作为深度学习的一个重要进展，它继承了深度神经网络的优点，同时也带来了挑战。已有研究发现，经典学习理论不足以理解深度网络的行为(Zhang et al.， 2017)，因此需要在学习理论中引入新的工具。对于PTMs，除了对神经模型本身(如Transformer和BERT)的理论理解之外，还出现了新的问题。例如，从理论上理解预训练在提高下游任务泛化方面的作用是很重要的。最近的研究(Saunshi et al.， 2019)在理解特定假设下的对比学习方面进行了卓有成效的尝试。然而，在更现实的环境下分析PTMs仍然是开放的。

更多的例子，图灵测试的GPT-3可以在

Giving GPT-3 a Turing Test

正如我们之前提到的，对抗性的鲁棒性也提出了新的问题。在之前的工作中，为了实现神经网络的对抗鲁棒性，需要更高的样本复杂度(Schmidt et al.， 2018)。这种的分析激发了进一步的改进(例如，(Pang et al.， 2020))。然而，通常不知道大规模的PTMs在这方面有什么帮助。有没有有效的方法来探索 PTM 作为额外的数据资源来提高下游任务的鲁棒性？此外，正如前面提到的，PTMs本身的鲁棒性也是一个尚未解决的问题。

8.5 Modeledge Learning

As introduced in section 7, PTMs can achieve a surge of improvements for a wide range of NLP tasks because they learn versatile knowledge from large unlabeled corpora. As opposed to the knowledge represented by discrete symbols, which is in-terpretable to human beings, the knowledge stored in PTMs is represented as real-valued vectors. For example, given a triple hh, r, ti in a knowledge graph, it is easy to know that the head entity h has a relation r to the tail entity t. In contrast, you seem to have difﬁculty knowing what a representation produced by a PTM means. Therefore, we can refer to the knowledge stored in PTMs as “mod-eledge”, which is distinguished from the discrete symbolic knowledge formalized by human beings.

正如第7节所介绍的那样，PTMs可以在大量的NLP任务中实现大幅度的改进，因为他们从大型的未标记语料库中学习各种各样的知识。与人类可以理解的离散符号表示的知识相反，存储在 PTM 中的知识表示为实值向量。例如，给定知识图中的三元组hh, r, ti在知识图谱中，很容易知道头部实体 h 与尾部实体 t 有关系 r。相反，你似乎很难知道由PTM产生的表示意味着什么。因此，我们可以将存储在PTMs中的知识称为“模型边缘”(modeledge)，这与人类形式化的离散符号知识不同。

Knowledge-Aware Tasks. While the use of sym-bolic knowledge is effective, it is time-consuming and labor-intensive to manually organize this dis-crete knowledge such as building various knowl-edge bases. With the rapid advance of researches on PTMs, there emerge various PTMs such as GPT, BERT and BART. More and more researchers have probed into what knowledge do PTMs learn from the data, and why they perform so well on down-stream tasks (Jawahar et al., 2019b; Ethayarajh, 2019). Petroni et al. (2019) state that PTMs can be seen as knowledge bases and study how to apply PTMs to the knowledge completion task. Etha-yarajh (2019) also claim that PTMs would be open knowledge graphs and propose an unsupervised method to build knowledge graphs based on PTMs. From all these knowledge-aware tasks, we can ﬁnd that a wealth of human knowledge is captured by PTMs and stored in the form of modeledge. How to stimulate the modeledge of PTMs is worth further exploring in the future.

知识感知任务。虽然符号知识的使用是有效的，但手工组织这些离散的知识，例如建立各种知识库，既费时又费力。随着PTM研究的迅速推进，出现了GPT、BERT、BERT等多种PTM，越来越多的研究者开始探究PTM从数据中学习到什么知识，以及他们为什么在下游任务中表现如此出色(Jawahar et al.， 2019b;Ethayarajh, 2019)。Petroni等(2019)指出，PTMs可以被视为知识库，并研究如何将 PTM 应用于知识完成任务。Etha-yarajh(2019)也提出PTMs将是开放的知识图谱，并提出了一种基于PTMs构建知识图谱的无监督方法。从所有这些知识感知任务中，我们可以发现，PTMs捕获了大量的人类知识，并以modeledge的形式存储。如何激发PTM的modeledge是值得进一步探讨的问题。

Modeledge Storage and Management. As exist-ing PTMs are built on varying architectures and may be trained with different corpora, they contain diverse modeledge. As a result, how to store and manage various continuous modeledge in PTMs becomes a new challenge. There are two kinds of straightforward ideas.

The ﬁrst is to pre-train a huge model on extra-large scale data. Then, PTMs will have the extraordinary ability to cover almost all modeledge in existing PTMs. This method is simple and effective while it requires extremely high computational power and storage resources. For example, GPT-3 uses about 175 billion param-eters.

The second is to combine multiple models into one large model based on the mixture of ex-perts (MoE) (Jacobs et al., 1991). For example, Fe-dus et al. (2021) improve MoE to propose Switch Transformers. This method is easy to contain new models but the requirement of memory grows as the number of models increases.

Considering that there are both similarities and differences among existing PTMs, we have an im-portant question that needs to be answered: is it possible to build a universal continuous knowledge base (UCKB) that stores modeledge from various PTMs? The UCKB can not only store continuous modeledge imported from existing PTMs but also can blend different modeledge and then export the fused modeledge to a model to make it more pow-erful. Chen et al. (2020a) ﬁrst propose the concept of UCKB and make some preliminary explorations. They regard neural networks as parameterized func-tions and use knowledge distillation (Hinton et al., 2014) to import and export modeledge. UCKB overcomes the redundancy of model storage and stores the modeledge of various models into a com-mon continuous knowledge base. However, how to design more effective architectures for the storage and interface of UCKB still remains challenging.

Modeledge存储和管理。由于现有的PTM构建在不同的架构上，并且可能使用不同的语料库进行训练，因此它们包含不同的modeledge。因此，如何在PTMs中存储和管理各种连续modeledge成为一个新的挑战。有两种简单的想法。

第一种是在超大规模的数据上预训练一个巨大的模型。然后，PTM将具有覆盖现有PTM中几乎所有modeledge的非凡能力。这种方法简单有效，但需要极高的计算能力和存储资源。例如，GPT-3使用了大约1750亿个参数。

第二种是将多个模型结合成一个基于专家混合的大模型(MoE) (Jacobs et al.， 1991)。例如，Fe-dus等人(2021)改进了MoE，提出了Switch Transformers。该方法易于包含新的模型，但随着模型数量的增加，对内存的需求也会增加。

考虑到现有的PTM既有相似之处，也有不同之处，我们有一个重要的问题需要回答：是否有可能构建一个通用的连续知识库(UCKB)来存储来自各种PTM的modeledge?UCKB不仅可以存储从现有PTM导入的连续modeledge，还可以混合不同的modeledge，然后将融合的modeledge导出到一个模型中，使其功能更加强大。Chen et al. (2020a)首先提出了UCKB的概念，并做了一些初步的探索。他们将神经网络作为参数化函数，利用知识蒸馏(Hinton et al.， 2014)导入和导出modeledge。UCKB克服了模型存储的冗余性，将各种模型的modeledge存储为一个通用的连续知识库。然而，如何为UCKB的存储和接口设计更有效的架构仍然是一个挑战。

8.6 Cognitive and Knowledgeable Learning认知和知识学习

Making PTMs more knowledgeable is an impor-tant topic for the future of PTMs. We divide the future development of knowledgeable PTMs into the following three approaches:

(1)、Knowledge Augmentation. For an input text, there is rich related external knowledge, which can be used to augment the input. Considering the formats of knowledge and plain text are very different, it is important to bridge the gap between text representations and knowledge representations (including symbols or vectors) and use their infor-mation uniformly as input. The solution to this problem requires both uniﬁed model architectures and knowledge-guided pre-training objectives.

(2)、Knowledge Support. Current model architectures are manually designed and usually very regular. With prior knowledge about the input, we can train different sub-module to process different kinds of input, which may accelerate the process of training and inference and beneﬁt the model efﬁciency. This process is similar to human behavior where differ-ent brain regions correspond to different activity functions.

(3)、Knowledge Supervision. Knowledge bases store amounts of structural data, which can be used as a complementary source during pre-training. By learning from both knowledge bases and large-scale corpora, PTMs can have better language un-derstanding and generation abilities compared to only using plain text.

Through these three direc-tions, we hope the future PTMs can easily under-stand the meanings beyond words and achieve bet-ter performance on various downstream tasks.

让PTMs知识更加渊博是 PTM 未来的一个重要主题。我们将知识型PTMs的未来发展分为以下三种方式：

(1)、知识扩充。对于一个输入文本，有丰富的相关外部知识，可以用来扩充输入。考虑到知识和纯文本的格式差异很大，重要的是要弥合文本表示和知识表示(包括符号或向量)之间的差距，并统一使用它们的信息作为输入。解决这一问题需要统一的模型架构和知识引导的预训练目标。

(2)、知识支撑。当前的模型架构是手工设计的，并且通常非常规则。有了对输入的先验知识，我们可以训练不同的子模块来处理不同类型的输入，这样可以加快训练和推理的过程，提高模型的效率。这一过程类似于人类的行为，不同的大脑区域对应不同的活动功能。

(3)、知识监督。知识库存储大量的结构数据，可以在预训练期间作为补充源使用。通过从知识库和大规模语料库学习，与仅使用纯文本相比，PTM可以具有更好的语言理解和生成能力。

通过这三个方向，我们希望未来的PTMs能够更容易地理解文字之外的含义(难以言喻)，并在各种下游任务上取得更好的表现。

In terms of cognitive PTMs, we believe the fol-lowing approaches would be helpful:

Cognitive Architecture. Since neural networks are inspired by the micro structure of the human neural system, it is expected to see how the macro function and organization of human cognitive sys-tem can enlighten the design of the next generation of intelligence system, such as the Global Working Theory (GWT). The success of CogQA and CogLTX may provide some thoughts on this chal-lenge.

Explicit and Controllable Reasoning. While deep learning has achieved success in many perceptive tasks, how to conduct complex decision making and efﬁcient multi-step reasoning is still unsolved, which may require machines to auto-matically plan the decision making process into a cognitive graph and do explicit reasoning over the factors in graphs as human do. Methods such as InversePrompting (Zou et al., 2021) which shows supreme ability in controlling theme-related text generation would provide some thoughts.

Interactions of Knowledge. Though our PTMs are getting bigger and more general, what knowl-edge it has learned from pre-training is largely un-explored. Moreover, since our brains are working with the collaboration of different function zones, it is important to see if our PTMs have shaped dif-ferent inner function modules and how they would interact with each other.

在认知型PTMs方面，我们相信以下方法会有所帮助：

(1)、认知架构。由于神经网络的灵感来自于人类神经系统的微观结构，因此人们希望看到人类认知系统的宏观功能和组织能够启发下一代智能系统的设计，如全球工作理论(GWT)。CogQA和CogLTX的成功可能为这一挑战提供一些思路。

(2)、显性和可控推理。虽然深度学习在许多感知任务中取得了成功，但如何进行复杂的决策和高效的多步推理仍然没有解决，这可能需要机器自动将决策过程规划成认知图谱并执行显式推理像人类一样超越图表中的因素。比如反向提示(Zou et al.， 2021)等这样的方法，在控制主题相关文本生成方面显示出极高能力的方法，可以提供一些思路。

(3)、知识的相互作用。虽然我们的PTMs越来越大，越来越普遍，但它从预训练学到的知识在很大程度上还没有被探索。此外，由于我们的大脑是通过不同功能区域的协作来工作的，所以，重要的是要看看我们的PTMs是否形成了不同的内部功能模块，以及它们是如何相互作用的。

8.7 Applications应用

PTMs have been successfully applied in a wide variety of domains and tasks. In this section, we will highlight some of these applications.

Natural Language Generation. Many natural language generation tasks have been dominated by PTMs, such as GPT-2, BART, T5, UniLM and many more. These tasks include machine transla-tion, summarization, dialog generation, story gen-eration, poetry generation and other long text gen-eration. Since the prevalent trend of PTMs, the backbone models have moved from CNNs/RNNs to transformers or transformer-based PTMs. PTMs have also been successfully applied to multimodal generation. Trained on text-image parallel data, these models have been shown strong in applica-tions such as visual question answering, image-to-text generation and text-to-image generation. As large-scale PTMs have been trained on so large-scale data, they have innate advantages for natu-ral language generation, particularly low-resourced natural language generation.

Dialog Systems. Many recent open-domain dia-log systems are built upon large-scale transformer structures. These examples include Meena (Adi-wardana et al., 2020), Blender (Roller et al., 2021), CDial-GPT (Wang et al., 2020e), Plato (Bao et al., 2020) and Plato-2 (Bao et al., 2021), which are trained on large-scale conversation data, com-monly with the seq2seq framework. These mod-els have shown capabilities of delivering natural and engaging conversations, some of which have been reported to be close to human-level performance (Adiwardana et al., 2020). However, dialog-speciﬁc pre-training tasks are yet to be explored, comparing to pre-training tasks for other applica-tions.

PTMs已经成功地应用于各种领域和任务中。在本节中，我们将重点介绍其中的一些应用程序。

(1)、自然语言生成。许多自然语言生成任务由PTMs主导，如GPT-2、BART、T5、UniLM等。这些任务包括机器翻译、摘要、对话生成、故事生成、诗歌生成和其他长文本生成。由于PTMs的流行趋势，主骨干模型已经从CNN / RNN转移到Transformers或基于Transformers的PTMs。PTMs也已成功地应用于多模态的产生。这些模型以文本图像并行数据为训练，在视觉问答、图像文本生成和文本图像生成等应用中表现得非常出色。由于大规模的PTM已经在如此大规模的数据上进行了训练，因此它们对于自然语言生成具有先天的优势，特别是资源不足下的自然语言生成。

(2)、对话系统。目前，许多最近的开放域对话系统都是建立在大型Transformer结构上的。这些例子包括Meena (Adi-wardana et al.， 2020)、Blender (Roller et al.， 2021)、CDial-GPT (Wang et al.， 2020e)、Plato (Bao et al.， 2020)和Plato-2 (Bao et al.， 2021)，它们通常使用seq2seq框架对大规模会话数据进行训练。这些模型已经显示出了提供自然和引人入胜的对话的能力，其中一些已经被报道接近人类水平的表现(Adiwardana等人，2020年)。然而，与其他应用程序的预训练任务相比，对话特定的预训练任务还有待探索。

Domain-Speciﬁc PTMs. When large-scale domain-speciﬁc corpora are cheaply available, we can train domain-speciﬁc PTMs on such data. Some notable works include BioBERT (Lee et al., 2020) and SciBERT (Beltagy et al., 2019), which are trained respectively on the biological and scien-tiﬁc literature text. These models are expected and veriﬁed to learn more domain-speciﬁc knowledge and language use than those trained on the general text. Such domain expertise is usually regarded as important for solving many domain-speciﬁc prob-lems.

Domain Adaptation and Task Adaptation. Large-scale PTMs learn general knowledge from the large-scale general text, providing a good ini-tial point to further learn domain-speciﬁc knowl-edge by ﬁne-tuning or other techniques. Although PTMs are becoming larger and larger, the domain-speciﬁc data are always limited. Therefore, domain adaptation is becoming crucial for domain-speciﬁc applications. It has been evident that the simple ﬁne-tuning of large-scale PTMs is not sufﬁcient for domain-speciﬁc applications (Gururangan et al., 2020; Ke et al., 2020). The most essential reason for this is the distribution shift: the data distribution in a speciﬁc domain may be substantially different from that in the general pre-training text. Another important issue for the success of domain-speciﬁc applications goes to task adaptation. Most often, domain applications have a small set of labeled data, which can empower supervised learning to learn domain expertise more efﬁciently. However, for super-large PTMs, simply ﬁne-tuning on la-beled data seems to be inefﬁcient in computation, nor effective in performance. Thus, how to bridge the gap between pre-training and task-speciﬁc ﬁne-tuning becomes crucial. Moreover, efﬁcient and effective task-speciﬁc ﬁne-tuning is also an impor-tant research direction for the future application of PTMs.

(3)、特定领域的PTMs。当大规模的特定领域的语料库可以廉价获得时，我们可以在这样的数据上训练特定于领域的PTMs。一些著名的工作包括BioBERT (Lee et al.， 2020)和SciBERT (Beltagy et al.， 2019)，它们分别针对生物和科学文献文本进行训练。与在一般文本上训练的模型相比，这些模型被期望和验证能够学习更多特定领域的知识和语言使用。这种领域专业知识通常被认为对于解决许多特定领域的问题很重要。

(4)、领域适配和任务适配。大规模 PTM 从大规模通用文本中学习通用知识，为通过微调或其他技术进一步学习特定领域的知识提供了一个很好的起点。尽管PTMs变得越来越大，但特定领域的数据总是有限的。因此，领域适配对于特定领域的应用程序变得至关重要。很明显，对于特定领域的应用来说，大规模PTMs的简单微调是不够的(Gururangan等人，2020;柯等，2020)。最根本的原因是分布变化：特定领域的数据分布可能与一般的预训练文本的数据分布有很大的不同。特定领域应用程序成功的另一个重要问题是任务适配。通常情况下，领域应用程序有一小部分标记数据，这可以使监督学习更有效地学习领域专业知识。然而，对于超大的PTMs，简单地对带标签的数据进行微调在计算上似乎效率低下，在性能上也没有效果。因此，如何在预训练和具体任务的微调之间架起桥梁变得至关重要。此外，高效且有效的特定任务微调也是 PTM 未来应用的重要研究方向。

9 Conclusion

In this paper, we take a look into the history of pre-training to indicate the core issue of PTMs, and meanwhile reveal the crucial position of PTMs in the AI development spectrum. Furthermore, we comprehensively review the latest efforts towards better PTMs, including designing effective archi-tectures, utilizing rich contexts, improving compu-tational efﬁciency, and conducting interpretation and theoretical analysis. All these works contribute to the recent wave of developing PTMs. Although existing PTMs have achieved promising results, es-pecially those large-scale PTMs showing amazing abilities in zero/few-shot learning scenarios, how to develop PTMs next is still an open question. The knowledge stored in PTMs is represented as real-valued vectors, which is quite different from the discrete symbolic knowledge formalized by human beings. We name this continuous and machine-friendly knowledge “modeledge” and believe that it is promising to capture the modeledge in a more effective and efﬁcient way and stimulate the mod-eledge for speciﬁc tasks. We hope our view could inspire more efforts in this ﬁeld and advance the development of PTMs.

在本文中，我们回顾了预训练的历史，指出了 PTMs 的核心问题，同时揭示了 PTMs 在 AI 发展光谱中的关键地位。此外，我们全面回顾了为更好的PTMs所做的最新努力，包括设计有效的架构，利用丰富的上下文，提高计算效率，以及进行解释和理论分析。所有这些工作都促成了最近一波开发 PTM 的浪潮。虽然现有的PTMs已经取得了很好的效果，尤其是那些在零/few-shot 学习场景中表现出惊人能力的大规模 PTMs，但接下来如何开发 PTMs 仍然是一个悬而未决的问题。PTM中存储的知识以实值向量表示，与人类形式化的离散符号知识有很大不同。我们将这种连续的、对机器友好的知识命名为“modeledge”，并相信它有望以更有效、更高效的方式捕获模型边缘，并针对特定的任务激发模态边缘。我们希望我们的观点能够激发该领域的更多努力并推动 PTM 的发展。

Note and Contribution

This paper originates from a 3-day closed-door workshop initiated by Jie Tang, Ji-Rong Wen and Minlie Huang held in Beijing WTown from January 1 to January 3, 2021, supported by China Computer Federation (CCF). All authors of this paper orga-nized or participated in this workshop, and this paper can be regarded as a summary and extension of the discussion in the workshop.

本文源于由唐杰、温继荣、黄民烈于2021年1月1日至3日在北京WTown举办的为期3天的闭门研讨会，该研讨会得到了中国计算机联合会(CCF)的支持。本文作者均组织或参与了本次研讨会，本文可视为本次研讨会讨论的总结和延伸。

The contributions of all authors are listed as fol-lows: Zhiyuan Liu and Xu Han designed the struc-ture of this paper, Xu Han drafted the abstract, Sec-tion 1 and Section 2, Ning Ding drafted Section 3, Xiao Liu and Jiezhong Qiu drafted Section 4, Yuqi Huo and Liang Zhang drafted Section 5, Yuxian Gu drafted Section 6, Zhengyan Zhang drafted Sec-tion 7. All faculty authors drafted various topics in Section 8, including Xipeng Qiu for Section 8.1, Ji-Rong Wen, Ruihua Song and Yang Liu for Sec-tion 8.2, Jinhui Yuan and Wentao Han for Sec-tion 8.3, Jun Zhu and Yanyan Lan for Section 8.4, Yang Liu for Section 8.5, Jie Tang and Zhiyuan Liu for Section 8.6, Minlie Huang and Jie Tang for Sec-tion 8.7. Wayne Xin Zhao, Xipeng Qiu provided comments to the manuscript, and Xu Han, Ning Ding and Zhengyan Zhang proofread the whole paper.

所有作者的贡献如下：
Zhiyuan Liu and Xu Han designed the struc-ture of this paper, Xu Han drafted the abstract, Sec-tion 1 and Section 2, Ning Ding drafted Section 3, Xiao Liu and Jiezhong Qiu drafted Section 4, Yuqi Huo and Liang Zhang drafted Section 5, Yuxian Gu drafted Section 6, Zhengyan Zhang drafted Sec-tion 7. All faculty authors drafted various topics in Section 8, including Xipeng Qiu for Section 8.1, Ji-Rong Wen, Ruihua Song and Yang Liu for Sec-tion 8.2, Jinhui Yuan and Wentao Han for Sec-tion 8.3, Jun Zhu and Yanyan Lan for Section 8.4, Yang Liu for Section 8.5, Jie Tang and Zhiyuan Liu for Section 8.6, Minlie Huang and Jie Tang for Sec-tion 8.7. Wayne Xin Zhao, Xipeng Qiu provided comments to the manuscript, and Xu Han, Ning Ding and Zhengyan Zhang proofread the whole paper.