bert如何应用于下游任务

Google’s Bidirectional Encoder Representations from Transformers (BERT) is a large-scale pre-trained autoencoding language model developed in 2018. Its development has been described as the NLP community’s “ImageNet moment”, largely because of how adept BERT is at performing downstream NLP language understanding tasks with very little backpropagation and fine-tuning needed (usually only 2–4 epochs).

Google的变压器双向编码器表示法(BERT)是于2018年开发的大规模预训练自动编码语言模型。其开发被描述为NLP社区的“ ImageNet时刻”，这主要是因为BERT擅长执行下游NLP语言只需极少的反向传播和微调即可理解任务(通常只有2-4个纪元)。

Image for post — Source: Devlin et al. (2019)

For context, traditional word embeddings (e.g. word2vec and GloVe) are non-contextual. They represent each word token with a single static vector, and learn by word co-occurence instead of the sequential context of words. This may be problematic when words are polysemous (i.e. where the same word has multiple different meanings), which is very common in law.

对于上下文，传统的单词嵌入(例如word2vec和GloVe) 是非上下文的。 Ť哎用单个静态向量表示每个单词标记，并通过单词共现而不是单词的顺序上下文学习。当单词是多义词时(即，同一个单词具有多种不同的含义)，这在法律上是很常见的，可能会产生问题。

For instance, a single static context-free vector will be used to represent the word “lemon” in the sentences “the lemon is sour” and “that car is a lemon” (i.e. an interesting reference to lemon laws, which protects consumers when new vehicles turn out to be defective).

例如，将使用单个静态无上下文向量来表示“柠檬是酸的”和“那车是柠檬”这句话中的“柠檬”一词(即有趣的柠檬法参考，在新车证明是有缺陷的)。

In contrast, BERT is a contextual model which generates context-specific representations based on the words surrounding each token. Interestingly, Ethayarajh (2019) shows that instead of creating one representation of “lemon” per word sense (left option), BERT will create infinitely many representations of “lemon”, each highly specific to its context (right option). This means that instead of generating a single dense vector to represent the token “lemon”, the token is dynamically represented by “the ______ is sour” and “that car is a _____”.

相反，BERT是上下文模型，它基于每个令牌周围的单词生成特定于上下文的表示形式。有趣的是，Ethayarajh(2019)表明，BERT不会创建每个词义的“柠檬”表示(左选项)，而是会创建无限多的“柠檬”表示，每种表示都非常特定于其上下文(右选项)。这意味着该代币不是生成单个密集向量来表示代币“柠檬”，而是由“ ______是酸的”和“那辆车是_____”来动态表示的。

BERT’s ability to capture linguistic polysemy is particularly relevant in the legal domain, where polysemy abounds. For instance, the word “consideration” in law represents the idea of reciprocity in contractual relationships, and has a different meaning from the word’s usual connotation of being considerate. Furthermore, the same term can have multiple definitions in statute and case law. For instance, the term “worker” has four different definitions in EU law, and can be used to connote different kinds of workers, even in the same document. The development of contextual language models is hence significant for the legal AI landscape.

BERT捕获语言多义性的能力在多义性盛行的法律领域特别重要。举例来说，在法律上的单词“考虑”代表合同关系中的互惠的理念，并具有从体贴的话通常的内涵有不同的含义。此外，同一术语在成文法和判例法中可以有多种定义。例如，“工人”一词在欧盟法律中有四个不同的定义，即使在同一文档中也可以用来表示不同种类的工人。因此，上下文语言模型的开发对于法律AI领域具有重要意义。

The technicalities of the BERT architecture are beyond the scope of this article (and has been extensively written about). Nevertheless, it is worth noting that one of BERT’s key innovations is its bidirectionality, i.e. it overcomes the traditional problems of sequential text parsing. Notably, bidirectionality in this context does not mean bidirectional in a sequential sense (i.e. parsing from both left-to-right and right-to left at the same time) but bidirectional in a simultaneous sense (i.e. it learns information from both the left and right contexts of a token in all layers at the same time). This is done by using methods like Masked LMs (MLMs) and Next Sentence Prediction (NSP) to achieve better contextual word embedding results.

BERT架构的技术超出了本文的范围(并且已经进行了广泛的讨论)。但是，值得注意的是，BERT的关键创新之一是其双向性，即它克服了顺序文本解析的传统问题。值得注意的是，在这种情况下，双向性并不意味着在顺序意义上是双向的(即同时从左向右和从右到左进行解析)，而是在同时意义上是双向的(即它从左向和右向都学习信息)在所有层中同时具有令牌的正确上下文)。这可以通过使用掩盖的LM(MLM)和下一句预测(NSP)之类的方法来完成，以实现更好的上下文词嵌入结果。

特定领域BERT的优势 (Advantages of Domain-Specific BERTs)

While BERT has been very effective in performing general language representation tasks, a problem is that — being trained on unlabelled generic Wikipedia and open-source articles — it lacks domain-specific knowledge. This is an important caveat, as a language model can only be as good as its corpus. As an analogy, applying a vanilla BERT model to legal domain-specific issues might be equivalent to asking a liberal arts student to approach a legal problem, rather than a law student trained on years of legal material. This is problematic since there is a lot of disconnect between the language as found in general open-source corpora (e.g. Wikipedia and news articles) and legal language, which may be esoteric and Latin-based.

尽管BERT在执行常规语言表示任务方面非常有效，但问题是-在未标记的通用Wikipedia和开源文章上接受了培训-它缺少特定领域的知识。这是一个重要的警告，因为语言模型只能与其语料库一样好。作为类推，将香草BERT模型应用于特定法律领域的问题可能等同于要求文科学生解决法律问题，而不是要求经过多年法律材料培训的法律学生。这是有问题的，因为在通用开源语料库(例如Wikipedia和新闻文章)中使用的语言与法律语言(可能是深奥的和拉丁的)之间存在很多脱节。

While there has not yet been a legal domain-specific BERT model developed at the time of writing, examples of other domain-specific BERTs that have been developed include BioBERT (biomedical sciences), SciBERT (scientific publications), FinBERT (financial communciations), and ClinicalBERT (clinical notes). Although they all share the similarity of being domain-specific BERTs, their architecture exhibits many key differences.

尽管在撰写本文时尚未开发出合法的特定于领域的BERT模型，但已开发的其他特定于领域的BERT的示例包括BioBERT(生物医学)，SciBERT(科学出版物)，FinBERT(金融通讯)，和ClinicalBERT(临床注释)。尽管它们都具有特定于域的BERT的相似性，但是它们的体系结构却表现出许多关键差异。

培训特定于法律域的BERT (Training Legal Domain-Specific BERTs)

I will explore some of the approaches in training domain-specific BERTs:

我将探讨一些训练特定领域BERT的方法：

完全预训练BERT (Completely Pre-training BERT)

This involves completely re-doing BERT’s pre-training process with large-scale unlabelled legal corpora (e.g. statutes, precedents). In this sense, the domain-specific knowledge would be injected during the pre-training process. This was the approach taken for the SCIVOCAB SciBERT.

这涉及使用大规模的未标记法律语料库(例如法规，判例)完全重新进行BERT的预培训过程。从这个意义上说，特定领域的知识将在预训练过程中注入。这是SCIVOCAB SciBERT采取的方法。

While this does significantly enhance BERT’s performance on domain-specific tasks, the key problem is that the time, costs, and quantity of data required for complete re-training might simply be too massive to be feasible. BERT-Base itself is a neural network with 12 stacked layers and approximately 110 million weights/parameters, and its pre-training corpora required 3.3 billion tokens. Re-training a domain-specific BERT would require a similar amount of training data (e.g. SciBERT used 3.17 billion tokens). It would require a week or more to train the model (e.g.the SCIVOCAB SciBERTs took 1 week to train on a v3 TPU) and be extremely costly, which is simply not feasible for the average enterprise.

尽管这确实可以显着提高BERT在特定领域任务上的性能，但关键问题在于，完全重新训练所需的时间，成本和数据量可能太庞大而无法实现。 BERT-Base本身是一个神经网络，具有12个堆叠的层和大约1.1亿个权重/参数，其预训练语料库需要33亿个令牌。重新训练特定于域的BERT将需要类似数量的训练数据(例如SciBERT使用了31.7亿个令牌)。训练模型需要一周或更长时间(例如，SCIVOCAB SciBERT在v3 TPU上花费了1周的训练时间)，而且成本极高，这对于普通企业而言根本不可行。

进一步的预训练BERT (Further Pre-Training BERT)

As such, a compromise might be to not completely re-train BERT, but instead initialise weights from BERT and pre-train it further with legal domain-specific data. This has been shown to enhance BERT’s performance, and Sun (2019) shows that models that were further pre-trained performed better across all seven datasets than the original BERT-Base model.

因此，一个折衷方案可能是不完全重新训练BERT，而是初始化BERT的权重并使用合法的特定于域的数据对其进行进一步的预训练。已经证明这可以提高BERT的性能，Sun(2019)表明，经过进一步预训练的模型在所有七个数据集中的表现都优于原始BERT-Base模型。

This seems to be the most popular method, and was used for BioBERT, the BASEVOCAB SciBERT, and FinBERT. Instead of completely re-training from scratch, researchers initialised the new BERT model with learned weights from BERT-Base, then trained it on domain-specific texts (e.g. PubMed abstracts and PMC full-text articles for BioBERT). Lee et al. (2020) reported that BioBERT achieved higher F1, precision, and recall scores than BERT on all datasets. Similarly, Beltagy et al. (2019) reported that SciBert outperformed on biomedical, computer science, and multi-domain tasks.

这似乎是最流行的方法，并用于BioBERT，BASEVOCAB SciBERT和FinBERT。研究人员没有从头开始进行完全重新训练，而是使用从BERT-Base获得的权重初始化了新的BERT模型，然后在特定领域的文本上进行了训练(例如，PubMed摘要和BioBERT的PMC全文文章)。 Lee等。 (2020)报告说，在所有数据集上，BioBERT的F1，精度和召回得分均高于BERT。同样，Beltagy等。 (2019)报告说，SciBert在生物医学，计算机科学和多领域任务方面表现出色。

微调BERT (Fine-Tuning BERT)

Another, even simpler, option is to use the pre-trained BERT but fine-tune it with legal domain-specific corpora in downstream NLP tasks. The concept of fine-tuning comes from the domain of transfer learning, which — in simple terms — means taking a model that was aimed to solve task x, and repurposing it to solve task y. In practice, it usually means taking the pre-trained BERT model and adding an additional output layer of untrained neurons at the end, which is then trained using labelled legal corpora (fine-tuning is usually a supervised task). After being fine-tuned on domain-specific data, the resulting model will have updated weights that more closely resembles the characteristics and vocabulary of the target domain

另一个甚至更简单的选择是使用经过预训练的BERT，但在下游NLP任务中使用合法的特定领域语料对其进行微调。微调的概念来自迁移学习的领域，简单来说，这意味着采用旨在解决任务x的模型，然后将其重新用于解决任务y 。在实践中，通常意味着采用预训练的BERT模型，并在末尾添加未训练神经元的附加输出层，然后使用带标签的合法语料库对其进行训练(微调通常是一项监督任务)。在针对特定领域的数据进行微调之后，生成的模型将具有更新的权重，该权重更类似于目标域的特征和词汇

The advantage of this approach is that it requires much less data and is significantly faster and cheaper to train. Given that BERT-Base’s pre-training has already encoded knowledge of most words in the general domain, it only requires a few tweaks to adapt it to function in a legal context. Instead of a timespan of weeks, fine-tuning usually requires just 2–4 epochs (a matter of minutes). This approach also requires less data (although it usually requires labelled data), which makes it more feasible.

这种方法的优点是所需的数据少得多，并且训练起来明显更快，更便宜。鉴于BERT-Base的预训练已经对通用领域中大多数单词的知识进行了编码，因此只需要进行一些调整即可使其适应法律环境。微调通常只需要2到4个纪元(约数分钟)，而不是几周的时间。这种方法还需要较少的数据(尽管通常需要标记的数据)，这使其更可行。

Furthermore, if desired, fine-tuning can be done in conjunction with domain-specific pre-training, i.e. they are not mutually exclusive methods. For instance, BioBERT was first pre-trained with biomedical domain-specific corpora and then fine-tuned on biomedical text mining tasks like named entity recognition, relation extraction, and QA.

此外，如果需要，可以结合特定于域的预训练来进行微调，即它们不是互斥的方法。例如，首先对BioBERT进行生物医学领域特定语料库的培训，然后对生物医学文本挖掘任务(如命名实体识别，关系提取和QA)进行微调。

结论 (Conclusion)

With the recent hype around mega-scale models like OpenAI’s GPT-3, the initial excitement surrounding BERT seems to have somewhat waned. Nevertheless, it bears remembering that BERT is still an incredibly powerful and agile model that arguably sees more practical application. As seen from the experience of domain-specific BERTs like SciBERT and BioBERT, fine-tuning and/or pre-training BERT on domain-specific corpora is likely to improve performance with regard to legal NLP tasks. To this end, depending on the researcher’s time and monetary resources, there are various ways to improve and tailor BERT’s performance.

随着最近对像OpenAI的GPT-3之类的大规模模型的炒作，围绕BERT的最初兴奋似乎有所减弱。但是，值得记住的是，BERT仍然是一个非常强大且敏捷的模型，可以说它看到了更多的实际应用。从SciBERT和BioBERT等领域特定BERT的经验可以看出，对领域特定语料库的BERT进行微调和/或预训练可能会提高法律NLP任务的性能。为此，取决于研究人员的时间和金钱资源，有多种方法可以改善和调整BERT的绩效。

翻译自: https://towardsdatascience.com/lawbert-towards-a-legal-domain-specific-bert-716886522b49

bert如何应用于下游任务

http://www.taodudu.cc/news/show-4387112.html

养成好习惯，戒掉坏习惯
基于STM32智能小车蓝牙遥控实验（有代码含上位机）
电子系统知识点杂记
电子系统期末
【项目实战】基于STM32单片机的智能小车设计（有代码）
使用IIC驱动MPU6050获取六轴数据
STM32F0-DAY1
数电学习（十一、D/A和A/D转换）
明厨亮灶监控系统解决方案，看得见的食品安全
智慧景区视频监控方案
煤矿用计算机,计算机技术在煤矿安全生产中应用
《液晶显示器和液晶电视维修核心教程》——第1章液晶显示器维修概要1.1　液晶显示器的基本知识...
液晶显示模块的分类
液晶显示模块分类
基于STM32F103单片机的生理监控心率脉搏监控TFT彩屏显示
液晶拼接控制器
linux系统正常的运行状态是,一种用于监控Linux系统运行状态的监控系统及方法...
环保设施运行在线监控数采仪环保工况监测终端
光伏监控
智慧养殖远程管理监控方案
多功能环境监控系统设计
畜禽养殖智能环境监控系统
液晶拼接屏仍然是安防监控的应用领域
智能环境监控系统解决方案及应用背景
环保设施运行在线监控
液晶监控屏：大屏领域已占据主导地位
云看板生产管理系统，实时监控网关采集的数据
制造工厂生产线液晶电子看板显示终端
三层架构与四大天王之——改
【Linux】云服务器部署网站后的ab压力测试，服务器网络接口io情况、进程cpu占用等有关情况查看

bert如何应用于下游任务_培训特定于法律域的BERT相关推荐

大二暑假_SSM项目_培训教育板块
大二暑假_SSM项目_培训教育板块一.页面搭建 1)主要内容 2)部分代码 3)遇到的困难 4)效果展示二.管理员上传文件 1)主要内容 2)部分代码 3)遇到的困难 4)效果展示三.用户点击按 ...
《Autosar从入门到精通-实战篇》总目录_培训教程持续更新中...
目录一.Autosar入门篇: 1.1 DBC专题(共9篇) 1.2 ARXML专题(共35篇) 1.2.1 CAN Matrix Arxml(共28篇) 1.2.2 ASWC Arxml(共7篇) ...
《Autosar_BSW高阶配置》总目录_培训教程持续更新中...
目录 0 基础"开胃菜"(共20+篇) 0.1 CANFD和Classic CAN介绍 0.2 UDS/OBD诊断网络层/传输层介绍 0.3 常用UDS诊断服务介绍 0.4 所有O ...
《Autosar_MCAL高阶配置》总目录_培训教程持续更新中...
欢迎大家订阅<Autosar_MCAL高阶配置>专栏(可以理解为是Autosar培训教程),献上常用的案例和配置方法.下方整理了相关博文的链接(单击蓝色字体即可跳转),方便大家获取. 本专 ...
bert 无标记文本调优_使用BERT准确标记主观问答内容
bert 无标记文本调优介绍 (Introduction) Kaggle released Q&A understanding competition at the beginning o ...
员工培训案例分析答案_培训主管的技巧：培训教材问题解析、培训实施分析报告（附案例）...
关注[本头条号]更多关于企业管理.员工激励.薪酬制度.绩效激励等内容免费与你分享!私信"资料"送您关于员工管理.绩效薪酬的干货视频. 培训分析在本章节中,我们将以某公司实施的员工 ...
倾向得分匹配的stata命令_培训对工资是否影响显著：倾向得分匹配法(PSM)及stata实现...
第一部分模型背景 1.研究目的 2.基本思想第二部分数据介绍以及语法简介 1.数据介绍 2.语法格式第三部分案例讲解以及stata实现 1.变量介绍以及数据描述性统计 2.倾向匹 ...
word2vec模型评估_【新书】从Word2Vec到BERT的自然语言处理嵌入进展，附下载
嵌入向量( embedding)是一项广受欢迎的技术,有着众多应用.最近Mohammad和Jose撰写了<Embeddings in Natural Language Processing Th ...
需求调研的方法及过程_培训需求调研方法
课程设计与开发是每一位职业培训师都必须会的技能,今天我们就来分享一下如何开发课程.第一节课,让我们先从培训需求调研开始. 培训需求调研方法有很多,从个体层次分为:问卷法.观察法.访谈法:从组织层次分为 ...

bert如何应用于下游任务_培训特定于法律域的BERT