Deep Contextualized Word Representations


M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, et al., Deep Contextualized Word Representations, NAACL (2018)


摘要

深度上下文词表示(deep contextualized word representation):

(1)词特征(complex characteristics of word),如语法(syntax)、语义(semantics)

(2)一词多义(vary across linguistic contexts i.e., polysemy)

本文词向量为双向深度语言模型中间状态的函数(learned functions of the internal states of a deep bidirectional language model (biLM))

验证任务:问答(question answering)、语义继承(textual entailment)、情感分析(sentiment analysis)等

1 引言

预训练词表示(pre-trained word representations)理想(ideally)建模:

(1)词特征(complex characteristics of word),如语法(syntax)、语义(semantics)

(2)一词多义(vary across linguistic contexts i.e., polysemy)

本文提出一种深度上下文词表示(deep contextualized word representation):

(1)各词条的表示为整条语句的函数(each token is assigned a representation that is a function of the entire input sentence);

(2)向量由耦合语言模型损失训练的双向LSTM导出(vectors derived from a bidirectional LSTM that is trained with a coupled language model (LM) objective on a large text corpus)。

本文称之为ELMo(Embeddings from Language Models),其中“深”指:表示为biLM所有中间层的函数(ELMo representations are deep, in the sense that they are a function of all of the internal layers of the biLM)。此外,对每个终端任务,可学习不同的向量线性组合(learn a linear combination of the vectors stacked above each input word for each end task)。

LSTM的高层LSTM向量能捕获词条的上下文相关语义(higher-level LSTM states capture context-dependent aspects of word meaning);低层向量捕获词条语法特征(lower-level states model aspects of syntax)。

■本文中,biLM指ELMo,biRNN指下游任务模型。■

2 相关工作

由于能够从大规模无标签语料库(large scale unlabeled text)中捕获词条的语法、语义信息(capture syntactic and semantic information of words),预训练词向量(pretrained word vectors)已成为顶尖NLP标准组件(a standard component of most state-of-the-art NLP architectures),如问答(question answering)、语义继承(textual entailment)、语义角色标注(semantic role labeling)。然而,早期词向量方法仅为各词条分配一个与上下文无关的单一表示(a single context-independent representation for each word)。

为克服这一缺陷,子词条信息(subword information)、词条词义向量(learning separate vectors for each word sense)等方法被提出。本文通过字符卷积从子词条中获益(benefits from subword units through the use of character convolutions),并能将多词义信息与下游任务无缝衔接(seamlessly incorporate multi-sense information into downstream tasks without explicitly training to predict predefined sense classes)。

上下文相关表示(context-dependent representations):context2vec,使用双向LSTM对引导词的上下文编码(encode the context around a pivot word)。

3 ELMo:语言模型词嵌入(ELMo: Embeddings from Language Models)

ELMo词表示(word represntations)为输入语句的函数(functions of the entire input sentence):

(1)取自字符卷积双层biLMs(computed on top of two-layer biLMs with character convolutions)

(2)中间网络状态的线性函数(a linear function of the internal network states)

通过大规模数据集预训练,biLM能够半监督学习(allows us to do semi-supervised learning, where the biLM is pretrained at a large scale),并易于加入现有NLP结构中(easily incorporated into a wide range of existing neural NLP architectures)。

3.1 双向语言模型(bidirectional language models)

给定长度为NNN的词条序列(a sequence of NNN tokens),(ti,t2,…,tN)(t_{i}, t_{2}, \dots, t_{N})(ti,t2,,tN),前向语言模型(forward language model)对序列概率的表示为:给定历史序列(t1,…,tk−1)(t_{1}, \dots, t_{k - 1})(t1,,tk1),计算词条tkt_{k}tk的概率:

p(tp1,t2,…,tN)=∏k=1Np(tk∣t1,…,tk−1)p(t_p{1}, t_{2}, \dots, t_{N}) = \prod_{k = 1}^{N} p(t_{k} | t_{1}, \dots, t_{k - 1})p(tp1,t2,,tN)=k=1Np(tkt1,,tk1)

后向语言模型(backward language model):

p(tp1,t2,…,tN)=∏k=1Np(tk∣tk+1,tk+2…,tN)p(t_p{1}, t_{2}, \dots, t_{N}) = \prod_{k = 1}^{N} p(t_{k} | t_{k + 1}, t_{k + 2} \dots, t_{N})p(tp1,t2,,tN)=k=1Np(tktk+1,tk+2,tN)

双向语言模型(biLM):同时最大化前、后向对数似然(jointly maximizes the log likelihood of the forward and backward directions):

∑k=1N(log⁡p(tk∣t1,…,tk−1;Θx,Θ→LSTM,Θs)+log⁡p(tk∣t1,…,tk−1;Θx,Θ←LSTM,Θs))\sum_{k = 1}^{N} \left( \log p(t_{k} | t_{1}, \dots, t_{k - 1}; \mathbf{\Theta}_{x}, \overrightarrow{\mathbf{\Theta}}_{\text{LSTM}}, \mathbf{\Theta}_{s}) + \log p(t_{k} | t_{1}, \dots, t_{k - 1}; \mathbf{\Theta}_{x}, \overleftarrow{\mathbf{\Theta}}_{\text{LSTM}}, \mathbf{\Theta}_{s}) \right)k=1N(logp(tkt1,,tk1;Θx,Θ

LSTM,Θs)+logp(tkt1,,tk1;Θx,Θ

LSTM
,Θs))

前、后向网络的词条表示权值Θx\mathbf{\Theta}_{x}Θx和softmax权值Θs\mathbf{\Theta}_{s}Θs共享,LSTM权值独立(tie the parameters for both the token representation (Θx\mathbf{\Theta}_{x}Θx) and Softmax layer (Θs\mathbf{\Theta}_{s}Θs) in the forward and backward direction while maintaining separate parameters for the LSTMs in each direction)。

3.2 ELMo

ELMo的biLM中间层表示组合与任务相关(ELMo is a task specific combination of the intermediate layer representations in the biLM)。给定词条tkt_{k}tkLLL层biLM计算2L+12L + 12L+1个表示集合(for each token tkt_{k}tk, a LLL-layer biLM computes a set of 2L+12L + 12L+1 representations):

Rk={xkLM,h→k,jLM,h←k,jLM∣j=1,…,L}={hk,jLM∣j=1,…,L}R_{k} = \{ \mathbf{x}_{k}^{\text{LM}}, \overrightarrow{\mathbf{h}}_{k, j}^{\text{LM}}, \overleftarrow{\mathbf{h}}_{k, j}^{\text{LM}} | j = 1, \dots, L \} = \{ \mathbf{h}_{k, j}^{\text{LM}} | j = 1, \dots, L \}Rk={xkLM,h

k,jLM,h

k,jLM
j=1,,L}={hk,jLMj=1,,L}

其中,hk,0LM\mathbf{h}_{k, 0}^{\text{LM}}hk,0LM为词条层(token layer)表示、hk,jLM=[h→k,jLM;h←k,jLM]\mathbf{h}_{k, j}^{\text{LM}} = [\overrightarrow{\mathbf{h}}_{k, j}^{\text{LM}}; \overleftarrow{\mathbf{h}}_{k, j}^{\text{LM}}]hk,jLM=[h

k,jLM;h

k,jLM
]为各biLSTM层表示。

为在下游模型引入ELMo,本文将RRR简化为单一向量ELMok=E(Rk;Θe)\text{ELMo}_{k} = E(R_{k}; \mathbf{\Theta}_{e})ELMok=E(Rk;Θe)(for inclusion in a downstream model, ELMo collapses all layers in RRR into a single vector),各biLM层的权值与任务相关(compute a task specific weighting of all biLM layers)

ELMoktask=E(Rk;Θtask)=γtask∑j=0Lsjtaskhk,jLM(1)\text{ELMo}_{k}^{\text{task}} = E(R_{k}; \mathbf{\Theta}^{\text{task}}) = \gamma^{\text{task}} \sum_{j = 0}^{L} s_{j}^{\text{task}} \mathbf{h}_{k, j}^{\text{LM}} \tag {1}ELMoktask=E(Rk;Θtask)=γtaskj=0Lsjtaskhk,jLM(1)

其中,stask\mathbf{s}^{\text{task}}stask为softmax归一化权值(softmax-normalized weights)、标量参数(scalar parameter)γtask\gamma^{\text{task}}γtask使任务模型能够对ELMo向量缩放(allows the task model to scale the entire ELMo vector)

3.3 监督NLP任务(using biLMs for supervised NLP tasks)

给定预训练biLM和目标NLP任务的监督结构(a supervised architecture for a target NLP task)

(1)运行biLM,记录各词条的所有层表示(record all of the layer representations for each word)

(2)使终端任务模型学习所有表示的线性组合(let the end task model learn a linear combination of these representations)

3.4 biLM结构(pre-trained bidirectional language model architecture)

双向联合训练(support joint training of both directions)、LSTM层间通过残差相连(add a residual connection between LSTM layers)

L=2L = 2L=2层biLSTM,输入4096维、输出512维(L = 2 biLSTM layers with 4096 units and 512 dimension projections);biLSTM之间通过残差相连。

2048字符nnn-gram卷积、512维线性投影(context insensitive type representation uses 2048 character nnn-gram convolutional filters followed by two highway layers and a linear projection down to a 512 representation)

在1B Word Benchmark数据集上进行10轮训练,前、后向平均困惑度(average forward and backward perplexities)为39.7。

4 评估(Evaluation)


问答(question answering)

SQuAD(Stanford Question Answering Dataset)

语义继承(textual entailment)

语义继承:给定前提,判断假设是否成立(textual entailment is the task of determining whether a “hypothesis” is true, given a “premise”)。

SNLI(Stanford Natural Language Inference)

语义角色标注(semantic role labeling)

语义角色标注:判断语句结构,谁对谁做了什么(a semantic role labeling (SRL) system models the predicate-argument structure of a sentence, and is often described as answering “Who did what to whom”)。

指代消解(coreference resolution)

指代消解:文本中代词所指(coreference resolution is the task of clustering mentions in text that refer to the same underlying real world entities)

命名实体识别(named entity extraction)

CoNLL 2003 NER

情感分析(sentiment analysis)

SST-5(Stanford Sentiment Treebank),五分类

5 分析(Analysis)

biLM模型的低层捕获语法信息,而高层捕获语义信息(syntactic information is better represented at lower layers while semantic information is captured a higher layers)。

5.1 层加权策略(Alternate layer weighting schemes)

正则参数(regularization parameter)λ\lambdaλ

(1)λ=1\lambda = 1λ=1,权值方程简化为平均加权(reduce the weighting function to a simple average over the layers)

(2)减小λ\lambdaλ,例如λ=0.001\lambda = 0.001λ=0.001,各层权值可变(smaller values allow the layer weights to vary)

5.2 ELMo引入位置(Where to include ELMo?)

biRNN的输入、输出均可引入ELMo

5.3 biLM表示的信息(What information is captured by the biLM’s representations?)

biLM必须能够通过上下文区分词义(the biLM must be disambiguating the meaning of words using their context)


词义辨析(word sense disambiguation)

词性标注(POS tagging)


对监督任务的启示(implications for supervised tasks)

5.4 样本效率(sample efficiency)

ELMo能够加快训练速度、减小训练集规模,从而提升样本效率(adding ELMo to a model increases the sample efficiency considerably, both in terms of number of parameter updates to reach state-of-the-art performance and the overall training set size)。

5.5 权值可视化(Visualization of learned weights)

6 结论

文献阅读 - Deep Contextualized Word Representations相关推荐

  1. 论文笔记--Deep contextualized word representations

    论文笔记--Deep contextualized word representations 1. 文章简介 2. 文章概括 3 文章重点技术 3.1 BiLM(Bidirectional Langu ...

  2. Paper:《ELMO:Deep contextualized word representations》翻译与解读

    Paper:<ELMO:Deep contextualized word representations>翻译与解读 目录 <ELMO:Deep contextualized wor ...

  3. Deep contextualized word representations

    引言 这是2018年NAACL上的Best Paper,即是大名鼎鼎的ELMo,原文地址:https://arxiv.org/pdf/1802.05365.pdf 对比传统Word2Vec这种形式的词 ...

  4. ELMo:最好用的词向量(Deep contextualized word representations)论文 pdf

    下载地址:https://u20150046.ctfile.com/fs/20150046-376633397 作者:Matthew E. Peters, Mark Neumann, Mohit Iy ...

  5. ELMo: Deep contextualized word representations

    Abstract 本文介绍一种新型的上下文词表示可以建模: 词的复杂用法特征(例如句法或者语义特征) 在不同的语言上下文中的用法变化(例如多义词) 我们的词向量是学到的深度双向语言模型(biLM)内部 ...

  6. 文献阅读 | Deep learning enables reference-free isotropic super-resolution for v fluorescence microscopy

    文献阅读 | Deep learning enables reference-free isotropic super-resolution for volumetric fluorescence mi ...

  7. 文献阅读笔记:Word Translation Without Parallel Data

    0. 背景 机构:Facebook 作者:Alexis Conneau, Guillaume Lample 发布地方:LCLR 2018 面向任务:无监督机器翻译 论文地址:https://arxiv ...

  8. Contextual Word Representations and Pretraining

    一.Word Representations 在2018年之前,我们使用的词向量表征技术主要有三个:Word2Vec.Glove.FastText.但是这三个方法都普遍存在在一些问题,就是无法解决一次 ...

  9. 文献阅读2019-Computer-aided diagnosis system for breast ultrasound images using deep learning

    文献阅读2019-Computer-aided diagnosis system for breast ultrasound images using deep learning 1 通过生成热图来探 ...

最新文章

  1. arm linux下 chkntfs,XPE开机自动扫描相关_Windows Embedded Standard 7 嵌入式定制的技术博客_51CTO博客...
  2. idea2019的安装与激活
  3. 【NOI2014】动物园 kmp性质
  4. 软件项目开发应写的13类文档
  5. 天津计算机考研901,2013年天津大学901计算机考研真题
  6. [PAMI2013] Guided Image Filtering 导向滤波器以及OpenCV-Python代码实现
  7. 检测正常和不正常图_医生提醒:激素正常不等于内分泌正常,带你走出内分泌失调误区...
  8. cmake导入so库_cmake编译.so库体积非常大,求解答
  9. 被裁的第50天,我终于拿到心仪公司Offer
  10. Docker和Ubuntu主机互传复制文件
  11. 二分类模型评估之 ROC曲线和PR曲线
  12. 如何让域用户安装需要管理员权限的软件
  13. 史上最详细域名链接被微信封杀拦截屏蔽解决方案_微信域名防封跳转系统
  14. 基于FPGA的贪吃蛇游戏设计(1)整体架构设计
  15. 【70后、80后、90后嘚啵嘚】招募特约评论员啦!
  16. k8s可视化管理工具Rancher安装和使用
  17. Matlab机器人工具箱(3):双臂操作(从模型建立到轨迹规划)
  18. java springboot 商城系统源码
  19. WiFi 6的核心技术
  20. 坦克小战的游戏规则说明(3)

热门文章

  1. 全球及中国桥接芯片行业市场前瞻及投资商机研究报告2022-2028年
  2. testdisk windows mac linux,TestDisk for Mac-TestDisk Mac版下载 V7.2|TestDisk Mac版 - 燃文下载站...
  3. 1984洛杉矶奥运会歌曲《reach out》铃声 1984洛杉矶奥运会歌曲...
  4. Java PDFBox
  5. 第12节 特色数据——千股千评
  6. JS|TS - 去除对象和数组中的假值
  7. 杨福宇专栏|细读特斯拉安阳案48页数据文档,了解故障的存在~
  8. 球队至少需要多少分可以出线,最多有多少分未出线
  9. Ubuntu20.04修复网络不显示问题
  10. 关于STM32F103ZET6原理图设计