Deep Contextualized Word Representations

M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, et al., Deep Contextualized Word Representations, NAACL (2018)

摘要

深度上下文词表示（deep contextualized word representation）：

（1）词特征（complex characteristics of word），如语法（syntax）、语义（semantics）

（2）一词多义（vary across linguistic contexts i.e., polysemy）

本文词向量为双向深度语言模型中间状态的函数（learned functions of the internal states of a deep bidirectional language model (biLM)）

验证任务：问答（question answering）、语义继承（textual entailment）、情感分析（sentiment analysis）等

1 引言

预训练词表示（pre-trained word representations）理想（ideally）建模：

（1）词特征（complex characteristics of word），如语法（syntax）、语义（semantics）

（2）一词多义（vary across linguistic contexts i.e., polysemy）

本文提出一种深度上下文词表示（deep contextualized word representation）：

（1）各词条的表示为整条语句的函数（each token is assigned a representation that is a function of the entire input sentence）；

（2）向量由耦合语言模型损失训练的双向LSTM导出（vectors derived from a bidirectional LSTM that is trained with a coupled language model (LM) objective on a large text corpus）。

本文称之为ELMo（Embeddings from Language Models），其中“深”指：表示为biLM所有中间层的函数（ELMo representations are deep, in the sense that they are a function of all of the internal layers of the biLM）。此外，对每个终端任务，可学习不同的向量线性组合（learn a linear combination of the vectors stacked above each input word for each end task）。

LSTM的高层LSTM向量能捕获词条的上下文相关语义（higher-level LSTM states capture context-dependent aspects of word meaning）；低层向量捕获词条语法特征（lower-level states model aspects of syntax）。

■本文中，biLM指ELMo，biRNN指下游任务模型。■

2 相关工作

由于能够从大规模无标签语料库（large scale unlabeled text）中捕获词条的语法、语义信息（capture syntactic and semantic information of words），预训练词向量（pretrained word vectors）已成为顶尖NLP标准组件（a standard component of most state-of-the-art NLP architectures），如问答（question answering）、语义继承（textual entailment）、语义角色标注（semantic role labeling）。然而，早期词向量方法仅为各词条分配一个与上下文无关的单一表示（a single context-independent representation for each word）。

为克服这一缺陷，子词条信息（subword information）、词条词义向量（learning separate vectors for each word sense）等方法被提出。本文通过字符卷积从子词条中获益（benefits from subword units through the use of character convolutions），并能将多词义信息与下游任务无缝衔接（seamlessly incorporate multi-sense information into downstream tasks without explicitly training to predict predefined sense classes）。

上下文相关表示（context-dependent representations）：context2vec，使用双向LSTM对引导词的上下文编码（encode the context around a pivot word）。

3 ELMo：语言模型词嵌入（ELMo: Embeddings from Language Models）

ELMo词表示（word represntations）为输入语句的函数（functions of the entire input sentence）：

（1）取自字符卷积双层biLMs（computed on top of two-layer biLMs with character convolutions）

（2）中间网络状态的线性函数（a linear function of the internal network states）

通过大规模数据集预训练，biLM能够半监督学习（allows us to do semi-supervised learning, where the biLM is pretrained at a large scale），并易于加入现有NLP结构中（easily incorporated into a wide range of existing neural NLP architectures）。

3.1 双向语言模型（bidirectional language models）

给定长度为 $N$ 的词条序列（a sequence of $N$ tokens）， $(ti,t2,…,tN)(t_{i}, t_{2}, \dots, t_{N})$ ，前向语言模型（forward language model）对序列概率的表示为：给定历史序列 $(t1,…,tk−1)(t_{1}, \dots, t_{k - 1})$ ，计算词条 $t_{k}$ 的概率：

$p(tp1,t2,…,tN)=∏k=1Np(tk∣t1,…,tk−1)p(t_p{1}, t_{2}, \dots, t_{N}) = \prod_{k = 1}^{N} p(t_{k} | t_{1}, \dots, t_{k - 1})$

后向语言模型（backward language model）：

$p(tp1,t2,…,tN)=∏k=1Np(tk∣tk+1,tk+2…,tN)p(t_p{1}, t_{2}, \dots, t_{N}) = \prod_{k = 1}^{N} p(t_{k} | t_{k + 1}, t_{k + 2} \dots, t_{N})$

双向语言模型（biLM）：同时最大化前、后向对数似然（jointly maximizes the log likelihood of the forward and backward directions）：

$∑k=1N(log⁡p(tk∣t1,…,tk−1;Θx,Θ→LSTM,Θs)+log⁡p(tk∣t1,…,tk−1;Θx,Θ←LSTM,Θs))\sum_{k = 1}^{N} \left( \log p(t_{k} | t_{1}, \dots, t_{k - 1}; \mathbf{\Theta}_{x}, \overrightarrow{\mathbf{\Theta}}_{\text{LSTM}}, \mathbf{\Theta}_{s}) + \log p(t_{k} | t_{1}, \dots, t_{k - 1}; \mathbf{\Theta}_{x}, \overleftarrow{\mathbf{\Theta}}_{\text{LSTM}}, \mathbf{\Theta}_{s}) \right)$

LSTM,Θs)+logp(tk∣t1,…,tk−1;Θx,Θ

LSTM,Θs))

前、后向网络的词条表示权值 $Θx\mathbf{\Theta}_{x}$ 和softmax权值 $Θs\mathbf{\Theta}_{s}$ 共享，LSTM权值独立（tie the parameters for both the token representation ( $Θx\mathbf{\Theta}_{x}$ ) and Softmax layer ( $Θs\mathbf{\Theta}_{s}$ ) in the forward and backward direction while maintaining separate parameters for the LSTMs in each direction）。

3.2 ELMo

ELMo的biLM中间层表示组合与任务相关（ELMo is a task specific combination of the intermediate layer representations in the biLM）。给定词条 $t_{k}$ ， $L$ 层biLM计算 $2 L + 1$ 个表示集合（for each token $t_{k}$ , a $L$ -layer biLM computes a set of $2 L + 1$ representations）：

$Rk={xkLM,h→k,jLM,h←k,jLM∣j=1,…,L}={hk,jLM∣j=1,…,L}R_{k} = \{ \mathbf{x}_{k}^{\text{LM}}, \overrightarrow{\mathbf{h}}_{k, j}^{\text{LM}}, \overleftarrow{\mathbf{h}}_{k, j}^{\text{LM}} | j = 1, \dots, L \} = \{ \mathbf{h}_{k, j}^{\text{LM}} | j = 1, \dots, L \}$

k,jLM,h

k,jLM∣j=1,…,L}={hk,jLM∣j=1,…,L}

其中， $hk,0LM\mathbf{h}_{k, 0}^{\text{LM}}$ 为词条层（token layer）表示、 $hk,jLM=[h→k,jLM;h←k,jLM]\mathbf{h}_{k, j}^{\text{LM}} = [\overrightarrow{\mathbf{h}}_{k, j}^{\text{LM}}; \overleftarrow{\mathbf{h}}_{k, j}^{\text{LM}}]$

k,jLM;h

k,jLM]为各biLSTM层表示。

为在下游模型引入ELMo，本文将 $R$ 简化为单一向量 $ELMok=E(Rk;Θe)\text{ELMo}_{k} = E(R_{k}; \mathbf{\Theta}_{e})$ （for inclusion in a downstream model, ELMo collapses all layers in $R$ into a single vector），各biLM层的权值与任务相关（compute a task specific weighting of all biLM layers）

$ELMoktask=E(Rk;Θtask)=γtask∑j=0Lsjtaskhk,jLM(1)\text{ELMo}_{k}^{\text{task}} = E(R_{k}; \mathbf{\Theta}^{\text{task}}) = \gamma^{\text{task}} \sum_{j = 0}^{L} s_{j}^{\text{task}} \mathbf{h}_{k, j}^{\text{LM}} \tag {1}$

其中， $stask\mathbf{s}^{\text{task}}$ 为softmax归一化权值（softmax-normalized weights）、标量参数（scalar parameter） $γtask\gamma^{\text{task}}$ 使任务模型能够对ELMo向量缩放（allows the task model to scale the entire ELMo vector）

3.3 监督NLP任务（using biLMs for supervised NLP tasks）

给定预训练biLM和目标NLP任务的监督结构（a supervised architecture for a target NLP task）

（1）运行biLM，记录各词条的所有层表示（record all of the layer representations for each word）

（2）使终端任务模型学习所有表示的线性组合（let the end task model learn a linear combination of these representations）

3.4 biLM结构（pre-trained bidirectional language model architecture）

双向联合训练（support joint training of both directions）、LSTM层间通过残差相连（add a residual connection between LSTM layers）

$L = 2$ 层biLSTM，输入4096维、输出512维（L = 2 biLSTM layers with 4096 units and 512 dimension projections）；biLSTM之间通过残差相连。

2048字符 $n$ -gram卷积、512维线性投影（context insensitive type representation uses 2048 character $n$ -gram convolutional filters followed by two highway layers and a linear projection down to a 512 representation）

在1B Word Benchmark数据集上进行10轮训练，前、后向平均困惑度（average forward and backward perplexities）为39.7。

4 评估（Evaluation）

问答（question answering）

SQuAD（Stanford Question Answering Dataset）

语义继承（textual entailment）

语义继承：给定前提，判断假设是否成立（textual entailment is the task of determining whether a “hypothesis” is true, given a “premise”）。

SNLI（Stanford Natural Language Inference）

语义角色标注（semantic role labeling）

语义角色标注：判断语句结构，谁对谁做了什么（a semantic role labeling (SRL) system models the predicate-argument structure of a sentence, and is often described as answering “Who did what to whom”）。

指代消解（coreference resolution）

指代消解：文本中代词所指（coreference resolution is the task of clustering mentions in text that refer to the same underlying real world entities）

命名实体识别（named entity extraction）

CoNLL 2003 NER

情感分析（sentiment analysis）

SST-5（Stanford Sentiment Treebank），五分类

5 分析（Analysis）

biLM模型的低层捕获语法信息，而高层捕获语义信息（syntactic information is better represented at lower layers while semantic information is captured a higher layers）。

5.1 层加权策略（Alternate layer weighting schemes）

正则参数（regularization parameter） $λ\lambda$ ：

（1） $λ=1\lambda = 1$ ，权值方程简化为平均加权（reduce the weighting function to a simple average over the layers）

（2）减小 $λ\lambda$ ，例如 $λ=0.001\lambda = 0.001$ ，各层权值可变（smaller values allow the layer weights to vary）

5.2 ELMo引入位置（Where to include ELMo?）

biRNN的输入、输出均可引入ELMo

5.3 biLM表示的信息（What information is captured by the biLM’s representations?）

biLM必须能够通过上下文区分词义（the biLM must be disambiguating the meaning of words using their context）

词义辨析（word sense disambiguation）

词性标注（POS tagging）

对监督任务的启示（implications for supervised tasks）

5.4 样本效率（sample efficiency）

ELMo能够加快训练速度、减小训练集规模，从而提升样本效率（adding ELMo to a model increases the sample efficiency considerably, both in terms of number of parameter updates to reach state-of-the-art performance and the overall training set size）。

5.5 权值可视化（Visualization of learned weights）

6 结论

文献阅读 - Deep Contextualized Word Representations相关推荐

论文笔记--Deep contextualized word representations
论文笔记--Deep contextualized word representations 1. 文章简介 2. 文章概括 3 文章重点技术 3.1 BiLM(Bidirectional Langu ...
Paper：《ELMO：Deep contextualized word representations》翻译与解读
Paper:<ELMO:Deep contextualized word representations>翻译与解读目录 <ELMO:Deep contextualized wor ...
Deep contextualized word representations
引言这是2018年NAACL上的Best Paper,即是大名鼎鼎的ELMo,原文地址:https://arxiv.org/pdf/1802.05365.pdf 对比传统Word2Vec这种形式的词 ...
ELMo：最好用的词向量（Deep contextualized word representations）论文 pdf
下载地址:https://u20150046.ctfile.com/fs/20150046-376633397 作者:Matthew E. Peters, Mark Neumann, Mohit Iy ...
ELMo: Deep contextualized word representations
Abstract 本文介绍一种新型的上下文词表示可以建模: 词的复杂用法特征(例如句法或者语义特征) 在不同的语言上下文中的用法变化(例如多义词) 我们的词向量是学到的深度双向语言模型(biLM)内部 ...
文献阅读 | Deep learning enables reference-free isotropic super-resolution for v ﬂuorescence microscopy
文献阅读 | Deep learning enables reference-free isotropic super-resolution for volumetric ﬂuorescence mi ...
文献阅读笔记：Word Translation Without Parallel Data
0. 背景机构:Facebook 作者:Alexis Conneau, Guillaume Lample 发布地方:LCLR 2018 面向任务:无监督机器翻译论文地址:https://arxiv ...
Contextual Word Representations and Pretraining
一.Word Representations 在2018年之前,我们使用的词向量表征技术主要有三个:Word2Vec.Glove.FastText.但是这三个方法都普遍存在在一些问题,就是无法解决一次 ...
文献阅读2019-Computer-aided diagnosis system for breast ultrasound images using deep learning
文献阅读2019-Computer-aided diagnosis system for breast ultrasound images using deep learning 1 通过生成热图来探 ...

文献阅读 - Deep Contextualized Word Representations