NLP中的数据增强：UDA、EDA

文章目录

NLP数据增强
- 1. UDA (Unsupervised Data Augmentation)【推荐使用】
- 2. EDA (Easy Data Augmentation)

NLP数据增强

1. UDA (Unsupervised Data Augmentation)【推荐使用】

一个半监督的学习方法——UDA，减少对标注数据的需求，增加对未标注数据的利用。UDA的介绍来自论文《Unsupervised Data Augmentation for Consistency Training》。使用半监督方法时，常见的做法是，对大量未标记数据使用一致性训练来约束模型预测对输入噪声是不变的。在这篇论文中，提出了一个关于如何有效地对未标记数据进行噪声处理的新观点，并认为噪声质量，特别是由高级数据增强方法产生的噪声质量，在半监督学习中起着至关重要的作用。

通过用先进的数据增强方法（如 RandAugment 和反向翻译）代替简单的噪声操作，UDA在相同的一致性训练框架下对六种语言和三种视觉任务进行了实质性改进。在 IMDb 文本分类数据集上，只有 20 个标记数据，但是UDA方法实现了 4.20 的错误率，优于在 25,000 个标记数据上训练的SOTA模型。在标准的半监督学习基准 CIFAR-10 上，UDA方法优于所有以前的方法，并且仅用 250 个标记数据实现了 5.43 的错误率。UDA方法还与迁移学习很好地结合在一起，例如，当从 BERT 进行微调时，并在高数据机制（如 ImageNet）中产生改进，无论是只有 10% 的标记数据还是带有 130 万个额外未标记数据的完整标记集被使用.

UDA使用的语言增强技术——Back-translation：回译能够在保存语义不变的情况下，生成多样的句式。

UDA关键解决的是如何根据少量的标注数据来增加未标注数据的使用？

对给定的标注数据，可以根据监督学习方法学习到一个模型M=pθ(y∣x)M=p_{\theta}(y|x)M=pθ(y∣x)。对未标注数据，进行半监督学习：参考标注数据分布，对未标注数据添加噪声后学习到的模型pθ(y∣x^)p_{\theta}(y|\hat{x})pθ(y∣x^)。为了保证一致性的训练（consistency training)，需要尽量减少标注数据和未标注数据的分布差异，即最小化两个分布的KL散度：minDKL(pθ(y∣x)∣∣pθ(y∣x^))min \quad D_{KL} (p_{\theta}(y|x)||p_{\theta}(y|\hat{x}))minDKL(pθ(y∣x)∣∣pθ(y∣x^))。而x^=q(x,ϵ)\hat{x}=q(x,\epsilon)x^=q(x,ϵ)是对未标注数据添加噪声后得到的增强数据。那么如何添加噪声ϵ\epsilonϵ，来得到增强的数据集x^\hat{x}x^？

valid noise: 可以保证原始未标注数据和扩展的未标注数据的预测具有一致性。
diverse noise: 在不更改标签的情况下对输入进行大量修改，增加样本多样性，而不是仅用高斯噪声进行局部更改。
targeted inductive biases: 不同的任务需要不同的归纳偏差。

UDA论文中对图像分类、文本分类任务做了实验，分别用到不同的数据增强策略：

Image Classification: RandAugment数据增强方法，该方法受到 AutoAugment (Cubuk et al., 2018) 的启发。 AutoAugment 使用一种搜索方法将 Python 图像库 (PIL) 中的所有图像处理转换结合起来，以找到一个好的增强策略。在 RandAugment 中，我们不使用搜索，而是从 PIL 中的同一组增强变换中统一采样。换句话说，RandAugment 更简单，不需要标记数据，因为不需要搜索最优策略。
Text Classification: Back-translation回译，保持语义，利用机器翻译系统进行多语言互译，增加句子多样性。
Text Classification: Word replacing with TF-IDF ，回译可以保证全局语义不变，但无法控制某个词的保留。对于主题分类任务，某些关键词在确定主题时具有更重要的信息。所以采用新的增强方法：用较低的TF-IDF分数替换无信息的单词，同时保留较高的TF-IDF值的单词。

2. EDA (Easy Data Augmentation)

无监督方法——EDA来自论文《EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks》。一个用于提高文本分类任务性能的简单数据增强技术。 EDA 由四个简单但功能强大的操作组成：同义词替换、随机插入、随机交换和随机删除。在实验的五个文本分类任务中，EDA 提高了卷积和递归神经网络的性能。 EDA 对于较小的数据集表现出特别强的结果；平均而言，在五个数据集上，仅使用 50% 的可用训练集进行 EDA 训练达到了与使用所有可用数据进行正常训练相同的准确度。

EDA 的4个数据增强操作：

同义词替换(Synonym Replacement, SR)：从句子中随机选取n个不属于停用词集的单词，并随机选择其同义词替换它们；
随机插入(Random Insertion, RI)：随机的找出句中某个不属于停用词集的词，并求出其随机的同义词，将该同义词插入句子的一个随机位置。重复n次；
随机交换(Random Swap, RS)：随机的选择句中两个单词并交换它们的位置。重复n次；
随机删除(Random Deletion, RD)：以 p的概率，随机的移除句中的每个单词；

使用EDA需要注意：控制样本数量，少量学习，不能扩充太多，因为EDA操作太过频繁可能会改变语义，从而降低模型性能。

关于EDA，我想起之前面试NLP算法工程师时，被要求写出这个4个函数。

同义词替换(Synonym Replacement, SR)：

########################################################################
# Synonym replacement
# Replace n words in the sentence with synonyms from wordnet
#########################################################################for the first time you use wordnet
#import nltk
#nltk.download('wordnet')
from nltk.corpus import wordnet def synonym_replacement(words, n):new_words = words.copy()random_word_list = list(set([word for word in words if word not in stop_words]))random.shuffle(random_word_list)num_replaced = 0for random_word in random_word_list:synonyms = get_synonyms(random_word)if len(synonyms) >= 1:synonym = random.choice(list(synonyms))new_words = [synonym if word == random_word else word for word in new_words]#print("replaced", random_word, "with", synonym)num_replaced += 1if num_replaced >= n: #only replace up to n wordsbreak#this is stupid but we need it, trust mesentence = ' '.join(new_words)new_words = sentence.split(' ')return new_words

随机删除(Random Deletion, RD)：

########################################################################
# Random deletion
# Randomly delete words from the sentence with probability p
########################################################################def random_deletion(words, p):#obviously, if there's only one word, don't delete itif len(words) == 1:return words#randomly delete words with probability pnew_words = []for word in words:r = random.uniform(0, 1)if r > p:new_words.append(word)#if you end up deleting all words, just return a random wordif len(new_words) == 0:rand_int = random.randint(0, len(words)-1)return [words[rand_int]]return new_words

随机交换(Random Swap, RS)：

########################################################################
# Random swap
# Randomly swap two words in the sentence n times
########################################################################def random_swap(words, n):new_words = words.copy()for _ in range(n):new_words = swap_word(new_words)return new_wordsdef swap_word(new_words):random_idx_1 = random.randint(0, len(new_words)-1)random_idx_2 = random_idx_1counter = 0while random_idx_2 == random_idx_1:random_idx_2 = random.randint(0, len(new_words)-1)counter += 1if counter > 3:return new_wordsnew_words[random_idx_1], new_words[random_idx_2] = new_words[random_idx_2], new_words[random_idx_1] return new_words

随机插入(Random Insertion, RI)：

########################################################################
# Random insertion
# Randomly insert n words into the sentence
########################################################################def random_insertion(words, n):new_words = words.copy()for _ in range(n):add_word(new_words)return new_wordsdef add_word(new_words):synonyms = []counter = 0while len(synonyms) < 1:random_word = new_words[random.randint(0, len(new_words)-1)]synonyms = get_synonyms(random_word)counter += 1if counter >= 10:returnrandom_synonym = synonyms[0]random_idx = random.randint(0, len(new_words)-1)new_words.insert(random_idx, random_synonym)

参考：
[1]: https://github.com/google-research/uda “Unsupervised Data Augmentation”
[2]: https://arxiv.org/abs/1904.12848 “Unsupervised Data Augmentation for Consistency Training”
[3]: https://github.com/zhanlaoban/EDA_NLP_for_Chinese “EDA_NLP_for_Chinese”
[4]: https://arxiv.org/abs/1901.11196 “EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks”
[5]: https://github.com/jasonwei20/eda_nlp “EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks”

欢迎各位关注我的个人公众号：HsuDan，我将分享更多自己的学习心得、避坑总结、面试经验、AI最新技术资讯。