
自然语言处理 (Natural Language Processing)

Data Augmentation is the process that enables us to increase the size of the training data without actually collecting the data. But why do we need more data? The answer is simple — the more data we have, the better the performance of the model.

Data Augmentation是使我们能够在不实际收集数据的情况下增加训练数据的大小的过程。 但是为什么我们需要更多数据? 答案很简单-我们拥有的数据越多,模型的性能就越好。

Image data augmentation steps such as flipping, cropping, rotation, blurring, zooming, etc. helped tremendously in the computer vision. Also, it is relatively easy to create augmented images but the same is not the case with Natural Language Processing (NLP) due to the complexities inherent in the language. (For example, we can not replace every word by its synonym and even if we replace, the meaning of the sentence might change completely). Based on my findings/research, I haven’t seen as much research around text data augmentation as image augmentation.

图像数据增强步骤(例如翻转,裁切,旋转,模糊,缩放等)极大地帮助了计算机视觉。 同样,创建增强图像相对容易,但是Natural Language Processing (NLP)并非如此,因为该语言固有的复杂性。 (例如,我们不能用同义词代替每个单词,即使我们替换,句子的含义也可能会完全改变)。 根据我的发现/研究,关于文本数据增强的研究还不多于图像增强。

However, in this article, we will go thru 2 libraries TextAttack & Googletrans I have come across recently when I was trying augmentation for text data. So, let’s get started.

但是,在本文中,我们将通过2个库TextAttackGoogletrans我最近在尝试扩展文本数据时遇到的。 因此,让我们开始吧。

We will apply the augmentation techniques we are going to learn on these quotes of Simon Sinek.

我们将在Simon Sinek的这些报价中应用将要学习的增强技术。

“Leadership requires two things: a vision of the world that does not yet exist and the ability to communicate it.”


“The role of a leader is not to come up with all the great ideas. The role of a leader is to create an environment in which great ideas can happen”

“领导者的角色不是提出所有伟大的想法。 领导者的作用是创造一个可以产生好主意的环境”

- Simon Sinek


文字攻击 (TextAttack)

TextAttack is a Python framework for adversarial attacks, adversarial training, and data augmentation in NLP. In this article, we will focus only on data augmentation.

TextAttack是一个Python框架,用于NLP中的对抗攻击,对抗训练和数据增强。 在本文中,我们将仅关注数据扩充。

安装 (Installation)

!pip install textattack

用法 (Usage)

The textattack.Augmenter class provides four methods for data augmentation.


  1. WordNetAugmenter: Augments text by replacing with synonyms from the WordNet thesaurus.

    WordNetAugmenter :通过替换WordNet同义词库中的同义词来增强文本。

  2. EmbeddingAugmenter: Augments text by transforming words with their embeddings, with a constraint to ensure their cosine similarity is at least 0.8.

    EmbeddingAugmenter :通过转换带有嵌入词的单词来增强文本,并具有约束以确保其余弦相似度至少为0.8。

  3. CharSwapAugmenter: Augments text by substituting, deleting, inserting, and swapping adjacent characters.

    CharSwapAugmenter :通过替换,删除,插入和交换相邻字符来增强文本。

  4. EasyDataAugmenter: This augments the text with a combination of Wordnet synonym replacement, word deletion, word order swaps & synonym insertions. All these 4 functionalities happen randomly. So we will get different results each time we run the code. This returns 4 augmented results, unlike the other three methods.

    EasyDataAugmenter :这结合了Wordnet同义词替换,单词删除,单词顺序交换和同义词插入的组合来增加文本。 所有这四个功能都是随机发生的。 因此,每次运行代码时,我们都会得到不同的结果。 与其他三种方法不同,这将返回4个增强结果。

Let’s look at the data augmentation results using these four methods. Note that pct_words_to_swap=0.1, transformations_per_example=4 are passed to each of these methods by default. We can modify these default values as needed.

让我们看看使用这四种方法的数据扩充结果。 请注意,默认情况下将pct_words_to_swap=0.1transformations_per_example=4传递给每个方法。 我们可以根据需要修改这些默认值。

We can apply these methods to real-world data to increase the size of the data. The sample code is given below. Here the originaltrain dataframe is copied to the train_aug dataframe and then augmentation is applied on train_aug. And finally, train_aug is appended to the original train dataset.

我们可以将这些方法应用于实际数据,以增加数据的大小。 示例代码如下。 在这里,原始train数据帧被复制到train_aug数据帧,然后对train_aug进行train_aug 。 最后, train_aug被附加到原始train数据集。

train_aug = train.copy()from textattack.augmentation import EmbeddingAugmenteraug = EmbeddingAugmenter()train_aug['text'] = train_aug['text'].apply(lambda x: str(aug.augment(x)))train = train.append(train_copy, ignore_index=True)

谷歌翻译 (Googletrans)

Googletrans is built on top Google Translate API. This uses Google Translate Ajax API for language detection and translation.

Googletrans基于顶级的Google Translate API构建。 这使用Google Translate Ajax API进行语言检测和翻译。

安装 (Installation)

!pip install googletrans

用法 (Usage)

The key parameters to translate() method are:


src: source language. Optional parameter as googletrans will detect it.

src :源语言。 可选参数googletrans会检测到它。

dest: destination language. Mandatory parameter.

dest :目标语言。 必填参数。

text: the text to be translated from source language to the destination language. Mandatory parameter.

text :要从源语言转换为目标语言的文本。 必填参数。

As we can see, the given text is first translated from English to Italian and then translated to back to English. During this back translation, as we can see, there is a slight change in the sentence between the original text and the back-translated text but the overall meaning of the sentence is still retained.

如我们所见,给定的文本首先从EnglishItalian ,然后再翻译回English 。 如我们所见,在反向翻译过程中,原始文本和反向翻译文本之间的句子略有变化,但句子的整体含义仍然保留。

We can apply this technique to real-world data. The sample code is given below. Here the original train dataframe is copied to tran_aug dataframe and then back-translation is applied on train_aug dataframe. And finally, train_aug is appended to the original train dataframe. Note that we are translating original text from English to Italian and then from Italian to English.

我们可以将此技术应用于实际数据。 示例代码如下。 在这里,原始train数据帧被复制到tran_aug数据帧,然后在train_aug数据帧上应用back-translation 。 最后, train_aug被附加到原始train数据帧中。 请注意,我们正在将原始文本从英语翻译为意大利语,然后从意大利语翻译为英语。

train_aug = train.copy()from googletrans import Translatortranslator = Translator()train_aug['text'] = train_aug['text'].apply(lambda x: translator.translate(translator.translate(x, dest='it').text, dest='en').text)train = train.append(train_aug, ignore_index=True)

结论 (Conclusion)

Now you know how to make use of TextAttack and Googletrans libraries for your data science projects for text data augmentation.


翻译自: https://towardsdatascience.com/text-data-augmentation-f4143571ecd2




