Nlp预处理方法（BPE Byte pair encoding、Normalization、Lemmatisation、Stemming…）

这篇文章受最近发现的比较有意思的自然语言处理中BPE算法和规范化方法启发，总结了一些预处理方法。

1.双字节编码BPE-Byte pair encoding

这个算法的主要目的是压缩数据，并解决未注册词的问题。这里的未注册词指没有出现在训练语料库中，但出现在测试中的词。
The main purpose of this algorithm is to compress the data and solve the problem of unregistered words. Unregistered words here refer to words that do not appear in the training corpus, but appear when testing.

2.规范化Normalisation

这是一个经常被忽略，但很有用的方法。规范化的目的是把文本转换为规范形式，比如将soooo 转换为so，这对于社交媒体等文本很重要，可以确保单词的特征在一个相似的维度。然而规范化没有一种标准的方法，需要根据场景定义。
This is an often overlooked but useful preprocessing method. The purpose of normalisation is to transform text into canonical form, like transforming soooo to so, which is important for data especially social media, to make sure features are on a similar scale. However, there is no standard approach to normalisation, and the method needs to be defined according to the scenario.

3.1词形还原Lemmatisation

词形还原的目的是把一个词还原为能表达完整语义的一般形式。比如把good,better,best都还原为good
The purpose of lemmatisation is to reduce a word to a general form that can express the complete meaning. For example, transfer ‘good’, ‘better’, and ‘best’ to good.

3.2词干提取Stemming

词干提取与词形还原有共通之处，是去除词缀得到词根的过程。比如从cats, catlike, catty提取出同一个词根cat。
Stemming, has something in common to lemmatisation, is the process of removing affixes to get root of words. For example, we can extract the root ‘cat’ from ‘cats’, ‘catlike’, ‘catty’.

这两个方法都是为了减少单复数、时态等变形对分析结果的影响，但也可能会对训练产生不良影响。一个单词可能结束于不同的词性或意义，因为被词形还原或词根提取了。
These two methods are both designed to reduce the influence of the deformation like simple and complex numbers and tense on the analysis results, but they may also have a negative impact on the training. A word may also end up having a different POS or meaning, because it got lemmatised or stemmed.

4.停用词删除Stopword Removal

停用词是一种语言中常用的词。比如英文中的a，the，is等，删除这些词可以帮助我们专注于分析更为重要的词。
Stopwords are commonly used words in a languagelike a, the, is, etc. in English. Removing these words helps us focus on more important words during analysis.

5.标点符号删除Punctuation removal

通过去除标点符号，可以降低结构特征噪声，使模型更加有效。
By removing the punctuation, it can reduce the structure characteristic noise and make the model more effective.

6.小写化Lowercasing

在英文中，将所有文本小写是一种最简单有效的方法，适用于大多数自然语言处理问题，并有助于提高预期输出的一致性。但也要注意，有时在英文中，与小写单词相比，一些大写单词可能具有特殊的含义。
In English, lowercasing all text is one of the simplest and most effective methods for most natural language processing problems, and helps to improve the consistency of the expected output. However, note that some upper case words can have special meanings compared with lowercase words.