第二章如何使用NLTK手动清理文本

你不能直接从原始文本训练机器学习或深度学习模型，必须将文本转化成能够输入到模型的张量（tensor），这个过程意味着对原始文本进行处理，例如单词，标点和大小写等。实际上你可能需要一套文本处理方法，方法的选择取决于你的自然语言任务。下面我将介绍如何转化文本

2.1 简述

本节分为以下几个部分：

弗兰兹卡夫卡的Metamorphosis
特定任务的文本准备
手动标记
使用NLTK处理
其他文字处理事项

2.2 弗兰兹卡夫卡的Metamorphosis

首先选择数据集，我使用的是Franz Kafka编写的Metamorphosis来作为文本数据
数据地址请查看我的资源
文章开头是：One morning, when Gregor Samsa woke from troubled dreams, he found himself transformed in his bed into a horrible vermin.

2.3 将文本为特定问题进行处理

获得文本后，怎么处理文本取决于面对的是一个怎么样的目标或者任务。另外文本的特定也需要大概了解一下。这份文本有如下几个特点：

他是纯文本，没有解析标注
纯英文
不存在拼写错误
有标点
有连词
有扩折号连接连续的句子
有名字
有分节号：Ⅰ，Ⅱ

后面我将介绍一般文本的清理步骤。不过在此之前要考虑我们的目标是什么例如：如果对Kafkae sque语言模型感兴趣，可以保留所有案例，所有的标点符号；如果是对文档分类为Kafka和Not Kafka，那么就需要转化大小写，删除标点和装饰词
根据任务，选择如何准备文本数据。

2.4 手动标记

虽然文本处理很难，但是选择的文本比较干净。如果是一些其他复杂的文本，我门可以使用python代码来处理。处理过程中正则表达式和拆分字符串这样的工具会频繁出现

2.4.1 加载数据

我们来加载文本数据

with open('Metamorphosis.txt','r',encoding='utf-8')as f:text  = f.read()

2.4.2 根据单词间的空格拆分

with open('Metamorphosis.txt','r',encoding='utf-8')as f:text  = f.read()
words= text.split()
print(words[:100])

print(words[:100])的结果如下：

['<Metamorphosis>', 'One', 'morning,', 'when', 'Gregor', 'Samsa', 'woke', 'from', 'troubled', 'dreams,', 'he', 'found', 'himself', 'transformed', 'in', 'his', 'bed', 'into', 'a', 'horrible', 'vermin.', 'He', 'lay', 'on', 'his', 'armour-like', 'back,', 'and', 'if', 'he', 'lifted', 'his', 'head', 'a', 'little', 'he', 'could', 'see', 'his', 'brown', 'belly,', 'slightly', 'domed', 'and', 'divided', 'by', 'arches', 'into', 'stiff', 'sections.', 'The', 'bedding', 'was', 'hardly', 'able', 'to', 'cover', 'it', 'and', 'seemed', 'ready', 'to', 'slide', 'off', 'any', 'moment.', 'His', 'many', 'legs,', 'pitifully', 'thin', 'compared', 'with', 'the', 'size', 'of', 'the', 'rest', 'of', 'him,', 'waved', 'about', 'helplessly', 'as', 'he', 'looked.', "&quot;What's", 'happened', 'to', 'me?&quot;', 'he', 'thought.', 'It', "wasn't", 'a', 'dream.', 'His', 'room,', 'a', 'proper']

该示例中将文档拆分为一长串单词，并打印前100个单词。这里出现这种"“What’s"和’me?”'不规范符号和无意义词组，说明这个文本中存在“噪音”，需要先进行清理。

2.4.3 选择单词

另外一种方法就是使用正则表达式将文档拆分为单词

import re
with open('Metamorphosis.txt','r')as f:text  = f.read()
words = re.split(r'\W+',text)
print(words[:100])

print(words[:100])的结果如下：

['', 'Metamorphosis', 'One', 'morning', 'when', 'Gregor', 'Samsa', 'woke', 'from', 'troubled', 'dreams', 'he', 'found', 'himself', 'transformed', 'in', 'his', 'bed', 'into', 'a', 'horrible', 'vermin', 'He', 'lay', 'on', 'his', 'armour', 'like', 'back', 'and', 'if', 'he', 'lifted', 'his', 'head', 'a', 'little', 'he', 'could', 'see', 'his', 'brown', 'belly', 'slightly', 'domed', 'and', 'divided', 'by', 'arches', 'into', 'stiff', 'sections', 'The', 'bedding', 'was', 'hardly', 'able', 'to', 'cover', 'it', 'and', 'seemed', 'ready', 'to', 'slide', 'off', 'any', 'moment', 'His', 'many', 'legs', 'pitifully', 'thin', 'compared', 'with', 'the', 'size', 'of', 'the', 'rest', 'of', 'him', 'waved', 'about', 'helplessly', 'as', 'he', 'looked', 'quot', 'What', 's', 'happened', 'to', 'me', 'quot', 'he', 'thought', 'It', 'wasn', 't']

这里既是文档中含有不规范的字符，正则表达式也会将不规范符号过滤掉，但是无意义字符仍然在其中。与上面不同的还有没有了 'armour-like’之类的连词。

2.4.4 按空格分割并删除标点符号

这里我们想得到没有空格和没有标点符号的单词序列，同时希望连词仍然是一个词，一种方法是通过空格将文档拆分为单词，然后使用字符串翻译将所有标点符号替换为空格。Python中提供了一个名为string.punctuation的常量，它提供了一个很好的字符列表。

import string
print(string.punctuation)

结果是

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~

我们可以用正则表达式来选择标点字符并用sub（）函数来消除他们

re_punc = re.compile('[%s]'%re.escape(string.punctuation))
stripped = [re_punc.sub(' ',w)for w in words]

我们将代码放在一起来完成加载文本，通过空格将其拆分为单词，转换每个单词和消除标点符号

import re,string
with open('Metamorphosis.txt','r')as f:text  = f.read()
words = text.split()
print(words[:100])
re_punc = re.compile('[%s]'%re.escape(string.punctuation))
stripped = [re_punc.sub(' ',w)for w in words]
print('*'*100)
print(stripped[:100])

得到的结果对比

['<Metamorphosis>', 'One', 'morning,', 'when', 'Gregor', 'Samsa', 'woke', 'from', 'troubled', 'dreams,', 'he', 'found', 'himself', 'transformed', 'in', 'his', 'bed', 'into', 'a', 'horrible', 'vermin.', 'He', 'lay', 'on', 'his', 'armour-like', 'back,', 'and', 'if', 'he', 'lifted', 'his', 'head', 'a', 'little', 'he', 'could', 'see', 'his', 'brown', 'belly,', 'slightly', 'domed', 'and', 'divided', 'by', 'arches', 'into', 'stiff', 'sections.', 'The', 'bedding', 'was', 'hardly', 'able', 'to', 'cover', 'it', 'and', 'seemed', 'ready', 'to', 'slide', 'off', 'any', 'moment.', 'His', 'many', 'legs,', 'pitifully', 'thin', 'compared', 'with', 'the', 'size', 'of', 'the', 'rest', 'of', 'him,', 'waved', 'about', 'helplessly', 'as', 'he', 'looked.', "&quot;What's", 'happened', 'to', 'me?&quot;', 'he', 'thought.', 'It', "wasn't", 'a', 'dream.', 'His', 'room,', 'a', 'proper']
****************************************************************************************************
[' Metamorphosis ', 'One', 'morning ', 'when', 'Gregor', 'Samsa', 'woke', 'from', 'troubled', 'dreams ', 'he', 'found', 'himself', 'transformed', 'in', 'his', 'bed', 'into', 'a', 'horrible', 'vermin ', 'He', 'lay', 'on', 'his', 'armour like', 'back ', 'and', 'if', 'he', 'lifted', 'his', 'head', 'a', 'little', 'he', 'could', 'see', 'his', 'brown', 'belly ', 'slightly', 'domed', 'and', 'divided', 'by', 'arches', 'into', 'stiff', 'sections ', 'The', 'bedding', 'was', 'hardly', 'able', 'to', 'cover', 'it', 'and', 'seemed', 'ready', 'to', 'slide', 'off', 'any', 'moment ', 'His', 'many', 'legs ', 'pitifully', 'thin', 'compared', 'with', 'the', 'size', 'of', 'the', 'rest', 'of', 'him ', 'waved', 'about', 'helplessly', 'as', 'he', 'looked ', ' quot What s', 'happened', 'to', 'me  quot ', 'he', 'thought ', 'It', 'wasn t', 'a', 'dream ', 'His', 'room ', 'a', 'proper']

有时候文本数据可能包含不可打印的字符。我们可以使用类似的方法通过string.printable常量的反转来过滤掉所有不可打印的字符

re_print = re.compile('[^%s]'%re.escape(string.printable))
re_sesult = [re_print.sub(' ',w) for w in words]

完整代码：

import re,string
with open('Metamorphosis.txt','r')as f:text  = f.read()
words = text.split()
print(words[:100])
# re_punc = re.compile('[%s]'%re.escape(string.punctuation))
# stripped = [re_punc.sub(' ',w)for w in words]
print('*'*100)
# print(stripped[:100])
re_print = re.compile('[^%s]'%re.escape(string.printable))
re_sesult = [re_print.sub(' ',w) for w in words]
print(re_sesult[:100])

其结果为：

['<Metamorphosis>', 'One', 'morning,', 'when', 'Gregor', 'Samsa', 'woke', 'from', 'troubled', 'dreams,', 'he', 'found', 'himself', 'transformed', 'in', 'his', 'bed', 'into', 'a', 'horrible', 'vermin.', 'He', 'lay', 'on', 'his', 'armour-like', 'back,', 'and', 'if', 'he', 'lifted', 'his', 'head', 'a', 'little', 'he', 'could', 'see', 'his', 'brown', 'belly,', 'slightly', 'domed', 'and', 'divided', 'by', 'arches', 'into', 'stiff', 'sections.', 'The', 'bedding', 'was', 'hardly', 'able', 'to', 'cover', 'it', 'and', 'seemed', 'ready', 'to', 'slide', 'off', 'any', 'moment.', 'His', 'many', 'legs,', 'pitifully', 'thin', 'compared', 'with', 'the', 'size', 'of', 'the', 'rest', 'of', 'him,', 'waved', 'about', 'helplessly', 'as', 'he', 'looked.', "&quot;What's", 'happened', 'to', 'me?&quot;', 'he', 'thought.', 'It', "wasn't", 'a', 'dream.', 'His', 'room,', 'a', 'proper']
****************************************************************************************************
['<Metamorphosis>', 'One', 'morning,', 'when', 'Gregor', 'Samsa', 'woke', 'from', 'troubled', 'dreams,', 'he', 'found', 'himself', 'transformed', 'in', 'his', 'bed', 'into', 'a', 'horrible', 'vermin.', 'He', 'lay', 'on', 'his', 'armour-like', 'back,', 'and', 'if', 'he', 'lifted', 'his', 'head', 'a', 'little', 'he', 'could', 'see', 'his', 'brown', 'belly,', 'slightly', 'domed', 'and', 'divided', 'by', 'arches', 'into', 'stiff', 'sections.', 'The', 'bedding', 'was', 'hardly', 'able', 'to', 'cover', 'it', 'and', 'seemed', 'ready', 'to', 'slide', 'off', 'any', 'moment.', 'His', 'many', 'legs,', 'pitifully', 'thin', 'compared', 'with', 'the', 'size', 'of', 'the', 'rest', 'of', 'him,', 'waved', 'about', 'helplessly', 'as', 'he', 'looked.', "&quot;What's", 'happened', 'to', 'me?&quot;', 'he', 'thought.', 'It', "wasn't", 'a', 'dream.', 'His', 'room,', 'a', 'proper']

2.4.5 规范化案例

将所有单词转换为一种书写规范是很常见的。这意味着词汇量会缩小，也会丢失一些信息（例如将Apple公司与作为水果的apple相比）。我们可以将每个单词的lower（）函数将所有单词转化为小写。例如：

with open('Metamorphosis.txt','r')as f:text  = f.read()
words = text.split()
print(words[:100])
words = [word.lower()for word in words]
print('*'*100)
print(words[:100])

作word.lower()变化后：

['<Metamorphosis>', 'One', 'morning,', 'when', 'Gregor', 'Samsa', 'woke', 'from', 'troubled', 'dreams,', 'he', 'found', 'himself', 'transformed', 'in', 'his', 'bed', 'into', 'a', 'horrible', 'vermin.', 'He', 'lay', 'on', 'his', 'armour-like', 'back,', 'and', 'if', 'he', 'lifted', 'his', 'head', 'a', 'little', 'he', 'could', 'see', 'his', 'brown', 'belly,', 'slightly', 'domed', 'and', 'divided', 'by', 'arches', 'into', 'stiff', 'sections.', 'The', 'bedding', 'was', 'hardly', 'able', 'to', 'cover', 'it', 'and', 'seemed', 'ready', 'to', 'slide', 'off', 'any', 'moment.', 'His', 'many', 'legs,', 'pitifully', 'thin', 'compared', 'with', 'the', 'size', 'of', 'the', 'rest', 'of', 'him,', 'waved', 'about', 'helplessly', 'as', 'he', 'looked.', "&quot;What's", 'happened', 'to', 'me?&quot;', 'he', 'thought.', 'It', "wasn't", 'a', 'dream.', 'His', 'room,', 'a', 'proper']
****************************************************************************************************
['<metamorphosis>', 'one', 'morning,', 'when', 'gregor', 'samsa', 'woke', 'from', 'troubled', 'dreams,', 'he', 'found', 'himself', 'transformed', 'in', 'his', 'bed', 'into', 'a', 'horrible', 'vermin.', 'he', 'lay', 'on', 'his', 'armour-like', 'back,', 'and', 'if', 'he', 'lifted', 'his', 'head', 'a', 'little', 'he', 'could', 'see', 'his', 'brown', 'belly,', 'slightly', 'domed', 'and', 'divided', 'by', 'arches', 'into', 'stiff', 'sections.', 'the', 'bedding', 'was', 'hardly', 'able', 'to', 'cover', 'it', 'and', 'seemed', 'ready', 'to', 'slide', 'off', 'any', 'moment.', 'his', 'many', 'legs,', 'pitifully', 'thin', 'compared', 'with', 'the', 'size', 'of', 'the', 'rest', 'of', 'him,', 'waved', 'about', 'helplessly', 'as', 'he', 'looked.', "&quot;what's", 'happened', 'to', 'me?&quot;', 'he', 'thought.', 'it', "wasn't", 'a', 'dream.', 'his', 'room,', 'a', 'proper']

2.4.6 关于文字处理的说明

文字的处理问题与具体的问题有关，也与文本的特点有关，文本的处理根据实际的需要决定什么样的处理方式。同时，文本的处理越简单越好，简单的文本数据，简单的模型，小的词汇表。下一小节我来讲讲NLTK库中的一些工具。

keras自然语言处理（四）相关推荐

官网实例详解4.37（pretrained_word_embeddings.py）-keras学习笔记四
预训练词嵌入(向量) 脚本加载预处理的词向量(GloVe embeddings)加载到冻结的Keras嵌入层中,并基于20 Newsgroup数据集使用它来训练文本分类模型. (把newsgroup消 ...
官网实例详解-目录和实例简介-keras学习笔记四
https://github.com/keras-team/keras/tree/master/examples Keras examples directory Keras实例目录 (点击跳转) 官 ...
Keras学习笔记(四)：MaxPooling1D和GlobalMaxPooling1D的区别
区别: 1.GlobalMaxPooling1D: 在steps维度(也就是第二维)对整个数据求最大值. 比如说输入数据维度是[10, 4, 10],那么进过全局池化后,输出数据的维度则变成[10, ...
【小白学习Keras教程】四、Keras基于数字数据集建立基础的CNN模型
@Author:Runsen 文章目录基本卷积神经网络(CNN) 加载数据集 1.创建模型 2.卷积层 3. 激活层 4. 池化层 5. Dense(全连接层) 6. Model compile & ...
keras自然语言处理（五）
2.5使用NLTK进行标记和处理 Natural Language Toolkit,简称NLTK,是一个为处理和建模文本而编写的Python库,它提供了加载和处理文本的工具,我们可以使用这些工具来为我 ...
CS224n 深度自然语言处理(四) Note - Backpropagation and computation graphs
本文为笔者学习CS224N所做笔记,所包含内容不限于课程课件和讲义,还包括笔者对机器学习.神经网络的一些理解.所写内容难免有难以理解的地方,甚至可能有错误.如您在阅读中有疑惑或者建议,还望留言指正.笔 ...
Keras自然语言处理（九）
第六章为电影评论的情感分析做准备每个问题的文本数据都不同,准备工作从简单的步骤开始,例如加载数据,但是随着任务的进行,数据清理工作会变得越来越困难.下面我们来逐步了解如何为电影评论的情绪分析准备文 ...
文本分类：Keras+RNN vs传统机器学习
摘要:本文通过Keras实现了一个RNN文本分类学习的案例,并详细介绍了循环神经网络原理知识及与机器学习对比. 本文分享自华为云社区<基于Keras+RNN的文本分类vs基于传统机器学习的文本分 ...
Keras vs PyTorch：谁是第一深度学习框架？
「第一个深度学习框架该怎么选」对于初学者而言一直是个头疼的问题.本文中,来自 deepsense.ai 的研究员给出了他们在高级框架上的答案.在 Keras 与 PyTorch 的对比中,作者还给出了 ...

keras自然语言处理（四）

第二章如何使用NLTK手动清理文本

2.1 简述

2.2 弗兰兹卡夫卡的Metamorphosis

2.3 将文本为特定问题进行处理

2.4 手动标记

2.4.1 加载数据

2.4.2 根据单词间的空格拆分

2.4.3 选择单词

2.4.4 按空格分割并删除标点符号

2.4.5 规范化案例

2.4.6 关于文字处理的说明

keras自然语言处理（四）相关推荐

最新文章

热门文章

keras自然语言处理（四）

第二章 如何使用NLTK手动清理文本

2.1 简述

2.2 弗兰兹卡夫卡的Metamorphosis

2.3 将文本为特定问题进行处理

2.4 手动标记

2.4.1 加载数据

2.4.2 根据单词间的空格拆分

2.4.3 选择单词

2.4.4 按空格分割并删除标点符号

2.4.5 规范化案例

2.4.6 关于文字处理的说明

keras自然语言处理（四）相关推荐

最新文章

热门文章

第二章如何使用NLTK手动清理文本