自然语言处理 | (4)英文文本处理与NLTK

本篇博客我们将介绍使用NLTK对英文文本进行一些基本处理，之后我们还会学习一些更高级的模型或方法，不过这些基本处理要熟练掌握，因为他们可以对我们的数据进行一些预处理，作为更高级模型或工具的输入。

1.NLTK简介

2.英文Tokenization(标记化/分词)

3.停用词

4.词性标注

5.chunking/组块分析

6.命名实体识别

7.Stemming和Lemmatizing

8.WordNet与词义解析

完整代码

1.NLTK简介

2.英文Tokenization(标记化/分词)

import nltk
from nltk import word_tokenize, sent_tokenize
import matplotlib
%matplotlib inline
matplotlib.use('Agg')

# 读入数据
# 把文本读入到字符串中
with open('./data/text.txt','r') as f:corpus = f.read()
# 查看类型
print("corpus的数据类型为:",type(corpus))

#对文本进行断句 返回一个列表
#nltk.download('punkt')
sentences = sent_tokenize(corpus)
print(sentences)

# 对文本进行分词 返回一个列表
words = word_tokenize(corpus)
print(words[:20])

3.停用词

关于机器学习中停用词的产出与收集方法，大家可以参见知乎讨论机器学习中如何收集停用词

# 导入nltk内置的停用词
from nltk.corpus import stopwords
#nltk.download('stopwords') 需要下载到本地
stop_words = stopwords.words('english') #得到nltk内置的所有英文停用词
print(stop_words[:10]) #查看前10个

# 使用列表推导式去掉停用词
filter_corpus = [w for w in words if w not in stop_words]
print(filter_corpus[:20])

print("我们总共剔除的停用词数量为：", len(words)-len(filter_corpus))

4.词性标注

# 词性标注
from nltk import pos_tag
#nltk.download('averaged_perceptron_tagger') 需要下载到本地
tags = pos_tag(filter_corpus)
print(tags[:20])

具体的词性标注编码和含义见如下对应表：

5.chunking/组块分析

from nltk.chunk import RegexpParser
from nltk import sent_tokenize,word_tokenize# 写一个匹配名词短语NP的模式
#JJ形容词+NN名词 或 JJ形容词+NN名词+CC连词+NN名词
pattern = """NP: {<JJ>*<NN>+}   {<JJ>*<NN><CC>*<NN>+}"""# 定义组块分析器
chunker = RegexpParser(pattern)# 一段文本(字符串)
text = """
he National Wrestling Association was an early professional wrestling sanctioning body created in 1930 by
the National Boxing Association (NBA) (now the World Boxing Association, WBA) as an attempt to create
a governing body for professional wrestling in the United States. The group created a number of "World" level
championships as an attempt to clear up the professional wrestling rankings which at the time saw a number of
different championships promoted as the "true world championship". The National Wrestling Association's NWA
World Heavyweight Championship was later considered part of the historical lineage of the National Wrestling
Alliance's NWA World Heavyweight Championship when then National Wrestling Association champion Lou Thesz
won the National Wrestling Alliance championship, folding the original championship into one title in 1949."""#断句 返回一个列表
tokenized_sentence = nltk.sent_tokenize(text)
#分词 返回一个嵌套列表
tokenized_words = [nltk.word_tokenize(sentence) for sentence in tokenized_sentence]
#词性标注
tagged_words = [nltk.pos_tag(word) for word in tokenized_words]#识别之前定义的NP组块
word_tree = [chunker.parse(word) for word in tagged_words]word_tree[0].draw() # 会跳出弹窗，显示第一句话的解析图

6.命名实体识别

from nltk import ne_chunk,pos_tag,word_tokenize
#nltk.download('maxent_ne_chunker') #需要下载到本地
#nltk.download('words')
sentence = 'CoreJT studies at Stanford University.'
#依次对句子/文本进行分词 词性标注和命名实体识别
print(ne_chunk(pos_tag(word_tokenize(sentence))))

命名实体识别也非常推荐大家使用 stanford core nlp modules 作为nltk的NER工具库，通常来说它速度更快，而且有更高的识别准确度。

7.Stemming和Lemmatizing

# 可以用PorterStemmer
from nltk.stem import PorterStemmerstemmer = PorterStemmer()
print(stemmer.stem('running'))
print(stemmer.stem('makes'))
print(stemmer.stem('tagged'))

# 也可以用SnowballStemmerfrom nltk.stem import SnowballStemmer
stemmer1 = SnowballStemmer('english') #指定为英文
print(stemmer1.stem('growing'))

#Lemmatization和Stemmer很类似，不同的是他还考虑了词义关联等信息
#Stemmer速度更快 因为他只是基于一系列规则
from nltk.stem import WordNetLemmatizer
#nltk.download('wordnet')#需要下载到本地
lemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize('makes'))

8.WordNet与词义解析

from nltk.corpus import wordnet as wnprint(wn.synsets('man')) #查看单词man的各个词义
print(wn.synsets('man')[0].definition()) #查看第一种词义的解释
print(wn.synsets('man')[1].definition()) #查看第二种词义的解释

print(wn.synsets('dog'))#查看单词dog的各个词义
print(wn.synsets('dog')[0].definition())#查看第一种词义的解释
#基于第一种词义进行造句
dog = wn.synsets('dog')[0]
#或者 dog = wn.synset('dog.n.01')
print(dog.examples()[0])

# 查看dog的上位词
print(dog.hypernyms()) #犬类 家养动物

自然语言处理 | (4)英文文本处理与NLTK相关推荐

自然语言处理 | (5)英文文本处理与spaCy
本篇博客我们将介绍使用spaCy对英文文本进行一些处理,spaCy不仅包含一些基本的文本处理操作,还包含一些预训练的模型和词向量等,之后我们还会学习一些更高级的模型或方法,不过这些基本处理要熟练掌握, ...
python 英语分词_基于Python NLTK库进行英文文本预处理
文本预处理是要文本处理成计算机能识别的格式,是文本分类.文本可视化.文本分析等研究的重要步骤.具体流程包括文本分词.去除停用词.词干抽取(词形还原).文本向量表征.特征选择等步骤,以消除脏数据对挖掘分 ...
英文文本分词之工具NLTK
英文文本分词之工具NLTK 安装NLTK 停用词和标点符号包放置验证安装NLTK pip install nltk 分词需要用到两个包:stopwords和punkt,需要下载: import n ...
英文文本分词处理（NLTK）
文章目录 1.NLTK的安装 2.NLTK分词和分句 3.NLTK分词后去除标点符号 4.NLTK分词后去除停用词 5.NLTK分词后进行词性标注 6.NLTK分词后进行词干提取 7.NLTK分词后进 ...
python自然语言处理分词_Python 自然语言处理（基于jieba分词和NLTK）
Python 自然语言处理(基于jieba分词和NLTK) 发布时间:2018-05-11 11:39, 浏览次数:1038 , 标签: Python jieba NLTK ----------欢迎加 ...
英文文本分类——电影评论情感判别
目录 1.导入所需的库 2.用Pandas读入训练数据 3.构建停用词列表数据 4.对数据做预处理 5.将清洗的数据添加到DataFrame里 6.计算训练集中每条评论数据的向量 7.构建随机森林分类 ...
python英文文本分析和提取_英文文本挖掘预处理流程总结
在中文文本挖掘预处理流程总结中,我们总结了中文文本挖掘的预处理流程,这里我们再对英文文本挖掘的预处理流程做一个总结. 1. 英文文本挖掘预处理特点英文文本的预处理方法和中文的有部分区别.首先,英文文 ...
Python 自然语言处理（基于jieba分词和NLTK）
----------欢迎加入学习交流QQ群:657341423 自然语言处理是人工智能的类别之一.自然语言处理主要有那些功能?我们以百度AI为例从上述的例子可以看到,自然语言处理最基本的功能是词法分 ...
Python文本分析（NLTK,jieba,snownlp）
自然语言处理(NLP)是研究能实现人与计算机之间用自然语言进行有效通信的各种理论和方法,也是人工智能领域中一个最重要.最艰难的方向.说其重要,因为它的理论与实践与探索人类自身的思维.认知.意识等精神机 ...
自然语言处理(2)之文本资料库
自然语言处理(2)之文本资料库 1.获取文本资料库本章首先给出了一个文本资料库的实例:nltk.corpus.gutenberg,通过gutenberg实例来学习文本资料库.我们用help来查看它的 ...

自然语言处理 | (4)英文文本处理与NLTK

1.NLTK简介

2.英文Tokenization(标记化/分词)

3.停用词

4.词性标注

5.chunking/组块分析

6.命名实体识别

7.Stemming和Lemmatizing

8.WordNet与词义解析

自然语言处理 | (4)英文文本处理与NLTK相关推荐

最新文章

热门文章