Python 数据分析第六期–文本数据分析

1. Python 文本分析工具 NLTK

NLTK (Natural Language Toolkit)

NLP 领域最常用的一个 Python 库， NLP（natural language process）, 开源项目，自带分词，分类功能，强大的社区支持。

1.1 NLTK 安装

pip install nltk

语料库的安装，在命令行里安装，如果安装不成功，可离线下载。

import nltk
nltk.download()

1.2 文本预处理

1.2.1 分词

将句子拆分成具有语言语义学上意义的词，英文可用空格区分，而中文没有，较复杂，可用中文分词工具，如 “ 结巴分词 ” ，特殊字符的处理，可用正则表达式进行处理。

import nltk
from nltk.corpus import brown
# 需要下载brown语料库
# 引用布朗大学的语料库

# 查看语料库包含的类别
print(brown.categories())

['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies', 'humor', 'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance', 'science_fiction']

# 查看brown语料库
print('共有{}个句子'.format(len(brown.sents())))
print('共有{}个单词'.format(len(brown.words())))

共有57340个句子
共有1161192个单词

sentence = "Python is a widely used high-level programming language for general-purpose programming."
tokens = nltk.word_tokenize(sentence) # 需要下载punkt分词模型
print(tokens)

['Python', 'is', 'a', 'widely', 'used', 'high-level', 'programming', 'language', 'for', 'general-purpose', 'programming', '.']

结巴分词

# 安装 pip install jieba
import jiebaseg_list = jieba.cut("欢迎进入大学", cut_all=True)
print("全模式: " + "/ ".join(seg_list))  # 全模式seg_list = jieba.cut("欢迎进入大学", cut_all=False)
print("精确模式: " + "/ ".join(seg_list))  # 精确模式

1.2.2 词形归一化

英文中的如 “looked look looking”，在不同场景具有不同的词性，影响语料的准确性，需要进行词性归一化处理，具体做法如词干的提取，词性的归并。

词干提取

# PorterStemmer
from nltk.stem.porter import PorterStemmerporter_stemmer = PorterStemmer()
print(porter_stemmer.stem('looked'))
print(porter_stemmer.stem('looking'))

look
look

# SnowballStemmer
from nltk.stem import SnowballStemmersnowball_stemmer = SnowballStemmer('english')
print(snowball_stemmer.stem('looked'))
print(snowball_stemmer.stem('looking'))

look
look

# LancasterStemmer
from nltk.stem.lancaster import LancasterStemmerlancaster_stemmer = LancasterStemmer()
print(lancaster_stemmer.stem('looked'))
print(lancaster_stemmer.stem('looking'))

look
look

词形归并

from nltk.stem import WordNetLemmatizer # 需要下载wordnet语料库wordnet_lematizer = WordNetLemmatizer()
print(wordnet_lematizer.lemmatize('cats'))
print(wordnet_lematizer.lemmatize('boxes'))
print(wordnet_lematizer.lemmatize('are'))
print(wordnet_lematizer.lemmatize('went'))

cat
box
are
went

# 指明词性可以更准确地进行lemma
# lemmatize 默认为名词
print(wordnet_lematizer.lemmatize('are', pos='v'))
print(wordnet_lematizer.lemmatize('went', pos='v'))

be
go

1.2.3 词性标注

import nltkwords = nltk.word_tokenize('Python is a widely used programming language.')
print(nltk.pos_tag(words)) # 需要下载 averaged_perceptron_tagger

[('Python', 'NNP'), ('is', 'VBZ'), ('a', 'DT'), ('widely', 'RB'), ('used', 'VBN'), ('programming', 'NN'), ('language', 'NN'), ('.', '.')]

1.2.4 停用词

为节省存储空间和提高运算效率，NLP中会自动过滤掉某些字或某些词，具体如：语言中的功能词，如：the , is，词汇词, 如： want。

去除停用词

from nltk.corpus import stopwords # 需要下载stopwordsfiltered_words = [word for word in words if word not in stopwords.words('english')]
print('原始词：', words)
print('去除停用词后：', filtered_words)

原始词： ['Python', 'is', 'a', 'widely', 'used', 'programming', 'language', '.']
去除停用词后： ['Python', 'widely', 'used', 'programming', 'language', '.']

1.2.5 典型的文本预处理过程

import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords# 原始文本
raw_text = 'Life is like a box of chocolates. You never know what you\'re gonna get.'# 分词
raw_words = nltk.word_tokenize(raw_text)# 词形归一化
wordnet_lematizer = WordNetLemmatizer()
words = [wordnet_lematizer.lemmatize(raw_word) for raw_word in raw_words]# 去除停用词
filtered_words = [word for word in words if word not in stopwords.words('english')]print('原始文本：', raw_text)
print('预处理结果：', filtered_words)

原始文本： Life is like a box of chocolates. You never know what you're gonna get.
预处理结果： ['Life', 'like', 'box', 'chocolate', '.', 'You', 'never', 'know', "'re", 'gon', 'na', 'get', '.']

2. 情感分析

自然语言处理的过程原理：将自然语言转化为计算机程序更容易理解的形式，方法是将预处理得到的字符串进行向量化处理。

情感分析的方法是要么暴力解决构建情感字典，要么使用机器模型构建分类器。

# 简单的例子import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from nltk.classify import NaiveBayesClassifiertext1 = 'I like the movie so much!'
text2 = 'That is a good movie.'
text3 = 'This is a great one.'
text4 = 'That is a really bad movie.'
text5 = 'This is a terrible movie.'def proc_text(text):"""预处处理文本"""# 分词raw_words = nltk.word_tokenize(text)# 词形归一化wordnet_lematizer = WordNetLemmatizer()    words = [wordnet_lematizer.lemmatize(raw_word) for raw_word in raw_words]# 去除停用词filtered_words = [word for word in words if word not in stopwords.words('english')]# True 表示该词在文本中，为了使用nltk中的分类器return {word: True for word in filtered_words}# 构造训练样本
train_data = [[proc_text(text1), 1],[proc_text(text2), 1],[proc_text(text3), 1],[proc_text(text4), 0],[proc_text(text5), 0]]# 训练模型
nb_model = NaiveBayesClassifier.train(train_data)# 测试模型
text6 = 'That is a bad one.'
print(nb_model.classify(proc_text(text6)))

最后输出结果：

3. 文本相似度

度量文本之间的相似性，使用词频表示文本特征，通过文本中单词出现的频率和次数将文本表示成向量，向量间的相似度可用余弦公式来表示计算。

import nltk
from nltk import FreqDisttext1 = 'I like the movie so much '
text2 = 'That is a good movie '
text3 = 'This is a great one '
text4 = 'That is a really bad movie '
text5 = 'This is a terrible movie'text = text1 + text2 + text3 + text4 + text5
words = nltk.word_tokenize(text)
freq_dist = FreqDist(words)
print(freq_dist['is'])

# 取出常用的n=5个单词
n = 5# 构造“常用单词列表”
most_common_words = freq_dist.most_common(n)
print(most_common_words)

[('movie', 4), ('is', 4), ('a', 4), ('That', 2), ('This', 2)]

def lookup_pos(most_common_words):"""查找常用单词的位置"""result = {}pos = 0for word in most_common_words:result[word[0]] = pospos += 1return result# 记录位置
std_pos_dict = lookup_pos(most_common_words)
print(std_pos_dict)def lookup_pos(most_common_words):"""查找常用单词的位置"""result = {}pos = 0for word in most_common_words:result[word[0]] = pospos += 1return result# 记录位置
std_pos_dict = lookup_pos(most_common_words)
print(std_pos_dict)

{'movie': 0, 'is': 1, 'a': 2, 'That': 3, 'This': 4}

# 新文本
new_text = 'That one is a good movie. This is so good!'# 初始化向量
freq_vec = [0] * n# 分词
new_words = nltk.word_tokenize(new_text)# 在“常用单词列表”上计算词频
for new_word in new_words:if new_word in list(std_pos_dict.keys()):freq_vec[std_pos_dict[new_word]] += 1print(freq_vec)

[1, 2, 1, 1, 1]

4. 文本分类

引入TF-IDF, 词频逆文档频率。

from nltk.text import TextCollectiontext1 = 'I like the movie so much '
text2 = 'That is a good movie '
text3 = 'This is a great one '
text4 = 'That is a really bad movie '
text5 = 'This is a terrible movie'# 构建TextCollection对象
tc = TextCollection([text1, text2, text3, text4, text5])
new_text = 'That one is a good movie. This is so good!'
word = 'That'
tf_idf_val = tc.tf_idf(word, new_text)
print('{}的TF-IDF值为：{}'.format(word, tf_idf_val))

That的TF-IDF值为：0.02181644599700369

one ’
text4 = 'That is a really bad movie ’
text5 = ‘This is a terrible movie’

构建TextCollection对象

tc = TextCollection([text1, text2, text3,
text4, text5])
new_text = ‘That one is a good movie. This is so good!’
word = ‘That’
tf_idf_val = tc.tf_idf(word, new_text)
print(’{}的TF-IDF值为：{}’.format(word, tf_idf_val))

That的TF-IDF值为：0.02181644599700369


## 5. 之后可参考贝叶斯分类器进行文本数据的分类分析。