

近年来,在文本处理方面有许多非常令人激动的新进展,这些内容都超出了本书的范围,并且都和神经网络有关。第一个进展是使用连续向量表示,也叫做词向量(word vector)或分布式词表示(distributed word representation),它在work2vec库中实现。

近年来,NLP还有另一个研究方向不断升温,就是使用递归神经网络(recurrent neural network,RNN)进行文本处理。与智能分类类别标签的分类模型相比,RNN是一种特别强大的神经网络,可以生成同样是文本的输出。能够生成文本作为输出,使得RNN非常适合自动翻译和摘要。


在文本分析的语境中,数据集通常被称为语料库(corpus),每个由单个文本表示的数据点称为文档(document)。这些术语来自于信息检索(information retrieval,IR)和自然语言处理(natural language processing,NLP)的社区,它们主要针对文本数据。



from sklearn.datasets import load_files

reviews_train = load_files("data/aclImdb/train/")

# load_files returns a bunch, containing training texts and training labels

text_train, y_train = reviews_train.data, reviews_train.target

print("type of text_train:{}".format(type(text_train)))

print("length of text_train:{}".format(len(text_train)))


type of text_train:

length of text_train: 25000


b"This movie has a special way of telling the story, at first i found it rather odd as it jumped through time and I had no idea whats happening.
Anyway the story line was although simple, but still very real and touching. You met someone the first time, you fell in love completely, but broke up at last and promoted a deadly agony. Who hasn't go through this? but we will never forget this kind of pain in our life.
I would say i am rather touched as two actor has shown great performance in showing the love between the characters. I just wish that the story could be a happy ending."


text_train = [doc.replace(b"
", b" ") for doc in text_train]


array([0, 1])

print("Samples per class (training):{}".format(np.bincount(y_train)))

Samples per class (training): [12500 12500]


reviews_test = load_files("data/aclImdb/test/")

text_test, y_test = reviews_test.data, reviews_test.target

print("Number of documents in test data:{}".format(len(text_test)))

print("Samples per class (test):{}".format(np.bincount(y_test)))

text_test = [doc.replace(b"
", b" ") for doc in text_test]

Number of documents in test data: 25000

Samples per class (test): [12500 12500]





构建词表(vocabulary building)。收集一个词表,里面包含出现在任意文档中的所有词,并对它们进行编号(比如按字母顺序排序)。


下面是字符串“This is how you get ants.”的处理过程。其输出是包含每个文档中单词计数的一个向量。对于词表中的每个单词,我们都有它在每个文档中的出现次数。也就是说,整个数据集中的每个唯一单词都对应于这种数值表示的一个特征。请注意,原始字符中的单词顺序与词袋特征表示完全无关。


bards_words =["The fool doth think he is wise,",

"but the wise man knows himself to be a fool"]

from sklearn.feature_extraction.text import CountVectorizer

vect = CountVectorizer()


CountVectorizer(analyzer='word', binary=False, decode_error='strict',

dtype=, encoding='utf-8', input='content',

lowercase=True, max_df=1.0, max_features=None, min_df=1,

ngram_range=(1, 1), preprocessor=None, stop_words=None,

strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',

tokenizer=None, vocabulary=None)

print("Vocabulary size:{}".format(len(vect.vocabulary_)))

print("Vocabulary content:\n{}".format(vect.vocabulary_))

Vocabulary size: 13

Vocabulary content:

{'the': 9, 'fool': 3, 'doth': 2, 'think': 10, 'he': 4, 'is': 6, 'wise': 12, 'but': 1, 'man': 8, 'knows': 7, 'himself': 5, 'to': 11, 'be': 0}

bag_of_words = vect.transform(bards_words)


bag_of_words: <2x13 sparse matrix of type ''

with 16 stored elements in Compressed Sparse Row format>

print("Dense representation of bag_of_words:\n{}".format(


Dense representation of bag_of_words:

[[0 0 1 1 1 0 1 0 0 1 1 0 1]

[1 1 0 1 0 1 0 1 1 1 0 1 1]]



vect = CountVectorizer().fit(text_train)

X_train = vect.transform(text_train)



<25000x74849 sparse matrix of type ''

with 3431196 stored elements in Compressed Sparse Row format>

feature_names = vect.get_feature_names()

print("Number of features:{}".format(len(feature_names)))

print("First 20 features:\n{}".format(feature_names[:20]))

print("Features 20010 to 20030:\n{}".format(feature_names[20010:20030]))

print("Every 2000th feature:\n{}".format(feature_names[::2000]))

Number of features: 74849

First 20 features:

['00', '000', '0000000000001', '00001', '00015', '000s', '001', '003830', '006', '007', '0079', '0080', '0083', '0093638', '00am', '00pm', '00s', '01', '01pm', '02']

Features 20010 to 20030:

['dratted', 'draub', 'draught', 'draughts', 'draughtswoman', 'draw', 'drawback', 'drawbacks', 'drawer', 'drawers', 'drawing', 'drawings', 'drawl', 'drawled', 'drawling', 'drawn', 'draws', 'draza', 'dre', 'drea']

Every 2000th feature:

['00', 'aesir', 'aquarian', 'barking', 'blustering', 'bête', 'chicanery', 'condensing', 'cunning', 'detox', 'draper', 'enshrined', 'favorit', 'freezer', 'goldman', 'hasan', 'huitieme', 'intelligible', 'kantrowitz', 'lawful', 'maars', 'megalunged', 'mostey', 'norrland', 'padilla', 'pincher', 'promisingly', 'receptionist', 'rivals', 'schnaas', 'shunning', 'sparse', 'subset', 'temptations', 'treatises', 'unproven', 'walkman', 'xylophonist']

from sklearn.model_selection import cross_val_score

from sklearn.linear_model import LogisticRegression

scores = cross_val_score(LogisticRegression(), X_train, y_train, cv=5)

print("Mean cross-validation accuracy:{:.2f}".format(np.mean(scores)))

Mean cross-validation accuracy: 0.88

from sklearn.model_selection import GridSearchCV

param_grid = {'C': [0.001, 0.01, 0.1, 1, 10]}

grid = GridSearchCV(LogisticRegression(), param_grid, cv=5)

grid.fit(X_train, y_train)

print("Best cross-validation score:{:.2f}".format(grid.best_score_))

print("Best parameters: ", grid.best_params_)

Best cross-validation score: 0.89

Best parameters: {'C': 0.1}

X_test = vect.transform(text_test)

print("Test score:{:.2f}".format(grid.score(X_test, y_test)))

Test score: 0.88

vect = CountVectorizer(min_df=5).fit(text_train)

X_train = vect.transform(text_train)

print("X_train with min_df:{}".format(repr(X_train)))

X_train with min_df: <25000x27271 sparse matrix of type ''

with 3354014 stored elements in Compressed Sparse Row format>

feature_names = vect.get_feature_names()

print("First 50 features:\n{}".format(feature_names[:50]))

print("Features 20010 to 20030:\n{}".format(feature_names[20010:20030]))

print("Every 700th feature:\n{}".format(feature_names[::700]))

First 50 features:

['00', '000', '007', '00s', '01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '100', '1000', '100th', '101', '102', '103', '104', '105', '107', '108', '10s', '10th', '11', '110', '112', '116', '117', '11th', '12', '120', '12th', '13', '135', '13th', '14', '140', '14th', '15', '150', '15th', '16', '160', '1600', '16mm', '16s', '16th']

Features 20010 to 20030:

['repentance', 'repercussions', 'repertoire', 'repetition', 'repetitions', 'repetitious', 'repetitive', 'rephrase', 'replace', 'replaced', 'replacement', 'replaces', 'replacing', 'replay', 'replayable', 'replayed', 'replaying', 'replays', 'replete', 'replica']

Every 700th feature:

['00', 'affections', 'appropriately', 'barbra', 'blurbs', 'butchered', 'cheese', 'commitment', 'courts', 'deconstructed', 'disgraceful', 'dvds', 'eschews', 'fell', 'freezer', 'goriest', 'hauser', 'hungary', 'insinuate', 'juggle', 'leering', 'maelstrom', 'messiah', 'music', 'occasional', 'parking', 'pleasantville', 'pronunciation', 'recipient', 'reviews', 'sas', 'shea', 'sneers', 'steiger', 'swastika', 'thrusting', 'tvs', 'vampyre', 'westerns']

grid = GridSearchCV(LogisticRegression(), param_grid, cv=5)

grid.fit(X_train, y_train)

print("Best cross-validation score:{:.2f}".format(grid.best_score_))

Best cross-validation score: 0.89



from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS

print("Number of stop words:{}".format(len(ENGLISH_STOP_WORDS)))

print("Every 10th stopword:\n{}".format(list(ENGLISH_STOP_WORDS)[::10]))

Number of stop words: 318

Every 10th stopword:

['they', 'of', 'who', 'found', 'none', 'co', 'full', 'otherwise', 'never', 'have', 'she', 'neither', 'whereby', 'one', 'any', 'de', 'hence', 'wherever', 'whose', 'him', 'which', 'nine', 'still', 'from', 'here', 'what', 'everything', 'us', 'etc', 'mine', 'find', 'most']

# Specifying stop_words="english" uses the built-in list.

# We could also augment it and pass our own.

vect = CountVectorizer(min_df=5, stop_words="english").fit(text_train)

X_train = vect.transform(text_train)

print("X_train with stop words:\n{}".format(repr(X_train)))

X_train with stop words:

<25000x26966 sparse matrix of type ''

with 2149958 stored elements in Compressed Sparse Row format>

grid = GridSearchCV(LogisticRegression(), param_grid, cv=5)

grid.fit(X_train, y_train)

print("Best cross-validation score:{:.2f}".format(grid.best_score_))

Best cross-validation score: 0.88



另一种方法是按照我们预计的特征信息量大小来缩放特征,而不是舍弃那些认为不重要的特征。最常见的一种做法就是使用词频-逆向文档频率(term frequency-inverse document frequency,tf-idf)方法。这一方法对在某个特定文档中经常出现的术语给与很高的权重,但对在语料库的许多文档都经常出现的术语给与的权重却不高。如果一个单词在某个特定文档中经常出现,但在许多文档中却不常出现,那么这个单词很可能是对文档内容的很好描述。




from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.pipeline import make_pipeline

pipe = make_pipeline(TfidfVectorizer(min_df=5, norm=None),


param_grid = {'logisticregression__C': [0.001, 0.01, 0.1, 1, 10]}

grid = GridSearchCV(pipe, param_grid, cv=5)

grid.fit(text_train, y_train)

print("Best cross-validation score:{:.2f}".format(grid.best_score_))

Best cross-validation score: 0.89

vectorizer = grid.best_estimator_.named_steps["tfidfvectorizer"]

# transform the training dataset:

X_train = vectorizer.transform(text_train)

# find maximum value for each of the features over dataset:

max_value = X_train.max(axis=0).toarray().ravel()

sorted_by_tfidf = max_value.argsort()

# get feature names

feature_names = np.array(vectorizer.get_feature_names())

print("Features with lowest tfidf:\n{}".format(


print("Features with highest tfidf:\n{}".format(


Features with lowest tfidf:

['poignant' 'disagree' 'instantly' 'importantly' 'lacked' 'occurred'

'currently' 'altogether' 'nearby' 'undoubtedly' 'directs' 'fond' 'stinker'

'avoided' 'emphasis' 'commented' 'disappoint' 'realizing' 'downhill'


Features with highest tfidf:

['coop' 'homer' 'dillinger' 'hackenstein' 'gadget' 'taker' 'macarthur'

'vargas' 'jesse' 'basket' 'dominick' 'the' 'victor' 'bridget' 'victoria'

'khouri' 'zizek' 'rob' 'timon' 'titanic']

sorted_by_idf = np.argsort(vectorizer.idf_)

print("Features with lowest idf:\n{}".format(


Features with lowest idf:

['the' 'and' 'of' 'to' 'this' 'is' 'it' 'in' 'that' 'but' 'for' 'with'

'was' 'as' 'on' 'movie' 'not' 'have' 'one' 'be' 'film' 'are' 'you' 'all'

'at' 'an' 'by' 'so' 'from' 'like' 'who' 'they' 'there' 'if' 'his' 'out'

'just' 'about' 'he' 'or' 'has' 'what' 'some' 'good' 'can' 'more' 'when'

'time' 'up' 'very' 'even' 'only' 'no' 'would' 'my' 'see' 'really' 'story'

'which' 'well' 'had' 'me' 'than' 'much' 'their' 'get' 'were' 'other'

'been' 'do' 'most' 'don' 'her' 'also' 'into' 'first' 'made' 'how' 'great'

'because' 'will' 'people' 'make' 'way' 'could' 'we' 'bad' 'after' 'any'

'too' 'then' 'them' 'she' 'watch' 'think' 'acting' 'movies' 'seen' 'its'






feature_names, n_top_features=40)


python 文本分析教程_《Python机器学习基础教程》七、处理文本数据相关推荐

  1. python分类器鸢尾花怎么写_python机器学习基础教程-鸢尾花分类

    一: 环境准备: 1.导入的库: importnumpy as npimportmatplotlib.pyplot as pltimportpandas as pdimport mglearn 2.导 ...

  2. python平稳性检验_时间序列预测基础教程系列(14)_如何判断时间序列数据是否是平稳的(Python)...

    时间序列预测基础教程系列(14)_如何判断时间序列数据是否是平稳的(Python) 发布时间:2019-01-10 00:02, 浏览次数:620 , 标签: Python 导读: 本文介绍了数据平稳 ...

  3. python分类器鸢尾花怎么写_python机器学习基础教程:鸢尾花分类

    首先导入必要的库: import numpy as np import matplotlib.pyplot as plt import pandas as pd import mglearn 复制代码 ...

  4. python 广告分析算法_[Python]研究广告渠道的特征数据与结果数据的相关性, 并对渠道作出评分模型...

    官方描述 公司近三个月(30天)大力投放广告,累计投放的渠道有889,每个渠道的客户性质也可能不同,比如在优酷视频投广告和今日头条投放广告,效果可能会有差异.现在需要对广告效果分析实现有针对性的广告效 ...

  5. python基础教程免费下载-《Python机器学习基础教程》高清版免费PDF下载

    Python机器学习基础教程-[德] 安德里亚斯·穆勒(Andreas C.Müller)[美]莎拉·吉多(Sarah Guido) 著,张亮(hysic) 译 下载地址1:网盘下载 下载地址2:网盘 ...

  6. Python机器学习基础教程-第2章-监督学习之K近邻

    前言 本系列教程基本就是摘抄<Python机器学习基础教程>中的例子内容. 为了便于跟踪和学习,本系列教程在Github上提供了jupyter notebook 版本: Github仓库: ...

  7. Python机器学习基础教程(1)Irises(鸢尾花)分类之新手上路

    一.感谢博客的内容提供的参考 标题:最新版学习笔记---Python机器学习基础教程(1)Irises(鸢尾花)分类---附完整代码 作者:非鱼子焉 地址:https://zhu-rui.blog.c ...

  8. python机器学习基础教程-学习笔记(一)

    了解 scikit-learn 及其用法是很重要的,但还有其他一些库也可以改善你的编程体验. scikit-learn 是基于 NumPy 和 SciPy 科学计算库的.此外,我们还会用到 panda ...

  9. 回归素材(part4)--python机器学习基础教程

    学习笔记,仅供参考 文章目录 python机器学习基础教程 线性模型 用于回归的线性模型 过拟合 岭回归 lasso python机器学习基础教程 线性模型 用于回归的线性模型 过拟合

  10. 泛化,过拟合,欠拟合素材(part1)--python机器学习基础教程

    学习笔记,仅供参考,有错必纠 文章目录 python机器学习基础教程 泛化.过拟合.欠拟合 模型复杂度与训练集大小的关系 python机器学习基础教程 泛化.过拟合.欠拟合 判断一个算法在新数据上表现 ...


  1. Git 高级用法小抄
  2. VMware(VMDebugger)导致VS2010启动慢的解决办法
  3. 正则表达式匹配不包含特定字符串解决匹配溢出问题
  4. WeakMap 本身释放,而 keyObject 没有释放的情况下,value 会释放吗?
  5. linux安卓主线程同步,Android解决:使用多线程和Handler同步更新UI
  6. AtCoder Beginner Contest 203(Sponsored by Panasonic)题解
  7. 为什么现在越来越多的人不愿换新机?最后一个原因扎心了
  8. 10_隐马尔科夫模型HMM1_统计学习方法
  9. 关于solaris中 crontab -e 出现数字0的解决办法
  10. mysql error 1017_[转载]解决 mysql ERROR 1017:Can t find file解决错误
  11. autojs今日头条急速脚本
  12. 【Visio】 windows Visio 画图
  13. javaweb网上商城系统
  14. RS-232协议和RS-485协议
  15. java远程桌面_java – 实现远程桌面共享解决方案
  16. Windows环境下不用第三方程序给新硬盘提前预装Windows系统
  17. 替罪羊树[Scapegoat Tree]
  18. 1965:【14NOIP普及组】珠心算测验
  19. 基于python的电影在线_利用python实现电影推荐
  20. 数据分析师-SQL笔试题-做透这道题就够了


  1. 常用的页面布局(两栏布局、三栏(圣杯、双飞翼)布局)
  2. RK3568平台开发系列讲解(安卓适配篇)Android11旋转屏幕
  3. 新·自学日语教材推荐加点评
  4. 吉林市一日游规格说明书
  5. 如何在水经注微图中加载地形地貌图进行道路设计
  6. 栅栏CyclicBarrier
  7. 那些年我们听过的法则
  8. 传世私服显示不了服务器,传世SF私服搭建架设教程
  9. MongoDB 极简入门实践
  10. 使用gimp批量处理图片