当你在 Google 的图片搜索里输入“cute kitten”时,Google 怎么会知道你要搜索什么呢?
其实这个词组与可爱的小猫咪是密切相关的。当你在 YouTube 搜索框中输入“dead parrot”
的时候,YouTube 怎么会知道要推荐一些 Monty Python 乐团的幽默短剧呢?那是因为每个
上传的视频里都带有标题和简介文字

概括数据

在第 7 章里,我们介绍过如何把文本内容分解成 n-gram 模型,或者说是 n 个单词长度的
词组。从最基本的功能上说,这个集合可以用来确定这段文字中最常用的单词和短语。另
外,还可以提取原文中那些最常用的短语周围的句子,对原文进行看似合理的概括。

我们即将用来做数据归纳的文字样本源自美国第九任总统威廉 ·亨利 ·哈里森的就职演
说。哈里森的总统生涯创下美国总统任职历史的两个记录:一个是最长的就职演说,另一
个是最短的任职时间——32 天。

我们将用他的总统就职演说(http://pythonscraping.com/files/inaugurationSpeech.txt)的全文
作为这一章许多示例代码的数据源。

简单修改一下我们在第 7 章里用过的 n-gram 模型,就可以获得 2-gram 序列的频率数据,
然后我们用 Python 的 operator 模块对 2-gram 序列的频率字典进行排序


from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
import string
import operator
def cleanInput(input):input = re.sub('\n+', " ", input).lower()input = re.sub('\[[0-9]*\]', "", input)input = re.sub(' +', " ", input)input = bytes(input, "UTF-8")input = input.decode("ascii", "ignore")cleanInput = []input = input.split(' ')for item in input:item = item.strip(string.punctuation)if len(item) > 1 or (item.lower() == 'a' or item.lower() == 'i'):cleanInput.append(item)return cleanInput
def ngrams(input, n):input = cleanInput(input)output = {}for i in range(len(input)-n+1):ngramTemp = " ".join(input[i:i+n])if ngramTemp not in output:output[ngramTemp] = 0output[ngramTemp] += 1return output
content = str(urlopen("http://pythonscraping.com/files/inaugurationSpeech.txt").read(),'utf-8')
ngrams = ngrams(content, 2)
sortedNGrams = sorted(ngrams.items(), key = operator.itemgetter(1), reverse=True)
print(sortedNGrams)

ouput

[('of the', 213),
he constitution',
), ('the people',
3), ('of a', 22),
('in
34),
24),
('of
the', 65), ('to the', 61), ('by the', 41), ('t
('of our', 29), ('to be', 26), ('from the', 24
('and the', 23), ('it is', 23), ('that the', 2
their', 19)

“of the”“in the”和“to the”看
起来并不重要

最常用的 5000 个单词列表可以免费获取,作为一个基本的过滤器来过滤最常用的 2-gram
序列绰绰有余。其实只用前 100 个单词就可以大幅改善分析结果,我们增加一个 isCommon
函数来实现


from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
import string
import operatordef isCommon(ngram):commonWords = ["the", "be", "and", "of", "a", "in", "to", "have", "it", "i", "that", "for", "you", "he", "with", "on", "do", "say", "this", "they", "is", "an", "at", "but","we", "his", "from", "that", "not", "by", "she", "or", "as", "what", "go", "their","can", "who", "get", "if", "would", "her", "all", "my", "make", "about", "know", "will","as", "up", "one", "time", "has", "been", "there", "year", "so", "think", "when", "which", "them", "some", "me", "people", "take", "out", "into", "just", "see", "him", "your", "come", "could", "now", "than", "like", "other", "how", "then", "its", "our", "two", "more", "these", "want", "way", "look", "first", "also", "new", "because", "day", "more", "use", "no", "man", "find", "here", "thing", "give", "many", "well"]for word in ngram:if word in commonWords:return Truereturn Falsedef cleanText(input):input = re.sub('\n+', " ", input).lower()input = re.sub('\[[0-9]*\]', "", input)input = re.sub(' +', " ", input)input = re.sub("u\.s\.", "us", input)input = bytes(input, "UTF-8")input = input.decode("ascii", "ignore")return inputdef cleanInput(input):input = cleanText(input)cleanInput = []input = input.split(' ')for item in input:item = item.strip(string.punctuation)if len(item) > 1 or (item.lower() == 'a' or item.lower() == 'i'):cleanInput.append(item)return cleanInputdef getNgrams(input, n):input = cleanInput(input)output = {}for i in range(len(input)-n+1):ngramTemp = " ".join(input[i:i+n])if ngramTemp not in output:output[ngramTemp] = 0output[ngramTemp] += 1return outputdef getFirstSentenceContaining(ngram, content):#print(ngram)sentences = content.split(".")for sentence in sentences: if ngram in sentence:return sentencereturn ""content = str(urlopen("http://pythonscraping.com/files/space.txt").read(), 'utf-8')
ngrams = getNgrams(content, 2)
sortedNGrams = sorted(ngrams.items(), key = operator.itemgetter(1), reverse = True)
print(sortedNGrams)

output

('united states', 10), ('executive department', 4), ('general governm
ent', 4), ('called upon', 3), ('government should', 3), ('whole count
ry', 3), ('mr jefferson', 3), ('chief magistrate', 3), ('same causes'
, 3), ('legislative body', 3)

马尔可夫模型

这些文字生成器都是基于一种常用于分析大量随机事件的马尔可夫模型,随机事件的特点
是一个离散事件发生之后,另一个离散事件将在前一个事件的条件下以一定的概率发生。

在这个天气系统模型中,如果今天是晴天,那么明天有 70% 的可能是晴天,20% 的可能
多云,10% 的可能下雨。如果今天是下雨天,那么明天有 50% 的可能也下雨,25% 的可
能是晴天,25% 的可能是多云。
需要注意以下几点。

任何一个节点引出的所有可能的总和必须等于 100%。无论是多么复杂的系统,必然会
在下一步发生若干事件中的一个事件。
虽然这个天气系统在任一时间都只有三种可能,但是你可以用这个模型生成一个天气状
态的无限次转移列表。
只有当前节点的状态会影响后一天的状态。如果你在“晴天”节点上,即使前 100 天都
是晴天或雨天都没关系,明天晴天的概率还是 70%。
有些节点可能比其他节点较难到达。这个现象的原因用数学来解释非常复杂,但是可以
直观地看出,在这个系统中任意时间节点上,第二天是“雨天”的可能性(指向它的箭
头概率之和小于“100%”)比“晴天”或“多云”要小很多

from urllib.request import urlopen
from random import randintdef wordListSum(wordList):sum = 0for word, value in wordList.items():sum += valuereturn sumdef retrieveRandomWord(wordList):randIndex = randint(1, wordListSum(wordList))for word, value in wordList.items():randIndex -= valueif randIndex <= 0:return worddef buildWordDict(text):#Remove newlines and quotestext = text.replace("\n", " ")text = text.replace("\"", "")#Make sure puncuation are treated as their own "word," so they will be included#in the Markov chainpunctuation = [',','.',';',':']for symbol in punctuation:text = text.replace(symbol, " "+symbol+" ")words = text.split(" ")#Filter out empty wordswords = [word for word in words if word != ""]wordDict = {}for i in range(1, len(words)):if words[i-1] not in wordDict:#Create a new dictionary for this wordwordDict[words[i-1]] = {}if words[i] not in wordDict[words[i-1]]:wordDict[words[i-1]][words[i]] = 0wordDict[words[i-1]][words[i]] += 1return wordDicttext = str(urlopen("http://pythonscraping.com/files/inaugurationSpeech.txt").read(), 'utf-8')
wordDict = buildWordDict(text)#Generate a Markov chain of length 100
length = 100
chain = ""
currentWord = "I"
for i in range(0, length):chain += currentWord+" "#print(wordDict[currentWord])currentWord = retrieveRandomWord(wordDict[currentWord])print(chain)

out put

I sincerely believe in Chief Magistrate to make all necessary sacrifices and
oppression of the remedies which we may have occurred to me in the arrangement
and disbursement of the democratic claims them , consolatory to have been best
political power in fervently commending every other addition of legislation , by
the interests which violate that the Government would compare our aboriginal
neighbors the people to its accomplishment . The latter also susceptible of the
Constitution not much mischief , disputes have left to betray . The maxim which
may sometimes be an impartial and to prevent the adoption or

维基百科六度分割

广度优先搜索算法的思路是优先搜寻直接连接到起始页的所有链接(而不是找到一个链接
就纵向深入搜索)。如果这些链接不包含目标页面(你想要找的词条),就对第二层的链
接——连接到起始页的页面的所有链接——进行搜索。这个过程不断重复,直到达到搜索
深度限制(本例中使用的层数限制是 6 层)或者找到目标页面为止。


from urllib.request import urlopen
from bs4 import BeautifulSoup
import pymysqlconn = pymysql.connect(host='127.0.0.1', port=3306, user='root', passwd='root', db='mysql', charset='utf8')
cur = conn.cursor()
cur.execute("USE wikipedia")def getUrl(pageId):cur.execute("SELECT url FROM pages WHERE id = %s", (int(pageId)))if cur.rowcount == 0:return Nonereturn cur.fetchone()[0]def getLinks(fromPageId):cur.execute("SELECT toPageId FROM links WHERE fromPageId = %s", (int(fromPageId)))if cur.rowcount == 0:return Nonereturn [x[0] for x in cur.fetchall()]def searchBreadth(targetPageId, currentPageId, depth, nodes):if nodes is None or len(nodes) == 0:return Noneif depth <= 0:for node in nodes:if node == targetPageId:return [node]return None#depth is greater than 0 -- go deeper!for node in nodes:found = searchBreadth(targetPageId, node, depth-1, getLinks(node))if found is not None:return found.append(currentPageId)return Nonenodes = getLinks(1)
targetPageId = 123428
for i in range(0,4):found = searchBreadth(targetPageId, 1, i, nodes)if found is not None:print(found)for node in found:print(getUrl(node))breakelse:print("No path found")

下面是凯文 ·贝肯词条(在数据库中页面 ID 为 1)和埃里克 ·艾德尔词条(在数据库中页
面 ID 为 78520)的链接路径:

TARGET 134951 FOUND!
PAGE: 156224
PAGE: 155545
PAGE: 3
PAGE: 1

对应的链接名称是:Kevin Bacon → San Diego Comic Con International → Brian Froud →
Terry Jones → Eric Idle。

自然语言工具包

安装与设置

NLTK 模块的安装方法和其他 Python 模块一样,要么从 NLTK 网站直接下载安装包进行
安装,要么用其他几个第三方安装器通过关键词“nltk”安装。详细的安装教程,请参考NLTK 网站(http://www.nltk.org/install.html)。

模块安装之后,可以下载 NLTK 自带的文本库,这样你就可以非常轻松地实验 NLTK 的功
能。在 Python 命令行输入下面的命令即可:

>>> import nltk
>>> nltk.download()

用 NLTK 做统计分析

NLTK 很擅长生成一些统计信息,包括对一段文字的单词数量、单词频率和单词词性的统
计。如果你只需要做一些简单直接的计算(比如,一段文字中不重复单词的数量),导入
NLTK 模块就太大材小用了——它是一个非常大的模块。但是,如果你还需要对文本做一
些更有深度的分析,那么里面有许多函数可以帮你实现任何需要的统计指标。


from nltk import word_tokenize
from nltk import Texttokens = word_tokenize("Here is some not very interesting text")
text = Text(tokens)

用 NLTK 做统计分析一般是从 Text 对象开始的。 Text 对象可以通过下面的方法用简单的
Python 字符串来创建:

from nltk.book import *

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908

文本对象可以像普通的 Python 数组那样操作,好像它们就是一个包含文本里所有单词的数
组。用这个属性,你可以统计文本中不重复的单词,然后与总单词数据进行比较:

>>> len(text6)/len(words)
7.833333333333333

前面的数据表明剧本中每个单词平均被使用了八次。你还可以将文本对象放到一个频率分
布对象 FreqDist 中,查看哪些单词是最常用的,以及单词的频率是多少。

>>> from nltk import FreqDist
>>> fdist = FreqDist(text6)
>>> fdist.most_common(10)
[(':', 1197), ('.', 816), ('!', 801), (',', 731), ("'", 421), ('[', 3
19), (']', 312), ('the', 299), ('I', 255), ('ARTHUR', 225)]
>>> fdist["Grail"]
34

用 NLTK 做词性分析

网络数据采集经常需要处理搜索的问题。你在采集了一个网站的文字之后,可能想从文字
里面搜索“google”这个词,但你要的是作为动词的 google,不要作为专用名词的 Google。
或者你就想查找 Google 公司的名称 Google,但是不想通过首字母大写来找出答案(人们
可能忘记将首字母大写,直接写成 google)。那么这时函数 pos_tag 就很管用了:


from nltk import word_tokenize, sent_tokenize, pos_tag
sentences = sent_tokenize("Google is one of the best companies in the world. I constantly google myself to see what I'm up to.")
nouns = ['NN', 'NNS', 'NNP', 'NNPS']for sentence in sentences: if "google" in sentence.lower(): taggedWords = pos_tag(word_tokenize(sentence)) for word in taggedWords: if word[0].lower() == "google" and word[1] in nouns: print(sentence)

python数据采集8-自然语言处理相关推荐

Python和NLTK自然语言处理
作者:[印度] 尼天·哈登尼亚(Nitin Hardeniya)雅各布·帕金斯出版社:人民邮电出版社出版时间:2019年04月 Python和NLTK自然语言处理
独家 | 快速掌握spacy在python中进行自然语言处理（附代码链接）
作者:Paco Nathan 翻译:笪洁琼校对:和中华本文约6600字,建议阅读15分钟. 本文简要介绍了如何使用spaCy和Python中的相关库进行自然语言处理(有时称为"文本分析& ...
python和nltk自然语言处理书评_python自然语言处理_自然语言处理入门
说明:本文是<Python数据分析与数据化运营>中的"3.12.4 自然语言文本预处理".下面是正文内容-与数据库本文从概念和实际操作量方面,从零开始,介绍在Pyth ...
python documents in chinese_基于 Python 的简单自然语言处理实践
基于 Python 的简单自然语言处理 Twenty News Group 语料集处理 20 Newsgroup 数据集包含了约 20000 篇来自于不同的新闻组的文档,最早由 Ken Lang 搜集 ...
推荐：快速掌握spacy在python中进行自然语言处理（附代码链接）
作者:Paco Nathan 翻译:笪洁琼校对:和中华本文约6600字,建议阅读15分钟. 本文简要介绍了如何使用spaCy和Python中的相关库进行自然语言处理(有时称为"文本分析& ...
python数据采集6-读取文档
文章目录 python数据采集6-读取文档文档编码纯文本 CSV PDF 微软Word和.docx python数据采集6-读取文档有种观点认为,互联网基本上就是那些符合新式 Web 2.0 潮 ...
Python数据采集分析告诉你为何上海二手房你都买不起
感谢关注Python爱好者社区公众号,在这里,我们会每天向您推送Python相关的文章实战干货. 来吧,一起Python. 对商业智能BI.大数据分析挖掘.机器学习,python,R等数据领域感兴趣的 ...
Python 数据采集-爬取学校官网新闻标题与链接（基础）
Python 爬虫爬取学校官网新闻标题与链接一.前言二.扩展库简要介绍 01 urllib 库 (1)urllib.request.urlopen() 02 BeautifulSoup 库 (1) ...
python数据采集有哪些技术_如何快速掌握Python数据采集与网络爬虫技术
一.数据采集与网络爬虫技术简介网络爬虫是用于数据采集的一门技术,可以帮助我们自动地进行信息的获取与筛选.从技术手段来说,网络爬虫有多种实现方案,如PHP.Java.Python ....那么用pyt ...
python数据采集3-开始采集
文章目录 python数据采集3-开始采集遍历单个域名采集整个网站通过互联网 python数据采集3-开始采集遍历单个域名写一段获取百度百科网站的任何页面并提取页面链接的 Python 代码 ...

python数据采集8-自然语言处理