Python 爬虫实战（1）

前言：

在我学习完python的基础知识之后，当然想要练练手，加深一下对python以及其语法的理解，
所以听说爬虫特别有成就感，非常有利于学习and娱乐，以及培养学习的兴趣，so就到处百度爬虫的相关文章，网上的确有很多相关的，但我还是决定自己写写,
只有自己写下了讲出来才能代表真的学会了这么技术

我也是第一次学python和抓包，是根据网上的各种讲解以及自己的摸索，慢慢学会的，有什么说的不对的，欢迎指正
刚开始的时候我的使用Anaconda管理包和环境（py3.6），然而后来我在学多线程的时候，就出现了问题：setdaemon(true)一直没效果（设置守护线程后，守护线程本来应该在所有非守护线程执行完就立马结束而不管守护线程是否结束的，but没用，网上各种查也查不到，后来我把代码写到.py文件里直接在cmd里执行该脚本就没问题了）
使用Anaconda是因为我打算入坑深度学习，所以提前熟悉熟悉这个管理工具，科科

本章目的：抓取豆瓣电影网站正在上映列表的评价关键词，并使用词云表示出来

豆瓣正在上映列表如图

图片说明

豆瓣电影《芳华》短评列表如图

图片说明

最终获得的《芳华》短评词云如图

图片说明

抓取步骤大致分为三步，具体的又分为下面几步：

1.访问并获取网页数据并抽取出来有用信息

首先要获取网页数据

获取豆瓣正在上映列表的网页数据
使用Python的urllib模块: 其提供了一个从指定的URL地址获取网页数据(创建一个表示远程url的类文件对象)，通过该对象可以对其进行分析处理，获取想要的数据

函数原型

函数原型

url: 请求的地址（除了url，其他都可以不填）
data: 访问url时请求的参数
timeout: 超时时间

python2.X写法：

12	import urllib urltext = urllib.urlopen(url)

python3.X写法：

12	import urllib.request #from urllib import requesturltext = urllib.request.urlopen(url)

urlopen()方法返回值类型：

如果请求的是http或者https地址，则返回http.client.HTTPResponse对象
如果请求的是ftp或者Data URL地址(以及requests explicitly handled by legacyURLopener and FancyURLopener classes <-原谅我没看懂)，则返回urllib.response.addinfourl对象

上面的对象包含多个方法可供我们使用

geturl() :返回请求的网页地址
info() :返回一个httplib.HTTPMessage对象，表示远程服务器返回的头信息
getcode():返回HTTP状态码
read() , readline() , readlines() , fileno() , close() 这些方法的使用方式与文件对象完全一样

实例

import urlliburltext = urllib.request.urlopen('https://movie.douban.com/nowplaying/beijing/')print (urltext)print (urltext.geturl())print (urltext.info())print (urltext.getcode())

其次,分析网页数据，抓取想要的数据

找到网页上你要的电影列表的位置，看看有什么标签特点

我们发现所有的电影列表都在id为nowplaying的div下面的一个ul下，该ul的class为lists，并且每个电影的li标签的class为list-item
该li标签中有许多熟悉，我们发现data-title为电影标题，data-score为电影评分，data-star为打星…,最最重要的是id，每个电影都不同，可推测应该是电影的唯一标识（编号）；
我们要通过某一个标识来查询该电影的短评，通过查看电影主页的网址（https://movie.douban.com/subject/26862829/ ）可知，这个id就是我们需要的

使用python的BeautifulSoup库进行网页信息的抓取（网页解析库）
BeautifulSoup 是一个可以从 HTML 或 XML 文件中提取数据的 Python 库.它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式.Beautiful Soup 会帮你节省数小时甚至数天的工作时间.

BeautifulSoup:Beautiful Soup提供一些简单的、python式的函数用来处理导航、搜索、修改分析树等功能。
它是一个工具箱，通过解析文档为用户提供需要抓取的数据，因为简单，所以不需要多少代码就可以写出一个完整的应用程序。
Beautiful Soup自动将输入文档转换为Unicode编码，输出文档转换为utf-8编码。你不需要考虑编码方式，除非文档没有指定一个编码方式，这时，Beautiful Soup就不能自动识别编码方式了。然后，你仅仅需要说明一下原始编码方式就可以了。
Beautiful Soup已成为和lxml、html6lib一样出色的python解释器，为用户灵活地提供不同的解析策略或强劲的速度。(from http://beautifulsoup.readthedocs.io/zh_CN/latest/ )

使用BeautifulSoup解析代码,能够得到一个 BeautifulSoup 的对象,并能按照标准的缩进格式的结构输出

Beautiful Soup将复杂HTML文档转换成一个复杂的树形结构,每个节点都是Python对象,所有对象可以归纳为4种: Tag, NavigableString, BeautifulSoup, Comment

Tag: <class 'bs4.element.Tag'>标签对象，两个属性：name, attribute (直接调用：tag.name，tag['class']), 如果是多值属性，则返回list，也可以赋值为多值属性（假的多值属性返回字符串，如id="aaa bbb"）NavigableString: <class 'bs4.element.NavigableString'> tag中的字符串对象，即tag.string; tag中包含的字符串不能编辑,但是可以被替换成其它的字符串,用 replace_with() 方法BeautifulSoup : <class 'bs4.BeautifulSoup'>BeautifulSoup对象表示的是一个文档的全部内容.大部分时候,可以把它当作 Tag 对象,它支持 遍历文档树 和 搜索文档树 中描述的大部分的方法; 一个方法：soup.name # [document]Comment : <class 'bs4.element.Comment'>文档的注释部分，Comment 对象是一个特殊类型的 NavigableString 对象

BeautifulSoup对象使用示例：

解析时，可以传入一段字符串或一个文件句柄.

from bs4 import BeautifulSoup   #导入模块soup = BeautifulSoup(open("index.html")) soup = BeautifulSoup("<html>data</html>")

#首先,文档被转换成Unicode,并且HTML的实例都被转换成Unicode编码#然后,BeautifulSoup选择最合适的解析器来解析这段文档,如果手动指定解析器那么Beautiful Soup会选择指定的解析器来解析文档

soup.title # 标签对象：<title>北京 - 在线购票&影讯
soup.title.name # 标签名称：title
soup.title.string # 标签内容：北京 - 在线购票&影讯
soup.p # 第一个p标签对象：
豆瓣
soup.p[‘class’] # 第一个p标签对象的类属性

123456789101112131415161718

原型：find_all( name , attrs , recursive , string , **kwargs ) 搜索当前tag的所有tag子节点,并判断是否符合过滤器的条件    name:       name 参数可以查找所有名字为 name 的tag,字符串对象会被自动忽略掉.    attrs:      通过属性选择器查询，有两种写法                    1. soup.find_all(class_='value', id='value2')                     2. soup.find_all(attrs={"class": "value", "id":"value2"})    limit:  限制查询结果个数    recursive: 调用tag的 find_all() 方法时,Beautiful Soup会检索当前tag的所有子孙节点,如果只想搜索tag的直接子节点,可以使用参数 recursive=False    string: 通过 string 参数可以搜搜文档中的字符串内容. soup.find_all("a", string="value") #查询标签中文字包含value的a标签``` 8. soup.find('a').get('href')   # 找到第一个a标签 并返回其href属性内容 （ find_all() 方法的返回结果是值包含一个元素的列表,而 find() 方法直接返回结果.）9. 更多用法见BeautifulSoup官网中文文档：http://beautifulsoup.readthedocs.io/zh_CN/latest/

* 解析网页代码,并编码为utf-8```pythonimport urlliburltext = urllib.request.urlopen('https://movie.douban.com/nowplaying/beijing/')html_data = urltext.read().decode('utf-8')print (html_data)

获取正在上映列表数据 nowplaying_movie_list列表（List）

12345678910

import urlliburltext = urllib.request.urlopen('https://movie.douban.com/nowplaying/beijing/')html_data = urltext.read().decode('utf-8')

from bs4 import BeautifulSoup as bssoup = bs(html_data, 'html.parser')nowplaying_movie = soup.find_all('div', id = 'nowplaying') # 先获取id为nowplaying的div# print (nowplaying_movie) # 只有一条数据，因为id是唯一的nowplaying_movie_list = nowplaying_movie[0].find_all('li', class_ = 'list-item')# 再获取class为list-item的liprint (nowplaying_movie_list)

至此已经获得了最内部一层的电影数据，可以直接获得每个电影的id了

1	print (nowplaying_movie_list[0]['id'], '\n') #获取第一个电影的id数据

现在我们需要获取其中某一个id，通过这个id获取对应电影的短评，然后就可以进行处理了
你也可以自由发挥，制作一个查询的功能，通过输入电影名称指定某一个电影进行分析

2.分析网页中有用信息并进行处理

首先按照上面的步骤访问电影首页，抽取短评信息，存放到一个List中

首先解析网页代码

123456789

requrl = "https://movie.douban.com/subject/" + nowplaying_movie_list[0]['id'] + "/comments?start=0&limit=20"resp = urllib.request.urlopen(requrl)html_data = resp.read().decode('utf-8')soup = bs(html_data, 'html.parser')

title = soup.find('title') # 直接获取title标签print(title.string) #获取标签中内容comment_div_list = soup.find_all('div', class_ = 'comment')print (comment_div_list) #所有的短片标签列表

通过下面的源码可知，所有的短评文字都放在class为comment-item的div下的一个p标签中，所有我们要得到所有的p标签并组成一个List

commentList = []  #存放所有的短评内容数据 Listfor cm in comment_div_list:    if cm.find_all('p')[0] is not None:        commentList.append(cm.find_all('p')[0].string) #把短评内容存放在列表中print (commentList)

已得短评List，但是该List中包含大量的单引号（List自带的），换行符等不需要的东西，并且由于我们要做成词云，所有的符号都不要，只要文字

12345678

while True:    if None in commentList:        commentList.remove(None) #去除NoneType数据    else:        breakcomments = ''.join(commentList) #拼接字符串comments = comments.replace(' ','').replace("\n", "").replace("\t", "")print (comments)

词云展示的只是关键词，所以去除用户短评中的所有的标点符号（正则表达式）

import re       #正则表达式pattern = re.compile(r'[\u4e00-\u9fa5]+')  #去除标点符号(正则表达式)filterdata = re.findall(pattern, comments)cleaned_comments = ''.join(filterdata) # 把filterdata按照空字符串为间隔连接起来print (cleaned_comments)

目前所有的评价都没有间隔的展示在这里，我们需要把其中的词语取出来得到所有的关键词

使用jieba分词, 把字符串中的所有的词语分出来，组成一个List

结巴（jieba）是国人出的一个精品插件，可以对一段中文进行分词，有三种分词模式，可以适应不同需求。

12345678

jieba.cut 方法接受三个输入参数: 需要分词的字符串；cut_all 参数用来控制是否采用全模式；HMM 参数用来控制是否使用 HMM 模型jieba.cut_for_search 方法接受两个参数：需要分词的字符串；是否使用 HMM 模型。该方法适合用于搜索引擎构建倒排索引的分词，粒度比较细待分词的字符串可以是 unicode 或 UTF-8 字符串、GBK 字符串。注意：不建议直接输入 GBK 字符串，可能无法预料地错误解码成 UTF-8jieba.cut 以及 jieba.cut_for_search 返回的结构都是一个可迭代的 generator，可以使用 for 循环来获得分词后得到的每一个词语(unicode)，或者用jieba.lcut 以及 jieba.lcut_for_search 直接返回 listjieba.Tokenizer(dictionary=DEFAULT_DICT) 新建自定义分词器，可用于同时使用不同词典。jieba.dt 为默认分词器，所有全局分词相关函数都是该分词器的映射。

也可以添加自定义词典 （from： http://blog.csdn.net/qq_27231343/article/details/51898940 ）

123456789101112131415161718192021

#代码示例# encoding=utf-8import jieba

seg_list = jieba.cut("我来到北京清华大学", cut_all=True)print("Full Mode: " + "/ ".join(seg_list))  # 全模式

seg_list = jieba.cut("我来到北京清华大学", cut_all=False)print("Default Mode: " + "/ ".join(seg_list))  # 精确模式

seg_list = jieba.cut("他来到了网易杭研大厦")  # 默认是精确模式print("* ".join(seg_list))

seg_list = jieba.cut_for_search("小明硕士毕业于中国科学院计算所，后在日本京都大学深造")  # 搜索引擎模式print(", ".join(seg_list))

#输出结果# Full Mode: 我/ 来到/ 北京/ 清华/ 清华大学/ 华大/ 大学# Default Mode: 我/ 来到/ 北京/ 清华大学# 他* 来到* 了* 网易* 杭研* 大厦       (此处，“杭研”并没有在词典中，但是也被Viterbi算法识别出来了)# 小明, 硕士, 毕业, 于, 中国, 科学, 学院, 科学院, 中国科学院, 计算, 计算所, ，, 后, 在, 日本, 京都, 大学, 日本京都大学, 深造

使用jieba分割短评，获取返回的分词List

123	import jiebasegment = jieba.lcut(cleaned_comments)print (segment)

数据中有“的”、“是”、“我”、“你”等虚词（停用词），而这些词在任何场景中都是高频时，并且没有实际的含义，所以我们要他们进行清除。

使用pandas

1234567891011

import pandas as pdwords_df = pd.DataFrame({'segment':segment})  #格式转换 把List转化为Dict# words_df.head()# print(words_df)# print (words_df.segment)#从网上下载常用停用词文件 stopwords.txt 然后对比去除统计结果中所有的停用词stopwords=pd.read_csv("E:/stopwords.txt",index_col=False,quoting=3,sep="\t",names=['stopword'], encoding='utf-8')#quoting=3全不引用# print (stopwords.stopword)# print (words_df.segment.isin(stopwords.stopword))words_df = words_df[~words_df.segment.isin(stopwords.stopword)]  #stopwords.txt不能有空格words_df.head()

我的停用词文件： http://p18j2ow6f.bkt.clouddn.com/static/file/stopwords.txt

清洗了关键词以后，我们把剩下的词语进行分类统计，观察每个词语的频率

使用numpy

import numpy    #numpy计算包words_stat = words_df.groupby(by=['segment'])['segment'].agg({"计数":numpy.size}) # 按照segment分类words_stat = words_stat.reset_index().sort_values(by=["计数"],ascending=False)  #词频按照 计数 由大到小排列words_stat.head()

3.制作为词云

12345678910111213

import matplotlib# %matplotlib inline

matplotlib.rcParams['figure.figsize'] = (10.0, 5.0)from wordcloud import WordCloud #词云包

wordcloud=WordCloud(font_path="E:/simhei.ttf",background_color="white",max_font_size=80)  #指定字体类型、字体大小和字体颜色# print (wordcloud)word_frequence = {x[0]:x[1] for x in words_stat.head(1000).values}# print (word_frequence)

wordcloud=wordcloud.fit_words(word_frequence)matplotlib.pyplot.imshow(wordcloud)

我的字体文件： http://p18j2ow6f.bkt.clouddn.com/static/file/simhei.ttf

最终效果

图片说明

个人完整代码：

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110

from urllib import request   #python3.X写法#import urllib             #python2.X写法from bs4 import BeautifulSoup as bsimport re       #正则表达式import jieba    #分词包 中文分词操作 结巴分词import pandas as pd import numpy    #numpy计算包

"""python2.X 关于 urllib的用法    import urllib      text = urllib.urlopen(url).read()  

python3.X 关于 urllib的用法    import urllib.request  #from urllib import request    response = urllib.request.urlopen(url)      text = response.read()  """

def getList():    resp = request.urlopen('https://movie.douban.com/nowplaying/beijing/')  #获取url下的影片列表;python2.x下使用urllib.urlopen()    html_data = resp.read().decode('utf-8') # 读取返回的数据(返回页面的html代码)    # print(html_data)

    soup = bs(html_data, 'html.parser') # 解析html代码 开始获取其中的数据    nowplaying_movie = soup.find_all('div', id = 'nowplaying')  #获取id为nowplaying的div标签以及内部的代码 (得到的是一个list)    # print (nowplaying_movie);    nowplaying_movie_list = nowplaying_movie[0].find_all('li', class_ = 'list-item') #获取class是list-item的所有li标签    # print (nowplaying_movie_list);    # print (nowplaying_movie_list[0]['id'], '\n');   # 打印第一个影片的id

    """测试代码 开始"""    # test = nowplaying_movie_list[0].find_all('ul')    # print (test)    # test = nowplaying_movie_list[0].find_all('ul')[0].find_all('li')[1]    # print (test)    """测试代码 结束"""

    return nowplaying_movie_list

def getComments(nowplaying_movie_list, num):    requrl = "https://movie.douban.com/subject/" + nowplaying_movie_list[num]['id'] + "/comments?start=0&limit=20" #获取url下的影片短评列表    resp = urllib.request.urlopen(requrl)    html_data = resp.read().decode('utf-8')

    soup = bs(html_data, 'html.parser')

    title = soup.find('title')    print(title.string)

    comment_div_list = soup.find_all('div', class_ = 'comment')    #print (comment_div_list)    commentList = []  #存放所有的短评内容数据    for cm in comment_div_list:        if cm.find_all('p')[0] is not None:            commentList.append(cm.find_all('p')[0].string) #把短评内容存放在列表中    # print (comments)

    comments = ''    for k in range(len(commentList)):        comments = comments + (str(commentList[k])).strip()    #print (comments)    pattern = re.compile(r'[\u4e00-\u9fa5]+')  #去除标点符号(正则表达式)    filterdata = re.findall(pattern, comments)    cleaned_comments = ''.join(filterdata) # 把filterdata按照空字符串为间隔连接起来    # print (cleaned_comments)

    segment = jieba.lcut(cleaned_comments) #list    # print (segment)    words_df = pd.DataFrame({'segment':segment})  #格式转换    # words_df.head()    # print(words_df)    # print (words_df.segment)    # 数据中有“的”、“是”、“我”、“你”等虚词（停用词），而这些词在任何场景中都是高频时，并且没有实际的含义，所以我们要他们进行清除。

    #从网上下载常用停用词文件 stopwords.txt 然后对比去除统计结果中所有的停用词    stopwords=pd.read_csv("E:/stopwords.txt",index_col=False,quoting=3,sep="\t",names=['stopword'], encoding='utf-8')#quoting=3全不引用    # print (stopwords.stopword)    # print (words_df.segment.isin(stopwords.stopword))    words_df = words_df[~words_df.segment.isin(stopwords.stopword)]  #stopwords.txt不能有空格    words_df.head()

    #进行词频统计    words_stat = words_df.groupby(by=['segment'])['segment'].agg({"计数":numpy.size}) # 按照segment分类    words_stat = words_stat.reset_index().sort_values(by=["计数"],ascending=False)  #词频按照 计数 由大到小排列    words_stat.head()

    return words_stat

#词云展示def show(words_stat):    import matplotlib    %matplotlib inline

    matplotlib.rcParams['figure.figsize'] = (10.0, 5.0)    from wordcloud import WordCloud #词云包

    wordcloud=WordCloud(font_path="E:/simhei.ttf",background_color="white",max_font_size=80)  #指定字体类型、字体大小和字体颜色    # print (wordcloud)    word_frequence = {x[0]:x[1] for x in words_stat.head(1000).values}    # print (word_frequence)

    wordcloud=wordcloud.fit_words(word_frequence)    matplotlib.pyplot.imshow(wordcloud)

num = 0 #从0开始, 获取豆瓣最新上映电影短评关键信息movie_list = getList()words_stat = getComments(movie_list, num)show(words_stat)

别人家的代码【滑稽】：

1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889909192939495969798

#coding:utf-8__author__ = 'hang'

import warningswarnings.filterwarnings("ignore")import jieba    #分词包import numpy    #numpy计算包import codecs   #codecs提供的open方法来指定打开的文件的语言编码，它会在读取的时候自动转换为内部unicode import reimport pandas as pd  import matplotlib.pyplot as pltfrom urllib import requestfrom bs4 import BeautifulSoup as bs%matplotlib inline

import matplotlibmatplotlib.rcParams['figure.figsize'] = (10.0, 5.0)from wordcloud import WordCloud#词云包

#分析网页函数def getNowPlayingMovie_list():       resp = request.urlopen('https://movie.douban.com/nowplaying/hangzhou/')            html_data = resp.read().decode('utf-8')        soup = bs(html_data, 'html.parser')        nowplaying_movie = soup.find_all('div', id='nowplaying')            nowplaying_movie_list = nowplaying_movie[0].find_all('li', class_='list-item')        nowplaying_list = []        for item in nowplaying_movie_list:                nowplaying_dict = {}                nowplaying_dict['id'] = item['data-subject']               for tag_img_item in item.find_all('img'):                        nowplaying_dict['name'] = tag_img_item['alt']                        nowplaying_list.append(nowplaying_dict)        return nowplaying_list

#爬取评论函数def getCommentsById(movieId, pageNum):     eachCommentList = [];     if pageNum>0:          start = (pageNum-1) * 20     else:         return False     requrl = 'https://movie.douban.com/subject/' + movieId + '/comments' +'?' +'start=' + str(start) + '&limit=20'     print(requrl)    resp = request.urlopen(requrl)     html_data = resp.read().decode('utf-8')     soup = bs(html_data, 'html.parser')     comment_div_lits = soup.find_all('div', class_='comment')     for item in comment_div_lits:         if item.find_all('p')[0].string is not None:                 eachCommentList.append(item.find_all('p')[0].string)    return eachCommentList

def main():    #循环获取第一个电影的前10页评论    commentList = []    NowPlayingMovie_list = getNowPlayingMovie_list()    for i in range(10):            num = i + 1         commentList_temp = getCommentsById(NowPlayingMovie_list[0]['id'], num)        commentList.append(commentList_temp)

    #将列表中的数据转换为字符串    comments = ''    for k in range(len(commentList)):        comments = comments + (str(commentList[k])).strip()

    #使用正则表达式去除标点符号    pattern = re.compile(r'[\u4e00-\u9fa5]+')    filterdata = re.findall(pattern, comments)    cleaned_comments = ''.join(filterdata)

    #使用结巴分词进行中文分词    segment = jieba.lcut(cleaned_comments)    words_df=pd.DataFrame({'segment':segment})

    #去掉停用词    stopwords=pd.read_csv("stopwords.txt",index_col=False,quoting=3,sep="\t",names=['stopword'], encoding='utf-8')#quoting=3全不引用    words_df=words_df[~words_df.segment.isin(stopwords.stopword)]

    #统计词频    words_stat=words_df.groupby(by=['segment'])['segment'].agg({"计数":numpy.size})    words_stat=words_stat.reset_index().sort_values(by=["计数"],ascending=False)

    #用词云进行显示    wordcloud=WordCloud(font_path="simhei.ttf",background_color="white",max_font_size=80)    word_frequence = {x[0]:x[1] for x in words_stat.head(1000).values}

    word_frequence_list = []    for key in word_frequence:        temp = (key,word_frequence[key])        word_frequence_list.append(temp)

    wordcloud=wordcloud.fit_words(word_frequence_list)    plt.imshow(wordcloud)

#主函数main()

转载自链接地址: http://python.jobbole.com/88325/

个人博客欢迎来访： http://blog.zj2626.com