Scrapy项目之自动爬取网页信息

前文已经介绍了利用Scrapy框架与手写爬虫，比较了Scrapy框架的优势。前面介绍的scrapy框架爬取是针对一个网页的爬取，而本文介绍的是实现多个网页的自动爬取，本文将以爬取虎扑湿乎乎论坛帖子信息为例，讲解自动爬取网页信息的爬虫。

1.分析页面

打开https://bbs.hupu.com/vote页面，该页面就是开始爬虫页面，点击进去，页面如下图所示：

观察以上页面，目标爬取的是每个帖子的相关信息，需要的信息是帖子的url、帖子标题、帖子作者、作者信息链接、回复人数、浏览人数、发表帖子的时间、最后回复的时间、最后回复者的昵称。

以下通过F12进行DEBUG，并定义以上信息获取规则。

1.首先是url，如下图所示，根据xpath规则，可定义获取规则为："//a[@class='truetit']/@href"

2.title.根据上图，title获取规则可以为："//a[@class='truetit']/text()"

3.author.根据下图所示为作者信息，可以定义规则为："//a[@class='aulink']/text()"。

4.authorlink。根据上图可以定义规则："//a[@class='aulink']/@href"

5.pubtime。根据上图可以定义发布时间的规则："//a[@style='color:#808080;cursor: initial; ']/text()"

6.replyandscan。这里将回复与浏览两个参数合并一起作为一个参数，到后面爬取之后再进行拆分。如下图所示，可以将爬取规则定为："//span[@class='ansour box']/text()"

7.lastreplytime。获取最近回复时间，如下图所示，规则为："//div[@class='endreply box']/a/text()"

8.endauthor。获取最近回复的作者昵称："//div[@class='endreply box']/span/text()"

2.建立项目并编写items.py

首先，根据前面博客讲述的步骤创建项目，项目结构如下所示：

根据上文页面分析部分的分析，将需要的信息编写进items.py。

# -*- coding: utf-8 -*-# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.htmlimport scrapyclass AutocrawlItem(scrapy.Item):# define the fields for your item here like:# name = scrapy.Field()#定义帖子的链接url=scrapy.Field()#定义帖子的标题title=scrapy.Field()#定义帖子的作者author=scrapy.Field()#定义作者信息的链接authorlink=scrapy.Field()#定义帖子回复reply=scrapy.Field()#定义浏览情况scan=scrapy.Field()#定义发布时间pubtime=scrapy.Field()#定义最后回复时间lastreplytime=scrapy.Field()#定义最后回复的作者endauthor=scrapy.Field()

3.编写pipelines.py文件

items.py定义了爬取数据的结构，而pipelines.py文件则是对这些有固定结构的数据进行后处理。这里主要将数据按照一定的格式保存进JSON文件中。具体的形式就是每一条帖子的信息，保存成一个json结构。为了更好的移植性，下面创建JSON文件使用相对路径，具体如下所示：

# -*- coding: utf-8 -*-# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
import codecs
import jsonclass AutocrawlPipeline(object):def __init__(self):#代码移植时注意改写self.file=codecs.open("../autocrawl/data.json","wb",encoding="utf-8")def process_item(self, item, spider):for j in range(len(item["url"])):url = "https://bbs.hupu.com"+item["url"][j]title=item["title"][j]author=item["author"][j]authorlink=item["authorlink"][j]reply=item["reply"][j]scan=item["scan"][j]pubtime=item["pubtime"][j]lastreplytime=item["lastreplytime"][j]endauthor=item["endauthor"][j]oneitem={"url":url,"title":title,"author":author,"authorlink":authorlink,"reply":reply,"scan":scan,"pubtime":pubtime,"lastreplytime":lastreplytime,"endauthor":endauthor}i = json.dumps(oneitem, ensure_ascii=False)# 换行line = i + '\n'print(line)  # 调试self.file.write(line)return itemdef close_spider(self,spider):self.file.close()

4.编写settings.py

settings.py主要进行相应的配置，比如将上述pipelines进行配置，同时配置禁止cookie、设置下载延迟等，代码如下：

# -*- coding: utf-8 -*-# Scrapy settings for autocrawl project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://doc.scrapy.org/en/latest/topics/settings.html
#     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://doc.scrapy.org/en/latest/topics/spider-middleware.htmlBOT_NAME = 'autocrawl'SPIDER_MODULES = ['autocrawl.spiders']
NEWSPIDER_MODULE = 'autocrawl.spiders'# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'autocrawl (+http://www.yourdomain.com)'# Obey robots.txt rules
ROBOTSTXT_OBEY = False# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 0.5
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16# Disable cookies (enabled by default)
COOKIES_ENABLED = False# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'autocrawl.middlewares.AutocrawlSpiderMiddleware': 543,
#}# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'autocrawl.middlewares.AutocrawlDownloaderMiddleware': 543,
#}# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {'autocrawl.pipelines.AutocrawlPipeline': 300,
}# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

5.编写爬虫文件

在项目文件夹中利用scrapy genspider -t basic auto hupu.com创建名为auto的爬虫文件，这里的开始爬取的网页就是文章开头提到的https://bbs.hupu.com/vote，然后该论坛的第2页以及后面页数，发现url有规律，规律为https://bbs.hupu.com/vote-page，其中page即为页数，page从2开始取。根据上述参数的爬取规则进行相应参数的爬取。因为刚开始从起始网页开始爬取，后面需要进行函数回调，并传入相应url，才能继续爬取。具体如下所示：

# -*- coding: utf-8 -*-
import scrapy
from scrapy import Requestfrom autocrawl.items import AutocrawlItemclass AutoSpider(scrapy.Spider):name = 'auto'allowed_domains = ['hupu.com']start_urls = ['https://bbs.hupu.com/vote']def parse(self, response):item = AutocrawlItem()item["url"]=response.xpath("//a[@class='truetit']/@href").extract()item["title"]=response.xpath("//a[@class='truetit']/text()").extract()item["author"]=response.xpath("//a[@class='aulink']/text()").extract()item["authorlink"]=response.xpath("//a[@class='aulink']/@href").extract()#item["replyandscan"]#由于replyandscan元素中包含'\xa0',会出现报错，所以需要单独处理，将浏览情况与回复情况拆分t=response.xpath("//span[@class='ansour box']/text()").extract()reply=[]scan=[]for u in t:tmp=str(u).replace("\xa0","").split("/")if(tmp!=[]):reply.append(tmp[0])scan.append(tmp[1])item["reply"]=replyitem["scan"]=scanitem["pubtime"]=response.xpath("//div[@class='author box']//a[@style='color:#808080;cursor: initial; ']/text()").extract()item["lastreplytime"]=response.xpath("//div[@class='endreply box']/a/text()").extract()item["endauthor"]=response.xpath("//div[@class='endreply box']/span/text()").extract()yield itemfor i in range(2,21):url="https://bbs.hupu.com/vote-"+str(i)#通过yield返回Request，并指定要爬取的url和回调函数，从而实现自动爬取yield Request(url,callback=self.parse)

6.运行爬虫

根据上述指令进行爬虫，会发现项目中多了个JSON文件：

打开JSON文件，取部分内容如下：

{"reply": "130", "lastreplytime": "17:55", "pubtime": "2019-05-11", "title": "哈登表现没那么差，别一场论行吗", "authorlink": "https://my.hupu.com/107436237696224", "url": "https://bbs.hupu.com/27355887.html", "scan": "12658", "endauthor": "可乐瓶盖", "author": "zppsss"}
{"reply": "166", "lastreplytime": "17:55", "pubtime": "2019-05-10", "title": "应该讨论下CJ麦科勒姆到底差欧文多少呢", "authorlink": "https://my.hupu.com/186718992630848", "url": "https://bbs.hupu.com/27330869.html", "scan": "28306", "endauthor": "啦啦啦魔法能量", "author": "dw0818"}
{"reply": "256", "lastreplytime": "17:55", "pubtime": "2019-05-10", "title": "认真讲，目前东西部没有一个系列赛强度比得上火勇吧？", "authorlink": "https://my.hupu.com/255955476402177", "url": "https://bbs.hupu.com/27335533.html", "scan": "60839", "endauthor": "呵呵Tim催哦", "author": "白边年薪302K值"}
{"reply": "1128", "lastreplytime": "17:55", "pubtime": "2019-05-11", "title": "[话题团]一念天堂，一念地狱！库里半场33分率队挺进西部决赛！", "authorlink": "https://my.hupu.com/263688482793184", "url": "https://bbs.hupu.com/27348324.html", "scan": "241655", "endauthor": "盖了又盖之锅盖", "author": "耶律"}
{"reply": "96", "lastreplytime": "17:55", "pubtime": "2019-05-11", "title": "当时这个詹绝杀什么水平", "authorlink": "https://my.hupu.com/88507341799843", "url": "https://bbs.hupu.com/27341919.html", "scan": "37508", "endauthor": "KD超KB", "author": "热心用户"}
{"reply": "6", "lastreplytime": "17:55", "pubtime": "2019-05-11", "title": "难道我经历的脱臼和库里的脱臼不是一种脱臼?", "authorlink": "https://my.hupu.com/266836632587182", "url": "https://bbs.hupu.com/27359525.html", "scan": "2369", "endauthor": "一入论坛深似海", "author": "左脚的3分特别准"}
{"reply": "0", "lastreplytime": "17:55", "pubtime": "2019-05-11", "title": "非火蜜的感受，觉得大家还是要理性看待自己主队。", "authorlink": "https://my.hupu.com/224328494463192", "url": "https://bbs.hupu.com/27360738.html", "scan": "31", "endauthor": "gutcci1", "author": "gutcci1"}
{"reply": "14", "lastreplytime": "17:55", "pubtime": "2019-05-11", "title": "今天过后，有没有人跟我想法一样的，投个票吧", "authorlink": "https://my.hupu.com/116611849283329", "url": "https://bbs.hupu.com/27359705.html", "scan": "883", "endauthor": "七年之役", "author": "莫雷的炮友"}
{"reply": "15", "lastreplytime": "17:55", "pubtime": "2019-05-11", "title": "看完这场，觉得邓肯，库里和韦德真的堪称伟大", "authorlink": "https://my.hupu.com/274330133718993", "url": "https://bbs.hupu.com/27359012.html", "scan": "2458", "endauthor": "做好事儿不留名", "author": "西西四西西"}
{"reply": "5", "lastreplytime": "17:55", "pubtime": "2019-05-11", "title": "德安东尼第四节为什么还让让保罗休息?", "authorlink": "https://my.hupu.com/163827142009834", "url": "https://bbs.hupu.com/27351736.html", "scan": "1766", "endauthor": "飘然出世", "author": "史蒂分布莱恩特"}
{"reply": "38", "lastreplytime": "17:54", "pubtime": "2019-05-11", "title": "窝火出哈登，卡佩拉，换欧文，霍福德，洛奇儿，凯米们愿意么？，愿意就通知经理了", "authorlink": "https://my.hupu.com/198259890671262", "url": "https://bbs.hupu.com/27354862.html", "scan": "5086", "endauthor": "若风是只you", "author": "1348228476"}
{"reply": "10", "lastreplytime": "17:54", "pubtime": "2019-05-11", "title": "杜兰特压力太大了", "authorlink": "https://my.hupu.com/228144193864995", "url": "https://bbs.hupu.com/27354881.html", "scan": "3487", "endauthor": "伊斯科盘带没我6", "author": "leeson911"}
{"reply": "52", "lastreplytime": "17:54", "pubtime": "2019-05-11", "title": "我有点看不懂了，保罗到底想干嘛？", "authorlink": "https://my.hupu.com/201450290883476", "url": "https://bbs.hupu.com/27342874.html", "scan": "13602", "endauthor": "瑞克警长", "author": "包头大三中"}
{"reply": "9", "lastreplytime": "17:54", "pubtime": "2019-05-11", "title": "NBA纪录季后赛库里得分30+勇士29胜4负胜率87.9%", "authorlink": "https://my.hupu.com/52562014793614", "url": "https://bbs.hupu.com/27359803.html", "scan": "1343", "endauthor": "kit杰248j", "author": "Ken任"}
{"reply": "126", "lastreplytime": "17:54", "pubtime": "2019-05-11", "title": "不懂就问！火勇大战这姑娘是谁", "authorlink": "https://my.hupu.com/95098224075381", "url": "https://bbs.hupu.com/27344624.html", "scan": "38890", "endauthor": "别别别奶", "author": "青峰不会飞"}
{"reply": "116", "lastreplytime": "17:54", "pubtime": "2019-05-11", "title": "都一个个进来给库里道歉！", "authorlink": "https://my.hupu.com/1037239487635", "url": "https://bbs.hupu.com/27351436.html", "scan": "16960", "endauthor": "老子会算卦", "author": "饭特雷西"}
{"reply": "50", "lastreplytime": "17:54", "pubtime": "2019-05-11", "title": "我一直想不通格林为什么不练一下中投啊？", "authorlink": "https://my.hupu.com/224421424371559", "url": "https://bbs.hupu.com/27354703.html", "scan": "14623", "endauthor": "给你20块别舔了好吗", "author": "无极威骑扣无极尊"}
{"reply": "151", "lastreplytime": "17:53", "pubtime": "2019-05-10", "title": "火箭宁可裁掉安东尼，也不愿意留着备用？", "authorlink": "https://my.hupu.com/128882595791656", "url": "https://bbs.hupu.com/27336327.html", "scan": "65375", "endauthor": "靓仔鸽", "author": "客队蜜蜜"}
{"reply": "1", "lastreplytime": "17:53", "pubtime": "2019-05-11", "title": "吹一波今年季后赛坎特", "authorlink": "https://my.hupu.com/133029354296111", "url": "https://bbs.hupu.com/27360668.html", "scan": "75", "endauthor": "丿Wizard丶", "author": "丿Wizard丶"}
{"reply": "49", "lastreplytime": "17:53", "pubtime": "2019-05-11", "title": "东决G6和西决G6  东詹和西詹的不同命运", "authorlink": "https://my.hupu.com/251117791994202", "url": "https://bbs.hupu.com/27356874.html", "scan": "8413", "endauthor": "田罗纳尔多", "author": "梅西赛后说"}
{"reply": "58", "lastreplytime": "17:53", "pubtime": "2019-05-11", "title": "求科普：NBA历史上有没有出现拿了常规赛FMVP,却没有进入当年全明星的呢", "authorlink": "https://my.hupu.com/107391370994397", "url": "https://bbs.hupu.com/27342113.html", "scan": "10742", "endauthor": "列宁86", "author": "砍柴的阿然"}
{"reply": "27", "lastreplytime": "17:53", "pubtime": "2019-05-11", "title": "[流言板]库里：我都不知道克莱是否真的感到了压力，他就是打球", "authorlink": "https://my.hupu.com/236472107528248", "url": "https://bbs.hupu.com/27358263.html", "scan": "7526", "endauthor": "aqi8109", "author": "拖鞋的脱"}
{"reply": "3", "lastreplytime": "17:53", "pubtime": "2019-05-11", "title": "想说下自己的观点 打转换进攻是对球队后场篮板的保护", "authorlink": "https://my.hupu.com/250971314571985", "url": "https://bbs.hupu.com/27360574.html", "scan": "68", "endauthor": "ZlatanHan", "author": "不痛的不算xx"}
{"reply": "4", "lastreplytime": "17:53", "pubtime": "2019-05-11", "title": "希望火箭球迷多给卫平老师一些尊重。", "authorlink": "https://my.hupu.com/26206837676547", "url": "https://bbs.hupu.com/27358411.html", "scan": "853", "endauthor": "小豆子de梦想", "author": "小OK姐姐"}
{"reply": "65", "lastreplytime": "17:53", "pubtime": "2019-05-11", "title": "不懂就问，大帝和字母哥的天赋谁更强！", "authorlink": "https://my.hupu.com/34008750235629", "url": "https://bbs.hupu.com/27354877.html", "scan": "4679", "endauthor": "把海倒进杯", "author": "国产德文韦"}
........
.....

以上内容是完整的自动爬取网页信息的项目思路以及编写过程，可以通过这篇博文对scrapy框架进行爬虫有更深的了解。同时本文讲述的自动爬虫适用于有规律的URL。通过寻找url规律实现网页自动爬虫。该文所涉及的代码见github:https://github.com/carson0408/AutoCrawler.git