前文已经介绍了利用Scrapy框架与手写爬虫,比较了Scrapy框架的优势。前面介绍的scrapy框架爬取是针对一个网页的爬取,而本文介绍的是实现多个网页的自动爬取,本文将以爬取虎扑湿乎乎论坛帖子信息为例,讲解自动爬取网页信息的爬虫。

1.分析页面

打开https://bbs.hupu.com/vote页面,该页面就是开始爬虫页面,点击进去,页面如下图所示:

观察以上页面,目标爬取的是每个帖子的相关信息,需要的信息是帖子的url、帖子标题、帖子作者、作者信息链接、回复人数、浏览人数、发表帖子的时间、最后回复的时间、最后回复者的昵称。

以下通过F12进行DEBUG,并定义以上信息获取规则。

1.首先是url,如下图所示,根据xpath规则,可定义获取规则为:"//a[@class='truetit']/@href"

2.title.根据上图,title获取规则可以为:"//a[@class='truetit']/text()"

3.author.根据下图所示为作者信息, 可以定义规则为:"//a[@class='aulink']/text()"。

4.authorlink。根据上图可以定义规则:"//a[@class='aulink']/@href"

5.pubtime。根据上图可以定义发布时间的规则:"//a[@style='color:#808080;cursor: initial; ']/text()"

6.replyandscan。这里将回复与浏览两个参数合并一起作为一个参数,到后面爬取之后再进行拆分。如下图所示,可以将爬取规则定为:"//span[@class='ansour box']/text()"

7.lastreplytime。获取最近回复时间,如下图所示,规则为:"//div[@class='endreply box']/a/text()"

8.endauthor。获取最近回复的作者昵称:"//div[@class='endreply box']/span/text()"

2.建立项目并编写items.py

首先,根据前面博客讲述的步骤创建项目,项目结构如下所示:

根据上文页面分析部分的分析,将需要的信息编写进items.py。

# -*- coding: utf-8 -*-# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.htmlimport scrapyclass AutocrawlItem(scrapy.Item):# define the fields for your item here like:# name = scrapy.Field()#定义帖子的链接url=scrapy.Field()#定义帖子的标题title=scrapy.Field()#定义帖子的作者author=scrapy.Field()#定义作者信息的链接authorlink=scrapy.Field()#定义帖子回复reply=scrapy.Field()#定义浏览情况scan=scrapy.Field()#定义发布时间pubtime=scrapy.Field()#定义最后回复时间lastreplytime=scrapy.Field()#定义最后回复的作者endauthor=scrapy.Field()

3.编写pipelines.py文件

items.py定义了爬取数据的结构,而pipelines.py文件则是对这些有固定结构的数据进行后处理。这里主要将数据按照一定的格式保存进JSON文件中。具体的形式就是每一条帖子的信息,保存成一个json结构。为了更好的移植性,下面创建JSON文件使用相对路径,具体如下所示:

# -*- coding: utf-8 -*-# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
import codecs
import jsonclass AutocrawlPipeline(object):def __init__(self):#代码移植时注意改写self.file=codecs.open("../autocrawl/data.json","wb",encoding="utf-8")def process_item(self, item, spider):for j in range(len(item["url"])):url = "https://bbs.hupu.com"+item["url"][j]title=item["title"][j]author=item["author"][j]authorlink=item["authorlink"][j]reply=item["reply"][j]scan=item["scan"][j]pubtime=item["pubtime"][j]lastreplytime=item["lastreplytime"][j]endauthor=item["endauthor"][j]oneitem={"url":url,"title":title,"author":author,"authorlink":authorlink,"reply":reply,"scan":scan,"pubtime":pubtime,"lastreplytime":lastreplytime,"endauthor":endauthor}i = json.dumps(oneitem, ensure_ascii=False)# 换行line = i + '\n'print(line)  # 调试self.file.write(line)return itemdef close_spider(self,spider):self.file.close()

4.编写settings.py

settings.py主要进行相应的配置,比如将上述pipelines进行配置,同时配置禁止cookie、设置下载延迟等,代码如下:

# -*- coding: utf-8 -*-# Scrapy settings for autocrawl project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://doc.scrapy.org/en/latest/topics/settings.html
#     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://doc.scrapy.org/en/latest/topics/spider-middleware.htmlBOT_NAME = 'autocrawl'SPIDER_MODULES = ['autocrawl.spiders']
NEWSPIDER_MODULE = 'autocrawl.spiders'# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'autocrawl (+http://www.yourdomain.com)'# Obey robots.txt rules
ROBOTSTXT_OBEY = False# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 0.5
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16# Disable cookies (enabled by default)
COOKIES_ENABLED = False# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'autocrawl.middlewares.AutocrawlSpiderMiddleware': 543,
#}# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'autocrawl.middlewares.AutocrawlDownloaderMiddleware': 543,
#}# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {'autocrawl.pipelines.AutocrawlPipeline': 300,
}# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

5.编写爬虫文件

在项目文件夹中利用scrapy genspider -t basic auto hupu.com创建名为auto的爬虫文件,这里的开始爬取的网页就是文章开头提到的https://bbs.hupu.com/vote,然后该论坛的第2页以及后面页数,发现url有规律,规律为https://bbs.hupu.com/vote-page,其中page即为页数,page从2开始取。根据上述参数的爬取规则进行相应参数的爬取。因为刚开始从起始网页开始爬取,后面需要进行函数回调,并传入相应url,才能继续爬取。具体如下所示:

# -*- coding: utf-8 -*-
import scrapy
from scrapy import Requestfrom autocrawl.items import AutocrawlItemclass AutoSpider(scrapy.Spider):name = 'auto'allowed_domains = ['hupu.com']start_urls = ['https://bbs.hupu.com/vote']def parse(self, response):item = AutocrawlItem()item["url"]=response.xpath("//a[@class='truetit']/@href").extract()item["title"]=response.xpath("//a[@class='truetit']/text()").extract()item["author"]=response.xpath("//a[@class='aulink']/text()").extract()item["authorlink"]=response.xpath("//a[@class='aulink']/@href").extract()#item["replyandscan"]#由于replyandscan元素中包含'\xa0',会出现报错,所以需要单独处理,将浏览情况与回复情况拆分t=response.xpath("//span[@class='ansour box']/text()").extract()reply=[]scan=[]for u in t:tmp=str(u).replace("\xa0","").split("/")if(tmp!=[]):reply.append(tmp[0])scan.append(tmp[1])item["reply"]=replyitem["scan"]=scanitem["pubtime"]=response.xpath("//div[@class='author box']//a[@style='color:#808080;cursor: initial; ']/text()").extract()item["lastreplytime"]=response.xpath("//div[@class='endreply box']/a/text()").extract()item["endauthor"]=response.xpath("//div[@class='endreply box']/span/text()").extract()yield itemfor i in range(2,21):url="https://bbs.hupu.com/vote-"+str(i)#通过yield返回Request,并指定要爬取的url和回调函数,从而实现自动爬取yield Request(url,callback=self.parse)

6.运行爬虫

根据上述指令进行爬虫,会发现项目中多了个JSON文件:

打开JSON文件,取部分内容如下:

{"reply": "130", "lastreplytime": "17:55", "pubtime": "2019-05-11", "title": "哈登表现没那么差,别一场论行吗", "authorlink": "https://my.hupu.com/107436237696224", "url": "https://bbs.hupu.com/27355887.html", "scan": "12658", "endauthor": "可乐瓶盖", "author": "zppsss"}
{"reply": "166", "lastreplytime": "17:55", "pubtime": "2019-05-10", "title": "应该讨论下CJ麦科勒姆到底差欧文多少呢", "authorlink": "https://my.hupu.com/186718992630848", "url": "https://bbs.hupu.com/27330869.html", "scan": "28306", "endauthor": "啦啦啦魔法能量", "author": "dw0818"}
{"reply": "256", "lastreplytime": "17:55", "pubtime": "2019-05-10", "title": "认真讲,目前东西部没有一个系列赛强度比得上火勇吧?", "authorlink": "https://my.hupu.com/255955476402177", "url": "https://bbs.hupu.com/27335533.html", "scan": "60839", "endauthor": "呵呵Tim催哦", "author": "白边年薪302K值"}
{"reply": "1128", "lastreplytime": "17:55", "pubtime": "2019-05-11", "title": "[话题团]一念天堂,一念地狱!库里半场33分率队挺进西部决赛!", "authorlink": "https://my.hupu.com/263688482793184", "url": "https://bbs.hupu.com/27348324.html", "scan": "241655", "endauthor": "盖了又盖之锅盖", "author": "耶律"}
{"reply": "96", "lastreplytime": "17:55", "pubtime": "2019-05-11", "title": "当时这个詹绝杀什么水平", "authorlink": "https://my.hupu.com/88507341799843", "url": "https://bbs.hupu.com/27341919.html", "scan": "37508", "endauthor": "KD超KB", "author": "热心用户"}
{"reply": "6", "lastreplytime": "17:55", "pubtime": "2019-05-11", "title": "难道我经历的脱臼和库里的脱臼不是一种脱臼?", "authorlink": "https://my.hupu.com/266836632587182", "url": "https://bbs.hupu.com/27359525.html", "scan": "2369", "endauthor": "一入论坛深似海", "author": "左脚的3分特别准"}
{"reply": "0", "lastreplytime": "17:55", "pubtime": "2019-05-11", "title": "非火蜜的感受,觉得大家还是要理性看待自己主队。", "authorlink": "https://my.hupu.com/224328494463192", "url": "https://bbs.hupu.com/27360738.html", "scan": "31", "endauthor": "gutcci1", "author": "gutcci1"}
{"reply": "14", "lastreplytime": "17:55", "pubtime": "2019-05-11", "title": "今天过后,有没有人跟我想法一样的,投个票吧", "authorlink": "https://my.hupu.com/116611849283329", "url": "https://bbs.hupu.com/27359705.html", "scan": "883", "endauthor": "七年之役", "author": "莫雷的炮友"}
{"reply": "15", "lastreplytime": "17:55", "pubtime": "2019-05-11", "title": "看完这场,觉得邓肯,库里和韦德真的堪称伟大", "authorlink": "https://my.hupu.com/274330133718993", "url": "https://bbs.hupu.com/27359012.html", "scan": "2458", "endauthor": "做好事儿不留名", "author": "西西四西西"}
{"reply": "5", "lastreplytime": "17:55", "pubtime": "2019-05-11", "title": "德安东尼第四节为什么还让让保罗休息?", "authorlink": "https://my.hupu.com/163827142009834", "url": "https://bbs.hupu.com/27351736.html", "scan": "1766", "endauthor": "飘然出世", "author": "史蒂分布莱恩特"}
{"reply": "38", "lastreplytime": "17:54", "pubtime": "2019-05-11", "title": "窝火出哈登,卡佩拉,换欧文,霍福德,洛奇儿,凯米们愿意么?,愿意就通知经理了", "authorlink": "https://my.hupu.com/198259890671262", "url": "https://bbs.hupu.com/27354862.html", "scan": "5086", "endauthor": "若风是只you", "author": "1348228476"}
{"reply": "10", "lastreplytime": "17:54", "pubtime": "2019-05-11", "title": "杜兰特压力太大了", "authorlink": "https://my.hupu.com/228144193864995", "url": "https://bbs.hupu.com/27354881.html", "scan": "3487", "endauthor": "伊斯科盘带没我6", "author": "leeson911"}
{"reply": "52", "lastreplytime": "17:54", "pubtime": "2019-05-11", "title": "我有点看不懂了,保罗到底想干嘛?", "authorlink": "https://my.hupu.com/201450290883476", "url": "https://bbs.hupu.com/27342874.html", "scan": "13602", "endauthor": "瑞克警长", "author": "包头大三中"}
{"reply": "9", "lastreplytime": "17:54", "pubtime": "2019-05-11", "title": "NBA纪录季后赛库里得分30+勇士29胜4负胜率87.9%", "authorlink": "https://my.hupu.com/52562014793614", "url": "https://bbs.hupu.com/27359803.html", "scan": "1343", "endauthor": "kit杰248j", "author": "Ken任"}
{"reply": "126", "lastreplytime": "17:54", "pubtime": "2019-05-11", "title": "不懂就问!火勇大战这姑娘是谁", "authorlink": "https://my.hupu.com/95098224075381", "url": "https://bbs.hupu.com/27344624.html", "scan": "38890", "endauthor": "别别别奶", "author": "青峰不会飞"}
{"reply": "116", "lastreplytime": "17:54", "pubtime": "2019-05-11", "title": "都一个个进来给库里道歉!", "authorlink": "https://my.hupu.com/1037239487635", "url": "https://bbs.hupu.com/27351436.html", "scan": "16960", "endauthor": "老子会算卦", "author": "饭特雷西"}
{"reply": "50", "lastreplytime": "17:54", "pubtime": "2019-05-11", "title": "我一直想不通格林为什么不练一下中投啊?", "authorlink": "https://my.hupu.com/224421424371559", "url": "https://bbs.hupu.com/27354703.html", "scan": "14623", "endauthor": "给你20块别舔了好吗", "author": "无极威骑扣无极尊"}
{"reply": "151", "lastreplytime": "17:53", "pubtime": "2019-05-10", "title": "火箭宁可裁掉安东尼,也不愿意留着备用?", "authorlink": "https://my.hupu.com/128882595791656", "url": "https://bbs.hupu.com/27336327.html", "scan": "65375", "endauthor": "靓仔鸽", "author": "客队蜜蜜"}
{"reply": "1", "lastreplytime": "17:53", "pubtime": "2019-05-11", "title": "吹一波今年季后赛坎特", "authorlink": "https://my.hupu.com/133029354296111", "url": "https://bbs.hupu.com/27360668.html", "scan": "75", "endauthor": "丿Wizard丶", "author": "丿Wizard丶"}
{"reply": "49", "lastreplytime": "17:53", "pubtime": "2019-05-11", "title": "东决G6和西决G6  东詹和西詹的不同命运", "authorlink": "https://my.hupu.com/251117791994202", "url": "https://bbs.hupu.com/27356874.html", "scan": "8413", "endauthor": "田罗纳尔多", "author": "梅西赛后说"}
{"reply": "58", "lastreplytime": "17:53", "pubtime": "2019-05-11", "title": "求科普:NBA历史上有没有出现拿了常规赛FMVP,却没有进入当年全明星的呢", "authorlink": "https://my.hupu.com/107391370994397", "url": "https://bbs.hupu.com/27342113.html", "scan": "10742", "endauthor": "列宁86", "author": "砍柴的阿然"}
{"reply": "27", "lastreplytime": "17:53", "pubtime": "2019-05-11", "title": "[流言板]库里:我都不知道克莱是否真的感到了压力,他就是打球", "authorlink": "https://my.hupu.com/236472107528248", "url": "https://bbs.hupu.com/27358263.html", "scan": "7526", "endauthor": "aqi8109", "author": "拖鞋的脱"}
{"reply": "3", "lastreplytime": "17:53", "pubtime": "2019-05-11", "title": "想说下自己的观点 打转换进攻是对球队后场篮板的保护", "authorlink": "https://my.hupu.com/250971314571985", "url": "https://bbs.hupu.com/27360574.html", "scan": "68", "endauthor": "ZlatanHan", "author": "不痛的不算xx"}
{"reply": "4", "lastreplytime": "17:53", "pubtime": "2019-05-11", "title": "希望火箭球迷多给卫平老师一些尊重。", "authorlink": "https://my.hupu.com/26206837676547", "url": "https://bbs.hupu.com/27358411.html", "scan": "853", "endauthor": "小豆子de梦想", "author": "小OK姐姐"}
{"reply": "65", "lastreplytime": "17:53", "pubtime": "2019-05-11", "title": "不懂就问,大帝和字母哥的天赋谁更强!", "authorlink": "https://my.hupu.com/34008750235629", "url": "https://bbs.hupu.com/27354877.html", "scan": "4679", "endauthor": "把海倒进杯", "author": "国产德文韦"}
........
.....

以上内容是完整的自动爬取网页信息的项目思路以及编写过程,可以通过这篇博文对scrapy框架进行爬虫有更深的了解。同时本文讲述的自动爬虫适用于有规律的URL。通过寻找url规律实现网页自动爬虫。该文所涉及的代码见github:https://github.com/carson0408/AutoCrawler.git

Scrapy项目之自动爬取网页信息相关推荐

  1. Scrapy研究探索(六)——自动爬取网页之II(CrawlSpider)

    原创,转载注明: http://blog.csdn.net/u012150179/article/details/34913315 基于上面的博客修改而得 一 目的 在教程(二)中使用基于Spider ...

  2. [python] 常用正则表达式爬取网页信息及分析HTML标签总结

    这篇文章主要是介绍Python爬取网页信息时,经常使用的正则表达式及方法.它是一篇总结性文章,实用性比较大,主要解决自己遇到的爬虫问题,也希望对你有所帮助~ 当然如果会Selenium基于自动化测试爬 ...

  3. Python爬虫:Xpath爬取网页信息(附代码)

    Python爬虫:Xpath爬取网页信息(附代码) 上一次分享了使用Python简单爬取网页信息的方法.但是仅仅对于单一网页的信息爬取一般无法满足我们的数据需求.对于一般的数据需求,我们通常需要从一个 ...

  4. 常用正则表达式爬取网页信息及分析HTML标签总结

    这篇文章主要是介绍Python爬取网页信息时,经常使用的正则表达式及方法.它是一篇总结性文章,实用性比较大,主要解决自己遇到的爬虫问题,也希望对你有所帮助~ 当然如果会Selenium基于自动化测试爬 ...

  5. 利用脚本动态爬取网页信息

    利用脚本动态爬取网页信息 编译环境:Jupyter Notebook (Anaconda3) 调用的包:selenium,webdrive,webdriver_manager 文章目录 利用脚本动态爬 ...

  6. python正则表达式爬取网页数据_常用正则表达式爬取网页信息及HTML分析总结

    Python爬取网页信息时,经常使用的正则表达式及方法. 1.获取 标签之间内容2.获取 超链接之间内容3.获取URL最后一个参数命名图片或传递参数4.爬取网页中所有URL链接5.爬取网页标题titl ...

  7. 常用正则表达式爬取网页信息及HTML分析总结

    Python爬取网页信息时,经常使用的正则表达式及方法. 1.获取<tr></tr>标签之间内容 2.获取<a href..></a>超链接之间内容 3 ...

  8. Python 爬取网页信息并保存到本地爬虫爬取网页第一步【简单易懂,注释超级全,代码可以直接运行】

    Python 爬取网页信息并保存到本地[简单易懂,代码可以直接运行] 功能:给出一个关键词,根据关键词爬取程序,这是爬虫爬取网页的第一步 步骤: 1.确定url 2.确定请求头 3.发送请求 4.写入 ...

  9. python爬取网页信息

    最近在学习python,发现通过python爬取网页信息确实方便,以前用C++写了个简单的爬虫,爬取指定网页的信息,代码随便一写都几百行,而要用python完成相同的工作,代码量相当少.前几天看到了一 ...

最新文章

  1. 资产组合管理中有哪些基础概念?
  2. C++类型转换方式总结
  3. oracle的concat的用法
  4. 请求的资源不可用html,“HTTP状态404请求的资源不可用”
  5. DASH.js使用demo(配合ffmpeg和mp4box)
  6. Windows 11 修改Edge按 Alt+Tab 键为单个窗口
  7. flashfxp怎么用,flashfxp怎么用
  8. qmainwindow 标题栏_Qt:自定义标题栏(QMainWindow)
  9. 在经历了6个月的学习后,我终于上架了自己的第一款APP---酷课堂iOS群问答精华整理(201807...
  10. JVM中的Xms和Xmx
  11. Spring Cloud Open Feign系列【23】OAuth2FeignRequestInterceptor、BasicAuthRequestInterceptor拦截器解析
  12. 最近使用的一款session工具:sa-Token
  13. linux聊天python_Python socket C/S结构的聊天室应用
  14. 一行代码实现验证码--Happy Captcha
  15. 推理规则/经典规则(排中律/反证法双重否定消除)
  16. TrustZone 基本信息介绍大全
  17. 到底什么是信息检索?
  18. Java如何去除字符串中的HTML标签
  19. go语言基础学习 (五) http请求
  20. U盘下载系统之后剩余空间只剩32G?

热门文章

  1. 运营商站在FMC和移动UC之间无所适从
  2. bladeRF无线门控钥匙信号重放小记
  3. 如何启动G6信息门户信息服务器,有备无患,用G6 FTP自定义SITE远程重启服务器...
  4. 一文教会你Python 随机爬山算法
  5. CSMA/CD与全双工通信
  6. clistctrl获取列高 mfc_MFC控件之CListCtrl的应用实例教程
  7. 体重 php,图片文字记录体型-体重-维度(小基数157.5cm)
  8. 软件工程 作业 流程图与盒图
  9. idea中构造方法快捷键
  10. 用MATLAB创作歌曲