scrapy 分布式爬虫- RedisSpider

爬去当当书籍信息

多台机器同时爬取，共用一个redis记录 scrapy_redis

带爬取的request对象储存在redis中，每台机器读取request对象并删除记录，经行爬取。实现分布式爬虫

import scrapy
from scrapy_redis.spiders import RedisSpider
from copy import deepcopyclass DangdangSpider(RedisSpider):name = 'dangdang'allowed_domains = ['dangdang.com']# 开始爬虫，会从redis的key中读取start_url.redis_key = "dangdang" # lpush dangdang 'http://book.dangdang.com/'def parse(self, response):# 大分类div_list = response.xpath("//div[@class='con flq_body']/div")[:-4]print(len(div_list), 'duoshao')for div in div_list:item = {}item['b_cate'] = div.xpath("./dl/dt//text()").extract()item['b_cate'] = [i.strip() for i in item['b_cate'] if len(i.strip())>0] # 过滤掉空字符print('b_cate:', item['b_cate'])# 中间分类if item['b_cate'] == ['创意文具']:print(item['b_cate'], "pass......")item['m_cate'] = Noneitem['s_cate_url'] = div.xpath("./dl/dt/a/@ddt-src").extract_first()print('s_cate_url:', item['m_cate'])# yield scrapy.Request(#     item['s_cate_url'],#     callback=self.parse_special,#     meta={'item': deepcopy(item)}# )else:dl_list = div.xpath(".//dl[@class='inner_dl']")for dl in dl_list:item['m_cate'] = dl.xpath("./dt//text()").extract()item['m_cate'] = [i.strip() for i in item['m_cate'] if len(i.strip())>0]# 小分类dd_list = dl.xpath("./dd")for dd in dd_list:item['s_cate'] = dd.xpath("./a/@title").extract_first()item['s_cate_url'] = dd.xpath("./a/@ddt-src").extract_first()# 小分类的所有书籍if item['s_cate_url'] is not None:yield scrapy.Request(item['s_cate_url'],callback=self.parse_books,meta={'item': deepcopy(item)})def parse_special(self, response):''' 文具信息 '''passdef parse_books(self, response):item = response.meta['item']# 当前小分类的书籍li_list = response.xpath("//ul[@class='list_aa ']/li")if li_list is not None:for li in li_list:try:item['book_price'] = li.xpath(".//span[@class='num']/text()").extract_first() + \li.xpath(".//span[@class='tail']/text()").extract_first()except:item['book_price'] = 'Unknown'item['book_url'] = li.xpath("./a/@href").extract_first()if item['book_url'] is not None:yield scrapy.Request(item['book_url'],callback=self.parse_book_detail,meta={'item': deepcopy(item)})def parse_book_detail(self, response):item = response.meta['item']item['book_name'] = response.xpath("//div[@class='name_info']/h1/img/text()").extract_first()item['book_desc'] = response.xpath("//span[@class='head_title_name']/text()").extract_first()# 这一本书籍的详细信息span_list = response.xpath("//div[@class='messbox_info']/span")item['book_author'] = span_list.xpath("./span[1]/a/text()").extract() # 可能多个作者item['publisher'] = span_list.xpath("./span[2]/a/text()").extract_first()item['pub_date'] = span_list.xpath("./span[3]/text()").extract_first()print(item)# yield item

posted on 2019-05-10 16:27 .Tang 阅读(...) 评论(...) 编辑收藏

转载于:https://www.cnblogs.com/tangpg/p/10845174.html

scrapy 分布式爬虫- RedisSpider相关推荐

Scrapy分布式爬虫打造搜索引擎 - （三）知乎网问题和答案爬取
Python分布式爬虫打造搜索引擎基于Scrapy.Redis.elasticsearch和django打造一个完整的搜索引擎网站推荐前往我的个人博客进行阅读:http://blog.mtiany ...
scrapy分布式爬虫原理（scrapy_redis）
scrapy分布式爬虫及scrapy_redis 分布式原理 scrapy--redis实现分布式 scrapy_redis源码分布式原理在学习完scrapy基本知识后,大多数爬虫应用了scrap ...
scrapy分布式爬虫爬取淘车网
一.master主机配置 1.开启redis服务器 2.city.py#文件 # 城市编码 CITY_CODE = ['shijiazhuang', 'tangshan', 'qinhuangdao' ...
[235]scrapy分布式爬虫scrapy-redis(二)
===============================================================Scrapy-Redis分布式爬虫框架================== ...
scrapy分布式爬虫案例
关于 Redis Redis 是目前公认的速度最快的基于内存的键值对数据库 Redis 作为临时数据的缓存区,可以充分利用内存的高速读写能力大大提高爬虫爬取效率. 关于 scrapy-redis sc ...
基于Scrapy分布式爬虫的开发与设计
个人博客请访问http://blog.xhzyxed.cn 这个项目也是初窥python爬虫的一个项目,也是我的毕业设计,当时选题的时候,发现大多数人选择的都是网站类,实在是普通不过了,都是一些简单的 ...
python分布式爬虫框架_python之简单Scrapy分布式爬虫的实现
分布式爬虫:爬虫共用同一个爬虫程序,即把同一个爬虫程序同时部署到多台电脑上运行,这样可以提高爬虫速度. 在默认情况下,scrapy爬虫是单机爬虫,只能在一台电脑上运行,因为爬虫调度器当中的队列queu ...
三十五 Python分布式爬虫打造搜索引擎Scrapy精讲—scrapy分布式爬虫要点
1.分布式爬虫原理 2.分布式爬虫优点 3.分布式爬虫需要解决的问题转载于:https://www.cnblogs.com/meng-wei-zhi/p/8182813.html
python爬虫-初步使用Scrapy分布式爬虫（爬取mcbbs整合包保存名称及主要mod），大爱MC
首先介绍一下scrapy. Scrapy一个开源和协作的框架,是为了页面抓取所设计的,使用它可以快速.简单.可扩展(通过中间件)的方式从网站中提取所需的数据. 工作流程如下 Scrapy Engine ...

scrapy 分布式爬虫- RedisSpider

scrapy 分布式爬虫- RedisSpider相关推荐

最新文章

热门文章