Scrapy-redis分布式爬虫

分布式爬虫步骤：
1.先用scrapy写出爬虫项目
2.redis 服务器
3.把原来单机版改成分布式
a 修改settings
b 修改爬虫类继承
from scrapy_redis.spiders import RedisSpider —– spider
from scrapy_redis.spiders import RedisCrawlSpider —–crawlspider
c 设置 redis-key
d 启动所有slave （所有爬虫终端）
e 向master端的redis里 push数据（起始url）
lpush redis-key url

myspider_redis.py版本

#coding:utf8
from scrapy_redis.spiders import RedisSpiderclass MySpider(RedisSpider):"""Spider that reads urls from redis queue (myspider:start_urls)."""name = 'myspider_redis'# 请求队列的键redis_key = 'myspider:start_urls'allow_domain = ['hao123.com']def parse(self, response):return {'name': response.css('title::text').extract_first(),'url': response.url,}

manage.py:

from scrapy import cmdline
import os
os.chdir('example/spiders')
cmdline.execute('scrapy runspider mycrawler_redis.py'.split())

redis-cli -h 192.168.6.6
lpush myspider:start_urls http://www.baidu.com
llen myspider:start_urls

mycrawler_redis.py版本

#coding:utf8
from scrapy.spiders import Rule # 按照规则提取url
from scrapy.linkextractors import LinkExtractor  # 生成请求from scrapy_redis.spiders import RedisCrawlSpiderclass MyCrawler(RedisCrawlSpider):"""Spider that reads urls from redis queue (myspider:start_urls)."""name = 'mycrawler_redis'redis_key = 'mycrawler:start_urls'allowed_domains = ['itxdl.cn']rules = (# follow all linksRule(LinkExtractor(), callback='parse_page', follow=True),)def parse_page(self, response):return {'name': response.css('title::text').extract_first(),'url': response.url,}

lpush mycrawler:start_urls http://www.baidu.com

Scrapy-redis分布式爬虫相关推荐

Scrapy 框架分布式爬虫
分布式爬虫 scrapy-redis 实现原生scrapy 无法实现分布式调度器和管道无法被分布式机群共享环境安装 - pip install scrapy_redis 导包:from sc ...
Python-玩转数据-scrapy简单分布式爬虫
一.说明虽然scrapy能做的事情很多,但是要做到大规模的分布式应用则捉襟见肘.有能人改变了scrapy的队列调度,将起始的网址从start_urls里分离出来,改为从redis读取,多个客户端可以 ...
解析python网络爬虫pdf 黑马程序员_正版解析Python网络爬虫核心技术 Scrapy框架分布式爬虫黑马程序员 Python应用编程丛书中国铁道出版社...
商品参数书名:Python应用编程丛书:解析Python网络爬虫:核心技术.Scrapy框架.分布式爬虫定价:52.00元作者:[中国]黑马程序员出版社:中国铁道出版社出版日期:2018-0 ...
【Python3爬虫】学习分布式爬虫第一步--Redis分布式爬虫初体验
一.写在前面之前写的爬虫都是单机爬虫,还没有尝试过分布式爬虫,这次就是一个分布式爬虫的初体验.所谓分布式爬虫,就是要用多台电脑同时爬取数据,相比于单机爬虫,分布式爬虫的爬取速度更快,也能更好地应对I ...
基于scrapy的分布式爬虫抓取新浪微博个人信息和微博内容存入MySQL
为了学习机器学习深度学习和文本挖掘方面的知识,需要获取一定的数据,新浪微博的大量数据可以作为此次研究历程的对象一.环境准备 python 2.7 scrapy框架的部署(可以查看上一篇博客的简要操作 ...
Scrapy + Redis 分布式爬取58同城北京全站二手房数据
Hello,我是 Alex 007,为啥是007呢?因为叫 Alex 的人太多了,再加上每天007的生活,Alex 007就诞生了. 经过一个星期的学习,爬虫这个章节算是告一段落了,记录一下作业. 文 ...
基于scrapy的分布式爬虫（5）：伯乐在线文章爬取
当我们完成了环境配置之后,所要做的就是使用 scrapy 爬取相关数据了. 接下来,我们以伯乐在线网站为例,进行实际操作. 目的:抓取 http://blog.jobbole.com/all-post ...
scrapy 分布式 mysql_Scrapy基于scrapy_redis实现分布式爬虫部署的示例
准备工作 1.安装scrapy_redis包,打开cmd工具,执行命令pip install scrapy_redis 2.准备好一个没有BUG,没有报错的爬虫项目 3.准备好redis主服务器还有跟 ...
scrapy分布式爬虫爬取淘车网
一.master主机配置 1.开启redis服务器 2.city.py#文件 # 城市编码 CITY_CODE = ['shijiazhuang', 'tangshan', 'qinhuangdao' ...
Redis实现分布式爬虫
redis分布式爬虫概念:多台机器上可以执行同一个爬虫程序,实现网站数据的爬取原生的scrapy是不可以实现分布式爬虫, 原因如下: 调度器无法共享管道无法共享 scrapy-redis组件:专 ...

Scrapy-redis分布式爬虫

Scrapy-redis分布式爬虫相关推荐

最新文章

热门文章