From:https://blog.csdn.net/weixin_37947156/article/details/75082061

小白进阶之Scrapy第三篇(基于Scrapy-Redis的分布式以及cookies池):https://cuiqingcai.com/4048.html

开始之前我们得知道scrapy-redis的一些配置:PS 这些配置是写在Scrapy项目的settings.py中的!

Scrapy 所有默认设置

scrapy/settings/default_settings.py

"""
This module contains the default values for all settings used by Scrapy.For more information about these settings you can read the settings
documentation in docs/topics/settings.rstScrapy developers, if you add a setting here remember to:* add it in alphabetical order
* group similar settings without leaving blank lines
* add its documentation to the available settings documentation(docs/topics/settings.rst)"""import sys
from importlib import import_module
from os.path import join, abspath, dirnameimport sixAJAXCRAWL_ENABLED = FalseAUTOTHROTTLE_ENABLED = False
AUTOTHROTTLE_DEBUG = False
AUTOTHROTTLE_MAX_DELAY = 60.0
AUTOTHROTTLE_START_DELAY = 5.0
AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0BOT_NAME = 'scrapybot'CLOSESPIDER_TIMEOUT = 0
CLOSESPIDER_PAGECOUNT = 0
CLOSESPIDER_ITEMCOUNT = 0
CLOSESPIDER_ERRORCOUNT = 0COMMANDS_MODULE = ''COMPRESSION_ENABLED = TrueCONCURRENT_ITEMS = 100CONCURRENT_REQUESTS = 16
CONCURRENT_REQUESTS_PER_DOMAIN = 8
CONCURRENT_REQUESTS_PER_IP = 0COOKIES_ENABLED = True
COOKIES_DEBUG = FalseDEFAULT_ITEM_CLASS = 'scrapy.item.Item'DEFAULT_REQUEST_HEADERS = {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8','Accept-Language': 'en',
}DEPTH_LIMIT = 0
DEPTH_STATS_VERBOSE = False
DEPTH_PRIORITY = 0DNSCACHE_ENABLED = True
DNSCACHE_SIZE = 10000
DNS_TIMEOUT = 60DOWNLOAD_DELAY = 0# 用户可自定义的下载处理器
DOWNLOAD_HANDLERS = {}
# 默认的下载处理器
DOWNLOAD_HANDLERS_BASE = {'data': 'scrapy.core.downloader.handlers.datauri.DataURIDownloadHandler','file': 'scrapy.core.downloader.handlers.file.FileDownloadHandler','http': 'scrapy.core.downloader.handlers.http.HTTPDownloadHandler','https': 'scrapy.core.downloader.handlers.http.HTTPDownloadHandler','s3': 'scrapy.core.downloader.handlers.s3.S3DownloadHandler','ftp': 'scrapy.core.downloader.handlers.ftp.FTPDownloadHandler',
}DOWNLOAD_TIMEOUT = 180      # 3minsDOWNLOAD_MAXSIZE = 1024*1024*1024   # 1024m
DOWNLOAD_WARNSIZE = 32*1024*1024    # 32mDOWNLOAD_FAIL_ON_DATALOSS = TrueDOWNLOADER = 'scrapy.core.downloader.Downloader'DOWNLOADER_HTTPCLIENTFACTORY = 'scrapy.core.downloader.webclient.ScrapyHTTPClientFactory'
DOWNLOADER_CLIENTCONTEXTFACTORY = 'scrapy.core.downloader.contextfactory.ScrapyClientContextFactory'
DOWNLOADER_CLIENT_TLS_METHOD = 'TLS' # Use highest TLS/SSL protocol version supported by the platform,# also allowing negotiationDOWNLOADER_MIDDLEWARES = {}DOWNLOADER_MIDDLEWARES_BASE = {# Engine side'scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware': 100,'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware': 300,'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware': 350,'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware': 400,'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': 500,'scrapy.downloadermiddlewares.retry.RetryMiddleware': 550,'scrapy.downloadermiddlewares.ajaxcrawl.AjaxCrawlMiddleware': 560,'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware': 580,'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 590,'scrapy.downloadermiddlewares.redirect.RedirectMiddleware': 600,'scrapy.downloadermiddlewares.cookies.CookiesMiddleware': 700,'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 750,'scrapy.downloadermiddlewares.stats.DownloaderStats': 850,'scrapy.downloadermiddlewares.httpcache.HttpCacheMiddleware': 900,# Downloader side
}DOWNLOADER_STATS = TrueDUPEFILTER_CLASS = 'scrapy.dupefilters.RFPDupeFilter'EDITOR = 'vi'
if sys.platform == 'win32':EDITOR = '%s -m idlelib.idle'EXTENSIONS = {}EXTENSIONS_BASE = {'scrapy.extensions.corestats.CoreStats': 0,'scrapy.extensions.telnet.TelnetConsole': 0,'scrapy.extensions.memusage.MemoryUsage': 0,'scrapy.extensions.memdebug.MemoryDebugger': 0,'scrapy.extensions.closespider.CloseSpider': 0,'scrapy.extensions.feedexport.FeedExporter': 0,'scrapy.extensions.logstats.LogStats': 0,'scrapy.extensions.spiderstate.SpiderState': 0,'scrapy.extensions.throttle.AutoThrottle': 0,
}FEED_TEMPDIR = None
FEED_URI = None
FEED_URI_PARAMS = None  # a function to extend uri arguments
FEED_FORMAT = 'jsonlines'
FEED_STORE_EMPTY = False
FEED_EXPORT_ENCODING = None
FEED_EXPORT_FIELDS = None
FEED_STORAGES = {}
FEED_STORAGES_BASE = {'': 'scrapy.extensions.feedexport.FileFeedStorage','file': 'scrapy.extensions.feedexport.FileFeedStorage','stdout': 'scrapy.extensions.feedexport.StdoutFeedStorage','s3': 'scrapy.extensions.feedexport.S3FeedStorage','ftp': 'scrapy.extensions.feedexport.FTPFeedStorage',
}
FEED_EXPORTERS = {}
FEED_EXPORTERS_BASE = {'json': 'scrapy.exporters.JsonItemExporter','jsonlines': 'scrapy.exporters.JsonLinesItemExporter','jl': 'scrapy.exporters.JsonLinesItemExporter','csv': 'scrapy.exporters.CsvItemExporter','xml': 'scrapy.exporters.XmlItemExporter','marshal': 'scrapy.exporters.MarshalItemExporter','pickle': 'scrapy.exporters.PickleItemExporter',
}
FEED_EXPORT_INDENT = 0FILES_STORE_S3_ACL = 'private'
FILES_STORE_GCS_ACL = ''FTP_USER = 'anonymous'
FTP_PASSWORD = 'guest'
FTP_PASSIVE_MODE = TrueHTTPCACHE_ENABLED = False
HTTPCACHE_DIR = 'httpcache'
HTTPCACHE_IGNORE_MISSING = False
HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
HTTPCACHE_EXPIRATION_SECS = 0
HTTPCACHE_ALWAYS_STORE = False
HTTPCACHE_IGNORE_HTTP_CODES = []
HTTPCACHE_IGNORE_SCHEMES = ['file']
HTTPCACHE_IGNORE_RESPONSE_CACHE_CONTROLS = []
HTTPCACHE_DBM_MODULE = 'anydbm' if six.PY2 else 'dbm'
HTTPCACHE_POLICY = 'scrapy.extensions.httpcache.DummyPolicy'
HTTPCACHE_GZIP = FalseHTTPPROXY_ENABLED = True
HTTPPROXY_AUTH_ENCODING = 'latin-1'IMAGES_STORE_S3_ACL = 'private'
IMAGES_STORE_GCS_ACL = ''ITEM_PROCESSOR = 'scrapy.pipelines.ItemPipelineManager'ITEM_PIPELINES = {}
ITEM_PIPELINES_BASE = {}LOG_ENABLED = True
LOG_ENCODING = 'utf-8'
LOG_FORMATTER = 'scrapy.logformatter.LogFormatter'
LOG_FORMAT = '%(asctime)s [%(name)s] %(levelname)s: %(message)s'
LOG_DATEFORMAT = '%Y-%m-%d %H:%M:%S'
LOG_STDOUT = False
LOG_LEVEL = 'DEBUG'
LOG_FILE = None
LOG_SHORT_NAMES = FalseSCHEDULER_DEBUG = FalseLOGSTATS_INTERVAL = 60.0MAIL_HOST = 'localhost'
MAIL_PORT = 25
MAIL_FROM = 'scrapy@localhost'
MAIL_PASS = None
MAIL_USER = NoneMEMDEBUG_ENABLED = False        # enable memory debugging
MEMDEBUG_NOTIFY = []            # send memory debugging report by mail at engine shutdownMEMUSAGE_CHECK_INTERVAL_SECONDS = 60.0
MEMUSAGE_ENABLED = True
MEMUSAGE_LIMIT_MB = 0
MEMUSAGE_NOTIFY_MAIL = []
MEMUSAGE_WARNING_MB = 0METAREFRESH_ENABLED = True
METAREFRESH_MAXDELAY = 100NEWSPIDER_MODULE = ''RANDOMIZE_DOWNLOAD_DELAY = TrueREACTOR_THREADPOOL_MAXSIZE = 10REDIRECT_ENABLED = True
REDIRECT_MAX_TIMES = 20  # uses Firefox default setting
REDIRECT_PRIORITY_ADJUST = +2REFERER_ENABLED = True
REFERRER_POLICY = 'scrapy.spidermiddlewares.referer.DefaultReferrerPolicy'RETRY_ENABLED = True
RETRY_TIMES = 2  # initial response + 2 retries = 3 requests
RETRY_HTTP_CODES = [500, 502, 503, 504, 522, 524, 408]
RETRY_PRIORITY_ADJUST = -1ROBOTSTXT_OBEY = FalseSCHEDULER = 'scrapy.core.scheduler.Scheduler'# 基于磁盘的任务队列(后进先出)
SCHEDULER_DISK_QUEUE = 'scrapy.squeues.PickleLifoDiskQueue'# 基于内存的任务队列(后进先出)
SCHEDULER_MEMORY_QUEUE = 'scrapy.squeues.LifoMemoryQueue'# 优先级队列
SCHEDULER_PRIORITY_QUEUE = 'queuelib.PriorityQueue'SPIDER_LOADER_CLASS = 'scrapy.spiderloader.SpiderLoader'
SPIDER_LOADER_WARN_ONLY = FalseSPIDER_MIDDLEWARES = {}SPIDER_MIDDLEWARES_BASE = {# Engine side'scrapy.spidermiddlewares.httperror.HttpErrorMiddleware': 50,'scrapy.spidermiddlewares.offsite.OffsiteMiddleware': 500,'scrapy.spidermiddlewares.referer.RefererMiddleware': 700,'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware': 800,'scrapy.spidermiddlewares.depth.DepthMiddleware': 900,# Spider side
}SPIDER_MODULES = []STATS_CLASS = 'scrapy.statscollectors.MemoryStatsCollector'
STATS_DUMP = TrueSTATSMAILER_RCPTS = []TEMPLATES_DIR = abspath(join(dirname(__file__), '..', 'templates'))URLLENGTH_LIMIT = 2083USER_AGENT = 'Scrapy/%s (+https://scrapy.org)' % import_module('scrapy').__version__TELNETCONSOLE_ENABLED = 1
TELNETCONSOLE_PORT = [6023, 6073]
TELNETCONSOLE_HOST = '127.0.0.1'
TELNETCONSOLE_USERNAME = 'scrapy'
TELNETCONSOLE_PASSWORD = NoneSPIDER_CONTRACTS = {}
SPIDER_CONTRACTS_BASE = {'scrapy.contracts.default.UrlContract': 1,'scrapy.contracts.default.ReturnsContract': 2,'scrapy.contracts.default.ScrapesContract': 3,
}

Scrapy-redis 的一些默认配置

scrapy-redis/defaults.py

import redis#所有的爬虫通过Redis去重所用到的 key
DUPEFILTER_KEY = 'dupefilter:%(timestamp)s'# pipeline 数据存到 redis 所使用的的 key
PIPELINE_KEY = '%(spider)s:items'# 配置 所使用的 redis
REDIS_CLS = redis.StrictRedis# redis 编码
REDIS_ENCODING = 'utf-8'# redis 连接参数
REDIS_PARAMS = {'socket_timeout': 30,'socket_connect_timeout': 30,'retry_on_timeout': True,'encoding': REDIS_ENCODING,
}# 调度器中 request 存放到 redis 中 所使用的 key
SCHEDULER_QUEUE_KEY = '%(spider)s:requests'# 使用优先级调度请求队列 (默认使用)
SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.PriorityQueue'# 去重规则,在redis中去重时 所用到的 key
SCHEDULER_DUPEFILTER_KEY = '%(spider)s:dupefilter'  # 去重规则对应处理的类
# 优先使用DUPEFILTER_CLASS,如果有SCHEDULER_DUPEFILTER_CLASS则使用这个
SCHEDULER_DUPEFILTER_CLASS = 'scrapy_redis.dupefilter.RFPDupeFilter'START_URLS_KEY = '%(name)s:start_urls'
START_URLS_AS_SET = False

Scrapy-redis 用到的一些配置

# 启用 Scrapy-Redis 调度存储请求队列
SCHEDULER = "scrapy_redis.scheduler.Scheduler"# 去重规则对应处理的类
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"# 对保存到 redis 中的数据进行序列化,默认使用pickle
# SCHEDULER_SERIALIZER = "scrapy_redis.picklecompat"  # 不清除 Redis 队列,即是否在关闭时候保留原来的调度器和去重记录。
# True=保留,False=清空。这样可以暂停/恢复 爬取
SCHEDULER_PERSIST = True# 是否在开始之前清空 调度器和去重记录,True=清空,False=不清空
# SCHEDULER_FLUSH_ON_START = True  DEPTH_PRIORITY = 1  # 广度优先
# DEPTH_PRIORITY = -1 # 深度优先
#使用优先级调度请求队列 (默认使用)
SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.PriorityQueue'#可选用的其它队列 PriorityQueue(有序集合),FifoQueue(列表)、LifoQueue(列表)
#SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.FifoQueue'  # 广度优先
#SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.LifoQueue'  # 深度优先# 最大空闲时间防止分布式爬虫因为等待而关闭
# 去调度器中获取数据时,如果为空,最多等待时间(最后没数据,未获取到)。
# SCHEDULER_IDLE_BEFORE_CLOSE = 10# 使用 scrapy-redis 的 pipeline 进行数据处理
ITEM_PIPELINES = {'scrapy_redis.pipelines.RedisPipeline': 300
}#序列化项目管道作为redis Key存储
# REDIS_ITEMS_KEY = '%(spider)s:items'#默认使用 ScrapyJSONEncoder进行项目序列化
#You can use any importable path to a callable object.
# REDIS_ITEMS_SERIALIZER = 'json.dumps'# 指定连接到redis时使用的端口和地址(可选)
# REDIS_HOST = 'localhost'
# REDIS_PORT = 6379# 指定用于连接 redis 的 URL(可选)。
# 如果设置此项,则此项优先级高于设置的 REDIS_HOST 和 REDIS_PORT。
# 如果没有 user 默认是 root 。
# 示例 REDIS_URL = "redis://root:12345678@192.168.0.100:6379"
# REDIS_URL = 'redis://user:pass@hostname:9001'# 连接redis
REDIS_HOST = '100.100.100.100'         # 主机名
REDIS_PORT = 9999                      # 端口
REDIS_PARAMS  = {'password':'xxx'}     # Redis连接参数。
REDIS_ENCODING = "utf-8"               # redis编码类型。默认:'utf-8'
# 或者:
REDIS_URL = 'redis://user:pass@hostname:9001' # 连接 URL(优先于以上配置)# 自定义redis客户端类
# REDIS_PARAMS['redis_cls'] = 'myproject.RedisClient'# 如果为True,则使用redis的'spop'进行操作。
# 如果需要避免起始网址列表出现重复,这个选项非常有用。开启此选项 urls 必须通过sadd添加,否则会出现类型错误。
# REDIS_START_URLS_AS_SET = False# RedisSpider 和 RedisCrawlSpider 默认 start_usls 键
# REDIS_START_URLS_KEY = '%(name)s:start_urls'

请各位小伙伴儿自行挑选需要的配置写到项目的settings.py文件中

英语渣靠Google、看不下去的小伙伴儿看这儿:http://scrapy-redis.readthedocs.io/en/stable/readme.html

继续在我们上一篇博文中的爬虫程序修改:

首先把我们需要的redis配置文件写入settings.py中:

如果你的redis数据库按照前一片博文配置过则需要以下至少三项

SCHEDULER = "scrapy_redis.scheduler.Scheduler"
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
REDIS_URL = 'redis://root:密码@主机IP:端口'

第三项请按照你的实际情况配置。

scrapy-redis 配置 settings相关推荐

  1. 5.1.8 NoSQL数据库-Redis(键值key-value)-Redis配置详解

    目录 1.写在前面 2.具体信息 2.1 单位 2.2 包含 2.3 网络 2.4 通用 GENERAL 2.5 快照 2.6 REPLICATION 主从复制 2.7 SECURITY 安全 2.8 ...

  2. 知乎爬虫(scrapy默认配置下单机1小时可爬取60多万条数据)

    知乎爬虫(scrapy默认配置下单机1小时可爬取60多万条数据) 版本:1.0 作者: AlexTan CSDN: http://blog.csdn.net/alextan_ e-mail: alex ...

  3. Scrapy——可配置的爬虫

    转自:  http://bbs.cnpameng.com/bbs/forum.php?mod=viewthread&tid=12&extra=page%3D1 Scrapy--可配 ...

  4. redis配置中文版

    Redis.conf # redis 配置文件示例# 当你需要为某个配置项指定内存大小的时候,必须要带上单位, # 通常的格式就是 1k 5gb 4m 等: # # 1k => 1000 byt ...

  5. Redis 配置详解 —— 全网最新最全

    文章目录 一.撰文目的 二.配置详解 1. EXAMPLE(概要说明) 2.INCLUDES(配置包含) 3.MODULES(加载模块) 4.NETWORK(网络配置) 5.TLS/SSL(通讯协议) ...

  6. docker redis配置源文件

    docker redis 配置源文件 按照这套配置文件 密码为 123456 bind 127.0.0.1 已经注释 daemonize no 如果使用Docker这个参数一定要是 no 否则会和do ...

  7. Python面试必备—分布式爬虫scrapy+redis解析

    传智播客博学谷 微信号:boxuegu- get最新最全的IT技能 免费领取各种视频资料 注意:文末送书 很多小伙伴留言,最近准备跳槽,但是面试的机会比较少,好不容易得到面试机会,由于技术点的匮乏,面 ...

  8. Redis 配置连接池,redisTemplate 操作多个db数据库,切换多个db,解决JedisConnectionFactory的设置连接方法过时问题。(转)

    Redis 配置连接池,redisTemplate 操作多个db数据库,切换多个db,解决JedisConnectionFactory的设置连接方法过时问题.(转) 参考文章: (1)Redis 配置 ...

  9. linux redis 配置详解

    #redis.conf # Redis configuration file example. # ./redis-server /path/to/redis.conf################ ...

最新文章

  1. Python使用SQLAlchemy连接数据库并创建数据表、插入数据、删除数据、更新表、查询表(CRUD)
  2. Java注释@interface的用法
  3. Maven指令的生命周期
  4. 垃圾回收机制之标记压缩算法与分代算法
  5. ST环境进行测试时,事前需要考虑的问题
  6. 显示visual studio试用版序列号输入框小程序_Visual Studio 2008试用版的评估期已经结束 的解决方法...
  7. 字节码指令之操作数栈管理指令
  8. 如何写一份优秀的Web前端简历【面试秘籍】
  9. html5手机页面工具,Html5技术变革下的H5页面制作工具和手机app开发工具
  10. 推荐一个好用的IDEA插件---Translation—中英互译
  11. SpringCloud微服务快速入坑
  12. java使用POI识别excel的复选框插件
  13. linux重做系统分区,Linux系统分区表恢复的教程
  14. beyond compare 3 过期解决办法
  15. P1361 小猫爬山 bool dfs
  16. 南京美食,为吃遍天下做准备~~
  17. win7系统双硬盘双系统问题解决
  18. SqlServer-IN写法(普通、存储过程)
  19. python opencv入门 Meanshift 和 Camshift 算法(40)
  20. snat与dnat的区别

热门文章

  1. 后BERT时代:15个预训练模型对比分析与关键点探究
  2. Spring Boot 2.0与Java 9
  3. 论文浅尝 | 改善多语言KGQA的 Zero-shot 跨语言转换
  4. python实现批量图片/文件重命名
  5. 【LeetCode】4月5日打卡-Day21-最大子序和问题
  6. 基于Hadoop的产品大数据分布式存储优化
  7. python学习之数据类型(int,bool,str)
  8. Attention Model
  9. MVC中Model BLL层Model模型互转
  10. 扩展控件--NumberTextBox