scrapy框架是爬虫界最为强大的框架,没有之一,它的强大在于它的高可扩展性和低耦合,使使用者能够轻松的实现更改和补充。

其中内置三种爬虫主程序模板,scrapy.Spider、RedisSpider、CrawlSpider、RedisCrawlSpider(深度分布式爬虫)分别为别为一般爬虫、分布式爬虫、深度爬虫提供内部逻辑;下面将从源码和应用来学习,

scrapy.Spider

源码:

"""
Base class for Scrapy spidersSee documentation in docs/topics/spiders.rst
"""
import logging
import warningsfrom scrapy import signals
from scrapy.http import Request
from scrapy.utils.trackref import object_ref
from scrapy.utils.url import url_is_from_spider
from scrapy.utils.deprecate import create_deprecated_class
from scrapy.exceptions import ScrapyDeprecationWarning
from scrapy.utils.deprecate import method_is_overriddenclass Spider(object_ref):"""Base class for scrapy spiders. All spiders must inherit from thisclass."""name = Nonecustom_settings = Nonedef __init__(self, name=None, **kwargs):if name is not None:self.name = nameelif not getattr(self, 'name', None):raise ValueError("%s must have a name" % type(self).__name__)self.__dict__.update(kwargs)if not hasattr(self, 'start_urls'):self.start_urls = []@propertydef logger(self):logger = logging.getLogger(self.name)return logging.LoggerAdapter(logger, {'spider': self})def log(self, message, level=logging.DEBUG, **kw):"""Log the given message at the given log levelThis helper wraps a log call to the logger within the spider, but youcan use it directly (e.g. Spider.logger.info('msg')) or use any otherPython logger too."""self.logger.log(level, message, **kw)@classmethoddef from_crawler(cls, crawler, *args, **kwargs):spider = cls(*args, **kwargs)spider._set_crawler(crawler)return spiderdef set_crawler(self, crawler):warnings.warn("set_crawler is deprecated, instantiate and bound the ""spider to this crawler with from_crawler method ""instead.",category=ScrapyDeprecationWarning, stacklevel=2)assert not hasattr(self, 'crawler'), "Spider already bounded to a " "crawler"self._set_crawler(crawler)def _set_crawler(self, crawler):self.crawler = crawlerself.settings = crawler.settingscrawler.signals.connect(self.close, signals.spider_closed)def start_requests(self):cls = self.__class__if method_is_overridden(cls, Spider, 'make_requests_from_url'):warnings.warn("Spider.make_requests_from_url method is deprecated; it ""won't be called in future Scrapy releases. Please ""override Spider.start_requests method instead (see %s.%s)." % (cls.__module__, cls.__name__),)for url in self.start_urls:yield self.make_requests_from_url(url)else:for url in self.start_urls:yield Request(url, dont_filter=True)def make_requests_from_url(self, url):""" This method is deprecated. """return Request(url, dont_filter=True)def parse(self, response):raise NotImplementedError('{}.parse callback is not defined'.format(self.__class__.__name__))@classmethoddef update_settings(cls, settings):settings.setdict(cls.custom_settings or {}, priority='spider')@classmethoddef handles_request(cls, request):return url_is_from_spider(request.url, cls)@staticmethoddef close(spider, reason):closed = getattr(spider, 'closed', None)if callable(closed):return closed(reason)def __str__(self):return "<%s %r at 0x%0x>" % (type(self).__name__, self.name, id(self))__repr__ = __str__BaseSpider = create_deprecated_class('BaseSpider', Spider)class ObsoleteClass(object):def __init__(self, message):self.message = messagedef __getattr__(self, name):raise AttributeError(self.message)spiders = ObsoleteClass('"from scrapy.spider import spiders" no longer works - use ''"from scrapy.spiderloader import SpiderLoader" and instantiate ''it with your project settings"'
)# Top-level imports
from scrapy.spiders.crawl import CrawlSpider, Rule
from scrapy.spiders.feed import XMLFeedSpider, CSVFeedSpider
from scrapy.spiders.sitemap import SitemapSpider

其中需要关注的是name(爬虫名字)、start_urls(抓取的起始url列表)、allowed_domains(限定抓取的url所在域名)、start_requests(开始抓取的方法)

name、start_urls、allowed_domains是属性,在创建创建项目的时候已经建好了,稍作修改即可。start_requests是起始的抓取方法,一般是默认的遍历start_urls列表生成Request对象,在scrapy中需要登录的时候可以复写该方法,这个比较简单不在赘述。

CrawlSpider

深度爬虫,根据连接提取规则,会自动抓取页面中满足规则的连接,然后再请求解析,再抓取从而一直深入。

源码

"""
This modules implements the CrawlSpider which is the recommended spider to use
for scraping typical web sites that requires crawling pages.See documentation in docs/topics/spiders.rst
"""import copy
import sixfrom scrapy.http import Request, HtmlResponse
from scrapy.utils.spider import iterate_spider_output
from scrapy.spiders import Spiderdef identity(x):return xclass Rule(object):def __init__(self, link_extractor, callback=None, cb_kwargs=None, follow=None, process_links=None, process_request=identity):self.link_extractor = link_extractorself.callback = callbackself.cb_kwargs = cb_kwargs or {}self.process_links = process_linksself.process_request = process_requestif follow is None:self.follow = False if callback else Trueelse:self.follow = followclass CrawlSpider(Spider):rules = ()def __init__(self, *a, **kw):super(CrawlSpider, self).__init__(*a, **kw)self._compile_rules()def parse(self, response):return self._parse_response(response, self.parse_start_url, cb_kwargs={}, follow=True)def parse_start_url(self, response):return []def process_results(self, response, results):return resultsdef _build_request(self, rule, link):r = Request(url=link.url, callback=self._response_downloaded)r.meta.update(rule=rule, link_text=link.text)return rdef _requests_to_follow(self, response):if not isinstance(response, HtmlResponse):returnseen = set()for n, rule in enumerate(self._rules):links = [lnk for lnk in rule.link_extractor.extract_links(response)if lnk not in seen]if links and rule.process_links:links = rule.process_links(links)for link in links:seen.add(link)r = self._build_request(n, link)yield rule.process_request(r)def _response_downloaded(self, response):rule = self._rules[response.meta['rule']]return self._parse_response(response, rule.callback, rule.cb_kwargs, rule.follow)def _parse_response(self, response, callback, cb_kwargs, follow=True):if callback:cb_res = callback(response, **cb_kwargs) or ()cb_res = self.process_results(response, cb_res)for requests_or_item in iterate_spider_output(cb_res):yield requests_or_itemif follow and self._follow_links:for request_or_item in self._requests_to_follow(response):yield request_or_itemdef _compile_rules(self):def get_method(method):if callable(method):return methodelif isinstance(method, six.string_types):return getattr(self, method, None)self._rules = [copy.copy(r) for r in self.rules]for rule in self._rules:rule.callback = get_method(rule.callback)rule.process_links = get_method(rule.process_links)rule.process_request = get_method(rule.process_request)@classmethoddef from_crawler(cls, crawler, *args, **kwargs):spider = super(CrawlSpider, cls).from_crawler(crawler, *args, **kwargs)spider._follow_links = crawler.settings.getbool('CRAWLSPIDER_FOLLOW_LINKS', True)return spiderdef set_crawler(self, crawler):super(CrawlSpider, self).set_crawler(crawler)self._follow_links = crawler.settings.getbool('CRAWLSPIDER_FOLLOW_LINKS', True)

CrawlSpider是继承于Spider,也实现了其中的常用属性和方法,新增了一个rules属性(连接提取规则集合),但是不同的是Crawl内部实现了parse解析方法,不能在Crawl中使用该关键词。

def parse(self, response):
return self._parse_response(response, self.parse_start_url, cb_kwargs={}, follow=True)

也提供了一个可复写(overrideable)的方法:

  • parse_start_url(response)
  • 当start_url的请求返回时,该方法被调用。 该方法分析最初的返回值并必须返回一个 Item对象或者 一个 Request 对象或者 一个可迭代的包含二者对象。

rules

在rules中包含一个或多个Rule对象,每个Rule对爬取网站的动作定义了特定操作。如果多个rule匹配了相同的链接,则根据规则在本集合中被定义的顺序,第一个会被使用。

class scrapy.spiders.Rule(link_extractor, callback = None, cb_kwargs = None, follow = None, process_links = None, process_request = None
)

  • link_extractor:是一个Link Extractor对象,用于定义需要提取的链接(Link Extractor对象见下)。
  • callback: 从link_extractor中每获取到链接时,参数所指定的值作为回调函数,该回调函数接受一个response作为其第一个参数。
    注意:当编写爬虫规则时,避免使用parse作为回调函数。由于CrawlSpider使用parse方法来实现其逻辑,如果覆盖了 parse方法,crawl spider将会运行失败。
  • follow:是一个布尔(boolean)值,指定了根据该规则从response提取的链接是否需要跟进。 如果callback为None,follow 默认设置为True ,否则默认为False。
  • process_links:指定该spider中哪个的函数将会被调用,从link_extractor中获取到链接列表时将会调用该函数。该方法主要用来过滤。
  • process_request:指定该spider中哪个的函数将会被调用, 该规则提取到每个request时都会调用该函数。 (用来过滤request)

LinkExtractors

class scrapy.linkextractors.LinkExtractor

Link Extractors 的目的很简单: 提取链接。

每个LinkExtractor有唯一的公共方法是 extract_links(),它接收一个 Response 对象,并返回一个 scrapy.link.Link 对象。

Link Extractors要实例化一次,并且 extract_links 方法会根据不同的 response 调用多次提取链接。

class scrapy.linkextractors.LinkExtractor(allow = (),deny = (),allow_domains = (),deny_domains = (),deny_extensions = None,restrict_xpaths = (),tags = ('a','area'),attrs = ('href'),canonicalize = True,unique = True,process_value = None
)

主要参数:

  • allow:满足括号中“正则表达式”的值会被提取,如果为空,则全部匹配。
  • deny:与这个正则表达式(或正则表达式列表)不匹配的URL一定不提取。
  • allow_domains:会被提取的链接的domains。
  • deny_domains:一定不会被提取链接的domains。
  • restrict_xpaths:使用xpath表达式,和allow共同作用过滤链接。

案例

from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Ruleclass TestSpider(CrawlSpider):name = 'Test'allowed_domains = ['Test.com']start_urls = ['http://Test.com/']rules = (Rule(LinkExtractor(allow=r'Items/'), callback='parse_test', follow=True),)def parse_test(self, response):items = {}
············return items

RedisSpider、RedisCrawlSpider

Scrapy-redis提供了下面四种组件:

  • Scheduler(调度程序)
  • Duplication Filter(过滤)
  • Item Pipeline(数据管道)
  • Base Spider(爬虫基类)

Scheduler

Scrapy中跟“待爬队列”直接相关的就是调度器Scheduler,它负责对新的request进行入列操作(加入Scrapy queue),取出下一个要爬取的request(从Scrapy queue中取出)等操作。它把待爬队列按照优先级建立了一个字典结构,比如:

{优先级0 : 队列0优先级1 : 队列1优先级2 : 队列2}

然后根据request中的优先级,来决定该入哪个队列,出列时则按优先级较小的优先出列。为了管理这个比较高级的队列字典,Scheduler需要提供一系列的方法。但是原来的Scheduler已经无法使用,所以使用Scrapy-redis的scheduler组件。

Duplication Filter

Scrapy中用集合实现这个request去重功能,Scrapy中把已经发送的request指纹放入到一个集合中,把下一个request的指纹拿到集合中比对,如果该指纹存在于集合中,说明这个request发送过了,如果没有则继续操作。这个核心的判重功能是这样实现的:

def request_seen(self, request):# self.request_figerprints就是一个指纹集合  fp = self.request_fingerprint(request)# 这就是判重的核心操作  if fp in self.fingerprints:return Trueself.fingerprints.add(fp)if self.file:self.file.write(fp + os.linesep)

在scrapy-redis中去重是由Duplication Filter组件来实现的,它通过redis的set 不重复的特性,巧妙的实现了Duplication Filter去重。scrapy-redis调度器从引擎接受request,将request的指纹存⼊redis的set检查是否重复,并将不重复的request push写⼊redis的 request queue。

引擎请求request(Spider发出的)时,调度器从redis的request queue队列⾥里根据优先级pop 出⼀个request 返回给引擎,引擎将此request发给spider处理。

Item Pipeline

引擎将(Spider返回的)爬取到的Item给Item Pipeline,scrapy-redis 的Item Pipeline将爬取到的 Item 存⼊redis的 items queue。

修改过Item Pipeline可以很方便的根据 key 从 items queue 提取item,从⽽实现 items processes集群。

Base Spider

不在使用scrapy原有的Spider类,重写的RedisSpider继承了Spider和RedisMixin这两个类,RedisMixin是用来从redis读取url的类。

当我们生成一个Spider继承RedisSpider时,调用setup_redis函数,这个函数会去连接redis数据库,然后会设置signals(信号):

  • 一个是当spider空闲时候的signal,会调用spider_idle函数,这个函数调用schedule_next_request函数,保证spider是一直活着的状态,并且抛出DontCloseSpider异常。
  • 一个是当抓到一个item时的signal,会调用item_scraped函数,这个函数会调用schedule_next_request函数,获取下一个request。
from scrapy import signals
from scrapy.exceptions import DontCloseSpider
from scrapy.spiders import Spider, CrawlSpiderfrom . import connection, defaults
from .utils import bytes_to_strclass RedisMixin(object):"""Mixin class to implement reading urls from a redis queue."""redis_key = Noneredis_batch_size = Noneredis_encoding = None# Redis client placeholder.server = Nonedef start_requests(self):"""Returns a batch of start requests from redis."""return self.next_requests()def setup_redis(self, crawler=None):"""Setup redis connection and idle signal.This should be called after the spider has set its crawler object."""if self.server is not None:returnif crawler is None:# We allow optional crawler argument to keep backwards# compatibility.# XXX: Raise a deprecation warning.crawler = getattr(self, 'crawler', None)if crawler is None:raise ValueError("crawler is required")settings = crawler.settingsif self.redis_key is None:self.redis_key = settings.get('REDIS_START_URLS_KEY', defaults.START_URLS_KEY,)self.redis_key = self.redis_key % {'name': self.name}if not self.redis_key.strip():raise ValueError("redis_key must not be empty")if self.redis_batch_size is None:# TODO: Deprecate this setting (REDIS_START_URLS_BATCH_SIZE).self.redis_batch_size = settings.getint('REDIS_START_URLS_BATCH_SIZE',settings.getint('CONCURRENT_REQUESTS'),)try:self.redis_batch_size = int(self.redis_batch_size)except (TypeError, ValueError):raise ValueError("redis_batch_size must be an integer")if self.redis_encoding is None:self.redis_encoding = settings.get('REDIS_ENCODING', defaults.REDIS_ENCODING)self.logger.info("Reading start URLs from redis key '%(redis_key)s' ""(batch size: %(redis_batch_size)s, encoding: %(redis_encoding)s",self.__dict__)self.server = connection.from_settings(crawler.settings)# The idle signal is called when the spider has no requests left,# that's when we will schedule new requests from redis queuecrawler.signals.connect(self.spider_idle, signal=signals.spider_idle)def next_requests(self):"""Returns a request to be scheduled or none."""use_set = self.settings.getbool('REDIS_START_URLS_AS_SET', defaults.START_URLS_AS_SET)fetch_one = self.server.spop if use_set else self.server.lpop# XXX: Do we need to use a timeout here?found = 0# TODO: Use redis pipeline execution.while found < self.redis_batch_size:data = fetch_one(self.redis_key)if not data:# Queue empty.breakreq = self.make_request_from_data(data)if req:yield reqfound += 1else:self.logger.debug("Request not made from data: %r", data)if found:self.logger.debug("Read %s requests from '%s'", found, self.redis_key)def make_request_from_data(self, data):"""Returns a Request instance from data coming from Redis.By default, ``data`` is an encoded URL. You can override this method toprovide your own message decoding.Parameters----------data : bytesMessage from redis."""url = bytes_to_str(data, self.redis_encoding)return self.make_requests_from_url(url)def schedule_next_requests(self):"""Schedules a request if available"""# TODO: While there is capacity, schedule a batch of redis requests.for req in self.next_requests():self.crawler.engine.crawl(req, spider=self)def spider_idle(self):"""Schedules a request if available, otherwise waits."""# XXX: Handle a sentinel to close the spider.self.schedule_next_requests()raise DontCloseSpiderclass RedisSpider(RedisMixin, Spider):"""Spider that reads urls from redis queue when idle.Attributes----------redis_key : str (default: REDIS_START_URLS_KEY)Redis key where to fetch start URLs from..redis_batch_size : int (default: CONCURRENT_REQUESTS)Number of messages to fetch from redis on each attempt.redis_encoding : str (default: REDIS_ENCODING)Encoding to use when decoding messages from redis queue.Settings--------REDIS_START_URLS_KEY : str (default: "<spider.name>:start_urls")Default Redis key where to fetch start URLs from..REDIS_START_URLS_BATCH_SIZE : int (deprecated by CONCURRENT_REQUESTS)Default number of messages to fetch from redis on each attempt.REDIS_START_URLS_AS_SET : bool (default: False)Use SET operations to retrieve messages from the redis queue. If False,the messages are retrieve using the LPOP command.REDIS_ENCODING : str (default: "utf-8")Default encoding to use when decoding messages from redis queue."""@classmethoddef from_crawler(self, crawler, *args, **kwargs):obj = super(RedisSpider, self).from_crawler(crawler, *args, **kwargs)obj.setup_redis(crawler)return objclass RedisCrawlSpider(RedisMixin, CrawlSpider):"""Spider that reads urls from redis queue when idle.Attributes----------redis_key : str (default: REDIS_START_URLS_KEY)Redis key where to fetch start URLs from..redis_batch_size : int (default: CONCURRENT_REQUESTS)Number of messages to fetch from redis on each attempt.redis_encoding : str (default: REDIS_ENCODING)Encoding to use when decoding messages from redis queue.Settings--------REDIS_START_URLS_KEY : str (default: "<spider.name>:start_urls")Default Redis key where to fetch start URLs from..REDIS_START_URLS_BATCH_SIZE : int (deprecated by CONCURRENT_REQUESTS)Default number of messages to fetch from redis on each attempt.REDIS_START_URLS_AS_SET : bool (default: True)Use SET operations to retrieve messages from the redis queue.REDIS_ENCODING : str (default: "utf-8")Default encoding to use when decoding messages from redis queue."""@classmethoddef from_crawler(self, crawler, *args, **kwargs):obj = super(RedisCrawlSpider, self).from_crawler(crawler, *args, **kwargs)obj.setup_redis(crawler)return obj

在scrapy_redis组件中不仅提供了RedisSpider还提供了兼具深度爬虫的RedisCrawlSpider,至于其余几个Redis分布式组件将在后面逐一分享。

Redis分布式组件,新增redis_key 属性,用于早redis中去重和数据存储。

示例

from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rulefrom scrapy_redis.spiders import RedisCrawlSpiderclass TestSpider(RedisCrawlSpider):name = 'test'allowed_domains = ['www.test.com']redis_key = 'testspider:start_urls'rules = [# 获取每一页的链接Rule(link_extractor=LinkExtractor(allow=('/?page=d+'))),# 获取每一个公司的详情Rule(link_extractor=LinkExtractor(allow=('/d+')), callback='parse_item')]def parse_item(self, response):······return item

至于更多配置不再赘述, 后续将对一些组件继续深入分析。

dubbo源码深度解析_scrapy框架通用爬虫、深度爬虫、分布式爬虫、分布式深度爬虫,源码解析及应用相关推荐

  1. SSM框架基于JavaWeb在线投票系统的设计与实现源码

     博主介绍:✌在职Java研发工程师.专注于程序设计.源码分享.技术交流.专注于Java技术领域和毕业设计✌ 项目名称 SSM框架基于JavaWeb在线投票系统的设计与实现源码 视频效果 SSM框架基 ...

  2. php支持上传音乐播放网,基于ThinkPHP5框架开发的响应式在线音乐播放网站PHP源码+WAP手机端|在线支付+会员购买+音乐上传...

    源码介绍 基于ThinkPHP5框架开发的响应式在线音乐播放网PHP源码,是一款开源的跨平台音乐管理系统,基于国内最优秀的开源框架ThinkPHP5.0.11内核开发的在线DJ音乐播放分享网站,完全免 ...

  3. dubbo源码解析之框架粗谈

    dubbo框架设计 一.dubbo框架整体设计 二.各层说明 三.dubbo工程模块分包 四.依赖关系 五.调用链 文章系列 [一.dubbo源码解析之框架粗谈] [二.dubbo源码解析之dubbo ...

  4. zookeeper 密码_「附源码」Dubbo+Zookeeper 的 RPC 远程调用框架

    技术博文,及时送达 作者 | 码农云帆哥 链接 | blog.csdn.net/sinat_27933301 上一篇:从零搭建创业公司后台技术栈 这是一个基于Dubbo+Zookeeper 的 RPC ...

  5. C#毕业设计——基于C#+asp.net+SQL server的通用作业批改系统设计与实现(毕业论文+程序源码)——作业批改系统

    基于C#+asp.net+SQL server的通用作业批改系统设计与实现(毕业论文+程序源码) 大家好,今天给大家介绍基于C#+asp.net+SQL server的通用作业批改系统设计与实现,文章 ...

  6. php写网页6,基于ThinkPHP6+AdminLTE框架开发的响应式企业网站CMS系统PHP源码,ThinkPHP6开发的后台权限管理系统...

    源码介绍 基于最新ThinkPHP6+AdminLTE框架开发的响应式企业网站CMS系统PHP源码,基于最新版本的ThinkPHP 6.0.0RC3框架,后台前端框架采用AdminLTE.系统的核心理 ...

  7. Nmap源码分析(基本框架)

    Nmap是一款非常强大的开源扫描工具.自己在使用过程中忍不住想仔细阅读一下它的源码.源码里面汇集了众多安全专家的精巧设计与优雅写法,读起来令人心旷神怡而又受益匪浅. 这里我们以阅读nmap6.0的代码 ...

  8. 易语言模拟器中控源码 全新手游模拟器通用中控源码, 适用于各种游戏, 源码现成的只需要更换游戏就可以用哦

    易语言模拟器中控源码 全新手游模拟器通用中控源码, 适用于各种游戏, 源码现成的只需要更换游戏就可以用哦, 带修改教程,带讲解说明, 简单易懂不需要别人指导在家可以自学. 降低新手编写多线程中控的门槛 ...

  9. java二维码生成 使用SSM框架 搭建属于自己的APP二维码合成、解析、下载

    java二维码生成 使用SSM框架 搭建属于自己的APP二维码合成.解析.下载 自己用java搭建一个属于自己APP二维码合成网站.我的思路是这样的: 1.用户在前台表单提交APP的IOS和Andro ...

最新文章

  1. AngularJS:如何使用自定义指令来取代ng-repeat
  2. 嵌套循环连接,哈希连接,排序合并连接(2015-2-4学习日记)
  3. 蚂蚁金服自研数据库OceanBase如何登顶TPC-C
  4. 自动驾驶汽车自主决策与规划技术(一):里程定位于全局定位简介
  5. android版本英文,Android API Level与sdk版本中英文对照表
  6. Networking UVALive - 2515 (最小生成树,适合kruskal)
  7. mysql连接教程_MySQL 连接
  8. 清华计算机系人工智能学院,CoAI - 清华大学交互式人工智能课题组
  9. 《非洲归来 不必远方》读后感
  10. 艾永亮:不做读书人生意的书店,如此不正经却年赚超12亿?
  11. 《小石潭记》古文鉴赏
  12. Spacy的依存分析
  13. windows系统 对应GVLK码自查
  14. 视觉SLAM(二):相机与图像
  15. ERP现状及未来发展趋势分析?
  16. Linux系统常用基本命令总结
  17. lm2576使用注意
  18. C305例会-电脑攒机
  19. 专利交底书(附下载模板)
  20. 腾讯CDC团队:前端异常监控解决方案

热门文章

  1. mysql常见错误及解决办法_mysql常见错误代码、原因及处理办法
  2. oracle dplsql.bsq,Oracle PLSQL语言初级教程之过程和函数
  3. 基于Python的接口自动化测试框架
  4. 《集体智慧编程》代码勘误:第六章
  5. matlab中如何将视频保存成图像
  6. Linux安装Java
  7. bzoj1232[Usaco2008Nov]安慰奶牛cheer*
  8. 2.0、Android Studio编写你的应用
  9. Word2010使用技巧之四:页眉的另类使用
  10. BlueHost Gzip优化JS和CSS传输