一 。  去重的规则组件

去重数据,中通过set() 去重的, 留下的数据存在redis 中,

找到这个类  : from scrapy.dupefilter import RFPDupeFilter

a. 爬虫中yield Request(...dont_filter=False)

b. 类

from scrapy.dupefilter import BaseDupeFilter

import redis

from scrapy.utils.request import request_fingerprint

class XzxDupefilter(BaseDupeFilter):

def __init__(self,key):

self.conn = None

self.key = key

@classmethod

def from_settings(cls, settings):

key = settings.get('DUP_REDIS_KEY')

return cls(key)

def open(self):

self.conn = redis.Redis(host='127.0.0.1',port=6379)

def request_seen(self, request):

fp = request_fingerprint(request)

added = self.conn.sadd(self.key, fp)

return added == 0

c. settings中配置

# 默认dupefilter

# DUPEFILTER_CLASS = 'scrapy.dupefilter.RFPDupeFilter'

DUPEFILTER_CLASS = 'xzx.dupfilter.XzxDupefilter' # 可以自定义的

这个类给url 添加一个唯一的标识:

from scrapy.utils.request import request_fingerprint

补充:调度器中有一段代码来规定

def enqueue_request(self, request):

# dont_filter=True, => False -> 添加到去重规则:False,True

# dont_filter=False, => True -> 添加到去重规则: False,True

if not request.dont_filter and self.df.request_seen(request):

return False

# 添加到调度器

dqok = self._dqpush(request)

二 。调度器

1. 广度优先 (本质就是栈)

2.深度优先 (本质就是队列)

3. 优先级队列 (redis的有序集合)

三  下载中间件

这个中间件事 调度器 于 下载器之间的中间件。

a.     scrapy中下载中间件的作用?

统一对所有请求批量对request对象进行下载前的预处理。

b. 针对user-agent,默认中间件 内置的默认的执行, 获取的是stettings 中自己配置的user-agent

class UserAgentMiddleware(object):

"""This middleware allows spiders to override the user_agent"""

def __init__(self, user_agent='Scrapy'):

self.user_agent = user_agent # USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'

@classmethod

def from_crawler(cls, crawler):

o = cls(crawler.settings['USER_AGENT'])

crawler.signals.connect(o.spider_opened, signal=signals.spider_opened)

return o

def spider_opened(self, spider):

self.user_agent = getattr(spider, 'user_agent', self.user_agent)

def process_request(self, request, spider):

if self.user_agent:

request.headers.setdefault(b'User-Agent', self.user_agent)

c. 关于重定向 内置对的默认的

class BaseRedirectMiddleware(object):

enabled_setting = 'REDIRECT_ENABLED'

def __init__(self, settings):

if not settings.getbool(self.enabled_setting):

raise NotConfigured

self.max_redirect_times = settings.getint('REDIRECT_MAX_TIMES')

self.priority_adjust = settings.getint('REDIRECT_PRIORITY_ADJUST')

@classmethod

def from_crawler(cls, crawler):

return cls(crawler.settings)

def _redirect(self, redirected, request, spider, reason):

ttl = request.meta.setdefault('redirect_ttl', self.max_redirect_times)

redirects = request.meta.get('redirect_times', 0) + 1

if ttl and redirects <= self.max_redirect_times:

redirected.meta['redirect_times'] = redirects

redirected.meta['redirect_ttl'] = ttl - 1

redirected.meta['redirect_urls'] = request.meta.get('redirect_urls', []) + \

[request.url]

redirected.dont_filter = request.dont_filter

redirected.priority = request.priority + self.priority_adjust

logger.debug("Redirecting (%(reason)s) to %(redirected)s from %(request)s",

{'reason': reason, 'redirected': redirected, 'request': request},

extra={'spider': spider})

return redirected

else:

logger.debug("Discarding %(request)s: max redirections reached",

{'request': request}, extra={'spider': spider})

raise IgnoreRequest("max redirections reached")

def _redirect_request_using_get(self, request, redirect_url):

redirected = request.replace(url=redirect_url, method='GET', body='')

redirected.headers.pop('Content-Type', None)

redirected.headers.pop('Content-Length', None)

return redirected

class RedirectMiddleware(BaseRedirectMiddleware):

"""

Handle redirection of requests based on response status

and meta-refresh html tag.

"""

def process_response(self, request, response, spider):

if (request.meta.get('dont_redirect', False) or

response.status in getattr(spider, 'handle_httpstatus_list', []) or

response.status in request.meta.get('handle_httpstatus_list', []) or

request.meta.get('handle_httpstatus_all', False)):

return response

allowed_status = (301, 302, 303, 307, 308)

if 'Location' not in response.headers or response.status not in allowed_status:

return response

location = safe_url_string(response.headers['location'])

redirected_url = urljoin(request.url, location)

if response.status in (301, 307, 308) or request.method == 'HEAD':

redirected = request.replace(url=redirected_url)

return self._redirect(redirected, request, spider, response.status)

redirected = self._redirect_request_using_get(request, redirected_url)

return self._redirect(redirected, request, spider, response.status)

d. 关于cookie 是内置的默认的就执行

用法 自己写的逻辑里 yield 加上meta={“cookieJar”:1}}:

def start_requests(self):

for url in self.start_urls:

yield Request(url=url,callback=self.parse,meta={"cookieJar":1})

class CookiesMiddleware(object):

"""This middleware enables working with sites that need cookies"""

def __init__(self, debug=False):

self.jars = defaultdict(CookieJar)

self.debug = debug

@classmethod

def from_crawler(cls, crawler):

if not crawler.settings.getbool('COOKIES_ENABLED'):

raise NotConfigured

return cls(crawler.settings.getbool('COOKIES_DEBUG'))

def process_request(self, request, spider):

if request.meta.get('dont_merge_cookies', False):

return

# cookiejarkey = 1

cookiejarkey = request.meta.get("cookiejar")

jar = self.jars[cookiejarkey] # CookieJar对象-> 空容器

cookies = self._get_request_cookies(jar, request)

for cookie in cookies:

jar.set_cookie_if_ok(cookie, request)

# set Cookie header

request.headers.pop('Cookie', None)

jar.add_cookie_header(request)

self._debug_cookie(request, spider)

def process_response(self, request, response, spider):

if request.meta.get('dont_merge_cookies', False):

return response

# extract cookies from Set-Cookie and drop invalid/expired cookies

cookiejarkey = request.meta.get("cookiejar")

jar = self.jars[cookiejarkey]

jar.extract_cookies(response, request)

self._debug_set_cookie(response, spider)

return response

def _debug_cookie(self, request, spider):

if self.debug:

cl = [to_native_str(c, errors='replace')

for c in request.headers.getlist('Cookie')]

if cl:

cookies = "\n".join("Cookie: {}\n".format(c) for c in cl)

msg = "Sending cookies to: {}\n{}".format(request, cookies)

logger.debug(msg, extra={'spider': spider})

def _debug_set_cookie(self, response, spider):

if self.debug:

cl = [to_native_str(c, errors='replace')

for c in response.headers.getlist('Set-Cookie')]

if cl:

cookies = "\n".join("Set-Cookie: {}\n".format(c) for c in cl)

msg = "Received cookies from: {}\n{}".format(response, cookies)

logger.debug(msg, extra={'spider': spider})

def _format_cookie(self, cookie):

# build cookie string

cookie_str = '%s=%s' % (cookie['name'], cookie['value'])

if cookie.get('path', None):

cookie_str += '; Path=%s' % cookie['path']

if cookie.get('domain', None):

cookie_str += '; Domain=%s' % cookie['domain']

return cookie_str

def _get_request_cookies(self, jar, request):

if isinstance(request.cookies, dict):

cookie_list = [{'name': k, 'value': v} for k, v in \

six.iteritems(request.cookies)]

else:

cookie_list = request.cookies

cookies = [self._format_cookie(x) for x in cookie_list]

headers = {'Set-Cookie': cookies}

response = Response(request.url, headers=headers)

return jar.make_cookies(response, request)

默认中间件:

DOWNLOADER_MIDDLEWARES_BASE = {

# Engine side

'scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware': 100,

'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware': 300,

'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware': 350,

'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware': 400,

'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': 500,

'scrapy.downloadermiddlewares.retry.RetryMiddleware': 550,

'scrapy.downloadermiddlewares.ajaxcrawl.AjaxCrawlMiddleware': 560,

'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware': 580,

'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 590,

'scrapy.downloadermiddlewares.redirect.RedirectMiddleware': 600,

'scrapy.downloadermiddlewares.cookies.CookiesMiddleware': 700,

'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 750,

'scrapy.downloadermiddlewares.stats.DownloaderStats': 850,

'scrapy.downloadermiddlewares.httpcache.HttpCacheMiddleware': 900,

# Downloader side

}

注意点:

process_request   不用返回,

1. 如果 有返回response,就会找最后一个process—ressponse

2. 如果返回request , 就到直接根据返回的request 到调度器中执行

process_response:必须有返回值

四  。 爬虫中间件

下载器组件 到 爬虫组件中间件,

默认有 优先级的中间件 和 深度的中间件

编写中间件

class XzxSpiderMiddleware(object):

# Not all methods need to be defined. If a method is not defined,

# scrapy acts as if the spider middleware does not modify the

# passed objects.

@classmethod

def from_crawler(cls, crawler):

# This method is used by Scrapy to create your spiders.

s = cls()

return s

def process_spider_input(self, response, spider):

# Called for each response that goes through the spider

# middleware and into the spider.

# Should return None or raise an exception.

return None

def process_spider_output(self, response, result, spider):

# Called with the results returned from the Spider, after

# it has processed the response.

# Must return an iterable of Request, dict or Item objects.

for i in result:

yield i

def process_spider_exception(self, response, exception, spider):

# Called when a spider or process_spider_input() method

# (from other spider middleware) raises an exception.

# Should return either None or an iterable of Response, dict

# or Item objects.

pass

def process_start_requests(self, start_requests, spider):

# Called with the start requests of the spider, and works

# similarly to the process_spider_output() method, except

# that it doesn’t have a response associated.

# Must return only requests (not items).

for r in start_requests:

yield r

配置文件:

SPIDER_MIDDLEWARES = {

'xzx.middlewares.XzxSpiderMiddleware': 543,

}

内置爬虫中间件 settings 中的配置 :

深度 :

DEPTH_LIMIT = 8

优先级

DEPTH_PRIORITY = 1, 请求的优先级:0 -1 -2 -3 。。。。

DEPTH_PRIORITY = -1,请求的优先级:0 1 2 3 。。。。

SPIDER_MIDDLEWARES_BASE = {

# Engine side

'scrapy.spidermiddlewares.httperror.HttpErrorMiddleware': 50,

'scrapy.spidermiddlewares.offsite.OffsiteMiddleware': 500,

'scrapy.spidermiddlewares.referer.RefererMiddleware': 700,

'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware': 800,

'scrapy.spidermiddlewares.depth.DepthMiddleware': 900,

# Spider side

}

总结:

1. DupeFilter

- 默认放在set集合

- url变更为唯一标记

- 将去重规则放到redis中的意义何在?

- 去重+dont_filter

2. 调度器

- 爬虫中什么是深度和广度优先?

- 用什么可以实现?

- 栈

- 队列

- 优先级集合

3,开放封闭原则:

对源码封闭,对配置文件开放, 通过修改配置文件,实现自己想要的功能.

python组件介绍_python 中的爬虫· scrapy框架 重要的组件的介绍相关推荐

  1. python 中的爬虫· scrapy框架 重要的组件的介绍

    一 .  去重的规则组件 去重数据,中通过set() 去重的, 留下的数据存在redis 中, 找到这个类  : from scrapy.dupefilter import RFPDupeFilter ...

  2. python同步异步_python中Tornado的同步与异步I/O的介绍(附示例)

    本篇文章给大家带来的内容是关于python中Tornado的同步与异步I/O的介绍(附示例),有一定的参考价值,有需要的朋友可以参考一下,希望对你有所帮助. 协程是Tornado种推荐的编程方式,使用 ...

  3. python cookie池_Python爬虫scrapy框架Cookie池(微博Cookie池)的使用

    下载代码Cookie池(这里主要是微博登录,也可以自己配置置其他的站点网址) 下载代码GitHub:https://github.com/Python3WebSpider/CookiesPool 下载 ...

  4. python scrapy爬虫视频_python爬虫scrapy框架的梨视频案例解析

    之前我们使用lxml对梨视频网站中的视频进行了下载 下面我用scrapy框架对梨视频网站中的视频标题和视频页中对视频的描述进行爬取 分析:我们要爬取的内容并不在同一个页面,视频描述内容需要我们点开视频 ...

  5. Python爬虫-Scrapy框架(四)- 内置爬虫文件 - 4.2 初探Crawl Spider

    Python爬虫-Scrapy框架(四)- 内置爬虫文件 - 4.2 初探Crawl Spider 写在前面 初探Crawl Spider 创建Crawl Spider项目 对比Basic与Crawl ...

  6. Python爬虫 scrapy框架爬取某招聘网存入mongodb解析

    这篇文章主要介绍了Python爬虫 scrapy框架爬取某招聘网存入mongodb解析,文中通过示例代码介绍的非常详细,对大家的学习或者工作具有一定的参考学习价值,需要的朋友可以参考下 创建项目 sc ...

  7. python爬虫--Scrapy框架--Scrapy+selenium实现动态爬取

    python爬虫–Scrapy框架–Scrapy+selenium实现动态爬取 前言 本文基于数据分析竞赛爬虫阶段,对使用scrapy + selenium进行政策文本爬虫进行记录.用于个人爬虫学习记 ...

  8. Python爬虫—Scrapy框架—Win10下载安装

    Python爬虫-Scrapy框架-Win10下载安装 1. 下载wheel 2.下载twisted 3. 下载pywin32 4. 下载安装Scrapy 5. 创建一个scrapy项目 6. fir ...

  9. python 爬虫Scrapy框架入门

    简单介绍Scrapy框架 Scrapy是一个为了爬取网站数据,提取结构性数据而编写的应用框架,我们只需要实现少量的代码,就能够快速的抓取. Scrapy使用了Twisted异步网络框架,可以加快我们的 ...

最新文章

  1. 独家 | 如何解决深度学习泛化理论
  2. python绘制球体_趣学Python之弹球游戏第一阶段--画个红球
  3. 2018创投圈风云再起,企服征途百家争鸣,寻找中国创业最强音!
  4. JavaScript类型强制解释
  5. java指定sql生成xml_通过generate解析SQL日志生成xml进行SQL回放
  6. 浙大PAT CCCC L3-014 周游世界 ( 最短路变形 )
  7. python中递归函数写法_《Python入门08》你知道Python递归函数怎么写吗~~
  8. ❤️《大前端—模块化》
  9. 计算机六级好考吗,计算机六级考什么?
  10. 前端面试题及答案(持续更新)
  11. 你以为服务器关了这事就结束了? - XcodeGhost截胡攻击和服务端的复现,以及UnityGhost预警...
  12. Windows 7下面安装VMware、BackTrack5(BT5)、minidwep-gtk
  13. Java程序员该如何准备明年的「金三银四」跳槽季,你准备好了吗?
  14. 计算机存储溢出 是什么意思,数据溢出是什么意思
  15. 国培南通之行的感悟——(其三)
  16. 详解 GloVe 的原理和应用
  17. android qq消息数 拖拽动画,史上最详细仿QQ未读消息拖拽粘性效果的实现
  18. 浙江大学pat 1013
  19. java实现实体关系抽取
  20. Swift5.1 语言指南(十三) 方法

热门文章

  1. 雪花算法Snowflake
  2. SM2 国密算法被 Linux 内核社区接受
  3. .NET Core 使用 Consul 服务注册发现
  4. 程序员修神之路--那些分布式事务解决方案
  5. 通过极简模拟框架让你了解ASP.NET Core MVC框架的设计与实现[上篇]
  6. 【翻译】.NET 5 Preview 1 发布
  7. 研发协同平台持续集成实践
  8. Blazor 版 Bootstrap Admin 通用后台权限管理框架
  9. asp.net core 使用Mysql和Dapper
  10. Asp.Net Core Web应用程序—探索