python组件介绍_python 中的爬虫· scrapy框架 重要的组件的介绍
一 。 去重的规则组件
去重数据,中通过set() 去重的, 留下的数据存在redis 中,
找到这个类 : from scrapy.dupefilter import RFPDupeFilter
a. 爬虫中yield Request(...dont_filter=False)
b. 类
from scrapy.dupefilter import BaseDupeFilter
import redis
from scrapy.utils.request import request_fingerprint
class XzxDupefilter(BaseDupeFilter):
def __init__(self,key):
self.conn = None
self.key = key
@classmethod
def from_settings(cls, settings):
key = settings.get('DUP_REDIS_KEY')
return cls(key)
def open(self):
self.conn = redis.Redis(host='127.0.0.1',port=6379)
def request_seen(self, request):
fp = request_fingerprint(request)
added = self.conn.sadd(self.key, fp)
return added == 0
c. settings中配置
# 默认dupefilter
# DUPEFILTER_CLASS = 'scrapy.dupefilter.RFPDupeFilter'
DUPEFILTER_CLASS = 'xzx.dupfilter.XzxDupefilter' # 可以自定义的
这个类给url 添加一个唯一的标识:
from scrapy.utils.request import request_fingerprint
补充:调度器中有一段代码来规定
def enqueue_request(self, request):
# dont_filter=True, => False -> 添加到去重规则:False,True
# dont_filter=False, => True -> 添加到去重规则: False,True
if not request.dont_filter and self.df.request_seen(request):
return False
# 添加到调度器
dqok = self._dqpush(request)
二 。调度器
1. 广度优先 (本质就是栈)
2.深度优先 (本质就是队列)
3. 优先级队列 (redis的有序集合)
三 下载中间件
这个中间件事 调度器 于 下载器之间的中间件。
a. scrapy中下载中间件的作用?
统一对所有请求批量对request对象进行下载前的预处理。
b. 针对user-agent,默认中间件 内置的默认的执行, 获取的是stettings 中自己配置的user-agent
class UserAgentMiddleware(object):
"""This middleware allows spiders to override the user_agent"""
def __init__(self, user_agent='Scrapy'):
self.user_agent = user_agent # USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'
@classmethod
def from_crawler(cls, crawler):
o = cls(crawler.settings['USER_AGENT'])
crawler.signals.connect(o.spider_opened, signal=signals.spider_opened)
return o
def spider_opened(self, spider):
self.user_agent = getattr(spider, 'user_agent', self.user_agent)
def process_request(self, request, spider):
if self.user_agent:
request.headers.setdefault(b'User-Agent', self.user_agent)
c. 关于重定向 内置对的默认的
class BaseRedirectMiddleware(object):
enabled_setting = 'REDIRECT_ENABLED'
def __init__(self, settings):
if not settings.getbool(self.enabled_setting):
raise NotConfigured
self.max_redirect_times = settings.getint('REDIRECT_MAX_TIMES')
self.priority_adjust = settings.getint('REDIRECT_PRIORITY_ADJUST')
@classmethod
def from_crawler(cls, crawler):
return cls(crawler.settings)
def _redirect(self, redirected, request, spider, reason):
ttl = request.meta.setdefault('redirect_ttl', self.max_redirect_times)
redirects = request.meta.get('redirect_times', 0) + 1
if ttl and redirects <= self.max_redirect_times:
redirected.meta['redirect_times'] = redirects
redirected.meta['redirect_ttl'] = ttl - 1
redirected.meta['redirect_urls'] = request.meta.get('redirect_urls', []) + \
[request.url]
redirected.dont_filter = request.dont_filter
redirected.priority = request.priority + self.priority_adjust
logger.debug("Redirecting (%(reason)s) to %(redirected)s from %(request)s",
{'reason': reason, 'redirected': redirected, 'request': request},
extra={'spider': spider})
return redirected
else:
logger.debug("Discarding %(request)s: max redirections reached",
{'request': request}, extra={'spider': spider})
raise IgnoreRequest("max redirections reached")
def _redirect_request_using_get(self, request, redirect_url):
redirected = request.replace(url=redirect_url, method='GET', body='')
redirected.headers.pop('Content-Type', None)
redirected.headers.pop('Content-Length', None)
return redirected
class RedirectMiddleware(BaseRedirectMiddleware):
"""
Handle redirection of requests based on response status
and meta-refresh html tag.
"""
def process_response(self, request, response, spider):
if (request.meta.get('dont_redirect', False) or
response.status in getattr(spider, 'handle_httpstatus_list', []) or
response.status in request.meta.get('handle_httpstatus_list', []) or
request.meta.get('handle_httpstatus_all', False)):
return response
allowed_status = (301, 302, 303, 307, 308)
if 'Location' not in response.headers or response.status not in allowed_status:
return response
location = safe_url_string(response.headers['location'])
redirected_url = urljoin(request.url, location)
if response.status in (301, 307, 308) or request.method == 'HEAD':
redirected = request.replace(url=redirected_url)
return self._redirect(redirected, request, spider, response.status)
redirected = self._redirect_request_using_get(request, redirected_url)
return self._redirect(redirected, request, spider, response.status)
d. 关于cookie 是内置的默认的就执行
用法 自己写的逻辑里 yield 加上meta={“cookieJar”:1}}:
def start_requests(self):
for url in self.start_urls:
yield Request(url=url,callback=self.parse,meta={"cookieJar":1})
class CookiesMiddleware(object):
"""This middleware enables working with sites that need cookies"""
def __init__(self, debug=False):
self.jars = defaultdict(CookieJar)
self.debug = debug
@classmethod
def from_crawler(cls, crawler):
if not crawler.settings.getbool('COOKIES_ENABLED'):
raise NotConfigured
return cls(crawler.settings.getbool('COOKIES_DEBUG'))
def process_request(self, request, spider):
if request.meta.get('dont_merge_cookies', False):
return
# cookiejarkey = 1
cookiejarkey = request.meta.get("cookiejar")
jar = self.jars[cookiejarkey] # CookieJar对象-> 空容器
cookies = self._get_request_cookies(jar, request)
for cookie in cookies:
jar.set_cookie_if_ok(cookie, request)
# set Cookie header
request.headers.pop('Cookie', None)
jar.add_cookie_header(request)
self._debug_cookie(request, spider)
def process_response(self, request, response, spider):
if request.meta.get('dont_merge_cookies', False):
return response
# extract cookies from Set-Cookie and drop invalid/expired cookies
cookiejarkey = request.meta.get("cookiejar")
jar = self.jars[cookiejarkey]
jar.extract_cookies(response, request)
self._debug_set_cookie(response, spider)
return response
def _debug_cookie(self, request, spider):
if self.debug:
cl = [to_native_str(c, errors='replace')
for c in request.headers.getlist('Cookie')]
if cl:
cookies = "\n".join("Cookie: {}\n".format(c) for c in cl)
msg = "Sending cookies to: {}\n{}".format(request, cookies)
logger.debug(msg, extra={'spider': spider})
def _debug_set_cookie(self, response, spider):
if self.debug:
cl = [to_native_str(c, errors='replace')
for c in response.headers.getlist('Set-Cookie')]
if cl:
cookies = "\n".join("Set-Cookie: {}\n".format(c) for c in cl)
msg = "Received cookies from: {}\n{}".format(response, cookies)
logger.debug(msg, extra={'spider': spider})
def _format_cookie(self, cookie):
# build cookie string
cookie_str = '%s=%s' % (cookie['name'], cookie['value'])
if cookie.get('path', None):
cookie_str += '; Path=%s' % cookie['path']
if cookie.get('domain', None):
cookie_str += '; Domain=%s' % cookie['domain']
return cookie_str
def _get_request_cookies(self, jar, request):
if isinstance(request.cookies, dict):
cookie_list = [{'name': k, 'value': v} for k, v in \
six.iteritems(request.cookies)]
else:
cookie_list = request.cookies
cookies = [self._format_cookie(x) for x in cookie_list]
headers = {'Set-Cookie': cookies}
response = Response(request.url, headers=headers)
return jar.make_cookies(response, request)
默认中间件:
DOWNLOADER_MIDDLEWARES_BASE = {
# Engine side
'scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware': 100,
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware': 300,
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware': 350,
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware': 400,
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': 500,
'scrapy.downloadermiddlewares.retry.RetryMiddleware': 550,
'scrapy.downloadermiddlewares.ajaxcrawl.AjaxCrawlMiddleware': 560,
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware': 580,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 590,
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware': 600,
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware': 700,
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 750,
'scrapy.downloadermiddlewares.stats.DownloaderStats': 850,
'scrapy.downloadermiddlewares.httpcache.HttpCacheMiddleware': 900,
# Downloader side
}
注意点:
process_request 不用返回,
1. 如果 有返回response,就会找最后一个process—ressponse
2. 如果返回request , 就到直接根据返回的request 到调度器中执行
process_response:必须有返回值
四 。 爬虫中间件
下载器组件 到 爬虫组件中间件,
默认有 优先级的中间件 和 深度的中间件
编写中间件
class XzxSpiderMiddleware(object):
# Not all methods need to be defined. If a method is not defined,
# scrapy acts as if the spider middleware does not modify the
# passed objects.
@classmethod
def from_crawler(cls, crawler):
# This method is used by Scrapy to create your spiders.
s = cls()
return s
def process_spider_input(self, response, spider):
# Called for each response that goes through the spider
# middleware and into the spider.
# Should return None or raise an exception.
return None
def process_spider_output(self, response, result, spider):
# Called with the results returned from the Spider, after
# it has processed the response.
# Must return an iterable of Request, dict or Item objects.
for i in result:
yield i
def process_spider_exception(self, response, exception, spider):
# Called when a spider or process_spider_input() method
# (from other spider middleware) raises an exception.
# Should return either None or an iterable of Response, dict
# or Item objects.
pass
def process_start_requests(self, start_requests, spider):
# Called with the start requests of the spider, and works
# similarly to the process_spider_output() method, except
# that it doesn’t have a response associated.
# Must return only requests (not items).
for r in start_requests:
yield r
配置文件:
SPIDER_MIDDLEWARES = {
'xzx.middlewares.XzxSpiderMiddleware': 543,
}
内置爬虫中间件 settings 中的配置 :
深度 :
DEPTH_LIMIT = 8
优先级
DEPTH_PRIORITY = 1, 请求的优先级:0 -1 -2 -3 。。。。
DEPTH_PRIORITY = -1,请求的优先级:0 1 2 3 。。。。
SPIDER_MIDDLEWARES_BASE = {
# Engine side
'scrapy.spidermiddlewares.httperror.HttpErrorMiddleware': 50,
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware': 500,
'scrapy.spidermiddlewares.referer.RefererMiddleware': 700,
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware': 800,
'scrapy.spidermiddlewares.depth.DepthMiddleware': 900,
# Spider side
}
总结:
1. DupeFilter
- 默认放在set集合
- url变更为唯一标记
- 将去重规则放到redis中的意义何在?
- 去重+dont_filter
2. 调度器
- 爬虫中什么是深度和广度优先?
- 用什么可以实现?
- 栈
- 队列
- 优先级集合
3,开放封闭原则:
对源码封闭,对配置文件开放, 通过修改配置文件,实现自己想要的功能.
python组件介绍_python 中的爬虫· scrapy框架 重要的组件的介绍相关推荐
- python 中的爬虫· scrapy框架 重要的组件的介绍
一 . 去重的规则组件 去重数据,中通过set() 去重的, 留下的数据存在redis 中, 找到这个类 : from scrapy.dupefilter import RFPDupeFilter ...
- python同步异步_python中Tornado的同步与异步I/O的介绍(附示例)
本篇文章给大家带来的内容是关于python中Tornado的同步与异步I/O的介绍(附示例),有一定的参考价值,有需要的朋友可以参考一下,希望对你有所帮助. 协程是Tornado种推荐的编程方式,使用 ...
- python cookie池_Python爬虫scrapy框架Cookie池(微博Cookie池)的使用
下载代码Cookie池(这里主要是微博登录,也可以自己配置置其他的站点网址) 下载代码GitHub:https://github.com/Python3WebSpider/CookiesPool 下载 ...
- python scrapy爬虫视频_python爬虫scrapy框架的梨视频案例解析
之前我们使用lxml对梨视频网站中的视频进行了下载 下面我用scrapy框架对梨视频网站中的视频标题和视频页中对视频的描述进行爬取 分析:我们要爬取的内容并不在同一个页面,视频描述内容需要我们点开视频 ...
- Python爬虫-Scrapy框架(四)- 内置爬虫文件 - 4.2 初探Crawl Spider
Python爬虫-Scrapy框架(四)- 内置爬虫文件 - 4.2 初探Crawl Spider 写在前面 初探Crawl Spider 创建Crawl Spider项目 对比Basic与Crawl ...
- Python爬虫 scrapy框架爬取某招聘网存入mongodb解析
这篇文章主要介绍了Python爬虫 scrapy框架爬取某招聘网存入mongodb解析,文中通过示例代码介绍的非常详细,对大家的学习或者工作具有一定的参考学习价值,需要的朋友可以参考下 创建项目 sc ...
- python爬虫--Scrapy框架--Scrapy+selenium实现动态爬取
python爬虫–Scrapy框架–Scrapy+selenium实现动态爬取 前言 本文基于数据分析竞赛爬虫阶段,对使用scrapy + selenium进行政策文本爬虫进行记录.用于个人爬虫学习记 ...
- Python爬虫—Scrapy框架—Win10下载安装
Python爬虫-Scrapy框架-Win10下载安装 1. 下载wheel 2.下载twisted 3. 下载pywin32 4. 下载安装Scrapy 5. 创建一个scrapy项目 6. fir ...
- python 爬虫Scrapy框架入门
简单介绍Scrapy框架 Scrapy是一个为了爬取网站数据,提取结构性数据而编写的应用框架,我们只需要实现少量的代码,就能够快速的抓取. Scrapy使用了Twisted异步网络框架,可以加快我们的 ...
最新文章
- 独家 | 如何解决深度学习泛化理论
- python绘制球体_趣学Python之弹球游戏第一阶段--画个红球
- 2018创投圈风云再起,企服征途百家争鸣,寻找中国创业最强音!
- JavaScript类型强制解释
- java指定sql生成xml_通过generate解析SQL日志生成xml进行SQL回放
- 浙大PAT CCCC L3-014 周游世界 ( 最短路变形 )
- python中递归函数写法_《Python入门08》你知道Python递归函数怎么写吗~~
- ❤️《大前端—模块化》
- 计算机六级好考吗,计算机六级考什么?
- 前端面试题及答案(持续更新)
- 你以为服务器关了这事就结束了? - XcodeGhost截胡攻击和服务端的复现,以及UnityGhost预警...
- Windows 7下面安装VMware、BackTrack5(BT5)、minidwep-gtk
- Java程序员该如何准备明年的「金三银四」跳槽季,你准备好了吗?
- 计算机存储溢出 是什么意思,数据溢出是什么意思
- 国培南通之行的感悟——(其三)
- 详解 GloVe 的原理和应用
- android qq消息数 拖拽动画,史上最详细仿QQ未读消息拖拽粘性效果的实现
- 浙江大学pat 1013
- java实现实体关系抽取
- Swift5.1 语言指南(十三) 方法