Python爬虫:scrapy框架请求参数meta、headers、cookies一探究竟
对于scrapy请参数,会经常用到,不过没有深究
今天我就来探索下scrapy请求时所携带的3个重要参数headers
, cookies
, meta
原生参数
首先新建myscrapy
项目,新建my_spider
爬虫
通过访问:http://httpbin.org/get 来测试请求参数
将爬虫运行起来
# -*- coding: utf-8 -*-from scrapy import Spider, Request
import loggingclass MySpider(Spider):name = 'my_spider'allowed_domains = ['httpbin.org']start_urls = ['http://httpbin.org/get']def parse(self, response):self.write_to_file("*" * 40)self.write_to_file("response text: %s" % response.text)self.write_to_file("response headers: %s" % response.headers)self.write_to_file("response meta: %s" % response.meta)self.write_to_file("request headers: %s" % response.request.headers)self.write_to_file("request cookies: %s" % response.request.cookies)self.write_to_file("request meta: %s" % response.request.meta)def write_to_file(self, words):with open("logging.log", "a") as f:f.write(words)if __name__ == '__main__':from scrapy import cmdlinecmdline.execute("scrapy crawl my_spider".split())
保存到文件中的信息如下:
response text:
{"args":{},"headers":{"Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8","Accept-Encoding":"gzip,deflate","Accept-Language":"en","Connection":"close","Host":"httpbin.org","User-Agent":"Scrapy/1.5.1 (+https://scrapy.org)"},"origin":"223.72.90.254","url":"http://httpbin.org/get"
}response headers:
{b'Server': [b'gunicorn/19.8.1'], b'Date': [b'Sun, 22 Jul 2018 10:03:15 GMT'], b'Content-Type': [b'application/json'], b'Access-Control-Allow-Origin': [b'*'], b'Access-Control-Allow-Credentials': [b'true'], b'Via': [b'1.1 vegur']
}response meta:
{'download_timeout': 180.0, 'download_slot': 'httpbin.org', 'download_latency': 0.5500118732452393
}request headers:
{b'Accept': [b'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'], b'Accept-Language': [b'en'], b'User-Agent': [b'Scrapy/1.5.1 (+https://scrapy.org)'], b'Accept-Encoding': [b'gzip,deflate']
}request cookies:
{}request meta:
{'download_timeout': 180.0, 'download_slot': 'httpbin.org', 'download_latency': 0.5500118732452393
}
meta
通过上面的输出比较,发现 response 和 request 的meta
参数是一样的,meta的功能就是从request携带信息,将其传递给response的
修改下代码,测试下传递效果
# -*- coding: utf-8 -*-from scrapy import Spider, Request
import loggingclass MySpider(Spider):name = 'my_spider'allowed_domains = ['httpbin.org']start_urls = ['http://httpbin.org/get']def start_requests(self):for url in self.start_urls:yield Request(url, meta={"uid": "this is uid of meta"})def parse(self, response):print("request meta: %s" % response.request.meta.get("uid"))print("response meta: %s" % response.meta.get("uid"))
输出如下
request meta: this is uid of meta
response meta: this is uid of meta
看来获取request中meta
这两种方式都可行,这里的meta类似字典,可以按照字典获取key-value的形式获取对应的值
当然代理设置也是通过meta的
以下是一个代理中间件的示例
import randomclass ProxyMiddleware(object): def process_request(self, request, spider):proxy=random.choice(proxies)request.meta["proxy"] = proxy
headers
按照如下路径,打开scrapy的default_settings
文件
from scrapy.settings import default_settings
发现是这么写的
USER_AGENT = 'Scrapy/%s (+https://scrapy.org)' % import_module('scrapy').__version__DEFAULT_REQUEST_HEADERS = {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8','Accept-Language': 'en',
}
修改下请求头,看服务器返回的信息
# -*- coding: utf-8 -*-from scrapy import Spider, Request
import loggingclass MySpider(Spider):name = 'my_spider'allowed_domains = ['httpbin.org']start_urls = ['http://httpbin.org/get',]def start_requests(self):for url in self.start_urls:yield Request(url, headers={"User-Agent": "Chrome"})def parse(self, response):logging.debug("*" * 40)logging.debug("response text: %s" % response.text)logging.debug("response headers: %s" % response.headers)logging.debug("request headers: %s" % response.request.headers)if __name__ == '__main__':from scrapy import cmdlinecmdline.execute("scrapy crawl my_spider".split())
输出如下
response text:
{"args":{},"headers":{"Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8","Accept-Encoding":"gzip,deflate","Accept-Language":"en","Connection":"close","Host":"httpbin.org","User-Agent":"Chrome"},"origin":"122.71.64.121","url":"http://httpbin.org/get"
}response headers:
{b'Server': [b'gunicorn/19.8.1'], b'Date': [b'Sun, 22 Jul 2018 10:29:26 GMT'], b'Content-Type': [b'application/json'], b'Access-Control-Allow-Origin': [b'*'], b'Access-Control-Allow-Credentials': [b'true'], b'Via': [b'1.1 vegur']
}request headers:
{b'User-Agent': [b'Chrome'], b'Accept': [b'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'], b'Accept-Language': [b'en'], b'Accept-Encoding': [b'gzip,deflate']
}
看到 request 和 服务器接收到并返回的的 headers(User-Agent
)变化了,说明已经把默认的User-Agent
修改了
看到default_settings
中默认使用了中间件UserAgentMiddleware
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': 500,
源码如下
class UserAgentMiddleware(object):"""This middleware allows spiders to override the user_agent"""def __init__(self, user_agent='Scrapy'):self.user_agent = user_agent@classmethoddef from_crawler(cls, crawler):o = cls(crawler.settings['USER_AGENT'])crawler.signals.connect(o.spider_opened, signal=signals.spider_opened)return odef spider_opened(self, spider):self.user_agent = getattr(spider, 'user_agent', self.user_agent)def process_request(self, request, spider):if self.user_agent:request.headers.setdefault(b'User-Agent', self.user_agent)
仔细阅读源码,发现无非就是对User-Agent
读取和设置操作,仿照源码写自己的中间件
这里使用fake_useragent
库来随机获取请求头,详情可参看:
https://blog.csdn.net/mouday/article/details/80476409
middlewares.py 编写自己的中间件
from fake_useragent import UserAgentclass UserAgentMiddleware(object):def process_request(self, request, spider):ua = UserAgent()user_agent = ua.chromerequest.headers.setdefault(b'User-Agent', user_agent)
settings.py 用自己的中间件替换默认中间件
DOWNLOADER_MIDDLEWARES = {'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,'myscrapy.middlewares.UserAgentMiddleware': 500
}
输出如下:
request headers:
{b'Accept': [b'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'], b'Accept-Language': [b'en'], b'User-Agent': [b'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.93 Safari/537.36'], b'Accept-Encoding': [b'gzip,deflate']
}
关于scrapy请求头设置,可以参考我之前的文章:
https://blog.csdn.net/mouday/article/details/80776030
cookies
上面的信息中少了个response.cookies
,如果添加上回报错:
AttributeError: 'TextResponse' object has no attribute 'cookies'
说明响应是不带cookies参数的
通过 http://httpbin.org/cookies 测试cookies
# -*- coding: utf-8 -*-from scrapy import Spider, Request
import loggingclass MySpider(Spider):name = 'my_spider'allowed_domains = ['httpbin.org']start_urls = ['http://httpbin.org/cookies']def start_requests(self):for url in self.start_urls:yield Request(url, cookies={"username": "pengshiyu"})def parse(self, response):logging.debug("*" * 40)logging.debug("response text: %s" % response.text)logging.debug("request headers: %s" % response.request.headers)logging.debug("request cookies: %s" % response.request.cookies)if __name__ == '__main__':from scrapy import cmdlinecmdline.execute("scrapy crawl my_spider".split())
返回值如下:
response text:
{"cookies":{"username":"pengshiyu"}
}request headers:
{b'Accept': [b'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'], b'Accept-Language': [b'en'], b'User-Agent': [b'Scrapy/1.5.1 (+https://scrapy.org)'], b'Accept-Encoding': [b'gzip,deflate'], b'Cookie': [b'username=pengshiyu']
}request cookies:
{'username': 'pengshiyu'
}
服务器端已经接收到我的cookie值了,不过request的headers
也包含了相同的cookie,保存到了键为Cookie
下面
其实并没有什么cookie,浏览器请求的·cookies·被包装到了·headers·中发送给服务器端
既然这样,在headers中包含Cookie
试试
def start_requests(self):for url in self.start_urls:yield Request(url, headers={"Cookie": {"username": "pengshiyu"}})
返回结果
response text:
{"cookies":{}
}request headers:
{b'Accept': [b'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'], b'Accept-Language': [b'en'], b'User-Agent': [b'Scrapy/1.5.1 (+https://scrapy.org)'], b'Accept-Encoding': [b'gzip,deflate']
}request cookies:
{}
cookies 是空的,设置失败
我们找到 default_settings
中的cookie中间件
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware': 700
class CookiesMiddleware(object):"""This middleware enables working with sites that need cookies"""def __init__(self, debug=False):self.jars = defaultdict(CookieJar)self.debug = debug@classmethoddef from_crawler(cls, crawler):if not crawler.settings.getbool('COOKIES_ENABLED'):raise NotConfiguredreturn cls(crawler.settings.getbool('COOKIES_DEBUG'))def process_request(self, request, spider):if request.meta.get('dont_merge_cookies', False):returncookiejarkey = request.meta.get("cookiejar")jar = self.jars[cookiejarkey]cookies = self._get_request_cookies(jar, request)for cookie in cookies:jar.set_cookie_if_ok(cookie, request)# set Cookie headerrequest.headers.pop('Cookie', None)jar.add_cookie_header(request)self._debug_cookie(request, spider)def process_response(self, request, response, spider):if request.meta.get('dont_merge_cookies', False):return response# extract cookies from Set-Cookie and drop invalid/expired cookiescookiejarkey = request.meta.get("cookiejar")jar = self.jars[cookiejarkey]jar.extract_cookies(response, request)self._debug_set_cookie(response, spider)return responsedef _debug_cookie(self, request, spider):if self.debug:cl = [to_native_str(c, errors='replace')for c in request.headers.getlist('Cookie')]if cl:cookies = "\n".join("Cookie: {}\n".format(c) for c in cl)msg = "Sending cookies to: {}\n{}".format(request, cookies)logger.debug(msg, extra={'spider': spider})def _debug_set_cookie(self, response, spider):if self.debug:cl = [to_native_str(c, errors='replace')for c in response.headers.getlist('Set-Cookie')]if cl:cookies = "\n".join("Set-Cookie: {}\n".format(c) for c in cl)msg = "Received cookies from: {}\n{}".format(response, cookies)logger.debug(msg, extra={'spider': spider})def _format_cookie(self, cookie):# build cookie stringcookie_str = '%s=%s' % (cookie['name'], cookie['value'])if cookie.get('path', None):cookie_str += '; Path=%s' % cookie['path']if cookie.get('domain', None):cookie_str += '; Domain=%s' % cookie['domain']return cookie_strdef _get_request_cookies(self, jar, request):if isinstance(request.cookies, dict):cookie_list = [{'name': k, 'value': v} for k, v in \six.iteritems(request.cookies)]else:cookie_list = request.cookiescookies = [self._format_cookie(x) for x in cookie_list]headers = {'Set-Cookie': cookies}response = Response(request.url, headers=headers)return jar.make_cookies(response, request)
观察源码,发现以下几个方法
# process_request
jar.add_cookie_header(request) # 添加cookie到headers# process_response
jar.extract_cookies(response, request) # 提取出cookie# _debug_cookie
request.headers.getlist('Cookie') # 从headers获取cookie# _debug_set_cookie
response.headers.getlist('Set-Cookie') # 从headers获取Set-Cookie
几个参数:
# settings
COOKIES_ENABLED
COOKIES_DEBUG# meta
dont_merge_cookies
cookiejar# headers
Cookie
Set-Cookie
使用最开始cookie部分的代码,为了看的清晰,我删除了headers中其他参数,下面逐个做测试
1、COOKIES_ENABLED
COOKIES_ENABLED = True (默认)
response text:
{"cookies":{"username":"pengshiyu"}
}request headers:
{b'Cookie': [b'username=pengshiyu']
}request cookies:
{'username': 'pengshiyu'
}
一切ok
COOKIES_ENABLED = False
response text:
{"cookies":{}
}request headers:
{}request cookies:
{'username': 'pengshiyu'
}
虽然request的cookies有内容,不过headers没有加进去,所以服务器端没有获取到cookie
注意
:查看请求的真正cookie,应该在request的header中查看
2、COOKIES_DEBUG
COOKIES_DEBUG = False (默认)
DEBUG: Crawled (200) <GET http://httpbin.org/cookies> (referer: None)
COOKIES_DEBUG = True
多输出了下面一句,可以看到我设置的cookie
[scrapy.downloadermiddlewares.cookies] DEBUG: Sending cookies to: <GET http://httpbin.org/cookies>
Cookie: username=pengshiyu
当然,debug模式下服务器肯定能正常接收我的cookie
3、dont_merge_cookies
设置meta={"dont_merge_cookies": True}
默认为 False
response text:
{"cookies":{}
}request headers:
{}request cookies:
{'username': 'pengshiyu'
}
服务器并没有接收到我的cookie
4、cookiejar
直接通过response.request.meta.get("cookiejar")
获取
response text:
{"cookies":{"username":"pengshiyu"}}request headers:
{b'Cookie': [b'username=pengshiyu']}request cookies:
{'username': 'pengshiyu'}request cookiejar:
None
啥也没有
5、Cookie
直接获取:response.request.headers.get("Cookie"))
headers Cookie:
b'username=pengshiyu'
看来这里已经被处理成字节串了
修改Request请求参数
cookies={"username": "pengshiyu", "password": "123456"}
# response.request.headers.get("Cookie"))
headers Cookie:
b'username=pengshiyu; password=123456'# request.headers.getlist('Cookie')
headers Cookies:
[b'username=pengshiyu; password=123456']
很明显,两个获取方式,一个获取的是字符串,一个获取的是列表
6、Set-Cookie
同样,我通过以下
response.headers.get("Set-Cookie")
response.headers.getlist("Set-Cookie")
还是啥都没有
headers Set-Cookie: None
headers Set-Cookies: []
不过,到目前为止,cookie设置的大概流程应该如下:
request cookies: {'username': 'pengshiyu', 'password': '123456'}
request cookiejar: None
request Cookie: b'username=pengshiyu; password=123456'
response text: {"cookies":{"password":"123456","username":"pengshiyu"}}
response Set-Cookie: None
response Set-Cookies: []
7、接收服务器传递过来的cookie
将请求链接改为 :’http://httpbin.org/cookies/set/key/value’
开启 COOKIES_DEBUG
在debug中看到如下变化
Sending cookies to: <GET http://httpbin.org/cookies/set/key/value>
Cookie: username=pengshiyu; password=123456Received cookies from: <302 http://httpbin.org/cookies/set/key/value>
Set-Cookie: key=value; Path=/Redirecting (302) to <GET http://httpbin.org/cookies> from <GET http://httpbin.org/cookies/set/key/value>Sending cookies to: <GET http://httpbin.org/cookies>
Cookie: key=value; username=pengshiyu; password=123456
日志看出他进行了两次请求,看到中间的cookie变化:
发送 -> 接收 -> 发送
第二次发送的cookie包含了第一次请求时服务器端传递过来的cookie,说明scrapy对服务器端和客户端的cookie进行了管理
最后的cookie输出
request cookies: {'username': 'pengshiyu', 'password': '123456'}
request cookiejar: None
request Cookie: b'key=value; username=pengshiyu; password=123456'
response text: {"cookies":{"key":"value","password":"123456","username":"pengshiyu"}}
response Set-Cookie: None
request的cookies并没有变化,而request.headers.get(“Cookie”)已经发生了变化
8、接收服务器传递过来的 同key键cookie
将请求链接换为:httpbin.org/cookies/set/username/pengpeng
Sending cookies to: <GET http://httpbin.org/cookies/set/username/pengpeng>
Cookie: username=pengshiyuReceived cookies from: <302 http://httpbin.org/cookies/set/username/pengpeng>
Set-Cookie: username=pengpeng; Path=/Redirecting (302) to <GET http://httpbin.org/cookies> from <GET http://httpbin.org/cookies/set/username/pengpeng>Sending cookies to: <GET http://httpbin.org/cookies>
Cookie: username=pengshiyu
发现虽然收到了username=pengpeng
但是,第二次发请求的时候,又发送了原来的的cookieusername=pengshiyu
这说明客户端设置的cookie优先级高于服务器端传递过来的cookie
9、取消使用中间件CookiesMiddleware
DOWNLOADER_MIDDLEWARES = {'scrapy.downloadermiddlewares.cookies.CookiesMiddleware': None
}
请求链接:http://httpbin.org/cookies
request cookies: {'username': 'pengshiyu'}
request cookiejar: None
request Cookie: None
response text: {"cookies":{}}
response Set-Cookie: None
response Set-Cookies: []
这个效果类似COOKIES_ENABLED = False
10、自定义cookie池
class RandomCookiesMiddleware(object):def process_request(self, request, spider):cookies = []cookie = random.choice(cookies)request.cookies = cookie
同样需要设置
DOWNLOADER_MIDDLEWARES = {'myscrapy.middlewares.RandomCookiesMiddleware': 600
}
注意到scrapy的中间件CookiesMiddleware
值是700,为了cookie设置生效,需要在这个中间件启用之前就设置好自定义的cookie,优先级按照从小到大的顺序执行,所以我们自己自定义的cookie中间件需要小于 < 700
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware': 700,
总结
参数 | 设置 | 获取 | 说明 |
---|---|---|---|
meta | Request(url, meta={“uid”: “100”}) request.meta[“uid”] = “100” | response.request.meta.get(“uid”) response.meta.get(“uid”) | 携带request参数给response,或设置代理 |
headers | Request(url, headers={“User-Agent”: “chrome”}) request.headers[“User-Agent”]=”chrome” USER_AGENT=”chrome” | response.request.headers.get(“User-Agent”) | 设置客户端请求头参数 |
cookies | Request(url, cookies={“username”: “pengshiyu”} ) request.cookies = {“username”: “pengshiyu”} | response.request.cookies response.request.headers.get(“Cookie”) response.headers.get(‘Set-Cookie’) | 客户端请求头中的Cookie参数,管理客户端与服务器端之间的会话识别 |
常用的中间件如下
import random
from fake_useragent import UserAgentclass RandomUserAgentMiddleware(object):def process_request(self, request, spider):ua = UserAgent()user_agent = ua.chromerequest.headers.setdefault(b'User-Agent', user_agent)class RandomProxyMiddleware(object):def process_request(self, request, spider):proxies = []proxy = random.choice(proxies)request.meta["proxy"] = proxyclass RandomCookiesMiddleware(object):def process_request(self, request, spider):cookies = []cookie = random.choice(cookies)request.cookies = cookie
当然,cookies 和 proxies 需要结合自己的情况补全
Python爬虫:scrapy框架请求参数meta、headers、cookies一探究竟相关推荐
- python爬虫--Scrapy框架--Scrapy+selenium实现动态爬取
python爬虫–Scrapy框架–Scrapy+selenium实现动态爬取 前言 本文基于数据分析竞赛爬虫阶段,对使用scrapy + selenium进行政策文本爬虫进行记录.用于个人爬虫学习记 ...
- Python爬虫—Scrapy框架—Win10下载安装
Python爬虫-Scrapy框架-Win10下载安装 1. 下载wheel 2.下载twisted 3. 下载pywin32 4. 下载安装Scrapy 5. 创建一个scrapy项目 6. fir ...
- Python爬虫 scrapy框架爬取某招聘网存入mongodb解析
这篇文章主要介绍了Python爬虫 scrapy框架爬取某招聘网存入mongodb解析,文中通过示例代码介绍的非常详细,对大家的学习或者工作具有一定的参考学习价值,需要的朋友可以参考下 创建项目 sc ...
- Python爬虫-Scrapy框架(四)- 内置爬虫文件 - 4.2 初探Crawl Spider
Python爬虫-Scrapy框架(四)- 内置爬虫文件 - 4.2 初探Crawl Spider 写在前面 初探Crawl Spider 创建Crawl Spider项目 对比Basic与Crawl ...
- Python爬虫——Scrapy框架(附有实战)
大家好!我是霖hero 有一天,我在逛街,突然被一声靓仔打断了我的脚步,回头一看,原来是水果摊阿姨叫我买水果,说我那么靓仔,便宜一点买给我,自恋的我无法拒绝阿姨的一声声靓仔,于是买了很多水果回家,家人 ...
- Python爬虫scrapy框架的源代码分析
scrapy框架流程图 推荐三个网址:官方1.5版本:https://doc.scrapy.org/en/latest/topics/architecture.html点击打开链接 官方0.24版本( ...
- python爬虫scrapy框架教程_Python爬虫教程-30-Scrapy 爬虫框架介绍
从本篇开始学习 Scrapy 爬虫框架 Python爬虫教程-30-Scrapy 爬虫框架介绍 框架:框架就是对于相同的相似的部分,代码做到不出错,而我们就可以将注意力放到我们自己的部分了 常见爬虫框 ...
- Python爬虫——Scrapy框架
Scrapy是用python实现的一个为了爬取网站数据,提取结构性数据而编写的应用框架.使用Twisted高效异步网络框架来处理网络通信. Scrapy架构: ScrapyEngine:引擎.负责控制 ...
- Python爬虫——scrapy框架介绍
一.什么是Scrapy? Scrapy是一个为了爬取网站数据,提取结构性数据而编写的应用框架,非常出名,非常强悍.所谓的框架就是一个已经被集成了各种功能(高性能异步下载,队列,分布式,解析,持久化等) ...
最新文章
- facebook 直播_什么时候是在Facebook Live上直播的最佳时间? 我分析了5,000个Facebook帖子以找出答案。...
- 用OpenCV和深度学习进行年龄识别
- eclipse opengl java_OpenGL 之 Eclipse 开发环境搭建 | 学步园
- C2065	“__m128d”: 未声明的标识符
- #模拟触手机屏幕_从操作系统的改变谈手机设计进化,单手并不是最终的便捷...
- TurboShop应用特性(2009V3.6)
- 简单struts,spring,mybatis组合使用
- MySQL账户安全设置
- 使用Maven 插件构建docker 镜像和推送仓库
- 如何正确认识网络工程师
- 整个人麻掉!这竟然是一家可以养老的互联网大厂...
- 如何区分光纤跳线的颜色?
- 判断一个数是否为完全数
- 识字水平测试软件,3000字良心测评,市面上最火的3款识字App,这款最便宜好用...
- ECharts 数据可视化插件
- 控制类(Controller)
- 【每日早报】2019/08/13
- 基于神经网络的智能诊断,基于神经网络的控制
- android去广告实现原理,分析某视频软件加载方案和去广告原理
- 教务管理系统C++文件系统,使用Open Hash存储教师的教师编号与密码。
热门文章
- Chrome扩展学习Demo(三):将浏览器地址栏的网址转换为二维码
- 关于MAX3232ESE+T的过热问题
- 老衲躺地上都中枪的“ = + ”
- [C++] 哈希详解
- 为什么高频交易被俄罗斯人垄断?
- 计算机技术基础 vb 试卷及答案,云南师范大学《VB》期末试卷及答案
- jQuery的常见选择器和筛选器
- Python爬虫——豆瓣评分8分以上电影爬取-存储-可视化分析
- 基于单片机的电子钟/万年历系统设计教程(#0418)
- Oracle grant all privileges to user