urllib2 解析
概述
urllib2中的核心类:
Request :一个具体的url请求,包含了请求的所有信息,不仅仅试用于http协议
OpenerDirector:与BaseHandler组合,通过组合不同得handler处理不同的请求
BaseHandler :参与完成请求处理的类,不同的请求处理都继承这个类
在urllib2中,一次请求被分为三个过程,分别是request,open,response
request:目的在于构造本次请求Request对象所需得所有信息,如http协议中的header信息
open:处理具体请求的过程,封装Request对象,调用更底层的类完成请求并返回response
response:对返回的Response对象做处理
当然后有一个error处理的过程,但这个不是主动触发的。
OpenerDirector
因为每次请求的具体实现是不同的handler,而且一次请求可能由很多handler组成。所以实现这一耦合机制的类就是OpenerDirector,这个类可以注册(添加)各种不同的handler用来帮助处理一次请求。通常来说handler中的命名规则为 protocol_request|open|response,这分别对应不同协议的三个过程。还是直接上代码,写了一点中文的注释。
01
|
class OpenerDirector:
|
02
|
def __init__( self ):
|
03
|
# manage the individual handlers
|
04
|
# 所有已注册的handler
|
05
|
self .handlers = []
|
06
|
# 已注册的不同过程的方法
|
07
|
self .handle_open = {}
|
08
|
self .handle_error = {}
|
09
|
self .process_response = {}
|
10
|
self .process_request = {}
|
11
|
12
|
# 添加一个handler
|
13
|
#
|
14
|
def add_handler( self , handler):
|
15
|
# 通过检测BaseHandler中的方法 确保handler继承于BaseHandler
|
16
|
if not hasattr (handler, "add_parent" ):
|
17
|
raise TypeError( "expected BaseHandler instance, got %r" %
|
18
|
type (handler))
|
19
|
20
|
# 省略一些handler验证代码,主要是检查,这些handler是否有处理过程函数
|
21
|
22
|
# 如果这个handler验证成功,会调用add_parent,这是BaseHandler的方法
|
23
|
# 使得在handler中可以使用self.parent访问OpenerDirector,在HTTPErrorProcessor有用到
|
24
|
if added:
|
25
|
# the handlers must work in an specific order, the order
|
26
|
# is specified in a Handler attribute
|
27
|
bisect.insort( self .handlers, handler)
|
28
|
handler.add_parent( self )
|
29
|
30
|
def close( self ):
|
31
|
# Only exists for backwards compatibility.
|
32
|
pass
|
33
|
34
|
# 调用某个chain中的某种协议的方法
|
35
|
def _call_chain( self , chain, kind, meth_name, * args):
|
36
|
# Handlers raise an exception if no one else should try to handle
|
37
|
# the request, or return None if they can't but another handler
|
38
|
# could. Otherwise, they return the response.
|
39
|
handlers = chain.get(kind, ())
|
40
|
for handler in handlers:
|
41
|
func = getattr (handler, meth_name)
|
42
|
43
|
result = func( * args)
|
44
|
if result is not None :
|
45
|
return result
|
46
|
47
|
# 核心的方法,在此方法中实现了一次请求的三个过程
|
48
|
def open ( self , fullurl, data = None , timeout = socket._GLOBAL_DEFAULT_TIMEOUT):
|
49
|
# accept a URL or a Request object
|
50
|
if isinstance (fullurl, basestring ):
|
51
|
req = Request(fullurl, data)
|
52
|
else :
|
53
|
req = fullurl
|
54
|
if data is not None :
|
55
|
req.add_data(data)
|
56
|
57
|
req.timeout = timeout
|
58
|
protocol = req.get_type()
|
59
|
60
|
# pre-process request
|
61
|
# 调用所有已注册的handler的request处理方法
|
62
|
meth_name = protocol + "_request"
|
63
|
for processor in self .process_request.get(protocol, []):
|
64
|
meth = getattr (processor, meth_name)
|
65
|
req = meth(req)
|
66
|
67
|
# 处理open过程
|
68
|
response = self ._open(req, data)
|
69
|
70
|
# post-process response
|
71
|
# 调用所有已注册的handler的respone处理方法
|
72
|
meth_name = protocol + "_response"
|
73
|
for processor in self .process_response.get(protocol, []):
|
74
|
meth = getattr (processor, meth_name)
|
75
|
response = meth(req, response)
|
76
|
77
|
return response
|
78
|
79
|
# 对于open处理过程,还分了三个小类别default,protocol,unknow
|
80
|
# 按照这个数序如果存在某个处理方法则调用,返回结果
|
81
|
def _open( self , req, data = None ):
|
82
|
result = self ._call_chain( self .handle_open, 'default' ,
|
83
|
'default_open' , req)
|
84
|
if result:
|
85
|
return result
|
86
|
87
|
protocol = req.get_type()
|
88
|
result = self ._call_chain( self .handle_open, protocol, protocol +
|
89
|
'_open' , req)
|
90
|
if result:
|
91
|
return result
|
92
|
93
|
return self ._call_chain( self .handle_open, 'unknown' ,
|
94
|
'unknown_open' , req)
|
95
|
96
|
#error处理过程是一个被动过程,它会调用handle_error中注册的错误处理方法
|
97
|
def error( self , proto, * args):
|
98
|
#省略代码
|
Handler
urllib2提供很多handler来处理不同的请求,常用的HTTPHandler,FTPHandler都比较好理解。这里提一下HTTPCookieProcessor和HTTPRedirectHandler
HTTPCookieProcessor是处理cookie的,在很多需要身份验证的请求中cookie是必不可少的,python中对cookie的操作是有cookielib模块来完成的,而这个handler只是调用了其方法,在request和response过程中将cookie加到请求中和把cookie从响应中解析出来。
HTTPRedirectHandler是处理30x状态的handler,直接看源码,貌似英文的注释已经讲的很明白了
001
|
class HTTPRedirectHandler(BaseHandler):
|
002
|
# maximum number of redirections to any single URL
|
003
|
# this is needed because of the state that cookies introduce
|
004
|
max_repeats = 4
|
005
|
# maximum total number of redirections (regardless of URL) before
|
006
|
# assuming we're in a loop
|
007
|
max_redirections = 10
|
008
|
009
|
# 这个方法把当前Requst头中的信息附加到新的url中,就是跳转的目的url
|
010
|
def redirect_request( self , req, fp, code, msg, headers, newurl):
|
011
|
"""Return a Request or None in response to a redirect.
|
012
|
013
|
This is called by the http_error_30x methods when a
|
014
|
redirection response is received. If a redirection should
|
015
|
take place, return a new Request to allow http_error_30x to
|
016
|
perform the redirect. Otherwise, raise HTTPError if no-one
|
017
|
else should try to handle this url. Return None if you can't
|
018
|
but another Handler might.
|
019
|
"""
|
020
|
m = req.get_method()
|
021
|
if (code in ( 301 , 302 , 303 , 307 ) and m in ( "GET" , "HEAD" )
|
022
|
or code in ( 301 , 302 , 303 ) and m = = "POST" ):
|
023
|
# Strictly (according to RFC 2616), 301 or 302 in response
|
024
|
# to a POST MUST NOT cause a redirection without confirmation
|
025
|
# from the user (of urllib2, in this case). In practice,
|
026
|
# essentially all clients do redirect in this case, so we
|
027
|
# do the same.
|
028
|
# be conciliant with URIs containing a space
|
029
|
newurl = newurl.replace( ' ' , '%20' )
|
030
|
newheaders = dict ((k,v) for k,v in req.headers.items()
|
031
|
if k.lower() not in ( "content-length" , "content-type" )
|
032
|
)
|
033
|
return Request(newurl,
|
034
|
headers = newheaders,
|
035
|
origin_req_host = req.get_origin_req_host(),
|
036
|
unverifiable = True )
|
037
|
else :
|
038
|
raise HTTPError(req.get_full_url(), code, msg, headers, fp)
|
039
|
040
|
# Implementation note: To avoid the server sending us into an
|
041
|
# infinite loop, the request object needs to track what URLs we
|
042
|
# have already seen. Do this by adding a handler-specific
|
043
|
# attribute to the Request object.
|
044
|
# 处理302错误
|
045
|
def http_error_302( self , req, fp, code, msg, headers):
|
046
|
# Some servers (incorrectly) return multiple Location headers
|
047
|
# (so probably same goes for URI). Use first header.
|
048
|
# 获取跳转的url
|
049
|
if 'location' in headers:
|
050
|
newurl = headers.getheaders( 'location' )[ 0 ]
|
051
|
elif 'uri' in headers:
|
052
|
newurl = headers.getheaders( 'uri' )[ 0 ]
|
053
|
else :
|
054
|
return
|
055
|
056
|
# fix a possible malformed URL
|
057
|
urlparts = urlparse.urlparse(newurl)
|
058
|
if not urlparts.path:
|
059
|
urlparts = list (urlparts)
|
060
|
urlparts[ 2 ] = "/"
|
061
|
newurl = urlparse.urlunparse(urlparts)
|
062
|
063
|
newurl = urlparse.urljoin(req.get_full_url(), newurl)
|
064
|
065
|
# XXX Probably want to forget about the state of the current
|
066
|
# request, although that might interact poorly with other
|
067
|
# handlers that also use handler-specific request attributes
|
068
|
# 构造新的请求
|
069
|
new = self .redirect_request(req, fp, code, msg, headers, newurl)
|
070
|
if new is None :
|
071
|
return
|
072
|
073
|
# loop detection
|
074
|
# .redirect_dict has a key url if url was previously visited.
|
075
|
# 循环检测机制,防止跳转循环
|
076
|
# 把已经访问的url添加到redirect_dict中并对跳转的次数做了限制
|
077
|
if hasattr (req, 'redirect_dict' ):
|
078
|
visited = new.redirect_dict = req.redirect_dict
|
079
|
if (visited.get(newurl, 0 ) > = self .max_repeats or
|
080
|
len (visited) > = self .max_redirections):
|
081
|
raise HTTPError(req.get_full_url(), code,
|
082
|
self .inf_msg + msg, headers, fp)
|
083
|
else :
|
084
|
visited = new.redirect_dict = req.redirect_dict = {}
|
085
|
visited[newurl] = visited.get(newurl, 0 ) + 1
|
086
|
087
|
# Don't close the fp until we are sure that we won't use it
|
088
|
# with HTTPError.
|
089
|
fp.read()
|
090
|
fp.close()
|
091
|
092
|
# 获取新url的内容
|
093
|
return self .parent. open (new, timeout = req.timeout)
|
094
|
095
|
# 对于30x的错误都用302的方法实现
|
096
|
http_error_301 = http_error_303 = http_error_307 = http_error_302
|
097
|
098
|
inf_msg = "The HTTP server returned a redirect error that would " \
|
099
|
"lead to an infinite loop.\n" \
|
100
|
"The last 30x error message was:\n
|
Error handler
错误处理需要单独讲就是因为其特殊性,在urllib2中,处理错误的hanlder是HTTPErrorProcessor完成的
01
|
class HTTPErrorProcessor(BaseHandler):
|
02
|
"""Process HTTP error responses."""
|
03
|
handler_order = 1000 # after all other processing
|
04
|
05
|
def http_response( self , request, response):
|
06
|
code, msg, hdrs = response.code, response.msg, response.info()
|
07
|
08
|
# According to RFC 2616, "2xx" code indicates that the client's
|
09
|
# request was successfully received, understood, and accepted.
|
10
|
# 对于不是2xx的返回状态一概认为产生了一个错误
|
11
|
# 都使用OpenerDirector的error方法来分发到相应的handler的处理方法中
|
12
|
if not ( 200 < = code < 300 ):
|
13
|
response = self .parent.error(
|
14
|
'http' , request, response, code, msg, hdrs)
|
15
|
16
|
return response
|
17
|
18
|
https_response = http_respons
|
urlopen,install_opener,build_opener
这是urllib2模块的方法,在urllib2模块中存在一个全局变量保存OpenerDirector实例。
urlopen方法则是调用了OpenerDirector实例的open方法
install_opener方法把一个OpenerDirector实例做为当前的opener
最关键的是build_opener,它决定了OpenerDirector中存在哪些handler
01
|
def build_opener( * handlers):
|
02
|
"""Create an opener object from a list of handlers.
|
03
|
04
|
The opener will use several default handlers, including support
|
05
|
for HTTP, FTP and when applicable, HTTPS.
|
06
|
07
|
If any of the handlers passed as arguments are subclasses of the
|
08
|
default handlers, the default handlers will not be used.
|
09
|
"""
|
10
|
import types
|
11
|
def isclass(obj):
|
12
|
return isinstance (obj, types.ClassType) or hasattr (obj, "__bases__" )
|
13
|
14
|
opener = OpenerDirector()
|
15
|
# 默认会加载的handler
|
16
|
# 如果有这些类的子类则用子类代替他们
|
17
|
default_classes = [ProxyHandler, UnknownHandler, HTTPHandler,
|
18
|
HTTPDefaultErrorHandler, HTTPRedirectHandler,
|
19
|
FTPHandler, FileHandler, HTTPErrorProcessor]
|
20
|
if hasattr (httplib, 'HTTPS' ):
|
21
|
default_classes.append(HTTPSHandler)
|
22
|
skip = set ()
|
23
|
# 获取默认handler中可以被替换的handler
|
24
|
for klass in default_classes:
|
25
|
for check in handlers:
|
26
|
# 传入的handler可以是类名也可以是一个实例
|
27
|
if isclass(check):
|
28
|
if issubclass (check, klass):
|
29
|
skip.add(klass)
|
30
|
elif isinstance (check, klass):
|
31
|
skip.add(klass)
|
32
|
# 去掉可以替换的handler
|
33
|
for klass in skip:
|
34
|
default_classes.remove(klass)
|
35
|
# 添加handler
|
36
|
for klass in default_classes:
|
37
|
opener.add_handler(klass())
|
38
|
# 再添加传入的handler
|
39
|
for h in handlers:
|
40
|
# 实例化
|
41
|
if isclass(h):
|
42
|
h = h()
|
43
|
opener.add_handler(h)
|
44
|
return opener
|
总结
显而易见urllib2的扩展性是很好的,opener很handler的低耦合可以使我们添加其他对于其他任何协议的handler,这里提供一个实现了文件上传功能的HTTPClient类(点击下载),这个类使用了https://github.com/seisen/urllib2_file提供的上传文件功能模块,不过这个与HTTPCookieProcessor有冲突,所以我添加了两个方法使在需要上传文件的时候用文件上传功能。
可以在urllib2_file.py后添加
1
|
def install_FHandler():
|
2
|
urllib2._old_HTTPHandler = urllib2.HTTPHandler
|
3
|
urllib2.HTTPHandler = newHTTPHandler
|
4
|
urllib2._opener = None
|
5
|
6
|
def uninstall_FHandler():
|
7
|
urllib2.HTTPHandler = urllib2._old_HTTPHandler
|
8
|
urllib2._opener = None
|
转载:http://xw2423.byr.edu.cn/blog/archives/794
urllib2 解析相关推荐
- 爬虫学习笔记,从基础到部署。
爬虫基础知识: 笔记中出现的代码已经全部放到了github上https://github.com/liangxs0/python_spider_save.git 1.http基本原理 http:协议. ...
- Python:urllib与urllib2错误解析
原文地址: http://www.zhenv5.com/?p=398 首先说一下我用的Python版本是2.7.1,等换了新主机就用最新的3.1版本, 现在先将就着学习Python的基本知识. 悲 ...
- pythonurllib标准_Python标准库urllib2的一些使用细节总结
Python 标准库中有很多实用的工具类,但是在具体使用时,标准库文档上对使用细节描述的并不清楚,比如 urllib2 这个 HTTP 客户端库.这里总结了一些 urllib2 的使用细节. 1.Pr ...
- Python中urllib2总结
使用Python访问网页主要有三种方式: urllib, urllib2, httplib urllib比较简单,功能相对也比较弱,httplib简单强大,但好像不支持session 1. 最简单的页 ...
- 转Python 标准库 urllib2 的使用细节
Python 标准库中有很多实用的工具类,但是在具体使用时,标准库文档上对使用细节描述的并不清楚,比如 urllib2 这个 HTTP 客户端库.这里总结了一些 urllib2 库的使用细节. 1 P ...
- Python 获取接口数据,解析JSON,写入文件
Python 获取接口数据,解析JSON,写入文件 用于练手的例子,从国家气象局接口上获取JSON数据,将它写入文件中,并解析JSON: 总的来说,在代码量上,python代码量要比java少很多.而 ...
- Python 爬虫笔记、多线程、xml解析、基础笔记(不定时更新)
1 Python学习网址:http://www.runoob.com/python/python-multithreading.html 注意高级中的xml解析和多线程 2 参考笔记 虫师 ...
- 抓取网页并解析HTML
http://www.lovelucy.info/python-crawl-pages.html 我觉得java太啰嗦,不够简洁.Python这个脚本语言开发起来速度很快,一个活生生的例子是因有关政策 ...
- python爬虫基础(二)~工具包: 下载包requests、urllib和解析包BeautifulSoup(bs4)、lxml.etree.xpath
目录 1. html下载工具包 1.1 urllib工具包 1.1.1 urllib错误一 1.2 Requests工具包 1.2.1 requests错误一 2. html解析工具包 2.1 Bea ...
最新文章
- sql server和mysql分页查询_sql server和mysql中分别实现分页功能
- 用SAXBuilder、Document、Element操作xml
- 零基础学习 Python 之运算符
- 5G NR Rel16 Measurement report triggering--测量上报事件
- AcWing 1402. 星空之夜 1月28
- Finally语句块的执行
- html getelementbyid 修改图片_如何使用HTML、CSS和JS轻松构建桌面应用程序
- 7月11日安全沙龙演讲主题漏洞与网站挂马
- 数字孪生,开启3D智慧园区管理新篇章
- STN(Spatial Transformer Networks)
- 计算机无法连接移动硬盘,移动硬盘无法访问解决大全
- 使用webpack打包nodejs 后台端环境|NodeJs 打包后台代码
- oracle的switch+case语句吗,2.7 switch 语句中的 case 范围
- 服务器无限矿物指令,迷你世界刷矿物指令 | 手游网游页游攻略大全
- 2018最新vue.js2.0完整视频教程12套
- Qt利用深度优先搜索实现迷宫寻宝
- 基于人工智能算法的多元负荷预测
- PHP_thinkPHP框架(1)
- 2021 Java 这一年
- 网络安全 (十 后渗透)