urllib2 解析

概述

urllib2中的核心类：
Request ：一个具体的url请求，包含了请求的所有信息，不仅仅试用于http协议
OpenerDirector：与BaseHandler组合，通过组合不同得handler处理不同的请求
BaseHandler ：参与完成请求处理的类，不同的请求处理都继承这个类

在urllib2中，一次请求被分为三个过程，分别是request,open,response
request：目的在于构造本次请求Request对象所需得所有信息，如http协议中的header信息
open：处理具体请求的过程，封装Request对象，调用更底层的类完成请求并返回response
response：对返回的Response对象做处理
当然后有一个error处理的过程，但这个不是主动触发的。

OpenerDirector

因为每次请求的具体实现是不同的handler，而且一次请求可能由很多handler组成。所以实现这一耦合机制的类就是OpenerDirector，这个类可以注册(添加)各种不同的handler用来帮助处理一次请求。通常来说handler中的命名规则为 protocol_request|open|response，这分别对应不同协议的三个过程。还是直接上代码，写了一点中文的注释。

view source print ?

`01`	`class` `OpenerDirector:`

`02`	`def` `__init__(self):`

`03`	`# manage the individual handlers`

`04`	`# 所有已注册的handler`

`05`	`self.handlers` `=` `[]`

`06`	`# 已注册的不同过程的方法`

`07`	`self.handle_open` `=` `{}`

`08`	`self.handle_error` `=` `{}`

`09`	`self.process_response` `=` `{}`

`10`	`self.process_request` `=` `{}`

11

`12`	`# 添加一个handler`

13 #

`14`	`def` `add_handler(self, handler):`

`15`	`# 通过检测BaseHandler中的方法确保handler继承于BaseHandler`

`16`	`if` `not` `hasattr(handler,` `"add_parent"):`

`17`	`raise` `TypeError("expected BaseHandler instance, got %r"` `%`

`18`	`type(handler))`

19

`20`	`# 省略一些handler验证代码，主要是检查，这些handler是否有处理过程函数`

21

`22`	`# 如果这个handler验证成功，会调用add_parent，这是BaseHandler的方法`

`23`	`# 使得在handler中可以使用self.parent访问OpenerDirector，在HTTPErrorProcessor有用到`

`24`	`if` `added:`

`25`	`# the handlers must work in an specific order, the order`

`26`	`# is specified in a Handler attribute`

`27`	`bisect.insort(self.handlers, handler)`

`28`	`handler.add_parent(self)`

29

`30`	`def` `close(self):`

`31`	`# Only exists for backwards compatibility.`

32 pass

33

`34`	`# 调用某个chain中的某种协议的方法`

`35`	`def` `_call_chain(self, chain, kind, meth_name,` `*args):`

`36`	`# Handlers raise an exception if no one else should try to handle`

`37`	`# the request, or return None if they can't but another handler`

`38`	`# could. Otherwise, they return the response.`

`39`	`handlers` `=` `chain.get(kind, ())`

`40`	`for` `handler` `in` `handlers:`

`41`	`func` `=` `getattr(handler, meth_name)`

42

`43`	`result` `=` `func(*args)`

`44`	`if` `result` `is` `not` `None:`

`45`	`return` `result`

46

`47`	`# 核心的方法，在此方法中实现了一次请求的三个过程`

`48`	`def` `open(self, fullurl, data=None, timeout=socket._GLOBAL_DEFAULT_TIMEOUT):`

`49`	`# accept a URL or a Request object`

`50`	`if` `isinstance(fullurl,` `basestring):`

`51`	`req` `=` `Request(fullurl, data)`

52 else:

`53`	`req` `=` `fullurl`

`54`	`if` `data` `is` `not` `None:`

`55`	`req.add_data(data)`

56

`57`	`req.timeout` `=` `timeout`

`58`	`protocol` `=` `req.get_type()`

59

`60`	`# pre-process request`

`61`	`# 调用所有已注册的handler的request处理方法`

`62`	`meth_name` `=` `protocol+"_request"`

`63`	`for` `processor` `in` `self.process_request.get(protocol, []):`

`64`	`meth` `=` `getattr(processor, meth_name)`

`65`	`req` `=` `meth(req)`

66

`67`	`# 处理open过程`

`68`	`response` `=` `self._open(req, data)`

69

`70`	`# post-process response`

`71`	`# 调用所有已注册的handler的respone处理方法`

`72`	`meth_name` `=` `protocol+"_response"`

`73`	`for` `processor` `in` `self.process_response.get(protocol, []):`

`74`	`meth` `=` `getattr(processor, meth_name)`

`75`	`response` `=` `meth(req, response)`

76

`77`	`return` `response`

78

`79`	`# 对于open处理过程，还分了三个小类别default，protocol，unknow`

`80`	`# 按照这个数序如果存在某个处理方法则调用，返回结果`

`81`	`def` `_open(self, req, data=None):`

`82`	`result` `=` `self._call_chain(self.handle_open,` `'default',`

`83`	`'default_open', req)`

`84`	`if` `result:`

`85`	`return` `result`

86

`87`	`protocol` `=` `req.get_type()`

`88`	`result` `=` `self._call_chain(self.handle_open, protocol, protocol` `+`

`89`	`'_open', req)`

`90`	`if` `result:`

`91`	`return` `result`

92

`93`	`return` `self._call_chain(self.handle_open,` `'unknown',`

`94`	`'unknown_open', req)`

95

`96`	`#error处理过程是一个被动过程，它会调用handle_error中注册的错误处理方法`

`97`	`def` `error(self, proto,` `*args):`

98 #省略代码

Handler

urllib2提供很多handler来处理不同的请求，常用的HTTPHandler，FTPHandler都比较好理解。这里提一下HTTPCookieProcessor和HTTPRedirectHandler

HTTPCookieProcessor是处理cookie的，在很多需要身份验证的请求中cookie是必不可少的，python中对cookie的操作是有cookielib模块来完成的，而这个handler只是调用了其方法，在request和response过程中将cookie加到请求中和把cookie从响应中解析出来。

HTTPRedirectHandler是处理30x状态的handler，直接看源码，貌似英文的注释已经讲的很明白了

view source print ?

001 class HTTPRedirectHandler(BaseHandler):

002 # maximum number of redirections to any single URL

003 # this is needed because of the state that cookies introduce

004 max_repeats = 4

005 # maximum total number of redirections (regardless of URL) before

006 # assuming we're in a loop

007 max_redirections = 10

008

009 # 这个方法把当前Requst头中的信息附加到新的url中，就是跳转的目的url

010 def redirect_request(self, req, fp, code, msg, headers, newurl):

011 """Return a Request or None in response to a redirect.

012

013 This is called by the http_error_30x methods when a

014 redirection response is received. If a redirection should

015 take place, return a new Request to allow http_error_30x to

016 perform the redirect. Otherwise, raise HTTPError if no-one

017 else should try to handle this url. Return None if you can't

018 but another Handler might.

019 """

020 m = req.get_method()

021 if (code in (301, 302, 303, 307) and m in ("GET", "HEAD")

022 or code in (301, 302, 303) and m == "POST"):

023 # Strictly (according to RFC 2616), 301 or 302 in response

024 # to a POST MUST NOT cause a redirection without confirmation

025 # from the user (of urllib2, in this case). In practice,

026 # essentially all clients do redirect in this case, so we

027 # do the same.

028 # be conciliant with URIs containing a space

029 newurl = newurl.replace(' ', '%20')

030 newheaders = dict((k,v) for k,v in req.headers.items()

031 if k.lower() not in ("content-length", "content-type")

032 )

033 return Request(newurl,

034 headers=newheaders,

035 origin_req_host=req.get_origin_req_host(),

036 unverifiable=True)

037 else:

038 raise HTTPError(req.get_full_url(), code, msg, headers, fp)

039

040 # Implementation note: To avoid the server sending us into an

041 # infinite loop, the request object needs to track what URLs we

042 # have already seen. Do this by adding a handler-specific

043 # attribute to the Request object.

044 # 处理302错误

045 def http_error_302(self, req, fp, code, msg, headers):

046 # Some servers (incorrectly) return multiple Location headers

047 # (so probably same goes for URI). Use first header.

048 # 获取跳转的url

049 if 'location' in headers:

050 newurl = headers.getheaders('location')[0]

051 elif 'uri' in headers:

052 newurl = headers.getheaders('uri')[0]

053 else:

054 return

055

056 # fix a possible malformed URL

057 urlparts = urlparse.urlparse(newurl)

058 if not urlparts.path:

059 urlparts = list(urlparts)

060 urlparts[2] = "/"

061 newurl = urlparse.urlunparse(urlparts)

062

063 newurl = urlparse.urljoin(req.get_full_url(), newurl)

064

065 # XXX Probably want to forget about the state of the current

066 # request, although that might interact poorly with other

067 # handlers that also use handler-specific request attributes

068 # 构造新的请求

069 new = self.redirect_request(req, fp, code, msg, headers, newurl)

070 if new is None:

071 return

072

073 # loop detection

074 # .redirect_dict has a key url if url was previously visited.

075 # 循环检测机制，防止跳转循环

076 # 把已经访问的url添加到redirect_dict中并对跳转的次数做了限制

077 if hasattr(req, 'redirect_dict'):

078 visited = new.redirect_dict = req.redirect_dict

079 if (visited.get(newurl, 0) >= self.max_repeats or

080 len(visited) >= self.max_redirections):

081 raise HTTPError(req.get_full_url(), code,

082 self.inf_msg + msg, headers, fp)

083 else:

084 visited = new.redirect_dict = req.redirect_dict = {}

085 visited[newurl] = visited.get(newurl, 0) + 1

086

087 # Don't close the fp until we are sure that we won't use it

088 # with HTTPError.

089 fp.read()

090 fp.close()

091

092 # 获取新url的内容

093 return self.parent.open(new, timeout=req.timeout)

094

095 # 对于30x的错误都用302的方法实现

096 http_error_301 = http_error_303 = http_error_307 = http_error_302

097

098 inf_msg = "The HTTP server returned a redirect error that would " \

099 "lead to an infinite loop.\n" \

100 "The last 30x error message was:\n

Error handler

错误处理需要单独讲就是因为其特殊性，在urllib2中，处理错误的hanlder是HTTPErrorProcessor完成的

view source print ?

`01`	`class` `HTTPErrorProcessor(BaseHandler):`

`02`	`"""Process HTTP error responses."""`

`03`	`handler_order` `=` `1000` `# after all other processing`

04

`05`	`def` `http_response(self, request, response):`

`06`	`code, msg, hdrs` `=` `response.code, response.msg, response.info()`

07

`08`	`# According to RFC 2616, "2xx" code indicates that the client's`

`09`	`# request was successfully received, understood, and accepted.`

`10`	`# 对于不是2xx的返回状态一概认为产生了一个错误`

`11`	`# 都使用OpenerDirector的error方法来分发到相应的handler的处理方法中`

`12`	`if` `not` `(200` `<=` `code <` `300):`

`13`	`response` `=` `self.parent.error(`

`14`	`'http', request, response, code, msg, hdrs)`

15

`16`	`return` `response`

17

`18`	`https_response` `=` `http_respons`

urlopen，install_opener，build_opener

这是urllib2模块的方法，在urllib2模块中存在一个全局变量保存OpenerDirector实例。
urlopen方法则是调用了OpenerDirector实例的open方法
install_opener方法把一个OpenerDirector实例做为当前的opener
最关键的是build_opener，它决定了OpenerDirector中存在哪些handler

view source print ?

`01`	`def` `build_opener(*handlers):`

`02`	`"""Create an opener object from a list of handlers.`

03

`04`	`The opener will use several default handlers, including support`

`05`	`for HTTP, FTP and when applicable, HTTPS.`

06

`07`	`If any of the handlers passed as arguments are subclasses of the`

`08`	`default handlers, the default handlers will not be used.`

09 """

`10`	`import` `types`

`11`	`def` `isclass(obj):`

`12`	`return` `isinstance(obj, types.ClassType)` `or` `hasattr(obj,` `"__bases__")`

13

`14`	`opener` `=` `OpenerDirector()`

`15`	`# 默认会加载的handler`

`16`	`# 如果有这些类的子类则用子类代替他们`

`17`	`default_classes` `=` `[ProxyHandler, UnknownHandler, HTTPHandler,`

`18`	`HTTPDefaultErrorHandler, HTTPRedirectHandler,`

`19`	`FTPHandler, FileHandler, HTTPErrorProcessor]`

`20`	`if` `hasattr(httplib,` `'HTTPS'):`

`21`	`default_classes.append(HTTPSHandler)`

`22`	`skip` `=` `set()`

`23`	`# 获取默认handler中可以被替换的handler`

`24`	`for` `klass` `in` `default_classes:`

`25`	`for` `check` `in` `handlers:`

`26`	`# 传入的handler可以是类名也可以是一个实例`

`27`	`if` `isclass(check):`

`28`	`if` `issubclass(check, klass):`

`29`	`skip.add(klass)`

`30`	`elif` `isinstance(check, klass):`

`31`	`skip.add(klass)`

`32`	`# 去掉可以替换的handler`

`33`	`for` `klass` `in` `skip:`

`34`	`default_classes.remove(klass)`

`35`	`# 添加handler`

`36`	`for` `klass` `in` `default_classes:`

`37`	`opener.add_handler(klass())`

`38`	`# 再添加传入的handler`

`39`	`for` `h` `in` `handlers:`

40 # 实例化

`41`	`if` `isclass(h):`

`42`	`h` `=` `h()`

`43`	`opener.add_handler(h)`

`44`	`return` `opener`

总结

显而易见urllib2的扩展性是很好的，opener很handler的低耦合可以使我们添加其他对于其他任何协议的handler，这里提供一个实现了文件上传功能的HTTPClient类(点击下载)，这个类使用了https://github.com/seisen/urllib2_file提供的上传文件功能模块，不过这个与HTTPCookieProcessor有冲突，所以我添加了两个方法使在需要上传文件的时候用文件上传功能。
可以在urllib2_file.py后添加

view source print ?

`1`	`def` `install_FHandler():`

`2`	`urllib2._old_HTTPHandler` `=` `urllib2.HTTPHandler`

`3`	`urllib2.HTTPHandler` `=` `newHTTPHandler`

`4`	`urllib2._opener` `=` `None`

5

`6`	`def` `uninstall_FHandler():`

`7`	`urllib2.HTTPHandler` `=` `urllib2._old_HTTPHandler`

`8`	`urllib2._opener` `=` `None`

转载：http://xw2423.byr.edu.cn/blog/archives/794

urllib2 解析相关推荐

爬虫学习笔记，从基础到部署。
爬虫基础知识: 笔记中出现的代码已经全部放到了github上https://github.com/liangxs0/python_spider_save.git 1.http基本原理 http:协议. ...
Python：urllib与urllib2错误解析
原文地址: http://www.zhenv5.com/?p=398 首先说一下我用的Python版本是2.7.1,等换了新主机就用最新的3.1版本, 现在先将就着学习Python的基本知识. 悲 ...
pythonurllib标准_Python标准库urllib2的一些使用细节总结
Python 标准库中有很多实用的工具类,但是在具体使用时,标准库文档上对使用细节描述的并不清楚,比如 urllib2 这个 HTTP 客户端库.这里总结了一些 urllib2 的使用细节. 1.Pr ...
Python中urllib2总结
使用Python访问网页主要有三种方式: urllib, urllib2, httplib urllib比较简单,功能相对也比较弱,httplib简单强大,但好像不支持session 1. 最简单的页 ...
转Python 标准库 urllib2 的使用细节
Python 标准库中有很多实用的工具类,但是在具体使用时,标准库文档上对使用细节描述的并不清楚,比如 urllib2 这个 HTTP 客户端库.这里总结了一些 urllib2 库的使用细节. 1 P ...
Python 获取接口数据，解析JSON,写入文件
Python 获取接口数据,解析JSON,写入文件用于练手的例子,从国家气象局接口上获取JSON数据,将它写入文件中,并解析JSON: 总的来说,在代码量上,python代码量要比java少很多.而 ...
Python 爬虫笔记、多线程、xml解析、基础笔记（不定时更新）
1 Python学习网址:http://www.runoob.com/python/python-multithreading.html 注意高级中的xml解析和多线程 2 参考笔记虫师 ...
抓取网页并解析HTML
http://www.lovelucy.info/python-crawl-pages.html 我觉得java太啰嗦,不够简洁.Python这个脚本语言开发起来速度很快,一个活生生的例子是因有关政策 ...
python爬虫基础(二)～工具包: 下载包requests、urllib和解析包BeautifulSoup(bs4)、lxml.etree.xpath
目录 1. html下载工具包 1.1 urllib工具包 1.1.1 urllib错误一 1.2 Requests工具包 1.2.1 requests错误一 2. html解析工具包 2.1 Bea ...

urllib2 解析

urllib2 解析相关推荐

最新文章

热门文章