概述

urllib2中的核心类:
Request :一个具体的url请求,包含了请求的所有信息,不仅仅试用于http协议
OpenerDirector:与BaseHandler组合,通过组合不同得handler处理不同的请求
BaseHandler :参与完成请求处理的类,不同的请求处理都继承这个类

在urllib2中,一次请求被分为三个过程,分别是request,open,response
request:目的在于构造本次请求Request对象所需得所有信息,如http协议中的header信息
open:处理具体请求的过程,封装Request对象,调用更底层的类完成请求并返回response
response:对返回的Response对象做处理
当然后有一个error处理的过程,但这个不是主动触发的。

OpenerDirector

因为每次请求的具体实现是不同的handler,而且一次请求可能由很多handler组成。所以实现这一耦合机制的类就是OpenerDirector,这个类可以注册(添加)各种不同的handler用来帮助处理一次请求。通常来说handler中的命名规则为 protocol_request|open|response,这分别对应不同协议的三个过程。还是直接上代码,写了一点中文的注释。

view source print ?
01 class OpenerDirector:
02     def __init__(self):
03         # manage the individual handlers
04         # 所有已注册的handler
05         self.handlers = []
06         # 已注册的不同过程的方法
07         self.handle_open = {}
08         self.handle_error = {}
09         self.process_response = {}
10         self.process_request = {}
11  
12     # 添加一个handler
13     #
14     def add_handler(self, handler):
15         # 通过检测BaseHandler中的方法 确保handler继承于BaseHandler
16         if not hasattr(handler, "add_parent"):
17             raise TypeError("expected BaseHandler instance, got %r" %
18                             type(handler))
19  
20         # 省略一些handler验证代码,主要是检查,这些handler是否有处理过程函数
21  
22         # 如果这个handler验证成功,会调用add_parent,这是BaseHandler的方法
23         # 使得在handler中可以使用self.parent访问OpenerDirector,在HTTPErrorProcessor有用到
24         if added:
25             # the handlers must work in an specific order, the order
26             # is specified in a Handler attribute
27             bisect.insort(self.handlers, handler)
28             handler.add_parent(self)
29  
30     def close(self):
31         # Only exists for backwards compatibility.
32         pass
33  
34     # 调用某个chain中的某种协议的方法
35     def _call_chain(self, chain, kind, meth_name, *args):
36         # Handlers raise an exception if no one else should try to handle
37         # the request, or return None if they can't but another handler
38         # could.  Otherwise, they return the response.
39         handlers = chain.get(kind, ())
40         for handler in handlers:
41             func = getattr(handler, meth_name)
42  
43             result = func(*args)
44             if result is not None:
45                 return result
46  
47     # 核心的方法,在此方法中实现了一次请求的三个过程
48     def open(self, fullurl, data=None, timeout=socket._GLOBAL_DEFAULT_TIMEOUT):
49         # accept a URL or a Request object
50         if isinstance(fullurl, basestring):
51             req = Request(fullurl, data)
52         else:
53             req = fullurl
54             if data is not None:
55                 req.add_data(data)
56  
57         req.timeout = timeout
58         protocol = req.get_type()
59  
60         # pre-process request
61         # 调用所有已注册的handler的request处理方法
62         meth_name = protocol+"_request"
63         for processor in self.process_request.get(protocol, []):
64             meth = getattr(processor, meth_name)
65             req = meth(req)
66  
67         # 处理open过程
68         response = self._open(req, data)
69  
70         # post-process response
71         # 调用所有已注册的handler的respone处理方法
72         meth_name = protocol+"_response"
73         for processor in self.process_response.get(protocol, []):
74             meth = getattr(processor, meth_name)
75             response = meth(req, response)
76  
77         return response
78  
79     # 对于open处理过程,还分了三个小类别default,protocol,unknow
80     # 按照这个数序如果存在某个处理方法则调用,返回结果
81     def _open(self, req, data=None):
82         result = self._call_chain(self.handle_open, 'default',
83                                   'default_open', req)
84         if result:
85             return result
86  
87         protocol = req.get_type()
88         result = self._call_chain(self.handle_open, protocol, protocol +
89                                   '_open', req)
90         if result:
91             return result
92  
93         return self._call_chain(self.handle_open, 'unknown',
94                                 'unknown_open', req)
95  
96     #error处理过程是一个被动过程,它会调用handle_error中注册的错误处理方法
97     def error(self, proto, *args):
98         #省略代码

Handler

urllib2提供很多handler来处理不同的请求,常用的HTTPHandler,FTPHandler都比较好理解。这里提一下HTTPCookieProcessor和HTTPRedirectHandler

HTTPCookieProcessor是处理cookie的,在很多需要身份验证的请求中cookie是必不可少的,python中对cookie的操作是有cookielib模块来完成的,而这个handler只是调用了其方法,在request和response过程中将cookie加到请求中和把cookie从响应中解析出来。

HTTPRedirectHandler是处理30x状态的handler,直接看源码,貌似英文的注释已经讲的很明白了

view source print ?
001 class HTTPRedirectHandler(BaseHandler):
002     # maximum number of redirections to any single URL
003     # this is needed because of the state that cookies introduce
004     max_repeats = 4
005     # maximum total number of redirections (regardless of URL) before
006     # assuming we're in a loop
007     max_redirections = 10
008  
009     # 这个方法把当前Requst头中的信息附加到新的url中,就是跳转的目的url
010     def redirect_request(self, req, fp, code, msg, headers, newurl):
011         """Return a Request or None in response to a redirect.
012  
013         This is called by the http_error_30x methods when a
014         redirection response is received.  If a redirection should
015         take place, return a new Request to allow http_error_30x to
016         perform the redirect.  Otherwise, raise HTTPError if no-one
017         else should try to handle this url.  Return None if you can't
018         but another Handler might.
019         """
020         m = req.get_method()
021         if (code in (301, 302, 303, 307) and m in ("GET", "HEAD")
022             or code in (301, 302, 303) and m == "POST"):
023             # Strictly (according to RFC 2616), 301 or 302 in response
024             # to a POST MUST NOT cause a redirection without confirmation
025             # from the user (of urllib2, in this case).  In practice,
026             # essentially all clients do redirect in this case, so we
027             # do the same.
028             # be conciliant with URIs containing a space
029             newurl = newurl.replace(' ', '%20')
030             newheaders = dict((k,v) for k,v in req.headers.items()
031                               if k.lower() not in ("content-length", "content-type")
032                              )
033             return Request(newurl,
034                            headers=newheaders,
035                            origin_req_host=req.get_origin_req_host(),
036                            unverifiable=True)
037         else:
038             raise HTTPError(req.get_full_url(), code, msg, headers, fp)
039  
040     # Implementation note: To avoid the server sending us into an
041     # infinite loop, the request object needs to track what URLs we
042     # have already seen.  Do this by adding a handler-specific
043     # attribute to the Request object.
044     # 处理302错误
045     def http_error_302(self, req, fp, code, msg, headers):
046         # Some servers (incorrectly) return multiple Location headers
047         # (so probably same goes for URI).  Use first header.
048         # 获取跳转的url
049         if 'location' in headers:
050             newurl = headers.getheaders('location')[0]
051         elif 'uri' in headers:
052             newurl = headers.getheaders('uri')[0]
053         else:
054             return
055  
056         # fix a possible malformed URL
057         urlparts = urlparse.urlparse(newurl)
058         if not urlparts.path:
059             urlparts = list(urlparts)
060             urlparts[2] = "/"
061         newurl = urlparse.urlunparse(urlparts)
062  
063         newurl = urlparse.urljoin(req.get_full_url(), newurl)
064  
065         # XXX Probably want to forget about the state of the current
066         # request, although that might interact poorly with other
067         # handlers that also use handler-specific request attributes
068         # 构造新的请求
069         new = self.redirect_request(req, fp, code, msg, headers, newurl)
070         if new is None:
071             return
072  
073         # loop detection
074         # .redirect_dict has a key url if url was previously visited.
075         # 循环检测机制,防止跳转循环
076         # 把已经访问的url添加到redirect_dict中并对跳转的次数做了限制
077         if hasattr(req, 'redirect_dict'):
078             visited = new.redirect_dict = req.redirect_dict
079             if (visited.get(newurl, 0) >= self.max_repeats or
080                 len(visited) >= self.max_redirections):
081                 raise HTTPError(req.get_full_url(), code,
082                                 self.inf_msg + msg, headers, fp)
083         else:
084             visited = new.redirect_dict = req.redirect_dict = {}
085         visited[newurl] = visited.get(newurl, 0) + 1
086  
087         # Don't close the fp until we are sure that we won't use it
088         # with HTTPError.
089         fp.read()
090         fp.close()
091  
092         # 获取新url的内容
093         return self.parent.open(new, timeout=req.timeout)
094  
095     # 对于30x的错误都用302的方法实现
096     http_error_301 = http_error_303 = http_error_307 = http_error_302
097  
098     inf_msg = "The HTTP server returned a redirect error that would " \
099               "lead to an infinite loop.\n" \
100               "The last 30x error message was:\n

Error handler

错误处理需要单独讲就是因为其特殊性,在urllib2中,处理错误的hanlder是HTTPErrorProcessor完成的

view source print ?
01 class HTTPErrorProcessor(BaseHandler):
02     """Process HTTP error responses."""
03     handler_order = 1000  # after all other processing
04  
05     def http_response(self, request, response):
06         code, msg, hdrs = response.code, response.msg, response.info()
07  
08         # According to RFC 2616, "2xx" code indicates that the client's
09         # request was successfully received, understood, and accepted.
10         # 对于不是2xx的返回状态一概认为产生了一个错误
11         # 都使用OpenerDirector的error方法来分发到相应的handler的处理方法中
12         if not (200 <= code < 300):
13             response = self.parent.error(
14                 'http', request, response, code, msg, hdrs)
15  
16         return response
17  
18     https_response = http_respons

urlopen,install_opener,build_opener

这是urllib2模块的方法,在urllib2模块中存在一个全局变量保存OpenerDirector实例。
urlopen方法则是调用了OpenerDirector实例的open方法
install_opener方法把一个OpenerDirector实例做为当前的opener
最关键的是build_opener,它决定了OpenerDirector中存在哪些handler

view source print ?
01 def build_opener(*handlers):
02     """Create an opener object from a list of handlers.
03  
04     The opener will use several default handlers, including support
05     for HTTP, FTP and when applicable, HTTPS.
06  
07     If any of the handlers passed as arguments are subclasses of the
08     default handlers, the default handlers will not be used.
09     """
10     import types
11     def isclass(obj):
12         return isinstance(obj, types.ClassType) or hasattr(obj, "__bases__")
13  
14     opener = OpenerDirector()
15     # 默认会加载的handler
16     # 如果有这些类的子类则用子类代替他们
17     default_classes = [ProxyHandler, UnknownHandler, HTTPHandler,
18                        HTTPDefaultErrorHandler, HTTPRedirectHandler,
19                        FTPHandler, FileHandler, HTTPErrorProcessor]
20     if hasattr(httplib, 'HTTPS'):
21         default_classes.append(HTTPSHandler)
22     skip = set()
23     # 获取默认handler中可以被替换的handler
24     for klass in default_classes:
25         for check in handlers:
26             # 传入的handler可以是类名也可以是一个实例
27             if isclass(check):
28                 if issubclass(check, klass):
29                     skip.add(klass)
30             elif isinstance(check, klass):
31                 skip.add(klass)
32     # 去掉可以替换的handler
33     for klass in skip:
34         default_classes.remove(klass)
35     # 添加handler
36     for klass in default_classes:
37         opener.add_handler(klass())
38     # 再添加传入的handler
39     for h in handlers:
40         # 实例化
41         if isclass(h):
42             h = h()
43         opener.add_handler(h)
44     return opener

总结

显而易见urllib2的扩展性是很好的,opener很handler的低耦合可以使我们添加其他对于其他任何协议的handler,这里提供一个实现了文件上传功能的HTTPClient类(点击下载),这个类使用了https://github.com/seisen/urllib2_file提供的上传文件功能模块,不过这个与HTTPCookieProcessor有冲突,所以我添加了两个方法使在需要上传文件的时候用文件上传功能。
可以在urllib2_file.py后添加

view source print ?
1 def install_FHandler():
2     urllib2._old_HTTPHandler = urllib2.HTTPHandler
3     urllib2.HTTPHandler = newHTTPHandler
4     urllib2._opener = None
5  
6 def uninstall_FHandler():
7     urllib2.HTTPHandler = urllib2._old_HTTPHandler
8     urllib2._opener = None

转载:http://xw2423.byr.edu.cn/blog/archives/794

urllib2 解析相关推荐

  1. 爬虫学习笔记,从基础到部署。

    爬虫基础知识: 笔记中出现的代码已经全部放到了github上https://github.com/liangxs0/python_spider_save.git 1.http基本原理 http:协议. ...

  2. Python:urllib与urllib2错误解析

    原文地址: http://www.zhenv5.com/?p=398   首先说一下我用的Python版本是2.7.1,等换了新主机就用最新的3.1版本, 现在先将就着学习Python的基本知识. 悲 ...

  3. pythonurllib标准_Python标准库urllib2的一些使用细节总结

    Python 标准库中有很多实用的工具类,但是在具体使用时,标准库文档上对使用细节描述的并不清楚,比如 urllib2 这个 HTTP 客户端库.这里总结了一些 urllib2 的使用细节. 1.Pr ...

  4. Python中urllib2总结

    使用Python访问网页主要有三种方式: urllib, urllib2, httplib urllib比较简单,功能相对也比较弱,httplib简单强大,但好像不支持session 1. 最简单的页 ...

  5. 转Python 标准库 urllib2 的使用细节

    Python 标准库中有很多实用的工具类,但是在具体使用时,标准库文档上对使用细节描述的并不清楚,比如 urllib2 这个 HTTP 客户端库.这里总结了一些 urllib2 库的使用细节. 1 P ...

  6. Python 获取接口数据,解析JSON,写入文件

    Python 获取接口数据,解析JSON,写入文件 用于练手的例子,从国家气象局接口上获取JSON数据,将它写入文件中,并解析JSON: 总的来说,在代码量上,python代码量要比java少很多.而 ...

  7. Python 爬虫笔记、多线程、xml解析、基础笔记(不定时更新)

    1  Python学习网址:http://www.runoob.com/python/python-multithreading.html     注意高级中的xml解析和多线程 2  参考笔记 虫师 ...

  8. 抓取网页并解析HTML

    http://www.lovelucy.info/python-crawl-pages.html 我觉得java太啰嗦,不够简洁.Python这个脚本语言开发起来速度很快,一个活生生的例子是因有关政策 ...

  9. python爬虫基础(二)~工具包: 下载包requests、urllib和解析包BeautifulSoup(bs4)、lxml.etree.xpath

    目录 1. html下载工具包 1.1 urllib工具包 1.1.1 urllib错误一 1.2 Requests工具包 1.2.1 requests错误一 2. html解析工具包 2.1 Bea ...

最新文章

  1. sql server和mysql分页查询_sql server和mysql中分别实现分页功能
  2. 用SAXBuilder、Document、Element操作xml
  3. 零基础学习 Python 之运算符
  4. 5G NR Rel16 Measurement report triggering--测量上报事件
  5. AcWing 1402. 星空之夜 1月28
  6. Finally语句块的执行
  7. html getelementbyid 修改图片_如何使用HTML、CSS和JS轻松构建桌面应用程序
  8. 7月11日安全沙龙演讲主题漏洞与网站挂马
  9. 数字孪生,开启3D智慧园区管理新篇章
  10. STN(Spatial Transformer Networks)
  11. 计算机无法连接移动硬盘,移动硬盘无法访问解决大全
  12. 使用webpack打包nodejs 后台端环境|NodeJs 打包后台代码
  13. oracle的switch+case语句吗,2.7 switch 语句中的 case 范围
  14. 服务器无限矿物指令,迷你世界刷矿物指令 | 手游网游页游攻略大全
  15. 2018最新vue.js2.0完整视频教程12套
  16. Qt利用深度优先搜索实现迷宫寻宝
  17. 基于人工智能算法的多元负荷预测
  18. PHP_thinkPHP框架(1)
  19. 2021 Java 这一年
  20. 网络安全 (十 后渗透)

热门文章

  1. 学好ARM开发的意义
  2. 证明HashSet不是线程安全
  3. 基于低加密指数广播攻击(Hastad攻击)的更深一步学习
  4. 表单form中的submit事件
  5. Java(13)- 抽象类
  6. Qt之QVector基本用法
  7. oracle varchar,date互转,number,varchar互转
  8. eclipse打开时报错:
  9. uniapp登录授权获取微信手机号组件封装
  10. 美通企业日报 | 无锡国际生命科学创新园开园;本特勒与恒大汽车博世实现合作...