urllib库

它是python内置的请求库，包括 request ，error ，parse，robotparse。

这里主要介绍request 和parse模块。

request

它是HTTP请求模块，主要用来模拟发送请求。就像我们输入网址之后回车这样，这个模块提供一些参数，就可以模拟这个过程了。

1.urlopen（）

urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)

用这个方法很简单，给这个方法传入一个url就可以请求网页，直接看代码

def urltest1():response = urllib.request.urlopen('https://www.python.org')  # 抓取python官网print(response.read().decode('utf-8'))    # 输出网页源码print(response.status)    # 查看返回结果的状态码print(response.getheaders())    # 查看相应的头信息print(response.getheader('Server')) # 查看响应头中的Server信息print(type(response.read()))    # 查看类型if __name__ == '__main__':urltest1()pass

运行结果：

data参数

还可以给方法中的 data 传参, data为我们要传入的数据。这样一来，它的请求方式就不是GET，而是POST。也就是说模拟了表单提交的方式，以POST传输数据。

def urltest2():data = urllib.parse.urlencode({'word': 'hello'})  # urlencode将字典转字符串data = data.encode('utf-8')  # 转字节流response = urllib.request.urlopen('http://httpbin.org/post', data=data)print(response.read().decode('utf-8'))  # decode将字节流转为字符串if __name__ == '__main__':# usecookies()urltest2()pass

运行结果：

{
"args": {},
"data": "",
"files": {},
"form": {
"word": "hello"
},
"headers": {
"Accept-Encoding": "identity",
"Content-Length": "10",
"Content-Type": "application/x-www-form-urlencoded",
"Host": "httpbin.org",
"User-Agent": "Python-urllib/3.7"
},
"json": null,
"origin": "119.101.46.252, 119.101.46.252",
"url": "https://httpbin.org/post"
}

Process finished with exit code 0

可以发现，我们传递的参数出现在了form字段中。

timeout参数

timeout用于设定超时时间，若请求超时，则会报错抛出异常。我们抓取网页时可以设置这个超时时间，若长时间没响应就跳过这个页面。具体操作可以用 try except语句实现，这里不再展示代码。

其他参数

context : 用于指定SSL设置

cafile : 指定CA证书

capath : CA证书路径

cadefoult不用理会，已弃用

Request类

构成一个完整的请求，靠以上参数还不够。如果要加入Headers等信息，这就可以利用Request类

class urlib.request.Request( url, data= None , headers={}, origin_req_host = None, unverifiable=False, method=None)

首先解释一下每个参数的意思：

url:这个是必传参数，表示网页地址

data:字节流（bytes）类型，表示数据。如果要传的数据是字典，那么可以用 urllib.parse 模块中的urlencode编码。

headers：请求头。

origin_req_host ：请求方的host名称或者IP地址

unverifiable：表示请求是否是无法验证的，，默认用户没有足够权限来选择接受这个请求的结果。

method：用来指示请求使用的方法

我们依然可以用urlopen这个方法发送请求，只不过这时候的参数变为了Request类的一个对象。

def urltest():url = 'http://httpbin.org/post'dic = {'name': 'Germey'}data = parse.urlencode(dic).encode('utf-8')  # data字节流headers = {    # 请求头'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36','Host': 'httpbin.org'}req = request.Request(url=url,data=data,headers=headers,method='POST') # 构造一个Request对象response = request.urlopen(req)print(response.read().decode())if __name__ == '__main__':# usecookies()urltest()pass

运行结果：

{
"args": {},
"data": "",
"files": {},
"form": {
"name": "Germey"
},
"headers": {
"Accept-Encoding": "identity",
"Content-Length": "11",
"Content-Type": "application/x-www-form-urlencoded",
"Host": "httpbin.org",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36"
},
"json": null,
"origin": "119.101.46.252, 119.101.46.252",
"url": "https://httpbin.org/post"
}

Process finished with exit code 0

Handler

虽然上面的操作能构造请求了，但是一些更高级的操作，比如cookies处理怎么办呢？这就有了更高级的工具Handler。handler是什么？handler可以理解为处理各种事务的工具，比如处理登陆验证，cookie，代理等。利用这个强大的工具我们基本能做到HTTP请求的所有事情。

首先介绍 BaseHandler类，它是所有Handler的父类。各种子类例如：

HTTPDefaultErrorHandler :处理HTTP响应错误。

HTTPRedirectHandler: 用于处理重定向

HTTPCookieProcessor:用于处理cookies

ProxyHandler:设置代理。

HTTPPasswordMgr:管理密码。

HTTPBasicAuthHandler:管理认证，如果链接需要认证，那么它可以解决认证问题。

那么我们再介绍和Handler密切相关的OpenerDirector类，称之为Opener。urlopen实际上就是一个opener。只不过我们现在需要更高级的功能，urlopen已经不能满足我们的需要，所以我们现在要深入一层进行配置，这样我们就要用到Opener。Opener可以使用open方法，返回类型和urlopen如出一辙。简而言之就是用handler构造opener。

我们以获取cookies和利用获取到的cookies创建请求为例，再来看看关于这种用法的实例：

def getcookies():# cookies处理filename = 'cookies.txt'  #路径# cookies = http.cookiejar.CookieJar()cookies = http.cookiejar.MozillaCookieJar(filename)  # cookiehandler = urllib.request.HTTPCookieProcessor(cookies)   # 创建一个Handleropener = urllib.request.build_opener(handler)  # 用handler构建openerresponse = opener.open('http://www.baidu.com') # 打开链接cookies.save(ignore_discard=True, ignore_expires=True)  # 保存cookies文件filename1 = 'cookies1.txt'cookies1 = http.cookiejar.LWPCookieJar(filename1)  # LWP格式的cookiehandler1 = urllib.request.HTTPCookieProcessor(cookies1)  # 创建一个Handleropener1 = urllib.request.build_opener(handler1)  # 用handler构建openerresponse1 = opener1.open('http://www.baidu.com')cookies1.save(ignore_discard=True, ignore_expires=True)  # 保存cookies文件for item in cookies:    # 打印cookiesprint(item.name + "=" + item.value)def usecookies():# 从文件读取cookiescookies = http.cookiejar.LWPCookieJar()cookies.load('cookies1.txt', ignore_expires=True, ignore_discard=True)  # 读取本地的cookies文件handler = urllib.request.HTTPCookieProcessor(cookies)opener = urllib.request.build_opener(handler)response = opener.open('http://www.baidu.com')print(response.read().decode('utf-8'))  # 输出源码

getcookies运行结果是保存了两种格式的cookies文件和打印cookies文件：

usecookies读取了本地保存的cookies文件，运行得到了网站的源码。

parse

下面是parse模块的一些url处理方法，请看代码

def parsetest():# parse模块方法# 测试urlparseresult = parse.urlparse('www.baidu.com/index.html;user?id=5#comment',scheme='https')# result 是一个元组，可以通过索引顺序获取，也可以用属性名获取print(result,result.scheme,result[0])# 测试urlunparse 可迭代对象长度为6data = ['http','www.baidu.com','index.html','user','id=5','comment']print(parse.urlunparse(data))# 测试urlsplitprint(parse.urlsplit('http://www.baidu.com/index.html;user?id=5#comment'))# 测试urlunsplit 可迭代对象长度为5data =['http','www.baidu.com','index.html','id=5','comment']print(parse.urlunsplit(data))# urljoin 测试。第一个参数不完整就补充，第二个参数完整就用第二个参数print(parse.urljoin('www.baidu.com#comment','?category=2'))print(parse.urljoin('http://www.baidu.com','https://www.baidu.com/about.html'))# urlencode 测试params = {'name':'germy','age':22}base_url = 'http://www.baidu.com?'url = base_url+parse.urlencode(params)print(url)# parse_qs 反urlencodequery ='name=germy&age=22'print(parse.parse_qs(query))#parse_qsl和parse_qs相似，返回类型为元组组成的列表print(parse.parse_qsl(query))# 将中文参数转化为url编码keyword = '中文'url = 'http://www.baidu.com/s?wd='+parse.quote(keyword)print(url)# unquote url解码为中文print(parse.unquote(url))if __name__ == '__main__':# getcookies()parsetest()pass

运行结果：

ParseResult(scheme='https', netloc='', path='www.baidu.com/index.html', params='user', query='id=5', fragment='comment') https https
http://www.baidu.com/index.html;user?id=5#comment
SplitResult(scheme='http', netloc='www.baidu.com', path='/index.html;user', query='id=5', fragment='comment')
http://www.baidu.com/index.html?id=5#comment
www.baidu.com?category=2
https://www.baidu.com/about.html
http://www.baidu.com?name=germy&age=22
{'name': ['germy'], 'age': ['22']}
[('name', 'germy'), ('age', '22')]
http://www.baidu.com/s?wd=%E4%B8%AD%E6%96%87
http://www.baidu.com/s?wd=中文

Process finished with exit code 0
-----------------------------------------------------------------------------

博主发现这个库用起来太复杂，一般都用requests。。。

简单介绍到这里。Over~

爬虫：urllib库的用法，关于 request，parse模块总结相关推荐

Python3爬虫入门之Urllib库的用法
urllib库的用法 urlopen urllib.request.urlopen(url,data=None,[timeout,]*,cafile=None,cadefault=False,cont ...
爬虫urllib库parse模块的urlparse详解
一点睛 urllib库里还提供了parse这个模块,它定义了处理URL的标准接口,例如实现URL各部分的抽取.合并以及链接转换. 它支持如下协议的URL处理:file.ftp.gopher.hdl. ...
Python爬虫 —— urllib库的使用（get/post请求+模拟超时/浏览器）
Python爬虫 -- urllib库的使用(get/post请求+模拟超时/浏览器) 这里写目录标题 Python爬虫 -- urllib库的使用(get/post请求+模拟超时/浏览器) 1.Py ...
python爬虫 - Urllib库及cookie的使用
lz提示一点,python3中urllib包括了py2中的urllib+urllib2.[python2和python3的区别.转换及共存 - urllib] 怎样扒网页? 其实就是根据URL来获取它 ...
Python爬虫urllib库的使用
urllib 在Python2中,有urllib和urllib2两个库实现请求发送,在Python3中,统一为urllib,是Python内置的HTTP请求库 request:最基本的HTTP请求模块 ...
002：Python爬虫Urllib库全面分析
Urllib: Python中有一个功能强大,用于操作URL,并且在爬虫中经常使用的库.就是Urllib库. (在python2的时候,有Urllib库,也有Urllib2库.Python3以后把Ur ...
0.爬虫 urlib库讲解 urlopen()与Request()
# 注意一下是import urllib.request 还是 form urllib import request 0. urlopen() 语法:urllib.request.urlopen(u ...
beautifulsoup解析动态页面div未展开_两个资讯爬虫解析库的用法与对比
" 阅读本文大概需要 10 分钟. " 舆情爬虫是网络爬虫一个比较重要的分支,舆情爬虫往往需要爬虫工程师爬取几百几千个新闻站点.比如一个新闻页面我们需要爬取其标题.正文.时间.作者 ...
python爬虫---requests库的用法
requests是python实现的简单易用的HTTP库,使用起来比urllib简洁很多因为是第三方库,所以使用前需要cmd安装 pip install requests 安装完成后import一下 ...

爬虫：urllib库的用法，关于 request，parse模块总结

urllib库

request

1.urlopen（）

data参数

timeout参数

其他参数

Request类

Handler

parse

爬虫：urllib库的用法，关于 request，parse模块总结相关推荐

最新文章

热门文章

爬虫 ：urllib库的用法，关于 request，parse模块总结

urllib库

request

1.urlopen（）

data参数

timeout参数

其他参数

Request类

Handler

parse

爬虫 ：urllib库的用法，关于 request，parse模块总结相关推荐

最新文章

热门文章

爬虫：urllib库的用法，关于 request，parse模块总结

爬虫：urllib库的用法，关于 request，parse模块总结相关推荐