2019独角兽企业重金招聘Python工程师标准>>>

Requests库安装

C:\Windows\System32>pip install requests
Requirement already satisfied: requests in d:\python\python37\lib\site-packages (2.20.0)
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in d:\python\python37\lib\site-packages (from requests) (3.0.4)
Requirement already satisfied: idna<2.8,>=2.5 in d:\python\python37\lib\site-packages (from requests) (2.7)
Requirement already satisfied: urllib3<1.25,>=1.21.1 in d:\python\python37\lib\site-packages (from requests) (1.24)
Requirement already satisfied: certifi>=2017.4.17 in d:\python\python37\lib\site-packages (from requests) (2018.10.15)
测试requests库
>>> import requests
访问百度>>> r = requests.get("http://www.baidu.com")
查看状态码>>> r.status_code
200
更改编码>>> r.text
'<!DOCTYPE html>\r\n<!--STATUS OK--><html> <head><meta http-equiv=content-type content=text/html;charset=utf-8><meta http-equiv=X-UA-Compatible content=IE=Edge><meta content=always name=referrer><link rel=stylesheet type=text/css href=http://s1.bdstatic.com/r/www/cache/bdorz/baidu.min.css><title>百度一下,你就知道</title></head> <body link=#0000cc> <div id=wrapper> <div id=head> <div class=head_wrapper> <div class=s_form> <div class=s_form_wrapper> <div id=lg> <img hidefocus=true src=//www.baidu.com/img/bd_logo1.png width=270 height=129> </div> <form id=form name=f action=//www.baidu.com/s class=fm> <input type=hidden name=bdorz_come value=1> <input type=hidden name=ie value=utf-8> <input type=hidden name=f value=8> <input type=hidden name=rsv_bp value=1> <input type=hidden name=rsv_idx value=1> <input type=hidden name=tn value=baidu><span class="bg s_ipt_wr"><input id=kw name=wd class=s_ipt value maxlength=255 autocomplete=off autofocus></span><span class="bg s_btn_wr"><input type=submit id=su value=百度一下 class="bg s_btn"></span> </form> </div> </div> <div id=u1> <a href=http://news.baidu.com name=tj_trnews class=mnav>新闻</a> <a href=http://www.hao123.com name=tj_trhao123 class=mnav>hao123</a> <a href=http://map.baidu.com name=tj_trmap class=mnav>地图</a> <a href=http://v.baidu.com name=tj_trvideo class=mnav>视频</a> <a href=http://tieba.baidu.com name=tj_trtieba class=mnav>贴吧</a> <noscript> <a href=http://www.baidu.com/bdorz/login.gif?login&amp;tpl=mn&amp;u=http%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1 name=tj_login class=lb>登录</a> </noscript> <script>document.write(\'<a href="http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u=\'+ encodeURIComponent(window.location.href+ (window.location.search === "" ? "?" : "&")+ "bdorz_come=1")+ \'" name="tj_login" class="lb">登录</a>\');</script> <a href=//www.baidu.com/more/ name=tj_briicon class=bri style="display: block;">更多产品</a> </div> </div> </div> <div id=ftCon> <div id=ftConw> <p id=lh> <a href=http://home.baidu.com>关于百度</a> <a href=http://ir.baidu.com>About Baidu</a> </p> <p id=cp>&copy;2017&nbsp;Baidu&nbsp;<a href=http://www.baidu.com/duty/>使用百度前必读</a>&nbsp; <a href=http://jianyi.baidu.com/ class=cp-feedback>意见反馈</a>&nbsp;京ICP证030173号&nbsp; <img src=//www.baidu.com/img/gs.gif> </p> </div> </div> </div> </body> </html>\r\n'
打印内容
requests.request()构造一个请求,支撑以下各方法的基础方法
requests.get()获取HTML网页的主要方法,对应于HTTP的GET
requests.head()获取HTML网页头信息的方法,对应于HTTP的HEAD
requests.post()向HTML网页头信息的方法,对应于HTTP的POST
requests.put()向HTML网页提交PUT请求的方法,对应于HTTP的PUT
requests.patch()向HTML网页提交局部修改请求,对应于HTTP的PATCH
requests.delete()向HTML网页提交删除请求,对应于HTTP的DELETE

get()方法

r = requests.get(url)
构造一个向服务器请求资源Requests对象
返回一个包含服务器资源的Response对象完整使用方法
requests.get(url,params = None,**kwargs)
url:拟获取页面的url链接
params:url种的额外参数,字典或字节流格式,可选
**kwargs:12个控制访问的参数其中最重要的两个对象是Requests和Response
Response对象包含爬虫返回的内容>>> type(r)
<class 'requests.models.Response'>
返回一个类>>> r.headers
{'Transfer-Encoding': 'chunked', 'Cache-Control': 'private, no-cache, no-store, proxy-revalidate, no-transform', 'Content-Encoding': 'gzip', 'Content-Type': 'text/html', 'Date': 'Tue, 23 Oct 2018 08:25:38 GMT', 'Keep-Alive': 'timeout=38', 'Last-Modified': 'Mon, 23 Jan 2017 13:27:43 GMT', 'Pragma': 'no-cache', 'Server': 'bfe/1.0.8.18', 'Set-Cookie': 'BDORZ=27315; max-age=86400; domain=.baidu.com; path=/'}
返回头部信息
r.status_code  HTTP请求的返回状态,200表示连接成功,404表示失败
r.txt   HTTP响应内容的字符串形式,即,url对应的页面内容
r.encoding  从HTTP header中猜测的相应内容编码方式
r.apparent_encoding 从内容分析出的相应内容编码方式(备选编码方式)
r.content   HTTP相应内容的二进制形式
>>> import requests
>>> r = requests.get("http://www.baidu.com")
>>> r.status_code
200
>>> r.text
'<!DOCTYPE html>\r\n<!--STATUS OK--><html> <head><meta http-equiv=content-type content=text/html;charset=utf-8><meta http-equiv=X-UA-Compatible content=IE=Edge><meta content=always name=referrer><link rel=stylesheet type=text/css href=http://s1.bdstatic.com/r/www/cache/bdorz/baidu.min.css><title>ç\x99¾åº¦ä¸\x80ä¸\x8bï¼\x8cä½\xa0å°±ç\x9f¥é\x81\x93</title></head> <body link=#0000cc> <div id=wrapper> <div id=head> <div class=head_wrapper> <div class=s_form> <div class=s_form_wrapper> <div id=lg> <img hidefocus=true src=//www.baidu.com/img/bd_logo1.png width=270 height=129> </div> <form id=form name=f action=//www.baidu.com/s class=fm> <input type=hidden name=bdorz_come value=1> <input type=hidden name=ie value=utf-8> <input type=hidden name=f value=8> <input type=hidden name=rsv_bp value=1> <input type=hidden name=rsv_idx value=1> <input type=hidden name=tn value=baidu><span class="bg s_ipt_wr"><input id=kw name=wd class=s_ipt value maxlength=255 autocomplete=off autofocus></span><span class="bg s_btn_wr"><input type=submit id=su value=ç\x99¾åº¦ä¸\x80ä¸\x8b class="bg s_btn"></span> </form> </div> </div> <div id=u1> <a href=http://news.baidu.com name=tj_trnews class=mnav>æ\x96°é\x97»</a> <a href=http://www.hao123.com name=tj_trhao123 class=mnav>hao123</a> <a href=http://map.baidu.com name=tj_trmap class=mnav>å\x9c°å\x9b¾</a> <a href=http://v.baidu.com name=tj_trvideo class=mnav>è§\x86é¢\x91</a> <a href=http://tieba.baidu.com name=tj_trtieba class=mnav>è´´å\x90§</a> <noscript> <a href=http://www.baidu.com/bdorz/login.gif?login&amp;tpl=mn&amp;u=http%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1 name=tj_login class=lb>ç\x99»å½\x95</a> </noscript> <script>document.write(\'<a href="http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u=\'+ encodeURIComponent(window.location.href+ (window.location.search === "" ? "?" : "&")+ "bdorz_come=1")+ \'" name="tj_login" class="lb">ç\x99»å½\x95</a>\');</script> <a href=//www.baidu.com/more/ name=tj_briicon class=bri style="display: block;">æ\x9b´å¤\x9a产å\x93\x81</a> </div> </div> </div> <div id=ftCon> <div id=ftConw> <p id=lh> <a href=http://home.baidu.com>å\x85³äº\x8eç\x99¾åº¦</a> <a href=http://ir.baidu.com>About Baidu</a> </p> <p id=cp>&copy;2017&nbsp;Baidu&nbsp;<a href=http://www.baidu.com/duty/>使ç\x94¨ç\x99¾åº¦å\x89\x8då¿\x85读</a>&nbsp; <a href=http://jianyi.baidu.com/ class=cp-feedback>æ\x84\x8fè§\x81å\x8f\x8dé¦\x88</a>&nbsp;京ICPè¯\x81030173å\x8f·&nbsp; <img src=//www.baidu.com/img/gs.gif> </p> </div> </div> </div> </body> </html>\r\n'
>>> r.encoding
'ISO-8859-1'
>>> r.apparent_encoding
'utf-8'
>>> r.encoding = 'utf-8'
>>> r.text
'<!DOCTYPE html>\r\n<!--STATUS OK--><html> <head><meta http-equiv=content-type content=text/html;charset=utf-8><meta http-equiv=X-UA-Compatible content=IE=Edge><meta content=always name=referrer><link rel=stylesheet type=text/css href=http://s1.bdstatic.com/r/www/cache/bdorz/baidu.min.css><title>百度一下,你就知道</title></head> <body link=#0000cc> <div id=wrapper> <div id=head> <div class=head_wrapper> <div class=s_form> <div class=s_form_wrapper> <div id=lg> <img hidefocus=true src=//www.baidu.com/img/bd_logo1.png width=270 height=129> </div> <form id=form name=f action=//www.baidu.com/s class=fm> <input type=hidden name=bdorz_come value=1> <input type=hidden name=ie value=utf-8> <input type=hidden name=f value=8> <input type=hidden name=rsv_bp value=1> <input type=hidden name=rsv_idx value=1> <input type=hidden name=tn value=baidu><span class="bg s_ipt_wr"><input id=kw name=wd class=s_ipt value maxlength=255 autocomplete=off autofocus></span><span class="bg s_btn_wr"><input type=submit id=su value=百度一下 class="bg s_btn"></span> </form> </div> </div> <div id=u1> <a href=http://news.baidu.com name=tj_trnews class=mnav>新闻</a> <a href=http://www.hao123.com name=tj_trhao123 class=mnav>hao123</a> <a href=http://map.baidu.com name=tj_trmap class=mnav>地图</a> <a href=http://v.baidu.com name=tj_trvideo class=mnav>视频</a> <a href=http://tieba.baidu.com name=tj_trtieba class=mnav>贴吧</a> <noscript> <a href=http://www.baidu.com/bdorz/login.gif?login&amp;tpl=mn&amp;u=http%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1 name=tj_login class=lb>登录</a> </noscript> <script>document.write(\'<a href="http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u=\'+ encodeURIComponent(window.location.href+ (window.location.search === "" ? "?" : "&")+ "bdorz_come=1")+ \'" name="tj_login" class="lb">登录</a>\');</script> <a href=//www.baidu.com/more/ name=tj_briicon class=bri style="display: block;">更多产品</a> </div> </div> </div> <div id=ftCon> <div id=ftConw> <p id=lh> <a href=http://home.baidu.com>关于百度</a> <a href=http://ir.baidu.com>About Baidu</a> </p> <p id=cp>&copy;2017&nbsp;Baidu&nbsp;<a href=http://www.baidu.com/duty/>使用百度前必读</a>&nbsp; <a href=http://jianyi.baidu.com/ class=cp-feedback>意见反馈</a>&nbsp;京ICP证030173号&nbsp; <img src=//www.baidu.com/img/gs.gif> </p> </div> </div> </div> </body> </html>\r\n'
>>> r.encoding
'utf-8'
r.encoding:如果header中不存在charset,则认为编码为ISO-8859-1
r.apparent_encoding:根据网页内容分析出的编码方式

爬取网页的通用代码框架

连接异常requests.ConnectionError  网络连接错误异常,如DNS查询失败、拒绝连接等
requests.HTTPError      HTTP错误异常
requests.URLRequired    URL错误异常
requests.TooManyRedirects   超过最大重定向次数,产生重定向异常
requests.ConnectTimeout 连接远程服务器超时异常
requests.Timeout        请求URL超时,产生超时异常
r.raise_for_status()   如果不是200,产生异常requests.HTTPError
带HTTP头
#!/usr/bin/env python3
# -*- coding: UTF-8 -*-import requestsdef getHTMLText(url):try:r = requests.get(url,timeout = 30)r.raise_for_status()#如果状态不是200,引发HTTPEorror异常r.encoding = r.apparent_encodingreturn r.textexcept:return "Error"if __name__ == "__main__":url = "http://www.baidu.com"print(getHTMLText(url))====================== RESTART: C:\Python3.7.0\test.py ======================
<!DOCTYPE html><!--STATUS OK--><html> <head><meta http-equiv=content-type content=text/html;charset=utf-8><meta http-equiv=X-UA-Compatible content=IE=Edge><meta content=always name=referrer><link rel=stylesheet type=text/css href=http://s1.bdstatic.com/r/www/cache/bdorz/baidu.min.css><title>百度一下,你就知道</title></head> <body link=#0000cc> <div id=wrapper> <div id=head> <div class=head_wrapper> <div class=s_form> <div class=s_form_wrapper> <div id=lg> <img hidefocus=true src=//www.baidu.com/img/bd_logo1.png width=270 height=129> </div> <form id=form name=f action=//www.baidu.com/s class=fm> <input type=hidden name=bdorz_come value=1> <input type=hidden name=ie value=utf-8> <input type=hidden name=f value=8> <input type=hidden name=rsv_bp value=1> <input type=hidden name=rsv_idx value=1> <input type=hidden name=tn value=baidu><span class="bg s_ipt_wr"><input id=kw name=wd class=s_ipt value maxlength=255 autocomplete=off autofocus></span><span class="bg s_btn_wr"><input type=submit id=su value=百度一下 class="bg s_btn"></span> </form> </div> </div> <div id=u1> <a href=http://news.baidu.com name=tj_trnews class=mnav>新闻</a> <a href=http://www.hao123.com name=tj_trhao123 class=mnav>hao123</a> <a href=http://map.baidu.com name=tj_trmap class=mnav>地图</a> <a href=http://v.baidu.com name=tj_trvideo class=mnav>视频</a> <a href=http://tieba.baidu.com name=tj_trtieba class=mnav>贴吧</a> <noscript> <a href=http://www.baidu.com/bdorz/login.gif?login&amp;tpl=mn&amp;u=http%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1 name=tj_login class=lb>登录</a> </noscript> <script>document.write('<a href="http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u='+ encodeURIComponent(window.location.href+ (window.location.search === "" ? "?" : "&")+ "bdorz_come=1")+ '" name="tj_login" class="lb">登录</a>');</script> <a href=//www.baidu.com/more/ name=tj_briicon class=bri style="display: block;">更多产品</a> </div> </div> </div> <div id=ftCon> <div id=ftConw> <p id=lh> <a href=http://home.baidu.com>关于百度</a> <a href=http://ir.baidu.com>About Baidu</a> </p> <p id=cp>&copy;2017&nbsp;Baidu&nbsp;<a href=http://www.baidu.com/duty/>使用百度前必读</a>&nbsp; <a href=http://jianyi.baidu.com/ class=cp-feedback>意见反馈</a>&nbsp;京ICP证030173号&nbsp; <img src=//www.baidu.com/img/gs.gif> </p> </div> </div> </div> </body> </html>不带HTTP头
#!/usr/bin/env python3
# -*- coding: UTF-8 -*-import requestsdef getHTMLText(url):try:r = requests.get(url,timeout = 30)r.raise_for_status()#如果状态不是200,引发HTTPEorror异常r.encoding = r.apparent_encodingreturn r.textexcept:return "Error"if __name__ == "__main__":url = "www.baidu.com"print(getHTMLText(url))====================== RESTART: C:\Python3.7.0\test.py ======================
Error
  • 测试
测试1
#!/usr/bin/env python3
# -*- coding: UTF-8 -*-import requestsdef url(url):try:r = requests.get(url)r.encoding = r.apparent_encodingreturn r.status_codeexcept:print("Error")print(url("http://www.baidu.com"))====================== RESTART: C:\Python3.7.0\test.py ======================
200测试2
#!/usr/bin/env python3
# -*- coding: UTF-8 -*-import requestsdef url(url):try:r = requests.get(url)r.encoding = r.apparent_encodingreturn r.status_codeexcept:print("Error")print(url("www.baidu.com"))====================== RESTART: C:\Python3.7.0\test.py ======================
Error
None测试3
#!/usr/bin/env python3
# -*- coding: UTF-8 -*-import requestsdef url(url):try:r = requests.get(url)r.encoding = r.apparent_encodingreturn r.status_codeexcept:url = "http://" + urlr = requests.get(url)r.encoding = r.apparent_encodingreturn r.status_codeprint(url("www.baidu.com"))====================== RESTART: C:\Python3.7.0\test.py ======================
200

HTTP协议

HTTP协议:Hypertext Transfer Protocol,超文本传输协议
基于“请求与相应”模式的、无状态的应用层协议
- 用户发起请求,服务器作相关相应就是“请求与相应”模式的
- 无状态是指第一次请求与第二次请求没有相关关联
HTTP协议采用URL作为定位网络资源的标识URL格式http://host[:port][path]
post:合法的Internet主机域名或IP
port:端口号,缺省端口为80
path:请求资源的路径
URl是通过HTTP协议存取资源的Internet路径,一个URL对应一个数据资源HTTP协议对资源的操作
GET 请求获取URL位置的资源
HEAD    请求获取URL位置资源的相应消息报告,即获得该资源的头部信息
POST    请求向URL位置的资源后附加新的数据
PUT 请求向URL位置存储一个资源,覆盖原URL位置的资源
PATCH   请求局部更新URL位置的资源,即改变该处资源的部分内容
DELETE  请求删除URL位置存储的资源
假设URL位置有一组数据UserInfo,包括UserID、UserName等20个字段
需求:用户修改了UserName,其他不便
- 采用PATCH,仅向URL提交UserName的局部更新请求
- 采用PUT,必须将所有20个字段一并提交到URL,未提交字段被删除
PATCH的最主要好处:节省网络带宽

requests库的使用方法

>>> r = requests.head('http://httpbin.org/get')
>>> r.headers
{'Connection': 'close', 'Cache-Control': 'max-age:86400', 'Date': 'Tuesday, 23-Oct-18 17:38:31 CST', 'Expires': 'Wed, 24 Oct 2018 17:38:31 GMT', 'Keep-Alive': 'timeout=38', 'Location': 'https://httpbin.org/get', 'Content-Length': '0'}
>>> r.text
''
测试1
#!/usr/bin/env python3
# -*- coding: UTF-8 -*-import requestspayload = {'key1': 'value1', 'key2': 'value2'}r = requests.post("https://httpbin.org/post", data=payload)
print(r.text)====================== RESTART: C:\Python3.7.0\test.py ======================
{"args": {}, "data": "", "files": {}, "form": {"key1": "value1", "key2": "value2"}, "headers": {"Accept": "*/*", "Accept-Encoding": "gzip, deflate", "Connection": "close", "Content-Length": "23", "Content-Type": "application/x-www-form-urlencoded", "Host": "httpbin.org", "User-Agent": "python-requests/2.20.0"}, "json": null, "origin": "178.128.121.252", "url": "https://httpbin.org/post"
}
向URL POST一个字典自动编码为form(表单)测试2
>>> r = requests.post("https://httpbin.org/post",data = 'ABC')
>>> print(r.text)
{"args": {}, "data": "ABC", "files": {}, "form": {}, "headers": {"Accept": "*/*", "Accept-Encoding": "gzip, deflate", "Connection": "close", "Content-Length": "3", "Host": "httpbin.org", "User-Agent": "python-requests/2.20.0"}, "json": null, "origin": "178.128.121.252", "url": "https://httpbin.org/post"
}
向URL POST一个字符串自动编码为data错误
>>> r = requests.post("http://httpbin.org/post",data=payload)
>>> print(r.text)
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<title>405 Method Not Allowed</title>
<h1>Method Not Allowed</h1>
<p>The method is not allowed for the requested URL.</p>
该站点使用了SSL证书,需要用https进行访问
参考:
http://docs.python-requests.org/en/latest/user/quickstart/#more-complicated-post-requests
http://docs.python-requests.org/en/latest/
#!/usr/bin/env python3
# -*- coding: UTF-8 -*-import requestspayload = {'key1': 'value1', 'key2': 'value2'}r = requests.put("https://httpbin.org/put", data=payload)
print(r.text)====================== RESTART: C:\Python3.7.0\test.py ======================
{"args": {}, "data": "", "files": {}, "form": {"key1": "value1", "key2": "value2"}, "headers": {"Accept": "*/*", "Accept-Encoding": "gzip, deflate", "Connection": "close", "Content-Length": "23", "Content-Type": "application/x-www-form-urlencoded", "Host": "httpbin.org", "User-Agent": "python-requests/2.20.0"}, "json": null, "origin": "178.128.121.252", "url": "https://httpbin.org/put"
}

主要解析方法

  • requests.request(method,url,**kwargs)
method:请求方式,对应get/put/post等7种
url:拟获取页面的url链接
**kwargs:控制访问的参数,共13个
method:请求方式
r = requests.request(‘GET',url,**kwargs)
r = requests.request('HEAD',url,**kwargs)
r = requests.request('POST',url,**kwargs)
r = requests.request('PUT',url,**kwargs)
r = requests.request('PATCH',url,**kwargs)
r = requests.request('delete',url,**kwargs)
r = requests.request('OPPTIONS',url,**kwargs)
**kwargs:控制访问的参数
params:字典或字节序列,作为参数增加到url种
>>> kv = {'key1': 'value1', 'key2': 'value2'}
>>> r = requests.request("GET","https://httpbin.org/ws", params = kv)
>>> print(r.url)
https://httpbin.org/ws?key1=value1&key2=value2data:字典、字节序列或文件对象,作为Request的内容
>>> kv = {'key1': 'value1', 'key2': 'value2'}
>>> r = requests.request("POST","https://httpbin.org/ws", data = kv)
>>> body = '主体内容'
>>> r = requests.request("POST","https://httpbin.org/ws", data = body)json:JSON格式的数据,作为Request的内容
>>> kv = {'key1': 'value1'}
>>> r = requests.request("POST","https://httpbin.org/ws", json = kv)headers:字典,HTTP定制头
>>> hd = {'user-anent':'Chrome/10'}
>>> r = requests.request('POST','https://httpbin.org/ws',headers = hd)
模拟浏览器cookies:字典或CookieJar,Request种的cookie
auth:元组,支持HTTP认证功能files:字典类型,传输文件
>>> fs = {'file':open('data.xls','rb')}
>>> r = requests.request('POST','https://httpbin.org/ws',files = fs)
向某个连接提交文件timeout:设置超时时间,s为单位
>>> r = requests.request('GET','http://www.baidu.com',timeout = 10)
设置一段时间内,请求未返回,则返回一个超时proxies:字典类型,设定访问代理服务器,可以增加登录认证
>>> pxs = {'http':'http://user:pass@10.10.10.1:1234','https':'https://10.10.10.1:4321'}
>>> r = requests.request('GET','http://www.baidu.com',proxies = pxs)
可以有效的隐藏真实IP地址,防止逆追踪
返回的错误
>>> pxs = {'http':'http://user:pass@10.10.10.1:1234','https':'https://10.10.10.1:4321'}
>>> r = requests.request('GET','http://www.baidu.com',proxies = pxs)
Traceback (most recent call last):File "D:\Python\Python37\lib\site-packages\urllib3\connection.py", line 159, in _new_conn(self._dns_host, self.port), self.timeout, **extra_kw)File "D:\Python\Python37\lib\site-packages\urllib3\util\connection.py", line 80, in create_connectionraise errFile "D:\Python\Python37\lib\site-packages\urllib3\util\connection.py", line 70, in create_connectionsock.connect(sa)
TimeoutError: [WinError 10060] 由于连接方在一段时间后没有正确答复或连接的主机没有反应,连接尝试失败。During handling of the above exception, another exception occurred:Traceback (most recent call last):File "D:\Python\Python37\lib\site-packages\urllib3\connectionpool.py", line 600, in urlopenchunked=chunked)File "D:\Python\Python37\lib\site-packages\urllib3\connectionpool.py", line 354, in _make_requestconn.request(method, url, **httplib_request_kw)File "D:\Python\Python37\lib\http\client.py", line 1229, in requestself._send_request(method, url, body, headers, encode_chunked)File "D:\Python\Python37\lib\http\client.py", line 1275, in _send_requestself.endheaders(body, encode_chunked=encode_chunked)File "D:\Python\Python37\lib\http\client.py", line 1224, in endheadersself._send_output(message_body, encode_chunked=encode_chunked)File "D:\Python\Python37\lib\http\client.py", line 1016, in _send_outputself.send(msg)File "D:\Python\Python37\lib\http\client.py", line 956, in sendself.connect()File "D:\Python\Python37\lib\site-packages\urllib3\connection.py", line 181, in connectconn = self._new_conn()File "D:\Python\Python37\lib\site-packages\urllib3\connection.py", line 168, in _new_connself, "Failed to establish a new connection: %s" % e)
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x000001ABE91220F0>: Failed to establish a new connection: [WinError 10060] 由于连接方在一段时间后没有正确答复或连接的主机没有反应,连接尝试失败。During handling of the above exception, another exception occurred:Traceback (most recent call last):File "D:\Python\Python37\lib\site-packages\requests\adapters.py", line 449, in sendtimeout=timeoutFile "D:\Python\Python37\lib\site-packages\urllib3\connectionpool.py", line 638, in urlopen_stacktrace=sys.exc_info()[2])File "D:\Python\Python37\lib\site-packages\urllib3\util\retry.py", line 398, in incrementraise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='10.10.10.1', port=1234): Max retries exceeded with url: http://www.baidu.com/ (Caused by ProxyError('Cannot connect to proxy.', NewConnectionError('<urllib3.connection.HTTPConnection object at 0x000001ABE91220F0>: Failed to establish a new connection: [WinError 10060] 由于连接方在一段时间后没有正确答复或连接的主机没有反应,连接尝试失败。')))During handling of the above exception, another exception occurred:Traceback (most recent call last):File "<pyshell#55>", line 1, in <module>r = requests.request('GET','http://www.baidu.com',proxies = pxs)File "D:\Python\Python37\lib\site-packages\requests\api.py", line 60, in requestreturn session.request(method=method, url=url, **kwargs)File "D:\Python\Python37\lib\site-packages\requests\sessions.py", line 524, in requestresp = self.send(prep, **send_kwargs)File "D:\Python\Python37\lib\site-packages\requests\sessions.py", line 637, in sendr = adapter.send(request, **kwargs)File "D:\Python\Python37\lib\site-packages\requests\adapters.py", line 510, in sendraise ProxyError(e, request=request)
requests.exceptions.ProxyError: HTTPConnectionPool(host='10.10.10.1', port=1234): Max retries exceeded with url: http://www.baidu.com/ (Caused by ProxyError('Cannot connect to proxy.', NewConnectionError('<urllib3.connection.HTTPConnection object at 0x000001ABE91220F0>: Failed to establish a new connection: [WinError 10060] 由于连接方在一段时间后没有正确答复或连接的主机没有反应,连接尝试失败。')))
高级功能:
allow_redirects:True/False,默认为True,重定向开关
stream:True/False,默认为True,获取内容立即下载开关
verify:True/False,默认为True,认证SSl证书开关
cert:本地SSL证书路径
  • requests.get(url,params = None,**kwargs)
url:拟获取页面的url链接
params:url种的额外参数,字典或字节流格式,可选
**kwargs:12个控制访问的参数
**kwargs除了params,其他都是一样的
  • requests.head(url,**kwargs)
url:拟获取页面的url链接
**kwargs:13和控制访问的参数
  • requests.post(url,data = None,json = None,**kwargs)
url:拟更新页面的url链接
data:字典、字节序列或文件,Request的内容
json:JSON格式的数据,Request的内容
**kwargs:11个控制访问的参数
  • requests.put(url,data = None,**kwargs)
url:拟更新页面的url链接
data:字典、字节序列或文件,Request的内容
**kwargs:12个控制访问的参数
  • requests.patch(url,data = None,**kwargs)
url:你更新页面的url链接
data:字典、字节序列或文件,Request的内容
**kwargs:12个控制访问的参数
  • requests.delete(url,**kwargs)
url:你更新页面的url链接
**kwargs:13个控制访问的参数

网络爬虫的限制

来源审查:判断User-Agent进行限制
- 检查来访HTTP协议头的User-Agent域,只相应浏览器或友好爬虫的访问
发布公告:Robots协议
- 告知所有爬虫网站的爬取策略,要求爬虫遵守

Rotos协议

Robots Exclusion Standard网路爬虫排除标准
User-agent禁止指定爬虫访问
Disallow禁止爬虫访问的目录

Rotos协议的使用

网络爬虫:自动或人工识别robots.txt,在进行内容爬取
约束性:Robots协议是建议但非约束性,网络爬虫可以不遵守,但存在法律风险

伪装成浏览器的头部信息,欺骗站点服务规则

#!/usr/bin/env python3
# -*- coding: UTF-8 -*-import requestsurl = "https://www.amazon.cn/dp/B07GVXHCXH/ref=cngwdyfloorv2_recs_0/458-7185842-7809912?pf_rd_m=A1AJ19PSB66TGU&pf_rd_s=desktop-2&pf_rd_r=Q0MXZCKWT7ZHNRYVWBAS&pf_rd_r=Q0MXZCKWT7ZHNRYVWBAS&pf_rd_t=36701&pf_rd_p=d0690322-dfc8-4e93-ac2c-8e2eeacbc49e&pf_rd_p=d0690322-dfc8-4e93-ac2c-8e2eeacbc49e&pf_rd_i=desktop"
hd = {'user-agent':'Mzilla/5.0'}
try:r = requests.get(url,headers = hd)r.raise_for_status()r.encoding = r.apparent_encodingprint(r.text[1000:2000])
except:print("Error")

关键词提交

百度关键词接口:
http://www.baidu.com/s?wd=keyword
360关键词接口:
http://www.so.com/s?q=keyword
#!/usr/bin/env python3
# -*- coding: UTF-8 -*-import requeststry:kv = {'wd':'Python'}r = requests.get("http://baidu.com/s",params = kv)print(r.request.url)r.raise_for_status()print(len(r.text))
except:"Error"====================== RESTART: C:\Python3.7.0\test.py ======================
http://www.baidu.com/s?wd=Python
390455

网络图片的爬取

国家地理:
http://www.ngchina.com.cn/
http://www.nationalgeographic.com.cn
http://image.ngchina.com.cn/2018/1023/20181023024551199.jpg
#!/usr/bin/env python3
# -*- coding: UTF-8 -*-import requestspath = "abc.jpg"
url = "http://image.ngchina.com.cn/2018/1023/20181023024551199.jpg"r = requests.get(url)
with open(path,'wb') as f:f.write(r.content)
f.close()
#!/usr/bin/env python3
# -*- coding: UTF-8 -*-import requests
import osurl = "http://image.ngchina.com.cn/2018/1023/20181023024551199.jpg"
root = "D://pics//"
path = root + url.split('/')[-1]try:if not os.path.exists(root):os.mkdir(root)if not os.path.exists(path):r = requests.get(url)with open(path,'wb') as f:f.write(r.content)f.close()print("File saved")else:print("File already exists")except:print("Error")====================== RESTART: C:\Python3.7.0\test.py ======================
File saved
>>>
====================== RESTART: C:\Python3.7.0\test.py ======================
File already exists

IP归属地查询

http://m.ip138.com/ip.asp?ip=ipaddress
import requestsip = eval(input())kv = {'ip':ip}r = requests.get("http://m.ip138.com/ip.asp?ip=",params = ip)
print(r.request.url)
失败

转载于:https://my.oschina.net/hellopasswd/blog/2251410

【Python网络爬虫】规则#20181023相关推荐

  1. Python网络爬虫的规则

    Python网络爬虫的规则 "The website is the API." 本节内容参考链接: link. 目录 Python网络爬虫的规则 网络协议 Requests库 Ro ...

  2. Python网络爬虫理解

    今天买了一本<玩转python网络爬虫>,打算深入学习网络爬虫~~ 刚开始就是基础理解啦~~~ 定义: 网络爬虫是一种按照一定的规则自动地抓取网络信息的程序或者脚本: 爬虫的类型: 通用网 ...

  3. Python 网络爬虫笔记6 -- 正则表达式

    Python 网络爬虫笔记6 – 正则表达式 Python 网络爬虫系列笔记是笔者在学习嵩天老师的<Python网络爬虫与信息提取>课程及笔者实践网络爬虫的笔记. 课程链接:Python网 ...

  4. 简单了解Python网络爬虫

    网络爬虫(又被称为网页蜘蛛,网络机器人),是一种按照一定的规则,自动的抓取信息的程序或者脚本. 网络爬虫是互联网上进行信息采集的通用手段,在互联网的各个专业方向上都是不可或缺的底层技术支撑.本课程从爬 ...

  5. 开源 Python网络爬虫框架 Scrapy

    开源 Python 网络爬虫框架 Scrapy:http://blog.csdn.net/zbyufei/article/details/7554322 介绍 所谓网络爬虫,就是一个在网上到处或定向抓 ...

  6. python网络爬虫的流程图_基于Python的网络爬虫的设计与实现

    龙源期刊网 http://www.qikan.com.cn 基于 Python 的网络爬虫的设计与实现 作者:高祖彦 来源:<商情> 2020 年第 33 期 [摘要]一个爬虫从网上爬取数 ...

  7. python网络爬虫系列教程——python中lxml库应用全解(xpath表达式)

    全栈工程师开发手册 (作者:栾鹏) python教程全解 python网络爬虫lxml库的应用全解. 在线安装方法:cmd中输入"pip install lxml" 离线安装,下载 ...

  8. 从零开始学python网络爬虫

    大家好哈,最近博主在学习Python,特别是网络数据采集(爬虫).学习期间也碰到了一些问题,在解决问题的同时也大量参看了网上了一些资源,获得了一些经验.所以希望能将学习过程中碰到的问题一并记录下来,同 ...

  9. Python网络爬虫(一):爬虫基础

    Python网络爬虫(一)爬虫基础 一.爬虫基础 1.HTTP基本原理 1.1URI和URL URI,全称:Uniform Resource Identifier,即统一资源标志符:URL,全称:Un ...

  10. Python网络爬虫数据采集实战:Scrapy框架爬取QQ音乐存入MongoDB

    ​    通过前七章的学习,相信大家对整个爬虫有了一个比较全貌的了解 ,其中分别涉及四个案例:静态网页爬取.动态Ajax网页爬取.Selenium浏览器模拟爬取和Fillder今日头条app爬取,基本 ...

最新文章

  1. linux下tomcat无法访问问题(换一种说法:无法访问8080端口)
  2. centos7重启桌面服务_CENTOS7安装桌面系统
  3. 如何优化MySQL千万级大表
  4. 冲刺第三天 11.27 TUE
  5. Ubuntu 16.04安装Bless十六进制编辑器
  6. python入门教程pdf-Python入门教程详解.pdf
  7. div+css布局实现个人网页设计(HTML期末作业)
  8. 环评师考各个科目有哪些备考的好方法?
  9. 【apiPost】-工具
  10. php 时间戳转换日期格式用法
  11. 编程算法题:101个数字,[1,100]中有一个是重复的,找出这个重复的数字。
  12. vim 修改文件出现错误 “ E45: ‘readonly’ option is set (add to override)“
  13. 戴建业老师对李白和杜甫的讨论
  14. 郑州财经学院第54次全国计算机,听爷爷讲故事
  15. 磷酸铁锂电池充电过压保护
  16. STM32学习——入门小项目
  17. android 虚拟机介绍
  18. OpenHarmony轻量系统 子系统,组件那点事
  19. SOAP协议规范介绍
  20. 以太坊学习笔记(一):基于POA的私有链搭建

热门文章

  1. 对最大熵模型为什么要最大化熵的一点理解
  2. 第四:Python发送邮件时实现生成测试报告/邮件自动发送
  3. python用户界面画图_通过海龟绘图学习Python-01
  4. python 循环添加array_python常用的基本语句介绍
  5. Chrome 浏览器 NET::ERR_SSL_OBSOLETE_VERSION 问题及解决方式
  6. [Ext JS 7]7.4 样式主题(Theme)
  7. [设计模式-创建型]工厂方法(Factory Method)
  8. ant 时 --java.lang.NoSuchMethodError: org.apache.tools.ant.util.FileUtils.getFileUtils 解决方法
  9. 【GWT系列】 Speed Tracer 入门
  10. 论破坏计算机信息系统罪,论破坏计算机信息系统罪