爬虫笔记_2、requests的使用方式与HTTP协议

上文中提到了爬虫的五个步骤，第1、2步骤对需求的分析和网站的寻找，那么这篇博文就来记录一下第三个步骤：

步骤三是：下载网站的返回内容

即：我们如何通过程序去获取网页的HTML等信息呢？

环境：
anaconda
pycharm

import requests      # 导入requests模块url = 'https://www.baidu.com'       # 假如网址是百度# 通过get获取url对应的返回值（HTML）
response = requests.get(url)      # 来获取url的返回值print(response)

输出：<Response [200]>

如果访问一个url可以正常的返回HTML的信息，则返回值就是200

结果请求是否成功的代码。最常见的代码包括：
200—成功。请求已发送且响应已成功接收。
400—坏请求。当目的服务器接收到请求但不理解细节所以无法处理时发生。
404—页面找不到。如果目标API已移动或已更新但未保留向后兼容性时发生。
500—内部服务器错误。服务器端发生了某种致命错误，且错误未被服务提供商捕获。

print(response.text)     # 来获取url的返回值的HTML

输出：<!DOCTYPE html>
<!--STATUS OK--><html> <head><meta http-equiv=content-type content=text/html;charset=utf-8><meta http-equiv=X-UA-Compatible content=IE=Edge><meta content=always name=referrer><link rel=stylesheet type=text/css href=https://ss1.bdstatic.com/5eN1bjq8AAUYm2zgoY3K/r/www/cache/bdorz/baidu.min.css><title>ç™¾åº¦ä¸€ä¸‹ï¼Œä½ å°±çŸ¥é“</title></head> <body link=#0000cc> <div id=wrapper> <div id=head> <div class=head_wrapper> <div class=s_form> <div class=s_form_wrapper> <div id=lg> <img hidefocus=true src=//www.baidu.com/img/bd_logo1.png width=270 height=129> </div> <form id=form name=f action=//www.baidu.com/s class=fm> <input type=hidden name=bdorz_come value=1> <input type=hidden name=ie value=utf-8> <input type=hidden name=f value=8> <input type=hidden name=rsv_bp value=1> <input type=hidden name=rsv_idx value=1> <input type=hidden name=tn value=baidu><span class="bg s_ipt_wr"><input id=kw name=wd class=s_ipt value maxlength=255 autocomplete=off autofocus=autofocus></span><span class="bg s_btn_wr"><input type=submit id=su value=ç™¾åº¦ä¸€ä¸‹ class="bg s_btn" autofocus></span> </form> </div> </div> <div id=u1> <a href=http://news.baidu.com name=tj_trnews class=mnav>æ–°é—»</a> <a href=https://www.hao123.com name=tj_trhao123 class=mnav>hao123</a> <a href=http://map.baidu.com name=tj_trmap class=mnav>åœ°å›¾</a> <a href=http://v.baidu.com name=tj_trvideo class=mnav>è§†é¢‘</a> <a href=http://tieba.baidu.com name=tj_trtieba class=mnav>è´´å§</a> <noscript> <a href=http://www.baidu.com/bdorz/login.gif?login&amp;tpl=mn&amp;u=http%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1 name=tj_login class=lb>ç™»å½•</a> </noscript> <script>document.write('<a href="http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u='+ encodeURIComponent(window.location.href+ (window.location.search === "" ? "?" : "&")+ "bdorz_come=1")+ '" name="tj_login" class="lb">ç™»å½•</a>');</script> <a href=//www.baidu.com/more/ name=tj_briicon class=bri style="display: block;">æ›´å¤šäº§å“</a> </div> </div> </div> <div id=ftCon> <div id=ftConw> <p id=lh> <a href=http://home.baidu.com>å
³äºŽç™¾åº¦</a> <a href=http://ir.baidu.com>About Baidu</a> </p> <p id=cp>&copy;2017&nbsp;Baidu&nbsp;<a href=http://www.baidu.com/duty/>ä½¿ç”¨ç™¾åº¦å‰å¿
è¯»</a>&nbsp; <a href=http://jianyi.baidu.com/ class=cp-feedback>æ„è§åé¦ˆ</a>&nbsp;äº¬ICPè¯030173å·&nbsp; <img src=//www.baidu.com/img/gs.gif> </p> </div> </div> </div> </body> </html>

这时，会发现返回的HTML中出现很多的乱码

处理乱码的方式：

如。我们的url是https://www.baidu.com，先在浏览器中访问这个地址
点击鼠标右键，点击Inspect（检查），在弹出的HTML中查看head中的meta中的charset = ‘utf-8’，其中：utf-8就是网页的编码方式

通过response 指定编码方式

response.encoding = 'utf-8'    # 指定编码方式，发现乱码的没有了
print(response.text)

输出：<!DOCTYPE html>
<!--STATUS OK--><html> <head><meta http-equiv=content-type content=text/html;charset=utf-8><meta http-equiv=X-UA-Compatible content=IE=Edge><meta content=always name=referrer><link rel=stylesheet type=text/css href=https://ss1.bdstatic.com/5eN1bjq8AAUYm2zgoY3K/r/www/cache/bdorz/baidu.min.css><title>百度一下，你就知道</title></head> <body link=#0000cc> <div id=wrapper> <div id=head> <div class=head_wrapper> <div class=s_form> <div class=s_form_wrapper> <div id=lg> <img hidefocus=true src=//www.baidu.com/img/bd_logo1.png width=270 height=129> </div> <form id=form name=f action=//www.baidu.com/s class=fm> <input type=hidden name=bdorz_come value=1> <input type=hidden name=ie value=utf-8> <input type=hidden name=f value=8> <input type=hidden name=rsv_bp value=1> <input type=hidden name=rsv_idx value=1> <input type=hidden name=tn value=baidu><span class="bg s_ipt_wr"><input id=kw name=wd class=s_ipt value maxlength=255 autocomplete=off autofocus=autofocus></span><span class="bg s_btn_wr"><input type=submit id=su value=百度一下 class="bg s_btn" autofocus></span> </form> </div> </div> <div id=u1> <a href=http://news.baidu.com name=tj_trnews class=mnav>新闻</a> <a href=https://www.hao123.com name=tj_trhao123 class=mnav>hao123</a> <a href=http://map.baidu.com name=tj_trmap class=mnav>地图</a> <a href=http://v.baidu.com name=tj_trvideo class=mnav>视频</a> <a href=http://tieba.baidu.com name=tj_trtieba class=mnav>贴吧</a> <noscript> <a href=http://www.baidu.com/bdorz/login.gif?login&amp;tpl=mn&amp;u=http%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1 name=tj_login class=lb>登录</a> </noscript> <script>document.write('<a href="http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u='+ encodeURIComponent(window.location.href+ (window.location.search === "" ? "?" : "&")+ "bdorz_come=1")+ '" name="tj_login" class="lb">登录</a>');</script> <a href=//www.baidu.com/more/ name=tj_briicon class=bri style="display: block;">更多产品</a> </div> </div> </div> <div id=ftCon> <div id=ftConw> <p id=lh> <a href=http://home.baidu.com>关于百度</a> <a href=http://ir.baidu.com>About Baidu</a> </p> <p id=cp>&copy;2017&nbsp;Baidu&nbsp;<a href=http://www.baidu.com/duty/>使用百度前必读</a>&nbsp; <a href=http://jianyi.baidu.com/ class=cp-feedback>意见反馈</a>&nbsp;京ICP证030173号&nbsp; <img src=//www.baidu.com/img/gs.gif> </p> </div> </div> </div> </body> </html>

总结：

import requests      # 导入模块
url = 'https://www.baidu.com'        # 通过get获取url对应的返回值（HTML）
response = requests.get(url)      # 来获取url的返回值response.encoding = 'utf-8'    # 指定编码方式，发现乱码的没有了
print(response)          # 输出200，说明返回的HTML是正常的
print(response.text)     # 来获取url的返回值的HTML

HTTP协议：

所谓协议，就是指双方遵循的规范。http协议，就是客户端（浏览器）和服务器之间进行“沟通”的一种规范。
HTTP协议中的方法：

GET：通过URL获取网站的信息，不改变服务器的任何内容
POST：通过URL向网站传输信息，改变网站的状态
等等

HTTP协议的传输：

Request: URL + request headers(请求头信息)
Response: HTML + response headers(返回头信息)

在请求头信息中的重要的三个信息：

User-Agent: 代表身份是什么
Referer: 跳转网页是什么，源网页的地址
Cookie：存储信息，与服务器端的Session相对应

想了解详细的HTTP协议，可以看资料

举例：

以 "https://www.xicidaili.com/nn/"为例

import requests      # 导入安装包url = "https://www.xicidaili.com/nn/"          # URLresponse = requests.get(url)           # URL的返回信息# response.text是str类型
# response.content是bytes类型
# response.text = response.content.decode("utf-8")
# 保存在HTML信息到文件
with open("xicidaili.html","wb") as f:            # 创建一个文件f.write(response.content)                               # 将返回信息存放下来

输出：
<html>
<head><title>503 Service Temporarily Unavailable</title></head>
<body bgcolor="white">
<center><h1>503 Service Temporarily Unavailable</h1></center>
<hr><center>nginx/1.1.19</center>
</body>
</html>

可以看到：503是错误信息，返回的有问题，这是因为我们忽略了HTTP中的请求头信息

所以，我们在上述的程序中加入请求头信息：

1）先找请求头信息的内容：

在浏览器中访问上面的网址，安F12或在Fn+F12，点击Network,刷新页面，点击name中的第一个，就可以看到Response Header (返回头信息)和 Request Heder(请求头信息)

import requestsurl = "https://www.xicidaili.com/nn/"# 添加请求头信息，字典的方式写入请求头信息，先只添加User-Agent，看行不行
headers = {"User-Agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.87 Safari/537.36"# 不行的话，就把请求头信息的其他内容也加进来
}
response = requests.get(url)# 保存在HTML信息到文件
with open("xicidaili.html","wb") as f:f.write(response.content)

输出：
<!DOCTYPE html>
<html>
<head><title>国内高匿免费HTTP代理IP__第1页国内高匿</title><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/><meta name="Description" content="国内高匿免费HTTP代理" /><meta name="Keywords" content="国内高匿,免费高匿代理,免费匿名代理,隐藏IP" /><meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=0"><meta name="applicable-device"content="pc,mobile"><link rel="stylesheet" media="screen" href="//fs.xicidaili.com/assets/application-9cf10a41b67e112fe8dd709caf886c0556b7939174952800b56a22c7591c7d40.css" /><meta name="csrf-param" content="authenticity_token" />
<meta name="csrf-token" content="qn2F+KCVgJtvALkmLuqo4lr++aSdVFWgWauEHp3x4XbXPBNCdSGxPjz2WBaHH5OpNzJDvcte4Hua/u5G1pnOmg==" />
</head>
<body>
,,,,,,,,,,,

这样就好了，要是这样也不行，就把Request Heder(请求头信息) 中Referer，Cookie的内容以字典的方式添加进来，要是还不对，就把Request Heder中的所以内容以字典的方式添加进去。