python spider模块_spider【第三篇】python爬虫模块requests

requests简介

requests模块是python3自带的库，可直接使用，该库主要用来处理http请求

requests模块的简单使用

requests模块发送简单的请求、获取响应

一、requests.get()

哪些地方我们会用到get请求

下载网页

检索

1.1 下载网页

import requests #预先安装requests库

response= requests.get('https://www.baidu.com/') #发送Http请求

response.encoding = "utf-8" #将下载内容编码为utf-8格式，否则乱码

print(response.text) #打印网页内容

print(response.status_code) #打印状态码，200代表正常

response.text

类型：str

解码类型：根据HTTP 头部对响应的编码作出有根据的推测，推测的文本编码

如何修改编码方式：response.encoding=”gbk”

response.content

类型：bytes

解码类型：没有指定

如何修改编码方式：response.content.deocde(“utf8”)

网页编码分析

或者

开始编码

得到网页数据

这里为了方便，用pycharm打开，当然也可以用浏览器打开

下载的网页效果

1.2 保存图片

import requests #预先安装requests库

response= requests.get('https://www.baidu.com/img/bd_logo1.png') #发送Http请求

print(response.status_code) #打印状态码，200代表正常

with open('baidu.png', 'wb') as f: #图片是二进制(也叫字节)数据

f.write(response.content)

1.3 检索

关于参数的注意点

在url地址中，很多参数是没有用的，比如百度搜索的url地址，其中参数只有一个字段有用，其他的都可以删除

对应的，在后续的爬虫中，越到很多参数的url地址，都可以尝试删除参数

删除多余参数

importrequests

query_string= input(":")

params= {"wd": query_string}

url= "https://www.baidu.com/s?wd={}".format(query_string)

response=requests.get(url)print(response.text)print(response.request.headers)

这里百度反爬虫的措施限制了User-Agent，去找一个User-Agent(网上也有很多)

#coding=utf-8

importrequests

query_string= input(":")

params= {"wd": query_string}

url= "https://www.baidu.com/s?wd={}".format(query_string)

headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.75 Safari/537.36"}

response= requests.get(url, headers=headers)print(response.text)print(response.request.headers)

baidusousuo

更多的反爬虫和突破反爬虫将在后续主题专门介绍

二、requests.post()

哪些地方我们会用到POST请求：

登录注册( POST 比 GET 更安全)

需要传输大文本内容的时候( POST 请求对数据长度没有要求)

百度单词翻译

importjsonimportrequestsdeffanyi(keyword):

base_url= 'https://fanyi.baidu.com/sug'

#构建请求对象

data ={'kw': keyword

}#模拟浏览器

header ={"User-Agent": "mozilla/4.0 (compatible; MSIE 5.5; Windows NT)","Content-Type": "application/x-www-form-urlencoded"}

req= requests.post(url=base_url, data=data, headers=header)#获取响应的json字符串

str_json =req.text#把json转换成字典

myjson =json.loads(str_json)

info= myjson['data'][0]['v']print(info)if __name__ == '__main__':whileTrue:

keyword= input('请输入翻译的单词：')if keyword == 'q':breakfanyi(keyword)

baidufanyi

requests详解

Python标准库中提供了：urllib、urllib2、httplib等模块以供Http请求，但是，它的 API 太渣了。它是为另一个时代、另一个互联网所创建的。它需要巨量的工作，甚至包括各种方法覆盖，来完成最简单的任务。

各类请求

requests.get(url, params=None, **kwargs)

requests.post(url, data=None, json=None, **kwargs)

requests.put(url, data=None, **kwargs)

requests.head(url, **kwargs)

requests.delete(url, **kwargs)

requests.patch(url, data=None, **kwargs)

requests.options(url, **kwargs)

# 以上方法均是在此方法的基础上构建

requests.request(method, url, **kwargs)

requests模块已经将常用的Http请求方法为用户封装完成，用户直接调用其提供的相应方法即可，其中方法的所有参数有：

def request(method, url, **kwargs):"""Constructs and sends a :class:`Request `.

:param method: method for the new :class:`Request` object.

:param url: URL for the new :class:`Request` object.

:param params: (optional) Dictionary or bytes to be sent in the query string for the :class:`Request`.

:param data: (optional) Dictionary, bytes, or file-like object to send in the body of the :class:`Request`.

:param json: (optional) json data to send in the body of the :class:`Request`.

:param headers: (optional) Dictionary of HTTP Headers to send with the :class:`Request`.

:param cookies: (optional) Dict or CookieJar object to send with the :class:`Request`.

:param files: (optional) Dictionary of ``'name': file-like-objects`` (or ``{'name': ('filename', fileobj)}``) for multipart encoding upload.

:param auth: (optional) Auth tuple to enable Basic/Digest/Custom HTTP Auth.

:param timeout: (optional) How long to wait for the server to send data

before giving up, as a float, or a :ref:`(connect timeout, read

timeout) ` tuple.

:type timeout: float or tuple

:param allow_redirects: (optional) Boolean. Set to True if POST/PUT/DELETE redirect following is allowed.

:type allow_redirects: bool

:param proxies: (optional) Dictionary mapping protocol to the URL of the proxy.

:param verify: (optional) whether the SSL cert will be verified. A CA_BUNDLE path can also be provided. Defaults to ``True``.

:param stream: (optional) if ``False``, the response content will be immediately downloaded.

:param cert: (optional) if String, path to ssl client cert file (.pem). If Tuple, ('cert', 'key') pair.

:return: :class:`Response ` object

:rtype: requests.Response

Usage::

>>> import requests

>>> req = requests.request('GET', 'http://httpbin.org/get')

"""

#By using the 'with' statement we are sure the session is closed, thus we

#avoid leaving sockets open which can trigger a ResourceWarning in some

#cases, and look like a memory leak in others.

with sessions.Session() as session:return session.request(method=method, url=url, **kwargs)

更多参数

requests模块的深入使用

一、使用代理

代理IP的分类

为什么要使用代理

让服务器以为不是同一个客户端在请求

防止我们的真实地址被泄露，防止被追究

requests.get("http://www.baidu.com", proxies =proxies)

requests.post("http://www.baidu.com", proxies =proxies)

proxies={"http": "http://12.34.56.78:9000","https": "https://12.34.56.78:9000",

}

代理IP的分类

透明代理(Transparent Proxy)

匿名代理(Anonymous Proxy)

混淆代理(Distorting Proxies)

高匿代理(Elite proxy或High Anonymity Proxy)

IP的选择:

高匿代理让别人根本无法发现你是在用代理,前几个都可以被发现

从使用的协议：代理ip可以分为http代理，https代理，socket代理等，使用的时候需要根据抓取网站的协议来选择

代理ip池的更新

购买的代理(Beagle等)ip很多时候大部分(超过60%)可能都没办法使用，这个时候就需要通过程序去检测哪些可用，把不能用的删除掉。

二、获取cookie

方式一

requests.utils.dict_from_cookiejar:把cookiejar对象转化为字典

importrequests

url= "http://www.baidu.com"response=requests.get(url)print(type(response.cookies))

cookies=requests.utils.dict_from_cookiejar(response.cookies)print(cookies)'''

{'BDORZ': '27315'}'''

方式二

importrequestsfrom bs4 importBeautifulSoupfrom requests.cookies importRequestsCookieJarfrom win32.win32crypt importCryptUnprotectDatadef getcookiefromchrome(host='.oschina.net'):'获取浏览器中某个网站的cookie'cookiepath= os.environ['LOCALAPPDATA'] + r"\Google\Chrome\User Data\Default\Cookies"sql= "select host_key,name,encrypted_value from cookies where host_key='%s'" %host

with sqlite3.connect(cookiepath) as conn:

cu=conn.cursor()

cookies= {name: CryptUnprotectData(encrypted_value)[1].decode() for host_key, name, encrypted_value incu.execute(sql).fetchall()}returncookiesdefret_soup(url, cookies_dict):'返回BeautifulSoup'response= requests.get(url, cookies=set_cookie(cookies_dict), headers=set_header(), proxies=None,

timeout=10)

response.encoding= "utf8"soup= BeautifulSoup(response.text, 'html.parser')returnsoupdefstart():'程序入口'cookies_dict= getcookiefromchrome(host='.bitinfocharts.com')

site= 'https://bitinfocharts.com'baseurl= site + '/top-100-richest-bitcoin-addresses.html'soup=ret_soup(baseurl, cookies_dict)print(soup)if __name__ == '__main__':

start()

windows版

三、处理证书错误

出现这个问题的原因是：ssl的证书不安全导致(ssl.CertificateError)

import requests

url = "https://www.12306.cn/mormhweb/"

response = requests.get(url,verify=False)

四、超时参数

在平时网上冲浪的过程中，我们经常会遇到网络波动，这个时候，一个请求等了很久可能任然没有结果

对应的，在爬虫中，一个请求很久没有结果，就会让整个项目的效率变得非常低，这个时候我们就需要对请求进行强制要求，让他必须在特定的时间内返回结果，否则就报错

response = requests.get(url,timeout=5)

五、retrying模块

上述方法能够加快我们整体的请求速度，但是在正常的网页浏览过成功，如果发生速度很慢的情况，我们会做的选择是刷新页面，那么在代码中，我们是否也可以刷新请求呢？

retrying模块的地址：https://pypi.org/project/retrying/

retrying 模块的使用

使用retrying模块提供的retry模块

通过装饰器的方式使用，让被装饰的函数反复执行

retry中可以传入参数stop_max_attempt_number,让函数报错后继续重新执行，达到最大执行次数的上限，如果每次都报错，整个函数报错，如果中间有一个成功，程序继续往后执行

importrequestsfrom retrying importretry

headers={}

@retry(stop_max_attempt_number=3) #最大重试3次，3次全部报错，才会报错

def_parse_url(url)

response= requests.get(url, headers=headers, timeout=3) #超时的时候回报错并重试

assert response.status_code == 200 #状态码不是200，也会报错并充实

returnresponsedefparse_url(url)try: #进行异常捕获

response =_parse_url(url)exceptException as e:print(e)

response=Nonereturn response

parse.py

如何突破基本的反爬虫策略

从上面的例子可以发现：

基本的反爬虫策略:1.尽量模拟浏览器：就是利用请求头里的字段做文章(UA Cookie...)2.突破限速、封号：针对某个IP或者某个账户限速甚至封禁

突破策略：1.把headers的信息copy到程序的headers2.多个代理ip/多个账户

反爬虫一般策略

请求头

user-agent: 当前用户使用的设备

Referer: "xxx"

content-type: application/json,

host

请求携带cookie或token

加密

发现ip变化

限制访问频率

验证码

隐藏登录界面部分数据

js动态加载，分析复杂

发现大量请求只加载html，不加载css js media

发现爬虫加载假数据

健全账号体系

更多示例

爬取汽车之家

import requests

from bs4 import BeautifulSoup # 预先安装BeautifulSoup4库

response = requests.get('http://www.autohome.com.cn/news/')

response.encoding = 'gbk'

soup = BeautifulSoup(response.text,'html.parser')

tag = soup.find(id='auto-channel-lazyload-article') # BeautifulSoup标签支持链式操作

h3 = tag.find(name='h3')

print(h3)

import requests

from bs4 import BeautifulSoup

import re

# 找到所有新闻

# 标题，简介，url，图片

HTTPS = 'https:' # 页面中url无https，访问需添加

response = requests.get('http://www.autohome.com.cn/news/')

response.encoding = 'gbk'

soup = BeautifulSoup(response.text, 'html.parser')

li_list = soup.find(id='auto-channel-lazyload-article').find_all(name='li')[:3] # 只取3条新闻

for li in li_list:

title = li.find('h3')

if not title:

continue

summary = li.find('p').text

# li.find('a').attrs,字典

url = HTTPS + li.find('a').get('href') # 等效于 li.find('a').attrs['href']

img = HTTPS + li.find('img').get('src')

# 下载图片

res = requests.get(img)

re_image_name = re.match(r'.*__(.*).jpg', img)

if re_image_name:

image_name = re_image_name.group(1)

file_name = "%s.jpg" % (image_name,)

with open(file_name, 'wb') as f:

f.write(res.content)

登录github

浏览器打开开发者模式，查看FormData

技巧：当前的页面在Network对应的页面背景会呈蓝色

import requests

from bs4 import BeautifulSoup

# 获取token

r1 = requests.get('https://github.com/login')

s1 = BeautifulSoup(r1.text,'html.parser')

token = s1.find(name='input',attrs={'name':'authenticity_token'}).get('value')

r1_cookie_dict = r1.cookies.get_dict()

# 将用户名密码token发送到服务端，post

"""

utf8:✓

authenticity_token:ollV+avLm6Fh3ZevegPO7gOH7xUzEBL0NWdA1aOQ1IO3YQspjOHbfnaXJOtVLQ95BtW9GZlaCIYd5M6v7FGUKg==

password:

commit:Sign in

"""

r2 = requests.post(

'https://github.com/session',

data={

"utf8": '✓',

"authenticity_token": token,

'login': 'xxx',

'password': 'xxx',

'commit': 'Sign in'

cookies=r1_cookie_dict

)

r2_cookie_dict = r2.cookies.get_dict()

# 一般需要登陆的网站只要带上登陆页面的POST cookie即可，有的站点还会带上登陆页面的GET cookie，如github

cookie_dict = {}

cookie_dict.update(r1_cookie_dict)

cookie_dict.update(r2_cookie_dict)

r3 = requests.get(

url='https://github.com/settings/emails',

cookies=cookie_dict

)

r4 = requests.get(

url='https://github.com/ecithy/online-edu',

cookies=cookie_dict

)

print(r4.text)

登录抽屉

方式一

# 1. 登录，cookie

# 2. 标签url，xxxx

import requests

from bs4 import BeautifulSoup

# 1. 获取cookie

r0 = requests.get('http://dig.chouti.com/')

r0_cookie_dict = r0.cookies.get_dict()

# 2. 发送用户名密码cookie

r1 = requests.post(

'http://dig.chouti.com/login',

data={

'phone': 'xxx',

'password': 'xxx',

'oneMonth':1

cookies=r0_cookie_dict

)

r1_cookie_dict = r1.cookies.get_dict()

cookie_dict = {}

cookie_dict.update(r0_cookie_dict)

cookie_dict.update(r1_cookie_dict)

r2 = requests.post('http://dig.chouti.com/link/vote?linksId=13915601',cookies=cookie_dict)

print(r2.text)

方式二：session(自动保存cookie，避免了手动操作cookie的繁琐)

import requests

session = requests.Session()

r1 = session.get(url="http://dig.chouti.com/")

r2 = session.post(

url="http://dig.chouti.com/login",

data={

'phone': "xxx",

'password': "xxx",

'oneMonth': ""

}

)

r3 = session.post(

url="http://dig.chouti.com/link/vote?linksId=13915601"

)

print(r3.text)

登陆知乎

# -*- coding: utf-8 -*-

__author__ = 'hy'

import requests

try:

import cookielib

except:

import http.cookiejar as cookielib

import re

session = requests.session()

# session.cookies = cookielib.LWPCookieJar(filename="cookies.txt") # 加载cookie

try:

session.cookies.load(ignore_discard=True)

except:

print("cookie未能加载")

agent = "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:51.0) Gecko/20100101 Firefox/51.0"

header = {

"HOST": "www.zhihu.com",

"Referer": "https://www.zhizhu.com",

'User-Agent': agent

}

def is_login():

# 通过个人中心页面返回状态码来判断是否为登录状态

inbox_url = "https://www.zhihu.com/question/56250357/answer/148534773"

response = session.get(inbox_url, headers=header, allow_redirects=False)

if response.status_code != 200:

print('请登陆')

else:

print('您已登陆')

def get_xsrf():

# 获取xsrf code

response = session.get("https://www.zhihu.com", headers=header)

match_obj = re.match('.*name="_xsrf" value="(.*?)"', response.text)

if match_obj:

return (match_obj.group(1))

else:

return ""

def zhihu_login(account, password):

# 知乎登录

if re.match("^1\d{10}", account):

print("手机号码登录")

post_url = "https://www.zhihu.com/login/phone_num"

post_data = {

"_xsrf": get_xsrf(),

"phone_num": account,

"password": password,

}

else:

if "@" in account:

# 判断用户名是否为邮箱

print("邮箱方式登录")

post_url = "https://www.zhihu.com/login/email"

post_data = {

"_xsrf": get_xsrf(),

"email": account,

"password": password

}

response_text = session.post(post_url, data=post_data, headers=header)

# session.cookies.save()

# zhihu_login("xxx", "xxx")

zhihu_login("xxx", "xxx")

is_login()

python spider模块_spider【第三篇】python爬虫模块requests相关推荐

Python之路（第三篇）模块
模块,用一砣代码实现了某个功能的代码集合. 类似于函数式编程和面向过程编程,函数式编程则完成一个功能,其他代码用来调用即可,提供了代码的重用性和代码间的耦合.而对于一个复杂的功能来,可能需要多个函数才 ...
Python之路【第七篇】:常用模块
一. 模块介绍 1. 什么是模块在前面的几个章节中我们基本上是用 python 解释器来编程,如果你从 Python 解释器退出再进入,那么你定义的所有的方法和变量就都消失了. 为此 Python ...
Python编程基础：第三十六节模块Modules
第三十六节模块Modules 前言实践前言我们目前所有的代码都写在一个文档里面.如果你的项目比较大,那么把所有功能写在一个文件里就非常不便于后期维护.为了提高我们代码的可读性,降低后期维护的成 ...
十、给小白看的第三篇Python基础教程
本文是第三篇,一共四篇打下Python基础 @Author:Runsen @公众号:Python之王上面两个基本搞定了Python中数据结构,下面花一篇讲讲最重要的类. 7.面向对象编程万物皆是对 ...
Python音乐跳舞毯(基于海龟画图创作的作品,来自Python创意编程100例sprites篇_Python精灵模块)
出色的配音是本作品的一大亮点哦! 相信这竟然是Python用海龟画图制作的作品吗? sprites模块就是用python的turtle模块制作的!所以本作品仍属于Python海龟画图作品 " ...
Python 学习笔记第三篇 Python实现网易云评论网页爬虫+词云展示（Pycharm+Mysql）
初始条件,具体可见我的其他文章. 1.安装Python.Python 学习笔记第一篇 Python的安装与配置 2.安装Pycharm,并导入第三方包.Python 学习笔记第二篇 Python ...
Python奥特曼打怪兽射击游戏(基于海龟画图创作的作品,来自Python创意编程100例sprites篇_Python精灵模块)
相信这竟然是Python用海龟画图制作的作品吗? sprites模块就是用python的turtle模块开发的!所以本作品仍属于Python海龟画图作品 """奥特曼打怪 ...
python 短视频_短视频篇 | Python 带你进行短视频二次创作
image 阅读文本大概需要 10 分钟. 1.目标场景无论是抖音还是快手等视频平台,一旦一个视频火了后,很多 UP 主都会争先抢后去模仿拍摄或剪辑,然后上传到平台,最后都能带来不错的流量. ...
Python之路【第三篇】：Python基础（二）
函数的理解面向过程:根据业务逻辑从上到下写垒代码函数式:将某功能代码封装到函数中,日后便无需重复编写,仅调用函数即可函数作用是你的程序有良好的扩展性.复用性. 同样的功能要是用3次以上的话就建议 ...
python成长之路第三篇(2)_正则表达式
打个广告欢迎加入linux,python资源分享群群号:478616847 目录: 1.什么是正则表达式,python中得正则简介 2.re模块的内容 3.小练习一.什么是正则表达式(re) 正则表 ...

python spider模块_spider【第三篇】python爬虫模块requests

python spider模块_spider【第三篇】python爬虫模块requests相关推荐

最新文章

热门文章