python爬取图片的库_16-python爬虫之Requests库爬取海量图片
Requests 是一个 Python 的 HTTP 客户端库。
Request支持HTTP连接保持和连接池,支持使用cookie保持会话,支持文件上传,支持自动响应内容的编码,支持国际化的URL和POST数据自动编码。
在python内置模块的基础上进行了高度的封装从而使得python进行网络请求时,变得人性化,使用Requests可以轻而易举的完成浏览器可有的任何操作。现代,国际化,友好。
requests会自动实现持久连接keep-alive
目录
一、Requests基础
二、发送请求与接收响应(基本GET请求)
三、发送请求与接收响应(基本POST请求)
四、response属性
五、代理
六、cookie和session
七、案例
一、Requests基础
1.安装Requests库
pip install requests
2.使用Requests库
import requests
二、发送请求与接收响应(基本GET请求)
response = requests.get(url)
1.传送 parmas参数
参数包含在url中
response = requests.get("http://httpbin.org/get?name=zhangsan&age=22")
print(response.text)
通过get方法传送参数
data = {
"name": "zhangsan",
"age": 30
}
response = requests.get("http://httpbin.org/get", params=data)
print(response.text)
2.模拟发送请求头(传送headers参数)
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36"
}
response = requests.get("http://httpbin.org/get", headers=headers)
print(response.text)
三、发送请求与接收响应(基本POST请求)
response = requests.post(url, data = data, headers=headers)
四、response属性
属性
描述
response.text
获取str类型(Unicode编码)的响应
response.content
获取bytes类型的响应
response.status_code
获取响应状态码
response.headers
获取响应头
response.request
获取响应对应的请求
五、代理
proxies = {
"http": "https://175.44.148.176:9000",
"https": "https://183.129.207.86:14002"
}
response = requests.get("https://www.baidu.com/", proxies=proxies)
六、cookie和session
使用的cookie和session好处:很多网站必须登录之后(或者获取某种权限之后)才能能够请求到相关数据。
使用的cookie和session的弊端:一套cookie和session往往和一个用户对应.请求太快,请求次数太多,容易被服务器识别为爬虫,从而使账号收到损害。
1.不需要cookie的时候尽量不去使用cookie。
2.为了获取登录之后的页面,我们必须发送带有cookies的请求,此时为了确保账号安全应该尽量降低数据
采集速度。
1.cookie
(1)获取cookie信息
response.cookies
2.session
(1)构造session回话对象
session = requests.session()
示例:
def login_renren():
login_url = 'http://www.renren.com/SysHome.do'
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36"
}
session = requests.session()
login_data = {
"email": "账号",
"password": "密码"
}
response = session.post(login_url, data=login_data, headers=headers)
response = session.get("http://www.renren.com/971909762/newsfeed/photo")
print(response.text)
login_renren()
七、案例
案例1:百度贴吧页面爬取(GET请求)
import requests
import sys
class BaiduTieBa:
def __init__(self, name, pn, ):
self.name = name
self.url = "http://tieba.baidu.com/f?kw={}&ie=utf-8&pn={}".format(name, pn)
self.headers = {
# "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36"
# 使用较老版本的请求头,该浏览器不支持js
"User-Agent": "Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)"
}
self.url_list = [self.url + str(pn*50) for pn in range(pn)]
print(self.url_list)
def get_data(self, url):
"""
请求数据
:param url:
:return:
"""
response = requests.get(url, headers=self.headers)
return response.content
def save_data(self, data, num):
"""
保存数据
:param data:
:param num:
:return:
"""
file_name = "./pages/" + self.name + "_" + str(num) + ".html"
with open(file_name, "wb") as f:
f.write(data)
def run(self):
for url in self.url_list:
data = self.get_data(url)
num = self.url_list.index(url)
self.save_data(data, num)
if __name__ == "__main__":
name = sys.argv[1]
pn = int(sys.argv[2])
baidu = BaiduTieBa(name, pn)
baidu.run()
案例2:金山词霸翻译(POST请求)
import requests
import sys
import json
class JinshanCiBa:
def __init__(self, words):
self.url = "http://fy.iciba.com/ajax.php?a=fy"
self.headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 SE 2.X MetaSr 1.0",
"X-Requested-With": "XMLHttpRequest"
}
self.post_data = {
"f": "auto",
"t": "auto",
"w": words
}
def get_data(self):
"""
请求数据
:param url:
:return:
"""
response = requests.post(self.url, data=self.post_data, headers=self.headers)
return response.text
def show_translation(self):
"""
显示翻译结果
:param data:
:param num:
:return:
"""
response = self.get_data()
json_data = json.loads(response, encoding='utf-8')
if json_data['status'] == 0:
translation = json_data['content']['word_mean']
elif json_data['status'] == 1:
translation = json_data['content']['out']
else:
translation = None
print(translation)
def run(self):
self.show_translation()
if __name__ == "__main__":
words = sys.argv[1]
ciba = JinshanCiBa(words)
ciba.run()
案例3:百度贴吧图片爬取
(1)普通版
从已下载页面中提取url来爬取图片(页面下载方法见案例1)
from lxml import etree
import requests
class DownloadPhoto:
def __init__(self):
pass
def download_img(self, url):
response = requests.get(url)
index = url.rfind('/')
file_name = url[index + 1:]
print("下载图片:" + file_name)
save_name = "./photo/" + file_name
with open(save_name, "wb") as f:
f.write(response.content)
def parse_photo_url(self, page):
html = etree.parse(page, etree.HTMLParser())
nodes = html.xpath("//a[contains(@class, 'thumbnail')]/img/@bpic")
print(nodes)
print(len(nodes))
for node in nodes:
self.download_img(node)
if __name__ == "__main__":
download = DownloadPhoto()
for i in range(6000):
download.parse_photo_url("./pages/校花_{}.html".format(i))
(2)多线程版
main.py
import requests
from lxml import etree
from file_download import DownLoadExecutioner, file_download
class XiaoHua:
def __init__(self, init_url):
self.init_url = init_url
self.download_executioner = DownLoadExecutioner()
def start(self):
self.download_executioner.start()
self.download_img(self.init_url)
def download_img(self, url):
html_text = file_download(url, type='text')
html = etree.HTML(html_text)
img_urls = html.xpath("//a[contains(@class,'thumbnail')]/img/@bpic")
self.download_executioner.put_task(img_urls)
# 获取下一页的连接
next_page = html.xpath("//div[@id='frs_list_pager']/a[contains(@class,'next')]/@href")
next_page = "http:" + next_page[0]
self.download_img(next_page)
if __name__ == '__main__':
x = XiaoHua("http://tieba.baidu.com/f?kw=校花&ie=utf-8")
x.start()
file_download.py
import requests
import threading
from queue import Queue
def file_download(url, type='content'):
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko'
}
r = requests.get(url, headers=headers)
if type == 'text':
return r.text
return r.content
class DownLoadExecutioner(threading.Thread):
def __init__(self):
super().__init__()
self.q = Queue(maxsize=50)
# 图片保存目录
self.save_dir = './img/'
# 图片计数
self.index = 0
def put_task(self, urls):
if isinstance(urls, list):
for url in urls:
self.q.put(url)
else:
self.q.put(urls)
def run(self):
while True:
url = self.q.get()
content = file_download(url)
# 截取图片名称
index = url.rfind('/')
file_name = url[index+1:]
save_name = self.save_dir + file_name
with open(save_name, 'wb+') as f:
f.write(content)
self.index += 1
print(save_name + "下载成功! 当前已下载图片总数:" + str(self.index))
(3)线程池版
main.py
import requests
from lxml import etree
from file_download_pool import DownLoadExecutionerPool, file_download
class XiaoHua:
def __init__(self, init_url):
self.init_url = init_url
self.download_executioner = DownLoadExecutionerPool()
def start(self):
self.download_img(self.init_url)
def download_img(self, url):
html_text = file_download(url, type='text')
html = etree.HTML(html_text)
img_urls = html.xpath("//a[contains(@class,'thumbnail')]/img/@bpic")
self.download_executioner.put_task(img_urls)
# 获取下一页的连接
next_page = html.xpath("//div[@id='frs_list_pager']/a[contains(@class,'next')]/@href")
next_page = "http:" + next_page[0]
self.download_img(next_page)
if __name__ == '__main__':
x = XiaoHua("http://tieba.baidu.com/f?kw=校花&ie=utf-8")
x.start()
file_download_pool.py
import requests
import concurrent.futures as futures
def file_download(url, type='content'):
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko'
}
r = requests.get(url, headers=headers)
if type == 'text':
return r.text
return r.content
class DownLoadExecutionerPool():
def __init__(self):
super().__init__()
# 图片保存目录
self.save_dir = './img_pool/'
# 图片计数
self.index = 0
# 线程池
self.ex = futures.ThreadPoolExecutor(max_workers=30)
def put_task(self, urls):
if isinstance(urls, list):
for url in urls:
self.ex.submit(self.save_img, url)
else:
self.ex.submit(self.save_img, urls)
def save_img(self, url):
content = file_download(url)
# 截取图片名称
index = url.rfind('/')
file_name = url[index+1:]
save_name = self.save_dir + file_name
with open(save_name, 'wb+') as f:
f.write(content)
self.index += 1
print(save_name + "下载成功! 当前已下载图片总数:" + str(self.index))
Request支持HTTP连接保持和连接池,支持使用cookie保持会话,支持文件上传,支持自动响应内容的编码,支持国际化的URL和POST数据自动编码。
在python内置模块的基础上进行了高度的封装,从而使得python进行网络请求时,变得人性化,使用Requests可以轻而易举的完成浏览器可有的任何操作。现代,国际化,友好。
目录
一、Requests基础
二、发送请求与接收响应(基本GET请求)
三、发送请求与接收响应(基本POST请求)
四、response属性
五、代理
六、cookie和session
七、案例
一、Requests基础
1.安装Requests库
pip install requests
2.使用Requests库
import requests
二、发送请求与接收响应(基本GET请求)
response = requests.get(url)
1.传送 parmas参数
参数包含在url中
response = requests.get("http://httpbin.org/get?name=zhangsan&age=22")
print(response.text)
通过get方法传送参数
data = {
"name": "zhangsan",
"age": 30
}
response = requests.get("http://httpbin.org/get", params=data)
print(response.text)
2.模拟发送请求头(传送headers参数)
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36"
}
response = requests.get("http://httpbin.org/get", headers=headers)
print(response.text)
三、发送请求与接收响应(基本POST请求)
response = requests.post(url, data = data, headers=headers)
四、response属性
属性
描述
response.text
获取str类型(Unicode编码)的响应
response.content
获取bytes类型的响应
response.status_code
获取响应状态码
response.headers
获取响应头
response.request
获取响应对应的请求
五、代理
proxies = {
"http": "https://175.44.148.176:9000",
"https": "https://183.129.207.86:14002"
}
response = requests.get("https://www.baidu.com/", proxies=proxies)
六、cookie和session
使用的cookie和session好处:很多网站必须登录之后(或者获取某种权限之后)才能能够请求到相关数据。
使用的cookie和session的弊端:一套cookie和session往往和一个用户对应.请求太快,请求次数太多,容易被服务器识别为爬虫,从而使账号收到损害。
1.不需要cookie的时候尽量不去使用cookie。
2.为了获取登录之后的页面,我们必须发送带有cookies的请求,此时为了确保账号安全应该尽量降低数据
采集速度。
1.cookie
(1)获取cookie信息
response.cookies
2.session
(1)构造session回话对象
session = requests.session()
示例:
def login_renren():
login_url = 'http://www.renren.com/SysHome.do'
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36"
}
session = requests.session()
login_data = {
"email": "账号",
"password": "密码"
}
response = session.post(login_url, data=login_data, headers=headers)
response = session.get("http://www.renren.com/971909762/newsfeed/photo")
print(response.text)
login_renren()
七、案例
案例1:百度贴吧页面爬取(GET请求)
import requests
import sys
class BaiduTieBa:
def __init__(self, name, pn, ):
self.name = name
self.url = "http://tieba.baidu.com/f?kw={}&ie=utf-8&pn={}".format(name, pn)
self.headers = {
# "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36"
# 使用较老版本的请求头,该浏览器不支持js
"User-Agent": "Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)"
}
self.url_list = [self.url + str(pn*50) for pn in range(pn)]
print(self.url_list)
def get_data(self, url):
"""
请求数据
:param url:
:return:
"""
response = requests.get(url, headers=self.headers)
return response.content
def save_data(self, data, num):
"""
保存数据
:param data:
:param num:
:return:
"""
file_name = "./pages/" + self.name + "_" + str(num) + ".html"
with open(file_name, "wb") as f:
f.write(data)
def run(self):
for url in self.url_list:
data = self.get_data(url)
num = self.url_list.index(url)
self.save_data(data, num)
if __name__ == "__main__":
name = sys.argv[1]
pn = int(sys.argv[2])
baidu = BaiduTieBa(name, pn)
baidu.run()
案例2:金山词霸翻译(POST请求)
import requests
import sys
import json
class JinshanCiBa:
def __init__(self, words):
self.url = "http://fy.iciba.com/ajax.php?a=fy"
self.headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 SE 2.X MetaSr 1.0",
"X-Requested-With": "XMLHttpRequest"
}
self.post_data = {
"f": "auto",
"t": "auto",
"w": words
}
def get_data(self):
"""
请求数据
:param url:
:return:
"""
response = requests.post(self.url, data=self.post_data, headers=self.headers)
return response.text
def show_translation(self):
"""
显示翻译结果
:param data:
:param num:
:return:
"""
response = self.get_data()
json_data = json.loads(response, encoding='utf-8')
if json_data['status'] == 0:
translation = json_data['content']['word_mean']
elif json_data['status'] == 1:
translation = json_data['content']['out']
else:
translation = None
print(translation)
def run(self):
self.show_translation()
if __name__ == "__main__":
words = sys.argv[1]
ciba = JinshanCiBa(words)
ciba.run()
案例3:百度贴吧图片爬取
(1)普通版
从已下载页面中提取url来爬取图片(页面下载方法见案例1)
from lxml import etree
import requests
class DownloadPhoto:
def __init__(self):
pass
def download_img(self, url):
response = requests.get(url)
index = url.rfind('/')
file_name = url[index + 1:]
print("下载图片:" + file_name)
save_name = "./photo/" + file_name
with open(save_name, "wb") as f:
f.write(response.content)
def parse_photo_url(self, page):
html = etree.parse(page, etree.HTMLParser())
nodes = html.xpath("//a[contains(@class, 'thumbnail')]/img/@bpic")
print(nodes)
print(len(nodes))
for node in nodes:
self.download_img(node)
if __name__ == "__main__":
download = DownloadPhoto()
for i in range(6000):
download.parse_photo_url("./pages/校花_{}.html".format(i))
(2)多线程版
main.py
import requests
from lxml import etree
from file_download import DownLoadExecutioner, file_download
class XiaoHua:
def __init__(self, init_url):
self.init_url = init_url
self.download_executioner = DownLoadExecutioner()
def start(self):
self.download_executioner.start()
self.download_img(self.init_url)
def download_img(self, url):
html_text = file_download(url, type='text')
html = etree.HTML(html_text)
img_urls = html.xpath("//a[contains(@class,'thumbnail')]/img/@bpic")
self.download_executioner.put_task(img_urls)
# 获取下一页的连接
next_page = html.xpath("//div[@id='frs_list_pager']/a[contains(@class,'next')]/@href")
next_page = "http:" + next_page[0]
self.download_img(next_page)
if __name__ == '__main__':
x = XiaoHua("http://tieba.baidu.com/f?kw=校花&ie=utf-8")
x.start()
file_download.py
import requests
import threading
from queue import Queue
def file_download(url, type='content'):
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko'
}
r = requests.get(url, headers=headers)
if type == 'text':
return r.text
return r.content
class DownLoadExecutioner(threading.Thread):
def __init__(self):
super().__init__()
self.q = Queue(maxsize=50)
# 图片保存目录
self.save_dir = './img/'
# 图片计数
self.index = 0
def put_task(self, urls):
if isinstance(urls, list):
for url in urls:
self.q.put(url)
else:
self.q.put(urls)
def run(self):
while True:
url = self.q.get()
content = file_download(url)
# 截取图片名称
index = url.rfind('/')
file_name = url[index+1:]
save_name = self.save_dir + file_name
with open(save_name, 'wb+') as f:
f.write(content)
self.index += 1
print(save_name + "下载成功! 当前已下载图片总数:" + str(self.index))
(3)线程池版
main.py
import requests
from lxml import etree
from file_download_pool import DownLoadExecutionerPool, file_download
class XiaoHua:
def __init__(self, init_url):
self.init_url = init_url
self.download_executioner = DownLoadExecutionerPool()
def start(self):
self.download_img(self.init_url)
def download_img(self, url):
html_text = file_download(url, type='text')
html = etree.HTML(html_text)
img_urls = html.xpath("//a[contains(@class,'thumbnail')]/img/@bpic")
self.download_executioner.put_task(img_urls)
# 获取下一页的连接
next_page = html.xpath("//div[@id='frs_list_pager']/a[contains(@class,'next')]/@href")
next_page = "http:" + next_page[0]
self.download_img(next_page)
if __name__ == '__main__':
x = XiaoHua("http://tieba.baidu.com/f?kw=校花&ie=utf-8")
x.start()
file_download_pool.py
import requests
import concurrent.futures as futures
def file_download(url, type='content'):
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko'
}
r = requests.get(url, headers=headers)
if type == 'text':
return r.text
return r.content
class DownLoadExecutionerPool():
def __init__(self):
super().__init__()
# 图片保存目录
self.save_dir = './img_pool/'
# 图片计数
self.index = 0
# 线程池
self.ex = futures.ThreadPoolExecutor(max_workers=30)
def put_task(self, urls):
if isinstance(urls, list):
for url in urls:
self.ex.submit(self.save_img, url)
else:
self.ex.submit(self.save_img, urls)
def save_img(self, url):
content = file_download(url)
# 截取图片名称
index = url.rfind('/')
file_name = url[index+1:]
save_name = self.save_dir + file_name
with open(save_name, 'wb+') as f:
f.write(content)
self.index += 1
print(save_name + "下载成功! 当前已下载图片总数:" + str(self.index))
python爬取图片的库_16-python爬虫之Requests库爬取海量图片相关推荐
- python怎么安装requests库-Python爬虫入门requests库的安装与使用
Requests库的详细安装过程 对于初学Python爬虫小白,认识和使用requests库是第一步,requests库包含了网页爬取 的常用方法.下面开始安装requests库. 1.检查是否安装过 ...
- python爬虫基础-requests库
python爬虫基础-requests库 python爬虫 1.什么是爬虫? 通过编写程序,模拟浏览器上网,然后让其去互联网上抓取数据的过程. 注意:浏览器抓取的数据对应的页面是一个完整的页面. 为什 ...
- python的requests库的添加代理_python爬虫之requests库使用代理
python爬虫之requests库使用代理 发布时间:2020-03-25 17:00:54 来源:亿速云 阅读:110 作者:小新 今天小编分享的是关于python爬虫的requests库使用代理 ...
- Python 爬虫之 Requests 库
所谓爬虫就是模拟客户端发送网络请求,获取网络响应,并按照一定的规则解析获取的数据并保存的程序.要说 Python 的爬虫必然绕不过 Requests 库. 1 简介 对于 Requests 库,官方文 ...
- 煎蛋网妹子图爬虫(requests库实现)
煎蛋网妹子图爬虫(requests库实现) 文章目录 煎蛋网妹子图爬虫(requests库实现) 一.前言 环境配置 二.完整代码 一.前言 说到煎蛋网爬虫,相比很多人都写过,我这里试着用reques ...
- 起点中文网爬虫实战requests库以及xpath的应用
起点中文网爬虫实战requests库以及xpath的应用 知识梳理: 本次爬虫是一次简单的复习应用,需要用到requests库以及xpath. 在开始爬虫之前,首先需要导入这两个库 import re ...
- python爬虫requests实战_Python爬虫之requests库网络爬取简单实战
实例1:直接爬取网页 实例2 : 构造headers,突破访问限制,模拟浏览器爬取网页 实例3 : 分析请求参数,构造请求参数爬取所需网页 实例4: 爬取图片 实例5: 分析请求参数,构造请求参数爬取 ...
- python爬虫——使用requests库和xpath爬取猎聘网职位详情
文章目录 前言 一.页面分析 1.职位列表页面分析 2.职位详情页面URL获取 3.职位详情页面分析 至此,所有页面解析完毕,开始写代码. 二.代码编写 1.导入相应库 2.设置代理和随机请求头 3. ...
- python爬虫怎么爬取图片_怎么用python爬取网站Jpg图片
用python爬取网站图片,通过引用requests库就可完成.下面,小编将以爬取百度图片为例 工具/原料 python环境,网络 安装requests库 1 cmd打开命令行界面,输入pip ins ...
最新文章
- 联想g510升级换什么cpu好_老兵不死,十年前的联想 Y450 笔记本复活记
- 纯css实现毛玻璃效果
- Python常用内置函数(二)
- linux安装配置jdk1.8
- 有关推挽输出、开漏输出、复用开漏输出、复用推挽输出以及上拉输入、下拉输入、浮空输入、模拟输入区别
- linux逻辑分区最小值,linux 逻辑卷管理 调整分区大小
- python 判断div 之间的内容是否为空_python实现单向链表数据结构及其基本方法
- 两个分数化简比怎么化_我学《分数的意义》心得
- 【OpenCV 例程200篇】68. 连续周期信号的傅立叶级数
- 一套漂亮的Bootstrap模板
- 用仿ActionScript的语法来编写html5——第三篇,鼠标事件与游戏人物移动
- Bigasoft Audio Converter for Mac - 音频转换器
- PS4蓝牙手柄分析之1
- 简单的 thymeleaf 前端网页模板
- java根据word模板导出_Java通过word模板导出word
- Orz_panda cup I题 (xdoj1117) 状压dp
- 2023最新素材解析网站源码搭建和原理,附带PHP小例子。
- PHP中文乱码的三个原因及解决方法
- java平面内有n个矩形_java有关于M*N矩形求解正方形长方形个数问题
- 一款真正可用的支付系统,可搭建自己的易支付系统,开源无后门可运营