python按关键字爬取必应高清图片

通过查询前人的博客，发现必应可通过url按关键字查找图片：

https://www.bing.com/images/async?q=查询关键字&first=图片编号&count=图片数量&mmasync=1

基于该url，我写了一个爬虫类，实现了按关键字下载固定数量的必应高清图片。调用时只需要一条python语句即可（由于使用了线程池并发请求图片，所以下载速度较快，一分钟300张高清图片没问题）：

# 关键词：电脑壁纸
# 需要的图片数量：100
# 图片保存路径：'E:\images'
BingImagesSpider('电脑美女壁纸', 100, 'E:\images').run()

爬虫类的源码如下：

import requests
from lxml import etree
import os
from multiprocessing.dummy import Pool
import json
from time import time# 作用：按关键字、图片数量爬取必应图片，存放到指定路径。
# 使用方法：只需运行一条命令 BingImagesSpider('电脑美女壁纸', 200, 'E:\images').run()
class BingImagesSpider:thread_amount = 1000 # 线程池数量，线程池用于多IO请求，减少总的http请求时间per_page_images = 30 # 每页必应请求的图片数count = 0 # 图片计数success_count = 0# 忽略图片标签的一些字符ignore_chars = ['|', '.', '，', ',', '', '', '/', '@', ':', '：', ';', '；', '[', ']', '+']# 允许的图片类型image_types = ['bmp', 'jpg', 'png', 'tif', 'gif', 'pcx', 'tga', 'exif', 'fpx', 'svg', 'psd', 'cdr', 'pcd', 'dxf', 'ufo', 'eps', 'ai', 'raw', 'WMF', 'webp']# 请求头headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.83 Safari/537.36'}# 必应图片 urlbing_image_url_pattern = 'https://www.bing.com/images/async?q={}&first={}&count={}&mmasync=1'def __init__(self, keyword, amount, path='./'):# keyword: 需爬取的关键字# amount: 需爬取的数量# path: 图片存放路径self.keyword = keywordself.amount = amountself.path = pathself.thread_pool = Pool(self.thread_amount)def __del__(self):self.thread_pool.close()self.thread_pool.join()# 作用：从必应请求图片def request_homepage(self, url):# url: 必应图片页的 urlreturn requests.get(url, headers=self.headers)# 作用：解析必应网页，得到所有图片的信息，封装到列表中返回# 每个图片的信息以字典对象存储，字典的键包括 image_title, image_type, image_md5, image_urldef parse_homepage_response(self, response):# response: 必应网站的响应# 获取各图片信息所在的json格式字符串 mtree = etree.HTML(response.text)m_list = tree.xpath('//*[@class="imgpt"]/a/@m')# 对每个图片分别处理info_list = []for m in m_list:dic = json.loads(m)# 去除一些文件名中不允许的字符image_title = dic['t']for char in self.ignore_chars:image_title = image_title.replace(char, ' ')image_title = image_title.strip()# 有些图片的信息中不包含图片格式，该情况将图片设置为 jpg 格式image_type = dic['murl'].split('.')[-1]if image_type not in self.image_types:image_type = 'jpg'# 将每个图片的信息存为字典格式info = dict()info['image_title'] = image_titleinfo['image_type'] = image_typeinfo['image_md5'] = dic['md5']info['image_url'] = dic['murl']info_list.append(info)return info_list# 请求具体图片，保存到初始化时指定的路径def request_and_save_image(self, info):# info: 每个图片的信息,以字典对象存储。字典的键包括 image_title, image_type, image_md5, image_urlfilename = '{} {}.{}'.format(self.count, info['image_title'], info['image_type'])filepath = os.path.join(self.path, filename)try:# 请求图片response = requests.get(info['image_url'], headers=self.headers, timeout=1.5)# 保存图片with open(filepath, 'wb') as fp:fp.write(response.content)# 打印日志self.count += 1self.success_count += 1print('{}: saving {} done.'.format(self.count, filepath))except requests.exceptions.RequestException as e:self.count += 1print('{}: saving {}failed. url: {}'.format(self.count, filepath, info['image_url']))print('\t tip:', e)# 作用：图片信息的列表去重，去除重复的图片信息def deduplication(self, info_list):result = []# 用图片的 md5 做为唯一标识符md5_set = set()for info in info_list:if info['image_md5'] not in md5_set:result.append(info)md5_set.add(info['image_md5'])return result# 作用：运行爬虫，爬取图片def run(self):# 创建用于保存图片的目录if not os.path.exists(self.path):os.mkdir(self.path)# 根据关键词和需要的图片数量，生成将爬取的必应图片网页列表homepage_urls = []for i in range(int(self.amount/self.per_page_images * 1.5) + 1): # 由于有些图片会重复，故先请求1.5倍图片，豁免url = self.bing_image_url_pattern.format(self.keyword, i*self.per_page_images, self.per_page_images)homepage_urls.append(url)print('homepage_urls len {}'.format(len(homepage_urls)))# 通过线程池请求所有必应图片网页homepage_responses = self.thread_pool.map(self.request_homepage, homepage_urls)# 从必应网页解析所有图片的信息，每个图片包括 image_title, image_type, image_md5, image_url 等信息。info_list = []for response in homepage_responses:result = self.parse_homepage_response(response)info_list += resultprint('info amount before deduplication', len(info_list))# 删除重复的图片，避免重复下载info_list = self.deduplication(info_list)print('info amount after deduplication', len(info_list))info_list = info_list[ : self.amount]print('info amount after split', len(info_list))# 下载所有图片，并保存self.thread_pool.map(self.request_and_save_image, info_list)print('all done. {} successfully downloaded, {} failed.'.format(self.success_count, self.count - self.success_count))if __name__ == '__main__':# 关键词：电脑壁纸# 需要的图片数量：100# 图片保存路径：'E:\images'start = time()BingImagesSpider('电脑壁纸', 100, 'E:\images').run()print(time() - start)

python按关键字爬取必应高清图片相关推荐

python利用bs4爬取外国高清图片网站
python利用bs4爬取外国高清图片网站爬取高清图片爬取高清图片 import re import requests from bs4 import BeautifulSoup import o ...
python手机壁纸超清_详解Python静态网页爬取获取高清壁纸
前言在设计爬虫项目的时候,首先要在脑内明确人工浏览页面获得图片时的步骤一般地,我们去网上批量打开壁纸的时候一般操作如下: 1.打开壁纸网页 2.单击壁纸图(打开指定壁纸的页面) 3.选择分辨率(我 ...
简易爬虫教程爬取4K高清图片
1.1 网址 # 谷歌浏览器http://www.netbian.com/weimei/ 1.2 查看网页源代码 # 1 直接获取会遇到防火墙,添加verify=False去掉安全认证# 2 resp ...
源代码src修改为本地图片_20 行 Python 代码批量抓取免费高清图片！
前言相信在你的工作中可能会经常用到PPT吧,你在PPT制作过程中有没有这样的困惑,就是可以到哪里找到既高清又无版权争议的图片素材呢?这里强烈推荐ColorHub,这是一个允许个人和商业用途的免费图片 ...
20 行 Python 代码批量抓取免费高清图片！
前言相信在你的工作中可能会经常用到PPT吧,你在PPT制作过程中有没有这样的困惑,就是可以到哪里找到既高清又无版权争议的图片素材呢?这里强烈推荐ColorHub,这是一个允许个人和商业用途的免费图片 ...
selenium、requests爬取新浪微博高清图片
文章目录案例介绍 step1:导入必要的包,模拟浏览器打开新浪微博首页 step2:登录微博账号,进入艾漫数据的微博主页,搜索"全部艺人活跃粉丝榜" step3:获取目标图片的u ...
20行Python 代码批量抓取免费高清图片！
前言相信在你的工作中可能会经常用到PPT吧,你在PPT制作过程中有没有这样的困惑,就是可以到哪里找到既高清又无版权争议的图片素材呢?这里强烈推荐ColorHub,这是一个允许个人和商业用途的免费图片 ...
python壁纸高清图片_详解Python静态网页爬取获取高清壁纸
前言在设计爬虫项目的时候,首先要在脑内明确人工浏览页面获得图片时的步骤一般地,我们去网上批量打开壁纸的时候一般操作如下: 1.打开壁纸网页 2.单击壁纸图(打开指定壁纸的页面) 3.选择分辨率(我 ...
python爬虫爬取海量高清图片，小白都能学习的简单操作
正文目标网站divinl 首先看看这网站是怎样加载数据的; 打开网站后发现底部有下一页的按钮,ok,爬这个网站就很简单了; 日文的我们目标是获取每张图片的高清的源地址,并且下载图片到桌面; 先随便 ...

python按关键字爬取必应高清图片

python按关键字爬取必应高清图片相关推荐

最新文章

热门文章