python爬虫（三）——多线程+正则匹配下载图片（wallheaven图片网站）

多线程+正则匹配下载图片（wallheaven图片网站）

1. wallheaven 壁纸网站

这个网站的图片是提供下载的，在壁纸类别之中质量非常高，包括了很多的高清图片。
详细情况可访问其主页页面：wallheaven

2、分析网页架构

1）获取全部页面的地址分析
网页主页地址为：

https://wallhaven.cc/

输入关键词china进行查询后,地址变为：

https://wallhaven.cc/search?q=china

下滑到第二页之后，地址变为：

https://wallhaven.cc/search?q=china&page=2

那么我们就得到了一个通用的地址访问格式：

https://wallhaven.cc/search?q={}&page={}

其中两个中括号之中分别填写关键词和页数

2）获取全部图片的地质
查看一张图片的url地址

https://th.wallhaven.cc/small/mp/mp3dwm.jpg

但是点开图片再单独查看图片可见其地址为：

https://w.wallhaven.cc/full/mp/wallhaven-mp3dwm.jpg

对比前后两张图片，可以明显地发现首页展示的图片画质非常差，而且经过压缩之后大小变形了，用来做壁纸或者其他用途都很有限。所有我们通向希望得到详细页面中的高清页面。我们至此有两种方案可以获取得到图片的地址：

从首页进入图片页面，在图片页面获得图片下载地址
从首页获得低清晰度图片下载地址，经过地址重组获得高清图片的下载地址

这里我选择第二种方法，以便捡一捡长时间不用的正则表达

3、编码实现

1）导入相关函数库

import os
import requests
import redis
from lxml import etree
import urllib.request
import time
import threading
from queue import Queue
import re

2）我后面打算使用redis作为中间件保存url等信息，暂时还没有使用，先定义好（可略过）

pool = redis.ConnectionPool(host='localhost',port=6379,decode_responses=True)
r = redis.Redis(connection_pool=pool)
print(r.ping())

3）主要函数及爬取过程
注意：

headers 中的内容根据自己的情况进行填写
Producer类：获取全部的img地址
consumer类：下载img
我这里使用的关键词是’anime’,可以通过input来进行交互下载
如果有什么错误或者认为不太正确的地方请联系我

Headers = {'authority': 'wallhaven.cc',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'cookie': '',  # 请写上自己的
'referer': 'https://wallhaven.cc/',
'user-agent': '' # 请修改为自己的
}
class Producer(threading.Thread):def __init__(self,page_queue,img_queue,*args,**kwargs):super(Producer, self).__init__(*args, **kwargs)self.page_queue = page_queueself.img_queue = img_queuedef run(self):while True:if self.page_queue.empty():breakpage_url = self.page_queue.get()page_text = self.deal_url(page_url)self.parse_indexPage(page_text)def deal_url(self,url):response = requests.get(url,headers=Headers)response.encoding= response.apparent_encodingreturn response.textdef parse_indexPage(self,text):html = etree.HTML(text)wallpaper_urls = html.xpath("//img[@alt='loading']/@data-src")for url in wallpaper_urls:url = url.replace('th','w',1) # 替换第一个th为wurl = url.replace('small','full') # 替换small为fulllast_urlTail = re.search(r'[^/]+(?!.*/)',url).group() #正则获取url最后一个/后的内容new_urlTail = 'wallhaven-'+last_urlTailurl = url.replace(last_urlTail,new_urlTail) #重构网址self.img_queue.put((new_urlTail,url)) class Consumer(threading.Thread):def __init__(self,page_queue,img_queue,*args,**kwargs):super(Consumer,self).__init__(*args,**kwargs)self.page_queue = page_queueself.img_queue = img_queuedef run(self):while True:if self.img_queue.empty() and self.page_queue.empty():breakimg_name,img_url = self.img_queue.get()self.save_img(img_name,img_url)def save_img(self,img_name,img_url):root= './3_figure/'path = root+img_nametry:if not os.path.exists(root):os.mkdir(root)if not os.path.exists(path):read_figure = requests.get(img_url)with open(path,'wb')as f:f.write(read_figure.content)f.close()print(path+" save ok！")else:print('文件已保存')except:print("文件爬取失败")  def main():base_url = 'https://wallhaven.cc/search?q={}&page={}'
#     keyword = input("please input the keyword")keyword = 'anime'page_queue = Queue(20)img_queue = Queue(200)page_num = 10for x in range(1,page_num+1):url = base_url.format(keyword,x)page_queue.put(url)print(url)for x in range(5):t = Producer(page_queue,img_queue)t.start() for x in range(5):t = Consumer(page_queue,img_queue)t.start()if __name__ == '__main__':main()

4、结果视图

我下载了10页的图片，每页24张图片，共计240张：

打开其中的一张图片：

看起来画质还不错。

python爬虫（三）——多线程+正则匹配下载图片（wallheaven图片网站）相关推荐

Python爬虫:运用多线程、IP代理模块爬取百度图片上小姐姐的图片
Python爬虫:运用多线程.IP代理模块爬取百度图片上小姐姐的图片 1.爬取输入类型的图片数量(用于给用户提示) 使用过百度图片的读者会发现,在搜索栏上输入关键词之后,会显示出搜索的结果,小编想大多 ...
python爬虫——三步爬得电影天堂电影下载链接，30多行代码即可搞定：
python爬虫--三步爬得电影天堂电影下载链接,30多行代码即可搞定: 本次我们选择的爬虫对象是:https://www.dy2018.com/index.html 具体的三个步骤:1.定位到202 ...
python多线程爬取多个网址_【Python爬虫】多线程爬取斗图网站（皮皮虾，我们上车）...
原标题:[Python爬虫]多线程爬取斗图网站(皮皮虾,我们上车) 斗图我不怕没有斗图库的程序猿是无助,每次在群里斗图都以惨败而告终,为了能让自己在斗图界立于不败之地,特意去网上爬取了斗图包.在这里 ...
初始python爬虫-爬取彼岸图单张到全部图片
初始python爬虫-爬取彼岸图单张到全部图片 1.单张图片爬取 2.一页图片 3.多页图片彼岸图链接: https://pic.netbian.com/new/ 用到的库: import requ ...
python爬取图片教程-推荐|Python 爬虫系列教程一爬取批量百度图片
Python 爬虫系列教程一爬取批量百度图片https://blog.csdn.net/qq_40774175/article/details/81273198# -*- coding: utf-8 ...
python爬虫之多线程、多进程+代码示例
python爬虫之多线程.多进程使用多进程.多线程编写爬虫的代码能有效的提高爬虫爬取目标网站的效率. 很多人学习python,不知道从何学起. 很多人学习python,掌握了基本语法过后,不知道在哪 ...
Python爬虫之Scrapy库的下载和安装
Python爬虫之Scrapy库的下载和安装下载scrapy库 1.点击https://www.lfd.uci.edu/~gohlke/pythonlibs,进入该网页后Ctrl+F输入scrapy ...
Python爬虫之网易云音乐下载
Python爬虫之网易云音乐下载目标用Python根据网易云音乐的ID,下载音乐,保存到本地MP3格式可以下载歌曲的范围:所有能够听的歌曲配置基础 Python 3.5 模块 pycrypto ...
Python爬虫爬取Twitter视频、文章、图片
Python爬虫爬取Twitter视频.文章.图片 Twitter的Python爬虫 https://github.com/bisguzar/twitter-scraper 2.2k星标 (2020. ...