python壁纸数据抓取_Python《wallhaven壁纸爬取》

今天不小心又发现了壁纸网站，感觉壁纸很多啊，多？我就忍不住了。爬一下咯。

我们今天爬取网站是https://wallhaven.cc/。

先来分析分析：

进入首页：我们先去找标签，果然有个大标签地址链接。

点进去后发现：

一共有三层标签分类，第三层的标签就直接对应了很多的图片。

假如我们随便点击一个“anime girls”，我们发现他的标签号是5。

第一页呢只是几张图，但是有个按钮能看到更多，我们点进去。

发现url很有特点啊。q=tagId。我猜测这个q是query的意思。

后来经过测试，我们可以通过搜索自己输入的关键字，那时候q就是等于自己输入的关键字了。不过我们今天想从tag的角度来爬取网站。

往下翻一番，滑动下鼠标会发现，会出现page的分页标记。

%3A就是英文的冒号:。地址栏显示的时候就变成了%3A。

鼠标往下滑就是分页，而且在同一个页面展示，因此我们打开开发者模式的network去查看下XHR信息了。

好了，当我们确定；当我们得到个tagId，我们只需要去简单的添加page信息就可以得到完整的分页信息。

根据元素找到每一张图片，一般我们会先看到缩略图，点击缩略图后才能看到高清大图。

按照经验，一般而言，缩略图和高清图是存在某种直接的对应关系的，url存在规律的。

比如我们有缩略图：

所对应的高清图是：

再比如缩略图：

所对应的高清图是：

再比如缩略图：

所对应的高清图是：

发现确实存在规律哈

总结下就是得到如下规律：

好了分析完毕

完整代码如下：

import time

from concurrent.futures import ThreadPoolExecutor

import time

import os

import re

from urllib.parse import urlencode

import requests

from bs4 import BeautifulSoup

from selenium import webdriver

from selenium.webdriver.chrome.options import Options

rootrurl = 'https://wallhaven.cc/?'

searchUrl = 'https://wallhaven.cc/search?'

ImgUrl = 'https://w.wallhaven.cc/full/{}/wallhaven-{}'

save_dir = 'D:/estimages/'

headers = {

"Referer": rootrurl,

'User-Agent': "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36",

'Accept-Language': 'en-US,en;q=0.8',

'Cache-Control': 'max-age=0',

'Connection': 'keep-alive'

} ###设置请求的头部，伪装成浏览器

def saveOneImg(dir, img_url):

new_headers = {

"Referer": img_url,

'User-Agent': "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36",

'Accept-Language': 'en-US,en;q=0.8',

'Cache-Control': 'max-age=0',

'Connection': 'keep-alive'

} ###设置请求的头部，伪装成浏览器，实时换成新的 header 是为了防止403 http code问题，防止反盗链，

try:

img = requests.get(img_url, headers=new_headers) # 请求图片的实际URL

if (str(img).find('200') > 1):

with open(

'{}/{}.jpg'.format(dir, img_url.split('/')[-1].split('?')[0]), 'wb') as jpg: # 请求图片并写进去到本地文件

jpg.write(img.content)

print(img_url)

jpg.close()

return True

else:

return False

except Exception as e:

print('exception occurs: ' + img_url)

print(e)

return False

def processOnePages(tmpDir, imgs):

for img in imgs:

code = img.get('data-src').split('/')[-1]

# 拼装高清大图的地址

saveOneImg(tmpDir, ImgUrl.format(code[:2], code))

pass

def oneSpiderProcess(name, tag):

tmpDir = '{}/{}'.format(save_dir, name)

if not os.path.exists(tmpDir):

os.makedirs(tmpDir)

page = 1

while 1:

params = {

'q': tag,

'page' : page

}

url = searchUrl + urlencode(params)

print('current page is: %s' % url)

html = BeautifulSoup(requests.get(url, headers=headers).text, features="html.parser")

footer = html.find('footer', {'class': 'pagination-notice'})

if footer is not None:

break

imgs = html.find('section', {'class': 'thumb-listing-page'}).find('ul').find_all('img')

page = page + 1

processOnePages(tmpDir, imgs)

def getAllTags():

# 此处应该是从标签页面去获取很多的。为了演示，这里自己手动填写了几个演示一下

list = {'初音未来': 'id:3', '最终幻想VII': 'id:2659', 'Uzumaki Naruto': 'id:1188'}

return list

if __name__ == '__main__':

taglist = getAllTags()

# 给每个标签配备一个线程

with ThreadPoolExecutor(max_workers=5) as t: # 创建一个最大容纳数量为20的线程池

for name, tag in taglist.items():

t.submit(oneSpiderProcess, name, tag)

# test one tag

# oneSpiderProcess('初音未来', 'id:3')

# 等待所有线程都完成。

while 1:

print('-------------------')

time.sleep(1)

效果如下：

原文链接:https://blog.csdn.net/qq_29367075/article/details/111940621

python壁纸数据抓取_Python《wallhaven壁纸爬取》相关推荐

python爬虫分析大学排名_Python爬虫之爬取中国大学排名（BeautifulSoup库）
image.png 我们需要打开网页源代码,查看此网页的信息是写在html代码中,还是由js文件动态生成的,如果是后者,那么我们目前仅仅采用requests和BeautifulSoup还很难爬取到排名 ...
python 定时自动爬取_python怎么定时爬取数据及将数据以邮件发送
定时功能,即程序可以根据我们设定的时间自动爬取数据: 通知功能,即程序可以把爬取到的数据结果以邮件的形式自动发送到我们的邮箱. 程序分成三个功能块:[爬虫]+[邮件]+[定时]. 对爬虫部分,主要是获 ...
python提取支付宝的账单_python通过adb爬取支付宝移动端账单信息
python通过adb连接爬取支付宝移动端的账单信息,操作过程如下: 于是此文件就可以分解为如下四个主要功能: 1.图片识别;2.模拟点击;3.模拟滑动;4.截图功能一.对账单详情页的处理: 对账单 ...
python爬虫网站词云_Python爬虫之爬取情话网站并绘制词云
一.爬取网站 1.分析目标网站首先我们需要分析目标网站的源代码分析html得知所有的情话都是在标签下,而且一个标签对应着一句情话. 2.编写代码 import bs4 import reque ...
python爬虫知乎图片_python爬虫（爬取知乎答案图片）
python爬虫(爬取知乎答案图片) 1.⾸先,你要在电脑⾥安装 python 的环境我会提供2.7和3.6两个版本的代码,但是本⽂只以python3.6版本为例. 安装完成后,打开你电脑的终端(T ...
python携程酒店评论_Python基于selenium爬取携程酒店评论信息
爬取站点任意一个携程酒店的详细链接,这里给出了四个,准备开四个线程爬取: https://hotels.ctrip.com/hotel/6278770.html#ctm_ref=hod_hp_hot ...
python访问多个网页_Python 爬虫 2 爬取多页网页
本文内容: Requests.get 爬取多个页码的网页例:爬取极客学院课程列表爬虫步骤打开目标网页,先查看网页源代码 get网页源码找到想要的内容,找到规律,用正则表达式匹配,存储结果 Re ...
python批量下载静态页面_Python静态网页爬取：批量获取高清壁纸
前言在设计爬虫项目的时候,首先要在脑内明确人工浏览页面获得图片时的步骤一般地,我们去网上批量打开壁纸的时候一般操作如下: 1.打开壁纸网页 2.单击壁纸图(打开指定壁纸的页面) 3.选择分辨率(我 ...
python多线程爬取_python 多线程方法爬取微信公众号文章
''' fh=open("/home/urllib/test/1.html","wb") fh.write(html1.encode("utf-8&q ...
爬虫数据executemany插入_python爬虫：爬取易迅网价格信息，并写入Mysql数据库
详细代码: ''' 结果:

python壁纸数据抓取_Python《wallhaven壁纸爬取》

python壁纸数据抓取_Python《wallhaven壁纸爬取》相关推荐

最新文章

热门文章