写在前面

本文使用Python编写爬虫脚本，实现多线程爬取唯美女生网站高颜值小姐姐的所有照片。

目标网站

唯美女生：https://www.vmgirls.com/

依赖模块

pip install requests
pip install BeautifulSoup4
pip install fake_useragent
pip install tqdm

requests：对网页发送HTTP请求并获取响应结果。
BeautifulSoup4：网页元素定位及解析。
fake_useragent：生成随机、伪造的用户代理。
tqdm：下载进度条打印

爬虫思路

我们的目的是爬取该网站的所有小姐姐图片。而该网站的妹子图片是在发的每篇文章里面，要先找到文章链接，才能爬取图片。

一般好的网站都会做一个站点地图，该站点地图里面会包含发布过的所有历史文章标题及链接。幸运的是找到了该网站的站点地图。

然后从站点地图获取发布过的所有文章标题及链接，文章标题作为图片保存文件夹，从文章链接获取图片地址并保存到本地。

截止2021年4月28日，唯美女生网站总计发布文章1363篇。为了提高爬取速度，用多线程技术来分别爬取每篇文章链接及标题。

唯美女生->站点地图：https://www.vmgirls.com/sitemap.html

完整代码

Github：https://github.com/XavierJiezou/python-vmgirls-crawl

import os
import time
import requests
from tqdm import tqdm
from bs4 import BeautifulSoup
import concurrent.futures as cf
from fake_useragent import UserAgentclass VmgirlsDownloader():def __init__(self):self.root = 'vmgs'os.makedirs(self.root, exist_ok=True)self.site = 'https://www.vmgirls.com/'self.sitemap = 'https://www.vmgirls.com/sitemap.html' # 从站点地图爬取文章列表self.headers = {'referer': self.site, 'user-agent': UserAgent().random}self.page()self.main()def page(self):resp = requests.get(self.sitemap, headers=self.headers)time.sleep(5)soup = BeautifulSoup(resp.content, 'lxml')temp = soup.select('h3 + ul li a') # 定位文章列表articles = []temp_dict = {}for item in temp:href = self.site+item.get('href')title = item.get('title')if temp_dict.get(title) == None:temp_dict[title] = 1else:temp_dict[title] += 1title += str(temp_dict[title]) # 重复文件夹的命名方式os.makedirs(os.path.join(self.root, title), exist_ok=True)articles.append([href, title])self.articles = articlesdef save(self, img_link, img_path):resp = requests.get(img_link, headers=self.headers)time.sleep(3)with open(img_path, 'wb') as f:f.write(resp.content)def down(self, article_link, article_title):resp = requests.get(article_link, headers=self.headers)time.sleep(5)soup = BeautifulSoup(resp.content, 'lxml')imgs = soup.select('div.nc-light-gallery img') # 定位文章里面的所有图片name = 1 for item in tqdm(imgs, desc=article_title):if 'https:' not in item.get('src'):img_link = 'https:'+item.get('src')else:img_link = 'https:'+item.get('srcset').split(' ')[0]img_path = f'{self.root}/{article_title}/{name}.{img_link.split(".")[-1]}'if not os.path.exists(img_path):self.save(img_link, img_path)name += 1else:continuedef main(self):with cf.ThreadPoolExecutor() as tp:for article_link, article_title in self.articles:tp.submit(self.down, article_link, article_title)if __name__ == '__main__':VmgirlsDownloader()

爬虫结果

1.62GB小姐姐图片下载：微软云盘 | 百度网盘（提取码：2233） | 天翼云盘

项目名称	具体描述
目标网站	https://www.vmgirls.com/ (唯美女生)
爬取日期	2021年4月28日
图片总数	17601张
图片大小	1,742,902,332字节 (约1.62GB)
图片类型	png、jpg和jpeg

单图预览

多图预览

引用参考

https://github.com/psf/requests
https://beautifulsoup.readthedocs.io/zh_CN/v4.4.0/
https://github.com/hellysmile/fake-useragent
https://github.com/tqdm/tqdm

【Python】多线程爬取某站高颜值小姐姐照片（共1.62GB）相关推荐

python多线程爬取王者荣耀高清壁纸过程
多线程与爬虫目标url json中查找url 访问url 读取json 查看json的list数组全部图片粗暴的单线程获取多线程执行目标url 查看http://pvp.qq.com/web ...
python多线程爬取斗图啦数据
python多线程爬取斗图啦网的表情数据使用到的技术点 requests请求库 re 正则表达式 pyquery解析库,python实现的jquery threading 线程 queue 队列 ' ...
Python 多线程爬取西刺代理
西刺代理是一个国内IP代理,由于代理倒闭了,所以我就把原来的代码放出来供大家学习吧. 镜像地址:https://www.blib.cn/url/xcdl.html 首先找到所有的tr标签,与class ...
从入门到入土：Python实现爬取某站视频|根据视频编号|支持通过视频名称和创作者名称寻找编号|以及python moviepy合并音频视频
写在前面: 此博客仅用于记录个人学习进度,学识浅薄,若有错误观点欢迎评论区指出.欢迎各位前来交流.(部分材料来源网络,若有侵权,立即删除) Python实现爬取某站视频|根据视频编号|支持通过视频名称 ...
python多线程爬取妹子图
python多线程爬取妹子图 python使用版本: 3.7 目的: 自己选择下载目录,逐个将主题图片保存到选定目录下. 效果: 一秒钟左右下载一张图片,下了七八十组图片暂时没什么问题,不放心的话,可 ...
Python爬虫 | 爬取高质量小姐姐照片
Python爬虫 | 爬取高质量小姐姐照片 1.数据来源分析 2.获取author_id_list和img_id 3.制作detial 4.制作detial_list 5.数据保存 6.批量获取 7. ...
python多线程爬取ts文件并合成mp4视频
python多线程爬取ts文件并合成mp4视频声明:仅供技术交流,请勿用于非法用途,如有其它非法用途造成损失,和本博客无关目录 python多线程爬取ts文件并合成mp4视频前言一.分析页面 ...
python爬虫爬取必应每日高清壁纸
python爬虫爬取必应每日高清壁纸一.简介二.使用的环境三.网页分析 1.分析网页每一页url形式以及总页数 2.网页重要信息收集 3.在源码中寻找所需信息的位置四.代码实现五.运行爬虫 ...
Python爬虫利用18行代码爬取虎牙上百张小姐姐图片
Python爬虫利用18行代码爬取虎牙上百张小姐姐图片下面开始上代码需要用到的库 import request #页面请求 import time #用于时间延迟 import re #正则表达式 ...

【Python】多线程爬取某站高颜值小姐姐照片（共1.62GB）

文章目录