requests+bs4批量爬取反爬虫图片网站

导读：爬取反爬虫图片网站

预览效果

遇到的问题:

刚开始爬虫的时候，爬取到的所有图片都是一张重定向推广图片
解决办法：在requests请求头headers中配置Referer属性，指向爬取网站的顶级域名（根据情况而定）

爬虫代码

import os,re
import requests
from contextlib import closing
from bs4 import BeautifulSoup
import json
import random
import time# 下载路径
DOWNLOAD_PATH = 'C:\\pictures\\美女校花\\'# 最大页数
MAX_PAGES = 30# 基础网址
BASE_URL = 'http://www.xxx.com/' # 清纯美女网站
BEAUTY_BASE_URL = BASE_URL + 'xiaohua/'# 列表URL
LIST_URL_ITEM = 'list_1_1.html'# 图片地址
BASIC_IMG_URL = 'http://img1.xxx.me/pic/' #http://img1.xxx.me/pic/3683/1.jpg class mmPicture(object):""" 图片下载 """def __init__(self):super(mmPicture, self).__init__()self.offset = 1self.list_url = BEAUTY_BASE_URLself.all_group_links = []self.all_img_links = []#请求列表页def requestDataList(self):headers = { "User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36" }if(self.offset != 1):self.list_url = BEAUTY_BASE_URL + 'list_1_' + str(self.offset) + '.html'print(self.list_url)with closing(requests.get(self.list_url, headers=headers)) as response:response.encoding="GBK"soup = BeautifulSoup(response.text, "html.parser")for sou in soup.find_all("dd"):AHref = sou.find('a')['href']#print(sou.find('a').string)if(re.match('.*\/{2}www.mm131.com\/.*', AHref)):self.all_group_links.append(AHref)#print(self.all_group_links)    passself.offset += 1print(self.offset)if(self.offset < MAX_PAGES):self.requestDataList()time.sleep(1)else:with open('./group.json','w') as f:f.write(json.dumps(self.all_group_links))# 详情页self.requestDetail()#请求详情页def requestDetail(self):# 读取本地jsonwith open('./group.json','r') as f:self.all_group_links = json.load(f)# 遍历获取图片链接for list_item_url in self.all_group_links:headers = { "User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36" }with closing(requests.get(list_item_url, headers=headers)) as response:response.encoding="GBK"soup = BeautifulSoup(response.text, "html.parser")#获取图片详情页码 href="1264_2.html"page_links = soup.select('.page-en')for page in page_links:img_url = re.match('\d+_{1}\d+',page['href'])[0].split('_')[0] + '/' + re.match('\d+_{1}\d+',page['href'])[0].split('_')[1] + '.jpg'self.all_img_links.append({'name': soup.find('h5').string,'url': img_url})print(list_item_url)# 写入本地文件with open('./links.json','w') as f:f.write(json.dumps(self.all_img_links))# 下载所有图片self.donwloadALLImgs()# 下载所有图片def donwloadALLImgs(self):with open('./links.json','r') as f:self.all_img_links = json.load(f)for imgItem in self.all_img_links:print(BASIC_IMG_URL+imgItem['url'])#下载文件self.downloads(imgItem)pass# 创建目录def mkdir(self, path):# 去除首位空格path = path.strip()# 去除尾部 \ 符号path = path.rstrip("\\")isExists = os.path.exists(path)# 判断结果if not isExists:os.makedirs(path) # 如果不存在则创建目录 # 创建目录操作函数return Trueelse:# 如果目录存在则不创建，并提示目录已存在return False# 下载def downloads(self, item):headers = { "User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36", "Referer": "http://www.xxx.com/xiaohua/"}with closing(requests.get(BASIC_IMG_URL+item['url'], headers=headers, stream=True)) as response:chunk_size = 1024  # 单次请求最大值content_size = (int(response.headers['content-length'])/chunk_size/chunk_size)  # 内容体总大小data_count = 0print('\n开始下载:\n')#创建目录self.mkdir(DOWNLOAD_PATH+item['name'])# 开始下载操作with open( DOWNLOAD_PATH + item['name'] + '\\' + item['name'] + str(random.randrange(0, 1000)) + '.jpg', 'wb') as file:for data in response.iter_content(chunk_size = chunk_size):file.write(data)data_count += (len(data)/chunk_size/chunk_size)now_progress = (data_count / content_size) * 100print("\r 文件下载进度：%d%%(%d M/%d M) - %s " % (now_progress, data_count, content_size, item['name']), end=" ")print('\n\n下载成功!\n')# 下载图片
mm = mmPicture()
mm.donwloadALLImgs()

requests+bs4批量爬取反爬虫图片网站相关推荐

超简单的图片爬取项目，复制粘贴就能用，批量爬取动漫图片。（保姆教程，可根据需要修改URL）
各位未来国家栋梁们好啊~相信栋梁们经常需要在网络上寻找各种资源,作为二次元的必备精神食粮,图片资源那是必不可少!在这里用python写了一个超简单的图片爬取小项目~话不多说,附上源码!(有用的话点个赞 ...
批量爬取中国大学MOOC网站的媒体资源
质量声明:原创文章,内容质量问题请评论吐槽.如对您产生干扰,可私信删除. 主要参考:https://github.com/Dayunxi/getMOOCmedia 三点说明: 感谢中国大学MOOC ...
爬虫实战：批量爬取京东内衣图片（自动爬取多页，非一页）
做下男生想做的事,爬取大量妹子内衣图. 作者: 电气-余登武准备工作假如我们想把京东内衣类商品的图片全部下载到本地,通过手工复制粘贴将是一项非常庞大的工程,此时,可以用python爬虫实现. 第一 ...
requests+bs4+正则爬取前程无忧招聘信息进阶版
整理思路获取所有职位信息的url 通过正则去掉不符合要求的url 爬取详情页信息解析详情页写入txt文件循环抓取提高速度多线程爬取先放上url:https://search.51job.c ...
python selenium加bs4批量爬取斗鱼直播信息
from selenium import webdriver from bs4 import BeautifulSoup import time # 返回Phantomjs对象 driver = we ...
pyspider爬取免费正版图片网站Pixabay
前言: 许多网友贡献了不少的爬取妹子图片,豆瓣电影的教程,开始学爬虫的时候也少不了参考各位大佬的代码和填坑的经验!这次打算原创出一个爬取国外比较大的图片网站Pixabay 之所以选择该网站第一:Pi ...
python爬取下载动态图片网站
这次我们来爬取一个图片网站 unsplash.com,为什么要选择这个网站呢?因为这个网站的所有图片都是js动态请求生成的,所以说一般的爬取肯定是不行的啦 ~ 一.工具这次爬取我们需要借助一款工具代 ...
Python3用requests,multiprocessing多线程爬取今日头条图片
仅供交流学习 #coding=utf-8import json import requests import re import os from multiprocessing import Pool ...
用爬虫爬取某妹子图片网站图片
闲聊这部分在这就省了吧感兴趣去我自己搭的博客看 : www.jojo-m.cn 代码实现 import requests from lxml import etree import time im ...

requests+bs4批量爬取反爬虫图片网站

导读：爬取反爬虫图片网站

预览效果

遇到的问题:

爬虫代码

requests+bs4批量爬取反爬虫图片网站相关推荐

最新文章

热门文章