Python《第一次爬虫遭遇反盗链（下）》

上一篇博文，我遇到了防止盗链的问题，

防盗链原理
http标准协议中有专门的字段记录referer
一来可以追溯上一个入站地址是什么
二来对于资源文件，可以跟踪到包含显示他的网页地址是什么
因此所有防盗链方法都是基于这个Referer字段

防盗链的作用
在很多地方，如淘宝、拍拍、有啊等C2C网站，发布商品需要对宝贝进行描述，就需要图片存储，而为了使自己辛辛苦苦拍摄的图片不被别人调用，就需要防盗链的功能。
提供防盗链的图片网站很多，如有照片、又拍网、百度相册、QQ相册、网易相册等等，但是既能支持网店外链，又有防盗链功能的网站很少；

在https://www.mzitu.com/ 进入首页后，进入到每个图片集的html页面，找到对应的图片

我们能得到它的图片地址，但是唯一需要做的就是不断去修改http header里面referer字段。
重点部分代码就是如下：

每次去请求图片的时候，就把每次更新的、不断变化的header传入即可，就能顺利下载图片了。

具体参考完整代码如下：

import os
import requests
from bs4 import BeautifulSouprootrurl = 'https://www.mzitu.com/'
save_dir = 'D:/estimages/'
no_more_pages = 'END'
max_pages = 10# 这是一个集合，不能重复，也就是不能重复下载图片
image_cache = set()
index = len(image_cache)headers = {"Referer": "https://www.mzitu.com/",'User-Agent': "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"}  ###设置请求的头部，伪装成浏览器def getNextPageUrl(html):ahref = html.find('a', {'class': 'next page-numbers'})  # 找到导航条的位置，获得下一个连接网页的位置if ahref is None:print('no more page')return no_more_pageselse:return ahref.get('href')def findTheNum(navi):lis = navi.find_all('span')num = 0for span in lis:if span.string.isdigit():tmp = int(span.string)if tmp > num:num = tmpreturn numdef deepSaveImgs(href, saveDir):html = BeautifulSoup(requests.get(href, headers=headers).text, features="html.parser")# to find the number of max pagestotal = findTheNum(html.find('div', {'class': 'pagenavi'}))print('total of this group is %d.' % total)for i in range(1, (total+1)):url = '{}/{}'.format(href, i)  # 拼接照片所在html的页面html = BeautifulSoup(requests.get(url, headers=headers).text, features="html.parser")  # 解析每张图片的源码img_url = html.find('img', attrs={'class': 'blur'}).get('src')  # 查找实际每张图片的具体地址# 因为网站有防盗链，重新设置了头部的Referer ；在浏览器里面F12里打开网络监听，在Request Headers 里面可以看得到new_headers = {'User-Agent': "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",'Referer': url}img = requests.get(img_url, headers=new_headers)  # 请求图片的实际URLprint(img.url)with open('{}/{}'.format(saveDir, img.url.split("/")[-1]), 'wb') as jpg:  # 请求图片并写进去到本地文件jpg.write(img.content)if img.url not in image_cache:image_cache.add(img.url)def saveImgs(html, mainidx):lis = html.find('ul', {'id': 'pins'}).find_all('li') subidx = 1for link in lis:# step 1: save this cover image, and create the folder.a = link.find('a')href = a.get('href')img = a.find('img').get('data-original')print('封面图片: ' + img)tag = '{}{}/{}/'.format(save_dir, mainidx, subidx)if not os.path.exists(tag):os.makedirs(tag)with open('{}/{}'.format(tag, "coverImg_" + img.split("/")[-1]), 'wb') as jpg:  # 请求图片并写进去到本地文件jpg.write(requests.get(img).content)if img not in image_cache:image_cache.add(img)# step 2: enter the mew page to save deeply.deepSaveImgs(href, tag)    #深度搜索该图片组subidx = subidx + 1if __name__ == '__main__':url = rootrurlidx = 1while 1:print("next page: " + url)html = BeautifulSoup(requests.get(url, headers=headers).text, features="html.parser")saveImgs(html, idx)   # 处理当前浏览页面if idx >= max_pages:breakidx = idx + 1url = getNextPageUrl(html)   # 获得下一个浏览页if url == no_more_pages:break

效果挺好的，我再本地也做了二级目录：
同样不能展开了，点到为止。

本文是参考学习了：
https://www.cnblogs.com/Mail-maomao/p/7955194.html

Python《第一次爬虫遭遇反盗链（下）》相关推荐

Github配置(git+vscode+python+jupyter)
①下载git 打开 git bash 工具的用户名和密码存储 $ git config --global user.name "Your Name" $ git config -- ...
【实验楼】python简明教程
①终端输入python进入欣赏完自己的杰作后,按 Ctrl + D 输入一个 EOF 字符来退出解释器,你也可以键入 exit() 来退出解释器. ②vim键盘快捷功能分布 ③这里需要注意如果程序中 ...
【Kaggle Learn】Python 5-8
五. Booleans and Conditionals Using booleans for branching logic x = True print(x) print(type(x))''' ...
【Kaggle Learn】Python 1-4
[Kaggle Learn]Python https://www.kaggle.com/learn/python 一. Hello, Python A quick introduction to Py ...
使用python愉快地做高数线代题目~
今天接触到了python,发现真是极易上手啊!对比c语言是什么鬼东西= = 诶,等下,看完教学文章发现TA在下面写了这句话如果做了前面的内容你可能已被吸引了,觉得c语言真的是废材! 不...不是的. ...
python 位运算与等号_Python 运算符
和大多数语言一样,Python也有很多运算符,并且运算符跟其他语言的运算符大同小异接下来一一介绍: 算术运算符: 运算符描述实例 +加 - 两个对象相加a+b的输出结果是30 -减 - 得到复数或者一 ...
python减小内存占用_如何将Python内存占用缩小20倍？
当程序执行过程中RAM中有大量对象处于活动状态时,可能会出现内存问题,特别是在对可用内存总量有限制的情况下. 下面概述了一些减小对象大小的方法,这些方法可以显著减少纯Python程序所需的RAM数量. ...
python中排序英文单词怎么写_Python实现对文件进行单词划分并去重排序操作示例...
本文实例讲述了Python实现对文件进行单词划分并去重排序操作.,具体如下: 文件名:test1.txt 文件内容: But soft what light through yonder window ...
python程序如何执行死刑图片_如何判断对象已死
已死的对象就是不可能被任何途径使用的对象,有以下几种方法判断一个对象是否已经死了: 引用计数给对象添加一个引用计数器,每当有一个地方引用他,计算器就加 1:当引用失效时,计数器减 1:任何时刻计数器 ...
Python gRPC 安装
1. 安装依赖库 sudo pip3 install grpcio sudo pip3 install protobuf sudo pip3 install grpcio_tools 2. 生成对应文 ...

Python《第一次爬虫遭遇反盗链（下）》

Python《第一次爬虫遭遇反盗链（下）》相关推荐

最新文章

热门文章