python常用小技巧（一）——百度图片无限制批量爬取

前言：我们在日常使用（搜壁纸，搜美女～～）或者科研项目（图像识别）中经常要批量获取某种类型的图片，然而很多时候我们都需要一个个点击下载，有什么办法可以让程序替我们完成这项工作呢，那就是爬虫啦。

一、准备材料：

- Python
- os
- re
- time
- random
- request

二、爬取难点：

百度拥有反爬虫机制，网上很多爬虫程序只能爬取30张（一页）图片。
这个时候你就要说了，一页页递增地遍历不就可以获取多页的图片了吗？
我一开始也这样想，后面发现百度页面没有分页，网址没有页数可以让我们改。
在搜索了许多网上那个的资料后，我终于找到了可以一页页递增地遍历的爬虫，当我满心欢喜地运行完之后，我发现我下载的图片数量远没有达到我设置的页数目标。这又是为什么呢？
后来，我发现是爬虫爬取的速度太快了，百度的后台会识别出它是程序在操作而不是人为地在下载，所以当爬取到一定数量的图片后，百度后台就拒绝我的请求了。
那就没有办法了吗，如何绕过后台的监测呢？
......
办法可能有，但是我这种菜鸡就算了哈哈，但是，虽然我们绕不过后台的监测，但是我们可以骗过后台的监测呀。
怎么骗呢，办法简单到你无法想象！！
就是在请求过去下载链接的时候增加一个延时函数，而延时的时间又是由random函数产生的，那么系统不就很难发现这是程序操作了吗！

三、程序编写：

# -*- coding:utf-8 -*-
import requests
import os
import re
import  time
import  random
def getManyPages(keyword,pages):params=[]for i in range(30,30*pages+30,30):params.append({'tn': 'resultjson_com','ipn': 'rj','ct': 201326592,'is': '','fp': 'result','queryWord': keyword,'cl': 2,'lm': -1,'ie': 'utf-8','oe': 'utf-8','adpicid': '','st': -1,'z': '','ic': 0,'word': keyword,'s': '','se': '','tab': '','width': '','height': '','face': 0,'istype': 2,'qc': '','nc': 1,'fr': '','pn': i,'rn': 30,'gsm': '1e','1488942260214': ''})url = 'https://image.baidu.com/search/acjson'# regex = re.compile(r'\\(?![/u"])')#  new_url = regex.sub(r"\\\\", url)urls = []#for i in params :# new_params = regex.sub(r"\\\\", params[i])for i in params:
#        regex = re.compile(r'\\(?![/u"])')
#        fixed = regex.sub(r"\\\\", params[i])urls.append(requests.get(url,params=i).json().get('data'))return urlsdef getpage(key,page):new_url = []for i in range(0, page*30+30, 30):new_url.append({'tn': 'resultjson_com','ipn': 'rj','ct': 201326592,'is': '','fp': 'result','queryWord': key,'cl': 2,'lm': -1,'ie': 'utf-8','oe': 'utf-8','adpicid': '','st': -1,'z': '','ic': 0,'word': key,'s': '','se': '','tab': '','width': '','height': '','face': 0,'istype': 2,'qc': '','nc': 1,'fr': '','pn': i,'rn': 30,'gsm': '1e','1488942260214': ''})url = 'https://image.baidu.com/search/acjson'result=[]for i in  new_url:randnumber1 = random.randint(0,3)#生成随机数time.sleep(randnumber1)#按随机数延时print(i)try:result.append(requests.get(url, params=i).json().get('data'))print(result)except :#如果延时之后还是被拒绝#print('error\n')randnumber2 = random.randint(5,10)#延迟随机时间time.sleep(randnumber2)#print(result)return resultdef getImg(dataList, localPath,keyword):i=1x = 0for list in dataList:for each in list:###################try:if each.get('thumbURL') != None:print('downloading:%s' % each.get('thumbURL'))pic = requests.get(each.get('thumbURL'))except requests.exceptions.ConnectionError:print('error: This photo cannot be downloaded')continuedir = 'image/' + keyword + '_' + str(i) + '.jpg'fp = open(dir, 'wb')fp.write(pic.content)fp.close()i += 1def dowmloadPic(html, keyword):pic_url = re.findall('"objURL":"(.*?)",', html, re.S)i = 1print('Can not find key word:' + keyword + 'is downloading...')for each in pic_url:print('No ' + str(i) + '.jpg is downloading ,URL:' + str(each))try:pic = requests.get(each, timeout=10)except requests.exceptions.ConnectionError:print('error: This photo cannot be downloaded')continuedir = 'image/' + keyword + '_' + str(i) + '.jpg'fp = open(dir, 'wb')fp.write(pic.content)fp.close()i += 1if __name__ == '__main__':# dataList = getManyPages('shoes',10)  # key word and number of page# getImg(dataList,'/image') # pathkeyword = 'shoes' #改变keyword就可以得到你要搜索的图片dataList = getpage(keyword,50)  # key word and number of pagegetImg(dataList,'/images',keyword) # path

四、程序运行与结论

我们可以看到程序已经爬取了1422张图片

虽然理论上这个傻办法是可以骗过后台的，但是有些时候还是会被拦截，大家可能要多尝试几遍才能得到想要的数量

python常用小技巧（一）——百度图片批量爬取相关推荐

python常用小技巧（四）——批量图片改名
python常用小技巧(四)--批量图片改名前言:在日常使用中我们需要批量修改图片名字,使用Python的话就可以很快地完成这个目标一.材料准备 - os 二.程序编写 # -*- coding: ...
Python常用小技巧（五）——批量读取json文件
Python常用小技巧(五)--批量读取json文件前言:其实Python能够批量读取很多文件,这里,本人以json文件为例(json是标注图片时生成的文件,记录有标注的坐标和标签,友情推荐标注图片 ...
Python常用小技巧（二）——打开图片
Python常用小技巧(二)--打开图片前言:对于大量图片的文件夹,你很难手工去检查每张图片是否损坏,这时候就要用程序去检查每张图片是否能打开了一.材料准备 - os - PIL 二.程序编写 i ...
pythonencoding etf-8_etf iopv python 代码30个Python常用小技巧
1.原地交换两个数字x, y =10, 20 print(x, y) y, x = x, y print(x, y) 10 20 20 10 2.链状比较操作符n = 10 print(1 print ...
python爬虫百度图片_python3爬取百度图片（2018年11月3日有效）
最终目的:能通过输入关键字进行搜索,爬取相应的图片存储到本地或者数据库首先打开百度图片的网站,搜索任意一个关键字,比如说:水果,得到如下的界面分析: 1.百度图片搜索结果的页面源代码不包含需要提取 ...
Python常用小技巧，提高刷题效率（适用于蓝桥杯python组）
1. 掌握python标准库及小技巧 python课程学习到面向对象,就可以刷题参加算法比赛了对于蓝桥杯不支持第三方库,但学会python标准库,将事半功倍: 2. 常用的列表函数 list1.ap ...
python easyicon同类型ico图片批量爬取
这是第二篇有关图片爬取的博客.似乎本人对图片情有独钟.这篇博客主要是还是用于记录我的学习记录.同时,我们在编写界面的时候,经常需要从网上下载一些ico图标用于自定义控件,也许不同的程序员有自己的下载方 ...
Python网络爬虫实战8：通过百度新闻网站批量爬取多个网页的信息
代码实例 # coding:utf8 import requests import re import timeheaders = {'User-Agent':'Mozilla/5.0 (Window ...
Python爬虫小实践：使用BeautifulSoup+Request爬取CSDN博客的个人基本信息
好久都没有动Python了,自从在网上买了<Python网络数据采集>这本书之后一直没有时间写自己的小的Demo,今天再网络上无意中看见 http://www.cnblogs.com/mf ...

python常用小技巧（一）——百度图片批量爬取