python爬虫 - 爬取堆糖图片

堆糖网上储存着许多照片，现在就利用python来下载照片。

打开搜索栏，搜索book，发现有许多照片，打开链接，再点击照片，得到类似https://b-ssl.duitang.com/uploads/item/201205/02/20120502002005_Aja53.jpeg的网址，这个就是照片的真正地址。

网页已经找到，但是搜索结果多么多图片，不能每一张都点进去，寻找真正网址来下载，这样没有效率了。

打开inspect，切换到Network选项，刷新一下网页。

Name里面就会出现许多文件。这个就是组成网页的所有文件。

打开其中一张照片，点击Preview，发现可以预览照片。但是看照片大小，看网址格式都和我们要寻找的真正网址都不同。

滑动网页，Name里就会不断增加新的文件，点击类似
?kw=book&type=feed&include_fields=top_comments%2Cis_root%2Csource_link%2Citem%2Cbuyable%2Croot_id%2Cstatus%2Clike_count%2Clike_id%2Csender%2Calbum%2Creply_count%2Cfavorite_blog_id&type=&start=24&=1549346071315的json文件，发现有类似path: "https://b-ssl.duitang.com/uploads/item/201801/02/20180102151225_twrmN.jpeg"的字段，自此就能断定，这是一个由ajax加载生成的网页。

构造获取json文件的链接

def get_html(self):url = 'https://www.duitang.com/napi/blog/list/by_search/?'params = {'kw': self.kw,'type': 'feed','include_fields':'top_comments%2Cis_root%2Csource_link%2Citem%2Cbuyable%2Croot_id%2Cstatus%2Clke_count%2Clike_id%2Csender%2Calbum%2Creply_count%2Cfavorite_blog_id','_type': '','start': self.start}headers = {'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko)Chrome/67.0.3396.99 Safari/537.36'}try:response = requests.get(url, params=params, headers=headers)if response.status_code == 200:return response.textexcept requests.ConnectionError as e:print(e)pass

判定字段的有效性：

def test(self, response):result = json.loads(response)data = result.get('data')if data:object_list = data.get('object_list')if object_list:for i in object_list:items = {}photo = i.get('photo')if photo:path = photo.get('path')if path:items['path'] = pathyield items

再次用requests去链接网页，这次是下载照片：

def get_html_2(self, items):try:url = items.get('path')headers = {'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko)Chrome/67.0.3396.99 Safari/537.36'}if 'gif_jpeg' in url:response = requests.get(url[:-5], headers=headers)if response.status_code == 200:return ('gif', response)elif 'png' in url:response = requests.get(url, headers=headers)if response.status_code == 200:return ('png', response)elif 'jpg' or 'jpeg' in url:response = requests.get(url, headers=headers)if response.status_code == 200:return ('jpg', response)else:print('Unknown format.')passexcept requests.ConnectionError as e:print(e)pass

最后本地存储照片：

def write_into_file(self, format, response):if not os.path.exists(os.path.join(DIST_DIR, self.kw)):os.makedirs(os.path.join(DIST_DIR, self.kw))if format == 'gif':file_path = '{0}/{1}/{2}.{3}'.format(DIST_DIR, self.kw,md5(response.content).hexdigest(), 'gif')if not os.path.exists(file_path):with open(file_path, 'wb') as f:f.write(response.content)else:print('Already Downloaded {0}.gif'.format(md5(response.content).hexdigest()))elif format == 'png':file_path = '{0}/{1}/{2}.{3}'.format(DIST_DIR, self.kw,md5(response.content).hexdigest(), 'png')if not os.path.exists(file_path):with open(file_path, 'wb') as f:f.write(response.content)else:print('Already Downloaded {0}.png'.format(md5(response.content).hexdigest()))elif format == 'jpg':file_path = '{0}/{1}/{2}.{3}'.format(DIST_DIR, self.kw,md5(response.content).hexdigest(), 'jpg')if not os.path.exists(file_path):with open(file_path, 'wb') as f:f.write(response.content)else:print('Already Downloaded {0}.jpg'.format(md5(response.content).hexdigest()))

所有的整理一下：

import json
import os
import time
from hashlib import md5import requestsBASE_DIR = os.path.dirname(os.path.abspath(__file__))
DIST_DIR = os.path.join(BASE_DIR, 'dist')class Spider:def __init__(self, kw, start=0):self.kw = kwself.start = startdef get_html(self):url = 'https://www.duitang.com/napi/blog/list/by_search/?'params = {'kw': self.kw,'type': 'feed','include_fields': 'top_comments%2Cis_root%2Csource_link%2Citem%2Cbuyable%2Croot_id%2Cstatus%2Clike_count%2Clike_id%2Csender%2Calbum%2Creply_count%2Cfavorite_blog_id','_type': '','start': self.start}headers = {'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'}try:response = requests.get(url, params=params, headers=headers)if response.status_code == 200:return response.textexcept requests.ConnectionError as e:print(e)passdef test(self, response):result = json.loads(response)data = result.get('data')if data:object_list = data.get('object_list')if object_list:for i in object_list:items = {}photo = i.get('photo')if photo:path = photo.get('path')if path:items['path'] = pathyield itemsdef get_html_2(self, items):try:url = items.get('path')headers = {'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'}if 'gif_jpeg' in url:response = requests.get(url[:-5], headers=headers)if response.status_code == 200:return ('gif', response)elif 'png' in url:response = requests.get(url, headers=headers)if response.status_code == 200:return ('png', response)elif 'jpg' or 'jpeg' in url:response = requests.get(url, headers=headers)if response.status_code == 200:return ('jpg', response)else:print('Unknown format.')passexcept requests.ConnectionError as e:print(e)passdef write_into_file(self, format, response):if not os.path.exists(os.path.join(DIST_DIR, self.kw)):os.makedirs(os.path.join(DIST_DIR, self.kw))if format == 'gif':file_path = '{0}/{1}/{2}.{3}'.format(DIST_DIR, self.kw,md5(response.content).hexdigest(), 'gif')if not os.path.exists(file_path):with open(file_path, 'wb') as f:f.write(response.content)else:print('Already Downloaded {0}.gif'.format(md5(response.content).hexdigest()))elif format == 'png':file_path = '{0}/{1}/{2}.{3}'.format(DIST_DIR, self.kw,md5(response.content).hexdigest(), 'png')if not os.path.exists(file_path):with open(file_path, 'wb') as f:f.write(response.content)else:print('Already Downloaded {0}.png'.format(md5(response.content).hexdigest()))elif format == 'jpg':file_path = '{0}/{1}/{2}.{3}'.format(DIST_DIR, self.kw,md5(response.content).hexdigest(), 'jpg')if not os.path.exists(file_path):with open(file_path, 'wb') as f:f.write(response.content)else:print('Already Downloaded {0}.jpg'.format(md5(response.content).hexdigest()))def main():print('Enter the keyowrd: ', end='')kw = input()# kw = 'book'start_time = time.time()counter = 0for start in range(0, 3600, 24):spider = Spider(kw, start=start)response = spider.get_html()items = spider.test(response)if items:for item in items:format, response = spider.get_html_2(item)if format == 'gif':print('Downloading: {0} It costs {1}s.'.format(item['path'][:-5], time.time() - start_time))else:print('Downloading: {0} It costs {1}s.'.format(item['path'], time.time() - start_time))counter += 1spider.write_into_file(format, response)else:breakprint('Get {0}. It costs {1}s'.format(counter, str(time.time() - start_time)))if __name__ == '__main__':main()

这样堆糖网的照片就可以下载下来了。

python爬虫 - 爬取堆糖图片相关推荐

把url地址复制到粘贴板上_写个简单的python爬虫爬取堆糖上漂亮的小姐姐
简单的爬虫入门实战最近刚学了python的爬虫,刚好可以用来爬取漂亮的图片作为壁纸,网上美图网站有很多,比如:花瓣,堆糖.它们请求图片数据的方式差不多类似,都是通过用户不断下滑加载新的图片,这种请求 ...
使用Python爬虫爬取网络美女图片
代码地址如下: http://www.demodashi.com/demo/13500.html 准备工作安装python3.6 略安装requests库(用于请求静态页面) pip instal ...
利用Python爬虫爬取网页福利图片
最近几天,学习了爬虫算法,通过参考书籍,写下自己简单爬虫项目: 爬取某福利网站的影片海报图片环境:anaconda3.5+spyder3.2.6 目录 1.本节目标 2.准备工作 3.抓取分析 4. ...
用Python 爬虫爬取贴吧图片
之前一直在看机器学习,遇到了一些需要爬取数据的内容,于是稍微看了看Python爬虫,在此适当做一个记录.我也没有深入研究爬虫,大部分均是参考了网上的资源. 先推荐两个Python爬虫的教程,网址分别是 ...
python爬虫爬取网页壁纸图片（《底特律：变人》）
参考文章:https://www.cnblogs.com/franklv/p/6829387.html 爬虫爬取网址:http://www.gamersky.com/news/201804/10396 ...
用python爬虫爬取网页壁纸图片（彼岸桌面网唯美图片）
参考文章:https://www.cnblogs.com/franklv/p/6829387.html 今天想给我的电脑里面多加点壁纸,但是嫌弃一个个保存太慢,于是想着写个爬虫直接批量爬取,因为爬虫只 ...
python唯美壁纸_用python爬虫爬取网页壁纸图片（彼岸桌面网唯美图片）
参考文章:https://www..com/franklv/p/6829387.html 今天想给我的电脑里面多加点壁纸,但是嫌弃一个个保存太慢,于是想着写个爬虫直接批量爬取,因为爬虫只是很久之前学过 ...
python爬虫爬取小姐姐图片
前言大致熟悉了python的基础语法以后,开始学习爬虫基础. 一.爬取前的准备工作 python3.7环境(只要是python3版本都可以): 依赖包 : time requests re (缺少包 ...
python爬虫爬取小姐姐图片（5762张）
接触爬虫的第一天第一步:现将python环境搭建好,工欲利其事必先利其器! 第二步:寻找目标网站,我选择的网站是http://www.win4000.com,里面有一个美女板块,里面有各种小姐姐的照 ...
用python爬虫爬取无水印图片_使用python 爬虫，爬取图片
一.需求: 用python实现去内涵段子里面下载网页当中的图片到本地当中二.实现: 1.获取要爬取的URL地址 2.设置headers 3.请求网页内容,把html内容转换成XML 4.解析地址内容 ...

python爬虫 - 爬取堆糖图片

python爬虫 - 爬取堆糖图片相关推荐

最新文章

热门文章