scrapy框架爬取糗妹妹网站妹子图分类的所有图片

爬取所有图片，一个页面的图片建一个文件夹。难点，图片中有不少.gif图片，需要重写下载规则,

创建scrapy项目

scrapy startproject qiumeimei

创建爬虫应用

cd qiumeimeiscrapy genspider -t crawl qmm www.xxx.com

items.py文件中定义下载字段

import scrapyclass QiumeimeiItem(scrapy.Item):# define the fields for your item here like:# name = scrapy.Field()page = scrapy.Field()image_url = scrapy.Field()

qmm.py文件中写爬虫主程序

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from qiumeimei.items import QiumeimeiItemclass QmmSpider(CrawlSpider):name = 'qmm'# allowed_domains = ['www.xxx.com']start_urls = ['http://www.qiumeimei.com/image']rules = (Rule(LinkExtractor(allow=r'http://www.qiumeimei.com/image/page/\d+'), callback='parse_item', follow=True),)def parse_item(self, response):page = response.url.split('/')[-1]if not page.isdigit():page = '1'image_urls = response.xpath('//div[@class="main"]/p/img/@data-lazy-src').extract()for image_url in image_urls:item = QiumeimeiItem()item['image_url'] = image_urlitem['page'] = pageyield item

pipelines.py文件中定义下载规则

import scrapy
import os
from scrapy.utils.misc import md5sum
# 导入scrapy 框架里的 管道文件的里的图像 图像处理的专用管道文件
from scrapy.pipelines.images import ImagesPipeline
# 导入图片路径名称
from qiumeimei.settings import IMAGES_STORE as images_store
# 必须继承 ImagesPipeline
class QiumeimeiPipeline(ImagesPipeline):# 定义返回文件名def file_path(self, request, response=None, info=None):file_name = request.url.split('/')[-1]return file_name# 重写父类的 下载文件的 方法def get_media_requests(self, item, info):yield scrapy.Request(url=item['image_url'])#     完成图片存储的方法 名称def item_completed(self, results, item, info):# print(results)page = item['page']print('正在下载第'+page+'页图片')image_url = item['image_url']image_name = image_url.split('/')[-1]old_name_list = [x['path'] for t, x in results if t]# 真正的原图片的存储路径old_name = images_store + old_name_list[0]image_path = images_store + page + "/"# 判断图片存放的目录是否存在if not os.path.exists(image_path):# 根据当前页码创建对应的目录
            os.mkdir(image_path)# 新名称new_name = image_path + image_name# 重命名
        os.rename(old_name, new_name)return item# 重写下载规则def image_downloaded(self, response, request, info):checksum = Nonefor path, image, buf in self.get_images(response, request, info):if checksum is None:buf.seek(0)checksum = md5sum(buf)width, height = image.sizeif self.check_gif(image):self.persist_gif(path, response.body, info)else:self.store.persist_file(path, buf, info,meta={'width': width, 'height': height},headers={'Content-Type': 'image/jpeg'})return checksumdef check_gif(self, image):if image.format is None:return Truedef persist_gif(self, key, data, info):root, ext = os.path.splitext(key)absolute_path = self.store._get_filesystem_path(key)self.store._mkdir(os.path.dirname(absolute_path), info)f = open(absolute_path, 'wb')  # use 'b' to write binary data.f.write(data)

settings.py文件中定义请求头和打开下载管道

USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36'ITEM_PIPELINES = {'qiumeimei.pipelines.QiumeimeiPipeline': 300,
}

运行爬虫

scrapy crawl qmm --nolog

查看文件夹是否下载成功

.gif为动态图。

done。

转载于:https://www.cnblogs.com/nmsghgnv/p/11359877.html

scrapy框架爬取糗妹妹网站妹子图分类的所有图片相关推荐

python爬虫scrapy框架爬取糗妹妹段子首页
声明:本文仅为学习爬虫,请勿商业和恶意攻击网站,本文所有解释权归作者. 本文分别用两种方法把获取的段子信息存储到了本地,分别是txt文件和json文件, txt文件比较简单,生成字典后用命令直接执行即 ...
【Python爬虫系列教程 28-100】小姐姐带你入门爬虫框架Scrapy、使用Scrapy框架爬取糗事百科段子
文章目录 Scrapy快速入门安装和文档: 快速入门: 创建项目: 目录结构介绍: Scrapy框架架构 Scrapy框架介绍: Scrapy框架模块功能: Scrapy Shell 打开Scrap ...
14. python爬虫——基于scrapy框架爬取糗事百科上的段子内容
python爬虫--基于scrapy框架爬取糗事百科上的段子内容 1.需求 2.分析及实现 3.实现效果 4.进行持久化存储 (1)基于终端指令 (2)基于管道 [前置知识]python爬虫--scr ...
python笔记之利用scrapy框架爬取糗事百科首页段子
环境准备: scrapy框架(可以安装anaconda一个python的发行版本,有很多库) cmd命令窗口教程: 创建爬虫项目 scrapy startproject qq #创建了一个爬虫项目q ...
利用Python Scrapy框架爬取“房天下”网站房源数据
文章目录分析网页获取新房.二手房.租房数据新房数据租房数据: 二手房数据反反爬虫将数据保存至MongoDB数据库 JSON格式 CSV格式 MongoDB数据库分析网页 "房天 ...
scrapy框架爬取校花网站的升级版
**spider目录下的文件:定义DemoSpider类** # -*- coding: utf-8 -*- from scrapy.spiders import CrawlSpider,Rule f ...
scrapy框架爬取网站图片
使用scrapy 框架爬取彼岸图库前言: 这两天在网上学习了一下scrapy框架,发现及其好用,把爬虫步骤分的细细的.所以写了一个简单项目回顾一下并分享给大家^ . ^ 源码我已经放到Github了 ...
Python的Scrapy框架爬取诗词网站爱情诗送给女友
文章目录前言效果展示: 一.安装scrapy库二.创建scrapy项目三.新建爬虫文件scmg_spider.py 四.配置settings.py文件五.定义数据容器,修改item.py文件 ...
利用python的scrapy框架爬取google搜索结果页面内容
scrapy google search 实验目的爬虫实习的项目1,利用python的scrapy框架爬取google搜索结果页面内容. https://github.com/1012598167/ ...

scrapy框架爬取糗妹妹网站妹子图分类的所有图片

scrapy框架爬取糗妹妹网站妹子图分类的所有图片相关推荐

最新文章

热门文章