使用scrapy框架爬取汽车之家的图片(高清)

不同于上一篇的地方是，这篇要爬取的是高清图片，而不仅仅是缩略图。
先来看一下要爬取的页面：https://car.autohome.com.cn/pic/series/3464.html

上一篇只是爬取了这些缩略图，而且每个子标签下还有很多图片都没有得到，所以这篇的目的是分别进入到这些子标签的详情页进行高清图片的爬取。

准备工作

网站格式

先观察一下目标网站的格式：
起始页是：
https://car.autohome.com.cn/pic/series/3464.html
对"更多"选项进行检查：https://car.autohome.com.cn/pic/series/3464-10.html#pvareaid=2042222
#后的可以删去，即
https://car.autohome.com.cn/pic/series/3464-10.html
再对其它标签的"更多"进行检查：
https://car.autohome.com.cn/pic/series/3464-3.html
不仅如此，在每个标签的详情页里有可能存在更多的页，如车厢座椅的详情页就有很多页，以第二页举例，其url为：
https://car.autohome.com.cn/pic/series/3464-3-p2.html

发现格式是很相似的，因此可以考虑用CrawlSpider

settings 设置

ITEM_PIPELINES = {# 'bmw.pipelines.BmwPipeline': 300,#  'scrapy.pipelines.images.ImagesPipeline':1'bmw.pipelines.BMWImagesPipeline': 1由于要使用框架下载并且做目录分类，因此不能再用默认的和ImagesPipeline了
}

# 图片下载路径，images pipelines使用
IMAGES_STORE = os.path.join(os.path.dirname(os.path.dirname(__file__)), 'images')

请求头
DEFAULT_REQUEST_HEADERS = {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8','Accept-Language': 'en','User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, ''like Gecko) Chrome/81.0.4044.138 Safari/537.36'
}

ROBOTSTXT_OBEY = False

程序

程序整体构成如下：

bmw5.py

from bmw.items import BmwItem
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor"""
高清图的地址:https://car2.autoimg.cn/cardfs/product/g29/M04/71/D7/autohomecar__ChsEn10CSUmAHmqsAAm2XOgbAEI053.jpg
缩略图的地址:https://car2.autoimg.cn/cardfs/product/g29/M04/71/D7/240x180_0_q95_c42_autohomecar__ChsEn10CSUmAHmqsAAm2XOgbAEI053.jpg
注意到这两种图片的格式只有240x180_0_q95_c42_是不同的，因此可作为区分
"""class Bmw5Spider(CrawlSpider):name = 'bmw5'allowed_domains = ['car.autohome.com.cn']start_urls = ['https://car.autohome.com.cn/pic/series/3464.html']# 使用crawl spider 就不能重写parse方法rules = (Rule(LinkExtractor(allow=r'https://car.autohome.com.cn/pic/series/3464.+'), follow=True,callback='parse_page'),)def parse_page(self, response):category = response.xpath('//div[@class="uibox"]/div/text()').get()# 注意，所有图片都在uibox-con carpic-list03 border-b-solid这个div里，这是多个类的名字，因此要使用contains# 而且这里得到的urls是缩略图，因此要做上面注释部分的转换srcs = response.xpath('//div[contains(@class, "uibox-con")]/ul/li//img/@src').getall()srcs = list(map(lambda x: x.replace("240x180_0_q95_c42_", ""), srcs))srcs = ["https:" + url for url in srcs]yield BmwItem(category=category, image_urls=srcs)

pipelines.py

import os
from urllib import request
from scrapy.pipelines.images import ImagesPipeline
from bmw import settingsclass BmwPipeline:def __init__(self):# 使用如下代码可以查看文件当前所在目录，对于pipelines来说，是bmw，但它上面还有一个bmw，因此使用两次如下代码# os.path.dirname(__file__)# 再使用join拼接路径，即在最上面的bmw目录下创建了images文件夹self.path = os.path.join(os.path.dirname(os.path.dirname(__file__)),'images')if not os.path.exists(self.path):os.mkdir(self.path)def process_item(self, item, spider):# item的类型:<class 'bmw.items.BmwItem'>category = item['category']urls = item['urls']category_path = os.path.join(self.path, category)if not os.path.exists(category_path):os.mkdir(category_path)for url in urls:request.urlretrieve(url, os.path.join(category_path, url.split('_')[-1]))return itemclass BMWImagesPipeline(ImagesPipeline):def get_media_requests(self, item, info):request_objs = super(BMWImagesPipeline, self).get_media_requests(item, info)for request_obj in request_objs:request_obj.item = itemreturn request_objsdef file_path(self, request, response=None, info=None):path = super(BMWImagesPipeline, self).file_path(request, response=None, info=None)category = request.item.get('category')images_store = settings.IMAGES_STOREcategory_path = os.path.join(images_store, category)if not os.path.exists(category_path):os.mkdir(category_path)image_name = path.replace("full/", "")image_path = os.path.join(category_path, image_name)return image_path

item.py

import scrapyclass BmwItem(scrapy.Item):# define the fields for your item here like:# name = scrapy.Field()category = scrapy.Field()image_urls = scrapy.Field()images = scrapy.Field()

遇见的问题

在运行时遇见了如下问题：

XX Spider.parse callback is not defined

查阅后才发现是因为使用了CrawlSpider但却没有继承，而是依然使用的scrapy.Spider.