爬虫框架Scrapy（10）下载文件与图片

文章目录

下载文件与图片
- （一）FilesPipeline 和 ImagesPipeline
- - 1. FilesPipeline 使用说明
  - 2. ImagesPipeline 使用说明
- （二）项目实例：下载 matplotlib 例子源码文件
- - 1. 页面分析
  - 2. 编码实现
  - - （1）创建项目文件
    - （2）启用 FilesPipeline
    - （3）Item 中封装数据
    - （4）编写 spider 内容
- （三）项目实例：下载360图片
- - 1. 页面分析
  - 2. 编码实现
  - - （1）新建项目
    - （2）构造请求
    - （3）提取信息
    - （4）存储数据

下载文件与图片

在之前的章节中，我们学习了从网页中爬取信息的方法，这只是爬虫最典型的一种应用，除此之外，下载文件也是实际应用中很常见的一种需求，例如使用爬虫爬取网站中的图片、视频、WORD 文档、PDF文件、压缩包等。下面来学习在 Scrapy 中如何下载文件和图片。

（一）FilesPipeline 和 ImagesPipeline

Scrapy 框架内部提供了两个 Item Pipeline，专门用于下载文件和图片：

FilesPipeline
ImagesPipeline

我们可以将这两个 Item Pipeline 看作特殊的下载器，用户使用时只需要通过 item 的一个特殊字段将要下载文件或图片的 url 传递给它们，便会自动将文件或图片下载到本地，并将下载结果信息存入 item 的另一个特殊字段，以便用户在导出文件中查阅。下面详细介绍如何使用它们。

1. FilesPipeline 使用说明

通过一个简单的例子讲解 FilesPipeline 的使用，在如下页面中可以下载多本 PDF 格式的小说：

<html><body>...<a href='/book/sg.pdf'>下载《三国演义》</a><a href='/book/shz.pdf'>下载《水浒传》</a><a href='/book/hlm.pdf'>下载《红楼梦》</a><a href='/book/xyj.pdf'>下载《西游记》</a>... </body>
</html>

使用 FilesPipeline 下载页面中所有 PDF 文件，可按以下步骤进行：

Step 1：在配置文件 settings.py 中启用 FilesPipeline，通常将其置于其他 Item Pipeline 之前：
```
ITEM_PIPELINES = {'scrapy.pipelines.files.FilesPipeline': 1}
```
Step 2：在配置文件 settings.py 中，使用 FILES_STORE 指定文件下载目录，如：
```
FILES_STORE = '/home/liushuo/Download/scrapy'
```
Step 3：在 Spider 解析一个包含文件下载链接的页面时，将所有需要下载文件的 url 地址收集到一个列表，赋给 item 的 file_urls 字段（item[‘file_urls’]）。FilesPipeline 在处理每一项item 时，会读取 item['file_urls'] ，对其中每一个 url 进行下载，Spider 示例代码如下：
```
class DownloadBookSpider(scrapy.Spider):...def parse(response):item = {} item['file_urls'] = []for url in response.xpath('//a/@href').extract():download_url = response.urljoin(url)item['file_urls'].append(download_url)yield item
```

当 FilesPipeline 下载完 item['file_urls'] 中的所有文件后，会将各文件的下载结果信息收集到另一个列表，赋给 item 的 files 字段（item[‘files’]）。下载结果信息包括以下内容：

Path：文件下载到本地的路径（相对于FILES_STORE的相对路径）；
Checksum：文件的校验和；
url：文件的url地址。

2. ImagesPipeline 使用说明

图片也是文件，所以下载图片本质上也是下载文件，ImagesPipeline 是 FilesPipeline 的子类，使用上和 FilesPipeline 大同小异，只是在所使用的 item 字段和配置选项上略有差别，如下表所示：

	FilesPipeline	ImagesPipeline
导入路径	scrapy.pipelines.files.FilesPipeline	scrapy.pipelines.images.ImagesPipeline
Item 字段	file_urls、files	image_urls、images
下载目录	FILES_STORE	IMAGES_STORE

ImagesPipeline 在 FilesPipleline 的基础上针对图片增加了一些特有的功能：

为图片生成缩略图：开启该功能，只需在配置文件 settings.py 中设置 IMAGES_THUMBS，它是一个字典，每一项的值是缩略图的尺寸，代码如下：
```
IMAGES_THUMBS = {'small':(50, 50), 'big':(270, 270),}
```
开启该功能后，下载一张图片时，本地会出现 3 张图片（1 张原图片，2 张缩略图），路径如下：
```
[IMAGES_STORE]/full/63bbfea82b8880ed33cdb762aa11fab722a90a24.jpg [IMAGES_STORE]/thumbs/small/63bbfea82b8880ed33cdb762aa11fab722a90a24.jpg [IMAGES_STORE]/thumbs/big/63bbfea82b8880ed33cdb762aa11fab722a90a24.jpg
```
过滤掉尺寸过小的图片：开启该功能，需在配置文件 settings.py 中设置 IMAGES_MIN_WIDTH 和 IMAGES_MIN_HEIGHT，它们分别指定图片最小的宽和高，代码如下：
```
IMAGES_MIN_WIDTH = 110
IMAGES_MIN_HEIGHT = 110
```
开启该功能后，如果下载了一张105×200 的图片，该图片就会被抛弃掉，因为它的宽度不符合标准。

（二）项目实例：下载 matplotlib 例子源码文件

下面我们来完成一个使用 FilesPipeline 下载文件的实战项目。matplotlib 是一个非常著名的 Python 绘图库，广泛应用于科学计算和数据分析等领域。在 matplotlib 网站上提供了许多应用例子代码，在浏览器中访问 http://matplotlib.org/examples/index.html，可看到诸多例子列表页面。其中有几百个例子，被分成多个类别。用户可以在每个例子页面中阅读源码，也可以点击页面中的 source code 按钮下载源码文件。如果我们想把所有例子的源码文件都下载到本地，可以编写一个爬虫程序完成这个任务。

1. 页面分析

先来看如何在例子列表页面 http://matplotlib.org/examples/index.html 中获取所有示例页面的链接。使用 scrapy shell 命令下载页面，然后调用 view 函数在浏览器中查看页面。

$ scrapy shell http://matplotlib.org/examples/index.html
...
>>> view(response)

观察发现，所有例子页面的链接都在 <div class="toctree-wrapper compound"> 下的每一个 <li class="toctree-l2"> 中，例如：

<li class="toctree-l2"><a class="reference internal" href="animation/animate_decay.html">animate_decay</a>
</li>

使用 LinkExtractor 提取所有示例页面的链接，代码如下：

>>> from scrapy.linkextractors import LinkExtractor
>>> le = LinkExtractor(restrict_css='div.toctree-wrapper.compound li.toctree-l2')
>>> links = le.extract_links(response)
>>> [link.url for link in links]
['http://matplotlib.org/examples/animation/animate_decay.html',
'http://matplotlib.org/examples/animation/basic_example.html',
'http://matplotlib.org/examples/animation/basic_example_writer.html',
'http://matplotlib.org/examples/animation/bayes_update.html',
'http://matplotlib.org/examples/animation/double_pendulum_animated.html',
'http://matplotlib.org/examples/animation/dynamic_image.html',
......
'http://matplotlib.org/examples/widgets/slider_demo.html',
'http://matplotlib.org/examples/widgets/span_selector.html']
>>> len(links) 506

接下来分析例子页面。调用 fetch 函数下载第一个例子页面，并调用 view 函数在浏览器中查看页面：

>>> fetch('http://matplotlib.org/examples/animation/animate_decay.html')
2021-03-24 21:13:03 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://matplotlib.org/2.0.2/examples/animation/animate_decay.html> (referer: None)
>>> view(response)
True

在第一个例子页面中，例子源码文件的下载地址可在 <a class="reference external"> 中找到，即页面中的 source code：

>>> href = response.css('a.reference.external::attr(href)').extract_first()
>>> href
'animate_decay.py'
>>> response.urljoin(href)
'https://matplotlib.org/2.0.2/examples/animation/animate_decay.py'

到此，页面分析的工作完毕，下面我们编写项目内容。

2. 编码实现

（1）创建项目文件

首先创建Scrapy项目，取名为 matplotlib，再使用 scrapy genspider 命令创建 Spider：

(pyspider) pyvip@VIP:~/project/Python_Spider/Spider_Project/Simple_Case$ scrapy startproject matplotlib
New Scrapy project 'matplotlib', using template directory '/home/pyvip/.virtualenvs/pyspider/lib/python3.6/site-packages/scrapy/templates/project', created in:/home/pyvip/project/Python_Spider/Spider_Project/Simple_Case/matplotlibYou can start your first spider with:cd matplotlibscrapy genspider example example.com
(pyspider) pyvip@VIP:~/project/Python_Spider/Spider_Project/Simple_Case$ cd matplotlib/matplotlib
(pyspider) pyvip@VIP:~/project/Python_Spider/Spider_Project/Simple_Case/matplotlib/matplotlib$ scrapy genspider examples matplotlib.org
Created spider 'examples' using template 'basic' in module:matplotlib.spiders.examples

（2）启用 FilesPipeline

在配置文件 settings.py 中启用 FilesPipeline，并指定文件下载目录，代码如下：

ITEM_PIPELINES = {'scrapy.pipelines.files.FilesPipeline': 300,}
FILES_STORE = 'examples_src'

（3）Item 中封装数据

实现 ExamplesItem，需定义 file_urls 和 files 两个字段，在items.py中完成如下代码：

class ExamplesItem(scrapy.Item):file_urls = scrapy.Field()files = scrapy.Field()

（4）编写 spider 内容

实现 ExamplesSpider，首先设置起始爬取点，修改 spider.py 中的 start_urls：

import scrapy
class ExamplesSpider(scrapy.Spider):name = "examples"allowed_domains = ["matplotlib.org"]start_urls = ['https://matplotlib.org/2.0.2/examples/index.html']def parse(self, response):pass

parse 方法是例子列表页面的解析函数，在该方法中提取每个例子页面的链接，用其构造 Request 对象并提交，提取链接的细节已在页面分析时讨论过，实现 parse 方法的代码如下：

import scrapy
from scrapy.linkextractors import LinkExtractorclass ExamplesSpider(scrapy.Spider):name = "examples"allowed_domains = ["matplotlib.org"]start_urls = ['https://matplotlib.org/2.0.2/examples/index.html']def parse(self, response):le = LinkExtractor(restrict_css='div.toctree-wrapper.compound', deny='/index.html$')print(len(le.extract_links(response)))for link in le.extract_links(response):yield scrapy.Request(link.url, callback=self.parse_example)def parse_example(self, response):pass

上面代码中，我们将例子页面的解析函数设置为 parse_example 方法，下面来实现这个方法。例子页面中包含了例子源码文件的下载链接，在 parse_example 方法中获取源码文件的 url，将其放入一个列表，赋给 ExampleItem 的 file_urls 字段。实现 parse_example 方法的代码如下：

import scrapy
from scrapy.linkextractors import LinkExtractor
from ..items import ExampleItemclass ExamplesSpider(scrapy.Spider):name = "examples"allowed_domains = ["matplotlib.org"]start_urls = ['http://matplotlib.org/examples/index.html']def parse(self, response):le = LinkExtractor(restrict_css='div.toctree-wrapper.compound', deny='/index.html$')print(len(le.extract_links(response)))for link in le.extract_links(response):yield scrapy.Request(link.url, callback=self.parse_example)def parse_example(self, response):href = response.css('a.reference.external::attr(href)').extract_first()url = response.urljoin(href)example = ExamplesItem()example['file_urls'] = [url]return example

编码完成后，运行爬虫，并观察结果：

(pyspider) pyvip@VIP:~/project/Python_Spider/Spider_Project/Simple_Case/matplotlib/matplotlib$ scrapy crawl examples -o examples.json
...
(pyspider) pyvip@VIP:~/project/Python_Spider/Spider_Project/Simple_Case/matplotlib/matplotlib$ ls
examples.json  examples_src  __init__.py  items.py  middlewares.py  pipelines.py  __pycache__  settings.py  spiders

运行结束后，在文件 examples.json 中可以查看到文件下载结果信息：

(pyspider) pyvip@VIP:~/project/Python_Spider/Spider_Project/Simple_Case/matplotlib/matplotlib$ cat examples.json
...
{"file_urls": ["https://matplotlib.org/2.0.2/examples/animation/rain.py"], "files": [{"url": "https://matplotlib.org/2.0.2/examples/animation/rain.py", "path": "full/be835bf299dea6dc4f368a3eda497bd78b186d75.py", "checksum": "5b08c716535577b1bca16810d8d45de0", "status": "downloaded"}]},
{"file_urls": ["https://matplotlib.org/2.0.2/examples/animation/basic_example.py"], "files": [{"url": "https://matplotlib.org/2.0.2/examples/animation/basic_example.py", "path": "full/932d5f7e17dc6eab6f9fdef2c68f136b5755646d.py", "checksum": "1d4afc0910f6abc519e6ecd32c66896a", "status": "downloaded"}]},
{"file_urls": ["https://matplotlib.org/2.0.2/examples/animation/bayes_update.py"], "files": [{"url": "https://matplotlib.org/2.0.2/examples/animation/bayes_update.py", "path": "full/eff52b989f81129694d042179f65742d09b612b8.py", "checksum": "715610c4375a1d749bc26b39cf7e7199", "status": "downloaded"}]},
{"file_urls": ["https://matplotlib.org/2.0.2/examples/animation/basic_example_writer.py"], "files": [{"url": "https://matplotlib.org/2.0.2/examples/animation/basic_example_writer.py", "path": "full/b0e207cda863f7f41adf20766034b07b11c57817.py", "checksum": "3f05d70f96cec4f10207e4148e9921d7", "status": "downloaded"}]}

再来查看文件下载目录 exmaples_src：

(pyspider) pyvip@VIP:~/project/Python_Spider/Spider_Project/Simple_Case/matplotlib/matplotlib$ tree examples_src
examples_src
└── full├── 003461a7cb514d061a36028d09f33a6d0fced5b8.py├── 01c1f9d780f9616da35d5520f7e86580c8c4ae19.py├── 01f6ec589f07462a120658a36fb9043a9658003c.py├── 03794363e57051be8c5a07dce1655f89e0b01077.py├── 038a85aec92104898ad44093059f415bef429bc1.py├── ...├── fb669202877e31452d80459aa2081125a6306b15.py├── fbebccd05ab23379dc3d15196dd31c1bd75030c1.py├── fc7bda5ccfb184b6866b5e4496832af0a550944d.py├── fca1e505e0925b10f62e485fef378ef1ad506223.py└── ffa7d0c03645a8d460d2a8d94ede826a761560d2.py1 directory, 506 files

如上所示，506 个源码文件被下载到了 examples_src/full 目录下，并且每个文件的名字都是一串长度相等的奇怪数字，这些数字是下载文件 url 的 sha1 散列值。例如，某文件 url 为：

http://matplotlib.org/mpl_examples/axes_grid/demo_floating_axes.py

该 url 的 sha1 散列值为：

d9b551310a6668ccf43871e896f2fe6e0228567d

那么该文件的存储路径为：

# [FILES_STORE]/full/[SHA1_HASH_VALUE].py
examples_src/full/d9b551310a6668ccf43871e896f2fe6e0228567d.py

这种命名方式可以防止重名的文件相互覆盖，但这样的文件名太不直观了，无法从文件名了解文件内容，我们期望把这些例子文件按照类别下载到不同目录下，为完成这个任务，可以写一个单独的脚本，依据 examples.json 文件中的信息将文件重命名，也可以修改 FilesPipeline 为文件命名的规则，这里采用后一种方式。

阅读 FilesPipeline 的源码发现，原来是其中的 file_path 方法决定了文件的命名。现在，我们实现一个 FilesPipeline 的子类，覆写 file_path 方法来实现所期望的文件命名规则，这些源码文件 url 的最后两部分是类别和文件名，例如：

http://matplotlib.org/mpl_examples/(axes_grid/demo_floating_axes.py)

可用以上括号中的部分作为文件路径，在 pipelines.py 实现 MyFilesPipeline，代码如下：

from scrapy.pipelines.files import FilesPipeline
from urllib.parse import urlparse
from os.path import basename, dirname, joinclass MyFilesPipeline(FilesPipeline):def file_path(self, request, response=None, info=None):path = urlparse(request.url).pathreturn join(basename(dirname(path)), basename(path))

修改配置文件，使用 MyFilesPipeline 替代 FilesPipeline：

ITEM_PIPELINES = {#'scrapy.pipelines.files.FilesPipeline': 1,'matplotlib.pipelines.MyFilesPipeline': 1,
}

删除之前下载的所有文件，重新运行爬虫后，再来查看 examples_src 目录：

(pyspider) pyvip@VIP:~/project/Python_Spider/Spider_Project/Simple_Case/matplotlib/matplotlib$ rm -r examples_src/full
(pyspider) pyvip@VIP:~/project/Python_Spider/Spider_Project/Simple_Case/matplotlib/matplotlib$ rm examples.json
(pyspider) pyvip@VIP:~/project/Python_Spider/Spider_Project/Simple_Case/matplotlib/matplotlib$ scrapy crawl examples -o examples.json
...
(pyspider) pyvip@VIP:~/project/Python_Spider/Spider_Project/Simple_Case/matplotlib/matplotlib$ tree examples_src
examples_src
├── animation
│   ├── animate_decay.py
│   ├── basic_example.py
│   ├── basic_example_writer.py
│   ├── bayes_update.py
│   ├── double_pendulum_animated.py
│   ├── dynamic_image2.py
│   ├── dynamic_image.py
│   ├── histogram.py
│   ├── moviewriter.py
│   ├── rain.py
│   ├── random_data.py
│   ├── simple_3danim.py
│   ├── simple_anim.py
│   ├── strip_chart_demo.py
│   ├── subplots.py
│   └── unchained.py
├── api
│   ├── ...
└── widgets├── buttons.py├── check_buttons.py├── cursor.py├── lasso_selector_demo.py├── menu.py├── multicursor.py├── radio_buttons.py├── rectangle_selector.py├── slider_demo.py└── span_selector.py26 directories, 506 files

从上述结果看出，506 个文件按类别被下载到 26 个目录下，这正是我们所期望的。

到此，文件下载的项目完成了。

（三）项目实例：下载360图片

以爬取 360 摄影美图为例，爬取摄影图片存储在本地。

1. 页面分析

我们这次爬取的目标网站为：https://image.so.com 。打开此页面，切换到摄影页面，网页中呈现了许许多多的摄影美图。我们打开浏览器开发者工具，过滤器切换到XHR 选项，然后下拉页面，可以看到下面就会呈现许多 Ajax 请求，如下图所示：

我们查看一个请求的详情，观察返回的数据结构，如下图所示：

我们可以发现返回格式是 JSON。其中 list 字段就是一张张图片的详情信息，包含了 30 张图片的 ID、名称、链接、缩略图等信息。

另外观察 Ajax 请求的参数信息，有一个参数 sn 一直在变化，这个参数很明显就是偏移量。当 sn 为 30 时，返回的是前 30 张图片，sn 为 60 时，返回的就是第 31~60 张图片。另外，ch 参数是摄影类别，listtype 是排序方式，temp 参数可以忽略。所以我们抓取时只需要改变 sn 的数值就好了。下面我们用 Scrapy 来实现图片的抓取，将图片存储到本地。

2. 编码实现

（1）新建项目

首先新建一个项目，命令如下：

(pyspider) pyvip@VIP:~/project/Python_Spider/Spider_Project/Simple_Case$ scrapy startproject image360
New Scrapy project 'image360', using template directory '/home/pyvip/.virtualenvs/pyspider/lib/python3.6/site-packages/scrapy/templates/project', created in:/home/pyvip/project/Python_Spider/Spider_Project/Simple_Case/image360You can start your first spider with:cd image360scrapy genspider example example.com

然后新建一个 Spider，命令如下：

(pyspider) pyvip@VIP:~/project/Python_Spider/Spider_Project/Simple_Case$ cd image360/image360
(pyspider) pyvip@VIP:~/project/Python_Spider/Spider_Project/Simple_Case/image360/image360$ ls
__init__.py  items.py  middlewares.py  pipelines.py  settings.py  spiders
(pyspider) pyvip@VIP:~/project/Python_Spider/Spider_Project/Simple_Case/image360/image360$ scrapy genspider images images.so.com
Created spider 'images' using template 'basic' in module:image360.spiders.images

这样我们就成功创建了一个 Spider。

（2）构造请求

首先我们要定义爬取的页数。我们可以在 settings.py 里面定义一个变量 MAX_PAGE，添加如下定义：

MAX_PAGE = 50

然后在 images.py 中定义 start_requests () 方法，用来生成 50 次请求，如下所示：

from urllib.parse import urlencode
from scrapy import Spider, Requestclass ImagesSpider(Spider):name = 'images'allowed_domains = ['images.so.com']start_urls = ['http://images.so.com/']def start_requests(self):data = {'ch': 'photography', 'listtype': 'new'}base_url = 'https://image.so.com/zjl?'for page in range(1, self.settings.get('MAX_PAGE') + 1):data['sn'] = page * 30params = urlencode(data)url = base_url + paramsyield Request(url, self.parse)

在这里我们首先定义了初始的两个参数，sn 参数是遍历循环生成的。然后利用 urlencode () 方法将字典转化为 URL 的 GET 参数，构造出完整的 URL，构造并生成 Request。

再修改 settings.py 中的 ROBOTSTXT_OBEY 变量，将其设置为 False，否则无法抓取，如下所示：

ROBOTSTXT_OBEY = False

最后运行爬虫，即可以看到链接都请求成功，执行命令如下所示：

(py3env) pyvip@VIP:~/project/Python_Spider/Spider_Project/Simple_Case/image360/image360$ scrapy crawl images
......
2021-03-23 15:32:13 [scrapy.core.scraper] DEBUG: Scraped from <200 https://image.so.com/zjl?ch=photography&listtype=new&sn=960>
{'id': 'a6e358a426a5fadd396ad5510253b3e4','thumb': 'https://p1.ssl.qhimgs1.com/sdr/200_200_/t016c75c6e93b92f53b.jpg','title': '住宿加次日早餐,签到,正面,盛开,薰衣草种植区,乡村,普罗旺斯,法国','url': 'https://p1.ssl.qhimgs1.com/t016c75c6e93b92f53b.jpg'}
item {'id': '1b7be74871b3d456255ac801d25d84c5','thumb': 'https://p2.ssl.qhimgs1.com/sdr/200_200_/t011eb18a15d14d9a08.jpg','title': '木桌子,秋天,装饰,瓷器','url': 'https://p2.ssl.qhimgs1.com/t011eb18a15d14d9a08.jpg'}
2021-03-23 15:32:13 [scrapy.core.scraper] DEBUG: Scraped from <200 https://image.so.com/zjl?ch=photography&listtype=new&sn=960>
{'id': '1b7be74871b3d456255ac801d25d84c5','thumb': 'https://p2.ssl.qhimgs1.com/sdr/200_200_/t011eb18a15d14d9a08.jpg','title': '木桌子,秋天,装饰,瓷器','url': 'https://p2.ssl.qhimgs1.com/t011eb18a15d14d9a08.jpg'}
2021-03-23 15:32:13 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://image.so.com/zjl?ch=photography&listtype=new&sn=1500> (referer: None)
2021-03-23 15:32:13 [scrapy.core.engine] INFO: Closing spider (finished)
2021-03-23 15:32:13 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 15762,'downloader/request_count': 50,'downloader/request_method_count/GET': 50,'downloader/response_bytes': 227320,'downloader/response_count': 50,'downloader/response_status_count/200': 50,'finish_reason': 'finished','finish_time': datetime.datetime(2021, 3, 23, 7, 32, 13, 613682),'item_scraped_count': 987,'log_count/DEBUG': 1038,'log_count/INFO': 7,'memusage/max': 52457472,'memusage/startup': 52457472,'response_received_count': 50,'scheduler/dequeued': 50,'scheduler/dequeued/memory': 50,'scheduler/enqueued': 50,'scheduler/enqueued/memory': 50,'start_time': datetime.datetime(2021, 3, 23, 7, 32, 6, 243451)}
2021-03-23 15:32:13 [scrapy.core.engine] INFO: Spider closed (finished)

运行结果如上述代码所示，所有请求的状态码都是 200，这就证明图片信息爬取成功。

（3）提取信息

首先在 item.py 中定义一个 Item，叫作 ImageItem，代码如下所示：

from scrapy import Field, Itemclass ImageItem(Item):id = Field()url = Field()title = Field()thumb = Field()

在这里我们定义了 4 个字段，包括图片的 ID、链接、标题、缩略图。接下来我们提取 Spider 里有关信息，将 parse () 方法改写为如下所示：

import json
from urllib.parse import urlencode
from scrapy import Spider, Request
from ..items import ImageItemclass ImagesSpider(Spider):name = 'images'allowed_domains = ['images.so.com']start_urls = ['http://images.so.com/']def start_requests(self):......def parse(self, response):result = json.loads(response.text)for image in result.get('list'):item = ImageItem()item['id'] = image.get('id')item['url'] = image.get('qhimg_url')item['title'] = image.get('title')item['thumb'] = image.get('qhimg_thumb')print('item', item)yield item

首先解析 JSON，遍历其 list 字段，取出一个个图片信息，然后再对 ImageItem 赋值，生成 Item 对象。这样我们就完成了信息的提取。

（4）存储数据

接下来我们需要将图片保存到本地。

Scrapy 提供了专门处理下载的 Pipeline，包括文件下载和图片下载，pipeline 的官方文档地址为：https://doc.scrapy.org/en/latest/topics/media-pipeline.html。下载文件和图片的原理与抓取页面的原理一样，因此下载过程支持异步和多线程，下载十分高效。下面我们来看看具体的实现过程。

首先定义存储文件的路径，需要定义一个 IMAGES_STORE 变量，在 settings.py 中添加如下代码：

IMAGES_STORE = './images'

在这里我们将路径定义为当前路径下的 images 子文件夹，即下载的图片都会保存到本项目的 images 文件夹中。内置的 ImagesPipeline 会默认读取 Item 的 image_urls 字段，并认为该字段是一个列表形式，它会遍历 Item 的 image_urls 字段，然后取出每个 URL 进行图片下载。但是现在生成的 Item 的图片链接字段并不是 image_urls 字段表示的，也不是列表形式，而是单个的 URL。所以为了实现下载，我们需要重新定义下载的部分逻辑，即要自定义 ImagePipeline，继承内置的 ImagesPipeline，重写几个方法。定义 ImagePipeline 的内容如下所示：

from scrapy import Request
from scrapy.exceptions import DropItem
from scrapy.pipelines.images import ImagesPipelineclass ImagePipeline(ImagesPipeline):def file_path(self, request, response=None, info=None):url = request.urlfile_name = url.split('/')[-1]return file_namedef item_completed(self, results, item, info):image_paths = [x['path'] for ok, x in results if ok]if not image_paths:raise DropItem('Image Downloaded Failed')return itemdef get_media_requests(self, item, info):yield Request(item['url'])

在这里我们实现了 ImagePipeline，继承 Scrapy 内置的 ImagesPipeline，重写下面几个方法。

get_media_requests ()：它的第一个参数 item 是爬取生成的 Item 对象。我们将它的 url 字段取出来，然后直接生成 Request 对象。此 Request 加入到调度队列，等待被调度，执行下载。
file_path ()：它的第一个参数 request 就是当前下载对应的 Request 对象。这个方法用来返回保存的文件名，直接将图片链接的最后一部分当作文件名即可。它利用 split () 函数分割链接并提取最后一部分，返回结果。这样此图片下载之后保存的名称就是该函数返回的文件名。
item_completed ()：它是当单个 Item 完成下载时的处理方法。因为并不是每张图片都会下载成功，所以我们需要分析下载结果并剔除下载失败的图片。如果某张图片下载失败，那么我们就不需保存此 Item 到数据库。该方法的第一个参数 results 就是该 Item 对应的下载结果，它是一个列表形式，列表每一个元素是一个元组，其中包含了下载成功或失败的信息。这里我们遍历下载结果找出所有成功的下载列表。如果列表为空，那么该 Item 对应的图片下载失败，随即抛出异常 DropItem，该 Item 忽略。否则返回该 Item，说明此 Item 有效。

最后只需要修改 settings.py，设置 ITEM_PIPELINES 就可以了，如下所示：

ITEM_PIPELINES = {'images360.pipelines.ImagePipeline': 300,
}

接下来运行程序，执行爬取，如下所示：

(py3env) pyvip@VIP:~/project/Python_Spider/Spider_Project/Simple_Case/image360/image360$ scrapy crawl images
......

爬虫运行结束之后，我们可以在本项目文件夹下面看见保存图片的文件夹：

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-HZDhqoqk-1616858517565)(image/爬取360摄影美图.png)]

至于如何将爬取的图片信息保存到 MySQL 或者 MongoDB 数据库，我们在后面的章节中会详细讲述。

图片下载失败，那么我们就不需保存此 Item 到数据库。该方法的第一个参数 results 就是该 Item 对应的下载结果，它是一个列表形式，列表每一个元素是一个元组，其中包含了下载成功或失败的信息。这里我们遍历下载结果找出所有成功的下载列表。如果列表为空，那么该 Item 对应的图片下载失败，随即抛出异常 DropItem，该 Item 忽略。否则返回该 Item，说明此 Item 有效。

最后只需要修改 settings.py，设置 ITEM_PIPELINES 就可以了，如下所示：

ITEM_PIPELINES = {'images360.pipelines.ImagePipeline': 300,
}

接下来运行程序，执行爬取，如下所示：

(py3env) pyvip@VIP:~/project/Python_Spider/Spider_Project/Simple_Case/image360/image360$ scrapy crawl images
......

爬虫运行结束之后，我们可以在本项目文件夹下面看见保存图片的文件夹：

至于如何将爬取的图片信息保存到 MySQL 或者 MongoDB 数据库，我们在后面的章节中会详细讲述。

上述文章内容如有错误，欢迎各位读者在评论区留言！