scrapy框架使用piplines、items进行提取项目并保存数据

1 .Scrapy提取项目

从网页中提取数据，Scrapy 使用基于 XPath 和 CSS 表达式的技术叫做选择器。

选择器有四个基本的方法，如下所示：

S.N.	方法 & 描述
extract()	它返回一个unicode字符串以及所选数据
extract_first()	它返回第一个unicode字符串以及所选数据
re()	它返回Unicode字符串列表，当正则表达式被赋予作为参数时提取
xpath()	它返回选择器列表，它代表由指定XPath表达式参数选择的节点
css()	它返回选择器列表，它代表由指定CSS表达式作为参数所选择的节点

2 .Scrapy Shell

如果使用选择器想快速的到到效果，我们可以使用Scrapy Shell
scrapy shell "http://www.163.com"

注意windows系统必须使用双引号

3.以文件的方式打印，保存文件的两种方式

3.1 python原生方式

with open("movie.txt", 'wb') as f: for n, c in zip(movie_name, movie_core): str = n+":"+c+"\n" f.write(str.encode())

3.2 以scrapy内置方式

scrapy 内置主要有四种：JSON，JSON lines，CSV，XML

我们将结果用最常用的JSON导出，命令如下：

在控制台输出一下命令：

scrapy crawl dmoz -o douban.json -t json

-o 后面是导出文件名，-t 后面是导出类型（这个可以不写）

scrapy crawl qidian -o qidian.json
scrapy crawl qidian -o qidian.csv
scrapy crawl qidian -o qidian.xml

3.3 scrapy 保存json文件出现Unicode字符

在setting里面加入下面的配置

FEED_EXPORT_ENCODING ='utf-8'

3.4 scrapy 保存csv文件出现乱码

在setting里面加入下面的配置

FEED_EXPORT_ENCODING = 'gb18030'

4. scrapy框架，使用piplines和items两种格式，对提取文件进行打印和保存

从一个普通的HTML网站提取数据，查看该网站得到的 XPath 的源代码。检测后，可以看到数据将在UL标签，并选择 li 标签中的元素。
代码的下面行显示了不同类型的数据的提取：


# -*- coding: utf-8 -*-
import scrapyclass QidianSpider(scrapy.Spider):name = 'qidian'allowed_domains = ['qidian.com']start_urls = ['https://www.qidian.com/rank/yuepiao?chn=21']def parse(self, response):names=response.xpath('//h4/a/text()').extract()authors=response.xpath('//p[@class="author"]/a[1]/text()').extract()# print(names,':',authors)books=[]for name,author in zip(names,authors):books.append({"name":name,"author":author})return books

返回的内容

{'movie_name': ['肖申克的救赎', '霸王别姬', '这个杀手不太冷', '阿甘正传', '美丽人生', '千与千寻', '泰坦尼克号', '辛德勒的名单', '盗梦空间', '机器人总动员', '海上钢琴师', '三傻大闹宝莱坞', '忠犬八公的故事', '放牛班的春天', '大话西游之大圣娶亲', '教父', '龙猫', '楚门的世界', '乱世佳人', '熔炉', '触不可及', '天堂电影院', '当幸福来敲门', '无间道', '星际穿越'], 'movie_core': ['9.6', '9.5', '9.4', '9.4', '9
.5', '9.2', '9.2', '9.4', '9.3', '9.3', '9.2', '9.1', '9.2', '9.2', '9.2', '9.2', '9.1', '9.1', '9.2', '9.2', '9.1', '9.1', '8.9', '9.0
', '9.1']}

4.1.通过使用piplines，对提取文件进行打印和保存


2.通过piplines打印返回数据并且保存数据# -*- coding: utf-8 -*-
import scrapyclass MaoyanSpider(scrapy.Spider):name = 'maoyan'allowed_domains = ['maoyan.com']start_urls = ['https://maoyan.com/films?showType=3']def parse(self, response):names=response.xpath('//div[@class="channel-detail movie-item-title"]/a/text()').extract()# scores_div=response.xpath('//div[@class="channel-detail channel-detail-orange"]')# scores=[]# for score in scores_div:#     scores.append(score.xpath('string(.)').extract_first())#简写scores = [score.xpath('string(.)').extract_first() for score inresponse.xpath('//div[@class="channel-detail channel-detail-orange"]')]# 在控制台打印的方法一# for name,score in zip(names,scores):#   print(name,':',score)# 在控制台打印的方法二 pipline 使用yield函数推送到pipline# 必须使用字典或item形式# 返回的是一个字典for name,score in zip(names,scores):yield {'name': name,'score': score}# 在piplines输出
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.htmlclass SpiderPipeline:def process_item(self, item, spider):print(item)# return item注意：                        piplines默认的打印格式是log日志需要在设置里打开piplines配置文件ITEM_PIPELINES = {'spider.pipelines.SpiderPipeline': 300,
}#保存文件
# -*- coding: utf-8 -*-# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html写法一
import jsonclass SpiderPipeline:def process_item(self, item, spider):#使用a追加写入，需要每次打开文件，增加CPU速度with open('movie.txt','a',encoding='utf-8') as f:#item返回的是字典对象，dumps转为字符串对象f.write(json.dumps(item,ensure_ascii=False)+'\n')         print(item)return item
写法二,使用打开和关闭文件的函数模式保存文件
import jsonclass SpiderPipeline:def open_spider(self,spider):self.filename=open('movie.txt','w',encoding='utf-8')def process_item(self, item, spider):#item返回的是字典对象，dumps转为字符串对象self.filename.write(json.dumps(item,ensure_ascii=False)+'\n')return itemdef close_spider(self,spider):self.filename.close()

4.2.通过使用items，对提取文件进行打印和保存

提取内容的封装Item
Scrapy进程可通过使用蜘蛛提取来自网页中的数据。Scrapy使用Item类生成输出对象用于收刮数据。Item 对象是自定义的python字典，可以使用标准字典语法获取某个属性的值

3. 提取内容的封装Item
Scrapy进程可通过使用蜘蛛提取来自网页中的数据。Scrapy使用Item类生成输出对象用于收刮数据
Item 对象是自定义的python字典，可以使用标准字典语法获取某个属性的值
2.1 定义
# -*- coding: utf-8 -*-
import scrapy
from spider.items import MovieItemclass MaoyanSpider(scrapy.Spider):name = 'maoyan'allowed_domains = ['maoyan.com']start_urls = ['https://maoyan.com/films?showType=3']def parse(self, response):names=response.xpath('//div[@class="channel-detail movie-item-title"]/a/text()').extract()# scores_div=response.xpath('//div[@class="channel-detail channel-detail-orange"]')# scores=[]# for score in scores_div:#     scores.append(score.xpath('string(.)').extract_first())scores = [score.xpath('string(.)').extract_first() for score inresponse.xpath('//div[@class="channel-detail channel-detail-orange"]')]# 创建item对象item=MovieItem()for name,score in zip(names,scores):item['names'] = nameitem['scores'] = scoreyield item# 在item输出
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.htmlimport scrapyclass MovieItem(scrapy.Item):# define the fields for your item here like:names=scrapy.Field()scores = scrapy.Field()保存数据
# -*- coding: utf-8 -*-# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
import json
写法一
class SpiderPipeline:def process_item(self, item, spider):#使用a追加写入，需要每次打开文件，增加CPU速度with open('movie.txt','a',encoding='utf-8') as f:#item返回的是字典对象，dumps转为字符串对象#使用item进行保存文件会出现序列化报错，需要把item序列化转化为字典格式#TypeError: Object of type MovieItem is not JSON serializablef.write(json.dumps(dict(item),ensure_ascii=False)+'\n')# print(item)return itemimport jsonclass SpiderPipeline:def open_spider(self,spider):self.filename=open('movie.txt','w',encoding='utf-8')def process_item(self, item, spider):#item返回的是字典对象，dumps转为字符串对象#使用item进行保存文件会出现序列化报错，需要把item序列化转化为字典格式#TypeError: Object of type MovieItem is not JSON serializableself.filename.write(json.dumps(dict(item),ensure_ascii=False)+'\n')return itemdef close_spider(self,spider):self.filename.close()