Scrapy爬虫框架管道文件pipelines

很多人开始使用scrapy框架会对pipelines的作用不够了解，其实可以理解为spiders负责解析网页，中间件负责request和dawnload，剩下的数据部分可以都交给pipelines来完成。

一、pipelines的通用性

使用多年的感受就是，pipelines可以具有很强的通用性，基本上一个写好的管道可以用于所有的爬虫，每个爬虫都可以公用一个pipelines.py，新的项目只需要复制一个之前定义好的pipelines，改几行代码就可以实现很强大的功能了。

二、pipelines的主要功能

1、对数据进行后处理，清洗，去重，融合，加时间戳…

下面是一个图片下载spider的数据处理管道，主要是为图片下载做准备，添加了时间戳，去除没有src的无效数据，添加Referer，添加存储路径dirpath（这里只需要相对路径）basedir在setting中指定。

class TimePipeline(object):def process_item(self, item, spider):if item['src'] is None:raise DropItem('Dtop empty item!!')else:item['crawled'] = str(datetime.utcnow())#将datetime转化为字符串,为了支持json.item['spider'] = spider.nameif spider.name == 'meitulu':item['Referer'] = 'https://www.meitulu.com/img.html?img=' + item['src']item['dirpath'] = 'meitulu_scrapy/' + item['title'] + '/'+item['dirname'] + '/' + item['picname']+'.jpg'elif spider.name == 'meituri':item['Referer'] = 'https://www.meituri.com/bigimg.html?img='+item['src']item['dirpath'] = 'meituri_scrapy/' + item['title'] + '/'+item['dirname'] + '/' + item['picname'] +'.jpg'elif spider.name == 'meituri_sql':item['title'] = item['title'].replace(' ', '')item['picname'] = item['picname'].replace(' ', '')elif spider.name == 'jmrenti':item['picname'] = item['picname'].replace('/','_')item['src'] = 'http://www.jmrenti.org' + item['src']item['Referer'] = 'http://www.jmrenti.org/'item['dirpath'] = 'jmrenti_scrapy/' + item['title'] + '/'+item['dirname'] + '/' + item['picname'] + '.jpg'elif spider.name == 'lituwu':item['picname'] = item['picname'].replace(' ', '')item['src'] = 'https://www.lituwu.com/' + item['src']item['Referer'] = 'https://www.lituwu.com/'item['dirpath'] = 'lituwu_scrapy/' + item['title'] + '/' +item['dirname'] + '/' + item['picname'] + '.jpg'elif spider.name == 'mntup':item['src'] = 'https://www.mntup.com' + item['src']item['Referer'] = 'https://www.mntup.com/'# item['dirpath'] = 'mntup_scrapy/' + item['title'] + '/' +item['dirname'] + '/' + item['picname'] + '.jpg'return item

2、将数据存储在文件系统

以下存储为json文件

class TextPipeline(object):def process_item(self, item, spider):     content = json.dumps(dict(item), ensure_ascii=False) + ',\n\n'textpath = spider.name +'.json'with open(textpath, 'a') as fp:fp.write(content)return item

3、将数据存储到数据库

mongodb

class MongoDBPipeline(object):'''mongodb管道'''def open_spider(self, spider):'''连接数据库'''db_url = spider.settings.get('MONGODB_URL', 'mongodb://localhost:27017')db_name = spider.settings.get('MONGODB_DB_NAME', 'scrapy_default')self.db_client = MongoClient(db_url)self.db = self.db_client[db_name]self.collection = self.db[spider.name]def process_item(self, item, spider):'''插入数据'''item = dict(item)  # 将数据转化为字典格式self.collection.insert_one(item)  # 向集合aisinei中插入数据return itemdef close_spider(self, spider):'''关闭'''self.db_client.close()

mysql：下面是之前写的存储疫情数据的管道，没有mongodb自由度高。

class MysqlPipeline(object):'''Mysql管道'''def open_spider(self, spider):'''用于连接数据库'''mysql_db = spider.settings.get('MYSQL_DB')self.db = pymysql.connect(mysql_db['HOST'], mysql_db['USER'], mysql_db['PASSWORD'], mysql_db['NAME'])self.cursor = self.db.cursor()def process_item(self, item, spider):'''插入数据'''# print(item)item = dict(item)  # 将数据转化为字典格式sql = '''INSERT INTO china2020(date, suspectedCount,confirmedCount,curedCount,deadCount,seriousCount,suspectedIncr,confirmedIncr,curedIncr,deadIncr,seriousIncr) VALUES('%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s')''' %(item['datadate'],item['suspectedCount'],item['confirmedCount'],item['curedCount'],item['deadCount'],item['seriousCount'],item['suspectedIncr'],item['confirmedIncr'],item['curedIncr'],item['deadIncr'],item['seriousIncr'])print(sql)try:self.cursor.execute(sql)self.db.commit()print('---------------写入MySQL成功------------')except:self.db.rollback()print('---------------写入MySQL不成功------------')return item

4、下载图片视频等二进制文件

下面的代码是参考官方文档写的，可以按照路径分类存储图片，非常好用。
注意要用meta传递定义好的存储路径，在file_path中接收。

class DownloadPipeline(ImagesPipeline):'''下载图片'''def get_media_requests(self, item, info):# 发起请求下载图片yield scrapy.Request(item['src'], meta = {'name':item['picstore']})def item_completed(self, results, item, info):if not results[0][0]:raise DropItem('下载失败')return itemdef file_path(self, request, response=None, info=None):  # 接收上面meta传递过来的图片名称picname = request.meta['name']return picname

Scrapy爬虫框架管道文件pipelines数据图像存储相关推荐

Python Scrapy 爬虫框架爬取推特信息及数据持久化！整理了我三天！
最近要做一个国内外新冠疫情的热点信息的收集系统,所以,需要爬取推特上的一些数据,然后做数据分类及情绪分析.作为一名合格的程序员,我们要有「拿来主义精神」,借助别人的轮子来实现自己的项目,而不是从头搭建 ...
Python Scrapy爬虫框架实战应用
通过上一节<Python Scrapy爬虫框架详解>的学习,您已经对 Scrapy 框架有了一个初步的认识,比如它的组件构成,配置文件,以及工作流程.本节将通过一个的简单爬虫项目对 Scr ...
【数据分析】干货！一文教会你 Scrapy 爬虫框架的基本使用
出品:Python数据之道 (ID:PyDataLab) 作者:叶庭云编辑:Lemon 一.scrapy 爬虫框架介绍在编写爬虫的时候,如果我们使用 requests.aiohttp 等库,需要从 ...
python3 scrapy爬虫_Python3 Scrapy爬虫框架(Scrapy/scrapy-redis)
Python3 Scrapy爬虫框架(Scrapy/scrapy-redis) 本文由 Luzhuo 编写,转发请保留该信息. 原文: https://blog..net/Rozol/article/ ...
Scrapy爬虫框架学习_intermediate
一.Scrapy爬虫框架介绍 Scrapy是功能强大的非常快速的网络爬虫框架,是非常重要的python第三方库.scrapy不是一个函数功能库,而是一个爬虫框架. 1.1 Scrapy库的安装 pip ...
python爬虫框架——scrapy（1）scrapy爬虫框架介绍
导语:(python语言中存在众多的爬虫框架,本文及接下来的几篇都只介绍scrapy框架) 一:整理scrapy爬虫框架组件的各种知识,了解爬虫机制的原理 1.scrapy架构: 各个组件: 引擎(E ...
【Python】Scrapy爬虫框架小试牛刀：爬取某论坛招聘信息
Scrapy爬虫框架小试牛刀:爬取某论坛招聘信息背景 Scrapy工作原理创建项目创建爬虫确定数据爬取思路编写对象:item.py 制作爬虫:muchongrecruit.py 存储内容:p ...
python创建scrapy_Python爬虫教程-31-创建 Scrapy 爬虫框架项目
首先说一下,本篇是在 Anaconda 环境下,所以如果没有安装 Anaconda 请先到官网下载安装 Scrapy 爬虫框架项目的创建0.打开[cmd] 1.进入你要使用的 Anaconda 环境1 ...
scrapy爬虫储存到mysql_详解Python之Scrapy爬虫教程NBA球员数据存放到Mysql数据库
获取要爬取的URL 爬虫前期工作用Pycharm打开项目开始写爬虫文件字段文件items # Define here the models for your scraped items # # S ...

Scrapy爬虫框架管道文件pipelines数据图像存储

Scrapy爬虫框架管道文件pipelines

Scrapy爬虫框架管道文件pipelines

一、pipelines的通用性

二、pipelines的主要功能

1、对数据进行后处理，清洗，去重，融合，加时间戳…

2、将数据存储在文件系统

3、将数据存储到数据库

4、下载图片视频等二进制文件

Scrapy爬虫框架管道文件pipelines数据图像存储相关推荐

最新文章

热门文章