爬虫之scrapy框架的数据持久化存储/保存为scv,json文件

文章目录

前情回顾
- selenium+phantomjs/chrome/firefox
execjs模块使用
今日笔记
scrapy框架
小试牛刀
猫眼电影案例
知识点汇总
数据持久化存储(MySQL)
- 实现步骤
保存为csv、json文件
盗墓笔记小说抓取案例（三级页面）
今日任务

前情回顾

selenium+phantomjs/chrome/firefox

设置无界面模式（chromedriver | firefox）

options = webdriver.ChromeOptions()
options.add_argument('--headless')browser = webdriver.Chrome(options=options)
browser.get(url)

browser执行JS脚本

browser.execute_script(
'window.scrollTo(0,document.body.scrollHeight)'
)
time.sleep(2)

selenium常用操作

# 1、键盘操作
from selenium.webdriver.common.keys import Keys
node.send_keys(Keys.SPACE)
node.send_keys(Keys.CONTROL, 'a')
node.send_keys(Keys.CONTROL, 'c')
node.send_keys(Keys.CONTROL, 'v')
node.send_keys(Keys.ENTER)# 2、鼠标操作
from selenium.webdriver import ActionChains
mouse_action = ActionChains(browser)
mouse_action.move_to_element(node)
mouse_action.perform()# 3、切换句柄
all_handles = browser.window_handles
browser.switch_to.window(all_handles[1])# 4、iframe子框架
browser.switch_to.iframe(iframe_element)

execjs模块使用

# 1、安装
sudo pip3 install pyexecjs# 2、使用(执行js代码)
with open('file.js','r') as f:js = f.read()obj = execjs.compile(js)
result = obj.eval('string')

今日笔记

scrapy框架

定义

异步处理框架,可配置和可扩展程度非常高,Python中使用最广泛的爬虫框架

安装

# Ubuntu安装
1、安装依赖包1、sudo apt-get install libffi-dev2、sudo apt-get install libssl-dev3、sudo apt-get install libxml2-dev4、sudo apt-get install python3-dev5、sudo apt-get install libxslt1-dev6、sudo apt-get install zlib1g-dev7、sudo pip3 install -I -U service_identity
2、安装scrapy框架1、sudo pip3 install Scrapy

# Windows安装
cmd命令行(管理员): python -m pip install Scrapy
# Error: Microsoft Visual C++ 14.0 is required xxx

Scrapy框架五大组件

1、引擎(Engine)      ：整个框架核心
2、调度器(Scheduler) ：维护请求队列
3、下载器(Downloader)：获取响应对象
4、爬虫文件(Spider)  ：数据解析提取
5、项目管道(Pipeline)：数据入库处理
**********************************
# 下载器中间件(Downloader Middlewares) : 引擎->下载器,包装请求(随机代理等)
# 蜘蛛中间件(Spider Middlewares) : 引擎->爬虫文件,可修改响应对象属性

scrapy爬虫工作流程

# 爬虫项目启动
1、由引擎向爬虫程序索要第一个要爬取的URL,交给调度器去入队列
2、调度器处理请求后出队列,通过下载器中间件交给下载器去下载
3、下载器得到响应对象后,通过蜘蛛中间件交给爬虫程序
4、爬虫程序进行数据提取：1、数据交给管道文件去入库处理2、对于需要继续跟进的URL,再次交给调度器入队列，依次循环

scrapy常用命令

# 1、创建爬虫项目(首字母大写)
scrapy startproject 项目名
# 2、创建爬虫文件
scrapy genspider 爬虫名 域名
# 3、运行爬虫
scrapy crawl 爬虫名

Baidu                   # 项目文件夹
├── Baidu               # 项目目录
│   ├── items.py        # 定义数据结构
│   ├── middlewares.py  # 中间件
│   ├── pipelines.py    # 数据处理
│   ├── settings.py     # 全局配置
│   └── spiders
│       ├── baidu.py    # 爬虫文件
└── scrapy.cfg          # 项目基本配置文件

全局配置文件settings.py详解

# 1、定义User-Agent
USER_AGENT = 'Mozilla/5.0'
# 2、是否遵循robots协议，一般设置为False
ROBOTSTXT_OBEY = False
# 3、最大并发量，默认为16
CONCURRENT_REQUESTS = 32
# 4、下载延迟时间
DOWNLOAD_DELAY = 1
# 5、请求头，此处也可以添加User-Agent
DEFAULT_REQUEST_HEADERS={}
# 6、项目管道
ITEM_PIPELINES={'项目目录名.pipelines.类名':300
}
# 7.蜘蛛中间件(543:优先级(范围:1-1000)  数字越小优先级越高)
SPIDER_MIDDLEWARES = {'Baidu.middlewares.BaiduSpiderMiddleware': 543,
}

创建爬虫项目步骤

1、新建项目 ：scrapy startproject 项目名
2、cd 项目文件夹
3、新建爬虫文件 ：scrapy genspider 文件名 域名
4、明确目标(items.py)
5、写爬虫程序(文件名.py)
6、管道文件(pipelines.py)
7、全局配置(settings.py)
8、运行爬虫 ：scrapy crawl 爬虫名

pycharm运行爬虫项目

1、创建begin.py(和scrapy.cfg文件同目录)
2、begin.py中内容：from scrapy import cmdlinecmdline.execute('scrapy crawl maoyan'.split())

小试牛刀

目标

打开百度首页，把 '百度一下，你就知道' 抓取下来，从终端输出
/html/head/title/text()

实现步骤

1、创建项目Baidu 和爬虫文件baidu

1、scrapy startproject Baidu
2、cd Baidu
3、scrapy genspider baidu www.baidu.com

2、编写爬虫文件baidu.py，xpath提取数据

# -*- coding: utf-8 -*-
import scrapyclass BaiduSpider(scrapy.Spider):name = 'baidu'allowed_domains = ['www.baidu.com']start_urls = ['http://www.baidu.com/']def parse(self, response):#起始地址的响应对象:response#extract_first():序列化列表对象,并提取第一个result = response.xpath('/html/head/title/text()').extract_first()print('*'*50)print(result)print('*'*50)

3、全局配置settings.py

USER_AGENT = 'Mozilla/5.0'
ROBOTSTXT_OBEY = False
DEFAULT_REQUEST_HEADERS = {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8','Accept-Language': 'en',
}

4、创建run.py（和scrapy.cfg同目录）

Baidu目录下的文件

from scrapy import cmdline
#在终端执行命令加(.split():列表中的元素一个一个单词输入)
cmdline.execute('scrapy crawl baidu'.split())

5、启动爬虫

直接运行 run.py 文件即可

思考运行过程

猫眼电影案例

目标

URL: 百度搜索 -> 猫眼电影 -> 榜单 -> top100榜
内容:电影名称、电影主演、上映时间

实现步骤

1、创建项目和爬虫文件

# 1、创建爬虫项目
# 2、创建爬虫文件
# https://maoyan.com/board/4?offset=0

2、定义要爬取的数据结构（items.py）

name = scrapy.Field()
star = scrapy.Field()
time = scrapy.Field()

3、编写爬虫文件（maoyan.py）

1、基准xpath,匹配每个电影信息节点对象列表dd_list = response.xpath('//dl[@class="board-wrapper"]/dd')
2、for dd in dd_list:电影名称 = dd.xpath('./a/@title')电影主演 = dd.xpath('.//p[@class="star"]/text()')上映时间 = dd.xpath('.//p[@class="releasetime"]/text()')

代码实现一

# -*- coding: utf-8 -*-
import scrapyfrom ..items import MaoyanItemclass MaoyanSpider(scrapy.Spider):name = 'maoyan'allowed_domains = ['maoyan.com']start_urls = ['https://maoyan.com/board/4']offset=0def parse(self, response):# response:start_urls中响应的对象# 基准的xpath ,匹配电影信息dd节点对象列表dd_list = response.xpath('//dl[@class="board-wrapper"]/dd')# 实例化item中对象item = MaoyanItem()for dd in dd_list:# 1.6版本以后用个get   []赋值,不能用'.'item['name'] = dd.xpath('./a/@title').get()item['star'] = dd.xpath('.//p[@class="star"]/text()').get()item['time'] = dd.xpath('.//p[@class="releasetime"]/text()').get()# 交给管道处理数据yield itemif self.offset<90:self.offset+=10url='https://maoyan.com/board/4?offset='+str(self.offset)#把url地址交个调度器入队列yield scrapy.Request(url=url,callback=self.parse)

代码实现二

# -*- coding: utf-8 -*-
import scrapyfrom ..items import MaoyanItemclass MaoyanSpider(scrapy.Spider):name = 'maoyan'allowed_domains = ['maoyan.com']start_urls = ['https://maoyan.com/board/4']#重写scrapy的start_requests()方法#拼接所有地址,交给调度器调度def start_requests(self):for offset in range(0,91,10):url = 'https://maoyan.com/board/4?offset=' + str(offset)# 把url地址交个调度器入队列yield scrapy.Request(url=url,callback=self.parse_html)def parse_html(self, response):# response:start_urls中响应的对象# 基准的xpath ,匹配电影信息dd节点对象列表dd_list = response.xpath('//dl[@class="board-wrapper"]/dd')# 实例化item中对象item = MaoyanItem()for dd in dd_list:# 1.6版本以后用个get   []赋值,不能用'.'item['name'] = dd.xpath('./a/@title').get()item['star'] = dd.xpath('.//p[@class="star"]/text()').get()item['time'] = dd.xpath('.//p[@class="releasetime"]/text()').get()# 交给管道处理数据yield item

4、定义管道文件（pipelines.py）

class MaoyanPipeline(object):def process_item(self, item, spider):#输出数据print(item["name"])return item

5、全局配置文件（settings.py）

BOT_NAME = 'Maoyan'
SPIDER_MODULES = ['Maoyan.spiders']
NEWSPIDER_MODULE = 'Maoyan.spiders'ROBOTSTXT_OBEY = FalseLOG_LEVEL='WARNING'  #取消调试信息
#请求头
DEFAULT_REQUEST_HEADERS = {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8','Accept-Language': 'en','User-Agent':'Mozilla/5.0',
}
#开启管道
ITEM_PIPELINES = {'Maoyan.pipelines.MaoyanPipeline': 300,
}

创建并运行文件（run.py）

from scrapy import cmdline
cmdline.execute('scrapy crawl maoyan'.split())

知识点汇总

节点对象.xpath(’’)

1、列表,元素为选择器 ['<selector data='A'>]
2、列表.extract() ：序列化列表中所有选择器为Unicode字符串 ['A','B','C']
3、列表.extract_first() 或者 get() :获取列表中第1个序列化的元素(字符串)

日志变量及日志级别(settings.py)

# 日志相关变量
LOG_LEVEL = 'WARNING'
LOG_FILE = '文件名.log'# 日志级别
5 CRITICAL ：严重错误
4 ERROR    ：普通错误
3 WARNING  ：警告
2 INFO     ：一般信息(检查中间件等配置)
1 DEBUG    ：调试信息
# 注意: 只显示当前级别的日志和比当前级别日志更严重的

管道文件使用

1、在爬虫文件中为items.py中类做实例化，用爬下来的数据给对象赋值必须使用..进行引入,否则报错(文件存在)from ..items import MaoyanItemitem = MaoyanItem()
2、管道文件（pipelines.py）
3、开启管道（settings.py）ITEM_PIPELINES = { '项目目录名.pipelines.类名':优先级 }

数据持久化存储(MySQL)

实现步骤

1、在setting.py中定义相关变量(数据库相关设置:主机名,用户名等)
2、pipelines.py中导入settings模块def open_spider(self,spider):# 爬虫开始执行1次,用于数据库连接def close_spider(self,spider):# 爬虫结束时执行1次,用于断开数据库连接
3、settings.py中添加此管道ITEM_PIPELINES = {'':200}# 注意 ：process_item() 函数中一定要 return item ***

保存为csv、json文件

命令格式

scrapy crawl maoyan -o maoyan.csv
scrapy crawl maoyan -o maoyan.json
# settings.py中设置导出编码(主要针对于json文件)
FEED_EXPORT_ENCODING = 'utf-8'

总结

流程

1.创建项目+爬虫文件

2.items.py:定义数据结构

3.spider.py:解析数据

4.pipelines.py:处理数据

5.setting.py:全局配置

6.run.py:运行爬虫

response的方法

response.xpath(’’)

response.text():字符串

response.body():字节串

选择器对象列表

xxx.path().extract()

xxx.path().extract_first()

xxx.path().get()

重写start_requests()方法

去掉start_urls变量

def start_requests()

setting.py常用变量

LOG_LEVEL=’’

LOG_FILE=’’

FEED_EXPORT_ENCODING=’’

存数据

class xxxPIpeline(object):def open_spider(self,spider):passdef process_item(self,item,spider):#必须返回itemreturn itemdef close_spider(self,spider):pass

盗墓笔记小说抓取案例（三级页面）

目标

# 抓取目标网站中盗墓笔记1-8中所有章节的所有小说的具体内容，保存到本地文件
1、网址 ：http://www.daomubiji.com/

准备工作xpath

1、一级页面xpath：
a节点: //li[contains(@id,"menu-item-20")]/a
title: ./text()
link : ./@href2、二级页面基准xpath ：//articlefor循环遍历后:name=article.xpath('./a/text()').get()link=article.xpath('./a/@href').get()3、三级页面xpath：response.xpath('//article[@class="article-content"]//p/text()').extract()
# 结果: ['p1','p2','p3','']

项目实现

1、创建项目及爬虫文件

1、创建项目 ：
2、创建爬虫 ：

2、定义要爬取的数据结构 - items.py

import scrapyclass DaomuItem(scrapy.Item):# 确定pipelines处理数据时需要哪些数据# 1. 一级页面标题 - 创建文件夹需要title = scrapy.Field()# 2. 二级页面标题 - 创建文件需要name = scrapy.Field()# 3. 小说内容content = scrapy.Field()

3、爬虫文件实现数据抓取 - daomu.py

# -*- coding: utf-8 -*-
import scrapy
from ..items import DaomuItemclass DaomuSpider(scrapy.Spider):name = 'daomu'allowed_domains = ['www.daomubiji.com']start_urls = ['http://www.daomubiji.com/']def parse(self, response):a_list=response.xpath('//li[contains(@id,"menu-item-20")]/a')for a in a_list:item=DaomuItem()item["title"]=a.xpath('./text()').get()link=a.xpath('./@href').get()#把link交给调度器入队列yield scrapy.Request(url=link,#meta:在不同的解析函数间传递参数meta={"item":item},callback=self.parse_two_html)#解析二级页面def parse_two_html(self,response):#获取itemitem=response.meta['item']a_list=response.xpath('//article')for article in a_list:name=article.xpath('./a/text()').get()two_link=article.xpath('./a/@href').get()#把link交给调度器进队列yield scrapy.Request(url=two_link,meta={"item":item,"name":name},callback=self.parse_three_html)def parse_three_html(self,response):# 获取itemitem = response.meta['item']item["name"]=response.meta['name']content_list = response.xpath('//article[@class="article-content"]//p/text()').extract()item["content"]="\n".join(content_list)return item

4、管道文件实现数据处理 - pipelines.py

# -*- coding: utf-8 -*-# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
import osclass DaomuPipeline(object):def process_item(self, item, spider):# 创建对应文件夹directory = '/home/tarena/novel/{}/'.format(item["title"])if not os.path.exists(directory):os.makedirs(directory)filename = directory + '{}.txt'.format(item["name"])with open(filename, "w") as f:f.write(item["content"])print(item["name"]+"*"*10+"下载完毕!")return item

5、全局配置 - setting.py

BOT_NAME = 'Daomu'SPIDER_MODULES = ['Daomu.spiders']
NEWSPIDER_MODULE = 'Daomu.spiders'ROBOTSTXT_OBEY = False
LOG_LEVEL='WARNING'  #取消调试信息# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8','Accept-Language': 'en','User-Agent':'Mozilla/5.0'
}# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {'Daomu.pipelines.DaomuPipeline': 300,
}

6、运行文件 - run.py

今日任务

1、scrapy框架有哪几大组件？以及各个组件之间是如何工作的？
2、腾讯招聘尝试改写为scrapyresponse.text ：获取页面响应内容
3、豆瓣电影尝试改为scrapy

爬虫之scrapy框架的数据持久化存储/保存为scv,json文件相关推荐

Python3[爬虫实战] scrapy爬取汽车之家全站链接存json文件
昨晚晚上一不小心学习了崔庆才,崔大神的博客,试着尝试一下爬取一个网站的全部内容,福利吧网站现在已经找不到了,然后一不小心逛到了汽车之家 (http://www.autohome.com.cn/beij ...
python3 爬虫全站_Python3[爬虫实战] scrapy爬取汽车之家全站链接存json文件
昨晚晚上一不小心学习了崔庆才,崔大神的博客,试着尝试一下爬取一个网站的全部内容,福利吧网站现在已经找不到了,然后一不小心逛到了汽车之家 (http://www.autohome.com.cn/beij ...
爬虫数据持久化存储——写入文件
这里写目录标题爬虫数据持久化存储--写入文件 open方法文件的读取及写入爬虫数据持久化存储--csv文件爬虫数据处理:操作数据库模块--pymysql pymysql介绍: pymysql安 ...
Python爬虫之scrapy框架介绍
一.什么是Scrapy? Scrapy是一个为了爬取网站数据,提取结构性数据而编写的应用框架,非常出名,非常强悍.所谓的框架就是一个已经被集成了各种功能(高性能异步下载,队列,分布式,解析,持久化等) ...
14. python爬虫——基于scrapy框架爬取糗事百科上的段子内容
python爬虫--基于scrapy框架爬取糗事百科上的段子内容 1.需求 2.分析及实现 3.实现效果 4.进行持久化存储 (1)基于终端指令 (2)基于管道 [前置知识]python爬虫--scr ...
python爬虫之Scrapy框架的post请求和核心组件的工作流程
python爬虫之Scrapy框架的post请求和核心组件的工作流程一 Scrapy的post请求的实现在爬虫文件中的爬虫类继承了Spider父类中的start_urls,该方法就可以对star ...
scrapy获取a标签的连接_python爬虫——基于scrapy框架爬取网易新闻内容
python爬虫--基于scrapy框架爬取网易新闻内容 1.需求[前期准备] 2.分析及代码实现(1)获取五大板块详情页url(2)解析每个板块(3)解析每个模块里的标题中详情页信息点击此处,获取 ...
解析python网络爬虫pdf 黑马程序员_正版解析Python网络爬虫核心技术 Scrapy框架分布式爬虫黑马程序员 Python应用编程丛书中国铁道出版社...
商品参数书名:Python应用编程丛书:解析Python网络爬虫:核心技术.Scrapy框架.分布式爬虫定价:52.00元作者:[中国]黑马程序员出版社:中国铁道出版社出版日期:2018-0 ...
19. python爬虫——基于scrapy框架爬取网易新闻内容
python爬虫--基于scrapy框架爬取网易新闻内容 1.需求 [前期准备] 2.分析及代码实现 (1)获取五大板块详情页url (2)解析每个板块 (3)解析每个模块里的标题中详情页信息 1.需 ...