爬虫开发10.scrapy框架之日志等级和请求传参

今日概要

日志等级
请求传参

今日详情

一.Scrapy的日志等级

　　- 在使用scrapy crawl spiderFileName运行程序时，在终端里打印输出的就是scrapy的日志信息。

　　- 日志信息的种类：

　　　　　　　　ERROR ：一般错误

　　　　　　　　WARNING : 警告

　　　　　　　　INFO : 一般的信息

　　　　　　　　DEBUG ：调试信息

　　- 设置日志信息指定输出：

　　　　在settings.py配置文件中，加入

LOG_LEVEL = ‘指定日志信息种类’即可。

LOG_FILE = 'log.txt'则表示将日志信息写入到指定文件中进行存储。

二.请求传参

　　- 在某些情况下，我们爬取的数据不在同一个页面中，例如，我们爬取一个电影网站，电影的名称，评分在一级页面，而要爬取的其他电影详情在其二级子页面中。这时我们就需要用到请求传参。

　　- 案例展示：爬取www.id97.com电影网，将一级页面中的电影名称，类型，评分一级二级页面中的上映时间，导演，片长进行爬取。

　　爬虫文件：

# -*- coding: utf-8 -*-
import scrapy
from moviePro.items import MovieproItemclass MovieSpider(scrapy.Spider):name = 'movie'allowed_domains = ['www.id97.com']start_urls = ['http://www.id97.com/']def parse(self, response):div_list = response.xpath('//div[@class="col-xs-1-5 movie-item"]')for div in div_list:item = MovieproItem()item['name'] = div.xpath('.//h1/a/text()').extract_first()item['score'] = div.xpath('.//h1/em/text()').extract_first()#xpath(string(.))表示提取当前节点下所有子节点中的数据值（.）表示当前节点item['kind'] = div.xpath('.//div[@class="otherinfo"]').xpath('string(.)').extract_first()item['detail_url'] = div.xpath('./div/a/@href').extract_first()#请求二级详情页面，解析二级页面中的相应内容,通过meta参数进行Request的数据传递yield scrapy.Request(url=item['detail_url'],callback=self.parse_detail,meta={'item':item})def parse_detail(self,response):#通过response获取itemitem = response.meta['item']item['actor'] = response.xpath('//div[@class="row"]//table/tr[1]/a/text()').extract_first()item['time'] = response.xpath('//div[@class="row"]//table/tr[7]/td[2]/text()').extract_first()item['long'] = response.xpath('//div[@class="row"]//table/tr[8]/td[2]/text()').extract_first()#提交item到管道yield item

　　items文件：

# -*- coding: utf-8 -*-# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.htmlimport scrapyclass MovieproItem(scrapy.Item):# define the fields for your item here like:name = scrapy.Field()score = scrapy.Field()time = scrapy.Field()long = scrapy.Field()actor = scrapy.Field()kind = scrapy.Field()detail_url = scrapy.Field()

管道文件：

# -*- coding: utf-8 -*-# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.htmlimport json
class MovieproPipeline(object):def __init__(self):self.fp = open('data.txt','w')def process_item(self, item, spider):dic = dict(item)print(dic)json.dump(dic,self.fp,ensure_ascii=False)return itemdef close_spider(self,spider):self.fp.close()

转载于:https://www.cnblogs.com/sunny666/p/10542647.html

爬虫开发10.scrapy框架之日志等级和请求传参相关推荐

scrapy框架的日志等级和请求传参
Scrapy的日志等级 - 在使用scrapy crawl spiderFileName运行程序时,在终端里打印输出的就是scrapy的日志信息.- 日志信息的种类:ERROR : 一般错误WARNI ...
Python 网络爬虫笔记10 -- Scrapy 使用入门
Python 网络爬虫笔记10 – Scrapy 使用入门 Python 网络爬虫系列笔记是笔者在学习嵩天老师的<Python网络爬虫与信息提取>课程及笔者实践网络爬虫的笔记. 课程链接: ...
Python爬虫5.3 — scrapy框架spider[Request和Response]模块的使用
Python爬虫5.3 - scrapy框架spider[Request和Response]模块的使用综述 Request对象 scrapy.Request()函数讲解: Response对象发送 ...
18-爬虫之scrapy框架请求传参实现的深度爬取（全站爬取）05
请求传参实现的深度爬取深度爬取:爬取的数据没有在同一张页面中(首页数据+详情页数据) 在scrapy中如果没有请求传参我们是无法进行持久化存储数据的实现方式: scrapy.Request(url ...
scrapy实现post请求与请求传参
不推荐使用scrapy框架发送post请求,配置复杂,如果在数据量大的情况下,可以通过如下代码来实现: import scrapyclass FySpider(scrapy.Spider):name ...
微信小程序开发：学习笔记[8]——页面跳转及传参
微信小程序开发:学习笔记[8]--页面跳转及传参页面跳转一个小程序拥有多个页面,我们可以通过wx.navigateTo推入一个新的页面.在首页使用2次wx.navigateTo后,页面层级会有三层 ...
爬虫Spider 09 - scrapy框架 | 日志级别 | 保存为csv、json文件
文章目录 Spider 08回顾 selenium+phantomjs/chrome/firefox execjs模块使用 Spider 09笔记 scrapy框架小试牛刀猫眼电影案例知识点汇总 ...
华为如何在开发者选项观察错误日志_爬虫scrapy框架--log日志输出配置及使用
1.在配置文件中设置日志输出文件名和日志等级 1.为什么以日期为文件名? 因为这样可以方便开发者查看每天的日志信息,同时也可以防止单文件log日志信息堆积的越来越多,所以将当天日志信息保存到当天的日志 ...
爬虫基础(五)-----scrapy框架简介
---------------------------------------------------摆脱穷人思维 <五> :拓展自己的视野,适当做一些眼前''无用''的事情,防止进入只关 ...

爬虫开发10.scrapy框架之日志等级和请求传参

爬虫开发10.scrapy框架之日志等级和请求传参相关推荐

最新文章

热门文章