item的介绍与使用-2.0
item的使用
通过爬取阳光热线问政平台来学习item的使用
目标:所有的投诉帖子的编号、帖子的链接、帖子的标题和内容
url:http://wz.sun0769.com/political/index/politicsNewest?id=1
网站的介绍
1.一个帖子对应着一个
- 标签
2.找到了数据的响应的url地址
3.url随着页面变化的规律 http://wz.sun0769.com/political/index/politicsNewest?id=1&page=1 http://wz.sun0769.com/political/index/politicsNewest?id=1&page=2 http://wz.sun0769.com/political/index/politicsNewest?id=1&page=3http://wz.sun0769.com/political/index/politicsNewest?id=1&page=页数
4.请求详情页的请求是一个get请求
文件目录
sun.py
# -*- coding: utf-8 -*- import scrapy from ..items import SunItemclass SunSpider(scrapy.Spider):name = 'sun'allowed_domains = ['sun0769.com']start_urls = ['http://wz.sun0769.com/political/index/politicsNewest?id=1&page=1']def parse(self, response):# 分组li_list=response.xpath('//div[@class="width-12"]/ul[@class="title-state-ul"]//li[@class="clear"]')for li in li_list:item=SunItem()item["num"]=li.xpath('.//span[@class="state1"]/text()').extract_first()item["title"]=li.xpath('.//span[@class="state3"]/a[1]/text()').extract_first()item["response_time"]=li.xpath('.//span[@class="state4"]/text()').extract_first().strip()item["response_time"]=item["response_time"].split(":")[-1]item["ask_time"]=li.xpath('.//span[@class="state5 "]/text()').extract_first()item["detail_url"]="http://wz.sun0769.com"+li.xpath('.//span[@class="state3"]/a[1]/@href').extract_first()yield scrapy.Request(item["detail_url"],callback=self.parse_detail_url,meta={"item":item} # 通过meta传递数据)# 实现翻页操作for page in range(2, 4):next_url = f"http://wz.sun0769.com/political/index/politicsNewest?id=1&page={page}"yield scrapy.Request(next_url,callback=self.parse)def parse_detail_url(self,response):"""处理详情页的数据"""item=response.meta["item"] # 取出数据item["content"]=response.xpath('//div[@class="details-box"]/pre/text()').extract_first()# 注意:图片和视频可能不止一个,也可能没有item["img"]=response.xpath('//div[@class="clear details-img-list Picture-img"]/img/@src').extract()item["video"]=response.xpath('//div[@class="vcp-player"]/video/@src').extract()yield item
items.py
定义我们需要爬取哪一些字段
# -*- coding: utf-8 -*-# Define here the models for your scraped items # # See documentation in: # https://docs.scrapy.org/en/latest/topics/items.htmlimport scrapyclass SunItem(scrapy.Item):# define the fields for your item here like:num = scrapy.Field()title = scrapy.Field()response_time = scrapy.Field()ask_time = scrapy.Field()detail_url = scrapy.Field()content = scrapy.Field()img = scrapy.Field()video = scrapy.Field()
pipelines.py
# -*- coding: utf-8 -*-# Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.htmlimport logginglogger=logging.getLogger(__name__)class SunPipeline:def process_item(self, item, spider):item["content"]=self.process_content(item["content"])# logger.warning(item)return itemdef process_content(self,content):"""处理item中的字符串"""new_content=content.replace("\r\n","").replace("\xa0","")return new_content
settings.py
# -*- coding: utf-8 -*-# Scrapy settings for Sun project # # For simplicity, this file contains only settings considered important or # commonly used. You can find more settings consulting the documentation: # # https://docs.scrapy.org/en/latest/topics/settings.html # https://docs.scrapy.org/en/latest/topics/downloader-middleware.html # https://docs.scrapy.org/en/latest/topics/spider-middleware.htmlBOT_NAME = 'Sun'SPIDER_MODULES = ['Sun.spiders'] NEWSPIDER_MODULE = 'Sun.spiders'LOG_LEVEL="WARN"# Crawl responsibly by identifying yourself (and your website) on the user-agent USER_AGENT = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36"# Obey robots.txt rules ROBOTSTXT_OBEY = True# Configure maximum concurrent requests performed by Scrapy (default: 16) #CONCURRENT_REQUESTS = 32# Configure a delay for requests for the same website (default: 0) # See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay # See also autothrottle settings and docs #DOWNLOAD_DELAY = 3 # The download delay setting will honor only one of: #CONCURRENT_REQUESTS_PER_DOMAIN = 16 #CONCURRENT_REQUESTS_PER_IP = 16# Disable cookies (enabled by default) #COOKIES_ENABLED = False# Disable Telnet Console (enabled by default) #TELNETCONSOLE_ENABLED = False# Override the default request headers: #DEFAULT_REQUEST_HEADERS = {# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', # 'Accept-Language': 'en', #}# Enable or disable spider middlewares # See https://docs.scrapy.org/en/latest/topics/spider-middleware.html #SPIDER_MIDDLEWARES = {# 'Sun.middlewares.SunSpiderMiddleware': 543, #}# Enable or disable downloader middlewares # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html #DOWNLOADER_MIDDLEWARES = {# 'Sun.middlewares.SunDownloaderMiddleware': 543, #}# Enable or disable extensions # See https://docs.scrapy.org/en/latest/topics/extensions.html #EXTENSIONS = {# 'scrapy.extensions.telnet.TelnetConsole': None, #}# Configure item pipelines # See https://docs.scrapy.org/en/latest/topics/item-pipeline.html ITEM_PIPELINES = {'Sun.pipelines.SunPipeline': 300, }# Enable and configure the AutoThrottle extension (disabled by default) # See https://docs.scrapy.org/en/latest/topics/autothrottle.html #AUTOTHROTTLE_ENABLED = True # The initial download delay #AUTOTHROTTLE_START_DELAY = 5 # The maximum download delay to be set in case of high latencies #AUTOTHROTTLE_MAX_DELAY = 60 # The average number of requests Scrapy should be sending in parallel to # each remote server #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 # Enable showing throttling stats for every response received: #AUTOTHROTTLE_DEBUG = False# Enable and configure HTTP caching (disabled by default) # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings #HTTPCACHE_ENABLED = True #HTTPCACHE_EXPIRATION_SECS = 0 #HTTPCACHE_DIR = 'httpcache' #HTTPCACHE_IGNORE_HTTP_CODES = [] #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
cmd
scrapy startproject Suncd Sunscrapy genspider sunscrapy crawl sun
item的介绍与使用-2.0相关推荐
- PMCAFF | 锤子科技产品经理朱萧木介绍 Smartisan OS 2.0 语音搜索功能
在锤子科技 他的工号是0001 他是罗永浩招聘的第一位员工 他31岁 个头高挑 蓄着马尾 像个艺术家 他是锤子科技的产品总监 他叫朱萧木 朱萧木介绍 Smartisan OS 2.0 语音搜索功能 昨 ...
- 详细介绍百度ERNIE 2.0:A Continual Pre-Training Framework for Language Understanding
系列阅读: 详细介绍百度ERNIE1.0:Enhanced Representation through Knowledge Integration 详细介绍百度ERNIE 2.0:A Continu ...
- 《 百度大脑AI技术成果白皮书》,介绍百度大脑5.0,附48页PDF下载
来源:专知 [导读]百度大脑是百度AI集大成者,自2010年起开始积累基础能力,2019年升级为5.0,成为软硬件一体的AI大生产平台.百度发布< 百度大脑AI技术成果白皮书>,详细介绍了 ...
- OAuth2.0_介绍_Spring Security OAuth2.0认证授权---springcloud工作笔记137
技术交流QQ群[JAVA,C++,Python,.NET,BigData,AI]:170933152 上面是oauth2.0的一些介绍. 我们说一下oauth2.0的一个验证的过程, oauth2.0 ...
- 最全最详细的蓝牙版本介绍包含蓝牙4.0和4.1
概述:蓝牙核心规范发展的主要版本: 表1 蓝牙核心规范发展介绍 版本 规范发布 增强功能 0.7 1998.10.19 Baseband.LMP 0.8 1999.1.21 HCI.L2CAP.RF ...
- BookKeeper设计介绍及其在Hadoop2.0 Namenode HA方案中的使用分析
BookKeeper背景 BK是一个可靠的日志流记录系统,用于将系统产生的日志(也可以是其他数据)记录在BK集群上,由BK这个第三方Storage保证数据存储的可靠和一致性.典型场景是系统写write ...
- 【OTB使用介绍一】tracker_benchmark_v1.0小白使用配置OTB100
tracker_benchmark_v1.0的配置教程,尽量写的详细些,从下载到配置完成可以使用 先说明tracker_benchmark_v1.0就是OTB(object tracking benc ...
- SD卡介绍(基于SPEC3.0)
1.SD卡基本介绍(Secure Digital Memrory Card) SD卡是基于MMC(MultiMedia卡)格式,是东芝在MMC卡上升级来的(所以,SD卡是支持MMC卡的协议的).SD卡 ...
- mMySQL中触发器和游标的介绍与使用L8.0.23免安装版配置详细教程 msi安装超详细教程
文章来源: 学习通http://www.bdgxy.com/ 普学网http://www.boxinghulanban.cn/ 智学网http://www.jaxp.net/ 一.下载MySql,安装 ...
- C++函数返回值介绍(含return 0 与 return 1 与 return -1介绍)
很多人在学习C++ 的过程中应该会留意到返回值的问题,特别是习惯用:int main() 的猿类同伴们.我们需要在函数结尾写个返回值. int main(){return 0; } 接下来就给大家简 ...
最新文章
- Photoshop图像处理操作汇总
- VOICE VOER IP(VoIP)理论(NP水平学的)
- 【C 语言】二级指针作为输入 ( 二维数组 | 二维数组内存大小计算 | 指针跳转步长问题 )
- 【学术相关】建议收藏,到底哪些行为是学术不端?
- Docker端口映射实现
- win8 linux分区工具,Ubuntu下挂载Win8磁盘分区
- Weekly Contest 141
- nll_loss 和 cross_entropy
- tp5 前台页面获取url链接里的参数,如下
- python3没有decode_我如何在Python3中使用.decode('string-escape')?
- php嗅探链接,教你怎么利用php来嗅探劫持服务器数据
- 几个找pdf资源的网站
- Java qq登录界面设计
- echarts全国各市地图坐标
- 【5G核心网】5GC核心网之网元NSSF
- 编写程序,先将输入的一系列整数中的最小值与第一个数交换,然后将最大值与最后一个数交换,最后输出交换后的序列。
- 东南亚电商龙头 shopee 社招,校招 内推(长期有效)
- Matlab论文插图绘制模板第69期—带误差棒的折线图(Errorbar)
- 硕士阶段人工智能有哪些比较好的发论文的方向?
- CCD相机和CMOS相机的区别
热门文章
- 2019 计蒜之道 初赛 第三场 阿里巴巴协助征战SARS(python做法,费马小定理+快速幂)
- 音视频多媒体开发基础概述之颜色空间(1)CIE色度模型 RGB颜色空间
- 如何手动启动消防广播_消防广播使用操作流程
- TOM邮箱,那个陪我走过20多年的邮箱
- STM32通过IIC驱动MLX90614红外温度传感器
- 网站建设需要怎么做?个人网站建设教程
- 设计模式java装饰模式范例_Java设计模式中外观模式和装饰器模式的介绍(代码示例)...
- 韩寒《三重门》精彩语录2
- sumo 教程——高速公路
- 视频技术系列 - 分析业内数字版权管理DRM技术