item的介绍与使用-2.0

item的使用

通过爬取阳光热线问政平台来学习item的使用
目标：所有的投诉帖子的编号、帖子的链接、帖子的标题和内容
url：http://wz.sun0769.com/political/index/politicsNewest?id=1

网站的介绍

1.一个帖子对应着一个

标签

2.找到了数据的响应的url地址

3.url随着页面变化的规律

http://wz.sun0769.com/political/index/politicsNewest?id=1&page=1
http://wz.sun0769.com/political/index/politicsNewest?id=1&page=2
http://wz.sun0769.com/political/index/politicsNewest?id=1&page=3http://wz.sun0769.com/political/index/politicsNewest?id=1&page=页数

4.请求详情页的请求是一个get请求

文件目录

sun.py

# -*- coding: utf-8 -*-
import scrapy
from ..items import SunItemclass SunSpider(scrapy.Spider):name = 'sun'allowed_domains = ['sun0769.com']start_urls = ['http://wz.sun0769.com/political/index/politicsNewest?id=1&page=1']def parse(self, response):# 分组li_list=response.xpath('//div[@class="width-12"]/ul[@class="title-state-ul"]//li[@class="clear"]')for li in li_list:item=SunItem()item["num"]=li.xpath('.//span[@class="state1"]/text()').extract_first()item["title"]=li.xpath('.//span[@class="state3"]/a[1]/text()').extract_first()item["response_time"]=li.xpath('.//span[@class="state4"]/text()').extract_first().strip()item["response_time"]=item["response_time"].split("：")[-1]item["ask_time"]=li.xpath('.//span[@class="state5 "]/text()').extract_first()item["detail_url"]="http://wz.sun0769.com"+li.xpath('.//span[@class="state3"]/a[1]/@href').extract_first()yield scrapy.Request(item["detail_url"],callback=self.parse_detail_url,meta={"item":item}  # 通过meta传递数据)# 实现翻页操作for page in range(2, 4):next_url = f"http://wz.sun0769.com/political/index/politicsNewest?id=1&page={page}"yield scrapy.Request(next_url,callback=self.parse)def parse_detail_url(self,response):"""处理详情页的数据"""item=response.meta["item"]  # 取出数据item["content"]=response.xpath('//div[@class="details-box"]/pre/text()').extract_first()# 注意：图片和视频可能不止一个，也可能没有item["img"]=response.xpath('//div[@class="clear details-img-list Picture-img"]/img/@src').extract()item["video"]=response.xpath('//div[@class="vcp-player"]/video/@src').extract()yield item

items.py

定义我们需要爬取哪一些字段

# -*- coding: utf-8 -*-# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.htmlimport scrapyclass SunItem(scrapy.Item):# define the fields for your item here like:num = scrapy.Field()title = scrapy.Field()response_time = scrapy.Field()ask_time = scrapy.Field()detail_url = scrapy.Field()content = scrapy.Field()img = scrapy.Field()video = scrapy.Field()

pipelines.py

# -*- coding: utf-8 -*-# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.htmlimport logginglogger=logging.getLogger(__name__)class SunPipeline:def process_item(self, item, spider):item["content"]=self.process_content(item["content"])# logger.warning(item)return itemdef process_content(self,content):"""处理item中的字符串"""new_content=content.replace("\r\n","").replace("\xa0","")return new_content

settings.py

# -*- coding: utf-8 -*-# Scrapy settings for Sun project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.htmlBOT_NAME = 'Sun'SPIDER_MODULES = ['Sun.spiders']
NEWSPIDER_MODULE = 'Sun.spiders'LOG_LEVEL="WARN"# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36"# Obey robots.txt rules
ROBOTSTXT_OBEY = True# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16# Disable cookies (enabled by default)
#COOKIES_ENABLED = False# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {#    'Sun.middlewares.SunSpiderMiddleware': 543,
#}# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {#    'Sun.middlewares.SunDownloaderMiddleware': 543,
#}# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {#    'scrapy.extensions.telnet.TelnetConsole': None,
#}# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {'Sun.pipelines.SunPipeline': 300,
}# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

cmd

scrapy startproject Suncd Sunscrapy genspider sunscrapy crawl sun

item的介绍与使用-2.0相关推荐

PMCAFF | 锤子科技产品经理朱萧木介绍 Smartisan OS 2.0 语音搜索功能
在锤子科技他的工号是0001 他是罗永浩招聘的第一位员工他31岁个头高挑蓄着马尾像个艺术家他是锤子科技的产品总监他叫朱萧木朱萧木介绍 Smartisan OS 2.0 语音搜索功能昨 ...
详细介绍百度ERNIE 2.0：A Continual Pre-Training Framework for Language Understanding
系列阅读: 详细介绍百度ERNIE1.0:Enhanced Representation through Knowledge Integration 详细介绍百度ERNIE 2.0:A Continu ...
《百度大脑AI技术成果白皮书》，介绍百度大脑5.0，附48页PDF下载
来源:专知 [导读]百度大脑是百度AI集大成者,自2010年起开始积累基础能力,2019年升级为5.0,成为软硬件一体的AI大生产平台.百度发布< 百度大脑AI技术成果白皮书>,详细介绍了 ...
OAuth2.0_介绍_Spring Security OAuth2.0认证授权---springcloud工作笔记137
技术交流QQ群[JAVA,C++,Python,.NET,BigData,AI]:170933152 上面是oauth2.0的一些介绍. 我们说一下oauth2.0的一个验证的过程, oauth2.0 ...
最全最详细的蓝牙版本介绍包含蓝牙4.0和4.1
概述:蓝牙核心规范发展的主要版本: 表1 蓝牙核心规范发展介绍版本规范发布增强功能 0.7 1998.10.19 Baseband.LMP 0.8 1999.1.21 HCI.L2CAP.RF ...
BookKeeper设计介绍及其在Hadoop2.0 Namenode HA方案中的使用分析
BookKeeper背景 BK是一个可靠的日志流记录系统,用于将系统产生的日志(也可以是其他数据)记录在BK集群上,由BK这个第三方Storage保证数据存储的可靠和一致性.典型场景是系统写write ...
【OTB使用介绍一】tracker_benchmark_v1.0小白使用配置OTB100
tracker_benchmark_v1.0的配置教程,尽量写的详细些,从下载到配置完成可以使用先说明tracker_benchmark_v1.0就是OTB(object tracking benc ...
SD卡介绍（基于SPEC3.0）
1.SD卡基本介绍(Secure Digital Memrory Card) SD卡是基于MMC(MultiMedia卡)格式,是东芝在MMC卡上升级来的(所以,SD卡是支持MMC卡的协议的).SD卡 ...
mMySQL中触发器和游标的介绍与使用L8.0.23免安装版配置详细教程 msi安装超详细教程
文章来源: 学习通http://www.bdgxy.com/ 普学网http://www.boxinghulanban.cn/ 智学网http://www.jaxp.net/ 一.下载MySql,安装 ...
C++函数返回值介绍（含return 0 与 return 1 与 return -1介绍）
很多人在学习C++ 的过程中应该会留意到返回值的问题,特别是习惯用:int main() 的猿类同伴们.我们需要在函数结尾写个返回值. int main(){return 0; } 接下来就给大家简 ...

item的介绍与使用-2.0

item的使用

网站的介绍

文件目录

sun.py

items.py

pipelines.py

settings.py

cmd

item的介绍与使用-2.0相关推荐

最新文章

热门文章