item的使用

通过爬取阳光热线问政平台来学习item的使用
目标:所有的投诉帖子的编号、帖子的链接、帖子的标题和内容
url:http://wz.sun0769.com/political/index/politicsNewest?id=1

网站的介绍

1.一个帖子对应着一个

  • 标签

    2.找到了数据的响应的url地址

    3.url随着页面变化的规律
  • http://wz.sun0769.com/political/index/politicsNewest?id=1&page=1
    http://wz.sun0769.com/political/index/politicsNewest?id=1&page=2
    http://wz.sun0769.com/political/index/politicsNewest?id=1&page=3http://wz.sun0769.com/political/index/politicsNewest?id=1&page=页数
    

    4.请求详情页的请求是一个get请求

    文件目录

    sun.py

    # -*- coding: utf-8 -*-
    import scrapy
    from ..items import SunItemclass SunSpider(scrapy.Spider):name = 'sun'allowed_domains = ['sun0769.com']start_urls = ['http://wz.sun0769.com/political/index/politicsNewest?id=1&page=1']def parse(self, response):# 分组li_list=response.xpath('//div[@class="width-12"]/ul[@class="title-state-ul"]//li[@class="clear"]')for li in li_list:item=SunItem()item["num"]=li.xpath('.//span[@class="state1"]/text()').extract_first()item["title"]=li.xpath('.//span[@class="state3"]/a[1]/text()').extract_first()item["response_time"]=li.xpath('.//span[@class="state4"]/text()').extract_first().strip()item["response_time"]=item["response_time"].split(":")[-1]item["ask_time"]=li.xpath('.//span[@class="state5 "]/text()').extract_first()item["detail_url"]="http://wz.sun0769.com"+li.xpath('.//span[@class="state3"]/a[1]/@href').extract_first()yield scrapy.Request(item["detail_url"],callback=self.parse_detail_url,meta={"item":item}  # 通过meta传递数据)# 实现翻页操作for page in range(2, 4):next_url = f"http://wz.sun0769.com/political/index/politicsNewest?id=1&page={page}"yield scrapy.Request(next_url,callback=self.parse)def parse_detail_url(self,response):"""处理详情页的数据"""item=response.meta["item"]  # 取出数据item["content"]=response.xpath('//div[@class="details-box"]/pre/text()').extract_first()# 注意:图片和视频可能不止一个,也可能没有item["img"]=response.xpath('//div[@class="clear details-img-list Picture-img"]/img/@src').extract()item["video"]=response.xpath('//div[@class="vcp-player"]/video/@src').extract()yield item

    items.py

    定义我们需要爬取哪一些字段

    # -*- coding: utf-8 -*-# Define here the models for your scraped items
    #
    # See documentation in:
    # https://docs.scrapy.org/en/latest/topics/items.htmlimport scrapyclass SunItem(scrapy.Item):# define the fields for your item here like:num = scrapy.Field()title = scrapy.Field()response_time = scrapy.Field()ask_time = scrapy.Field()detail_url = scrapy.Field()content = scrapy.Field()img = scrapy.Field()video = scrapy.Field()
    

    pipelines.py

    # -*- coding: utf-8 -*-# Define your item pipelines here
    #
    # Don't forget to add your pipeline to the ITEM_PIPELINES setting
    # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.htmlimport logginglogger=logging.getLogger(__name__)class SunPipeline:def process_item(self, item, spider):item["content"]=self.process_content(item["content"])# logger.warning(item)return itemdef process_content(self,content):"""处理item中的字符串"""new_content=content.replace("\r\n","").replace("\xa0","")return new_content
    

    settings.py


    # -*- coding: utf-8 -*-# Scrapy settings for Sun project
    #
    # For simplicity, this file contains only settings considered important or
    # commonly used. You can find more settings consulting the documentation:
    #
    #     https://docs.scrapy.org/en/latest/topics/settings.html
    #     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
    #     https://docs.scrapy.org/en/latest/topics/spider-middleware.htmlBOT_NAME = 'Sun'SPIDER_MODULES = ['Sun.spiders']
    NEWSPIDER_MODULE = 'Sun.spiders'LOG_LEVEL="WARN"# Crawl responsibly by identifying yourself (and your website) on the user-agent
    USER_AGENT = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36"# Obey robots.txt rules
    ROBOTSTXT_OBEY = True# Configure maximum concurrent requests performed by Scrapy (default: 16)
    #CONCURRENT_REQUESTS = 32# Configure a delay for requests for the same website (default: 0)
    # See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
    # See also autothrottle settings and docs
    #DOWNLOAD_DELAY = 3
    # The download delay setting will honor only one of:
    #CONCURRENT_REQUESTS_PER_DOMAIN = 16
    #CONCURRENT_REQUESTS_PER_IP = 16# Disable cookies (enabled by default)
    #COOKIES_ENABLED = False# Disable Telnet Console (enabled by default)
    #TELNETCONSOLE_ENABLED = False# Override the default request headers:
    #DEFAULT_REQUEST_HEADERS = {#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    #   'Accept-Language': 'en',
    #}# Enable or disable spider middlewares
    # See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
    #SPIDER_MIDDLEWARES = {#    'Sun.middlewares.SunSpiderMiddleware': 543,
    #}# Enable or disable downloader middlewares
    # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
    #DOWNLOADER_MIDDLEWARES = {#    'Sun.middlewares.SunDownloaderMiddleware': 543,
    #}# Enable or disable extensions
    # See https://docs.scrapy.org/en/latest/topics/extensions.html
    #EXTENSIONS = {#    'scrapy.extensions.telnet.TelnetConsole': None,
    #}# Configure item pipelines
    # See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
    ITEM_PIPELINES = {'Sun.pipelines.SunPipeline': 300,
    }# Enable and configure the AutoThrottle extension (disabled by default)
    # See https://docs.scrapy.org/en/latest/topics/autothrottle.html
    #AUTOTHROTTLE_ENABLED = True
    # The initial download delay
    #AUTOTHROTTLE_START_DELAY = 5
    # The maximum download delay to be set in case of high latencies
    #AUTOTHROTTLE_MAX_DELAY = 60
    # The average number of requests Scrapy should be sending in parallel to
    # each remote server
    #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
    # Enable showing throttling stats for every response received:
    #AUTOTHROTTLE_DEBUG = False# Enable and configure HTTP caching (disabled by default)
    # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
    #HTTPCACHE_ENABLED = True
    #HTTPCACHE_EXPIRATION_SECS = 0
    #HTTPCACHE_DIR = 'httpcache'
    #HTTPCACHE_IGNORE_HTTP_CODES = []
    #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

    cmd

    scrapy startproject Suncd Sunscrapy genspider sunscrapy crawl sun
    

item的介绍与使用-2.0相关推荐

  1. PMCAFF | 锤子科技产品经理朱萧木介绍 Smartisan OS 2.0 语音搜索功能

    在锤子科技 他的工号是0001 他是罗永浩招聘的第一位员工 他31岁 个头高挑 蓄着马尾 像个艺术家 他是锤子科技的产品总监 他叫朱萧木 朱萧木介绍 Smartisan OS 2.0 语音搜索功能 昨 ...

  2. 详细介绍百度ERNIE 2.0:A Continual Pre-Training Framework for Language Understanding

    系列阅读: 详细介绍百度ERNIE1.0:Enhanced Representation through Knowledge Integration 详细介绍百度ERNIE 2.0:A Continu ...

  3. 《 百度大脑AI技术成果白皮书》,介绍百度大脑5.0,附48页PDF下载

    来源:专知 [导读]百度大脑是百度AI集大成者,自2010年起开始积累基础能力,2019年升级为5.0,成为软硬件一体的AI大生产平台.百度发布< 百度大脑AI技术成果白皮书>,详细介绍了 ...

  4. OAuth2.0_介绍_Spring Security OAuth2.0认证授权---springcloud工作笔记137

    技术交流QQ群[JAVA,C++,Python,.NET,BigData,AI]:170933152 上面是oauth2.0的一些介绍. 我们说一下oauth2.0的一个验证的过程, oauth2.0 ...

  5. 最全最详细的蓝牙版本介绍包含蓝牙4.0和4.1

    概述:蓝牙核心规范发展的主要版本: 表1  蓝牙核心规范发展介绍 版本 规范发布 增强功能 0.7 1998.10.19 Baseband.LMP 0.8 1999.1.21 HCI.L2CAP.RF ...

  6. BookKeeper设计介绍及其在Hadoop2.0 Namenode HA方案中的使用分析

    BookKeeper背景 BK是一个可靠的日志流记录系统,用于将系统产生的日志(也可以是其他数据)记录在BK集群上,由BK这个第三方Storage保证数据存储的可靠和一致性.典型场景是系统写write ...

  7. 【OTB使用介绍一】tracker_benchmark_v1.0小白使用配置OTB100

    tracker_benchmark_v1.0的配置教程,尽量写的详细些,从下载到配置完成可以使用 先说明tracker_benchmark_v1.0就是OTB(object tracking benc ...

  8. SD卡介绍(基于SPEC3.0)

    1.SD卡基本介绍(Secure Digital Memrory Card) SD卡是基于MMC(MultiMedia卡)格式,是东芝在MMC卡上升级来的(所以,SD卡是支持MMC卡的协议的).SD卡 ...

  9. mMySQL中触发器和游标的介绍与使用L8.0.23免安装版配置详细教程 msi安装超详细教程

    文章来源: 学习通http://www.bdgxy.com/ 普学网http://www.boxinghulanban.cn/ 智学网http://www.jaxp.net/ 一.下载MySql,安装 ...

  10. C++函数返回值介绍(含return 0 与 return 1 与 return -1介绍)

    很多人在学习C++ 的过程中应该会留意到返回值的问题,特别是习惯用:int main()  的猿类同伴们.我们需要在函数结尾写个返回值. int main(){return 0; } 接下来就给大家简 ...

最新文章

  1. Photoshop图像处理操作汇总
  2. VOICE VOER IP(VoIP)理论(NP水平学的)
  3. 【C 语言】二级指针作为输入 ( 二维数组 | 二维数组内存大小计算 | 指针跳转步长问题 )
  4. 【学术相关】建议收藏,到底哪些行为是学术不端?
  5. Docker端口映射实现
  6. win8 linux分区工具,Ubuntu下挂载Win8磁盘分区
  7. Weekly Contest 141
  8. nll_loss 和 cross_entropy
  9. tp5 前台页面获取url链接里的参数,如下
  10. python3没有decode_我如何在Python3中使用.decode('string-escape')?
  11. php嗅探链接,教你怎么利用php来嗅探劫持服务器数据
  12. 几个找pdf资源的网站
  13. Java qq登录界面设计
  14. echarts全国各市地图坐标
  15. 【5G核心网】5GC核心网之网元NSSF
  16. 编写程序,先将输入的一系列整数中的最小值与第一个数交换,然后将最大值与最后一个数交换,最后输出交换后的序列。
  17. 东南亚电商龙头 shopee 社招,校招 内推(长期有效)
  18. Matlab论文插图绘制模板第69期—带误差棒的折线图(Errorbar)
  19. 硕士阶段人工智能有哪些比较好的发论文的方向?
  20. CCD相机和CMOS相机的区别

热门文章

  1. 2019 计蒜之道 初赛 第三场 阿里巴巴协助征战SARS(python做法,费马小定理+快速幂)
  2. 音视频多媒体开发基础概述之颜色空间(1)CIE色度模型 RGB颜色空间
  3. 如何手动启动消防广播_消防广播使用操作流程
  4. TOM邮箱,那个陪我走过20多年的邮箱
  5. STM32通过IIC驱动MLX90614红外温度传感器
  6. 网站建设需要怎么做?个人网站建设教程
  7. 设计模式java装饰模式范例_Java设计模式中外观模式和装饰器模式的介绍(代码示例)...
  8. 韩寒《三重门》精彩语录2
  9. sumo 教程——高速公路
  10. 视频技术系列 - 分析业内数字版权管理DRM技术