’http://wz.sun0769.com/political/index/politicsNewest’
需求:爬取该网站的标题,链接,时间和详情页面的内容

settings.py:

# Scrapy settings for yangguang project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.htmlBOT_NAME = 'yangguang'SPIDER_MODULES = ['yangguang.spiders']
NEWSPIDER_MODULE = 'yangguang.spiders'LOG_LEVEL = 'ERROR'
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.114 Safari/537.36 Edg/89.0.774.68'# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'yangguang (+http://www.yourdomain.com)'# Obey robots.txt rules
ROBOTSTXT_OBEY = False# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16# Disable cookies (enabled by default)
#COOKIES_ENABLED = False# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {#    'yangguang.middlewares.YangguangSpiderMiddleware': 543,
#}# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {#    'yangguang.middlewares.YangguangDownloaderMiddleware': 543,
#}# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {#    'scrapy.extensions.telnet.TelnetConsole': None,
#}# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {'yangguang.pipelines.YangguangPipeline': 300,
}# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

pipelines.py

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html# useful for handling different item types with a single interface
from itemadapter import ItemAdapterclass YangguangPipeline:def process_item(self, item, spider):print(item)return item

yg.py

import scrapyfrom yangguang.items import YangguangItem
class YgSpider(scrapy.Spider):name = 'yg'# allowed_domains = ['www.ya.com']start_urls = ['http://wz.sun0769.com/political/index/politicsNewest']# url = 'http://wz.sun0769.com/political/index/politicsNewest?id=1&page=1'page_num = 2def parse(self, response):# 分组li_list = response.xpath('/html/body/div[2]/div[3]/ul[2]/li')for li in li_list:item = YangguangItem()item['title'] = li.xpath('./span[3]/a/text()').extract_first()item['href'] = 'http://wz.sun0769.com' + li.xpath('./span[3]/a/@href').extract_first()item['publish_time'] = li.xpath('./span[5]/text()').extract_first()yield scrapy.Request(item['href'],callback= self.parse_detail,meta = {'item':item})new_url = f'http://wz.sun0769.com/political/index/politicsNewest?id=1&page={self.page_num}'self.page_num += 1if self.page_num <= 5:yield scrapy.Request(new_url,callback=self.parse)def parse_detail(self,response): # 处理详情页item = response.meta['item']item['content'] = response.xpath('/html/body/div[3]/div[2]/div[2]/div[2]/pre//text()').extract()# print(item)yield item

从0开始学习scrapy框架——(六)item的实例——阳光政务平台爬虫相关推荐

  1. python pipeline框架_Python爬虫从入门到放弃(十六)之 Scrapy框架中Item Pipeline用法...

    原博文 2017-07-17 16:39 − 当Item 在Spider中被收集之后,就会被传递到Item Pipeline中进行处理 每个item pipeline组件是实现了简单的方法的pytho ...

  2. mysql查询数据为0的所有字段6_MySQL8.0 初级学习笔记(六)查询数据

    MySQL8.0 初级学习笔记(六)查询数据 MySQL8.0 初级学习笔记(六)查询数据 查询有多种方法:单表查询,使用集合函数查询,连接查询,子查询,合并查询结果,为表和字段取别名,使用正则表达式 ...

  3. 从0开始学习自动化测试框架cypress(四)登录

    本文介绍一下模拟系统登录功能 1. 从json读取数据校验登陆 读取json的话,cypress提供了函数 cy.fixture(filePath, encoding, options) 其中file ...

  4. 从0开始学习自动化测试框架cypress(二)DOM

    Cypress Can Be Simple: 这节来学习它简约而不简单的写法吧, 看了之后有种还可以这样的感觉 学习内容: 如何查询DOM 命令主题和命令链 断言写法 来看一个栗子 describe( ...

  5. 从0开始学习自动化测试框架cypress(一)

    安装cypress 前提是已经安装nodejs mkdir cypress cd cypress npm install cypress --save-dev --registry=https://r ...

  6. Qt5.15.2+VLC3.0.14学习笔记(六)Qt Player测试(官方示例 vlc-qt+mingw_64版)

    前记: Qt Player是官方源码中的一个示例,好像已经很久没有更新了,今天运行测试了下,需要修改部分内容才能运行 vlc-3.0.14源码下载地址:https://code.videolan.or ...

  7. 从0开始学习自动化测试框架cypress(五)总结

    一.下载环境: cypress.zip解压运行examples或npm install cypress 二.简单使用 1.常用文件夹 fixtures: 保存json文件 integration: 保 ...

  8. 从0开始学习自动化测试框架cypress(五)案例

    本文将实现以下案例功能 使用到以下6个文件 login.jsp : 登录页面 index.jsp : 首页(成功登录后跳转到的页面) add_user.jsp : 新增用户页面 LoginServle ...

  9. 从0开始学习自动化测试框架cypress(三)特性

    下面再来一个简单的例子 实现效果是访问百度,输入java经典教程,搜索 describe('DOM访问操作实例', () => {it('百度搜索java经典教程', () => {cy. ...

  10. 基于Scrapy框架的当当网编程开发图书定向爬虫

    1 项目描述 喜欢买书的朋友肯定听说过当当图书,当当图书包含小说.童书.教辅.教材.考试.外语等多个图书种类,书籍相比其他网站算是比较全的. 本项目仅以采集当当网里面编程开发类的书籍为例.在实际操作过 ...

最新文章

  1. 爬虫原理与数据抓取----- urllib2:URLError与HTTPError
  2. 没有java home_【Java安装】必须有JAVA_HOME变量吗
  3. 【概念原理】四种SQL事务隔离级别和事务ACID特性
  4. “网络爸爸”的密码破解
  5. erdas2015几何校正模块在哪_在ERDAS中进行几何校正
  6. java 初始化duration_java11教程--类Duration用法
  7. python 局域网主机扫描_python扫描局域网主机名
  8. pacman安装ubuntu_最受欢迎的Linux发行版, Manjaro Linux虚拟机安装折腾全记录
  9. bigdecimal 减_市市场监管局多措并举推进高频事项“减时效”
  10. powershell excel 导入 sqlserver
  11. Android菜单详解
  12. 《如何写好科研论文》章节答案(清华)学堂在线(2020秋最新网课答案)
  13. 我们短暂的人类世和即将到来的算法世
  14. 高中计算机矩阵算法ppt,高中信息技术教科版必修1 数据与计算4.2 数值计算一等奖课件ppt...
  15. 数据库查询时报IllegalArgumentException异常是什么原因?
  16. 介绍中国传统节日的网页html,介绍中国传统节日的作文4篇
  17. Linux蓝牙系统(3) Linux 命令
  18. 宝塔一个IP创建多站点
  19. nginx iis 502 错误处理
  20. 使用Java来解决鸡兔同笼的问题,现在一个笼子里(只有鸡和兔子)有35个头,94只足,请求出鸡与兔个多少?

热门文章

  1. 关于LM2596S-5.0电流声问题——输出电容选择
  2. Python修改证件照底色,get新技能
  3. phpstudy配置sg11
  4. 【树莓派】利用tesseract进行汉字识别
  5. OTA频发的“大数据杀熟”,想要治你不容易?
  6. 【云计算学习教程】云计算的优势和劣势(优点和缺点)分析
  7. 坚果云Outlook邮件管理体系畅享高效办公生活
  8. android技巧:dumpsys简化信息查看Activity结构
  9. 苹果屏蔽更新描述文件_iOS屏蔽更新描述文件以及超级详细安装方法分享
  10. 深圳市及各区人才补贴