目录

前言

环境部署

插件推荐

爬虫目标

项目创建

webdriver部署

项目代码

Item定义

中间件定义

定义爬虫

pipeline输出结果文本

配置文件改动

验证结果

总结


前言

闲来无聊,写了一个爬虫程序获取百度疫情数据。申明一下,研究而已。而且页面应该会进程做反爬处理,可能需要调整对应xpath。

Github仓库地址:代码仓库

本文主要使用的是scrapy框架。

环境部署

主要简单推荐一下

插件推荐

这里先推荐一个Google Chrome的扩展插件xpath helper,可以验证xpath语法是不是正确。

爬虫目标

需要爬取的页面:实时更新:新型冠状病毒肺炎疫情地图

主要爬取的目标选取了全国的数据以及各个身份的数据。

项目创建

使用scrapy命令创建项目

scrapy startproject yqsj

webdriver部署

这里就不重新讲一遍了,可以参考我这篇文章的部署方法:(Scrapy框架)爬虫2021年CSDN全站综合热榜标题热词 | 爬虫案例_阿良的博客-CSDN博客

项目代码

开始撸代码,看一下百度疫情省份数据的问题。

页面需要点击展开全部span。所以在提取页面源码的时候需要模拟浏览器打开后,点击该按钮。所以按照这个方向,我们一步步来。

Item定义

定义两个类YqsjProvinceItem和YqsjChinaItem,分别定义国内省份数据和国内数据。

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.htmlimport scrapyclass YqsjProvinceItem(scrapy.Item):# define the fields for your item here like:# name = scrapy.Field()location = scrapy.Field()new = scrapy.Field()exist = scrapy.Field()total = scrapy.Field()cure = scrapy.Field()dead = scrapy.Field()class YqsjChinaItem(scrapy.Item):# define the fields for your item here like:# name = scrapy.Field()# 现有确诊exist_diagnosis = scrapy.Field()# 无症状asymptomatic = scrapy.Field()# 现有疑似exist_suspecte = scrapy.Field()# 现有重症exist_severe = scrapy.Field()# 累计确诊cumulative_diagnosis = scrapy.Field()# 境外输入overseas_input = scrapy.Field()# 累计治愈cumulative_cure = scrapy.Field()# 累计死亡cumulative_dead = scrapy.Field()

中间件定义

需要打开页面后点击一下展开全部。

完整代码

# Define here the models for your spider middleware
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/spider-middleware.htmlfrom scrapy import signals# useful for handling different item types with a single interface
from itemadapter import is_item, ItemAdapter
from scrapy.http import HtmlResponse
from selenium.common.exceptions import TimeoutException
from selenium.webdriver import ActionChains
import timeclass YqsjSpiderMiddleware:# Not all methods need to be defined. If a method is not defined,# scrapy acts as if the spider middleware does not modify the# passed objects.@classmethoddef from_crawler(cls, crawler):# This method is used by Scrapy to create your spiders.s = cls()crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)return sdef process_spider_input(self, response, spider):# Called for each response that goes through the spider# middleware and into the spider.# Should return None or raise an exception.return Nonedef process_spider_output(self, response, result, spider):# Called with the results returned from the Spider, after# it has processed the response.# Must return an iterable of Request, or item objects.for i in result:yield idef process_spider_exception(self, response, exception, spider):# Called when a spider or process_spider_input() method# (from other spider middleware) raises an exception.# Should return either None or an iterable of Request or item objects.passdef process_start_requests(self, start_requests, spider):# Called with the start requests of the spider, and works# similarly to the process_spider_output() method, except# that it doesn’t have a response associated.# Must return only requests (not items).for r in start_requests:yield rdef spider_opened(self, spider):spider.logger.info('Spider opened: %s' % spider.name)class YqsjDownloaderMiddleware:# Not all methods need to be defined. If a method is not defined,# scrapy acts as if the downloader middleware does not modify the# passed objects.@classmethoddef from_crawler(cls, crawler):# This method is used by Scrapy to create your spiders.s = cls()crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)return sdef process_request(self, request, spider):# Called for each request that goes through the downloader# middleware.# Must either:# - return None: continue processing this request# - or return a Response object# - or return a Request object# - or raise IgnoreRequest: process_exception() methods of#   installed downloader middleware will be called# return Nonetry:spider.browser.get(request.url)spider.browser.maximize_window()time.sleep(2)spider.browser.find_element_by_xpath("//*[@id='nationTable']/div/span").click()# ActionChains(spider.browser).click(searchButtonElement)time.sleep(5)return HtmlResponse(url=spider.browser.current_url, body=spider.browser.page_source,encoding="utf-8", request=request)except TimeoutException as e:print('超时异常:{}'.format(e))spider.browser.execute_script('window.stop()')finally:spider.browser.close()def process_response(self, request, response, spider):# Called with the response returned from the downloader.# Must either;# - return a Response object# - return a Request object# - or raise IgnoreRequestreturn responsedef process_exception(self, request, exception, spider):# Called when a download handler or a process_request()# (from other downloader middleware) raises an exception.# Must either:# - return None: continue processing this exception# - return a Response object: stops process_exception() chain# - return a Request object: stops process_exception() chainpassdef spider_opened(self, spider):spider.logger.info('Spider opened: %s' % spider.name)

定义爬虫

分别获取国内疫情数据以及省份疫情数据。完整代码:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# @Time    : 2021/11/7 22:05
# @Author  : 至尊宝
# @Site    :
# @File    : baidu_yq.pyimport scrapy
from selenium import webdriver
from selenium.webdriver.chrome.options import Optionsfrom yqsj.items import YqsjChinaItem, YqsjProvinceItemclass YqsjSpider(scrapy.Spider):name = 'yqsj'# allowed_domains = ['blog.csdn.net']start_urls = ['https://voice.baidu.com/act/newpneumonia/newpneumonia#tab0']china_xpath = "//div[contains(@class, 'VirusSummarySix_1-1-317_2ZJJBJ')]/text()"province_xpath = "//*[@id='nationTable']/table/tbody/tr[{}]/td/text()"province_xpath_1 = "//*[@id='nationTable']/table/tbody/tr[{}]/td/div/span/text()"def __init__(self):chrome_options = Options()chrome_options.add_argument('--headless')  # 使用无头谷歌浏览器模式chrome_options.add_argument('--disable-gpu')chrome_options.add_argument('--no-sandbox')self.browser = webdriver.Chrome(chrome_options=chrome_options,executable_path="E:\\chromedriver_win32\\chromedriver.exe")self.browser.set_page_load_timeout(30)def parse(self, response, **kwargs):country_info = response.xpath(self.china_xpath)yq_china = YqsjChinaItem()yq_china['exist_diagnosis'] = country_info[0].get()yq_china['asymptomatic'] = country_info[1].get()yq_china['exist_suspecte'] = country_info[2].get()yq_china['exist_severe'] = country_info[3].get()yq_china['cumulative_diagnosis'] = country_info[4].get()yq_china['overseas_input'] = country_info[5].get()yq_china['cumulative_cure'] = country_info[6].get()yq_china['cumulative_dead'] = country_info[7].get()yield yq_china# 遍历35个地区for x in range(1, 35):path = self.province_xpath.format(x)path1 = self.province_xpath_1.format(x)province_info = response.xpath(path)province_name = response.xpath(path1)yq_province = YqsjProvinceItem()yq_province['location'] = province_name.get()yq_province['new'] = province_info[0].get()yq_province['exist'] = province_info[1].get()yq_province['total'] = province_info[2].get()yq_province['cure'] = province_info[3].get()yq_province['dead'] = province_info[4].get()yield yq_province

pipeline输出结果文本

将结果按照一定的文本格式输出出来。完整代码:

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html# useful for handling different item types with a single interface
from itemadapter import ItemAdapterfrom yqsj.items import YqsjChinaItem, YqsjProvinceItemclass YqsjPipeline:def __init__(self):self.file = open('result.txt', 'w', encoding='utf-8')def process_item(self, item, spider):if isinstance(item, YqsjChinaItem):self.file.write("国内疫情\n现有确诊\t{}\n无症状\t{}\n现有疑似\t{}\n现有重症\t{}\n累计确诊\t{}\n境外输入\t{}\n累计治愈\t{}\n累计死亡\t{}\n".format(item['exist_diagnosis'],item['asymptomatic'],item['exist_suspecte'],item['exist_severe'],item['cumulative_diagnosis'],item['overseas_input'],item['cumulative_cure'],item['cumulative_dead']))if isinstance(item, YqsjProvinceItem):self.file.write("省份:{}\t新增:{}\t现有:{}\t累计:{}\t治愈:{}\t死亡:{}\n".format(item['location'],item['new'],item['exist'],item['total'],item['cure'],item['dead']))return itemdef close_spider(self, spider):self.file.close()

配置文件改动

直接参考,自行调整:

# Scrapy settings for yqsj project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.htmlBOT_NAME = 'yqsj'SPIDER_MODULES = ['yqsj.spiders']
NEWSPIDER_MODULE = 'yqsj.spiders'# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'yqsj (+http://www.yourdomain.com)'
USER_AGENT = 'Mozilla/5.0'# Obey robots.txt rules
ROBOTSTXT_OBEY = False# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16# Disable cookies (enabled by default)
COOKIES_ENABLED = False# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8','Accept-Language': 'en','User-Agent': 'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.94 Safari/537.36'
}# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
SPIDER_MIDDLEWARES = {'yqsj.middlewares.YqsjSpiderMiddleware': 543,
}# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {'yqsj.middlewares.YqsjDownloaderMiddleware': 543,
}# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {'yqsj.pipelines.YqsjPipeline': 300,
}# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

验证结果

看看结果文件

总结

emmmm,闲着无聊,写着玩,没啥好总结的。

申明一下,本文案例仅研究探索使用,不是为了恶意攻击。

分享:

修心,亦是修行之一。顺境修力,逆境修心,缺一不可。   ——《剑来》

如果本文对你有作用的话,不要吝啬你的赞,谢谢。

(Scrapy框架)爬虫获取百度新冠疫情数据 | 爬虫案例相关推荐

  1. 数据分享|函数型数据分析部分省市新冠疫情数据

    作者:Mingji Tang 统计学中传统的数据类型有截面数据和时间序列数据.这两者都只能在某一纵向或横向上探究数据,且部分前提条件又很难满足.而函数型数据连续型函数与离散型函数长期以来的分离状态,实 ...

  2. 新冠疫情数据统计 蓝桥杯楼赛第二十三期

    新冠疫情数据统计 介绍 2020 年,新冠疫情肆掠全球.约翰·霍普金斯大学 跟踪了全球病例数据,包括总病例数.COVID-19 传播速度以及全球爆发情况.我们拿到了截止于某日的疫情数据,希望通过 Py ...

  3. 世界各国新冠疫情数据

    接上篇,这是数学建模用到的数据集,各国新冠疫情数据,时间范围2020.2-2021.4,我放到把它放到Gitee仓库了,地址在下面.. Gitee仓库地址,收集数据不易,各位观众老爷,点个赞再走吧

  4. 利用Python实现新冠疫情数据可视化(获取疫情历史数据,制作南丁格尔玫瑰图、疫情地图、动态疫情组合图、词云)

    文章目录 前言 1.获取疫情历史数据 2.制作南丁格尔玫瑰图 2.1 全球各国确诊人数玫瑰图 2.2 全国各省市零新增天数玫瑰图 3.制作疫情地图 3.1全国各省市目前确诊总人数疫情地图 3.2全球各 ...

  5. 中国新冠疫情数据可视化

    文章目录 一.结果及源码展示 二.项目准备 1.第三方库 2.知识点概况 3.推荐视频 三.数据获取 四.数据库交互 五.绘制前端页面 六.Web程序开发 七.未来可期 一.结果及源码展示 自己做的这 ...

  6. 新冠疫情数据可视化python_【一点资讯】新冠疫情数据分析 | Python可视化工具看全国各地的新增趋势 www.yidianzixun.com...

    - 点击上方"中国统计网"订阅我吧!- 文末领取[腾讯疫情分析完整代码+数据包] 本篇文章将分享腾讯疫情实时数据抓取,获取全国各地和贵州省各地区的实时数据,并将数据存储至本地,最后 ...

  7. 新冠疫情数据建模分析

    4.2 湖北疫情数据预处理 import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn ...

  8. 新冠疫情数据实时显示-next.js

    假设咱们要构建一个显示冠状病毒统计数据的网站,主页显示每个国家/地区的信息,并带有指向更多详细信息的链接. 所有数据都来自一个 JSON 文件,咱们可以从公共 URL 下载该文件, 该文件每天更新一次 ...

  9. 李宏毅HW01——新冠疫情数据的预测

    目的:熟悉熟悉pytorch 导入数据 !gdown --id '1kLSW_-cW2Huj7bh84YTdimGBOJaODiOS' --output covid.train.csv !gdown ...

最新文章

  1. [New Portal]Windows Azure Web Site (4) Web Site Gallery
  2. Android短信彩信收发流程(应用层)
  3. [ACL18]基于Self-Attentive的成分句法分析
  4. 【MySQL笔记】: unable to connect to remote host. catalog download has failed.
  5. 5分钟了解CDN 加速原理
  6. 01-SA8155P Flat Build QFIL刷机
  7. 图论,回路,旅行商、邮递员问题。
  8. 国内外知名的待办事项app有哪些
  9. C语言---内存操作及基础知识
  10. php今日头条抓取正文,今日头条文章爬虫采集 - 八爪鱼采集器
  11. 覆盖和覆盖D2D通信网络的传输容量分析(Matlab代码实现)
  12. quicktime安装不了
  13. 【超全】Go语言超详细学习知识体系
  14. Win10杀死进程方式
  15. 短信服务之阿里云平台
  16. ACM国际大学生程序设计竞赛及练习题库
  17. 软件测试之边界值测试法
  18. 福建农林大学计算机与信息学院宿舍,2021福建农林大学宿舍条件和新生宿舍图片及分配规则分享...
  19. 北京农行研发中心面试总结(夏季实习生)
  20. 做题记录uva1626

热门文章

  1. ARM单片机与ARM内核
  2. 笔记本连不上WIFI,手机可以连上,怎么回事
  3. udf函数(udf udaf udtf)
  4. apache大师+伪静态_Apache配置伪静态
  5. C#WPF 常用控件
  6. WPF学习笔记——5)WrapPanel面板和DockPanel面板
  7. 数代接力飞越数千公里的帝王斑蝶,愿风儿指引你道路,愿星辰照亮你方向
  8. IOS 判断新浪微博是否安装
  9. 骨传导耳机哪个牌子好?骨传导耳机品牌排行
  10. 用python实现最简单简单的计算器