爬取并处理中国新冠肺炎疫情数据

项目名称:

爬取并处理中国新冠肺炎疫情数据

目的：

通过Python爬取中国新冠肺炎疫情数据，存入Excel，对此数据分析并进行可视化，制作查询中国疫情情况的GUI界面。

具体内容：

通过Python来实现一个爬取并处理中国新冠肺炎疫情数据的爬虫项目,该项目能从网站获取中国各省最新新冠肺炎疫情数据并将其保存在Excel文件中，进行数据处理可得到中国疫情累计确诊人数前十五省的饼状图，该项目制作了一个可以查询各省各地区新冠肺炎疫情数据的GUI界面;

Excel表格设计
饼状图
GUI界面
输出设计

系统实现：

本次项目使用的Python库有：requests，xlwt，json，matplotlib，tkinter，os，re，time

import requests
import os
import re
import xlwt
import time
import json
import matplotlib.pyplot as plt
import tkinter
from tkinter import scrolledtext
from  tkinter  import ttk
from tkinter import *

一、爬取数据

本项目选择爬取的网站是：丁香园 ▪ 丁香医生

https://ncov.dxy.cn/ncovh5/view/pneumonia?from=timeline&isappinstalled=0

代码：

def get_data_html():headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/14.0.835.163 Safari/535.1'}response = requests.get('https://ncov.dxy.cn/ncovh5/view/pneumonia?from=timeline&isappinstalled=0', headers=headers, timeout=3)# 请求页面response = str(response.content, 'utf-8')# 中文重新编码return response#返回了HTML数据def get_data_dictype():areas_type_dic_raw = re.findall('try { window.getAreaStat = (.*?)}catch\(e\)',get_data_html())areas_type_dic = json.loads(areas_type_dic_raw[0])return areas_type_dic#返回经过json转换过的字典化的数据

处理后得到的数据 areas_type_dic 部分如下图：

由图可看出：
• get_data_dictype() 返回的数据是一个列表
• 列表中的每个元素都是一个字典，即每个列表元素都是一个省的总疫情数据
• 每个字典的provinceName是省份名称，currentConfirmedCount是本省现存确诊人数，confirmedCount是本省累计确诊人数等等
• 每个字典cities键的值是一个列表，这个列表的每个元素是字典，包含一个地区的疫情数据

二、确定及创建项目保存路径

若文件保存目录不存在，则创建此目录并输出显示“数据文件夹不存在”及创建保存的目录。数据保存成功则输出显示数据爬取成功

代码：

def make_dir():  #检查并创建数据目录file_path = 'F:/中国疫情情况采集/'if not os.path.exists(file_path):print('数据文件夹不存在')os.makedirs(file_path)print('数据文件夹创建成功,创建目录为%s'%(file_path))else:print('数据保存在目录：%s' % (file_path))

效果：

三、数据写入Excel

Excel形式：

创建多个工作表。第一个工作表命名为“中国各省”，里面包含各省的新冠肺炎疫情情况（总现存确诊人数，总累计确诊人数，总疑似人数，总治愈人数，总死亡人数，地区ID编码）。后面的工作表依次按读取到的省份命名，工作表里面包含的数据为某省具体地区的疫情情况（现存确诊人数，累计确诊人数，疑似人数，治愈人数，死亡人数，地区ID编码），第二行是这个省的总情况，且与本省地区情况之间空一行，方便观察

代码：

def createxcelsheet(workbook,name):worksheet = workbook.add_sheet(name,cell_overwrite_ok=True)for i in range(0,9):worksheet.col(i).width = 256 * 15al = xlwt.Alignment()al.horz = 0x02      # 设置水平居中style = xlwt.XFStyle()style.alignment = al#写入数据列标签worksheet.write(0, 2, '城市名称',style)worksheet.write(0, 3, '现存确诊人数',style)worksheet.write(0, 4, '累计确诊人数',style)worksheet.write(0, 5, '疑似人数',style)worksheet.write(0, 6, '治愈人数',style)worksheet.write(0, 7, '死亡人数',style)worksheet.write(0, 8, '地区ID编码',style)return worksheet,style
global label,vvalues #获得中国省份名称和累计确诊人数
label=[]
vvalues=[]
def save_data_to_excle(): #爬取数据并保存在Excel中make_dir()     #调用方法检查数据目录是否存在，不存在则创建数据文件夹newworkbook = xlwt.Workbook()            # 打开工作簿，创建工作表sheet,style=createxcelsheet(newworkbook,'中国各省')      count=1    #中国各省的计数器for province_data in get_data_dictype():c_count=1  #某省的计数器provincename = province_data['provinceName']provinceshortName = province_data['provinceShortName']p_currentconfirmedCount=province_data['currentConfirmedCount']p_confirmedcount = province_data['confirmedCount']p_suspectedcount = province_data['suspectedCount']p_curedcount = province_data['curedCount']p_deadcount = province_data['deadCount']p_locationid = province_data['locationId']#用循环获取省级以及该省以下城市的数据label.append(provincename)vvalues.append( p_confirmedcount)sheet.write(count, 2, provincename,style)sheet.write(count, 3, p_currentconfirmedCount,style)sheet.write(count, 4, p_confirmedcount,style)sheet.write(count, 5, p_suspectedcount,style)sheet.write(count, 6, p_curedcount,style)sheet.write(count, 7, p_deadcount,style)sheet.write(count, 8, p_locationid,style)count+=1 worksheet,style=createxcelsheet(newworkbook,provincename)worksheet.write(c_count, 2, provinceshortName,style)worksheet.write(c_count, 3, p_currentconfirmedCount,style)worksheet.write(c_count, 4, p_confirmedcount,style)worksheet.write(c_count, 5, p_suspectedcount,style)worksheet.write(c_count, 6, p_curedcount,style)worksheet.write(c_count, 7, p_deadcount,style)worksheet.write(c_count, 8, p_locationid,style)#在工作表里写入省级数据 c_count+= 2   #省与省下各城市之间空一行for citiy_data in province_data['cities']:#该部分获取某个省下某城市的数据cityname = citiy_data['cityName']c_currentconfirmedCount=citiy_data['currentConfirmedCount']c_confirmedcount = citiy_data['confirmedCount']c_suspectedcount = citiy_data['suspectedCount']c_curedcount = citiy_data['curedCount']c_deadcount = citiy_data['deadCount']c_locationid = citiy_data['locationId']     #向Excel对应列标写入数据worksheet.write(c_count, 2, cityname,style)worksheet.write(c_count, 3, c_currentconfirmedCount,style)worksheet.write(c_count, 4, c_confirmedcount,style)worksheet.write(c_count, 5, c_suspectedcount,style)worksheet.write(c_count, 6, c_curedcount,style)worksheet.write(c_count, 7, c_deadcount,style)worksheet.write(c_count, 8, c_locationid,style)c_count+= 1    #此处为写入行数累加，在cities部分循环current_time = time.strftime("%Y年%m月%d日%H：%M：%S", time.localtime())newworkbook.save('F:\中国疫情情况采集\实时采集-%s.xls' % (current_time))print('******数据爬取成功******')

效果：

四、绘制饼状图

将各省的累计确诊人数数据另取出来，排序找到前十五名较多人数省份生成饼状图并保存在文件夹中

代码部分解释：
在函数save_data_to_excle中向label列表中顺序添加中国省份名称，向vvalues列表中顺序添加各省累计确诊人数。字典z中Label当键，vvalues当值，然后按值逆序排序，之后取前十五个。
最后利用matplotlib.pyplot中的pie()绘制，title（）设置标题，savefig()保存图片，show（）展示图片。列表切片取前十五个。

def sjvisual():plt.rcParams['font.sans-serif']=['SimHei']   # 解决中文显示问题plt.rcParams['axes.unicode_minus'] = False   # 解决负号显示问题plt.figure(figsize=(7,6))z=zip(label,vvalues)s=sorted(z,key=lambda x:x[1],reverse=True)top15=s[0:15:]labels=(dict(top15)).keys()values=(dict(top15)).values()plt.pie(values,labels=labels,radius = 1.2,pctdistance=0.8,autopct='%1.2f%%')   #绘制饼图plt.savefig('F:/中国疫情情况采集/2020年中国疫情确诊人数前十五省饼图.png')    #保存图片plt.show()    #展示图片

五、GUI界面

GUI界面可查询中国某省总疫情情况和此省下地区的具体疫情情况。通过下拉列表框选择想要查询的省份，若想查询此省下某地区疫情情况，可点击对应地区按钮进行查询。
若想查询多个省份，查完一个省份后可在下拉列表框中再次选择想要查询的省份，某省下地区疫情情况点击相应按钮即可。
代码：

def GUI():   #制作GUI界面global winwin=tkinter.Tk() #构造窗体win.minsize(800,660)win.title('中国各省疫情情况查询')tkinter.Label(win, text = '请选择省份：', height=1, width=10).place(x = 200, y = 0)tkinter.Label(win, text = '省份：').place(x = 240, y = 40)tkinter.Label(win, text = '现存确诊人数：').place(x = 240, y = 70)tkinter.Label(win, text = '累计确诊人数：').place(x = 240, y = 100)tkinter.Label(win, text = '疑似人数：').place(x = 240, y = 130)tkinter.Label(win, text = '治愈人数：').place(x = 240, y = 160)tkinter.Label(win, text = '死亡人数：').place(x = 240, y = 190)global comboxlistcomvalue=tkinter.StringVar()#窗体自带的文本，新建一个值  comboxlist=ttk.Combobox(win,textvariable=comvalue) #初始化comboxlist["values"]=labelcomboxlist.current(0)  #选择第一个显示comboxlist.bind("<<ComboboxSelected>>",go)  #绑定事件,(下拉列表框被选中时，绑定go()函数)comboxlist.pack()win.mainloop() #进入消息循环  global tst
tst=[]
def go(*args):    #处理事件，*args表示可变参数for i in tst:  #清空上个被选择省份的地区信息i.place_forget()c=comboxlist.get()  #被选中的省份for province_data in get_data_dictype():provincename = province_data['provinceName']if c==provincename:tkinter.Label(win, text = provincename,height=1, width=15).place(x = 400, y = 40)  tkinter.Label(win, text = province_data['currentConfirmedCount'],height=1, width=10).place(x = 400, y = 70)tkinter.Label(win, text = province_data['confirmedCount'],height=1, width=10).place(x = 400, y = 100)tkinter.Label(win, text = province_data['suspectedCount'],height=1, width=10).place(x = 400, y = 130)tkinter.Label(win, text = province_data['curedCount'],height=1, width=10).place(x = 400, y = 160)tkinter.Label(win, text = province_data['deadCount'],height=1, width=10).place(x = 400, y = 190)tkinter.Label(win, text = ' 请选择'+provincename+'下地区：',height=1, width=20).place(x = 50, y = 220)#设置城市按钮的位置 lt=[(15,240),(115,240),(215,240),(315,240),(415,240),(515,240),(615,240),(715,240),\(15,280),(115,280),(215,280),(315,280),(415,280),(515,280),(615,280),(715,280),\(15,320),(115,320),(215,320),(315,320),(415,320),(515,320),(615,320),(715,320),\(15,360),(115,360),(215,360),(315,360),(415,360),(515,360),(615,360),(715,360),\(15,400),(115,400),(215,400),(315,400),(415,400),(515,400),(615,400),(715,400)]      ct=0  #按钮位置计数器for city_data in province_data['cities']:while ct <len(province_data['cities']):b=Button(win, text = city_data['cityName'],height=1, width=10)tst.append(b)    #装入按钮，便于清除b.place(x=lt[ct][0],y=lt[ct][1])b.bind('<Button-1>',show)  #当按钮被选中，绑定show函数ct+=1break  # 控制一个城市只建一次按钮    tkinter.Label(win, text = '地区：').place(x = 240, y = 435)tkinter.Label(win, text = '现存确诊人数：').place(x = 240, y = 465)tkinter.Label(win, text = '累计确诊人数：').place(x = 240, y = 495)tkinter.Label(win, text = '疑似人数：').place(x = 240, y = 525)tkinter.Label(win, text = '治愈人数：').place(x = 240, y = 555)tkinter.Label(win, text = '死亡人数：').place(x = 240, y = 585)def show(event): #显示对应城市数据c=comboxlist.get() #获得被选中城市（按钮）名称for province_data in get_data_dictype():provincename = province_data['provinceName']if c==provincename:for city_data in province_data['cities']:if city_data['cityName']==event.widget['text']: # 匹配到对应数据tkinter.Label(win, text = city_data['cityName'],height=1, width=15 ).place(x = 400, y = 435)  tkinter.Label(win, text = city_data['currentConfirmedCount'],height=1, width=15).place(x = 400, y = 465)tkinter.Label(win, text = city_data['confirmedCount'],height=1, width=15).place(x = 400, y = 495)tkinter.Label(win, text = city_data['suspectedCount'],height=1, width=15).place(x = 400, y = 525)tkinter.Label(win, text = city_data['curedCount'],height=1, width=15).place(x = 400, y = 555)tkinter.Label(win, text = city_data['deadCount'],height=1, width=15).place(x = 400, y = 585)

代码部分解释：
• 下拉列表框绑定go函数，则有省份被选中时，执行go函数
• 将下拉列表框定义为全局变量，便于在go（）中匹配对应数据
• Comboxlist.get()从下拉列表框得到被选中的省份名称，即想要查询的省份
• event.widget[‘text’]能得到被点击按钮的名字，即想要查询的省下地区
• 利用lt列表规定按钮（地区名称）放置位置。按钮绑定show(event)

六、执行

代码最后调用save_data_to_excel()，sjvisual()，GUI()
若sjvisual()在GUI()前面执行，则饼状图关闭后GUI界面才会出来，反之，饼状图后出来

save_data_to_excle()
sjvisual()
GUI()