国家2020年区划数据爬取

国家区划数据在国家区划网站上的展示就是一个俄罗斯套娃

一层一层的套，从省市开始一直到村级（或居委会）

数据量达到了67w+（ps：咱们国家的区划就达到了这个数量级，人口大国真的不是盖的··）

在网上找到了一个很好的博主，也是采用了套娃的方式，将爬取的数据写到了县级，于是借鉴来改改，实现我想要的（这位博主，我后来也没找到了···请知情者评论区发一下链接，谢谢）

省-市-县-乡-村所有的数据，分表存储，一省一表

在经过不断的优化改进，改成了以下的代码

代码说明：

1. 我程序用了for循环是一次跑完的（省力，但是时间久），你可以单拎出来，多创建几个进程，会快很多（我不会在代码中用多进程，你会的话可以试试）

2.尽量在配置好点的linux服务器上跑，我在windows上跑这个特别慢，后来无奈还是转到了linux

3.由于数据量太大，难免会因为各种各样的原因爬取失败（主要是网络请求失败），所以加了很多关键日志，我打印出来的就是你没爬到的。后面有补充失败的的代码，肯定会用到的。

# -*- coding:utf-8 -*-
import requests
from lxml import etree
import json
import timeheaders = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.204 Safari/537.36','Cookie':'AD_RS_COOKIE=20080918; _trs_uv=kahvgie3_6_fc6v'}#省级
def province(index):s = requests.session()s.keep_alive = Falseresponse = requests.get('http://www.stats.gov.cn/tjsj/tjbz/tjyqhdmhcxhfdm/2020/index.html',headers=headers)response.encoding = 'gbk'text = response.texthtml = etree.HTML(text)trs = html.xpath('//tr[@class="provincetr"]/td')tr = trs[index]try:province = tr.xpath('./a/text()')[0]page = tr.xpath('./a/@href')[0]province_code = page[:2]city_url = 'http://www.stats.gov.cn/tjsj/tjbz/tjyqhdmhcxhfdm/2020/'+pagefp.write('%s,%s\t,%s\n' % (province, province_code,province))try:time.sleep(0.2)city(city_url)except:print("city failed",city_url)log_file.write('%s,%s\n' % ("city failed", city_url))except:print('province write failed ', tr.xpath('./td[2]/a/text()')[0])log_file.write('%s,%s\n' % ('province write failed ', tr.xpath('./td[2]/a/text()')[0]))#市级
def city(province_url):time.sleep(5)s = requests.session()s.keep_alive = Falseresponse2 = requests.get(province_url, headers=headers)response2.encoding = 'gbk'text2 = response2.texthtml2 = etree.HTML(text2)trs = html2.xpath('//tr[@class="citytr"]')for tr in trs:try:page = tr.xpath('./td[1]/a/@href')[0]page_list = page.split('/')city_code = page_list[0]country_url = 'http://www.stats.gov.cn/tjsj/tjbz/tjyqhdmhcxhfdm/2020/' + pagecity_id = tr.xpath('./td[1]/a/text()')[0]city = tr.xpath('./td[2]/a/text()')[0]fp.write('%s,%s\t,%s\n' % (city, city_id[:4],city))try:time.sleep(0.2)country(country_url,city_code)except:print('country faild ',country_url,city_code)log_file.write('%s,%s,%s\t\n' % ('country faild ', country_url, city_code))except:print('city write failed:',json.dumps(city,  ensure_ascii=False))log_file.write('%s,%s\t\n' % ('city write failed:',json.dumps(city,  ensure_ascii=False)))#县级
def country(country_url,city_code): #县区级print(country_url)time.sleep(5)s = requests.session()s.keep_alive = Falseresponse3 = requests.get(country_url, headers=headers)response3.encoding = 'gbk'text3 = response3.texthtml3 = etree.HTML(text3)trs = html3.xpath('//tr[@class="countytr"]')if trs:for tr in trs:try:page = tr.xpath('./td[1]/a/@href')[0]page_list = page.split('/')country_code = page_list[0]town_url = 'http://www.stats.gov.cn/tjsj/tjbz/tjyqhdmhcxhfdm/2020/' + city_code + '/' + page  country_id = tr.xpath('./td[1]/a/text()')[0]print(country_id)country = tr.xpath('./td[2]/a/text()')[0]print(country)fp.write('%s,%s\t,%s\n' % (country, country_id[:6], country))try:time.sleep(0.2)town(town_url, country_code, city_code)except:print('town failed ', town_url, country_code, city_code)log_file.write('%s,%s,%s\t,%s\t\n' % ('town failed ', town_url, country_code, city_code))except:try:country_id = tr.xpath('./td[1]/text()')[0]country = tr.xpath('./td[2]/text()')[0]fp.write('%s,%s\t,%s\n' % (country, country_id[:6], country))print("没有下级的地区：" + country)log_file.write('%s,%s\n' % ("没有下级地区:", country))except:print('country write failed', country_url)log_file.write('%s,%s\n' % ('country write failed', country_url))else:fp.write('%s,%s\t,%s\n' % ('市辖区', country_url[-9:-5]+'00', '市辖区'))town(country_url,'00',city_code)def town(town_url,country_code,city_code):time.sleep(5)s = requests.session()s.keep_alive = Falseresponse4 = requests.get(town_url, headers=headers)response4.encoding = 'gbk'text4 = response4.texthtml4 = etree.HTML(text4)trs = html4.xpath('//tr[@class="towntr"]')if country_code == '00':for tr in trs:  try:page = tr.xpath('./td[1]/a/@href')[0]page_list = page.split('/')town_code = page_list[0]village_url = 'http://www.stats.gov.cn/tjsj/tjbz/tjyqhdmhcxhfdm/2020/' + city_code  + '/' + page  town_id = tr.xpath('./td[1]/a/text()')[0]print(town_id)town = tr.xpath('./td[2]/a/text()')[0]print(town)fp.write('%s,%s\t,%s\n' % (town, town_id[:9], town))try:time.sleep(0.2)village(village_url)except:print('village faild', village_url)log_file.write('%s,%s\n' % ('village faild', village_url))except:try:town_id = tr.xpath('./td[1]/text()')[0]town = tr.xpath('./td[2]/text()')[0]fp.write('%s,%s\t,%s\n' % (town, town_id[:9], town))log_file.write('%s,%s\n' % ("没有下级地区:", town))except:print('town write failed:', json.dumps(city, ensure_ascii=False))log_file.write('%s,%s\n' % ('town write failed:', json.dumps(city, ensure_ascii=False)))else:for tr in trs:try:page = tr.xpath('./td[1]/a/@href')[0]page_list = page.split('/')town_code = page_list[0]village_url = 'http://www.stats.gov.cn/tjsj/tjbz/tjyqhdmhcxhfdm/2020/' + city_code + '/' + country_code + '/' + page  town_id = tr.xpath('./td[1]/a/text()')[0]town = tr.xpath('./td[2]/a/text()')[0]fp.write('%s,%s\t,%s\n' % (town, town_id[:9], town))try:time.sleep(0.2)village(village_url)except:print('village faild',village_url)log_file.write('%s,%s\n' % ('village faild',village_url))except:try:town_id = tr.xpath('./td[1]/text()')[0]town = tr.xpath('./td[2]/text()')[0]fp.write('%s,%s\t,%s\n' % (town, town_id[:9], town))log_file.write('%s,%s\n' % ("没有下级地区:", town))except:print('town write failed:', json.dumps(city,  ensure_ascii=False))log_file.write('%s,%s\n' % ('town write failed:', json.dumps(city,  ensure_ascii=False)))#村级
def village(village_url):time.sleep(5)s = requests.session()s.keep_alive = Falseresponse5 = requests.get(village_url, headers=headers)response5.encoding = 'gbk'text5 = response5.texthtml5 = etree.HTML(text5)trs = html5.xpath('//tr[@class="villagetr"]')for tr in trs:try:village_id = tr.xpath('./td[1]/text()')[0]village = tr.xpath('./td[3]/text()')[0]fp.write('%s,%s\t\n' % (village, village_id))time.sleep(0.2)except:print('village write failed', tr.xpath('./td[3]/text()')[0])log_file.write('%s,%s\n' % ('village write failed', tr.xpath('./td[3]/text()')[0]))print(village_url)log_file.write('%s\n' % (village_url))if __name__ == '__main__':for i in range(0,32):csv_name = "quhua_"+str(i)+".csv"err_name = "error"+str(i)+".txt"log_file = open(err_name, 'a')log_file.write('%s\n' % (csv_name))print(csv_name)fp = open(csv_name, 'a')fp.write('%s,%s\n' % ('区划名称', '区划代码'))  # 表头province(i)time.sleep(30)fp.close()time.sleep(5)log_file.close()time.sleep(30)

如果单个省市没有爬到的数据，我会打印日志，如下图：（windows打出来是个元组，linux打出来是几个字符串，你需要变成以下列表的形式，不同的省市你可以放在一起，只是注意放在一个列表里面，看我的代码层级就能明白）

加一个调用函数，修改一下主函数。补充爬取失败的代码如下：（建议单个省市跑，不然后面再爬取失败，你可能就不知道谁是谁了··）

def single(list):url_err = list[0]url_type = url_err.split(' ')print(url_type[0])if url_type[0]=='country':country(list[1],list[2])elif url_type[0]=='village':village(list[1])elif url_type[0]=='town':town(list[1],list[2],list[3])else:print('数据错误')if __name__ == '__main__':list_a = [
[['country faild ', 'http://www.stats.gov.cn/tjsj/tjbz/tjyqhdmhcxhfdm/2020/42/4201.html', '42'],
['country faild ', 'http://www.stats.gov.cn/tjsj/tjbz/tjyqhdmhcxhfdm/2020/42/4213.html', '42']],[['country faild ', 'http://www.stats.gov.cn/tjsj/tjbz/tjyqhdmhcxhfdm/2020/43/4301.html', '43'],
['country faild ', 'http://www.stats.gov.cn/tjsj/tjbz/tjyqhdmhcxhfdm/2020/43/4313.html', '43']]]for i,list_all in enumerate(list_a):print(i)print(time.time())csv_name = "quhua_" + str(i) + "_bu.csv"print(csv_name)fp = open(csv_name, 'a')fp.write('%s,%s\n' % ('区划名称', '补充内容'))for list_single in list_all:single(list_single)time.sleep(40)fp.close()print(time.time())

爬下来的数据中有个地方是福建省泉州市金门县，这里在国家区划网上是没有乡级村级的，后来查了下，说是这里归台湾管辖

大家在数据上要注意这里有个特殊点。

如果想要我爬到的数据，请来下载我的资源

国家2020年区划数据爬取相关推荐

python爬虫，2020年《财富》中国500强排行榜数据爬取源码
一个简单的demo,python爬虫,其实是以前的存货,很久很久没有写爬虫了,渣渣更渣了啊! 爬取财富中文网,2020年<财富>中国500强排行榜相关数据,数据都在网页源码里,结构也比较清 ...
2020年《财富》中国500强排行榜数据爬取，看看都有哪些
前言一个简单的demo,python爬虫,其实是以前的存货,很久很久没有写爬虫了,渣渣更渣了啊! 爬取财富中文网,2020年<财富>中国500强排行榜相关数据,数据都在网页源码里,结构也 ...
携程酒店数据爬取2020.5
携程酒店数据爬取2020.5 1. 开题目前网上有好多爬取携程网站的教程,大多数通过xpath,beautifulsoup,正则来解析网页的源代码.然后我这个菜b贪方便,直接copy源码的xpath ...
Python 大数据分析疫情：如何实现实时数据爬取及 Matplotlib 可视化？
作者 | 杨秀璋来源 | CSDN博客专家Eastmount 责编 | 夕颜思来想去,虽然很忙,但还是挤时间针对这次肺炎疫情写个Python大数据分析系列博客,包括网络爬虫.可视化分析.GIS地图 ...
猫眼网历史日票房数据爬取
文章目录前言一.分析猫眼网榜单网页和票房明细网页二.使用步骤 1.引入库 2.获取并分析源码函数 3.主函数部分若需要额外的榜单外的电影,可以直接去猫眼查询电影名称, 然后查看网页源码,使用c ...
python实现数据爬取——糗事百科爬虫项目
python实现数据爬取--糗事百科爬虫项目 # urllib.request 请求模块 import urllib.request # re 模块使 Python 语言拥有全部的正则表达式功能. i ...
python爬虫，g-mark网站图片数据爬取及补坑
应用python对g-mark网站图片数据爬取,同时但对于数据抓取失败的图片进行补坑(重新爬取操作),由于是日本网站,没有梯子访问的话,特别容易访问超时,比较合适的补坑操作是直接将数据采集到数据库,而 ...
基于python的自媒体和官媒数据爬取、可视化分析及云词图制作
创作不易,如果以下内容对你有帮助,别忘了点一个赞,让更多的小伙伴能看到吧~~ 1. 问题描述为了研究自媒体和官媒的传播新闻<武汉快递小哥汪勇的先进事迹>的及时性和传播力度,对比两种传播途 ...
UN Comtrade（联合国商品贸易统计数据库）数据爬取Python代码
目录 Python代码根据需求改写url 报错应对办法 UN Comtrade数据库关于中国台湾的数据 2021/9/28更新:最近有用户反馈下载会出现错误内容如下图,感谢用户@三眼皮138帮忙找出 ...

国家2020年区划数据爬取

国家2020年区划数据爬取相关推荐

最新文章

热门文章