背景:业务需要全国省市区的划分以及3级级联,正好想起2018年曾经抓取过国家统计局网站的去全国统计用区划代码和城乡划分代码,原资源的地址:2018年全国统计用区划代码和城乡划分代码.sql-MySQL文档类资源-CSDN下载

看到2021年已经更新,正好拿原来的代码看看是否还能跑。

代码测试:1、网站由原来的gbk转换为utf-8

2、抓取过程中会经常连接失败导致

3、失败后无法从失败处继续

那为了能顺利完成任务,需要对原有代码进行改造优化。

步骤如下:

1、目标url:2021年统计用区划代码和城乡划分代码

2、mysql 表结构

 SET FOREIGN_KEY_CHECKS=0;-- ----------------------------
-- Table structure for tab_citys
-- ----------------------------
DROP TABLE IF EXISTS `tab_citys`;
CREATE TABLE `tab_citys` (`id` int(11) NOT NULL AUTO_INCREMENT comment '自动id',`parent_id` int(11) DEFAULT NULL comment '父id',`city_name_zh` varchar(20) NOT NULL comment '名称',`vcode` varchar(20) DEFAULT NULL  comment '城乡划分代码',`city_level` int(11) NOT NULL comment '级别,共五级,1省2城市3区4街道5居委会',`city_code` char(12) NOT NULL comment '区划代码',`next_url` char(200) NOT NULL comment '下一级的url',PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8;

3、原则:

1)从第一级开始,依次抓取直到第五级,依次往复迭代,直接完成。

2)如果中途失败,从mysql中读取上次写入最后一条记录处,继续开始

3)防止多次连接导致,服务器判断为爬虫,ip禁用。

4、python 核心代码

import importlib
import sys
import time
import random
import MySQLdb
importlib.reload(sys)
import requests
import lxml.etree as etreeimport osUA_LIST = ["Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1","Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6","Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6","Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1","Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5","Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3","Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3","Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)","Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3","Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3","Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3","Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24","Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1","Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6","Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6","Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1","Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5","Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3","Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3","Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3","Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3","Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3","Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3","Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3","Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24","Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
]
class chinese_city():# 初始化函数def __init__(self):self.baseUrl = 'http://www.stats.gov.cn/tjsj/tjbz/tjyqhdmhcxhfdm/2021/index.html'self.base = 'http://www.stats.gov.cn/tjsj/tjbz/tjyqhdmhcxhfdm/2021/'self.conn = MySQLdb.connect(host="127.0.0.1", port=3306, user="root", passwd="***", db="test", charset='utf8')self.cur = self.conn.cursor()self.trdic = {1: '//tr[@class="provincetr"]',2: '//tr[@class="citytr"]',3: '//tr[@class="countytr"]',4: '//tr[@class="towntr"]',5: '//tr[@class="villagetr"]'}def __del__(self):if self.cur:self.cur.close()if self.conn:self.conn.close()@staticmethoddef log(log_str):t = time.strftime(r"%Y-%m-%d %H:%M:%S", time.localtime())print("[%s]%s" % (t, log_str))def get_now_time(self):"""获取当前日期时间:return:当前日期时间"""now = time.localtime()now_time = time.strftime("%Y-%m-%d %H:%M:%S", now)return now_timedef crawl_page(self,url):''' 爬行政区划代码公布页 '''self.log(f"crawling...{url}")headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:71.0) Gecko/20100101 Firefox/71.0','Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'}flag = Trueresponse = requests.get(url, headers=headers)response.encoding = 'utf-8'  # 这里添加一行text = response.texttime.sleep(2)return text#解析省页,返回listdef parseProvince(self):html = self.crawl_page(self.baseUrl)tree = etree.HTML(html, parser=etree.HTMLParser(encoding='utf-8'))nodes = tree.xpath('//tr[@class="provincetr"]')id = 1values = []for node in nodes:items = node.xpath('./td')for item in items:value = {}nexturl = item.xpath('./a/@href')province = item.xpath('./a/text()')self.log(province)value['url'] = self.base + "".join(nexturl)value['name'] = "".join(province)value['vcode'] = ""value['code'] = 0value['pid'] = 0value['id'] = idvalue['level'] = 1self.log(repr(value['name']))id = id + 1last_id = self.insert_to_db(value)value['id'] = last_idvalues.append(value)self.log(value)return values#根据trid 解析子页def parse(self,trid, pid, url):if url.strip() == '':return None# url_prefix+urlhtml = self.crawl_page(url)tree = etree.HTML(html, parser=etree.HTMLParser(encoding='utf-8'))nodes = tree.xpath(self.trdic.get(trid))path = os.path.basename(url)base_url = url.replace(path, '')id = 1values = []# 多个城市for node in nodes:value = {}nexturl = node.xpath('./td[1]/a/@href')if len(nexturl) == 0:nexturl = ''code = node.xpath('./td[1]/a/text()')if len(code) == 0:code = node.xpath('./td[1]/text()')name = node.xpath('./td[2]/a/text()')if len(name) == 0:name = node.xpath('./td[2]/text()')value['code'] = "".join(code)urltemp = "".join(nexturl)if len(urltemp) != 0:value['url'] = base_url + "".join(nexturl)else:value['url'] = ''value['name'] = "".join(name)value['vcode'] = ""self.log(repr(value['name']))self.log(value['url'])value['id'] = idvalue['pid'] = pidvalue['level'] = tridid = id + 1last_id = self.insert_to_db(value)value['id'] = last_idvalues.append(value)self.log(value)return values#解析社区页def parseVillager(self,trid, pid, url):html = self.crawl_page(url)tree = etree.HTML(html, parser=etree.HTMLParser(encoding='utf-8'))nodes = tree.xpath(self.trdic.get(trid))id = 1values = []# 多个城市for node in nodes:value = {}nexturl = node.xpath('./td[1]/a/@href')code = node.xpath('./td[1]/text()')vcode = node.xpath('./td[2]/text()')name = node.xpath('./td[3]/text()')value['code'] = "".join(code)value['url'] = "".join(nexturl)value['name'] = "".join(name)value['vcode'] = "".join(vcode)self.log(repr(value['name']))value['id'] = idvalue['pid'] = pidvalue['level'] = tridvalues.append(value)id = id + 1last_id = self.insert_to_db(value)value['id'] = last_idvalues.append(value)self.log(value)return values#插入数据库def insert_to_db(self,taobao):# return 0param = []lastid = 0try:sql = 'INSERT INTO tab_china_citys values(%s,%s,%s,%s,%s, %s,%s)'param = (0, taobao.get("pid"), taobao.get("name"), taobao.get("vcode"), taobao.get("level"), taobao.get("code"), taobao.get("url"))self.cur.execute(sql, param)lastid = self.cur.lastrowidself.conn.commit()except Exception as e:self.log(e)self.conn.rollback()return lastid#从头执行解析def parseChineseCity(self):flag = 1city_flag = 0;count_flag = 0town_flag = 0#先从数据库中获取省份数据values = self.parseProvince()略if __name__ == '__main__':chinese_city = chinese_city()chinese_city.parseChineseCity()

若有需要完整代码或者mysql 导入脚本,请私我。

附上mysql 下载地址:

链接:https://pan.baidu.com/s/1JX0sd6Gq2bivp2wXNYeJSA?pwd=YYDS 
提取码:YYDS

朋友们可以自由下载

获取全国统计用区划代码和城乡划分代码并写入数据库相关推荐

  1. 城市筛选数据(根据2020年度全国统计用区划代码和城乡划分代码更新维护的标准)

    根据2020年度全国统计用区划代码和城乡划分代码更新维护的标准,整理的城市联动筛选数据: /* 根据2020年度全国统计用区划代码和城乡划分代码更新维护的标准 */ var cityList = [{ ...

  2. Python获取[2016年统计用区划代码和城乡划分代码(截止2016年07月31日)]

    #!usr/bin/env python #-*- coding:utf-8 -*- import requests import re import time ##系统初始化 urlHeader=& ...

  3. [数据][json格式] 2016年统计用区划代码和城乡划分代码

    [数据][json格式] 2016年统计用区划代码和城乡划分代码 2013 年的时候写过一篇 [数据][xml格式] 2012年统计用区划代码和城乡划分代码. 到了今天,我需要某省的省市县乡村五级数据 ...

  4. Python爬虫练习五:爬取 2017年统计用区划代码和城乡划分代码(附代码与全部数据)

    本文仅供学习,需要数据的文末有链接下载,请不要重复爬取. 最近工作中,因为统计用区划代码和城乡划分代码更新了最新的2017版,需要爬取最新的数据.于是乎,本次花了一定精力,将整个2017版数据完完整整 ...

  5. python爬虫练习五(补充): 2018年统计用区划代码和城乡划分代码(附代码与全部数据)

    之前爬取过2017年的数据 详见 Python爬虫练习五:爬取 2017年统计用区划代码和城乡划分代码(附代码与全部数据) ,下面有评论说广东省的数据缺少了东莞与中山两个市的数据,检查网页结构发现确实 ...

  6. 区划代码和城乡划分代码

    区划代码和城乡划分代码的数据 访问地址:http://www.stats.gov.cn/tjsj/tjbz/tjyqhdmhcxhfdm/2021/index.html 前段时间写了一个程序,用来下载 ...

  7. Java 爬取国家统计局统计用区划代码和城乡划分代码

    插入速度比较慢,建议修改成批量插入. 用的 Spring Boot2.MyBatis Plus(Jdbc 都行,随便你).Junit5.okhttp.jsoup.dozer(你可以手动赋值,没几个属性 ...

  8. 2011年统计用区划代码和城乡划分代码

    分享一下我老师大神的人工智能教程!零基础,通俗易懂!http://blog.csdn.net/jiangjunshow 也欢迎大家转载本篇文章.分享知识,造福人民,实现我们中华民族伟大复兴! http ...

  9. 2018年统计用区划代码和城乡划分代码(截止2018年10月31日)(数据及python爬虫代码)

    统计局网站的数据汇总. 细粒度,到最后一级(一般为5级,网站上少部分地区为4级). 数据编码格式为utf8,以便显示名称中的生僻字,请使用合适的文本工具打开. 这里有python爬虫代码和所需库.爬取 ...

最新文章

  1. 他们提出了一个大胆的猜想:GWT(深度学习)→通用人工智能
  2. b区计算机考研招不满的大学,b区考研招不满的大学,适合调剂的学校有哪些
  3. python官网怎么下载安装-Python怎么下载安装
  4. flex布局知识点总结
  5. Android android:screenOrientation的简介
  6. 解决 PowerDesigner 错误 The generation has been cancelled…
  7. 使用poll实现的io多路复用服务端和客户端
  8. PCA(主成分分析)思想及实现
  9. springboot项目启动rabbitmq报错org.springframework.amqp.AmqpIOException: java.io.IOException
  10. R for data science之purrr包(上)
  11. 产品配件类目税目分类_商品和服务税收分类编码
  12. AutoSAR入门到精通讲解 (AppL) 2.1 AutoSAR-CP AppL概述
  13. SHAP可解释工具的理解及应用
  14. 【异常】Reason: Executor heartbeat timed out after 140927 ms
  15. 微信拉票之微信如何拉票及微信投票怎样拉票通过制作微信拉票群软件来辅助拉票
  16. Beautiful Songs
  17. 键盘定位板图纸_防火卷帘轨道安装强制定位
  18. 判断司机是否酒后驾车
  19. Chrome 开发者工具新功能-网络面板新增载荷(Payload)边栏
  20. 微信小程序中使用腾讯地图,导航到目的地

热门文章

  1. 最新版本 PHP (windows)开发环境配置
  2. 【论文总结】:基于密集点检测的anchor-free算法总结
  3. 数据分析-1.必备的三大能力
  4. Android下am和pm命令简介
  5. 如何使用win10自带虚拟机
  6. 【HDU3949 + BZOJ2115 + CF724G】【异或线性基例题】| 倍增 | 第k小异或和 | DFS处理环 |【CGWR】| N
  7. 在 Windows 读取 Linux 分割区的 Ext2Fsd
  8. 马蜂窝php面试题,马蜂窝开放平台
  9. 000929.CSI是China Securities Index中证指数
  10. Redox bootloader实现分析