Python爬取云南统计局数据报表

真正完全独立没参考任何代码没进行任何搜索的第一个爬虫，网站无robots.txt：

import requests
import time
import re
from bs4 import BeautifulSoup
from random import random# get the whole information of html page from the index page.
def get_html_text(url, code='utf-8'):kv = {'User-Agent': 'Mozilla/5.0'}try:res = requests.get(url, timeout=60, headers=kv)res.raise_for_status()res.encoding = codereturn res.textexcept BaseException:return ''# get the excel file and store from the detailed page.
def get_excel_store(url, filepath='d:/'):try:time.sleep(random())r = get_html_text(url)soup = BeautifulSoup(r, 'html.parser')# span is a special tag in this resource page with a style5 propertyspan = soup.find_all('span', attrs={'class': 'STYLE5'})[0]# find <a> in span and get information at the same time.for a in span.children:file_path = filepath + a.textxls = requests.get(a.attrs.get('href'), timeout=30)xls.raise_for_status()with open(file_path, 'wb') as newfile:newfile.write(xls.content)except BaseException:return ''def get_all_html(url, urllist2):r = get_html_text(url)soup = BeautifulSoup(r, 'html.parser')a = soup.find_all('a', attrs={'target': '_blank'})re_text = r'[/][\d]{6}[/][t][\d]{8}[_][\d]{2,8}'for i in a:try:href = i.attrs['href']urls = re.findall(re_text, href)for url in urls:if url == '':continueelse:urllist2.append('http://www.stats.yn.gov.cn/tjsj/jdsj' + url + '.html')except BaseException:continuedef main():filepath = 'D:/'url = 'http://www.stats.yn.gov.cn/tjsj/jdsj/index'# 构造翻页url列表urllist = []depth = 2for i in range(depth):if i == 0:urllist.append(url + '.html')else:urllist.append(url + '_' + str(i) + '.html')# 对每一页面中url提取信息：for url in urllist:r = get_html_text(url)urllist2 = []# 获取当前页面中所有的下级url,并存储在列表中get_all_html(url, urllist2)# 对每个下级url进行获取xls操作for url1 in urllist2:get_excel_store(url1)main()

编程花了得1小时，还是太慢，结果见下图

Python爬取云南统计局数据报表相关推荐

Python应用实战-Python爬取4000+股票数据，并用plotly绘制了树状热力图(treemap)
目录: 1. 准备工作 2. 开始绘图 2.1. 简单的例子 2.2. px.treemap常用参数介绍 2.3. color_continuous_scale参数介绍 2.4. 大A股市树状热力图来 ...
python爬去朋友圈_利用Python爬取朋友圈数据，爬到你开始怀疑人生
人生最难的事是自我认知,用Python爬取朋友圈数据,让我们重新审视自己,审视我们周围的圈子. 文:朱元禄(@数据分析-jacky) 哲学的两大问题:1.我是谁?2.我们从哪里来? 本文 jacky试 ...
python 爬取拉钩数据
Python通过Request库爬取拉钩数据爬取方法数据页面建表存储职位信息解析页面核心代码完整代码结果展示爬取方法采用python爬取拉钩数据,有很多方法可以爬取,我采用的是通过Re ...
python 爬取拉钩网数据
python 爬取拉钩网数据完整代码下载:https://github.com/tanjunchen/SpiderProject/blob/master/lagou/LaGouSpider.py # ...
利用Python爬取国家水稻数据中心的品种数据
利用Python爬取国家水稻数据中心的品种数据一.页面获取 python可以进行对网页的访问,主要用到requests,beautifulsoup4包. 首先新建一个page的py文件,用来获取页面 ...
利用Python爬取朋友圈数据，爬到你开始怀疑人生
人生最难的事是自我认知,用Python爬取朋友圈数据,让我们重新审视自己,审视我们周围的圈子. 文:朱元禄(@数据分析-jacky) 哲学的两大问题:1.我是谁?2.我们从哪里来? 本文 jacky试 ...
利用python爬取2019-nCoV确诊数据并制作pyecharts可视化地图
1.本章利用python爬取2019-nCoV确诊数据并制作pyecharts可视化地图: 2.主要内容为绘制出中国各省疫情数据,疫情数据从四个维度进行可视化展示:累积确诊人数.现存确诊人数.治愈人数 ...
用Python爬取最新股票数据含完整源代码
用Python爬取最新股票数据含完整源代码抓取目标: url:http://webapi.cninfo.com.cn/#/marketDataDate 数据目标: 获取证券代码证券简称交易日期 ...
使用python爬取喜马拉雅音频数据并保存
** 使用python爬取喜马拉雅音频数据并保存 ** 1.进入喜马拉雅官网,打开要爬取的项目网页,按F12=>F5后进行清空,点击项目网页中播放按钮,出现如下图点击,查找网页的url,获取到网 ...

Python爬取云南统计局数据报表

Python爬取云南统计局数据报表相关推荐

最新文章

热门文章