python3学习（3）：ID 遍历爬虫

从python3学习（2）中可知所有爬取的网站URL只有在结尾处有区别，因此，可以利用该弱点来遍历访问所有URL。

### 二、 ID 遍历爬虫，利用网站结构的弱点，轻松访问所有内容。
# Downloading: http://example.webscraping.com/places/default/view/Afghanistan-1
# Downloading: http://example.webscraping.com/places/default/view/Aland-Islands-2
# Downloading: http://example.webscraping.com/places/default/view/Albania-3
# Downloading: http://example.webscraping.com/places/default/view/Algeria-4
# Downloading: http://example.webscraping.com/places/default/view/American-Samoa-5
# Downloading: http://example.webscraping.com/places/default/view/Andorra-6
# Downloading: http://example.webscraping.com/places/default/view/Angola-7
## 由上可知，这些 URL 只有结尾处有区别。
import urllib.request  ## -- written by LiSongbo
def Rocky_dnload(url,user_agent='wswp',num_retries = 2):print('Downloading:',url)LiSongbo_he={'User-agent':user_agent}request = urllib.request.Request(url, headers=LiSongbo_he)try:  ## -- written by LiSongbo
html = urllib.request.urlopen(request).read()except urllib.request.URLError as e:  ## -- written by LiSongbo
print('Download error:',e.reason)html = Noneif num_retries > 0:  ## -- written by LiSongbo
if hasattr(e,'code') and 500 <= e.code < 600:return Rocky_dnload(url,user_agent,num_retries-1) ## retry 5xx HTTP errors
return htmlimport re  ## -- written by LiSongbo
def Rocky_crawl_sitemap(url):  ## -- written by LiSongbo
sitemap = Rocky_dnload(url)  ## download the sitmap file
sitemap = sitemap.decode('utf-8')links = re.findall('<loc>(.*?)</loc>', sitemap)  ## extract the sitemap links from flag loc
for link in links:  ## download each link
html = Rocky_dnload(link)  ## crape html here
import itertools   ## -- written by LiSongbo
max_errors = 5
n_errors = 0
for page in itertools.count(1):   ## -- written by LiSongbo
url = 'http://example.webscraping.com/view/-%d' % pagehtml = Rocky_dnload(url)if html is None:   ## -- written by LiSongbo
n_errors += 1if n_errors==max_errors:breakelse:n_errors = 0

运行结果如下：

Downloading: http://example.webscraping.com/view/-1
Downloading: http://example.webscraping.com/view/-2
Downloading: http://example.webscraping.com/view/-3
Downloading: http://example.webscraping.com/view/-4
Downloading: http://example.webscraping.com/view/-5
Downloading: http://example.webscraping.com/view/-6
Downloading: http://example.webscraping.com/view/-7
Downloading: http://example.webscraping.com/view/-8

Downloading: http://example.webscraping.com/view/-9

……

## -- written by LiSongbo

转载于:https://www.cnblogs.com/LiSongbo/p/9245585.html

python3学习（3）：ID 遍历爬虫相关推荐

python3学习（6）：ID 遍历爬虫，将需要下载的网页数量最小化
从python3学习(5)中可知所有爬取的网站URL只有在结尾处有区别,因此,可以利用该弱点来遍历访问所有URL. ### 二. ID 遍历爬虫,利用网站结构的弱点,轻松访问所有内容. # Downl ...
用python写网络爬虫 -从零开始 3 编写ID遍历爬虫
我们在访问网站的时候,发现有些网页ID 是按顺序排列的数字,这个时候我们就可以使用ID遍历的方式来爬取内容.但是局限性在于有些ID数字在10位数左右,那么这样爬取效率就会很低很低! import it ...
爬虫入门（三）进阶技巧之ID遍历、追踪链接
1.使用id遍历 (1)原理使用id遍历网页是常见的做法,由于大多数网站存储的数据太多,不可能为每一个网页都起名字,便用id做标记使得数据库方便识别,这也使得按id遍历网页成为可能. 在示例网站:h ...
2021-09-01 学习笔记：Python爬虫、数据可视化
2021-09-01 学习笔记:Python爬虫.数据可视化结于2021-09-07: 内容来自成都工业大学数字媒体专业实训: 主要内容: PyCharm开发Python脚本的基础配置: Pyt ...
Python学习教程：Python爬虫抓取技术的门道
Python学习教程:Python爬虫抓取技术的门道 web是一个开放的平台,这也奠定了web从90年代初诞生直至今日将近30年来蓬勃的发展.然而,正所谓成也萧何败也萧何,开放的特性.搜索引擎以及简单 ...
Python3 学习系列丨博客目录索引
整个博客有关 Python 学习目录索引,方便快捷定位查询基础学习篇 Python3 基础学习笔记 C01[变量和简单数据类型] Python3 基础学习笔记 C02[列表] Python3 基础学 ...
Python3学习笔记之-学习基础（第三篇）
Python3学习笔记之-学习基础(第三篇) 文章目录目录 Python3学习笔记之-学习基础(第三篇) 文章目录一.循环 1.for循环 2.while循环 3.break,continue 二 ...
Python3 学习笔记
Python3 学习笔记 1.基础语法 1.1 字符串操作 title() 将单词首字母改为大写 upper() 所有字母改为大写 lower() 所有字母改为小写 str1+str2 字符串通过'+ ...
python基础第三章选择结构答案-python3 学习笔记（二）选择结构、循环结构
python3 学习笔记 python 优雅明确简单 1.选择结构 (1)简单判断 if else 使用格式: if 条件: 表达式1 else: 表达式2 (2)多条件判断 elif 使用格式 ...

python3学习（3）：ID 遍历爬虫

python3学习（3）：ID 遍历爬虫相关推荐

最新文章

热门文章