【Python】《Python网络爬虫权威指南》第三章任务：验证六度分隔理论

任务描述

是否能够通过一个wiki页面上的站内链接，经过最多六次跳转，到达另一个wiki页面，对于本书，我们的任务是从https://en.wikipedia.org/wiki/Eric_Idle跳转到https://en.wikipedia.org/wiki/Kevin_Bacon。

完成思路

书上都写了，不讲了

过程记录

反正疫情在家闲着也是闲着，让笔记本开着跑了三天，最后的结果是：

爬取了超过80,000个页面并保存到本地，大小10GB+；
分析了超过200,000个站内链接；
找到了十几种可行路径；
实际上没有找到所有可行路径，最后不想跑下去了；

代码

获取一个wiki页面并保存到本地（毕竟有wall，方便出错了重新跑）

from urllib.request import urlopen
from urllib.error import HTTPError, URLError
from http.client import HTTPResponseimport timestorage_directory = 'D:/MyResources/爬虫数据/Wiki Pages'def process_filename(filename: str) -> str:hash_res = hash(filename)filename = filename.replace('"', '')\.replace('?', '')\.replace('*', '')\.replace('<', '')\.replace('>', '')\.replace(':', '')\.replace('/', '')\.replace('\\', '')\.replace('|', '')if len(filename) == 0 or len(filename) == filename.count('.'):filename = str(hash_res)return storage_directory + '/' + filenamedef get_and_store_page(url: str, filename: str) -> bool:try:response = urlopen(url)  # type: HTTPResponseexcept HTTPError as e:print(f'HTTPError: {e}')return Falseexcept URLError as e:print(f'URLError: {e}')return Falseelse:html = response.read().decode(encoding='utf-8')try:filename = process_filename(filename)f = open(file=filename, mode='w', encoding='utf-8')except FileNotFoundError as e:print(f'check your file name: {e}')return Falseelse:f.write(html)f.close()time.sleep(1)return Truedef load_stored_html(filename: str) -> (str, bool):filename = process_filename(filename)try:f = open(file=filename, mode='r', encoding='utf-8')except FileNotFoundError as e:print(f'check your filename: {e}')return '', Falseelse:res = f.read()f.close()return res, Trueif __name__ == '__main__':if get_and_store_page(url='https://en.wikipedia.org/wiki/Kevin_Bacon', filename='Kevin_Bacon.html'):print('success: https://en.wikipedia.org/wiki/Kevin_Bacon')else:print('fail: https://en.wikipedia.org/wiki/Kevin_Bacon')if get_and_store_page(url='https://en.wikipedia.org/wiki/Eric_Idle', filename='Eric_Idle.html'):print('success: https://en.wikipedia.org/wiki/Eric_Idle')else:print('fail: https://en.wikipedia.org/wiki/Eric_Idle')

验证六度分隔理论

from bs4 import BeautifulSoup
from bs4.element import Tag
from CH3_GetWikipedia import load_stored_html, get_and_store_pageimport re
import time
import copyhost = 'https://en.wikipedia.org'
visited_url = dict()
jump_path = ['', '', '', '', '', '', '']
results = []def find_kevin_bacon(path: str, jumps: int) -> None:global host, visited_url, jump_path, resultsjump_path[jumps] = host + pathif path.split('/')[-1] == 'Kevin_Bacon':print(f'!!!! it\'s found!')results.append(copy.deepcopy(jump_path))with open(file='./result.txt', mode='a', encoding='utf-8') as f:for u in jump_path:print(u)f.write(u + '\n')print(host + '/wiki/Kevin_Bacon')f.write('--------------------\n')returnif path in visited_url:if visited_url[path] > jumps:visited_url[path] = jumpselse:returnelse:visited_url[path] = jumpsnow = time.localtime(time.time())hour = now.tm_hourminute = now.tm_minsecond = now.tm_secprint(f'---> {hour}:{minute}:{second} jump time: {jumps}, visited: {len(visited_url)}, now visit: {path}.')if jumps >= 6:returnhtml, success = load_stored_html(filename=path.split('/')[-1] + '.html')if not success:success = get_and_store_page(url=host + path, filename=path.split('/')[-1] + '.html')if not success:returnelse:html, success = load_stored_html(filename=path.split('/')[-1] + '.html')bs = BeautifulSoup(markup=html, features='html.parser')links = bs.find(name='div', attrs={'id': 'bodyContent'}).\find_all(name='a', attrs={'href': re.compile('^(/wiki/)((?!:).)*$')})for link in links:  # type: Tagfind_kevin_bacon(path=link['href'], jumps=jumps + 1)if __name__ == '__main__':find_kevin_bacon(path='/wiki/Eric_Idle', jumps=0)print(f'一共找到{len(results)}种方案：')for res in results:for p in res:print(f'{p} -> ', end='')print('/wiki/Kevin_Bacon')

我找到的可行路径

--------------------
https://en.wikipedia.org/wiki/Eric_Idle
https://en.wikipedia.org/wiki/South_Shields
https://en.wikipedia.org/wiki/Tyne_and_Wear
https://en.wikipedia.org/wiki/Telford_and_Wrekin
https://en.wikipedia.org/wiki/Time_zone
https://en.wikipedia.org/wiki/Nome,_Alaska
https://en.wikipedia.org/wiki/Kevin_Bacon
--------------------
https://en.wikipedia.org/wiki/Eric_Idle
https://en.wikipedia.org/wiki/South_Shields
https://en.wikipedia.org/wiki/Tyne_and_Wear
https://en.wikipedia.org/wiki/Telford_and_Wrekin
https://en.wikipedia.org/wiki/England
https://en.wikipedia.org/wiki/Michael_Caine
https://en.wikipedia.org/wiki/Kevin_Bacon
--------------------
https://en.wikipedia.org/wiki/Eric_Idle
https://en.wikipedia.org/wiki/South_Shields
https://en.wikipedia.org/wiki/Tyne_and_Wear
https://en.wikipedia.org/wiki/Telford_and_Wrekin
https://en.wikipedia.org/wiki/England
https://en.wikipedia.org/wiki/Gary_Oldman
https://en.wikipedia.org/wiki/Kevin_Bacon
--------------------
https://en.wikipedia.org/wiki/Eric_Idle
https://en.wikipedia.org/wiki/South_Shields
https://en.wikipedia.org/wiki/Tyne_and_Wear
https://en.wikipedia.org/wiki/Telford_and_Wrekin
https://en.wikipedia.org/wiki/England
https://en.wikipedia.org/wiki/Daniel_Day-Lewis
https://en.wikipedia.org/wiki/Kevin_Bacon
--------------------
https://en.wikipedia.org/wiki/Eric_Idle
https://en.wikipedia.org/wiki/South_Shields
https://en.wikipedia.org/wiki/Tyne_and_Wear
https://en.wikipedia.org/wiki/Telford_and_Wrekin
https://en.wikipedia.org/wiki/New_town
https://en.wikipedia.org/wiki/Edmund_Bacon_(architect)
https://en.wikipedia.org/wiki/Kevin_Bacon
--------------------
https://en.wikipedia.org/wiki/Eric_Idle
https://en.wikipedia.org/wiki/South_Shields
https://en.wikipedia.org/wiki/Tyne_and_Wear
https://en.wikipedia.org/wiki/Telford_and_Wrekin
https://en.wikipedia.org/wiki/Stoke-on-Trent
https://en.wikipedia.org/wiki/Hugh_Dancy
https://en.wikipedia.org/wiki/Kevin_Bacon
--------------------
https://en.wikipedia.org/wiki/Eric_Idle
https://en.wikipedia.org/wiki/South_Shields
https://en.wikipedia.org/wiki/Tyne_and_Wear
https://en.wikipedia.org/wiki/Telford_and_Wrekin
https://en.wikipedia.org/wiki/Coventry
https://en.wikipedia.org/wiki/Bon_Jovi
https://en.wikipedia.org/wiki/Kevin_Bacon
--------------------
https://en.wikipedia.org/wiki/Eric_Idle
https://en.wikipedia.org/wiki/South_Shields
https://en.wikipedia.org/wiki/Tyne_and_Wear
https://en.wikipedia.org/wiki/Telford_and_Wrekin
https://en.wikipedia.org/wiki/Blackpool
https://en.wikipedia.org/wiki/Pleasure_Beach_Blackpool
https://en.wikipedia.org/wiki/Kevin_Bacon
--------------------
https://en.wikipedia.org/wiki/Eric_Idle
https://en.wikipedia.org/wiki/South_Shields
https://en.wikipedia.org/wiki/Tyne_and_Wear
https://en.wikipedia.org/wiki/Telford_and_Wrekin
https://en.wikipedia.org/wiki/Blackpool
https://en.wikipedia.org/wiki/Blackpool_Pleasure_Beach
https://en.wikipedia.org/wiki/Kevin_Bacon
--------------------
https://en.wikipedia.org/wiki/Eric_Idle
https://en.wikipedia.org/wiki/South_Shields
https://en.wikipedia.org/wiki/Tyne_and_Wear
https://en.wikipedia.org/wiki/Telford_and_Wrekin
https://en.wikipedia.org/wiki/Blackpool
https://en.wikipedia.org/wiki/Frasier
https://en.wikipedia.org/wiki/Kevin_Bacon
--------------------
https://en.wikipedia.org/wiki/Eric_Idle
https://en.wikipedia.org/wiki/South_Shields
https://en.wikipedia.org/wiki/Tyne_and_Wear
https://en.wikipedia.org/wiki/Telford_and_Wrekin
https://en.wikipedia.org/wiki/Brighton_and_Hove
https://en.wikipedia.org/wiki/Lewes
https://en.wikipedia.org/wiki/Kevin_Bacon
--------------------
https://en.wikipedia.org/wiki/Eric_Idle
https://en.wikipedia.org/wiki/South_Shields
https://en.wikipedia.org/wiki/Tyne_and_Wear
https://en.wikipedia.org/wiki/Telford_and_Wrekin
https://en.wikipedia.org/wiki/Isle_of_Wight
https://en.wikipedia.org/wiki/Jeremy_Irons
https://en.wikipedia.org/wiki/Kevin_Bacon
--------------------
https://en.wikipedia.org/wiki/Eric_Idle
https://en.wikipedia.org/wiki/South_Shields
https://en.wikipedia.org/wiki/Tyne_and_Wear
https://en.wikipedia.org/wiki/Telford_and_Wrekin
https://en.wikipedia.org/wiki/South_Gloucestershire
https://en.wikipedia.org/wiki/EE_Limited
https://en.wikipedia.org/wiki/Kevin_Bacon
--------------------
https://en.wikipedia.org/wiki/Eric_Idle
https://en.wikipedia.org/wiki/South_Shields
https://en.wikipedia.org/wiki/Tyne_and_Wear
https://en.wikipedia.org/wiki/Metropolitan_county
https://en.wikipedia.org/wiki/Conservative_Party_(UK)
https://en.wikipedia.org/wiki/Early_1990s
https://en.wikipedia.org/wiki/Kevin_Bacon
--------------------
https://en.wikipedia.org/wiki/Eric_Idle
https://en.wikipedia.org/wiki/South_Shields
https://en.wikipedia.org/wiki/Tyne_and_Wear
https://en.wikipedia.org/wiki/Metropolitan_county
https://en.wikipedia.org/wiki/Margaret_Thatcher
https://en.wikipedia.org/wiki/Meryl_Streep
https://en.wikipedia.org/wiki/Kevin_Bacon
--------------------
https://en.wikipedia.org/wiki/Eric_Idle
https://en.wikipedia.org/wiki/South_Shields
https://en.wikipedia.org/wiki/Tyne_and_Wear
https://en.wikipedia.org/wiki/Metropolitan_county
https://en.wikipedia.org/wiki/History_of_local_government_in_England
https://en.wikipedia.org/wiki/Cleveland
https://en.wikipedia.org/wiki/Kevin_Bacon
--------------------
https://en.wikipedia.org/wiki/Eric_Idle
https://en.wikipedia.org/wiki/South_Shields
https://en.wikipedia.org/wiki/Tyne_and_Wear
https://en.wikipedia.org/wiki/Metropolitan_county
https://en.wikipedia.org/wiki/Urban_area
https://en.wikipedia.org/wiki/Empire_State_Building
https://en.wikipedia.org/wiki/Kevin_Bacon
--------------------

感想

2020见证历史！

【Python】《Python网络爬虫权威指南》第三章任务：验证六度分隔理论相关推荐

爬虫书籍-Python网络爬虫权威指南OCR库 NLTK 数据清洗 BeautifulSoup Lambda表达式 Scrapy 马尔可夫模型
Python网络爬虫权威指南编辑推荐适读人群 :需要抓取Web 数据的相关软件开发人员和研究人员作为一种采集和理解网络上海量信息的方式,网页抓取技术变得越来越重要.而编写简单的自动化程序(网络爬 ...
python网络爬虫权威指南百度云-分析《Python网络爬虫权威指南第2版》PDF及代码...
对那些没有学过编程的人来说,计算机编程看着就像变魔术.如果编程是魔术(magic),那么网页抓取(Web scraping)就是巫术(wizardry),也就是运用"魔术"来实现精 ...
python网络爬虫权威指南豆瓣_福利分享：个人整理的Python书单，从基础到进阶...
原标题:福利分享:个人整理的Python书单,从基础到进阶我挑选的一些书籍,大家可以自行到书店或是网上自己选购.也由于个人水平有限,很可能大家觉得优秀的书籍没有列出,如果大家有觉得不错的书籍,欢迎大 ...
python网络爬虫权威指南百度云-Python网络爬虫权威指南 PDF 第2版
给大家带来的一篇关于Python爬虫相关的电子书资源,介绍了关于Python.网络爬虫方面的内容,本书是由人民邮电出版社出版,格式为PDF,资源大小5.54 MB,瑞安·米切尔编写,目前豆瓣.亚马逊. ...
《用Python写网络爬虫》——1.5 本章小结
本节书摘来自异步社区<用Python写网络爬虫>一书中的第1章,第1.5节,作者 [澳]Richard Lawson(理查德劳森),李斌译,更多章节内容可以访问云栖社区"异步 ...
python网络爬虫权威指南豆瓣_豆瓣Python大牛写的爬虫学习路线图，分享给大家！...
豆瓣Python大牛写的爬虫学习路线图,分享给大家! 今天给大家带来我的Python爬虫路线图,仅供大家参考! 第一步,学会自己安装python.库和你的编辑器并设置好它我们学习python的最终目 ...
python网络爬虫权威指南百度云-Python网络爬虫权威指南(第2版)
版权声明 O'Reilly Media, Inc. 介绍业界评论前言什么是网页抓取为什么要做网页抓取关于本书排版约定使用代码示例 O'Reilly Safari 联系我们致谢电子书 ...
python网络爬虫权威指南(第2版)pdf_用Python写网络爬虫(第2版) PDF 下载
资料目录: 第 1章网络爬虫简介 1 1.1 网络爬虫何时有用 1 1.2 网络爬虫是否合法 2 1.3 Python 3 3 1.4 背景调研 4 1.4.1 检查robots.txt 4 1.4 ...
python网络爬虫权威指南第2版 pdf微盘_python网络爬虫权威指南第2版pdf-Python网络爬虫权威指南第2版中文PDF+英文PDF+源代码下载_东坡手机下载...
本书不仅介绍了网页抓取,也为抓取.转换和使用新式网络中各种类型的数据提供了全面的指导.虽然本书用的是Python编程语言,涉及Python的许多基础知识,但这并不是一本Python 入门书. 如果你完 ...
使用BeautifulSoup爬取想要的标签（《python网络爬虫权威指南》笔记）
使用BeautifulSoup爬取想要的标签精确爬取标签 BeautifulSoup中的find()和find_all()方法 BeautifulSoup中的对象兄弟.子.父.后代标签的处理抓取 ...

【Python】《Python网络爬虫权威指南》第三章任务：验证六度分隔理论

【Python】《Python网络爬虫权威指南》第三章任务：验证六度分隔理论

任务描述

完成思路

过程记录

代码

我找到的可行路径

感想

【Python】《Python网络爬虫权威指南》第三章任务：验证六度分隔理论相关推荐

最新文章

热门文章