【Python】《Python网络爬虫权威指南》第三章任务:验证六度分隔理论

任务描述

是否能够通过一个wiki页面上的站内链接,经过最多六次跳转,到达另一个wiki页面,对于本书,我们的任务是从https://en.wikipedia.org/wiki/Eric_Idle跳转到https://en.wikipedia.org/wiki/Kevin_Bacon

完成思路

书上都写了,不讲了

过程记录

反正疫情在家闲着也是闲着,让笔记本开着跑了三天,最后的结果是:

  • 爬取了超过80,000个页面并保存到本地,大小10GB+;
  • 分析了超过200,000个站内链接;
  • 找到了十几种可行路径;
  • 实际上没有找到所有可行路径,最后不想跑下去了;

代码

获取一个wiki页面并保存到本地(毕竟有wall,方便出错了重新跑)

from urllib.request import urlopen
from urllib.error import HTTPError, URLError
from http.client import HTTPResponseimport timestorage_directory = 'D:/MyResources/爬虫数据/Wiki Pages'def process_filename(filename: str) -> str:hash_res = hash(filename)filename = filename.replace('"', '')\.replace('?', '')\.replace('*', '')\.replace('<', '')\.replace('>', '')\.replace(':', '')\.replace('/', '')\.replace('\\', '')\.replace('|', '')if len(filename) == 0 or len(filename) == filename.count('.'):filename = str(hash_res)return storage_directory + '/' + filenamedef get_and_store_page(url: str, filename: str) -> bool:try:response = urlopen(url)  # type: HTTPResponseexcept HTTPError as e:print(f'HTTPError: {e}')return Falseexcept URLError as e:print(f'URLError: {e}')return Falseelse:html = response.read().decode(encoding='utf-8')try:filename = process_filename(filename)f = open(file=filename, mode='w', encoding='utf-8')except FileNotFoundError as e:print(f'check your file name: {e}')return Falseelse:f.write(html)f.close()time.sleep(1)return Truedef load_stored_html(filename: str) -> (str, bool):filename = process_filename(filename)try:f = open(file=filename, mode='r', encoding='utf-8')except FileNotFoundError as e:print(f'check your filename: {e}')return '', Falseelse:res = f.read()f.close()return res, Trueif __name__ == '__main__':if get_and_store_page(url='https://en.wikipedia.org/wiki/Kevin_Bacon', filename='Kevin_Bacon.html'):print('success: https://en.wikipedia.org/wiki/Kevin_Bacon')else:print('fail: https://en.wikipedia.org/wiki/Kevin_Bacon')if get_and_store_page(url='https://en.wikipedia.org/wiki/Eric_Idle', filename='Eric_Idle.html'):print('success: https://en.wikipedia.org/wiki/Eric_Idle')else:print('fail: https://en.wikipedia.org/wiki/Eric_Idle')

验证六度分隔理论

from bs4 import BeautifulSoup
from bs4.element import Tag
from CH3_GetWikipedia import load_stored_html, get_and_store_pageimport re
import time
import copyhost = 'https://en.wikipedia.org'
visited_url = dict()
jump_path = ['', '', '', '', '', '', '']
results = []def find_kevin_bacon(path: str, jumps: int) -> None:global host, visited_url, jump_path, resultsjump_path[jumps] = host + pathif path.split('/')[-1] == 'Kevin_Bacon':print(f'!!!! it\'s found!')results.append(copy.deepcopy(jump_path))with open(file='./result.txt', mode='a', encoding='utf-8') as f:for u in jump_path:print(u)f.write(u + '\n')print(host + '/wiki/Kevin_Bacon')f.write('--------------------\n')returnif path in visited_url:if visited_url[path] > jumps:visited_url[path] = jumpselse:returnelse:visited_url[path] = jumpsnow = time.localtime(time.time())hour = now.tm_hourminute = now.tm_minsecond = now.tm_secprint(f'---> {hour}:{minute}:{second} jump time: {jumps}, visited: {len(visited_url)}, now visit: {path}.')if jumps >= 6:returnhtml, success = load_stored_html(filename=path.split('/')[-1] + '.html')if not success:success = get_and_store_page(url=host + path, filename=path.split('/')[-1] + '.html')if not success:returnelse:html, success = load_stored_html(filename=path.split('/')[-1] + '.html')bs = BeautifulSoup(markup=html, features='html.parser')links = bs.find(name='div', attrs={'id': 'bodyContent'}).\find_all(name='a', attrs={'href': re.compile('^(/wiki/)((?!:).)*$')})for link in links:  # type: Tagfind_kevin_bacon(path=link['href'], jumps=jumps + 1)if __name__ == '__main__':find_kevin_bacon(path='/wiki/Eric_Idle', jumps=0)print(f'一共找到{len(results)}种方案:')for res in results:for p in res:print(f'{p} -> ', end='')print('/wiki/Kevin_Bacon')

我找到的可行路径

--------------------
https://en.wikipedia.org/wiki/Eric_Idle
https://en.wikipedia.org/wiki/South_Shields
https://en.wikipedia.org/wiki/Tyne_and_Wear
https://en.wikipedia.org/wiki/Telford_and_Wrekin
https://en.wikipedia.org/wiki/Time_zone
https://en.wikipedia.org/wiki/Nome,_Alaska
https://en.wikipedia.org/wiki/Kevin_Bacon
--------------------
https://en.wikipedia.org/wiki/Eric_Idle
https://en.wikipedia.org/wiki/South_Shields
https://en.wikipedia.org/wiki/Tyne_and_Wear
https://en.wikipedia.org/wiki/Telford_and_Wrekin
https://en.wikipedia.org/wiki/England
https://en.wikipedia.org/wiki/Michael_Caine
https://en.wikipedia.org/wiki/Kevin_Bacon
--------------------
https://en.wikipedia.org/wiki/Eric_Idle
https://en.wikipedia.org/wiki/South_Shields
https://en.wikipedia.org/wiki/Tyne_and_Wear
https://en.wikipedia.org/wiki/Telford_and_Wrekin
https://en.wikipedia.org/wiki/England
https://en.wikipedia.org/wiki/Gary_Oldman
https://en.wikipedia.org/wiki/Kevin_Bacon
--------------------
https://en.wikipedia.org/wiki/Eric_Idle
https://en.wikipedia.org/wiki/South_Shields
https://en.wikipedia.org/wiki/Tyne_and_Wear
https://en.wikipedia.org/wiki/Telford_and_Wrekin
https://en.wikipedia.org/wiki/England
https://en.wikipedia.org/wiki/Daniel_Day-Lewis
https://en.wikipedia.org/wiki/Kevin_Bacon
--------------------
https://en.wikipedia.org/wiki/Eric_Idle
https://en.wikipedia.org/wiki/South_Shields
https://en.wikipedia.org/wiki/Tyne_and_Wear
https://en.wikipedia.org/wiki/Telford_and_Wrekin
https://en.wikipedia.org/wiki/New_town
https://en.wikipedia.org/wiki/Edmund_Bacon_(architect)
https://en.wikipedia.org/wiki/Kevin_Bacon
--------------------
https://en.wikipedia.org/wiki/Eric_Idle
https://en.wikipedia.org/wiki/South_Shields
https://en.wikipedia.org/wiki/Tyne_and_Wear
https://en.wikipedia.org/wiki/Telford_and_Wrekin
https://en.wikipedia.org/wiki/Stoke-on-Trent
https://en.wikipedia.org/wiki/Hugh_Dancy
https://en.wikipedia.org/wiki/Kevin_Bacon
--------------------
https://en.wikipedia.org/wiki/Eric_Idle
https://en.wikipedia.org/wiki/South_Shields
https://en.wikipedia.org/wiki/Tyne_and_Wear
https://en.wikipedia.org/wiki/Telford_and_Wrekin
https://en.wikipedia.org/wiki/Coventry
https://en.wikipedia.org/wiki/Bon_Jovi
https://en.wikipedia.org/wiki/Kevin_Bacon
--------------------
https://en.wikipedia.org/wiki/Eric_Idle
https://en.wikipedia.org/wiki/South_Shields
https://en.wikipedia.org/wiki/Tyne_and_Wear
https://en.wikipedia.org/wiki/Telford_and_Wrekin
https://en.wikipedia.org/wiki/Blackpool
https://en.wikipedia.org/wiki/Pleasure_Beach_Blackpool
https://en.wikipedia.org/wiki/Kevin_Bacon
--------------------
https://en.wikipedia.org/wiki/Eric_Idle
https://en.wikipedia.org/wiki/South_Shields
https://en.wikipedia.org/wiki/Tyne_and_Wear
https://en.wikipedia.org/wiki/Telford_and_Wrekin
https://en.wikipedia.org/wiki/Blackpool
https://en.wikipedia.org/wiki/Blackpool_Pleasure_Beach
https://en.wikipedia.org/wiki/Kevin_Bacon
--------------------
https://en.wikipedia.org/wiki/Eric_Idle
https://en.wikipedia.org/wiki/South_Shields
https://en.wikipedia.org/wiki/Tyne_and_Wear
https://en.wikipedia.org/wiki/Telford_and_Wrekin
https://en.wikipedia.org/wiki/Blackpool
https://en.wikipedia.org/wiki/Frasier
https://en.wikipedia.org/wiki/Kevin_Bacon
--------------------
https://en.wikipedia.org/wiki/Eric_Idle
https://en.wikipedia.org/wiki/South_Shields
https://en.wikipedia.org/wiki/Tyne_and_Wear
https://en.wikipedia.org/wiki/Telford_and_Wrekin
https://en.wikipedia.org/wiki/Brighton_and_Hove
https://en.wikipedia.org/wiki/Lewes
https://en.wikipedia.org/wiki/Kevin_Bacon
--------------------
https://en.wikipedia.org/wiki/Eric_Idle
https://en.wikipedia.org/wiki/South_Shields
https://en.wikipedia.org/wiki/Tyne_and_Wear
https://en.wikipedia.org/wiki/Telford_and_Wrekin
https://en.wikipedia.org/wiki/Isle_of_Wight
https://en.wikipedia.org/wiki/Jeremy_Irons
https://en.wikipedia.org/wiki/Kevin_Bacon
--------------------
https://en.wikipedia.org/wiki/Eric_Idle
https://en.wikipedia.org/wiki/South_Shields
https://en.wikipedia.org/wiki/Tyne_and_Wear
https://en.wikipedia.org/wiki/Telford_and_Wrekin
https://en.wikipedia.org/wiki/South_Gloucestershire
https://en.wikipedia.org/wiki/EE_Limited
https://en.wikipedia.org/wiki/Kevin_Bacon
--------------------
https://en.wikipedia.org/wiki/Eric_Idle
https://en.wikipedia.org/wiki/South_Shields
https://en.wikipedia.org/wiki/Tyne_and_Wear
https://en.wikipedia.org/wiki/Metropolitan_county
https://en.wikipedia.org/wiki/Conservative_Party_(UK)
https://en.wikipedia.org/wiki/Early_1990s
https://en.wikipedia.org/wiki/Kevin_Bacon
--------------------
https://en.wikipedia.org/wiki/Eric_Idle
https://en.wikipedia.org/wiki/South_Shields
https://en.wikipedia.org/wiki/Tyne_and_Wear
https://en.wikipedia.org/wiki/Metropolitan_county
https://en.wikipedia.org/wiki/Margaret_Thatcher
https://en.wikipedia.org/wiki/Meryl_Streep
https://en.wikipedia.org/wiki/Kevin_Bacon
--------------------
https://en.wikipedia.org/wiki/Eric_Idle
https://en.wikipedia.org/wiki/South_Shields
https://en.wikipedia.org/wiki/Tyne_and_Wear
https://en.wikipedia.org/wiki/Metropolitan_county
https://en.wikipedia.org/wiki/History_of_local_government_in_England
https://en.wikipedia.org/wiki/Cleveland
https://en.wikipedia.org/wiki/Kevin_Bacon
--------------------
https://en.wikipedia.org/wiki/Eric_Idle
https://en.wikipedia.org/wiki/South_Shields
https://en.wikipedia.org/wiki/Tyne_and_Wear
https://en.wikipedia.org/wiki/Metropolitan_county
https://en.wikipedia.org/wiki/Urban_area
https://en.wikipedia.org/wiki/Empire_State_Building
https://en.wikipedia.org/wiki/Kevin_Bacon
--------------------

感想

2020见证历史!

【Python】《Python网络爬虫权威指南》第三章任务:验证六度分隔理论相关推荐

  1. 爬虫书籍-Python网络爬虫权威指南OCR库 NLTK 数据清洗 BeautifulSoup Lambda表达式 Scrapy 马尔可夫模型

    Python网络爬虫权威指南 编辑推荐 适读人群 :需要抓取Web 数据的相关软件开发人员和研究人员 作为一种采集和理解网络上海量信息的方式,网页抓取技术变得越来越重要.而编写简单的自动化程序(网络爬 ...

  2. python网络爬虫权威指南 百度云-分析《Python网络爬虫权威指南第2版》PDF及代码...

    对那些没有学过编程的人来说,计算机编程看着就像变魔术.如果编程是魔术(magic),那么网页抓取(Web scraping)就是巫术(wizardry),也就是运用"魔术"来实现精 ...

  3. python网络爬虫权威指南 豆瓣_福利分享:个人整理的Python书单,从基础到进阶...

    原标题:福利分享:个人整理的Python书单,从基础到进阶 我挑选的一些书籍,大家可以自行到书店或是网上自己选购.也由于个人水平有限,很可能大家觉得优秀的书籍没有列出,如果大家有觉得不错的书籍,欢迎大 ...

  4. python网络爬虫权威指南 百度云-Python网络爬虫权威指南 PDF 第2版

    给大家带来的一篇关于Python爬虫相关的电子书资源,介绍了关于Python.网络爬虫方面的内容,本书是由人民邮电出版社出版,格式为PDF,资源大小5.54 MB,瑞安·米切尔编写,目前豆瓣.亚马逊. ...

  5. 《用Python写网络爬虫》——1.5 本章小结

    本节书摘来自异步社区<用Python写网络爬虫>一书中的第1章,第1.5节,作者 [澳]Richard Lawson(理查德 劳森),李斌 译,更多章节内容可以访问云栖社区"异步 ...

  6. python网络爬虫权威指南 豆瓣_豆瓣Python大牛写的爬虫学习路线图,分享给大家!...

    豆瓣Python大牛写的爬虫学习路线图,分享给大家! 今天给大家带来我的Python爬虫路线图,仅供大家参考! 第一步,学会自己安装python.库和你的编辑器并设置好它 我们学习python的最终目 ...

  7. python网络爬虫权威指南 百度云-Python网络爬虫权威指南(第2版)

    版权声明 O'Reilly Media, Inc. 介绍 业界评论 前言 什么是网页抓取 为什么要做网页抓取 关于本书 排版约定 使用代码示例 O'Reilly Safari 联系我们 致谢 电子书 ...

  8. python网络爬虫权威指南(第2版)pdf_用Python写网络爬虫(第2版) PDF 下载

    资料目录: 第 1章 网络爬虫简介 1 1.1 网络爬虫何时有用 1 1.2 网络爬虫是否合法 2 1.3 Python 3 3 1.4 背景调研 4 1.4.1 检查robots.txt 4 1.4 ...

  9. python网络爬虫权威指南 第2版 pdf微盘_python网络爬虫权威指南第2版pdf-Python网络爬虫权威指南第2版中文PDF+英文PDF+源代码下载_东坡手机下载...

    本书不仅介绍了网页抓取,也为抓取.转换和使用新式网络中各种类型的数据提供了全面的指导.虽然本书用的是Python编程语言,涉及Python的许多基础知识,但这并不是一本Python 入门书. 如果你完 ...

  10. 使用BeautifulSoup爬取想要的标签(《python网络爬虫权威指南》笔记)

    使用BeautifulSoup爬取想要的标签 精确爬取标签 BeautifulSoup中的find()和find_all()方法 BeautifulSoup中的对象 兄弟.子.父.后代标签的处理 抓取 ...

最新文章

  1. SLAM之特征匹配(一)————RANSAC-------OpenCV中findFundamentalMat函数使用的模型
  2. mysql where 1 作用_MYSQL where 1=1 的作用
  3. 前端学习(2562):v-loading
  4. 周计划1[7.22~7.28]
  5. const char *转wstring 方法
  6. KVM虚拟化基础概念
  7. Nginx源码包软件安装步骤
  8. Script error.全面解析
  9. CocosCreator之KUOKUO带你简单使用Spine骨骼动画
  10. 【代码实现】tag-based-multi-span-extraction
  11. python系统函数详解
  12. 接入微信提现Api(企业付款到零钱--向微信用户个人付款)
  13. MySQL 的统计直方图
  14. el-input输入11位手机号,边输入边验证手机号码格式
  15. 全球及中国辉光放电质谱仪(GDMS)市场商业模式与十四五投资战略规划研究报告2022-2028年
  16. printf用法补录
  17. 非财务人员财务培训_为非财务领导的公司工作
  18. windows10无法输入中文问题的解决
  19. docker 和 k8s 组件和构建流程
  20. 一文带你分分钟掌握手机ARM处理器的前世今生,再也不用担心妈妈老婆女友让我选手机啦...

热门文章

  1. 巨潮资讯网-多层次资本市场信息披露平台
  2. 致远OA webmail.do任意文件下载 CNVD-2020-62422
  3. amazon s3cmd 安装、批量下载
  4. 12306购票辅助工具
  5. Android 创建随机数生成器
  6. 3.数据的一致性与一致性算法(CAP原则、Paxos算法、Raft算法、ZAB协议)
  7. CE修改Eternium永恒之金【进阶篇】
  8. canvas橡皮擦功能
  9. TortoiseSVN使用简介
  10. boost1.79编译