python爬取网页上的超链接
用bs4中的BeautifulSoup解析网页
from urllib.request import urlopen
from bs4 import BeautifulSouphtml = urlopen('https://blog.csdn.net/zzc15806/') #获取网页
bs = BeautifulSoup(html, 'html.parser') #解析网页
hyperlink = bs.find_all('a') #获取所有超链接
for h in hyperlink:hh = h.get('href')print(hh)
结果如下:
https://blog.csdn.net/zzc15806
javascript:void(0);
https://blog.csdn.net/zzc15806?orderby=UpdateTime
https://blog.csdn.net/zzc15806?orderby=ViewCount
https://blog.csdn.net/zzc15806/rss/list
https://blog.csdn.net/yoyo_liyy/article/details/82762601
https://blog.csdn.net/yoyo_liyy/article/details/82762601
https://blog.csdn.net/zzc15806/article/details/84996039
https://blog.csdn.net/zzc15806/article/details/84996039
https://blog.csdn.net/zzc15806/article/details/84975709
https://blog.csdn.net/zzc15806/article/details/84975709
https://blog.csdn.net/zzc15806/article/details/84975539
https://blog.csdn.net/zzc15806/article/details/84975539
https://blog.csdn.net/zzc15806/article/details/84975137
https://blog.csdn.net/zzc15806/article/details/84975137
https://blog.csdn.net/zzc15806/article/details/84974458
https://blog.csdn.net/zzc15806/article/details/84974458
https://blog.csdn.net/zzc15806/article/details/84973370
https://blog.csdn.net/zzc15806/article/details/84973370
https://blog.csdn.net/zzc15806/article/details/84972108
https://blog.csdn.net/zzc15806/article/details/84972108
https://blog.csdn.net/zzc15806/article/details/84971215
https://blog.csdn.net/zzc15806/article/details/84971215
https://blog.csdn.net/zzc15806/article/details/84875070
https://blog.csdn.net/zzc15806/article/details/84875070
https://blog.csdn.net/zzc15806/article/details/84779131
https://blog.csdn.net/zzc15806/article/details/84779131
https://blog.csdn.net/zzc15806/article/details/84137013
https://blog.csdn.net/zzc15806/article/details/84137013
https://blog.csdn.net/zzc15806/article/details/84067017
https://blog.csdn.net/zzc15806/article/details/84067017
https://blog.csdn.net/zzc15806/article/details/83999940
https://blog.csdn.net/zzc15806/article/details/83999940
https://blog.csdn.net/zzc15806/article/details/83999668
https://blog.csdn.net/zzc15806/article/details/83999668
https://blog.csdn.net/zzc15806/article/details/83540661
https://blog.csdn.net/zzc15806/article/details/83540661
https://blog.csdn.net/zzc15806/article/details/83504130
https://blog.csdn.net/zzc15806/article/details/83504130
https://blog.csdn.net/zzc15806/article/details/83474661
https://blog.csdn.net/zzc15806/article/details/83474661
https://blog.csdn.net/zzc15806/article/details/83472329
https://blog.csdn.net/zzc15806/article/details/83472329
https://blog.csdn.net/zzc15806/article/details/83448761
https://blog.csdn.net/zzc15806/article/details/83448761
https://blog.csdn.net/zzc15806/article/details/83447006
https://blog.csdn.net/zzc15806/article/details/83447006
https://me.csdn.net/zzc15806
https://me.csdn.net/zzc15806
None
https://blog.csdn.net/zzc15806?t=1
https://blog.csdn.net/zzc15806?t=1
https://blog.csdn.net/home/help.html#level
https://blog.csdn.net/zzc15806/column/info/25194
https://blog.csdn.net/zzc15806/column/info/25194
https://blog.csdn.net/zzc15806/column/info/30921
https://blog.csdn.net/zzc15806/column/info/30921
https://blog.csdn.net/zzc15806/column/info/30926
https://blog.csdn.net/zzc15806/column/info/30926
https://blog.csdn.net/zzc15806/article/category/6989201
https://blog.csdn.net/zzc15806/article/category/7255220
https://blog.csdn.net/zzc15806/article/category/7422481
https://blog.csdn.net/zzc15806/article/category/7515657
https://blog.csdn.net/zzc15806/article/category/7534232
https://blog.csdn.net/zzc15806/article/category/7548654
https://blog.csdn.net/zzc15806/article/category/7549573
https://blog.csdn.net/zzc15806/article/category/7731524
https://blog.csdn.net/zzc15806/article/category/7732152
https://blog.csdn.net/zzc15806/article/category/7740409
https://blog.csdn.net/zzc15806/article/category/7749247
https://blog.csdn.net/zzc15806/article/category/7776199
https://blog.csdn.net/zzc15806/article/category/7830103
https://blog.csdn.net/zzc15806/article/category/7842074
https://blog.csdn.net/zzc15806/article/category/7936547
https://blog.csdn.net/zzc15806/article/category/8489572
None
https://blog.csdn.net/zzc15806/article/month/2018/12
https://blog.csdn.net/zzc15806/article/month/2018/11
https://blog.csdn.net/zzc15806/article/month/2018/10
https://blog.csdn.net/zzc15806/article/month/2018/09
https://blog.csdn.net/zzc15806/article/month/2018/08
https://blog.csdn.net/zzc15806/article/month/2018/07
https://blog.csdn.net/zzc15806/article/month/2018/06
https://blog.csdn.net/zzc15806/article/month/2018/05
https://blog.csdn.net/zzc15806/article/month/2018/04
https://blog.csdn.net/zzc15806/article/month/2018/03
https://blog.csdn.net/zzc15806/article/month/2018/02
https://blog.csdn.net/zzc15806/article/month/2018/01
https://blog.csdn.net/zzc15806/article/month/2017/10
https://blog.csdn.net/zzc15806/article/month/2017/06
None
https://blog.csdn.net/zzc15806/article/details/73662491
https://blog.csdn.net/zzc15806/article/details/79711114
https://blog.csdn.net/zzc15806/article/details/79603994
https://blog.csdn.net/zzc15806/article/details/79246716
https://blog.csdn.net/zzc15806/article/details/79615426
https://blog.csdn.net/zzc15806/article/details/79592577#comments
https://my.csdn.net/qq_35300611
https://blog.csdn.net/zzc15806/article/details/79592577#comments
https://my.csdn.net/zzc15806
https://blog.csdn.net/zzc15806/article/details/79592577#comments
https://my.csdn.net/qq_35300611
https://blog.csdn.net/zzc15806/article/details/79615426#comments
https://my.csdn.net/qq254271304
https://blog.csdn.net/zzc15806/article/details/80712320#comments
https://my.csdn.net/zzc15806
None
None
None
None
None
从上面可以看到,爬取的链接很杂乱,我们可以对进行筛选。例如,爬取所有博客的链接,并保存到‘txt’文件中:
from urllib.request import urlopen
from bs4 import BeautifulSouphtml = urlopen('https://blog.csdn.net/zzc15806/') #获取网页
bs = BeautifulSoup(html, 'html.parser') #解析网页
hyperlink = bs.find_all('a') #获取所有超链接file = open('./blog.txt', 'w')for h in hyperlink:hh = h.get('href')if hh and '/article/details/' in hh and '#comments' not in hh: #筛选博客链接print(hh)file.write(hh) #写入到“blog.txt”文件中file.write('\n')file.close()
结果如下:
https://blog.csdn.net/yoyo_liyy/article/details/82762601
https://blog.csdn.net/yoyo_liyy/article/details/82762601
https://blog.csdn.net/zzc15806/article/details/84996039
https://blog.csdn.net/zzc15806/article/details/84996039
https://blog.csdn.net/zzc15806/article/details/84975709
https://blog.csdn.net/zzc15806/article/details/84975709
https://blog.csdn.net/zzc15806/article/details/84975539
https://blog.csdn.net/zzc15806/article/details/84975539
https://blog.csdn.net/zzc15806/article/details/84975137
https://blog.csdn.net/zzc15806/article/details/84975137
https://blog.csdn.net/zzc15806/article/details/84974458
https://blog.csdn.net/zzc15806/article/details/84974458
https://blog.csdn.net/zzc15806/article/details/84973370
https://blog.csdn.net/zzc15806/article/details/84973370
https://blog.csdn.net/zzc15806/article/details/84972108
https://blog.csdn.net/zzc15806/article/details/84972108
https://blog.csdn.net/zzc15806/article/details/84971215
https://blog.csdn.net/zzc15806/article/details/84971215
https://blog.csdn.net/zzc15806/article/details/84875070
https://blog.csdn.net/zzc15806/article/details/84875070
https://blog.csdn.net/zzc15806/article/details/84779131
https://blog.csdn.net/zzc15806/article/details/84779131
https://blog.csdn.net/zzc15806/article/details/84137013
https://blog.csdn.net/zzc15806/article/details/84137013
https://blog.csdn.net/zzc15806/article/details/84067017
https://blog.csdn.net/zzc15806/article/details/84067017
https://blog.csdn.net/zzc15806/article/details/83999940
https://blog.csdn.net/zzc15806/article/details/83999940
https://blog.csdn.net/zzc15806/article/details/83999668
https://blog.csdn.net/zzc15806/article/details/83999668
https://blog.csdn.net/zzc15806/article/details/83540661
https://blog.csdn.net/zzc15806/article/details/83540661
https://blog.csdn.net/zzc15806/article/details/83504130
https://blog.csdn.net/zzc15806/article/details/83504130
https://blog.csdn.net/zzc15806/article/details/83474661
https://blog.csdn.net/zzc15806/article/details/83474661
https://blog.csdn.net/zzc15806/article/details/83472329
https://blog.csdn.net/zzc15806/article/details/83472329
https://blog.csdn.net/zzc15806/article/details/83448761
https://blog.csdn.net/zzc15806/article/details/83448761
https://blog.csdn.net/zzc15806/article/details/83447006
https://blog.csdn.net/zzc15806/article/details/83447006
https://blog.csdn.net/zzc15806/article/details/73662491
https://blog.csdn.net/zzc15806/article/details/79711114
https://blog.csdn.net/zzc15806/article/details/79603994
https://blog.csdn.net/zzc15806/article/details/79246716
https://blog.csdn.net/zzc15806/article/details/79615426
python爬取网页上的超链接相关推荐
- python爬取网页上的特定链接_python3下scrapy爬虫(第三卷:初步抓取网页内容之抓取网页里的指定数据)...
上一卷中我们抓取了网页的所有内容,现在我们抓取下网页的图片名称以及连接 现在我再新建个爬虫文件,名称设置为crawler2 做爬虫的朋友应该知道,网页里的数据都是用文本或者块级标签包裹着的,scrap ...
- python爬取网页上的特定链接_python 用bs4解析网页后,如何循环打开爬取出来的网址链接?...
请问,用beautiful soup爬取特定网页后提取tag 'a',抓取里面的网址,打开特定的网址,循环特定次数,最后打印出想要的网址,如何操作? 详细的要求如下图: 我的代码如下: import ...
- python爬取网页上的特定链接_自学python爬虫二:如何正常操作urllib2通过指定的URL抓取网页内容...
所谓网页抓取, 就是把URL地址中指定的网络资源从网络流中读取出来,保存到本地. 类似于使用程序模拟IE浏览器的功能,把URL作为HTTP请求的内容发送到服务器端, 然后读取服务器端的响应资源. 在P ...
- 01-windows下python爬取网页上的图片
1.首先下载python,安装环境 pycharm.anaconda的下载与安装 移步各个主页下载,一键式安装. - pycharm: http://www.jetbrains.com/pycharm ...
- python爬取电影评分_用Python爬取猫眼上的top100评分电影
代码如下: # 注意encoding = 'utf-8'和ensure_ascii = False,不写的话不能输出汉字 import requests from requests.exception ...
- python爬取网页公开数据_如何用Python爬取网页数据
使用Python爬取网页数据的方法: 一.利用webbrowser.open()打开一个网站:>>> import webbrowser >>> webbrowse ...
- 编程python爬取网页数据教程_实例讲解Python爬取网页数据
一.利用webbrowser.open()打开一个网站: >>> import webbrowser >>> webbrowser.open('http://i.f ...
- python爬取网页版QQ空间,生成词云图、柱状图、折线图(附源码)
python爬取网页版QQ空间,生成词云图.柱状图.折线图 最近python课程学完了,琢磨着用python点什么东西,经过一番搜索,盯上了QQ空间,拿走不谢,欢迎点赞收藏,记得github给个sta ...
- python爬虫教程:实例讲解Python爬取网页数据
这篇文章给大家通过实例讲解了Python爬取网页数据的步骤以及操作过程,有兴趣的朋友跟着学习下吧. 一.利用webbrowser.open()打开一个网站: >>> import w ...
最新文章
- 【莓闻】2009年黑莓增长显著 智能手机领域第一
- mongoDB add user in v3.0 问题的解决(Property 'addUser' of object admin is not a func)
- javascript回调函数及推论
- arm linux samba,嵌入式linux系统教你制作samba服务器
- C++的占位符std::placeholder
- numpy 常用产生随机数方法
- web技术分享| WebRTC控制摄像机平移、倾斜和缩放
- 校园wifi免费上网
- 雅诗兰黛公司将收购Dr. Jart+
- 《计算机的硬件系统》教案,计算机硬件系统的组成教案
- springboot+Vue开发的 ktv预定管理系统
- 淘宝店铺选品,淘宝店群怎么同行卡位选品?
- 记录一次神奇的大物实验——用模拟法测绘静电场——别人都是打铁~我们打孔~~~
- python实现自动点击器_Python模拟鼠标点击实现方法(将通过实例自动化模拟在360浏览器中自动搜索python)...
- 阿里2020.4.1实习笔试题——攻击怪兽
- Nmap Network Scanning扫描版
- python识图打怪_识别×图片、打飞机游戏,Python这些事你都知道吗?
- html中outline属性,css 轮廓(outline)属性是如何使用的
- NPS是什么?怎么用?完整NPS介绍和应用案例
- OKLink行业观察:投资数字资产的机构版图(二)微策略
热门文章
- Ensp实验随心记——PPPOE
- NOIP 2012 开车旅行
- python中fillna函数_Pandas DataFrame.fillna()例子
- shellcode调试
- Centos7.6安装中文字体
- vue 跨域 反向代理
- 用何以为家500条最热门的评价告诉你,它好不好看
- gitlab访问错误Whoops, GitLab is taking too much time to respond搞定
- JAVA POI3.17 Excel 表格居中,加边框,给边框加颜色
- java string 查找大写字母_java实现统计字符串中大写字母,小写字母及数字出现次数的方法示例...