用bs4中的BeautifulSoup解析网页

from urllib.request import urlopen
from bs4 import BeautifulSouphtml = urlopen('https://blog.csdn.net/zzc15806/') #获取网页
bs = BeautifulSoup(html, 'html.parser') #解析网页
hyperlink = bs.find_all('a')  #获取所有超链接
for h in hyperlink:hh = h.get('href')print(hh)

结果如下:

https://blog.csdn.net/zzc15806
javascript:void(0);
https://blog.csdn.net/zzc15806?orderby=UpdateTime
https://blog.csdn.net/zzc15806?orderby=ViewCount
https://blog.csdn.net/zzc15806/rss/list
https://blog.csdn.net/yoyo_liyy/article/details/82762601
https://blog.csdn.net/yoyo_liyy/article/details/82762601
https://blog.csdn.net/zzc15806/article/details/84996039
https://blog.csdn.net/zzc15806/article/details/84996039
https://blog.csdn.net/zzc15806/article/details/84975709
https://blog.csdn.net/zzc15806/article/details/84975709
https://blog.csdn.net/zzc15806/article/details/84975539
https://blog.csdn.net/zzc15806/article/details/84975539
https://blog.csdn.net/zzc15806/article/details/84975137
https://blog.csdn.net/zzc15806/article/details/84975137
https://blog.csdn.net/zzc15806/article/details/84974458
https://blog.csdn.net/zzc15806/article/details/84974458
https://blog.csdn.net/zzc15806/article/details/84973370
https://blog.csdn.net/zzc15806/article/details/84973370
https://blog.csdn.net/zzc15806/article/details/84972108
https://blog.csdn.net/zzc15806/article/details/84972108
https://blog.csdn.net/zzc15806/article/details/84971215
https://blog.csdn.net/zzc15806/article/details/84971215
https://blog.csdn.net/zzc15806/article/details/84875070
https://blog.csdn.net/zzc15806/article/details/84875070
https://blog.csdn.net/zzc15806/article/details/84779131
https://blog.csdn.net/zzc15806/article/details/84779131
https://blog.csdn.net/zzc15806/article/details/84137013
https://blog.csdn.net/zzc15806/article/details/84137013
https://blog.csdn.net/zzc15806/article/details/84067017
https://blog.csdn.net/zzc15806/article/details/84067017
https://blog.csdn.net/zzc15806/article/details/83999940
https://blog.csdn.net/zzc15806/article/details/83999940
https://blog.csdn.net/zzc15806/article/details/83999668
https://blog.csdn.net/zzc15806/article/details/83999668
https://blog.csdn.net/zzc15806/article/details/83540661
https://blog.csdn.net/zzc15806/article/details/83540661
https://blog.csdn.net/zzc15806/article/details/83504130
https://blog.csdn.net/zzc15806/article/details/83504130
https://blog.csdn.net/zzc15806/article/details/83474661
https://blog.csdn.net/zzc15806/article/details/83474661
https://blog.csdn.net/zzc15806/article/details/83472329
https://blog.csdn.net/zzc15806/article/details/83472329
https://blog.csdn.net/zzc15806/article/details/83448761
https://blog.csdn.net/zzc15806/article/details/83448761
https://blog.csdn.net/zzc15806/article/details/83447006
https://blog.csdn.net/zzc15806/article/details/83447006
https://me.csdn.net/zzc15806
https://me.csdn.net/zzc15806
None
https://blog.csdn.net/zzc15806?t=1
https://blog.csdn.net/zzc15806?t=1
https://blog.csdn.net/home/help.html#level
https://blog.csdn.net/zzc15806/column/info/25194
https://blog.csdn.net/zzc15806/column/info/25194
https://blog.csdn.net/zzc15806/column/info/30921
https://blog.csdn.net/zzc15806/column/info/30921
https://blog.csdn.net/zzc15806/column/info/30926
https://blog.csdn.net/zzc15806/column/info/30926
https://blog.csdn.net/zzc15806/article/category/6989201
https://blog.csdn.net/zzc15806/article/category/7255220
https://blog.csdn.net/zzc15806/article/category/7422481
https://blog.csdn.net/zzc15806/article/category/7515657
https://blog.csdn.net/zzc15806/article/category/7534232
https://blog.csdn.net/zzc15806/article/category/7548654
https://blog.csdn.net/zzc15806/article/category/7549573
https://blog.csdn.net/zzc15806/article/category/7731524
https://blog.csdn.net/zzc15806/article/category/7732152
https://blog.csdn.net/zzc15806/article/category/7740409
https://blog.csdn.net/zzc15806/article/category/7749247
https://blog.csdn.net/zzc15806/article/category/7776199
https://blog.csdn.net/zzc15806/article/category/7830103
https://blog.csdn.net/zzc15806/article/category/7842074
https://blog.csdn.net/zzc15806/article/category/7936547
https://blog.csdn.net/zzc15806/article/category/8489572
None
https://blog.csdn.net/zzc15806/article/month/2018/12
https://blog.csdn.net/zzc15806/article/month/2018/11
https://blog.csdn.net/zzc15806/article/month/2018/10
https://blog.csdn.net/zzc15806/article/month/2018/09
https://blog.csdn.net/zzc15806/article/month/2018/08
https://blog.csdn.net/zzc15806/article/month/2018/07
https://blog.csdn.net/zzc15806/article/month/2018/06
https://blog.csdn.net/zzc15806/article/month/2018/05
https://blog.csdn.net/zzc15806/article/month/2018/04
https://blog.csdn.net/zzc15806/article/month/2018/03
https://blog.csdn.net/zzc15806/article/month/2018/02
https://blog.csdn.net/zzc15806/article/month/2018/01
https://blog.csdn.net/zzc15806/article/month/2017/10
https://blog.csdn.net/zzc15806/article/month/2017/06
None
https://blog.csdn.net/zzc15806/article/details/73662491
https://blog.csdn.net/zzc15806/article/details/79711114
https://blog.csdn.net/zzc15806/article/details/79603994
https://blog.csdn.net/zzc15806/article/details/79246716
https://blog.csdn.net/zzc15806/article/details/79615426
https://blog.csdn.net/zzc15806/article/details/79592577#comments
https://my.csdn.net/qq_35300611
https://blog.csdn.net/zzc15806/article/details/79592577#comments
https://my.csdn.net/zzc15806
https://blog.csdn.net/zzc15806/article/details/79592577#comments
https://my.csdn.net/qq_35300611
https://blog.csdn.net/zzc15806/article/details/79615426#comments
https://my.csdn.net/qq254271304
https://blog.csdn.net/zzc15806/article/details/80712320#comments
https://my.csdn.net/zzc15806
None
None
None
None
None

从上面可以看到,爬取的链接很杂乱,我们可以对进行筛选。例如,爬取所有博客的链接,并保存到‘txt’文件中:

from urllib.request import urlopen
from bs4 import BeautifulSouphtml = urlopen('https://blog.csdn.net/zzc15806/') #获取网页
bs = BeautifulSoup(html, 'html.parser') #解析网页
hyperlink = bs.find_all('a')  #获取所有超链接file = open('./blog.txt', 'w')for h in hyperlink:hh = h.get('href')if hh and '/article/details/' in hh and '#comments' not in hh:  #筛选博客链接print(hh)file.write(hh)   #写入到“blog.txt”文件中file.write('\n')file.close()

结果如下:

https://blog.csdn.net/yoyo_liyy/article/details/82762601
https://blog.csdn.net/yoyo_liyy/article/details/82762601
https://blog.csdn.net/zzc15806/article/details/84996039
https://blog.csdn.net/zzc15806/article/details/84996039
https://blog.csdn.net/zzc15806/article/details/84975709
https://blog.csdn.net/zzc15806/article/details/84975709
https://blog.csdn.net/zzc15806/article/details/84975539
https://blog.csdn.net/zzc15806/article/details/84975539
https://blog.csdn.net/zzc15806/article/details/84975137
https://blog.csdn.net/zzc15806/article/details/84975137
https://blog.csdn.net/zzc15806/article/details/84974458
https://blog.csdn.net/zzc15806/article/details/84974458
https://blog.csdn.net/zzc15806/article/details/84973370
https://blog.csdn.net/zzc15806/article/details/84973370
https://blog.csdn.net/zzc15806/article/details/84972108
https://blog.csdn.net/zzc15806/article/details/84972108
https://blog.csdn.net/zzc15806/article/details/84971215
https://blog.csdn.net/zzc15806/article/details/84971215
https://blog.csdn.net/zzc15806/article/details/84875070
https://blog.csdn.net/zzc15806/article/details/84875070
https://blog.csdn.net/zzc15806/article/details/84779131
https://blog.csdn.net/zzc15806/article/details/84779131
https://blog.csdn.net/zzc15806/article/details/84137013
https://blog.csdn.net/zzc15806/article/details/84137013
https://blog.csdn.net/zzc15806/article/details/84067017
https://blog.csdn.net/zzc15806/article/details/84067017
https://blog.csdn.net/zzc15806/article/details/83999940
https://blog.csdn.net/zzc15806/article/details/83999940
https://blog.csdn.net/zzc15806/article/details/83999668
https://blog.csdn.net/zzc15806/article/details/83999668
https://blog.csdn.net/zzc15806/article/details/83540661
https://blog.csdn.net/zzc15806/article/details/83540661
https://blog.csdn.net/zzc15806/article/details/83504130
https://blog.csdn.net/zzc15806/article/details/83504130
https://blog.csdn.net/zzc15806/article/details/83474661
https://blog.csdn.net/zzc15806/article/details/83474661
https://blog.csdn.net/zzc15806/article/details/83472329
https://blog.csdn.net/zzc15806/article/details/83472329
https://blog.csdn.net/zzc15806/article/details/83448761
https://blog.csdn.net/zzc15806/article/details/83448761
https://blog.csdn.net/zzc15806/article/details/83447006
https://blog.csdn.net/zzc15806/article/details/83447006
https://blog.csdn.net/zzc15806/article/details/73662491
https://blog.csdn.net/zzc15806/article/details/79711114
https://blog.csdn.net/zzc15806/article/details/79603994
https://blog.csdn.net/zzc15806/article/details/79246716
https://blog.csdn.net/zzc15806/article/details/79615426

python爬取网页上的超链接相关推荐

  1. python爬取网页上的特定链接_python3下scrapy爬虫(第三卷:初步抓取网页内容之抓取网页里的指定数据)...

    上一卷中我们抓取了网页的所有内容,现在我们抓取下网页的图片名称以及连接 现在我再新建个爬虫文件,名称设置为crawler2 做爬虫的朋友应该知道,网页里的数据都是用文本或者块级标签包裹着的,scrap ...

  2. python爬取网页上的特定链接_python 用bs4解析网页后,如何循环打开爬取出来的网址链接?...

    请问,用beautiful soup爬取特定网页后提取tag 'a',抓取里面的网址,打开特定的网址,循环特定次数,最后打印出想要的网址,如何操作? 详细的要求如下图: 我的代码如下: import ...

  3. python爬取网页上的特定链接_自学python爬虫二:如何正常操作urllib2通过指定的URL抓取网页内容...

    所谓网页抓取, 就是把URL地址中指定的网络资源从网络流中读取出来,保存到本地. 类似于使用程序模拟IE浏览器的功能,把URL作为HTTP请求的内容发送到服务器端, 然后读取服务器端的响应资源. 在P ...

  4. 01-windows下python爬取网页上的图片

    1.首先下载python,安装环境 pycharm.anaconda的下载与安装 移步各个主页下载,一键式安装. - pycharm: http://www.jetbrains.com/pycharm ...

  5. python爬取电影评分_用Python爬取猫眼上的top100评分电影

    代码如下: # 注意encoding = 'utf-8'和ensure_ascii = False,不写的话不能输出汉字 import requests from requests.exception ...

  6. python爬取网页公开数据_如何用Python爬取网页数据

    使用Python爬取网页数据的方法: 一.利用webbrowser.open()打开一个网站:>>> import webbrowser >>> webbrowse ...

  7. 编程python爬取网页数据教程_实例讲解Python爬取网页数据

    一.利用webbrowser.open()打开一个网站: >>> import webbrowser >>> webbrowser.open('http://i.f ...

  8. python爬取网页版QQ空间,生成词云图、柱状图、折线图(附源码)

    python爬取网页版QQ空间,生成词云图.柱状图.折线图 最近python课程学完了,琢磨着用python点什么东西,经过一番搜索,盯上了QQ空间,拿走不谢,欢迎点赞收藏,记得github给个sta ...

  9. python爬虫教程:实例讲解Python爬取网页数据

    这篇文章给大家通过实例讲解了Python爬取网页数据的步骤以及操作过程,有兴趣的朋友跟着学习下吧. 一.利用webbrowser.open()打开一个网站: >>> import w ...

最新文章

  1. 【莓闻】2009年黑莓增长显著 智能手机领域第一
  2. mongoDB add user in v3.0 问题的解决(Property 'addUser' of object admin is not a func)
  3. javascript回调函数及推论
  4. arm linux samba,嵌入式linux系统教你制作samba服务器
  5. C++的占位符std::placeholder
  6. numpy 常用产生随机数方法
  7. web技术分享| WebRTC控制摄像机平移、倾斜和缩放
  8. 校园wifi免费上网
  9. 雅诗兰黛公司将收购Dr. Jart+
  10. 《计算机的硬件系统》教案,计算机硬件系统的组成教案
  11. springboot+Vue开发的 ktv预定管理系统
  12. 淘宝店铺选品,淘宝店群怎么同行卡位选品?
  13. 记录一次神奇的大物实验——用模拟法测绘静电场——别人都是打铁~我们打孔~~~
  14. python实现自动点击器_Python模拟鼠标点击实现方法(将通过实例自动化模拟在360浏览器中自动搜索python)...
  15. 阿里2020.4.1实习笔试题——攻击怪兽
  16. Nmap Network Scanning扫描版
  17. python识图打怪_识别×图片、打飞机游戏,Python这些事你都知道吗?
  18. html中outline属性,css 轮廓(outline)属性是如何使用的
  19. NPS是什么?怎么用?完整NPS介绍和应用案例
  20. OKLink行业观察:投资数字资产的机构版图(二)微策略

热门文章

  1. Ensp实验随心记——PPPOE
  2. NOIP 2012 开车旅行
  3. python中fillna函数_Pandas DataFrame.fillna()例子
  4. shellcode调试
  5. Centos7.6安装中文字体
  6. vue 跨域 反向代理
  7. 用何以为家500条最热门的评价告诉你,它好不好看
  8. gitlab访问错误Whoops, GitLab is taking too much time to respond搞定
  9. JAVA POI3.17 Excel 表格居中,加边框,给边框加颜色
  10. java string 查找大写字母_java实现统计字符串中大写字母,小写字母及数字出现次数的方法示例...