BeautifulSoup是一个非常优秀的Python扩展库,可以用来从HTML或XML文件中提取我们感兴趣的数据,并且允许指定使用不同的解析器。由于beautifulsoup3已经不再继续维护,因此新的项目中应使用beautifulsoup4,目前最新版本是4.5.0,可以使用pip install beautifulsoup4直接进行安装,安装之后应使用from bs4 import BeautifulSoup导入并使用。下面我们就一起来简单看一下BeautifulSoup4的强大功能,更加详细完整的学习资料请参考https://www.crummy.com/software/BeautifulSoup/bs4/doc/。

>>> from bs4 import BeautifulSoup

>>> BeautifulSoup('hello world!', 'lxml')  #自动添加和补全标签

<html><body><p>hello world!</p></body></html>

>>> html_doc = """

<html><head><title>The Dormouse's story</title></head>

<body>

<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were

<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,

<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

and they lived at the bottom of a well.</p>

<p class="story">...</p>

"""

>>> soup = BeautifulSoup(html_doc, 'html.parser')  #也可以使用lxml或其他解析器

>>> print(soup.prettify()) #以优雅的方式显示出来

<html>

<head>

<title>

The Dormouse's story

</title>

</head>

<body>

<p class="title">

<b>

The Dormouse's story

</b>

</p>

<p class="story">

Once upon a time there were three little sisters; and their names were

<a class="sister" href="http://example.com/elsie" id="link1">

Elsie

</a>

,

<a class="sister" href="http://example.com/lacie" id="link2">

Lacie

</a>

and

<a class="sister" href="http://example.com/tillie" id="link3">

Tillie

</a>

;

and they lived at the bottom of a well.

</p>

<p class="story">

...

</p>

</body>

</html>

>>> soup.title  #访问特定的标签

<title>The Dormouse's story</title>

>>> soup.title.name  #标签名字

'title'

>>> soup.title.text  #标签文本

"The Dormouse's story"

>>> soup.title.string

"The Dormouse's story"

>>> soup.title.parent  #上一级标签

<head><title>The Dormouse's story</title></head>

>>> soup.head

<head><title>The Dormouse's story</title></head>

>>> soup.b

<b>The Dormouse's story</b>

>>> soup.body.b

<b>The Dormouse's story</b>

>>> soup.name   #把整个BeautifulSoup对象看做标签对象

'[document]'

>>> soup.body

<body>

<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,

<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and

<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;

and they lived at the bottom of a well.</p>

<p class="story">...</p>

</body>

>>> soup.p

<p class="title"><b>The Dormouse's story</b></p>

>>> soup.p['class']  #标签属性

['title']

>>> soup.p.get('class') #也可以这样查看标签属性

['title']

>>> soup.p.text

"The Dormouse's story"

>>> soup.p.contents

[<b>The Dormouse's story</b>]

>>> soup.a

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

>>> soup.a.attrs  #查看标签所有属性

{'class': ['sister'], 'href': 'http://example.com/elsie', 'id': 'link1'}

>>> soup.find_all('a') #查找所有<a>标签

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

>>> soup.find_all(['a', 'b'])   #同时查找<a>和<b>标签

[<b>The Dormouse's story</b>, <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

>>> import re

>>> soup.find_all(href=re.compile("elsie"))  #查找href包含特定关键字的标签

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

>>> soup.find(id='link3')

<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

>>> soup.find_all('a', id='link3')

[<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

>>> for link in soup.find_all('a'):

print(link.text,':',link.get('href'))

Elsie : http://example.com/elsie

Lacie : http://example.com/lacie

Tillie : http://example.com/tillie

>>> print(soup.get_text()) #返回所有文本

The Dormouse's story

The Dormouse's story

Once upon a time there were three little sisters; and their names were

Elsie,

Lacie and

Tillie;

and they lived at the bottom of a well.

...

>>> soup.a['id'] = 'test_link1'  #修改标签属性的值

>>> soup.a

<a class="sister" href="http://example.com/elsie" id="test_link1">Elsie</a>

>>> soup.a.string.replace_with('test_Elsie')  #修改标签文本

'Elsie'

>>> soup.a.string

'test_Elsie'

>>> print(soup.prettify())

<html>

<head>

<title>

The Dormouse's story

</title>

</head>

<body>

<p class="title">

<b>

The Dormouse's story

</b>

</p>

<p class="story">

Once upon a time there were three little sisters; and their names were

<a class="sister" href="http://example.com/elsie" id="test_link1">

test_Elsie

</a>

,

<a class="sister" href="http://example.com/lacie" id="link2">

Lacie

</a>

and

<a class="sister" href="http://example.com/tillie" id="link3">

Tillie

</a>

;

and they lived at the bottom of a well.

</p>

<p class="story">

...

</p>

</body>

</html>

>>> for child in soup.body.children:   #遍历直接子标签

print(child)

<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were

<a class="sister" href="http://example.com/elsie" id="test_link1">test_Elsie</a>,

<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and

<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;

and they lived at the bottom of a well.</p>

<p class="story">...</p>

>>> for string in soup.strings:  #遍历所有文本,结果略

print(string)

>>> test_doc = '<html><head></head><body><p></p><p></p></body></heml>'

>>> s = BeautifulSoup(test_doc, 'lxml')

>>> for child in s.html.children:   #遍历直接子标签

print(child)

<head></head>

<body><p></p><p></p></body>

>>> for child in s.html.descendants: #遍历子孙标签

print(child)

<head></head>

<body><p></p><p></p></body>

<p></p>

<p></p>

Python爬虫辅助库BeautifulSoup4用法精要相关推荐

  1. Python爬虫扩展库BeautifulSoup4用法精要

    BeautifulSoup是一个非常优秀的Python扩展库,可以用来从HTML或XML文件中提取我们感兴趣的数据,并且允许指定使用不同的解析器.由于beautifulsoup3已经不再继续维护,因此 ...

  2. python爬虫---requests库的用法

    requests是python实现的简单易用的HTTP库,使用起来比urllib简洁很多 因为是第三方库,所以使用前需要cmd安装 pip install requests 安装完成后import一下 ...

  3. Python爬虫扩展库scrapy选择器用法入门(一)

    关于BeutifulSoup4的用法入门请参考Python爬虫扩展库BeautifulSoup4用法精要,scrapy爬虫案例请参考Python使用Scrapy爬虫框架爬取天涯社区小说"大宗 ...

  4. Python爬虫辅助利器PyQuery模块的安装使用攻略

    这篇文章主要介绍了Python爬虫辅助利器PyQuery模块的安装使用攻略,PyQuery可以方便地用来解析HTML内容,使其成为众多爬虫程序开发者的大爱,需要的朋友可以参考下 Windows下的安装 ...

  5. Python Pillow(PIL)库的用法介绍(二)

    Python Pillow(PIL)库的用法介绍(二) 在上一篇文章中介绍了Pillow库的一些基本用法,参考:https://blog.csdn.net/weixin_43790276/articl ...

  6. 【python】python爬虫requests库详解

    1.安装:pip install requests 简介:Requests是一个优雅而简单的Python HTTP库,与之前的urllibPython的标准库相比,Requests的使用方式非常的简单 ...

  7. 已解决(Python爬虫requests库报错 请求异常SSL错误,证书认证失败问题)requests.exceptions.SSLError: HTTPSConnectionPool

    成功解决(Python爬虫requests库报错 请求异常,SSL错误,证书认证失败问题)requests.exceptions.SSLError: HTTPSConnectionPool(host= ...

  8. Python爬虫——Requests 库基本使用

    文章目录 Python爬虫--Requests 库基本使用 1.Requests简介和下载 2.Requests 库基本使用 Python爬虫--Requests 库基本使用 1.Requests简介 ...

  9. Python爬虫基础库(RBX)的实践_田超凡

    转载请注明原作者:田超凡 20190410 CSDN博客:https://blog.csdn.net/qq_30056341# Python爬虫基础库RBX指的是:Requests Beautiful ...

最新文章

  1. C中 #define
  2. ORACLE nvarchar2和varchar2的区别
  3. hana::detail::variadic::split_at用法的测试程序
  4. mac查看端口,关闭进程
  5. 如何进入embl的ebi网站fasta3服务器,The EMBL-EBI bioinformatics web and programmatic tools framework...
  6. MTK(android init.rc) 写一个开机启动的服务
  7. 2020年2月数据库流行度排行:冬日虽然寒冷,春光必定灿烂
  8. Linux一键编译,linux下一键编译安装MariaDB10.0.12
  9. php for 每次增加2,php – 为什么foreach会将refcount增加2而不是1?
  10. Python字符串等于
  11. ASCII、 Unicode 和 UTF8
  12. Chrome 上最必不可少的29 款插件,超级提高效率
  13. c语言除法的作用,c语言除法(c语言除法保留小数)
  14. XMPP即时通讯(代码实现)
  15. 私活必备11个免费后台管理系统模板
  16. FTP 21端口和20端口有什么区别?
  17. Android——adapter解读
  18. 2017腾讯LIVE开发者大会精彩回顾!
  19. 群狼调研开展景区旅游服务质量暨游客满意度调查
  20. 4am永远 鼠标按键设置_《搞机作战室》机械师M8鼠标怎么安装/使用控制中心

热门文章

  1. python len函数_Python 初学者必备的常用内置函数
  2. oracle8i+下载,oracle database 8i, 9i,10g, 11g正确下载地址
  3. python下载url链接_使用Python从url地址下载所有pdf文件
  4. 删除用户账号的命令 mysql_【Mysql】常用指令之——用户操作(创建,授权,修改,删除)...
  5. android 2.3 otg,学会使用手机的OTG功能-事半功倍-系列评测2
  6. PL/SQL Developer14中文版,记住登录密码和常用快捷方式
  7. SET CONSTRAINTS DEFERRED | IMMEDIATE
  8. Nagios Plugin for Cacti (npc插件) Download 下载
  9. nodejs安装服务器系统,window下,nodejs安装http-server,并开启HTTP服务器
  10. 基于JAVA+SpringBoot+Mybatis+MYSQL的后台医疗管理系统