Python 爬虫之 Beautiful Soup 模块使用指南

版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。
本文链接：https://blog.csdn.net/bruce_6/article/details/80764000
爬取网页的流程一般如下：

选着要爬的网址（url）
使用 python 登录上这个网址（urlopen、requests 等）
读取网页信息（read() 出来）
将读取的信息放入 BeautifulSoup
使用 BeautifulSoup 选取 tag 信息等
可以看到，页面的获取其实不难，难的是数据的筛选，即如何获取到自己想要的数据。本文就带大家学习下 BeautifulSoup 的使用。

BeautifulSoup 官网介绍如下：

Beautiful Soup 是一个可以从 HTML 或 XML 文件中提取数据的 Python 库，它能够通过你喜欢的转换器实现惯用的文档导航、查找、修改文档的方式，能够帮你节省数小时甚至数天的工作时间。

1 安装
可以利用 pip 直接安装：

$ pip install beautifulsoup4
1
BeautifulSoup 不仅支持 HTML 解析器，还支持一些第三方的解析器，如 lxml，XML，html5lib 但是需要安装相应的库。如果我们不安装，则 Python 会使用 Python 默认的解析器，其中 lxml 解析器更加强大，速度更快，推荐安装。

$ pip install html5lib
$ pip install lxml
1
2
2 BeautifulSoup 的简单使用
首先我们先新建一个字符串，后面就以它来演示 BeautifulSoup 的使用。

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
The Dormouse's story

Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.

...
"""
1
2
3
4
5
6
7
8
9
10
11
12
13
使用 BeautifulSoup 解析这段代码，能够得到一个 BeautifulSoup 的对象，并能按照标准的缩进格式的结构输出:

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(html_doc, "lxml")
>>> print(soup.prettify())
1
2
3
篇幅有限，输出结果这里不再展示。

另外，这里展示下几个简单的浏览结构化数据的方法：

>>> soup.title
<title>The Dormouse's story</title>
>>> soup.title.name
'title'
>>> soup.title.string
"The Dormouse's story"
>>> soup.p['class']
['title']
>>> soup.a
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
>>> soup.find_all('a')
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
>>> soup.find(id='link1')
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
1
2
3
4
5
6
7
8
9
10
11
12
13
14
3 对象的种类
Beautiful Soup 将复杂 HTML 文档转换成一个复杂的树形结构，每个节点都是 Python 对象，所有对象可以归纳为 4 种: Tag、NavigableString、BeautifulSoup、Comment 。

3.1 Tag
Tag通俗点讲就是 HTML 中的一个个标签，像上面的 div，p，例如：

<title>The Dormouse's story</title>

<<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
1
2
3
可以利用 soup 加标签名轻松地获取这些标签的内容。

>>> print(soup.p)
The Dormouse's story
>>> print(soup.title)
<title>The Dormouse's story</title>
1
2
3
4
不过有一点是，它查找的是在所有内容中的第一个符合要求的标签，如果要查询所有的标签，我们在后面进行介绍。

每个 Tag 有两个重要的属性 name 和 attrs，name 指标签的名字或者 tag 本身的 name，attrs 通常指一个标签的 class。

>>> print(soup.p.name)
p
>>> print(soup.p.attrs)
{'class': ['title']}
1
2
3
4
3.2 NavigableString
NavigableString：获取标签内部的文字，如，soup.p.string。

>>> print(soup.p.string)
The Dormouse's story
1
2
3.3 BeautifulSoup
BeautifulSoup：表示一个文档的全部内容。大部分时候，可以把它当作 Tag 对象，是一个特殊的 Tag。

3.4 Comment
Comment：Comment 对象是一个特殊类型的 NavigableString 对象，其输出的内容不包括注释符号，但是如果不好好处理它，可能会对我们的文本处理造成意想不到的麻烦。

>>> markup = ""
>>> soup = BeautifulSoup(markup)
>>> comment = soup.b.string
>>> print(comment)
Hey, buddy. Want to buy a used parser?
>>> type(comment)
<class 'bs4.element.Comment'>
1
2
3
4
5
6
7
b 标签里的内容实际上是注释，但是如果我们利用 .string 来输出它的内容，我们发现它已经把注释符号去掉了，所以这可能会给我们带来不必要的麻烦。

这时候我们可以先判断了它的类型，是否为 bs4.element.Comment 类型，然后再进行其他操作，如打印输出等。

4 搜索文档树
BeautifulSoup 主要用来遍历子节点及子节点的属性，并提供了很多方法，比如获取子节点、父节点、兄弟节点等，但通过实践来看，这些方法用到的并不多。我们主要用到的是从文档树中搜索出我们的目标。

通过点取属性的方式只能获得当前文档中的第一个 tag，例如，soup.li。如果想要得到所有的<li> 标签，就需要用到 find_all()，find_all() 方法搜索当前 tag 的所有 tag 子节点，并判断是否符合过滤器的条件 find_all() 所接受的参数如下：

find_all( name , attrs , recursive , text , **kwargs )
1
4.1 按 name 搜索
可以查找所有名字为 name 的 tag，字符串对象会被自动忽略掉。

>>> soup.find_all('b')
[The Dormouse's story]
>>> soup.find_all('a')
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
1
2
3
4
4.2 按 id 搜索
如果文档树中包含一个名字为 id 的参数，其实在搜索时会把该参数当作指定名字 tag 的属性来搜索:

>>> soup.find_all(id='link1')
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
1
2
4.3 按 attr 搜索
有些 tag 属性在搜索不能使用，比如 HTML5 中的 data-* 属性，但是可以通过 find_all() 方法的 attrs 参数定义一个字典参数来搜索包含特殊属性的 tag。

其实 id 也是一个 attr：

>>> soup.find_all(attrs={'id':'link1'})
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
1
2
4.4 按 CSS 搜索
按照 CSS 类名搜索 tag 的功能非常实用，但标识 CSS 类名的关键字 class 在 Python 中是保留字，使用 class 做参数会导致语法错误。因此从 Beautiful Soup 的 4.1.1 版本开始，可以通过 class_ 参数搜索有指定 CSS 类名的 tag:

>>> soup.find_all(class_='sister')
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
1
2
4.5 string 参数
通过 string 参数可以搜搜文档中的字符串内容。与 name 参数的可选值一样，string 参数接受字符串、正则表达式、列表、True。

>>> soup.find_all('a', string='Elsie')
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
1
2
4.6 recursive 参数
调用 tag 的 find_all() 方法时，Beautiful Soup 会检索当前 tag 的所有子孙节点，如果只想搜索 tag 的直接子节点，可以使用参数 recursive=False。

4.6 find() 方法
它与 find_all() 方法唯一的区别是 find_all() 方法的返回结果是值包含一个元素的列表，而 find() 方法只返回第一个匹配的结果。

4.7 get_text() 方法
如果只想得到 tag 中包含的文本内容，那么可以用 get_text() 方法，这个方法获取到 tag 中包含的所有文本内容。

>>> soup.find_all('a', string='Elsie')[0].get_text()
'Elsie'
>>> soup.find_all('a', string='Elsie')[0].string
'Elsie'
1
2
3
4
至此，Beautiful Soup 的常用使用方法已讲完，若果想了解更多内容，建议看下官方文档：https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/。

总结
本篇主要带大家了解了 Beautiful Soup，结合一些小例子，相信大家对 Beautiful Soup 已不再陌生，下回会带大家结合 Beautiful Soup 进行爬虫的实战，欢迎继续关注！

如果觉得有用，欢迎关注我的微信，有问题可以直接交流，另外提供精品 Python 资料！

————————————————
版权声明：本文为CSDN博主「hoxis」的原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接及本声明。
原文链接：https://blog.csdn.net/bruce_6/article/details/80764000

Python 爬虫之 Beautiful Soup 模块使用指南相关推荐

Python爬虫之Beautiful soup模块
1.Beautiful soup与Xpath对比相同点:用来解析HTML和XML,并从中提取数据独有的特点: API简单,功能强大支持多种解析器自动实现编码的转换 2.Beautiful so ...
【Python爬虫】Beautiful Soup库入门
BeautifulSoup库的安装安装 pip install beautifulsoup4 测试是否安装成功 Python 3.8.3 (tags/v3.8.3:6f8c832, May 13 2 ...
以视频爬取实例讲解Python爬虫神器Beautiful Soup用法
1.安装BeautifulSoup4 easy_install安装方式,easy_install需要提前安装 1 easy_install beautifulsoup4 pip安装方式,pip也需要提 ...
python爬虫学习之Soup模块
前言就像我之前提到那样,使用正则来匹配获取是属麻烦,并且规则太多,下面结束一下python下面的一个模块Beautiful Soup来从网页抓取数据. 官网: 文档:http://beautiful ...
python爬虫之Beautiful Soup库，基本使用以及提取页面信息
一.Beautiful Soup简介爬虫正则表达式参考:Python 爬虫正则表达式和re库在爬虫过程中,可以利用正则表达式去提取信息,但是有些人觉得比较麻烦.因为花大量时间分析正则表达式.这时候 ...
Python爬虫库-Beautiful Soup的使用
Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库,简单来说,它能将HTML的标签文件解析成树形结构,然后方便地获取到指定标签的对应属性. 如在上一篇文章通过爬虫 ...
Python 爬虫利器 Beautiful Soup 4 之文档树的搜索
前面两篇介绍的是 Beautiful Soup 4 的基本对象类型和文档树的遍历, 本篇介绍 Beautiful Soup 4 的文档搜索搜索文档树主要使用两个方法 find() 和 find_al ...
python爬虫bs4库_04 Python爬虫之Beautiful Soup库
Beautiful Soup库的安装 Win平台: 以管理员身份运行 cmd 执行 pip install beautifulsoup4 Beautiful Soup库的安装小测首先,获取网页源码保 ...
python——爬虫学习——Beautiful Soup库的使用-(2)
Beautiful Soup库执行pip install beautifulsoup4安装Beautiful Soup库 Beautiful Soup库的简介 Beautiful Soup是一个可以 ...

Python 爬虫之 Beautiful Soup 模块使用指南

Python 爬虫之 Beautiful Soup 模块使用指南相关推荐

最新文章

热门文章