BeautfulSoup详解

Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库.它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式。这是官方文档的解释，我们在爬虫领域主要用来解析我们爬取的界面，得到我们需要的文字，图片等信息。

一、安装Beautiful Soup

pip install beautifulsoup4

如果您用的是Ubuntu

apt-get install Python-bs4

在PyPi中还有一个名字是 BeautifulSoup 的包,但那可能不是你想要的,那是 Beautiful Soup3 的发布版本,因为很多项目还在使用BS3, 所以 BeautifulSoup 包依然有效.但是如果你在编写新项目,那么你应该安装的 beautifulsoup4

如果上述安装方法都行不通,Beautiful Soup的发布协议允许你将BS4的代码打包在你的项目中,这样无须安装即可使用.

Beautiful Soup支持Python标准库中的HTML解析器,还支持一些第三方的解析器,其中一个是 lxml .根据操作系统不同,可以选择下列方法来安装lxml:

apt-get install Python-lxmleasy_install lxmlpip install lxml

二、快速使用

将一段文档传入BeautifulSoup 的构造方法,就能得到一个文档的对象, 可以传入一段字符串或一个文件句柄。

from bs4 import BeautifulSoupsoup = BeautifulSoup(open("./index.html"))soup = BeautifulSoup("<html>data</html>")

首先,文档被转换成Unicode,并且HTML的实例都被转换成Unicode编码。

然后,Beautiful Soup选择最合适的解析器来解析这段文档,如果手动指定解析器那么Beautiful Soup会选择指定的解析器来解析文档。

2.1、对象的种类

Beautiful Soup将复杂HTML文档转换成一个复杂的树形结构,每个节点都是Python对象,所有对象可以归纳为4种: Tag , NavigableString , BeautifulSoup , Comment 。

2.1.1、Tag

Tag 对象与XML或HTML原生文档中的tag相同:

soup = BeautifulSoup('<b class="boldest">Extremely bold</b>')
tag = soup.b
type(tag)
# <class 'bs4.element.Tag'>

每个tag都有自己的名字，我么通过.name获取

tag.name
# 'b'

一个tag可能有很多个属性. tag <b class="boldest"> 有一个 “class” 的属性,值为 “boldest” . tag的属性的操作方法与字典相同:

tag['class']
# ['boldest']

也可以直接”点”取属性，这里我们注意到我们取得是一个字典，对应的是一个列表，列表中是可以有多个值的，熟悉HTML的可能知道，因为class等值在HTML4之后是支持多值属性的，但是XML是不支持的。

tag.attrs
# {'class': ['boldest']}
tag.attrs.get('class')
# ['boldest']

2.1.2、NavigableString

字符串常被包含在tag内.Beautiful Soup用 NavigableString 类来包装tag中的字符串

tag.string
# 返回值是字符串
# 'Extremely bold'

2.1.3、BeautifulSoup

BeautifulSoup 对象表示的是一个文档的全部内容.大部分时候,可以把它当作 Tag 对象,它支持遍历文档树和搜索文档树中描述的大部分的方法.

因为 BeautifulSoup 对象并不是真正的HTML或XML的tag,所以它没有name和attribute属性.但有时查看它的 .name 属性是很方便的,所以 BeautifulSoup 对象包含了一个值为 “[document]” 的特殊属性 .name

soup.name
# u'[document]'

2.1.4、Comment

Tag , NavigableString , BeautifulSoup 几乎覆盖了html和xml中的所有内容,但是还有一些特殊对象.容易让人担心的内容是文档的注释部分:

markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>"
soup = BeautifulSoup(markup)
comment = soup.b.string
type(comment)
# <class 'bs4.element.Comment'>

Comment 对象是一个特殊类型的 NavigableString 对象:

comment
# u'Hey, buddy. Want to buy a used parser'

但是当它出现在HTML文档中时, Comment 对象会使用特殊的格式输出:

print(soup.b.prettify())
# <b>
#  <!--Hey, buddy. Want to buy a used parser?-->
# </b>

2.2、遍历文档树

其实我们在爬虫中用的文档树不多，我们更多地是使用搜索树

我们以官网上的一个例子为例

html_doc = """
<html><head><title>The Dormouse's story</title></head><body>
<p class="title"><b>The Dormouse's story</b></p><p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p><p class="story">...</p>
"""from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')

2.2.1、子节点

soup.head
# <head><title>The Dormouse's story</title></head>soup.title
# <title>The Dormouse's story</title>

但是用过.属性的方式只能获取第一个tag，比如.a可以匹配多个属性，但是只能输出找到的第一个，想要找到多个就需要用到搜索文档树中描述的方法,比如: find_all()

soup.a
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

tag的 .contents 属性可以将tag的子节点以列表的方式输出，但是该属性仅包含tag的直接子节点:

head_tag = soup.head
# <head><title>The Dormouse's story</title></head>head_tag.contents
# [<title>The Dormouse's story</title>]title_tag = head_tag.contents[0]
# <title>The Dormouse's story</title>title_tag.contents
# [u'The Dormouse's story']for child in title_tag.children:print(child)
# The Dormouse's story

tag的.descendants属性可以对所有tag的子孙节点进行递归循环

for child in head_tag.descendants:print(child)
# <title>The Dormouse's story</title>
# The Dormouse's story

tag的 .string属性，如果tag只有一个 NavigableString 类型子节点,那么这个tag可以使用 .string 得到子节点，如果一个tag仅有一个子节点,那么这个tag也可以使用 .string 方法,输出结果与当前唯一子节点的 .string 结果相同，如果tag包含了多个子节点,tag就无法确定 .string 方法应该调用哪个子节点的内容, .string 的输出结果是 None 。

title_tag.string
# u'The Dormouse's story'head_tag.contents
# [<title>The Dormouse's story</title>]head_tag.string
# u'The Dormouse's story'soup.html.string
# None

tag的 .strings属性。如果tag中包含多个字符串 ,可以使用 .strings 来循环获取:

tag的.stripped_strings属性可以去除多余空白内容:

for string in soup.strings:print(repr(string))# "The Dormouse's story"# '\n\n'# "The Dormouse's story"# '\n\n'# 'Once upon a time there were three little sisters; and their names were\n'# 'Elsie'# ',\n'# 'Lacie'# ' and\n'# 'Tillie'# ';\nand they lived at the bottom of a well.'# '\n\n'# '...'# '\n'for string in soup.stripped_strings:print(repr(string))# "The Dormouse's story"# "The Dormouse's story"# 'Once upon a time there were three little sisters; and their names were'# 'Elsie'# ','# 'Lacie'# 'and'# 'Tillie'# ';\nand they lived at the bottom of a well.'# '...'

2.2.2、父节点

tag的 .parent 属性来获取某个元素的父节点.在例子“爱丽丝”的文档中

title_tag = soup.title
title_tag
# <title>The Dormouse's story</title>
title_tag.parent
# <head><title>The Dormouse's story</title></head>

tag的 .parents 属性可以递归得到元素的所有父辈节点,下面的例子使用了 .parents 方法遍历了a标签到根节点的所有节点.

link = soup.a
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>for parent in link.parents:if parent is None:print(parent)else:print(parent.name)
# p
# body
# html
# [document]
# None

2.2.3、兄弟结点

在文档树中,使用 .next_sibling 和 .previous_sibling 属性来查询兄弟节点:

sibling_soup.b.next_sibling
# <c>text2</c>sibling_soup.c.previous_sibling
# <b>text1</b>

b标签有 .next_sibling 属性,但是没有 .previous_sibling 属性,因为b标签在同级节点中是第一个.同理c标签有 .previous_sibling 属性,却没有 .next_sibling 属性:

通过 .next_siblings 和 .previous_siblings 属性可以对当前节点的兄弟节点迭代输出，其实和上面类似的道理

2.3、搜索文档树

Beautiful Soup定义了很多搜索方法,这里着重介绍2个: find() 和 find_all() .其它方法的参数和用法类似,请读者举一反三操作文档树最简单的方法就是告诉它你想获取的tag的name.如果想获取head标签,只要用 soup.head :

再以“爱丽丝”文档作为例子:

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p><p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p><p class="story">...</p>
"""from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')

使用 find_all() 类似的方法可以查找到想要查找的文档内容

2.3.1、过滤器

字符串

最简单的过滤器是字符串.在搜索方法中传入一个字符串参数,Beautiful Soup会查找与字符串完整匹配的内容,下面的例子用于查找文档中所有的b标签:

soup.find_all('b')
# [<b>The Dormouse's story</b>]

如果传入字节码参数,Beautiful Soup会当作UTF-8编码,可以传入一段Unicode 编码来避免Beautiful Soup解析编码出错

正则表达式

如果传入正则表达式作为参数,Beautiful Soup会通过正则表达式的 match() 来匹配内容.下面例子中找出所有以b开头的标签,这表示和标签都应该被找到:

import re
for tag in soup.find_all(re.compile("^b")):print(tag.name)
# body
# b

列表

如果传入列表参数,Beautiful Soup会将与列表中任一元素匹配的内容返回.下面代码找到文档中所有标签和标签:

soup.find_all(["a", "b"])
# [<b>The Dormouse's story</b>,
#  <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

True

True 可以匹配任何值,下面代码查找到所有的tag,但是不会返回字符串节点

for tag in soup.find_all(True):print(tag.name)
# html
# head
# title
# body
# p
# b
# p
# a
# a
# a
# p

方法

如果没有合适过滤器,那么还可以定义一个方法,方法只接受一个元素参数,如果这个方法返回 True 表示当前元素匹配并且被找到,如果不是则反回 False

下面方法校验了当前元素,如果包含 class 属性却不包含 id 属性,那么将返回 True:

def has_class_but_no_id(tag):return tag.has_attr('class') and not tag.has_attr('id')

将这个方法作为参数传入 find_all() 方法,将得到所有

标签:

soup.find_all(has_class_but_no_id)
# [<p class="title"><b>The Dormouse's story</b></p>,
#  <p class="story">Once upon a time there were...</p>,
#  <p class="story">...</p>]

通过一个方法来过滤一类标签属性的时候, 这个方法的参数是要被过滤的属性的值, 而不是这个标签. 下面的例子是找出 href 属性不符合指定正则的 a 标签.

def not_lacie(href):return href and not re.compile("lacie").search(href)
soup.find_all(href=not_lacie)
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

标签过滤方法可以使用复杂方法. 下面的例子可以过滤出前后都有文字的标签.

from bs4 import NavigableString
def surrounded_by_strings(tag):return (isinstance(tag.next_element, NavigableString)and isinstance(tag.previous_element, NavigableString))for tag in soup.find_all(surrounded_by_strings):print tag.name
# p
# a
# a
# a
# p

2.3.2、find_all()

find_all(name,attrs,recursive,string,**kwargs)

name 参数可以查找所有名字为 name 的tag,字符串对象会被自动忽略掉.

soup.find_all("title")
# [<title>The Dormouse's story</title>]

keywords参数如果一个指定名字的参数不是搜索内置的参数名,搜索时会把该参数当作指定名字tag的属性来搜索,如果包含一个名字为 id的参数,Beautiful Soup会搜索每个tag的”id”属性.

soup.find_all(id='link2')
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]soup.find_all(href=re.compile("elsie"))
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]# 只要有id
soup.find_all(id=True)
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]# class是python的关键字，所以我们使用class_
soup.find_all("a", class_="sister")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

有些tag属性在搜索不能使用,比如HTML5中的 data-* 属性:

data_soup = BeautifulSoup('<div data-foo="value">foo!</div>')
data_soup.find_all(data-foo="value")
# SyntaxError: keyword can't be an expression

attrs 参数定义一个字典参数来搜索包含特殊属性的tag:

data_soup.find_all(attrs={"data-foo": "value"})
# [<div data-foo="value">foo!</div>]

limit参数，当搜索到的结果数量达到 limit 的限制时,就停止搜索返回结果.

soup.find_all("a", limit=2)
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

recursive参数，调用tag的 find_all() 方法时,Beautiful Soup会检索当前tag的所有子孙节点,如果只想搜索tag的直接子节点,可以使用参数 recursive=False .

2.3.3、find()

fand 与 fandall的区别就在于，find只返回找到的第一个结果，findall返回找到的所有结果

2.3.4、按照CSS搜索

Beautiful Soup支持大部分的CSS选择器，在 Tag 或 BeautifulSoup 对象的 .select() 方法中传入字符串参数, 即可使用CSS选择器的语法找到tag。

soup.select("title")
# [<title>The Dormouse's story</title>]soup.select("p nth-of-type(3)")
# [<p class="story">...</p>]

通过tag标签逐层查找:

soup.select("body a")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie"  id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]soup.select("html head title")
# [<title>The Dormouse's story</title>]

找到某个tag标签下的直接子标签 [6] :

soup.select("head > title")
# [<title>The Dormouse's story</title>]soup.select("p > a")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie"  id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]soup.select("p > a:nth-of-type(2)")
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]soup.select("p > #link1")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]soup.select("body > a")
# []

找到兄弟节点标签:

soup.select("#link1 ~ .sister")
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie"  id="link3">Tillie</a>]soup.select("#link1 + .sister")
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

通过CSS的类名查找:

soup.select(".sister")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]soup.select("[class~=sister]")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

通过tag的id查找:

soup.select("#link1")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]soup.select("a#link2")
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

同时用多种CSS选择器查询元素:

soup.select("#link1,#link2")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

通过是否存在某个属性来查找:

soup.select('a[href]')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

通过属性的值来查找:

soup.select('a[href="http://example.com/elsie"]')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]soup.select('a[href^="http://example.com/"]')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]soup.select('a[href$="tillie"]')
# [<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]soup.select('a[href*=".com/el"]')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

通过语言设置来查找:

multilingual_markup = """<p lang="en">Hello</p><p lang="en-us">Howdy, y'all</p><p lang="en-gb">Pip-pip, old fruit</p><p lang="fr">Bonjour mes amis</p>
"""
multilingual_soup = BeautifulSoup(multilingual_markup)
multilingual_soup.select('p[lang|=en]')
# [<p lang="en">Hello</p>,
#  <p lang="en-us">Howdy, y'all</p>,
#  <p lang="en-gb">Pip-pip, old fruit</p>]

返回查找到的元素的第一个

soup.select_one(".sister")
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

对于熟悉CSS选择器语法的人来说这是个非常方便的方法