python—bs4模块解析

一、前言

Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库.它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式.Beautiful Soup会帮你节省数小时甚至数天的工作时间.

Beautiful Soup支持Python标准库中的HTML解析器,还支持一些第三方的解析器,其中一个是 lxml ，在使用Beautiful Soup进行代码解析的时候，往往还会使用lxml配合使用。

安装使用：

安装bs4: python -m pip install bs4
安装lxml解析器：python -m pip install lxml

二、基础使用

导入模块并创建bs4对象：

from bs4 import BeautifulSoup
import xlwt
soup = BeautifulSoup(html_doc,'lxml')

在将一段HTML代码传入 BeautifulSoup的时候，可以直接传入一段代码，一个可以传入一个文件句柄，比如：

soup = BeautifulSoup(open('index.html'),'lxml')

三、对象的类型

1、Tag

tag有很多的方法和属性，最重要的两个属性是name和atrributes。

name:

每个tag都有自己的名字，通过.name属性来获取，并且可以通过.name属性来修改tag的名字。示例如下：

soup = BeautifulSoup('<b class="boldest">hello,word!!!</b>')
tag = soup.b
print(tag.name)         # u'b'
tag.name = 'testTag'
print(tag)              #<testTag class="boldest">hello,word!!!</testTag>

Attributes:

一个tag可能有很多个属性. tag <b class="boldest"> 有一个 “class” 的属性,值为 “boldest” . tag的属性的操作方法与字典相同,示例如下：

print(tag['class'])            # boldest
print(tag.attrs)               # 以字典形式输出：{u'class': u'boldest'}
tag['id'] = 1                  # 对于没有的属性，则增加，如果已经有，则修改属性值
del tag['id']                  # 删除已有的属性

多值属性：

HTML 4定义了一系列可以包含多个值的属性.在HTML5中移除了一些,却增加更多.最常见的多值的属性是 class (一个tag可以有多个CSS的class). 还有一些属性 rel , rev , accept-charset , headers , accesskey . 在Beautiful Soup中多值属性的返回类型是list，示例如下：

soup = BeautifulSoup('<p class="body strikeout"></p>')
print(soup.p['calss'])           # ["body", "strikeout"]soup1 = BeautifulSoup('<p class="body"></p>')
print(soup1.p['calss'])          #["body"]

如果某个属性看起来好像有多个值,但在任何版本的HTML定义中都没有被定义为多值属性,那么Beautiful Soup会将这个属性作为字符串返回.示例如下：

soup = BeautifulSoup('<p id="is my id"></p>')
print(soup.p['id'])                 # 'is my id'

如果转换的文档是xml格式，那么就不存在多值属性，属性值会被当做字符串进行处理。

2、可遍历的字符串

字符串常被包含在tag内.Beautiful Soup用 NavigableString 类来包装tag中的字符串:

soup = BeautifulSoup('<b class="boldest">hello,word!!!</b>')
tag = soup.b
print(tag.string)               # u'hello word!!!'
print(type(tag.string))         # <class 'bs4.element.NavigableString'>

一个 NavigableString 字符串与Python中的Unicode字符串相同,并且还支持包含在遍历文档树和搜索文档树中的一些特性. 通过 unicode() 方法可以直接将 NavigableString 对象转换成Unicode字符串:

soup = BeautifulSoup('<b class="boldest">hello,word!!!</b>')
tag = soup.b
print(type(unicode(tag.string)))      #<type 'unicode'>

四、遍历文档树

示例代码：获取个人导航页进行解析。

import requests
from bs4 import BeautifulSoup
import xlwt
import osurl = 'http://xx.xx.xx.128'
headers = {'userAgent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36'
}
req = requests.get(url, headers=headers)
# 创建一个Beautifulsoup对象
soup = BeautifulSoup(req.content.decode('utf-8'), 'lxml')

1、遍历子节点

1)通过tag的名字遍历

一个Tag可能包含多个字符串或其它的Tag,这些都是这个Tag的子节点.Beautiful Soup提供了许多操作和遍历子节点的属性.

操作文档树最简单的方法就是告诉它你想获取的tag的name.如果想获取标签,只要用 soup.head :

soup.head
# <head>
# <title>个性化安全导航</title>
# <meta charset="utf-8"/>
# <link href="./css/back_ground.css" rel="stylesheet" type="text/css"/>
# </head>
soup.title
# <title>个性化安全导航</title>

想要获取文档树下的某一个子标签，可以只当tag的方式，比如获取body下的h2标签;

soup.body.h2
# <h2 class="title">论坛社区</h2>

在子节点中有多个tag的情况下，通过点取属性的方式只能获得当前名字的第一个tag，如果想要获取多个tag的数据就需要使用 Searching the tree 中描述的方法,比如: find_all()

soup.body.find_all('h2')
#[<h2 class="title">论坛社区</h2>,  <h2 class="title">杂七杂八</h2>]

2）通过.contents 和 .children遍历

tag的 .contents 属性可以将tag的子节点以列表的方式输出:

soup.head.contents
# [<title>个性化安全导航</title>, <meta charset="utf-8"/>, <link href="./css/back_ground.css" rel="stylesheet" type="text/css"/>]

通过tag的 .children 生成器,可以对tag的子节点进行循环:

for children in soup.head.children:print(children)
# <title>个性化安全导航</title>
# <meta charset="utf-8"/>
# <link href="./css/back_ground.css" rel="stylesheet" type="text/css"/>

3) .descendants遍历孙节点

.contents 和 .children 属性仅包含tag的直接子节点.例如,标签只有三个个直接子节点，和。

但是标签也包含一个子节点:字符串 “The Dormouse’s story”,这种情况下字符串 “The Dormouse’s story”也属于标签的子孙节点. .descendants 属性可以对所有tag的子节点和孙节点进行递归循环

for children in soup.head.descendants:print(children)
# <title>个性化安全导航</title>
# 个性化安全导航
# <meta charset="utf-8"/>
# <link href="./css/back_ground.css" rel="stylesheet" type="text/css"/>

4) stripped_strings遍历字符串

如果tag中包含多个字符串，可以使用 .strings 来循环获取，但是 .strings会将换行符等最为字符串进行输出，而使用 .stripped_strings 可以去除多余空白内容(全部是空格的行会被忽略掉,段首和段末的空白会被删除)：

for children in soup.body.div.find_all('div')[1].stripped_strings:print(children)
# FreBuff 安全论坛
# 先知安全社区
# 安全客
# CSDN博客
# 百度超级链

2、遍历父节点

1） .parent 获取父节点

通过 .parent 属性来获取某个元素的父节点,比如获取第一个h2标签的父节点：

soup_h2 = soup.h2
print(soup_h2)
#<h2 class="title">论坛社区</h2>
print(soup_h2.parent)
#<div class="title_card"><h2 class="title">论坛社区</h2></div>

2） .parents 获取所有长辈节点

通过元素的 .parents 属性可以递归得到元素的所有父辈节点,继续以h2标签为例，通过parents获取所有长辈节点：

soup_h2 = soup.h2
print(soup_h2)
for parent in soup_h2.parents:print(parent.name)'''
<h2 class="title">论坛社区</h2>
div
div
body
html
[document]
'''

3、兄弟节点

1) 查询前面的兄弟节点

（1） .previous_sibling

获取前面一个兄弟节点。

soup_span = soup.find(class_ = 'card-title')
print(soup_span)
#<span class="card-title">FreBuff 安全论坛</span>
print(soup_span.previous_sibling)
#<span class="card-icon"><img src="./img/frebuff.ico"/></span>

（2） .previous_siblings

获取前面的所有兄弟节点。

soup_span = soup.find(class_ = 'card-title')
print(soup_span)
#<span class="card-title">FreBuff 安全论坛</span>
for span in soup_span.previous_siblings:print(span)
'''只有一个兄弟节点，所以只能查找到一个。
<span class="card-icon"><img src="./img/frebuff.ico"/></span>
'''

2）查询后面的兄弟节点

（1）next_sibling

获取后面一个兄弟节点。

soup_span = soup.find(class_ = 'card-icon')
print(soup_span)
#<span class="card-icon"><img src="./img/frebuff.ico"/></span>
print(soup_span.next_sibling)
#<span class="card-title">FreBuff 安全论坛</span>

（2）next_siblings

获取后面所有的兄弟节点。

soup_span = soup.find(class_ = 'card-icon')
print(soup_span)
#<span class="card-icon"><img src="./img/frebuff.ico"/></span>
for span in soup_span.next_siblings:print(span)
'''只有一个兄弟节点，所以只输出一个
<span class="card-title">FreBuff 安全论坛</span>
'''

五、搜索文档树

搜索文档树，主要的方法有两个，一个是find,一个是find_all(),find搜索文档中的第一个，而find_all()则是在全局范围进行搜索并返回一个列表。

1、find（）

1) 通过字符串过滤器搜索

最简单的过滤器是字符串.在搜索方法中传入一个字符串参数,Beautiful Soup会查找与字符串完整匹配的内容,比如查找文档中第一个的标签:

tag = soup.find('img')
print(tag)
#<img src="./img/frebuff.ico"/>

2）通过正则表达式搜索

如果传入正则表达式作为参数,Beautiful Soup会通过正则表达式的 match() 来匹配内容.下面例子中找出文档中第一个以t开头的所有标签：

tag = soup.find(re.compile("^t"))
print(tag)
#<title>个性化安全导航</title>

2、find_all()

1) 通过字符串进行过滤

示例：获取所有h2标签。

tag = soup.find_all('h2')
print(tag)
'''
[<h2 class="title">论坛社区</h2>, <h2 class="title">区 块 链</h2>, <h2 class="title">在线办公</h2>, <h2 class="title">知识
学习</h2>, <h2 class="title">编码解码</h2>, <h2 class="title">杂七杂八</h2>]
'''

2）通过正则进行过滤

示例:获取所有以h开头的标签。

for tag in soup.find_all(re.compile("^h")):print(tag.name,end=",")
#html,head,h2,h2,h2,h2,h2,h2,

3）通过字典进行查找

按照文档中的出现顺序进行输出：

for tag in soup.find_all(['body','h2']):print(tag.name,end=",")
#body,h2,h2,h2,h2,h2,h2,

4）通过keyword进行搜索

示例：通过class进行搜索(class 是多值属性，需要加“_")

for tag in soup.find_all(class_ = 'card-title'):print(tag.text,end=',')
#FreBuff 安全论坛,先知安全社区,安全客 ,CSDN博客,百度超级链 , 巴 比 特 ,金 色 财 经,火 币 网,欧易交易所, 非 小 号 ,Ethplorer交易浏览器......

在指定关键字的同时，也可以使用re模块进行正则匹配，示例如下：

for tag in soup.find_all(class_ = re.compile('^card')):print(tag,end=',')
#<span class="card-icon"><img src="./img/frebuff.ico"/></span>,<span class="card-title">FreBuff 安全论坛</span>,<span class="card-icon"><img src="./img/xz.ico"/></span>,<span class="card-title">先知安全社区</span>，.......

上面指定了一个关键词，也可以指定多个关键词进行搜索。

for tag in soup.find_all(class_ ="link_tooltip",title="https://www.okx.com/"):print(tag,end=',')
'''
<a class="link-tooltip" href="https://www.okx.com/" target="_blank" title="https://www.okx.com/">
<span class="card-icon"><img src="./img/ouyi.ico"/></span>
<span class="card-title">欧易交易所</span>
</a>,
'''

5）使用limit限制返回结果数量

find_all() 方法返回全部的搜索结构,如果文档树很大那么搜索会很慢.如果我们不需要全部结果,可以使用 limit 参数限制返回结果的数量.效果与SQL中的limit关键字类似,当搜索到的结果数量达到 limit 的限制时,就停止搜索返回结果.

tag =  soup.find_all('span',limit= 4)
print(tag)
#[<span class="card-icon"><img src="./img/frebuff.ico"/></span>, <span class="card-title">FreBuff 安全论坛</span>, <span class="card-icon"><img src="./img/xz.ico"/></span>, <span class="card-title">先知安全社区</span>]

六、参考文档

https://blog.csdn.net/weixin_33762130/article/details/92478877
https://www.jb51.net/article/276323.htm#_lab2_1_0
https://blog.csdn.net/xo3ylAF9kGs/article/details/124722280