BeautifulSoup库支持多种文档解析器，用户可以实现文档导航，查找元素，修改内容。

BeautifulSoup与lxml相比之下显得更易用

BeautifulSoup需要单独安装：

pip install beautifulsoup4

一、创建BeautifulSoup对象

使用BeautifulSoup解析文档，首先需要创建BeautifulSoup对象并指定解析器。

#首页要引用
import requests
from bs4 import BeautifulSoup

（1）直接打开html文档

soup1 = BeautifulSoup(open("***.html", encoding='UTF-8'), 'lxml')
print(soup1.contents)

（2）直接处理html片段

soup2 = BeautifulSoup("<html>***</html>","lxml")
print(soup2.contents)

（3）解析爬虫抓取的网页

url = "https://www.baidu.com/"
r = requests.get(url)
soup3 = BeautifulSoup(r.text, 'lxml')
print(soup3.contents)

二、对象

在BeautifulSoup中，文档每一层级节点都被当作对象。分为以下：

（1）Tag：与文档中的tag相同，通过元素名可以直接获取对对应内容

soup = BeautifulSoup('<div class="test1">****</div>', "lxml")
tag = soup.div
print(tag)

1.每一个标签对象都包含name属性，且可以修改。

print(tag.name) #div
tag.name='p'
print(tag) #<p class="test1">****</p>

2.当一个tag对象有一到多个属性时，tag属性是一个字典。可以使用tag[属性名]直接获取对应属性值

print(tag['class']) #['test1']# 直接获取属性，返回一个字典
print(tag.attrs) #{'class':['test1']}

3.当访问的属性有多个值时（如style,class,自定义属性），将返回属性值列表;当访问id等只有一个值的属性时，返回的是该属性值。

soup4 = BeautifulSoup('<div class="test1 test2">****</div>', "lxml")
tag = soup4.div
print(tag['class']) #['test1', 'test2']ip_soup=BeautifulSoup('<div id="my id">***</div>', "lxml")
print(ip_soup.div['id']) #my id

（2）NavigableString：可导航字符串，字符串常被包含在tag内，用来包装tag中的字符串。

html = '<div class="test1 test2">****</div>'
soup = BeautifulSoup(html, "lxml")
tag = soup.div
print(tag.string) #   ****
print(type(tag.string)) #   <class 'bs4.element.NavigableString'>

tag.string只在元素里面全是字符串，不包含子元素时才有效。

html = '<div class="test1 test2">****<p rel="test">**内部**</p></div>'
soup = BeautifulSoup(html, "lxml")
tag = soup.div
print(tag.string) # None
print(type(tag.string)) # <class 'NoneType'>

（3）BeautifulSoup：调用BeautifSoup构造器后，返回的soup对象会将该片段构造成一个完整的文档。soup对象在大多数情况下可以当成tag对象使用，但是soup对象包含的是整个文档，它的父元素和前一个元素就会成为'None'

html = '<div class="test1 test2">****<p rel="test">**内部**</p></div>'
soup = BeautifulSoup(html, "lxml")
tag = soup.div
print("对象的内容", soup.contents)  #对象的内容 [<html><body><div class="test1 test2">****<p rel="test">**内部**</p></div></body></html>]
print("对象的父元素：", soup.parent) #对象的父元素： None
print("对象的前一个元素：", soup.previous_element) #对象的前一个元素： None

（4）Comment：注释

html = '<div class="test1 test2"><!--hello 注释--></div>'
soup = BeautifulSoup(html, "lxml")
tag = soup.div
print(tag.string) # hello 注释
print(type(tag.string)) # <class 'bs4.element.Comment'>

三、遍历文档树

（1）遍历子节点

html_doc='''
<html>
<head><title>Title</title>
</head>
<body>
<div class="box"><span>这是第一个div内的span</span></div>
<div  class="detail"><span class="story">这是第二个div内的第一个span</span><span class="item">这是第二个div内的第二个span<a id="item1" class="item1" href="item1">item1</a><a id="item2" class="item2" href="item2">item2</a><a id="item3" class="item3" href="item3">item3</a><a id="item4" class="item4" href="item4">item4</a></span>
</div>
</body>
</html>
'''

1.获取子节点：通过soup对象访问元素名称，获取节点，创建soup对象后，使用soup.元素名即可获取对应节点。

soup = BeautifulSoup(html_doc, "lxml")
tag = soup.head
print('head', tag.prettify())  # prettify()可以用于BeautifulSoup对象也可以用于任何标签对象。
tag = soup.body
print('body', tag.prettify())
tag = soup.html
print('html', tag.prettify())

2.逐级获取：通过元素名只能获取该层级的第一个元素，使用find_all方法可以获取该层级的所有元素。

soup = BeautifulSoup(html_doc, "lxml")
tag = soup.body.div.span
print(tag.prettify())tag = soup.body.div.find_all('span')
print(tag)

tag对象遍历子节点属性

tag对象遍历子节点属性
.contents	返回子节点列表 #换行符“\n”也被当成子节点
.children	返回可迭代的列表对象
.descendants	返回一个生成器对象

BeautifulSoup获取字符内容属性

.string 返回子节点的文本内容（只包含一个tag对象或字符串才起作用） .strings 返回包含空格、空行的字符串生成器对象 .stripped_strings 返回不包含空格、空行的字符串生成器对象

其他属性

其他属性
.parent	获取其父节点
.next_sibling	与当前元素同一层级，紧邻的后一个元素。
.next_siblings	当前元素后面所有元素
.next_element	当前元素的字符串内容或子元素（向后）
.next_elements	往后查找所有元素及子元素
.previous_sibling	指紧邻的前一个元素
.previous_siblings	当前元素前面所有元素
.previous_element	当前元素的字符串内容或子元素（向前）
.previous_elements	向前查找所有元素及子元素

四、搜索文档元素

（1）find_all函数：返回符合条件的所有元素列表

1.通过元素名称查找

tags=soup.find_all('div')
print(tags)

2.True参数：返回文档所有的元素

for tag in soup.find_all(True):print(tag.name)

3.查找多个元素：返回文档中所有在列表中指定的元素

tags=soup.find_all(['div','span'])
for i in tags:print(i.name)

4.正则表达式查找

for tag in soup.find_all(re.compile('^d')):print(tag.name)for tag in soup.find_all(re.compile('sp')):print(tag.name)cntents = soup.find_all(text=re.compile('第一'))
for cntent in cntents:print(cntent)

5.调用外部元素

def has_class_and_id(tag):return tag.has_attr('class')and tag.has_attr('id') #has_attr()方法,检查标签是否具有该属性tags = soup.find_all(has_class_and_id)
for i in tags:print(i)

(2) 按关键字查找：属性名称=属性值

tags = soup.find_all(id=re.compile('item'))
for i in tags:print(i)

(3) 按样式查找

tags = soup.find_all('div',class_='box')
for i in tags:print(i)

(4)text,recursive,limit参数

text	搜索文档中的字符串内容，可以是字符串，正则，列表，True //(用法与name参数一致)
recursive	find_all方法会检索当前tag元素的所有子节点，设置recursive参数为False就只检索直接子节点
limit	用来显示返回的元素数量

1.text

tags = soup.find_all(text=['item1','item2'])
for i in tags:print(i)

2.recursive

tags = soup.body.find_all('div',recursive=False)
for i in tags:print(i)

3.limit

tags = soup.find_all('div',limit=1)
for i in tags:print(i)

(5)选择器：调用select方法，传入同样的语法规则

tags = soup.select('head >title') #选择head的直接子元素title
for i in tags:print(i)
print('---------------')
tags = soup.select('.box') #选择class值为box元素
for i in tags:print(i)
print('----------------')
tags = soup.select('a[href]') #选择具有href属性的a元素
for i in tags:print(i)
print('------------------')
tags = soup.select('#item3') #选择id为item3的元素
for i in tags:print(i)
print('--------------------')
tags = soup.select(".detail .story")  # 等级划分需要空格
for i in tags:print(i)

tag对象方法

tag对象方法
find_parent	返回当前对象的(参数)父元素，不加参数，返回当前对象的直接父元素
find_parents	返回当前对象的所有(参数)父元素，不加参数，返回当前对象的所有父元素
find_next_sibling	返回当前对象紧邻的第一个(参数)兄弟元素，不加参数，返回当前对象的紧邻的第一个元素 //向后
find_previous_sibling	与find_next_sibling功能和参数一致 //向前
find_next_siblings	返回当前对象的(参数)所有的兄弟元素，不加参数，返回当前对象的所有兄弟元素 //向后
find_previous_siblings	与find_next_siblings功能和参数一致 //向前
find_next	查找后面临近的第一个(参数)元素，不加参数，返回当前对象临近的第一个元素 //向后
find_previous	与find_next一致 // 向前
find_all_next	查找后面元素和内部的子元素 //向后
find_all_previous	与find_all_next一致 // 向前

爬虫之BeautifulSoup相关推荐

python爬虫beautifulsoup爬当当网_Python爬虫包 BeautifulSoup 递归抓取实例详解_python_脚本之家...
Python爬虫包 BeautifulSoup 递归抓取实例详解概要: 爬虫的主要目的就是为了沿着网络抓取需要的内容.它们的本质是一种递归的过程.它们首先需要获得网页的内容,然后分析页面内容并找到 ...
python爬虫提取a标签_Python爬虫库BeautifulSoup获取对象(标签)名,属性,内容,注释
一.Tag(标签)对象 1.Tag对象与XML或HTML原生文档中的tag相同. from bs4 import BeautifulSoup soup = BeautifulSoup('Extreme ...
[Python从零到壹] 五.网络爬虫之BeautifulSoup基础语法万字详解
欢迎大家来到"Python从零到壹",在这里我将分享约200篇Python系列文章,带大家一起去学习和玩耍,看看Python这个有趣的世界.所有文章都将结合案例.代码和作者的经验讲 ...
Python爬虫包 BeautifulSoup 递归抓取实例详解
Python爬虫包 BeautifulSoup 递归抓取实例详解概要: 爬虫的主要目的就是为了沿着网络抓取需要的内容.它们的本质是一种递归的过程.它们首先需要获得网页的内容,然后分析页面内容并找到另 ...
python爬虫-使用BeautifulSoup爬取新浪新闻标题
** python爬虫-使用BeautifulSoup爬取新浪新闻标题 ** 最近在学习爬虫的技巧,首先学习的是较为简单的BeautifulSoup,应用于新浪新闻上. import requests ...
python tag对象下有多个标签、属性_Python爬虫库BeautifulSoup获取对象(标签)名,属性,内容,注释...
Apple iPhone 11 (A2223) 128GB 黑色移动联通电信4G手机双卡双待 4999元包邮去购买 > 如何利用Python爬虫库BeautifulSoup获取对象(标签) ...
python获取标签属性值_Python爬虫库BeautifulSoup获取对象(标签)名,属性,内容,注释
更多python教程请到: 菜鸟教程www.piaodoo.com 人人影视www.sfkyty.com 16影视www.591319.com 星辰影院www.591319.com 一.Tag(标签) ...
网络爬虫之BeautifulSoup和lxml
网络爬虫之BeautifulSoup和lxml 一.网络爬虫的概念 1.1 爬虫概念 1.1.1 什么是爬虫 1.1.2 为什么学习爬虫? 1.2 爬虫流程二.爬虫常用包 2.1 Requests包 ...
python beautifulsoup模拟点击_Python爬虫丨BeautifulSoup实践
项目分析爬取的网站是下厨房,目标是固定栏目[本周最受欢迎] 可以看到我们要爬取的/explore/不在禁止爬取的列表内 1.先看下页面计划拿到的信息是:菜名.所需材料.和菜名所对应的详情页URL ...
Python爬虫之BeautifulSoup和requests的使用
requests,Python HTTP 请求库,相当于 Android 的 Retrofit,它的功能包括 Keep-Alive 和连接池.Cookie 持久化.内容自动解压.HTTP 代理.SSL ...

爬虫之BeautifulSoup