Beautiful Soup库

执行pip install beautifulsoup4安装Beautiful Soup库

Beautiful Soup库的简介

Beautiful Soup是一个可以从HTML或XML文件中提取数据的Python库。

它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式。

Beautiful Soup库是解析、遍历、维护“标签树”的功能库。

Beautiful Soup自动将输入文档转换为Unicode编码，输出文档转换为utf-8编码。

Beautiful Soup库的引用

from bs4 import BeautifulSoupimport bs4

主要是用BeautifulSoup类

Beautiful Soup库的解析器

Beautiful Soup支持Python标准库中的HTML解析器,还支持一些第三方的解析器

解析器	使用方法	条件
bs4的HTML解析器	BeautifulSoup(mk,’html.parser’)	安装bs4库
lxml的HTML解析器	BeautifulSoup(mk,’lxml’)	pip install lxml
lxml的XML解析器	BeautifulSoup(mk,’xml’)	pip install lxml
html5lib的解析器	BeautifulSoup(mk,’html5lib’)	pip install html5lib

不指定解析器，Beautiful Soup会选择最合适的解析器来解析这段文档,如果手动指定解析器那么Beautiful Soup会选择指定的解析器来解析文档

from bs4 import BeautifulSoupsoup1 = BeautifulSoup(open("E://index.html"),"html.parser")soup2 = BeautifulSoup("<html>data</html>","lxml")soup2 = BeautifulSoup("<html>data</html>")

Beautiful Soup库的对象

Tag , NavigableString , BeautifulSoup , Comment

对象	说明
Tag	标签，最基本的信息组织单元，分别用<>和标明开头和结尾
NavigableString	其实就是python的str对象的继承子类，实际上没区别
BeautifulSoup	表示的是一个文档的全部内容.大部分时候,可以把它当作 Tag 对象
Comment	内容是文档的注释部分

Tag

Tag的属性：

Name:

标签的名字，

…

的名字是’p’，格式：.name。

如果改变了tag的name,那将影响所有通过当前Beautiful Soup对象生成的HTML文档

>>> r=requests.get("http://python123.io/ws/demo.html")
>>> demo=r.text
>>> soup=BeautifulSoup(demo,'html.parser')
>>> soup.a.name
u'a'
>>> soup.a.name='aaa'
>>> soup.aaa.name
'aaa'

Attributes:

Attributes 标签的属性，字典形式组织，格式：.attrs。

>>> r=requests.get("http://python123.io/ws/demo.html")
>>> demo=r.text
>>> soup=BeautifulSoup(demo,'html.parser')
>>> tag=soup.a
>>> tag.attrs
{u'href': u'http://www.icourse163.org/course/BIT-268001', u'class': [u'py1'], u'id': u'link1'}
>>> tag.attrs['class']
[u'py1']
>>> tag.attrs['href']
u'http://www.icourse163.org/course/BIT-268001'
>>> type(tag.attrs)
<type 'dict'>
>>> type(tag)
<class 'bs4.element.Tag'>

Tag的操作：

tag属性的操作办法和字典一样

tag=soup.a
#print tag
print tag['class']
print tag['id']
print tag['href']#tag属性支持添加、删除、修改等，tag属性操作和dict一样
tag['class']='xiaodeng'
tag['id']=123#删除
del tag['class']
print tag.get('calss')

NavigableString

字符串常被包含在tag内.Beautiful Soup用
NavigableString 类来包装tag中的非属性字符串，<>…

>>> soup.a
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>
>>> soup.a.string
u'Basic Python'
>>> soup.p
<p class="title"><b>The demo python introduces several python courses.</b></p>
>>> soup.p.string
u'The demo python introduces several python courses.'
>>> type(soup.p.string)
<class 'bs4.element.NavigableString'>

BeautifulSoup

BeautifulSoup 对象并不是真正的HTML或XML的tag,所以它没有name和attribute属性.但有时查看它的 .name 属性是很方便的,所以 BeautifulSoup 对象包含了一个值为 “[document]” 的特殊属性 .name

>>> soup.name
u'[document]'

Comment

文档的注释部分,Comment 对象是一个特殊类型的 NavigableString 对象

>>> nsoup=BeautifulSoup("<b><!--This is a comment--></b><p>This is not a comment</p>","html.parser")>>> nsoup.b.string
u'This is a comment'
>>> type(nsoup.b.string)
<class 'bs4.element.Comment'>
>>> nsoup.p.string
u'This is not a comment'
>>> type(nsoup.p.string)
<class 'bs4.element.NavigableString'>

Beautiful Soup库人性化显示方法

prettify()

>>> soup = BeautifulSoup(demo,"html.parser")
>>> soup.p.next_sibling.next_sibling
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python f
rom novice to professional by tracking the following courses:\r\n<a class="py1" href="http://www.ico
urse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icour
se163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>
>>> print(soup.p.next_sibling.next_sibling.prettify())
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to pro
fessional by tracking the following courses:<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>and<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.
</p>>>>

基于Beautiful Soup库的HTML内容遍历

HTML基本格式

graph TD
html-->head
html-->body
head-->title
body-->p1
body-->p2
p1-->b
p2-->a1
p2-->a2

<>…

标签树的下行遍历

graph LR
html-->head
head-->title

属性	说明
.contents	子节点的列表，将所有儿子节点存入列表
.children	子节点的迭代类型，与.contents类似，用于循环遍历儿子节点
.descendants	子孙节点的迭代类型，包含所有子孙节点，用于循环遍历

>>> soup=BeautifulSoup(demo,"html.parser")
>>> soup.head
<head><title>This is a python demo page</title></head>
>>> soup.head.contents
[<title>This is a python demo page</title>]
>>> soup.body.contents
[u'\n', <p class="title"><b>The demo python introduces several python courses.</b></p>, u'\n', <p cl
ass="course">Python is a wonderful general-purpose programming language. You can learn Python from n
ovice to professional by tracking the following courses:\r\n<a class="py1" href="http://www.icourse1
63.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163
.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>, u'\n']
>>> len(soup.head.contents)
1
>>> len(soup.body.contents)
5
>>> soup.body.contents[1]
<p class="title"><b>The demo python introduces several python courses.</b></p>

遍历儿子节点

>>> for child in soup.body.children:
...     print(child)

遍历子孙节点

>>> for child in soup.body,descendants:
...     print(child)

标签树的上行遍历

graph LR
b-->p1
p1-->body
body-->html

属性	说明
.parent	节点的父亲标签
.parents	节点先辈标签的迭代类型，用于循环遍历先辈节点

遍历所有先辈节点，包括soup本身

>>> soup = BeautifulSoup(demo,"html.parser")
>>> for parent in soup.a.parents:
...     if parent is None:
...             print(parent)
...     else:
...             print(parent.name)
...
p
body
html
[document]

标签树的平行遍历

属性	说明
.next_sibling	返回按照HTML文本顺序的下一个平行节点标签
.previous_sibling	返回按照HTML文本顺序的上一个平行节点标签
.next_siblings	迭代类型，返回按照HTML文本顺序的后续所有平行节点标签
.previous_siblings	迭代类型，返回按照HTML文本顺序的前续所有平行节点标签

graph LR
p1-->p2
a1-->a2

平行遍历后续节点

>>> for sibling in soup.a.next_sibling:
...     print(sibling)
...a
n
d

平行遍历前续节点

>>>for sibling in soup.a.previous_sibling:
...print(sibling)

参考文档

Beautiful Soup 4.2.0 文档
Beautiful Soup 4.4.0 文档

python——爬虫学习——Beautiful Soup库的使用-(2)相关推荐

【Python爬虫】Beautiful Soup库入门
BeautifulSoup库的安装安装 pip install beautifulsoup4 测试是否安装成功 Python 3.8.3 (tags/v3.8.3:6f8c832, May 13 2 ...
python爬虫之Beautiful Soup库，基本使用以及提取页面信息
一.Beautiful Soup简介爬虫正则表达式参考:Python 爬虫正则表达式和re库在爬虫过程中,可以利用正则表达式去提取信息,但是有些人觉得比较麻烦.因为花大量时间分析正则表达式.这时候 ...
python爬虫bs4库_04 Python爬虫之Beautiful Soup库
Beautiful Soup库的安装 Win平台: 以管理员身份运行 cmd 执行 pip install beautifulsoup4 Beautiful Soup库的安装小测首先,获取网页源码保 ...
Python中的Beautiful Soup库（笔记）
介绍 BeautifulSoup是一种可以从html和xml中快速提取内容的python库,共有四种类型,对于爬虫解析来说,主要用其中的遍历文档树和搜索文档树.BeautifulSoup最主要的功能是 ...
Python 爬虫之 Beautiful Soup 模块使用指南
版权声明:本文为博主原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接和本声明. 本文链接:https://blog.csdn.net/bruce_6/article/deta ...
Python爬虫之Beautiful soup模块
1.Beautiful soup与Xpath对比相同点:用来解析HTML和XML,并从中提取数据独有的特点: API简单,功能强大支持多种解析器自动实现编码的转换 2.Beautiful so ...
以视频爬取实例讲解Python爬虫神器Beautiful Soup用法
1.安装BeautifulSoup4 easy_install安装方式,easy_install需要提前安装 1 easy_install beautifulsoup4 pip安装方式,pip也需要提 ...
Python 爬虫利器 Beautiful Soup 4 之文档树的搜索
前面两篇介绍的是 Beautiful Soup 4 的基本对象类型和文档树的遍历, 本篇介绍 Beautiful Soup 4 的文档搜索搜索文档树主要使用两个方法 find() 和 find_al ...
python爬虫学习之Soup模块
前言就像我之前提到那样,使用正则来匹配获取是属麻烦,并且规则太多,下面结束一下python下面的一个模块Beautiful Soup来从网页抓取数据. 官网: 文档:http://beautiful ...

python——爬虫学习——Beautiful Soup库的使用-(2)