网络爬虫：Beautiful Soup库信息组织与提取

爬虫：Beautiful Soup库&&信息组织与提取

Copyright: Jingmin Wei, Pattern Recognition and Intelligent System, School of Artificial and Intelligence, Huazhong University of Science and Technology

网络爬虫专栏链接

文章目录

爬虫：Beautiful Soup库&&信息组织与提取
- Reference
- 一、安装Beautiful Soup库
- 二、BeautifulSoup库(HTML标签树)的基本元素
- 三、基于bs4库的HTML内容遍历方法
- 四、基于bs4库的HTML格式输出
- 五、bs4库小结
- 六、信息标记的三种形式
- 七、三种信息标记形式的比较
- 八、信息提取的一般方法
- 九、基于bs4库的HTML内容查找方法
- 十、信息标记和提取方法总结

本教程主要参考中国大学慕课的 Python 网络爬虫与信息提取，为个人学习笔记。

在学习过程中遇到了一些问题，都手动记录并且修改更正，保证所有的代码为有效。且结合其他的博客总结了一些常见问题的解决方式。

本教程不商用，仅为学习参考使用。如需转载，请联系本人。

Reference

爬虫 MOOC

数据分析 MOOC

廖雪峰老师的 Python 教程

一、安装Beautiful Soup库

不同于idle的输出，在输出内容后我手动加了一行空行，如果对于输入输出有疑问，可以自行打开python shell进行验证。

Win 平台: “以管理员身份运行” cmd

pip3 install beautifulsoup4

安装小测：

演示 HTML 页面地址：http://python123.io/ws/demo.html

文件名称：demo.html

页面源代码：HTML 5.0 格式代码

<html><head><title>This is a python demo page</title></head>
<body>
<p class="title"><b>The demo python introduces several python courses.</b></p>
<p class="course">Python is a wonderful general‐purpose programming language. You can
learn Python from novice to professional by tracking the following courses:
<a href="http://www.icourse163.org/course/BIT‐268001" class="py1" id="link1">Basic
Python</a> and <a href="http://www.icourse163.org/course/BIT‐1001870001" class="py2"
id="link2">Advanced Python</a>.</p>
</body></html>

使用Requests库获取页面源代码

>>> import requests
>>> r = requests.get("https://python123.io/ws/demo.html")
>>> r.text
'<html><head><title>This is a python demo page</title></head>\r\n<body>\r\n<p class="title"><b>The demo python introduces several python courses.</b></p>\r\n<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\r\n<a href="http://www.icourse163.org/course/BIT-268001" class="py1" id="link1">Basic Python</a> and <a href="http://www.icourse163.org/course/BIT-1001870001" class="py2" id="link2">Advanced Python</a>.</p>\r\n</body></html>'>>> demo = r.text

紧跟上述例子，使用 BeautifulSoup 解析 http://python123.io/ws/demo.html 的源代码

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(demo, "html.parser")
>>> print(soup.prettify())

输出：

<html><head><title>This is a python demo page</title></head><body><p class="title"><b>The demo python introduces several python courses.</b></p><p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>and<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p></body>
</html>

BeautifulSoup 格式总结

第一个参数为需要解析的 html 格式信息，第二个参数为解析"这锅汤"对应的解析器。

from bs4 import BeautifulSoup
soup = BeautifulSoup('<p>data</p>', 'html.parser')

二、BeautifulSoup库(HTML标签树)的基本元素

Beautiful Soup 库是解析、遍历、维护“标签树”的功能库。

html 格式理解：

Beautiful Soup 库，也叫 beautifulsoup4 或 bs4。约定引用方式如下，即主要是用 BeautifulSoup 类。

from bs4 import BeautifulSoup
import bs4

实质上，BeautifulSoup对应一个HTML/XML文档的全部内容。

Beautiful Soup 库解析器：

解析器	使用方法	条件
bs4的HTML解析器	BeautifulSoup(mk,‘html.parser’)	安装bs4库
lxml的HTML解析器	BeautifulSoup(mk,‘lxml’)	pip install lxml
lxml的XML解析器	BeautifulSoup(mk,‘xml’)	pip install lxml
html5lib的解析器	BeautifulSoup(mk,‘html5lib’)	pip install html5lib

BeautifulSoup 类的基本元素

<p class=“title”> … </p>

基本元素	说明
Tag	标签，最基本的信息组织单元，分别用 <> 和 </> 标明开头和结尾
Name	标签的名字，<p>…</p> 的名字是 ‘p’ ，格式：<tag>.name
Attributes	标签的属性，字典形式组织，格式：<tag>.attrs
NavigableString	标签内非属性字符串，<>…</> 中字符串，格式：<tag>.string
Comment	标签内字符串的注释部分，一种特殊的 Comment 类型

回顾 html 代码( Markdown 自带 html 生成器)：

This is a python demo page This is a python demo page

The demo python introduces several python courses.

Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses: Basic Python and Advanced Python .

1.Tag 标签

>>> import requests
>>> r = requests.get("https://python123.io/ws/demo.html")
>>> demo = r.text>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(demo,"html.parser")
>>> soup.title
<title>This is a python demo page</title>>>> tag = soup.a
>>> tag
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>

2.Tag 的 name(名字)

接上段代码

>>> soup.a.name
'a'>>> soup.a.parent.name
'p'>>> soup.a.parent.parent.name
'body'

3.Tag 的 attrs(属性)

接上段代码

>>> tag.attrs   #标签的属性。给出标签名字，属性名字以及属性的值之间的对应关系
{'href': 'http://www.icourse163.org/course/BIT-268001', 'class': ['py1'], 'id': 'link1'}>>> tag.attrs['class']   #获取class属性的值['py1']>>> tag.attrs['href']   #获取a标签的链接属性'http://www.icourse163.org/course/BIT-268001'>>> type(tag.attrs)  #标签属性的类型<class 'dict'>>>> type(tag)    #标签类型<class 'bs4.element.Tag'>

4.Tag 的 NavigableString(非属性字符串)

>>> soup.a<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>>>> soup.a.string'Basic Python'>>> soup.p<p class="title"><b>The demo python introduces several python courses.</b></p>>>> soup.p.string'The demo python introduces several python courses.'>>> type(soup.p.string)<class 'bs4.element.NavigableString'>

5.Tag 的 Comment(注释)

html注释类型：<!--This is a comment>

>>> newsoup = BeautifulSoup("<b><!--This is a comment--></b><p>This is not a comment</p>", "html.parser")>>> newsoup.b.string'This is a comment'>>> type(newsoup.b.string)<class 'bs4.element.Comment'>>>> newsoup.p.string'This is not a comment'>>> type(newsoup.p.string)<class 'bs4.element.NavigableString'>

两个字符串都会显示，但是我们应该根据类型判断到底是注释还是非属性字符串。

总结：

三、基于bs4库的HTML内容遍历方法

HTML 树形结构：

遍历方式：

HTML 源代码：

<html><head><title>This is a python demo page</title></head><body><p class="title"><b>The demo python introduces several python courses.</b></p><p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>and<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p></body>
</html>

1.标签树的下行遍历

属性	说明
.contents	子节点的列表，将 <tag> 所有儿子节点存入列表
.children	子节点的迭代类型，与 .contents 类似，用于循环遍历儿子节点
.descendants	子孙节点的迭代类型，包含所有子孙节点，用于循环遍历

BeautifulSoup 类型是标签树的根节点

>>> import requests
>>> r = requests.get("https://python123.io/ws/demo.html")
>>> demo = r.text
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(demo, "html.parser")
>>> soup.head<head><title>This is a python demo page</title></head>>>> soup.head.contents #子节点的列表[<title>This is a python demo page</title>]>>> soup.body<body>
<p class="title"><b>The demo python introduces several python courses.</b></p>
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>
</body>>>> soup.body.contents['\n', <p class="title"><b>The demo python introduces several python courses.</b></p>, '\n', <p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>, '\n']>>> len(soup.body.contents)  #获取个数5>>> soup.body.contents[1]    #通过索引检索列表信息<p class="title"><b>The demo python introduces several python courses.</b></p>

遍历孩子结点：

>>> for child in soup.body.children:print(child)<p class="title"><b>The demo python introduces several python courses.</b></p><p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>>>> for child in soup.body.descendants:print(child)<p class="title"><b>The demo python introduces several python courses.</b></p>
<b>The demo python introduces several python courses.</b>
The demo python introduces several python courses.<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>
Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>
Basic Pythonand
<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>
Advanced Python

2.标签树的上行遍历

属性	说明
.parent	节点的父亲标签
.parents	节点先辈标签的迭代类型，用于循环遍历先辈节点

>>> soup.title.parent<head><title>This is a python demo page</title></head>>>> soup.html.parent    #HTML的标签的父亲即为其本身<html><head><title>This is a python demo page</title></head>
<body>
<p class="title"><b>The demo python introduces several python courses.</b></p>
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>
</body></html>>>> soup.parent

接上段代码

>>> for parent in soup.a.parents:if parent is None:print(parent)else:print(parent.name)p
body
html
[document]

3.标签树的平行遍历

属性	说明
.next_sibling	返回按照 HTML 文本顺序的下一个平行节点标签
.previous_sibling	返回按照 HTML 文本顺序的上一个平行节点标签
.next_siblings	迭代类型，返回按照 HTML 文本顺序的后续所有平行节点标签
.previous_siblings	迭代类型，返回按照 HTML 文本顺序的前续所有平行节点标签

>>> soup.a.next_sibling #在标签树中树形结构采用标签的形式组织，但是标签树之间的非属性字符串也构成了标签树的结点，所以并不能想当然认为标签结点获得的下一个结点类型一定是标签类型。' and '
>>> soup.a.next_sibling.next_sibling<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>>>> soup.a.previous_sibling'Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\r\n'>>> soup.a.previous_sibling.previous_sibling  #空节点>>> soup.a.parent  #父亲结点是一个p结点<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>

>>> for sibling in soup.a.next_sibling:print(sibling)   #遍历后续结点a
n
d>>> for sibling in soup.a.previous_sibling:print(sibling) #遍历前续结点#内容过长，请自行用idle或者pycharm、eclipse等输入验证

总结：

四、基于bs4库的HTML格式输出

能否让 HTML 内容更加“友好”的显示？

>>> import requests
>>> r = requests.get("https://python123.io/ws/demo.html")
>>> demo = r.text
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(demo, "html.parser")
>>> soup.prettify()'<html>\n <head>\n  <title>\n   This is a python demo page\n  </title>\n </head>\n <body>\n  <p class="title">\n   <b>\n    The demo python introduces several python courses.\n   </b>\n  </p>\n  <p class="course">\n   Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\n   <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">\n    Basic Python\n   </a>\n   and\n   <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">\n    Advanced Python\n   </a>\n   .\n  </p>\n </body>\n</html>'
>>> print(soup.prettify())<html><head><title>This is a python demo page</title></head><body><p class="title"><b>The demo python introduces several python courses.</b></p><p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>and<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p></body>
</html>

.prettify() 为 HTML 文本 <> 及其内容增加更加 ‘\n’ ，.prettify() 可用于标签，方法：<tag>.prettify()

>>> print(soup.a.prettify())<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python
</a>

bs4库的编码

bs4 库将任何 HTML 输入都变成 utf‐8 编码，Python 3.x 默认支持编码是 utf‐8 ，解析无障碍。

>>> soup = BeautifulSoup("<p>中文</p>", "html.parser")>>> soup.p.string'中文'
>>> print(soup.p.prettify())<p>中文
</p>

五、bs4库小结

六、信息标记的三种形式

信息标记：

标记后的信息可形成信息组织结构，增加了信息维度

标记的结构与信息一样具有重要价值

标记后的信息可用于通信、存储或展示

标记后的信息更利于程序理解和运用

HTML 信息标记：

HTML - 超文本标记语言，能够将超文本的信息嵌入到文本之中

信息标记的三种形式：

1.XML(eXtensible Markup Language)

基于 HTML 格式发展而来的一种通用信息表达形式。

<img src=“china.jpg” size=“10”> … </img>
<img src=“china.jpg” size=“10” />
<!‐‐ This is a comment, very useful ‐‐>

2.Json(JavsScript Object Notation)

有类型的键值对构成的表达形式。给出信息，并给信息类型做定义。

"name":"北京理工大学"
"name":["北京理工大学"， "延安自然科学院"]
"name":{"newName":"北京理工大学","oldName":"延安自然科学院"
}

3.YAML(YAML Ain’t Markup Language)

无类型键值对 key:value

name:北京理工大学
name:newname:北京理工大学oldname:延安自然科学院
name:
-北京理工大学
-延安自然科学院
text:|  #学校介绍
北京理工大学创立于1940年，前身是延安自然科学院，是中国共产党创办的第一所理工科大学，毛泽东同志亲
自题写校名，李富春、徐特立、李强等老一辈无产阶级革命家先后担任学校主要领导。学校是新中国成立以来
国家历批次重点建设的高校，首批进入国家“211工程”和“985工程”建设行列；在全球具有广泛影响力的英
国QS“世界大学500强”中，位列入选的中国大陆高校第15位。学校现隶属于工业和信息化部。

七、三种信息标记形式的比较

XML：

<person><firstName>Tian</firstName><lastName>Song</lastName><address><streetAddr>中关村南大街5号</streetAddr><city>北京市</city><zipcode>100081</zipcode></address><prof>Computer System</prof><prof>Security</prof>
</person>

JSON：

{"firstName" : "Tian" ,"lastName" : "Song" ,"address" : {"streetAddr" : "中关村南大街5号" ,"city" : "北京市" ,"zipcode" : "100081"},"prof" : [ "Computer System" , "Security" ]
}

YAML:

firstName : Tian
lastName : Song
address :streetAddr : 中关村南大街5号city : 北京市zipcode : 100081
prof :
‐Computer System
‐Security

XML 最早的通用信息标记语言，可扩展性好，但繁琐

JSON 信息有类型，适合程序处理(js)，较 XML 简洁

YAML 信息无类型，文本信息比例最高，可读性好

XML Internet上的信息交互与传递(包括特殊的 HTML 格式)

JSON 移动应用云端和节点的信息通信，无注释(程序对接口处理的情况)

YAML 各类系统的配置文件，有注释易读

八、信息提取的一般方法

从标记后的信息中提取所关注的内容

方法一：完整解析信息的标记形式，再提取关键信息

XML JSON YAML

需要标记解析器，例如：bs4 库的标签树遍历

优点：信息解析准确

缺点：提取过程繁琐，速度慢

方法二：无视标记形式，直接搜索关键信息

搜索

对信息的文本查找函数即可

优点：提取过程简洁，速度较快

缺点：提取结果准确性与信息内容相关

融合方法：结合形式解析与搜索方法，提取关键信息

XML JSON YAML 搜索

需要标记解析器及文本查找函数

实例：提取 HTML 中的所有 URL 链接

思路： 1) 搜索到所有<a>标签

2) 解析<a>标签格式，提取href后的链接内容

>>> import requests
>>> from bs4 import BeautifulSoup
>>> r = requests.get("https://python123.io/ws/demo.html")
>>> demo = r.text
>>> soup = BeautifulSoup(demo, "html.parser")
>>> for link in soup.find_all('a'):print(link.get('href'))http://www.icourse163.org/course/BIT-268001
http://www.icourse163.org/course/BIT-1001870001

九、基于bs4库的HTML内容查找方法

<>.find_all(name, attrs, recursive, string, **kwargs)

返回一个列表类型，存储查找的结果

<html><head><title>This is a python demo page</title></head><body><p class="title"><b>The demo python introduces several python courses.</b></p><p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>and<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p></body>
</html>

∙ name : 对标签名称的检索字符串

>>> soup.find_all('a')
[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]>>> soup.find_all(['a', 'b'])
[<b>The demo python introduces several python courses.</b>, <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]>>> for tag in soup.find_all(True):print(tag.name)html
head
title
body
p
b
p
a
a>>> import re
>>> for tag in soup.find_all(re.compile('b')):print(tag.name)body
b

∙ attrs: 对标签属性值的检索字符串，可标注属性检索

>>> soup.find_all('p', 'course')    #是否包含'course'的信息
[<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>]>>> soup.find_all(id='link1')    #全匹配搜索
[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>]>>> soup.find_all(id='link')    #全匹配搜索
[]>>> import re
>>> soup.find_all(id=re.compile('link'))    #非全匹配搜索
[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]

∙ recursive: 是否对子孙全部检索，默认 True

>>> soup.find_all('a')
[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]>>> soup.find_all('a', recursive=False)  #说明a标签在子孙的结点层次中
[]

∙ string: <>…</> 中字符串区域的检索字符串

>>> soup.find_all(string = "Basic Python")
['Basic Python']>>> import re
>>> soup.find_all(string = re.compile("python"))
['This is a python demo page', 'The demo python introduces several python courses.']

<tag>(…) 等价于 <tag>.find_all(…)

soup(…) 等价于 soup.find_all(…)

方法扩展

方法	说明
<>.find()	搜索且只返回一个结果，同 .find_all() 参数
<>.find_parents()	在先辈节点中搜索，返回列表类型，同 .find_all() 参数
<>.find_parent()	在先辈节点中返回一个结果，同 .find() 参数
<>.find_next_siblings()	在后续平行节点中搜索，返回列表类型，同 .find_all() 参数
<>.find_next_sibling()	在后续平行节点中返回一个结果，同 .find() 参数
<>.find_previous_siblings()	在前序平行节点中搜索，返回列表类型，同 .find_all() 参数
<>.find_previous_sibling()	在前序平行节点中返回一个结果，同 .find() 参数

十、信息标记和提取方法总结