第二讲 Beautiful Soup库

一、Beautiful Soup库基础

1.示例引入

#首先爬取下页面
>>>import requests
>>>r = requests.get('https://python123.io/ws/demo.html')
>>>r.status_code
200
>>>demo = r.text
>>>print(demo)
<html><head><title>This is a python demo page</title></head>
<body>
<p class="title"><b>The demo python introduces several python courses.</b></p>
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a href="http://www.icourse163.org/course/BIT-268001" class="py1" id="link1">Basic Python</a> and <a href="http://www.icourse163.org/course/BIT-1001870001" class="py2" id="link2">Advanced Python</a>.</p>
</body></html>#再利用BeautifulSoup处理
>>>from bs4 import BeautifulSoup
>>>soup = BeautifulSoup(demo,'html.parser')
>>>print(soup.prettify())
<html><head><title>This is a python demo page</title></head><body><p class="title"><b>The demo python introduces several python courses.</b></p><p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>and<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p></body>
</html>

BeautifulSoup库主要操作为是两行代码

from bs4 import BeautifulSoup
soup = BeautifulSoup('<p>data<p>','html.parser') #利用BeautifulSoup()解析，有两个参数
#参数'<p>data<p>'指的是html类型的信息
#参数'html.parser'是一个解析器

2.BeautifulSoup基本元素

(1) HTML和BeautifulSoup

BeautifulSoup对应一个HTML/XML文档的全部内容，建立BeautifulSoup的两种方法：

from bs4 import BeautifulSoup
soup = BeautifulSoup('<p>data<p>','html.parser')
soup2 = BeautifulSoup(open('D://demo.html'),'html.parser')

解析器有四种：

解析器	使用方法	条件
bs4的HTML解析器	BeautifulSoup(mk,‘html.parser’)	安装bs4库
lxml的HTML解析器	BeautifulSoup(mk,‘lxml’)	pip install lxml
lxml的XML解析器	BeautifulSoup(mk,‘xml’)	pip install lxml
html5lib的解析器	BeautifulSoup(mk,‘html5lib’)	pip install html5lib

HTML 标签参考手册

经过BeautifulSoup处理之后，每一种html的tag（标签）都有soup.tag属性与之对应

当文档中有多个同一种tag标签时，只会返回对一个tag标签的内容

(2) BeautifulSoup类的五种基本元素

基本元素	说明
Tag	标签，最基本的信息组织单元，分别用<>和</>标明开头和结尾，用soup.可以提取出来相应标签的内容
Name	标签的名字， … 的名字是’p’，格式：.name
Attributes	标签的属性，字典形式组织，每一个标签都有零或多个属性，格式：.attrs
NavigableString	标签内非属性字符串，<>…</>中字符串，格式：.string
Comment	标签内字符串的注释部分，一种特殊的Comment类

可以用type(soup.<tag>)查看元素的类型

元素示意图：

# Tag 返回标签中所有内容
>>>soup.a
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a># Name 返回标签名字
>>>soup.a.name
'a'
>>>soup.a.parent.name
'p'# Attributes 返回字典类型，所以还可以继续索引
>>> soup.a.attrs
{'href': 'http://www.icourse163.org/course/BIT-268001', 'class': ['py1'], 'id': 'link1'}
>>>soup.a.attrs['class']
['py1']
>>>type(soup.a.attrs)
<class 'dict'># NavigableString 返回该标签下的字符串
>>>soup.p.string
'The demo python introduces several python courses.'
>>>type(soup.p.string)
<class 'bs4.element.NavigableString'>#Comment 注释类型，用<tag>.string获取字符串时，注释不会被筛掉，也会被获取，并赋予Comment类型

3.HTML内容的3种遍历方法

HTML基本格式是一个树形结构：

HTML树形结构有三种遍历方式（遍历顺序不同）

下行遍历
上行遍历
平行遍历

(1) 下行遍历

属性	说明
.contents	子节点的列表类型，将所有儿子节点存入列表
.children	子节点的迭代类型，与.contents类似，用于循环遍历儿子节点
.descendants	子孙节点的迭代类型，包含所有子孙节点，用于循环遍历

爬虫第二讲：Beautiful Soup库