python 爬虫标签文本beautifullsoup_【Python爬虫】学习BeautifulSoup

Beautiful Soup

基本介绍Beautiful Soup 是一个HTML/XML 的解析器，主要用于解析和提取 HTML/XML 数据。

它是基于HTML DOM 的，会载入整个文档，解析整个DOM树，因此时间和内存开销都会大很多，所以性能要低于lxml(xpath)，lxml只会进行局部遍历。

BeautifulSoup 用来解析 HTML 比较简单，API非常人性化，支持CSS选择器、Python标准库中的HTML解析器，也支持 lxml 的 XML解析器。

虽然说BeautifulSoup4 简单比较容易上手，但是匹配效率还是远远不如正则以及xpath的，一般不推荐使用，推荐正则的使用。

安装使用beautiful soup安装：pip install beautifulsoup4

在代码中导入： from bs4 import BeautifulSoup

创建 Beautiful Soup对象 soup = BeautifulSoup(html，'html.parser')

这里html是被解析的文档，'html.parser'是文档解析器。要解析的文档类型，目前支持： “html”, “xml”, 和 “html5”

指定的解析器，目前支持：“lxml”, “html5lib”, 和 “html.parser”

如果仅仅想要解析HTML文档,只要用文档创建 BeautifulSoup 对象就可以了.Beautiful Soup会自动选择一个解析器来解析文档 ,解析器的优先顺序: lxml, html5lib .

下表列出了主要的解析器,以及它们的优缺点:

如果指定的解析器没有安装,Beautiful Soup会自动选择其它方案.目前只有 lxml 解析器支持XML文档的解析,在没有安装lxml库的情况下,创建 beautifulsoup 对象时无论是否指定使用lxml,都无法得到解析后的对象

如果一段HTML或XML文档格式不正确的话,那么在不同的解析器中返回的结果可能是不一样的,查看官方文档了解更多细节。解析器之间的区别：

Beautiful Soup库的基本元素

Beautiful Soup库的理解： Beautiful Soup库是解析、遍历、维护“标签树”的功能库，对应一个HTML/XML文档的全部内容。

BeautifulSoup类的基本元素:Tag 标签，最基本的信息组织单元，分别用<>和标明开头和结尾；

Name 标签的名字，

…

的名字是'p'，格式：.name;

Attributes 标签的属性，字典形式组织，格式：.attrs;

NavigableString 标签内非属性字符串，<>…>中字符串，格式：.string;

Comment 标签内字符串的注释部分，一种特殊的NavigableString 对象类型;

下面来进行代码实践

# 导入bs4库

from bs4 import BeautifulSoup

import requests # 抓取页面

r = requests.get('https://python123.io/ws/demo.html') # Demo网址

demo = r.text # 抓取的数据

demo

This is a python demo page\r\n\r\n

The demo python introduces several python courses.

\r\n

Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\r\nBasic Python and Advanced Python.

\r\n'

# 解析HTML页面

soup = BeautifulSoup(demo, 'html.parser') # 抓取的页面数据；bs4的解析器

# 有层次感的输出解析后的HTML页面

print(soup.prettify())

输出：

This is a python demo page

The demo python introduces several python courses.

Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:

Basic Python

and

Advanced Python

1)标签，用soup.访问获得:

当HTML文档中存在多个相同对应内容时，soup.返回第一个

soup.a # 访问标签a

# Basic Python

soup.title

This is a python demo page

2)标签的名字:每个都有自己的名字，通过soup..name获取，字符串类型

soup.a.name, soup.a.parent.name, soup.p.parent.name

# ('a', 'p', 'body')

3)标签的属性,一个可以有0或多个属性，字典类型,soup..attrs

tag = soup.a

print(tag.attrs)

# 获取属性值的两种方法

print(tag.attrs['class'], tag['class'])

print(type(tag.attrs))

# 输出

{'href': 'http://www.icourse163.org/course/BIT-268001', 'class': ['py1'], 'id': 'link1'}

['py1'] ['py1']

4)NavigableString:标签内非属性字符串,格式：soup..string, NavigableString可以跨越多个层次

print(soup.a.string)

print(type(soup.a.string))

# 输出

Basic Python

5)Comment:标签内字符串的注释部分，Comment是一种特殊类型的NavigableString 对象(有-->)

markup = ""

soup2 = BeautifulSoup(markup)

comment = soup2.b.string

print(type(comment))

print(comment)

# Hey, buddy. Want to buy a used parser?

6) .prettify()为HTML文本<>及其内容增加更加'\n',有层次感的输出

.prettify()可用于标签，方法：.prettify()

print(soup.a.prettify())

# 输出

Basic Python

7)bs4库将任何HTML输入都变成utf‐8编码

Python 3.x默认支持编码是utf‐8,解析无障碍

newsoup = BeautifulSoup('中文', 'html.parser')

print(newsoup.prettify())

# 输出

中文

基于bs4库的HTML内容遍历方法

HTML基本格式:<>…>构成了所属关系，形成了标签的树形结构标签树的下行遍历.contents 子节点的列表，将所有儿子节点存入列表

.children 子节点的迭代类型，与.contents类似，用于循环遍历儿子节点

.descendants 子孙节点的迭代类型，包含所有子孙节点，用于循环遍历

标签树的上行遍历.parent 节点的父亲标签

.parents 节点先辈标签的迭代类型，用于循环遍历先辈节点

标签树的平行遍历.next_sibling 返回按照HTML文本顺序的下一个平行节点标签

.previous_sibling 返回按照HTML文本顺序的上一个平行节点标签

.next_siblings 迭代类型，返回按照HTML文本顺序的后续所有平行节点标签

.previous_siblings 迭代类型，返回按照HTML文本顺序的前续所有平行节点标签

标签树的回退和前进.next_element返回解析过程中下一个被解析的对象(字符串或tag)

.previous_element返回解析过程中上一个被解析的对象(字符串或tag)

.next_elements 迭代类型，返回解析过程中后续所有被解析的对象(字符串或tag)

.previous_elements迭代类型，返回解析过程中前续所有被解析的对象(字符串或tag)

详见官方文档的“遍历文档树”及博客：

标签树的下行遍历

import requests

from bs4 import BeautifulSoup

r=requests.get('http://python123.io/ws/demo.html')

demo=r.text

soup=BeautifulSoup(demo,'html.parser')

print(soup.contents)# 获取整个标签树的儿子节点

# 输出

[

This is a python demo page

The demo python introduces several python courses.

Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:

Basic Python and Advanced Python.

]

print(soup.body.contents)#返回标签树的body标签下的节点

# 输出

['\n',

The demo python introduces several python courses.

, '\n',

Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:

Basic Python and Advanced Python.

, '\n']

print(soup.head)#返回head标签

# 输出

This is a python demo page

for child in soup.body.children:#遍历儿子节点

print(child)

The demo python introduces several python courses.

Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:

Basic Python and Advanced Python.

for child in soup.body.descendants:#遍历子孙节点

print(child)

The demo python introduces several python courses.

The demo python introduces several python courses.

Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:

Basic Python and Advanced Python.

Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:

Basic Python

and

Advanced Python

可以发现这里是深度遍历。

标签树的上行遍历

print((soup.title.parent, type(soup.html.parent), soup.parent))

# (

This is a python demo page, , None)

title的父节点是head标签，文档的顶层节点的父节点是 BeautifulSoup 对象， BeautifulSoup 对象的 .parent 是None

for parent in soup.a.parents: # 遍历先辈的信息

if parent is None:

print('parent:', parent)

else:

print('parent name:', parent.name)

parent name: p

parent name: body

parent name: html

parent name: [document]

标签a的父节点关系是：a—> p —> body —> html —> [document]，最后一个是因为soup.name是'[document]'

标签树的平行遍历

注意：标签树的平行遍历是有条件的

平行遍历发生在同一个父亲节点的各节点之间

标签中的内容也构成了节点

再来复习一下之前的文档结构：

print(soup.prettify())

This is a python demo page

The demo python introduces several python courses.

Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:

Basic Python

and

Advanced Python

第一个a标签和字符串文本“Python is a wonderful...”，” and“，第二个a标签，字符串文本” .“是兄弟节点，他们的父节点是p标签，因此：

print(soup.a.next_sibling)#a标签的下一个兄弟标签

# and

print(soup.a.next_sibling.next_sibling)#a标签的下一个标签的下一个标签

# Advanced Python

print(soup.a.previous_sibling)#a标签的前一个标签

# Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:

print(soup.a.previous_sibling.previous_sibling)#a标签的前一个标签的前一个标签

# None

for sibling in soup.a.next_siblings:#遍历后续节点

print(sibling)

# 输出

and

Advanced Python

for sibling in soup.a.previous_siblings:#遍历之前的节点

print(sibling)

# 输出

Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:

回退和前进

看一下这个文档：

The Dormouse's story

The Dormouse's story

HTML解析器把这段字符串转换成一连串的事件: “打开标签”,”打开一个

标签”,”打开一个标签”,”添加一段字符串”,”关闭标签”,”打开

标签”,等等.Beautiful Soup提供了重现解析器初始化过程的方法.

soup.a.next_element, soup.a.previous_element

# 输出

('Basic Python',

'Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\r\n')

for element in soup.a.next_elements:

print(element)

Basic Python

and

Advanced Python

基于bs4库的HTML内容的查找方法<>.find_all(name, attrs, recursive, string, **kwargs)name : 对标签名称的检索字符串，可以使任一类型的过滤器,字符串,正则表达式,列表,方法或是 True . True表示返回所有。

attrs: 对标签属性值的检索字符串，可标注属性检索

recursive: 是否对子孙全部检索，默认True

string: <>…>中字符串区域的检索字符串(..) 等价于 .find_all(..)

soup(..) 等价于 soup.find_all(..)

扩展方法：<>.find() 搜索且只返回一个结果，同.find_all()参数

<>.find_parents() 在先辈节点中搜索，返回列表类型，同.find_all()参数

<>.find_parent() 在先辈节点中返回一个结果，同.find()参数

<>.find_next_siblings() 在后续平行节点中搜索，返回列表类型，同.find_all()参数

<>.find_next_sibling() 在后续平行节点中返回一个结果，同.find()参数

<>.find_previous_siblings() 在前序平行节点中搜索，返回列表类型，同.find_all()参数

<>.find_previous_sibling() 在前序平行节点中返回一个结果，同.find()参数

import requests

from bs4 import BeautifulSoup

r = requests.get('http://python123.io/ws/demo.html')

demo = r.text

soup = BeautifulSoup(demo,'html.parser')

先介绍一下过滤器的类型。

字符串

# name : 对标签名称的检索字符串

soup.find_all('a')

# 输出

[Basic Python,

Advanced Python]

# attrs: 对标签属性值的检索字符串，可标注属性检索

soup.find_all("p","course")

# 输出

[

Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:

Basic Python and Advanced Python.

]

# recursive: 是否对子孙全部检索，默认True

soup.find_all('p',recursive=False)

# []

# string: <>…>中字符串区域的检索字符串

soup.find_all(string = "Basic Python") # 完全匹配才能匹配到

# ['Basic Python']

正则表达式

import re

# 查找所有以b开头的标签

for tag in soup.find_all(re.compile("^b")):

print(tag.name)

# body

# b

# 找出所有名字中包含”t”的标签

for tag in soup.find_all(re.compile("t")):

print(tag.name)

# html

# title

列表

# 找到文档中所有标签和

标签

soup.find_all(['a', 'p'])

# 输出

[

The demo python introduces several python courses.

Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:

Basic Python and Advanced Python.

Basic Python,

Advanced Python]

True

True 可以匹配任何值,下面代码查找到所有的tag,但是不会返回字符串节点

for tag in soup.find_all(True):

print(tag.name)

# html

# head

# title

# body

# p

# b

# p

# a

方法

方法只接受一个元素参数,如果这个方法返回 True 表示当前元素匹配并且被找到,如果不是则反回 False

下面方法校验了当前元素,如果包含 class 属性却不包含 id 属性,那么将返回 True。将这个方法作为参数传入 find_all() 方法,将得到所有

标签

def has_class_but_no_id(tag):

return tag.has_attr('class') and not tag.has_attr('id')

soup.find_all(has_class_but_no_id)

# 输出

[

The demo python introduces several python courses.

Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:

Basic Python and Advanced Python.

]

注意这里返回的标签是

标签的一部分。没有单独返回

通过一个方法来过滤一类标签属性的时候, 这个方法的参数是要被过滤的属性的值, 而不是这个标签. 下面的例子是找出 href 属性不符合指定正则的 a 标签.

def not_lacie(href):

return href and not re.compile("BIT-268001").search(href)

soup.find_all(href=not_lacie)

# 输出

[]

过滤出前后都有文字的标签

from bs4 import NavigableString

def surrounded_by_strings(tag):

return (isinstance(tag.next_element, NavigableString)

and isinstance(tag.previous_element, NavigableString))

for tag in soup.find_all(surrounded_by_strings):

print(tag.name)

# body

# p

# a

keyword 参数

如果一个指定名字的参数不是搜索内置的参数名,搜索时会把该参数当作指定名字tag的属性来搜索。比如tag的"id"属性，"href"属性，参数值包括字符串, 正则表达式, 列表, True.

还可以按CSS搜索，比如搜索class带有"tit"的标签，支持不同类型的过滤器 ,字符串,正则表达式,方法或 True

import re

soup.find_all(id="link") # 完全匹配才能匹配到

# []

soup.find_all(class_=re.compile("tit"))

# [

The demo python introduces several python courses.

]

实战：中国大学排名定向爬取

爬取思路：从网络上获取大学排名网页内容

提取网页内容中信息到合适的数据结构(二维数组)-排名，学校名称，总分

利用数据结构展示并输出结果

# 导入库

import requests

from bs4 import BeautifulSoup

import bs4

1.从网络上获取大学排名网页内容

def gethtml(url):

try:

res = requests.get(url)

# response.raise_for_status()这个方法可以捕获异常，使得出现异常时就会跳到except中执行，而不影响整体进程。

# r.encoding:从HTTP header中猜测的响应内容编码方式。

# r.apparent_encoding：根据网页内容分析出的编码方式。

# # 不论headers中是否要求编码格式，都从内容中找到实际编码格式，确保顺利解码

res.encoding=res.apparent_encoding

content = res.text

return content

except:

return ""

content = gethtml('http://www.zuihaodaxue.cn/zuihaodaxuepaiming2019.html')

soup = BeautifulSoup(content, 'html.parser')

soup

2.提取网页内容中信息到合适的数据结构(二维数组)查看网页源代码，观察并定位到需要爬取内容的标签；

使用bs4的查找方法提取所需信息-'排名，学校名称，总分'

查看网页源代码可以发现，我们需要的排名、学校名称、总分等数据在一个表格中，tbody是表格的主体内容，每一个tr标签的内容对应着表格的每一行，同时也是tbody的子节点标签。因此，我们要获取数据，就得解析出每一个tr标签。

根据之前所学的知识，我们可以先找到(find)表格主体tbody(网页中只有一个表格)，然后找出tbody下面的所有子节点标签tr，再从子节点的子节点中解析出排名、学校名称、总分。

方法一：使用find和find_all方法：

need_list = []

for tr in soup.find('tbody').find_all('tr'):

tds=tr('td') # 等价于tr.find_all('td')

need_list.append([tds[0].string,tds[1].string,tds[3].string])

need_list

方法二：使用children方法，但要进行实例类别判断，因为会存在bs4.element.NavigableString类型的文本内容：

need_list = []

for tr in soup.find('tbody').children:

if isinstance(tr,bs4.element.Tag):

# 或者用 tds=list(tr.children)

tds=tr('td') # 等价于tr.find_all('td')

need_list.append([tds[0].string,tds[1].string,tds[3].string])

need_list

3.利用数据结构展示并输出结果

# 参考 https://www.cnblogs.com/zhz-8919/p/9767357.html

# https://python3-cookbook.readthedocs.io/zh_CN/latest/c02/p13_aligning_text_strings.html

def printUnivList(ulist,num):

tplt = "{0:{3}^10}\t{1:{3}^10}\t{2:^10}"

print(tplt.format("排名","学校名称","总分", chr(12288)))

for u in ulist[:num]:

print(tplt.format(u[0],u[1],u[2], chr(12288)))

printUnivList(need_list,30)

采用.format打印输出时，可以定义输出字符串的输出宽度，在 ':' 后传入一个整数, 可以保证该域至少有这么多的宽度。用于美化表格时很有用。

但是在打印多组中文的时候，不是每组中文的字符串宽度都一样，当中文字符宽度不够的时候，程序采用西文空格填充，中西文空格宽度不一样，就会导致输出文本不整齐

解决方法：宽度不够时采用中文空格填充，中文空格的编码为chr(12288)

参考资料

BeautifulSoup中文文档：

欢迎关注我的公众号：

python 爬虫标签文本beautifullsoup_【Python爬虫】学习BeautifulSoup相关推荐

python获取标签属性值_Python爬虫库BeautifulSoup获取对象(标签)名,属性,内容,注释
更多python教程请到: 菜鸟教程www.piaodoo.com 人人影视www.sfkyty.com 16影视www.591319.com 星辰影院www.591319.com 一.Tag(标签) ...
连享会-Python爬虫与文本分析专题 (2019.5.17-19)
连享会-Python爬虫与文本分析专题研讨班文章目录连享会-Python爬虫与文本分析专题研讨班 1. 课程概览 2. 嘉宾简介 3. 授课内容 3.1 课程介绍为什么要学爬虫和文本分析? 为什 ...
EMNLP2021论文：元学习大规模多标签文本分类
每天给你送来NLP技术干货! 来自:南大NLP 01 - 研究动机大规模多标签文本分类(简称LMTC)是自然语言处理领域中一个十分重要的任务,其旨在从一个大规模的标签集合(标签集合大小一般以千为数量 ...
python实验七网络爬虫和文本处理
目录实验原理: 实验准备: 实验步骤与内容: 参考代码: 运行结果: 学习网络爬虫常用工具包 requests,以及对网页解析工具 BeautifulSoup 等操作: 依托自然语言处理领域的文本数 ...
python代码html显示数据_Python爬虫基础之认识html和学习数据提取（上）
我:我已经学会了基本的python,接下来可以学什么鸭? 惨绿青年:接下来可以学习制作python爬虫了,但还是需要学习相关的知识. 我:什么知识鸭? 惨绿青年:网页的相关知识.我们看到的网页一般是h ...
python win32ui选取文件夹_Python爬虫基础之认识html和学习数据提取（上）
我:我已经学会了基本的python,接下来可以学什么鸭? 惨绿青年:接下来可以学习制作python爬虫了,但还是需要学习相关的知识. 我:什么知识鸭? 惨绿青年:网页的相关知识.我们看到的网页一般是h ...
python免费自学爬虫_这套Python爬虫学习教程，不到一天即可新手到进阶！免费领...
想用Python做爬虫,而你却还不会Python的话,那么这些入门基础知识必不可少.很多小伙伴,特别是在学校的学生,接触到爬虫之后就感觉这个好厉害的样子,我要学.但是却完全不知道从何开始,很迷茫,学的 ...
python与excel结合能做什么-Python网络爬虫与文本数据分析
原标题:Python网络爬虫与文本数据分析课程介绍在过去的两年间,Python一路高歌猛进,成功窜上"最火编程语言"的宝座.惊奇的是使用Python最多的人群其实不是程序员,而 ...
python爬虫与数据分析实战27_Python网络爬虫与文本数据分析
原标题:Python网络爬虫与文本数据分析课程介绍在过去的两年间,Python一路高歌猛进,成功窜上"最火编程语言"的宝座.惊奇的是使用Python最多的人群其实不是程序员,而 ...

python 爬虫标签文本beautifullsoup_【Python爬虫】学习BeautifulSoup

python 爬虫标签文本beautifullsoup_【Python爬虫】学习BeautifulSoup相关推荐

最新文章

热门文章

python 爬虫 标签文本beautifullsoup_【Python爬虫】学习BeautifulSoup

python 爬虫 标签文本beautifullsoup_【Python爬虫】学习BeautifulSoup相关推荐

最新文章

热门文章

python 爬虫标签文本beautifullsoup_【Python爬虫】学习BeautifulSoup

python 爬虫标签文本beautifullsoup_【Python爬虫】学习BeautifulSoup相关推荐