一、 Beautifulsoup模块介绍

Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库.它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式.Beautiful Soup会帮你节省数小时甚至数天的工作时间.你可能在寻找 Beautiful Soup3 的文档,Beautiful Soup 3 目前已经停止开发,官网推荐在现在的项目中使用Beautiful Soup 4, 移植到BS4

#安装 Beautiful Soup
pip install beautifulsoup4#安装解析器
Beautiful Soup支持Python标准库中的HTML解析器,还支持一些第三方的解析器,其中一个是 lxml .根据操作系统不同,可以选择下列方法来安装lxml:$ apt-get install Python-lxml$ easy_install lxml$ pip install lxml另一个可供选择的解析器是纯Python实现的 html5lib , html5lib的解析方式与浏览器相同,可以选择下列方法来安装html5lib:$ apt-get install Python-html5lib$ easy_install html5lib$ pip install html5lib

下表列出了主要的解析器,以及它们的优缺点,官网推荐使用lxml作为解析器,因为效率更高. 在Python2.7.3之前的版本和Python3中3.2.2之前的版本,必须安装lxml或html5lib, 因为那些Python版本的标准库中内置的HTML解析方法不够稳定.

解析器	使用方法	优势	劣势
Python标准库	BeautifulSoup(markup, "html.parser")	Python的内置标准库执行速度适中文档容错能力强	Python 2.7.3 or 3.2.2)前的版本中文档容错能力差
lxml HTML 解析器	BeautifulSoup(markup, "lxml")	速度快文档容错能力强	需要安装C语言库
lxml XML 解析器	BeautifulSoup(markup, ["lxml", "xml"]) BeautifulSoup(markup, "xml")	速度快唯一支持XML的解析器	需要安装C语言库
html5lib	BeautifulSoup(markup, "html5lib")	最好的容错性以浏览器的方式解析文档生成HTML5格式的文档	速度慢不依赖外部扩展

中文文档：https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html

二、基本使用

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p><p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p><p class="story">...</p>
"""#基本使用：容错处理,文档的容错能力指的是在html代码不完整的情况下,使用该模块可以识别该错误。使用BeautifulSoup解析上述代码,能够得到一个 BeautifulSoup 的对象,并能按照标准的缩进格式的结构输出
from bs4 import BeautifulSoup
soup=BeautifulSoup(html_doc,'lxml') #具有容错功能
res=soup.prettify() #处理好缩进，结构化显示
print(res)

三、遍历文档树

遍历文档树：即直接通过标签名字选择，特点是选择速度快，但如果存在多个相同的标签则只返回第一个

from bs4 import BeautifulSoup
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story <span>lqz</span></b><span>egon</span></p><p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p><p class="story">...</p>
"""
soup=BeautifulSoup(html_doc,'html.parser')# res=soup.prettify()  # 美化
# print(res)1、用法
# html=soup.html
# title=soup.html.head.title
# title=soup.title
# print(title)2、获取标签的名称 ---> 标签对象.name
# a=soup.body.a
# a=soup.a.name
# print(a)
# print(soup.body.name)3、获取标签的属性  ---->标签对象['标签名']
# href=soup.body.a['href']
# attrs=soup.body.a.attrs  # 所有属性，---》字典
# href=soup.body.a.attrs['href']
# print(attrs['class'])# c=soup.p.attrs['class']
# print(c)4、获取标签的内容# res=soup.b.text  # 拿到当前标签子子孙所有的text
# res=soup.p.text# res=soup.p.string # 当前标签有且只有一个文本内容才能拿出来
# res=soup.b.string # 当前标签有且只有一个文本内容才能拿出来# res=soup.p.strings   # 把子子孙放到生成器中
#
# print(list(res))5、嵌套选择
# res=soup.html.body.p
# print(type(res))  # bs4.element.Tag
from bs4.element import Tag####了解
6、子节点、子孙节点
# print(soup.p.contents) #p下所有子节点，放到列表中# print(soup.p.children) #得到一个迭代器,包含p下所有子节点# for i,child in enumerate(soup.p.children):
#     print(i,child)# print(soup.p.descendants) #获取子孙节点,p下所有的标签都会选择出来
# for i,child in enumerate(soup.p.descendants):
#     print(i,child)7、父节点、祖先节点# print(soup.a.parent) #获取a标签的父节点# print(soup.body.parent)# print(soup.a.parents) #找到a标签所有的祖先节点，父亲的父亲，父亲的父亲的父亲...
# print(list(soup.a.parents))
# print(len(list(soup.a.parents)))8、兄弟节点
# print(soup.a.next_sibling) #下一个兄弟
# print(soup.a.previous_sibling) #上一个兄弟
#
# print(list(soup.a.next_siblings)) #下面的兄弟们=>生成器对象
# print(list(soup.a.previous_siblings)) #上面的兄弟们=>生成器对象

四、搜索文档树

1、五种过滤器

五种过滤器: 字符串、正则表达式、列表、True、方法

字符串
# res=soup.find(name='body')
# res=soup.find(name='p',class_='story')# 查找id为link2的标签
# res=soup.find(id='link2',name='a',class_='sister',href='http://example.com/lacie')
# res=soup.find(href='http://example.com/lacie')
# print(res)# res=soup.find(attrs={'class':['sister']})
# print(res)正则表达式
import re
# res=soup.find_all(name=re.compile('^b')) #找出b开头的标签，结果有body和b标签
# res=soup.find(name=re.compile('^b'))# res=soup.find_all(class_=re.compile('^s'))
# res=soup.find_all(href=re.compile('^http'))
# res=soup.find_all(id=re.compile('^l'))
# print(res)列表# res=soup.find_all(name=['body','b'])
# res=soup.find_all(id=['link1','link2'])# res=soup.find_all(attrs={'id':['link1','link2']})
#
# print(res)True# links=soup.find_all(href=True)
# print(links)# res=soup.find_all(name=True)
# res=soup.find_all(id=True)
# print(res)方法
# def has_class_but_no_id(tag):
#     return tag.has_attr('class') and not tag.has_attr('id')
#
# print(len(soup.find_all(name=has_class_but_no_id)))# 拿出当前页面所有图片
soup.find_all(name='img',href=True)## 建议 遍历文档树和搜索文档树混用
# soup.body.div.find其他参数  find，find_all#limit
# soup.find()
# res=soup.find_all(name='a',href=True,limit=2)  # 限制获取的条数
# print(res)# recursive 是否递归查找
# res=soup.find_all(name='a',recursive=False)
# res=soup.find_all(name='html',recursive=False)
# print(res)

2、CSS选择器

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story  <p>asdfasdf</p></b>Once upon a time there were three little sisters; and their names were<a href="http://example.com/elsie" class="sister" id="link1"><span>Elsie</span></a><a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;<div class='panel-1'><ul class='list' id='list-1'><li class='element'>Foo</li><li class='element'>Bar</li><li class='element'>Jay</li></ul><ul class='list list-small' id='list-2'><li class='element'><h1 class='yyyy'>Foo</h1></li><li class='element xxx'>Bar</li><li class='element'>Jay</li></ul></div>and they lived at the bottom of a well.
</p>
<p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup=BeautifulSoup(html_doc,'html.parser')'''
#id
.类名
标签
标签>标签
标签 标签
'''# res=soup.p.select('.sister')  # 使用css选择器
# res=soup.p.select('#link1')  # 使用css选择器
# res=soup.select('body>p')  # 使用css选择器 body的子标签p
res=soup.select('body p')  # 使用css选择器 body的子子孙孙标签p
print(len(res))### css选择器是通用的：bs4，lxml解析也可以是css选择器##css选择器不会写怎么办？
'#maincontent > div:nth-child(3) > table > tbody > tr:nth-child(13) > td:nth-child(3)'## xpath选择
'//*[@id="maincontent"]/div[2]/table/tbody/tr[18]/td[2]'

五、修改文档树

链接：https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#id40

六、总结

# 总结:
#1、推荐使用lxml解析库
#2、讲了三种选择器:标签选择器,find与find_all，css选择器1、标签选择器筛选功能弱,但是速度快2、建议使用find,find_all查询匹配单个结果或者多个结果3、如果对css选择器非常熟悉建议使用select
#3、记住常用的获取属性attrs和文本值get_text()的方法

七、 selenium使用

# 如果使用requests模块，发送请求获取的数据不全，它不能执行js# selenium:可以使用代码控制模拟人操作浏览器## 操作某个浏览器，就需要有浏览器驱动
# http://npm.taobao.org/mirrors/chromedriver/  谷歌驱动的淘宝镜像站
# 谷歌浏览器版本要跟驱动版本对应## 92.0.4515.131  下载相应版本驱动，放到项目代码中# pip3 install selenium# from selenium import webdriver
# import time
# # 打开一个谷歌浏览器
# bro=webdriver.Chrome(executable_path='chromedriver.exe')
#
# #地址栏中输入百度
# bro.get('https://www.cnblogs.com/')
#
# time.sleep(2)
#
# print(bro.page_source)  #当前页面的html内容
#
# bro.close()  # 关闭浏览器# import requests
#
# res=requests.get('https://dig.chouti.com/',headers={#     'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36'
# })
# print(res.text)

Beautifulsoup模块相关推荐

requsets模块和beautifulsoup模块
2.requests模块方法 requests是基于Python开发的HTTP库,使用Requests可以轻而易举的完成浏览器可有的任何操作. request.get() request.post() ...
python3 beautifulsoup 模块详解_关于beautifulsoup模块的详细介绍
这篇文章主要给大家介绍了python中 Beautiful Soup 模块的搜索方法函数. 方法不同类型的过滤参数能够进行不同的过滤,得到想要的结果.文中介绍的非常详细,对大家具有一定的参考价值,需要 ...
【爬虫剑谱】三卷3章拾遗篇-有关于bs4库中的BeautifulSoup模块使用小结
关于bs4库中的BeautifulSoup模块在实战后的快速上手小结一.BeautifulSoup 模块 1.将 Beautiful 对象实例化的两种方法 (1)将本地 HTML 文档转为 Beau ...
python爬虫之使用BeautifulSoup模块抓取500彩票网竞彩足球赛果及赔率
目录前言分析思路数据储存代码结果展示结语前言竞彩足球是目前比较受欢迎的一种体彩彩种,玩法较为灵活多样,赔率可观,今天就来记录一下如何抓取竞彩足球的开奖信息和赔率. 分析思路我使用的网 ...
BeautifulSoup模块学习文档
一.BeautifulSoup简介 1.BeautifulSoup模块 Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库.它能够通过你喜欢的转换器实现惯用的文档 ...
Requests 与 BeautifulSoup 模块
一.Requests库参考 :http://www.python-requests.org/en/master/user/quickstart/#make-a-request Requests是一个 ...
python：BeautifulSoup 模块使用指南
官方文档如下介绍: Beautiful Soup 是一个可以从 HTML 或 XML 文件中提取数据的 Python 库.它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式.Beauti ...
python bs4模块_python爬虫之Beautifulsoup模块用法详解
什么是beautifulsoup: 是一个可以从HTML或XML文件中提取数据的Python库.它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式.(官方) beautifulsoup是 ...
爬虫-request和BeautifulSoup模块
requests简介 Python标准库中提供了:urllib.urllib2.httplib等模块以供Http请求,但是,它的 API 太渣了.它是为另一个时代.另一个互联网所创建的.它需要巨量的工 ...

Beautifulsoup模块

目录

一、 Beautifulsoup模块介绍

二、基本使用

三、遍历文档树

四、搜索文档树

1、五种过滤器

2、CSS选择器

五、修改文档树

六、总结

七、 selenium使用

Beautifulsoup模块相关推荐

最新文章

热门文章

Beautifulsoup模块

目录

一、 Beautifulsoup模块介绍

二、 基本使用

三、 遍历文档树

四、 搜索文档树

1、五种过滤器

2、CSS选择器

五、 修改文档树

六、 总结

七、 selenium使用

Beautifulsoup模块相关推荐

最新文章

热门文章

二、基本使用

三、遍历文档树

四、搜索文档树

五、修改文档树

六、总结