BeautifulSoup的高级应用之 find findAll

BeautifulSoup 是python学习的重要组成部分，可用于帮助解析html/XML等内容，尤其是在爬取特定网页信息的时候，用于解析和检查在网上看到的那些乱七八糟而且不规范的HTML页面。至于BeautifulSoup 模块的安装可以参考博客

至于如何获取网页内容，可以查看博客内容总结。

这些方法的单数形式对应着某个复数形式，会找到所有符合要求的tag，以list的方式放回。他们的对应关系是：find->findall, findParent->findParents, findNextSibling->findNextSiblings…
举一个简单的爬取百度网页的例子：

>import urllib
>from bs4 import BeautifulSoup
>url="http://www.baidu.com" #这里是需要爬取的网页
>content=urllib.open(url).read() #这里是使用urllib模块的open函数打开url 再使用read函数读取网页内容 赋值给content
>soup=BeautifulSoup(content) #这里是将content内容转化为BeautifulSoup格式的数据
>print content #这里是输出网页html的内容

这里的BeautifulSoup(content)函数在官网上的解释为：Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.
将复杂HTML文档转换成一个复杂的树形结构.每个节点都是Python对象.

BeautifulSoup的主要函数使用

from bs4 import BeautifulSoup
html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" id="hehe"><b>The Dormouse's story</b></p>
<p class="story" id="firstpara">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup=BeautifulSoup(html,'html.parser')
print soup.prettify()

这里是一个读取html标签然后通过prettify()函数输出标签的过程。这里输出soup对象的html标签有多种方法：
1 soup.prettify()
2 soup.html
3 soup.contents
4 soup
另外使用soup+标签名称可以获取html标签中第一个匹配的标签内容，举例：
print soup.p输出结果为：<p class="title"><b>The Dormouse's story</b></p>
print soup.p.string 输出标签的内容 结果为：The Dormouse's story
另外输出标签内容还可以使用get_text()函数：

pid = soup.find(href=re.compile("^http:")) #使用re正则匹配 后面有讲
p1=soup.p.get_text()
The Dormouse's story

通过get函数获得标签的属性：

soup=BeautifulSoup(html,'html.parser')
pid = soup.findAll('a',{'class':'sister'})
for i in pid:print i.get('href') #对每项使用get函数取得tag属性值
http://example.com/elsie
http://example.com/lacie
http://example.com/tillie

对其他的标签也是同样可用的，并且输出的结果为文档中第一个匹配的对象，如果要搜索其他的标签需要使用find findAll函数。
BeautifulSoup提供了强大的搜索函数find 和findall，这里的两个方法(findAll和 find)仅对Tag对象以及，顶层剖析对象有效。

findAll(name, attrs, recursive, text, limit, **kwargs)

for link in soup.find_all('a'): #soup.find_all返回的为列表print(link.get('href'))
# http://example.com/elsie
# http://example.com/lacie
# http://example.com/tillie

findAll也可以使用标签的属性搜索标签，寻找 id=”secondpara”的 p 标记，返回一个结果集：

> pid=soup.findAll('p',id='hehe')  #通过tag的id属性搜索标签
> print pid
[<p class="title" id="hehe"><b>The Dormouse's story</b></p>]
>pid = soup.findAll('p',{'id':'hehe'}) #通过字典的形式搜索标签内容，返回的为一个列表[]
>print pid
[<p class="title" id="hehe"><b>The Dormouse's story</b></p>]

利用正则表达式搜索tag标签内容：

>pid=soup.findAll(id=re.compile("he$")) #正则表达式的使用
>print pid
[<p class="title" id="hehe"><b>The Dormouse's story</b></p>]

利用标签的多个属性值进行搜索：

pp=soup.findAll('a',attrs={'href':re.compile('^http'),'id':'link1'}) #标签多个属性值进行搜索 这里的attrs不可省略,便签'a'是可以省略的 相当于一个限定标签符
print pp
#[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]  #输出结果为list

对搜索结果的个数进行限制： limit=n

pid = soup.findAll('a',limit=2) #限制搜索前两个匹配的结果
#[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

利用find_all搜索返回一个列表：

soup.find_all(["a", "b"])
# [<b>The Dormouse's story</b>,
#  <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

这里的find_all函数参数中设置了一个列表的形式，包含了a和b两个标签，使结果以列表的形式返回。

读取和修改属性:

> p1 = soup.p
> p1 #输出p1内容
<p id="firstpara" align="center">This is paragraph<b>one</b>.</p>
> p1['id'] #输出p1的id属性
hehe
>p1['id']='haha'  #修改p1的id属性值
>print p1['id']
haha

BeautifulSoup中的find和findAll用法相同，不同之处为find返回的是findAll搜索值的第一个值。举例：

>soup=BeautifulSoup(html,'html.parser')
>pid = soup.find(href=re.compile("^http:")) #这里也是使用re正则匹配
>print pid
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

BeautifulSoup的高级应用之 find findAll相关推荐

BeautifulSoup的高级应用之 contents children descendants string strings stripped_strings
继上一节.BeautifulSoup的高级应用之 find findAll,这一节,主要解说BeautifulSoup有关的其它几个重要应用函数. 本篇中,所使用的html为: html_doc = ...
BeautifulSoup里“find_all“和“findAll“的区别
模块如果使用BeautifulSoup 4 版本,导入方式为: from bs4 import BeautifulSoup "find_all"和"findAll&qu ...
BeautifulSoup 删除标签而不删除内容（unwrap()）
from bs4 import BeautifulSoup soup = BeautifulSoup(body, 'html.parser') b = soup.findAll('a') for a ...
[Python BeautifulSoup Threading] 多线程漫画爬虫
漫画爬虫简介爬取公开漫画资源,下载完成后打包成Zip发送至手机指定文件夹. 20201020:新增manganelo爬虫,也是英文漫画,使用了beautifulsoup解析网页,同时使用了多线程. ...
beautifulsoup去除标签_python – 使用BeautifulSoup删除标签,但保留其内容
我使用的策略是用它的内容替换一个标签,如果它们是NavigableString类型,如果它们不是,然后递归到它们并用NavigableString等替换它们的内容.尝试这样: from Beautif ...
python测试工具开发面试宝典3web抓取
2019独角兽企业重金招聘Python工程师标准>>> 用requests输出网站返回头输出 'https://china-testing.github.io/' 的返回头参考答 ...
《明日方舟》Python版公开招募工具
工具介绍根据输入的标签,快速找出能够招募4星,5星干员的标签组合,比如刷出了重装 | 男 | 支援 |术师 | 先锋五个标签,输入效果如下: 注意:不支持高级干员和资深高级干员标签使用环境安 ...
Python实现“维基百科六度分隔理论“之基础爬虫
预备阅读:Python的urllib高级用法 Python中Beautiful Soup的用法 Python中的正则表达式模块re 前言前面学习了urllib和beautifulsoup来进行数 ...
python中doc=parased.getroot()_python实例手册.py
python实例手册 #encoding:utf8 # 设定编码-支持中文 0 说明手册制作: 雪松 littlepy www.51reboot.com 更新日期: 2016-01-21 欢迎系统运 ...

BeautifulSoup的高级应用之 find findAll

BeautifulSoup的高级应用之 find findAll相关推荐

最新文章

热门文章

BeautifulSoup的高级应用 之 find findAll

BeautifulSoup的高级应用 之 find findAll相关推荐

最新文章

热门文章

BeautifulSoup的高级应用之 find findAll

BeautifulSoup的高级应用之 find findAll相关推荐