Python爬虫遍历文档树

1.直接子节点：.contents .children属性

.content

Tag的.content属性可以将Tag的子节点以列表的方式输出

from bs4 import BeautifulSouphtml = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p># 创建 Beautiful Soup 对象，指定lxml解析器
soup = BeautifulSoup(html, "lxml")# 输出方式为列表
print(soup.head.contents)print(soup.head.contents[0])

运行结果

[<title>The Dormouse's story</title>]
<title>The Dormouse's story</title>

.children

它返回的不是一个列表，不过我们可以通过遍历获取所有的子节点。

from bs4 import BeautifulSouphtml = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""# 创建 Beautiful Soup 对象，指定lxml解析器
soup = BeautifulSoup(html, "lxml")# 输出方式为列表生成器对象
print(soup.head.children)# 通过遍历获取所有子节点
for child in soup.head.children:print(child)

运行结果

<list_iterator object at 0x008FF950>
<title>The Dormouse's story</title>

2.所有子孙节点：.descendants属性

上面讲的.contents和.children属性仅包含Tag的直接子节点，.descendants属性可以对所有Tag的子孙节点进行递归循环，和children类似，我们也需要通过遍历的方式获取其中的内容。

'''
遇到问题没人解答？小编创建了一个Python学习交流QQ群：531509025
寻找有志同道合的小伙伴，互帮互助,群里还有不错的视频学习教程和PDF电子书！
'''
from bs4 import BeautifulSouphtml = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""# 创建 Beautiful Soup 对象，指定lxml解析器
soup = BeautifulSoup(html, "lxml")# 输出方式为列表生成器对象
print(soup.head.descendants)# 通过遍历获取所有子孙节点
for child in soup.head.descendants:print(child)

运行结果

<generator object descendants at 0x00519AB0>
<title>The Dormouse's story</title>
The Dormouse's story

3.节点内容：.string属性

如果Tag只有一个NavigableString类型子节点，那么这个Tag可以使用.string得到子节点。如果一个Tag仅有一个子节点，那么这个Tab也可以使用.string方法，输出结果与当前唯一子节点的.string结果相同。

通俗点来讲就是：如果一个标签里面没有标签了，那么.string就会返回标签里面的内容。如果标签里面只有唯一的一个标签了，那么.string也会返回里面的内容。例如：

from bs4 import BeautifulSouphtml = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""# 创建 Beautiful Soup 对象，指定lxml解析器
soup = BeautifulSoup(html, "lxml")print(soup.head.string)print(soup.head.title.string)

运行结果

The Dormouse's story
The Dormouse's story

Python爬虫遍历文档树相关推荐

python爬虫：BeautifulSoup_遍历文档树
前提.回顾 1.因为最近工作中都是在跟XML格式的报文打交道:主要就是XML报文的解析.入库.在做自动化时,需要解析XML报文,前面虽然学习过下BeautifulSoup,结果这次在写脚本时,突然发现 ...
python 打印xml文档树_[Python]xml.etree.ElementTree处理xml文档
需求: 在实际应用中,需要对xml配置文件进行实时修改, 1.增加.删除某些节点 2.增加,删除,修改某个节点下的某些属性 3.增加,删除,修改某些节点的文本 xml源文件格式[例] path=&q ...
python爬虫学习文档整理
python爬虫文档收集: 1.正则表达式 2.Beautiful Soup4 3.Requests 4.Selenium with Pathon 5.Webdriver 6.Python3.6
python 打印xml文档树_Python构建XML树结构的方法示例
本文实例讲述了Python构建XML树结构的方法.分享给大家供大家参考,具体如下: 1.构建XML元素 #encoding=utf-8 from xml.etree import ElementTre ...
python爬虫urllib文档_11.【文本】Urllib(下) - 零基础学习Python爬虫系列
本文是视频av20148524的相关代码文档 # urllib(下) # post # post 和 get 传递参数同时存在的一个url url = "http://bbs.mumayi. ...
Python 爬虫利器 Beautiful Soup 4 之文档树的搜索
前面两篇介绍的是 Beautiful Soup 4 的基本对象类型和文档树的遍历, 本篇介绍 Beautiful Soup 4 的文档搜索搜索文档树主要使用两个方法 find() 和 find_al ...
Python : Beautiful Soup修改文档树
修改文档树 Beautiful Soup的强项是文档树的搜索,但同时也可以方便的修改文档树修改tag的名称和属性在 Attributes 的章节中已经介绍过这个功能,但是再看一遍也无妨. 重命名一 ...
python替换word内容,文档翻译-使用python替换word文档中的段落内容
前段时间遇到一个需求,需要将word文档中的内容进行替换,并且需要保证格式不变.在找了一圈资料后,发现没有现成的api供使用:由于本人做过一段时间文档解析,因此打算从word文档的xml入手,通过py ...
python - 官方简易文档篇（1）常用、函数
Python Tutorial, 发布 3.8.4rc1 tutorial.pdf 刚总结完str的一些细节,其中还有很多关于类的自定义的因为没有接触过,所以还不知道如何去用,但是再菜鸟教程上看到一个 ...

Python爬虫遍历文档树

Python爬虫遍历文档树相关推荐

最新文章

热门文章