python BeautifulSoup的简单使用

　　官网：https://www.crummy.com/software/BeautifulSoup/bs4/doc/

　　参考：https://www.cnblogs.com/yupeng/p/3362031.html

　　什么是BeautifulSoup？

　　　　BeautifulSoup是用Python写的一个HTML/XML的解析器，它可以很好的处理不规范标记并生成剖析树(parse tree)。它提供简单又常用的导航（navigating），搜索以及修改剖析树的操作。

　　下面通过一个测试例子简单说明下BeautifulSoup的用法

    def beautifulSoup_test(self):html_doc = """<html><head><title>The Dormouse's story</title></head><body><p class="title"><b>The Dormouse's story</b></p><p class="story">Once upon a time there were three little sisters; and their names were<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;<div  class="text" id="div1">测试</div>and they lived at the bottom of a well.</p><p class="story">...</p>"""# soup 就是BeautifulSoup处理格式化后的字符串soup = BeautifulSoup(html_doc,'lxml')# 得到的是title标签print(soup.title)# 输出：<title>The Dormouse's story</title># 得到的是文档中的第一个p标签，要想得到所有标签，得用find_all函数。# find_all 函数返回的是一个序列，可以对它进行循环，依次得到想到的东西.print(soup.p)print(soup.find_all('p'))print(soup.find(id='link3'))# 是返回文本,这个对每一个BeautifulSoup处理后的对象得到的标签都是生效的print(soup.get_text())aitems = soup.find_all('a')# 获取标签a的链接和idfor item in aitems:print(item["href"],item["id"])# 1、通过css查找print(soup.find_all("a", class_="sister"))# 输出：[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]print(soup.select("p.title"))# 输出：[<p class="title"><b>The Dormouse's story</b></p>]# 2、通过属性进行查找print(soup.find_all("a", attrs={"class": "sister"}))#输出：[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]# 3、通过文本进行查找print(soup.find_all(text="Elsie"))# 输出：['Elsie']print(soup.find_all(text=["Tillie", "Elsie", "Lacie"]))# 输出：['Elsie', 'Lacie', 'Tillie']# 4、限制结果个数print(soup.find_all("a", limit=2))#输出：[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]print(soup.find_all(id="link2"))# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]print(soup.find_all(id=True))#输出：[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,# 输出：<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>,# <div class="text" id="div1">测试</div>]

转载于:https://www.cnblogs.com/shaosks/p/10300054.html

python BeautifulSoup的简单使用相关推荐

python beautifulsoup抓取网页内容_利用Python和Beautiful Soup抓取网页内容
利用Python和Beautiful Soup抓取网页内容 Posted on 2012-08-09 00:08 SamWei 阅读(381) 评论(1) 编辑收藏 Python 3中提供了url打 ...
python代理池_用Python搭建一个简单的代理池
其实每次爬东西的时候,特怕IP被封,所以每次都要把时间延迟设置得长一点...这次用Python搭建一个简单的代理池.获取代理IP,然后验证其有效性.不过结果好像不是很理想,为什么西刺代理的高匿代理都能 ...
python编写登录_通过Python编写一个简单登录功能过程解析
通过Python编写一个简单登录功能过程解析需求: 写一个登录的程序, 1.最多登陆失败3次 2.登录成功,提示欢迎xx登录,今天的日期是xxx,程序结束 3.要检验输入是否为空,账号和密码不能为空 ...
Python beautifulsoup爬取小说
Python beautifulsoup爬取小说提前准备好需要的库文件,命令行输入以下命令 pip install requests pip install bs4 pip install lxml ...
python beautifulsoup库
Beautiful Soup是一个可以从HTML或XML文件中提取数据的Python库,简单来说,它能将HTML的标签文件解析成树形结构,然后方便地获取到指定标签的对应属性. 通过Beautiful ...
python网页爬虫+简单的数据分析
python网页爬虫+简单的数据分析文章目录 python网页爬虫+简单的数据分析一.数据爬取二.数据分析 1.我们今天爬取的目标网站是:http://pm25.in/ 2.需要爬取的目标数据是 ...
python制作统计图_刻意练习11：Python描述统计、简单统计图形
学习计划 MyPlan11 主题:Python描述统计.简单统计图形时间:8.5-8.11周内完成参考资料:新书<谁说菜鸟不会数据分析python篇> 各位星友们,在这个星球里每个人都 ...
用Python建立最简单的web服务器
用Python建立最简单的web服务器利用Python自带的包可以建立简单的web服务器.在DOS里cd到准备做服务器根目录的路径下,输入命令: python -m Web服务器模块 [端口号,默认 ...
Python django实现简单的邮件系统发送邮件功能
Python django实现简单的邮件系统发送邮件功能本文实例讲述了Python django实现简单的邮件系统发送邮件功能. django邮件系统 Django发送邮件官方中文文档总结如下: ...

python BeautifulSoup的简单使用

python BeautifulSoup的简单使用相关推荐

最新文章

热门文章