Beautiful Soup select语法记录

假设有一个这样的html文件

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

我们在写 CSS 时，标签名不加任何修饰，类名前加点，id名前加 #，在这里我们也可以利用类似的方法来筛选元素，用到的方法是 soup.select()，返回类型是 list
（1）通过标签名查找

print soup.select('title')
#[<title>The Dormouse's story</title>]print soup.select('a')
#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]print soup.select('b')
#[<b>The Dormouse's story</b>]

（2）通过类名查找

print soup.select('.sister')
#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

（3）通过 id 名查找

print soup.select('#link1')
#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]

（4）子标签查找

1.直接子元素

print soup.select("head > title")
#[<title>The Dormouse's story</title>]

2.间接子元素

不需要 > 符号

例如

<div class="right-content"><ul class="news-1" data-sudaclick="news_p"><li><a href="http://news.sina.com.cn/c/2018-09-22/doc-ihkhfqnt7246449.shtml"                target="_blank">瑞典电视台播辱华节目 我大使馆：挑战人性良知</a></li>      <li><a href="http://news.sina.com.cn/c/2018-09-22/doc-ihkmwytn5895667.shtml" target="_blank">日本侦察机四天五次绕飞雪龙号 所为何事？</a></li>     <li><a href="http://news.sina.com.cn/c/2018-09-22/doc-ifxeuwwr7253238.shtml" target="_blank">山西“吕梁头号官霸”敛财10亿 离任时有人送花圈</a></li>        <li><a href="http://news.sina.com.cn/c/2018-09-22/doc-ihkmwytn5836797.shtml" target="_blank">“中国人素质全球倒数”谣言又来 联合国亲自打脸</a></li></ul>

使用代码

import requests
from bs4 import  BeautifulSoup
res = requests.get("https://news.sina.com.cn/china/")
res.encoding = "utf-8"
soup = BeautifulSoup(res.text,"html.parser")
for newslink in soup.select(".right-content>li a"):print("新闻链接",newslink["href"])print("新闻标题",newslink.text)

没有输出

而使用代码

import requests
from bs4 import  BeautifulSoup
res = requests.get("https://news.sina.com.cn/china/")
res.encoding = "utf-8"
soup = BeautifulSoup(res.text,"html.parser")
for newslink in soup.select(".right-content li a"):print("新闻链接",newslink["href"])print("新闻标题",newslink.text)

输出

新闻链接 http://news.sina.com.cn/c/2018-09-22/doc-ihkhfqnt7246449.shtml
新闻标题 瑞典电视台播辱华节目 我大使馆：挑战人性良知
新闻链接 http://news.sina.com.cn/c/2018-09-22/doc-ihkmwytn5895667.shtml
新闻标题 日本侦察机四天五次绕飞雪龙号 所为何事？
新闻链接 http://news.sina.com.cn/c/2018-09-22/doc-ifxeuwwr7253238.shtml
新闻标题 山西“吕梁头号官霸”敛财10亿 离任时有人送花圈
新闻链接 http://news.sina.com.cn/c/2018-09-22/doc-ihkmwytn5836797.shtml
新闻标题 “中国人素质全球倒数”谣言又来 联合国亲自打脸
新闻链接 http://news.sina.com.cn/c/2018-09-24/doc-ihkmwytn7611790.shtml
新闻标题 美媒：用AI和无人机修复 中国长城还能存在几百年
新闻链接 http://news.sina.com.cn/c/2018-09-24/doc-ihkmwytn7608791.shtml
新闻标题 中国民用航空飞行学院:没有学姐为学妹立规矩之说
新闻链接 http://news.sina.com.cn/c/2018-09-23/doc-ihkmwytn7553959.shtml
新闻标题 人民日报评瑞典播放辱华节目：辱华者须付出代价
新闻链接 http://news.sina.com.cn/c/2018-09-23/doc-ifxeuwwr7543016.shtml
新闻标题 香港各界对高锟逝世深表哀悼：为人为学皆为楷模
新闻链接 http://news.sina.com.cn/c/2018-09-23/doc-ihkmwytn7518406.shtml
新闻标题 瑞典电视台就辱华节目不道歉 我们该置多少气？
新闻链接 http://news.sina.com.cn/c/2018-09-23/doc-ihkmwytn7508396.shtml
新闻标题 这个省会市长空缺近8个月后 迎来新任市委副书记
新闻链接 http://news.sina.com.cn/s/2018-09-23/doc-ihkmwytn7488367.shtml
新闻标题 家长晒官职全网疯传背后 是一件大事已发生质变
新闻链接 http://news.sina.com.cn/c/2018-09-23/doc-ifxeuwwr7531642.shtml
新闻标题 民调称柯文哲满意度6成 竞选对手：政治骗子
新闻链接 http://news.sina.com.cn/c/2018-09-23/doc-ihkmwytn7446955.shtml
新闻标题 辽宁铁岭调兵山市发生2.9级地震 震源深度0千米
新闻链接 http://news.sina.com.cn/c/2018-09-23/doc-ihkmwytn7436720.shtml
新闻标题 环球谈瑞典辱华节目:认为"洋大人"没错
新闻链接 http://news.sina.com.cn/o/2018-09-23/doc-ihkmwytn7417416.shtml
新闻标题 越南将为已故国家主席陈大光举行国葬 降半旗致哀

一字之别

（5）组合查找

组合查找即和写 class 文件时，标签名与类名、id名进行的组合原理是一样的，例如查找 p 标签中，id 等于 link1的内容，二者需要用空格分开

print soup.select('p #link1')
#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]

（6）属性查找

查找时还可以加入属性元素，属性需要用中括号括起来，注意属性和标签属于同一节点，所以中间不能加空格，否则会无法匹配到。

print soup.select("head > title")
#[<title>The Dormouse's story</title>]print soup.select('a[href="http://example.com/elsie"]')
#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]

同样，属性仍然可以与上述查找方式组合，不在同一节点的空格隔开，同一节点的不加空格

Beautiful Soup select语法记录相关推荐

python爬虫select用法_Python爬虫利器二之Beautiful Soup的用法
1. Beautiful Soup的简介简单来说,Beautiful Soup是python的一个库,最主要的功能是从网页抓取数据.官方解释如下: Beautiful Soup提供一些简单的.pyt ...
Beautiful Soup 之 select详解
1 [code language="python"] ### select 传入tag标签 1. soup.select("title") 2. soup.se ...
Beautiful Soup 4.4.0 文档
Beautiful Soup 4.4.0 文档文章目录 Beautiful Soup 4.4.0 文档 @[toc] 快速开始安装 Beautiful Soup 安装完成后的问题安装解析器如何 ...
Beautiful Soup 4.4.0 文档 — beautifulsoup 4.4.0q 文档
Beautiful Soup 4.4.0 文档¶ Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库.它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的 ...
python beautiful soup 标签完全相同_Python爬取Python教程并制作成pdf
欢迎点击右上角关注小编,除了分享技术文章之外还有很多福利,私信学习资料可以领取包括不限于Python实战演练.PDF电子文档.面试集锦.学习资料等. 想要把教程变成PDF有三步: 1.先生成空html ...
以视频爬取实例讲解Python爬虫神器Beautiful Soup用法
1.安装BeautifulSoup4 easy_install安装方式,easy_install需要提前安装 1 easy_install beautifulsoup4 pip安装方式,pip也需要提 ...
Python爬虫之（八）数据提取-Beautiful Soup
Beautiful Soup的简介 Beautiful Soup提供一些简单的.python式的函数用来处理导航.搜索.修改分析树等功能.它是一个工具箱,通过解析文档为用户提供需要抓取的数据,因为简单 ...
html标签补全方法 python,Python Beautiful Soup学习之HTML标签补全功能
Beautiful Soup是一个非常流行的Python模块.该模块可以解析网页,并提供定位内容的便捷接口. 使用下面两个命令安装: pip install beautifulsoup4或者 sudo ...
使用Beautiful Soup和lxml轻松搞掂网页数据爬取
其实这类文章很多了,但还是简要记录一下. 三个黄金搭档:Beautiful Soup.lxml和requests Python标准库: BeautifulSoup(markup, 'html.pars ...

Beautiful Soup select语法记录

Beautiful Soup select语法记录相关推荐

最新文章

热门文章