爬虫库之BeautifulSoup学习（五）

css选择器：

我们在写 CSS 时，标签名不加任何修饰，类名前加点，id名前加 #，在这里我们也可以利用类似的方法来筛选元素，用到的方法是 soup.select()，返回类型是 list

1）通过标签名查找

print soup.select('title')

#[<title>The Dormouse's story</title>]

print soup.select('a')

#[<a class="sister" href="http://example.com/elsie" id="link1"></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

2)通过类名查找

print soup.select('.sister')

3)通过id名查找

print soup.select('#link1')
#[<a class="sister" href="http://example.com/elsie" id="link1"></a>]

4)组合查找

组合查找即和写 class 文件时，标签名与类名、id名进行的组合原理是一样的，例如查找 p 标签中，id 等于 link1的内容，二者需要用空格分开

print soup.select('p #link1')
#[<a class="sister" href="http://example.com/elsie" id="link1"></a>]

直接子标签查找

print soup.select("head>title")

#[<title>The Dormouse's story</title>]

5、属性查找

查找时还可以加入属性元素，属性需要用中括号括起来，注意属性和标签属于同一节点，所以中间不能加空格，否则会无法匹配到。

print soup.select('a[class="sister"]')
#[<a class="sister" href="http://example.com/elsie" id="link1"></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

print soup.select('a[href="http://example.com/elsie"]')
#[<a class="sister" href="http://example.com/elsie" id="link1"></a>]

print soup.select('a[href="http://example.com/elsie"]')
#[<a class="sister" href="http://example.com/elsie" id="link1"></a>]
同样，属性仍然可以与上述查找方式组合，不在同一节点的空格隔开，同一节点的不加空格

print soup.select('p a[href="http://example.com/elsie"]')
#[<a class="sister" href="http://example.com/elsie" id="link1"></a>]

以上的 select 方法返回的结果都是列表形式，可以遍历形式输出，然后用 get_text() 方法来获取它的内容。

soup = BeautifulSoup(html, 'lxml')
print type(soup.select('title'))
print soup.select('title')[0].get_text()

for title in soup.select('title'):
print title.get_text()

soup = BeautifulSoup(html, 'lxml')
print type(soup.select('title'))
print soup.select('title')[0].get_text()

for title in soup.select('title'):
print title.get_text()

好，这就是另一种与 find_all 方法有异曲同工之妙的查找方法，是不是感觉很方便？

【实战练习】：

爬取kugou top 500排名、歌手、歌曲、时间

#-*-coding:utf-8-*-

import requests

from bs4 import BeautifulSoup

import time

headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) \

Chrome/66.0.3359.181 Safari/537.36 XiaoBai/10.0.1708.542 (XBCEF)'}

def get_info(url):

wb_data = requests.get(url,headers=headers)

soup = BeautifulSoup(wb_data.text,'lxml')

ranks = soup.select('span.pc_temp_num') #获取排名情况

titles = soup.select('div.pc_temp_songlist > ul > li > a') #获取歌曲标题

times = soup.select('span.pc_temp_tips_r > span') #获取歌曲时间

for rank,title,time in zip(ranks,titles,times):

data = {

'rank':rank.get_text().strip(),

'singer':title.get_text().split('-')[0],

'song':title.get_text().split('-')[-1],

'time':time.get_text().strip()

}

print (data)

if __name__ == '__main__':

urls = ['http://www.kugou.com/yy/rank/home/{}-8888.html'.format(number) for number in range(1,24)] #构造多页url

for url in urls:

get_info(url) #循环调用get_info函数

time.sleep(1)

转载于:https://www.cnblogs.com/yu2000/p/6861620.html

爬虫库之BeautifulSoup学习（五）相关推荐

第7课： bs4 库的 BeautifulSoup 基础学习
这里写目录标题本节课内容所需要安装的库: BeautifulSoup 简介: lxml 简介: requests ,BeautifulSoup 和 lxml 相互三者关系: 如何利用 bs4 的 ...
python tag对象下有多个标签、属性_Python爬虫库BeautifulSoup获取对象(标签)名,属性,内容,注释...
Apple iPhone 11 (A2223) 128GB 黑色移动联通电信4G手机双卡双待 4999元包邮去购买 > 如何利用Python爬虫库BeautifulSoup获取对象(标签) ...
[EntLib]微软企业库5.0 学习之路——第五步、介绍EntLib.Validation模块信息、验证器的实现层级及内置的各种验证器的使用方法——上篇...
本文是为后面的学习之路做铺垫,简单介绍下企业库中的Validation模块的一些相关知识,包括Validation模块的简介.用途.使用方法.默认提供的多种验证器的介绍等. 一.简介及用途在实际的项 ...
python爬虫提取a标签_Python爬虫库BeautifulSoup获取对象(标签)名,属性,内容,注释
一.Tag(标签)对象 1.Tag对象与XML或HTML原生文档中的tag相同. from bs4 import BeautifulSoup soup = BeautifulSoup('Extreme ...
colly爬虫库学习笔记
colly爬虫库学习笔记前言稍微的学习了一下Go语言的基础知识(错误处理和协程通道这些还没看),想着能不能做点东西,突然想到自己当时学了python之后就是专门为了写爬虫(虽然后来也咕了,只会一个 ...
python获取标签属性值_Python爬虫库BeautifulSoup获取对象(标签)名,属性,内容,注释
更多python教程请到: 菜鸟教程www.piaodoo.com 人人影视www.sfkyty.com 16影视www.591319.com 星辰影院www.591319.com 一.Tag(标签) ...
Python3爬虫入门之beautifulsoup库的使用
强调内容 BeautifulSoup 灵活又方便的网页解析库,处理高效,支持多种解析器.利用它不用编写正则表达式即可方便地实现网页信息的提取. 解析库解析器使用方法优势劣势 Python标准库 ...
python爬虫和数据分析电脑推荐_大数据分析必备的5款Python爬虫库
在数据科学或人工智能领域,除了算法之外,最重要的应该是数据了.甚至可以说一个模型到最后决定其准确度的往往不是算法而是数据.在现实中,缺少足够的数据成了数据分析师获得优秀模型的主要阻碍.可喜的是,现在网 ...
Python爬虫：用BeautifulSoup进行NBA数据爬取
爬虫主要就是要过滤掉网页中没用的信息.抓取网页中实用的信息一般的爬虫架构为: 在python爬虫之前先要对网页的结构知识有一定的了解.如网页的标签,网页的语言等知识,推荐去W3School: W3s ...

爬虫库之BeautifulSoup学习（五）

爬虫库之BeautifulSoup学习（五）相关推荐

最新文章

热门文章