python爬虫之XPATH（爬取糗事百科、扇贝单词、网易云）

1.XML简介

或许有人会说，我正则用的不好，处理 HTML 文档很累，有没有其他的方法？有！那就是XPath，我们可以：
（1）先将 HTML 文件转换成 XML 文档，
（2）然后用 XPath 查找 HTML节点或元素。

什么是XML？

XML 指可扩展标记语言（EXtensible Markup Language）XML 是一种标记语言，很类似 HTMLXML 的设计宗旨是传输数据，而非显示数据XML 的标签需要我们自行定义。XML 被设计为具有自我描述性。XML 是 W3C 的推荐标准

2.什么是XPATH？

Xath （XML Path Language）是一门在 XML 文档中查找信息的语言，可用来在 XML文档中对元素和属性进行遍历。

2.1选取节点

最常用的路径表达式：

nodename 选取此节点的所有子节点。/           从根节点选取。//           从匹配选择的当前节点选择文档中的节点，而不考虑它们的位置。.           选取当前节点。..           选取当前节点的父节点。@           选取属性。

text()取标签当中的值

效果如下：

bookstore         选取 bookstore 元素的所有子节点。/bookstore        选取根元素 bookstore。注释：假如路径起始于正斜杠( / )，则此路径始终代表到某元素的绝对路径！bookstore/book    选取属于 bookstore 的子元素的所有 book 元素。//book           选取所有 book 子元素，而不管它们在文档中的位置。bookstore//book   选择属于 bookstore 元素的后代的所有 book 元素，而不管它们位于 bookstore 之下的什么位置。//@lang           选取名为 lang 的所有属性。

2.2谓语

谓语用来查找某个特定的节点或者包含某个指定的值的节点，被嵌在方括号中。

在下面的表格中，我们列出了带有谓语的一些路径表达式，以及表达式的结果：

/bookstore/book[1]                   选取属于 bookstore 子元素的第一个 book 元素。/bookstore/book[last()]              选取属于 bookstore 子元素的最后一个 book 元素。/bookstore/book[last()-1]           选取属于 bookstore 子元素的倒数第二个 book 元素。/bookstore/book[position()<3]       选取最前面的两个属于 bookstore 元素的子元素的 book 元素。//title[@lang]                        选取所有拥有名为 lang 的属性的 title 元素。//title[@lang='eng']                选取所有 title 元素，且这些元素拥有值为 eng 的 lang 属性。/bookstore/book[price>35.00]        选取 bookstore 元素的所有 book 元素，且其中的 price 元素的值须大于 35.00。/bookstore/book[price>35.00]/title    选取 bookstore 元素中的 book 元素的所有 title 元素，且其中的 price 元素的值须大于 35.00。

2.3选取未知节点

XPath 通配符可用来选取未知的 XML 元素

*            匹配任何元素节点。@*            匹配任何属性节点。node()     匹配任何类型的节点。

路径表达式，以及这些表达式的结果：

/bookstore/*         选取 bookstore 元素的所有子元素。//*                       选取文档中的所有元素。//title[@*]             选取所有带有属性的 title 元素。

2.4选取若干路径

通过在路径表达式中使用“|”运算符，您可以选取若干个路径。在下面的表格中，列出了一些路径表达式，以及这些表达式的结果：

//book/title | //book/price          选取 book 元素的所有 title 和 price 元素。//title | //price                    选取文档中的所有 title 和 price 元素。/bookstore/book/title | //price       选取属于 bookstore 元素的 book 元素的所有 title 元素，以及文档中所有的 price 元素。

2.5XPATH运算符

运算符     描述               实例                      返回值|        计算两个节点集 //book | //cd       返回所有拥有 book 和 cd 元素的节点集
+          加法          6 + 4                      10
-           减法          6 - 4                        2
*           乘法          6 * 4                       24
div         除法          8 div 4                      2
=          等于          price=9.80     如果 price 是 9.80，则返回 true。如果 price 是 9.90，则返回 false。!=           不等于      price!=9.80        如果 price 是 9.90，则返回 true。如果 price 是 9.80，则返回 false。<           小于          price<9.80       如果 price 是 9.00，则返回 true。如果 price 是 9.90，则返回 false。<=       小于或等于     price<=9.80     如果 price 是 9.00，则返回 true。如果 price 是 9.90，则返回 false。>           大于          price>9.80       如果 price 是 9.90，则返回 true。如果 price 是 9.80，则返回 false。>=       大于或等于     price>=9.80     如果 price 是 9.90，则返回 true。如果 price 是 9.70，则返回 false。or         或           price=9.80 or price=9.70  如果 price 是 9.80，则返回 true。如果 price 是 9.50，则返回 false。and            与           price>9.00 and price<9.90 如果 price 是 9.80，则返回 true。如果 price 是 8.50，则返回 false。mod        计算除法的余数 5 mod 2                     1

3.代码

爬取糗事百科：

import requests
from lxml import etree
import time,json# 1.确立基本url
base_url='https://www.qiushibaike.com/'# 2.包装请求头
headers={
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.108 Safari/537.36'}response=requests.get(base_url,headers=headers)# 将页面内容转换为xpath对象
html=etree.HTML(response.text)max_page=html.xpath('//span[@class="page-numbers"]/text()')[-1].strip()
# print(max_page)list=[]
for page in range(1,int(max_page)+1):#https://www.qiushibaike.com/8hr/page/1/url='https://www.qiushibaike.com/8hr/page/{}'.format(page)response=requests.get(url,headers=headers)html=etree.HTML(response.text)li_list=html.xpath('//div[@class="recommend-article"]/ul/li')for site in li_list:# print(site)item={}title=site.xpath('.//div[@class="recmd-right"]/a/text()')if title:title=title[0]else:title=''# print(title)pic=site.xpath('.//a[contains(@class,"recmd-left")]/img/@src')[0]pic='http:'+piccomments=site.xpath('.//div[@class="recmd-num"]/span[last()-1]/text()')[0]# print(comments)funny_num=site.xpath('.//div[@class="recmd-num"]/span[1]/text()')[0]# print(funny_num)detail_url=site.xpath('.//div[@class="recmd-right"]/a/@href')[0]print(detail_url)#https://www.qiushibaike.com/article/121184311detail_url='https://www.qiushibaike.com'+detail_urlitem['title']=titleitem['pic']=picitem['comments']=commentsitem['funny_num']=funny_numitem['detail_url']=detail_urllist.append(item)print(item)json.dump(list,open('qiushi.json','w'))

爬取扇贝单词：

import requests
from lxml import etreeclass Shanbei():def __init__(self,base_url):self.base_url = base_urlself.headers = {'user-agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'}self.pase()def request_xpath(self,url):response = requests.get(url,headers = self.headers)return etree.HTML(response.text)def get_text(self,text):if text:return textelse:return ''def pase(self):word_dict={}for page in range(1,4):#https://www.shanbay.com/wordlist/110521/232414/?page=2url = self.base_url+'?page=%s' %page# url = self.base_url+f'?page={page}'# print(url)#获取xpath对象html = self.request_xpath(url)word_lsit = html.xpath('//table[@class="table table-bordered table-striped"]/tbody/tr')for site in word_lsit:word_en_list = site.xpath('//td[@class="span2"]/strong/text()')word_zh_list = self.get_text(site.xpath('//td[@class="span10"]/text()'))for i,word in enumerate(word_en_list):word_en = (word_en_list[i])word_zh = self.get_text(word_zh_list[i])# print(word_en,word_zh)word_dict[word_en]=word_zhprint(word_dict)if __name__ == '__main__':base_url = 'https://www.shanbay.com/wordlist/110521/232414/'Shanbei(base_url)

爬取网易云音乐所有歌手：

import requests
from  lxml import etreeclass Wangyi():def __init__(self,url):self.base_url = urlself.headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'}self.html = self.request_xpath(self.base_url)self.parse()#获取xpath对象def request_xpath(self,url):response = requests.get(url=url,headers=self.headers)# response.urlreturn etree.HTML(response.text)def parse(self):singer_list_url = self.html.xpath('//a[@class="cat-flag"]/@href')# print(singer_list_url)del singer_list_url[0]# print(singer_list_url)for site in singer_list_url:url = 'https://music.163.com{}'.format(site)# print(url)html = self.request_xpath(url)li_url_list = html.xpath('//ul[@class="n-ltlst f-cb"]/li[position()>1]/a/@href')# print(li_url_list)#/discover/artist/cat?id=6001&initial=65for url in li_url_list:self.parse_detail(url)def parse_detail(self,url):item={}base_url = 'https://music.163.com{}'.format(url)html = self.request_xpath(base_url)singer_name = html.xpath('//a[@class="nm nm-icn f-thide s-fc0"]/text()')detail_url = ''if __name__ == '__main__':base_url = 'https://music.163.com/discover/artist'Wangyi(base_url)

python爬虫之XPATH（爬取糗事百科、扇贝单词、网易云）相关推荐

Python爬虫学习笔记 -- 爬取糗事百科
Python爬虫学习笔记 -- 爬取糗事百科代码存放地址: https://github.com/xyls2011/python/tree/master/qiushibaike 爬取网址:https ...
Python爬虫实战之爬取糗事百科段子
Python爬虫实战之爬取糗事百科段子完整代码地址:Python爬虫实战之爬取糗事百科段子程序代码详解: Spider1-qiushibaike.py:爬取糗事百科的8小时最新页的段子.包含的信息 ...
python 爬虫实战1 爬取糗事百科段子
首先,糗事百科大家都听说过吧?糗友们发的搞笑的段子一抓一大把,这次我们尝试一下用爬虫把他们抓取下来. 本篇目标抓取糗事百科热门段子过滤带有图片的段子实现每按一次回车显示一个段子的发布时间,发布人 ...
Python爬虫练习：爬取糗事百科
本文的文字及图片来源于网络,仅供学习.交流使用,不具有任何商业用途,版权归原作者所有,如有问题请及时联系我们以作处理以下文章来源于CSDN,作者不温卜火爬取前的准备糗事百科官网:https:// ...
两个简单的xpath案例（爬取糗事百科扇贝单词）
1. 糗事百科 - etree 分析流程源代码 2. 扇贝单词分析源代码 1. 糗事百科 - etree 更多爬虫教程请移步 \color{red}更多爬虫教程请移步
python爬虫——利用BeautifulSoup4爬取糗事百科的段子
1 import requests 2 from bs4 import BeautifulSoup as bs 3 4 #获取单个页面的源代码网页 5 def gethtml(pagenum): 6 ...
爬虫第四战爬取糗事百科搞笑段子
又开始了新的篇章,本熊继续一个Python小白的修行之路,这次要爬取糗事百科主页的段子,恩 ..看起来不错的样子,只是段子不能吃 ,不然,啧啧... 相信很多人有去糗百看段子减压的习惯,如果能把这些段 ...
python爬虫经典段子_Python爬虫实战之爬取糗事百科段子
首先,糗事百科大家都听说过吧?糗友们发的搞笑的段子一抓一大把,这次我们尝试一下用爬虫把他们抓取下来. 友情提示糗事百科在前一段时间进行了改版,导致之前的代码没法用了,会导致无法输出和CPU占用过高的 ...
【资料下载】Python 第三讲——正则表达式爬取糗事百科数据...
直播时间:2月20日 20:00-21:00 直播讲师:罗攀--林学研究生<从零开始学Python网络爬虫>作者 <从零开始学Python数据分析>作者.擅长网络爬虫.数据分析 ...
Python使用aiohttp异步爬取糗事百科
from bs4 import BeautifulSoup import aiohttp # 代替requests import asyncio from urllib import parsehea ...