python的xpath用法介绍_python爬虫之xpath的基本使用详解

本篇文章主要介绍了python爬虫之xpath的基本使用详解，现在分享给大家，也给大家做个参考。一起过来看看吧

一、简介

XPath 是一门在 XML 文档中查找信息的语言。XPath 可用来在 XML 文档中对元素和属性进行遍历。XPath 是 W3C XSLT 标准的主要元素，并且 XQuery 和 XPointer 都构建于 XPath 表达之上。

二、安装

pip3 install lxml

三、使用

1、导入

from lxml import etree

2、基本使用

from lxml import etree

wb_data = """

first item
second item
third item
fourth item
fifth item

"""

html = etree.HTML(wb_data)

print(html)

result = etree.tostring(html)

print(result.decode("utf-8"))

从下面的结果来看，我们打印机html其实就是一个python对象，etree.tostring(html)则是不全里html的基本写法，补全了缺胳膊少腿的标签。

first item
second item
third item
fourth item
fifth item

3、获取某个标签的内容(基本使用)，注意，获取a标签的所有内容，a后面就不用再加正斜杠，否则报错。

写法一

html = etree.HTML(wb_data)

html_data = html.xpath('/html/body/p/ul/li/a')

print(html)

for i in html_data:

print(i.text)

first item

second item

third item

fourth item

fifth item

写法二(直接在需要查找内容的标签后面加一个/text()就行)

html = etree.HTML(wb_data)

html_data = html.xpath('/html/body/p/ul/li/a/text()')

print(html)

for i in html_data:

print(i)

first item

second item

third item

fourth item

fifth item

4、打开读取html文件

#使用parse打开html的文件

html = etree.parse('test.html')

html_data = html.xpath('//*')
#打印是一个列表，需要遍历

print(html_data)

for i in html_data:

print(i.text)

html = etree.parse('test.html')

html_data = etree.tostring(html,pretty_print=True)

res = html_data.decode('utf-8')

print(res)

打印：

first item
second item
third item
fourth item
fifth item

5、打印指定路径下a标签的属性(可以通过遍历拿到某个属性的值，查找标签的内容)

html = etree.HTML(wb_data)

html_data = html.xpath('/html/body/p/ul/li/a/@href')

for i in html_data:

print(i)

打印：link1.html

link2.html

link3.html

link4.html

link5.html

6、我们知道我们使用xpath拿到得都是一个个的ElementTree对象，所以如果需要查找内容的话，还需要遍历拿到数据的列表。

查到绝对路径下a标签属性等于link2.html的内容。

html = etree.HTML(wb_data)

html_data = html.xpath('/html/body/p/ul/li/a[@href="link2.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" ]/text()')

print(html_data)

for i in html_data:

print(i)

打印：['second item']

second item

7、上面我们找到全部都是绝对路径(每一个都是从根开始查找)，下面我们查找相对路径，例如，查找所有li标签下的a标签内容。

html = etree.HTML(wb_data)

html_data = html.xpath('//li/a/text()')

print(html_data)

for i in html_data:

print(i)

打印：['first item', 'second item', 'third item', 'fourth item', 'fifth item']

first item

second item

third item

fourth item

fifth item

8、上面我们使用绝对路径，查找了所有a标签的属性等于href属性值，利用的是/---绝对路径，下面我们使用相对路径，查找一下l相对路径下li标签下的a标签下的href属性的值，注意，a标签后面需要双//。

html = etree.HTML(wb_data)

html_data = html.xpath('//li/a//@href')

print(html_data)

for i in html_data:

print(i)

打印：['link1.html', 'link2.html', 'link3.html', 'link4.html', 'link5.html']

link1.html

link2.html

link3.html

link4.html

link5.html

9、相对路径下跟绝对路径下查特定属性的方法类似，也可以说相同。

html = etree.HTML(wb_data)

html_data = html.xpath('//li/a[@href="link2.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" ]')

print(html_data)

for i in html_data:

print(i.text)

打印：[]

second item

10、查找最后一个li标签里的a标签的href属性

html = etree.HTML(wb_data)

html_data = html.xpath('//li[last()]/a/text()')

print(html_data)

for i in html_data:

print(i)

打印：['fifth item']

fifth item

11、查找倒数第二个li标签里的a标签的href属性

html = etree.HTML(wb_data)

html_data = html.xpath('//li[last()-1]/a/text()')

print(html_data)

for i in html_data:

print(i)

打印：['fourth item']

fourth item

12、如果在提取某个页面的某个标签的xpath路径的话，可以如下图：

//*[@id="kw"]

解释：使用相对路径查找所有的标签，属性id等于kw的标签。

常用

#!/usr/bin/env python

# -*- coding:utf-8 -*-

from scrapy.selector import Selector, HtmlXPathSelector

from scrapy.http import HtmlResponse

html = """

first item
first item
second itemvv

second item

"""

response = HtmlResponse(url='http://example.com', body=html,encoding='utf-8')

# hxs = HtmlXPathSelector(response)

# print(hxs)

# hxs = Selector(response=response).xpath('//a')

# print(hxs)

# hxs = Selector(response=response).xpath('//a[2]')

# print(hxs)

# hxs = Selector(response=response).xpath('//a[@id]')

# print(hxs)

# hxs = Selector(response=response).xpath('//a[@id="i1"]')

# print(hxs)

# hxs = Selector(response=response).xpath('//a[@href="link.html" rel="external nofollow" rel="external nofollow" ][@id="i1"]')

# print(hxs)

# hxs = Selector(response=response).xpath('//a[contains(@href, "link")]')

# print(hxs)

# hxs = Selector(response=response).xpath('//a[starts-with(@href, "link")]')

# print(hxs)

# hxs = Selector(response=response).xpath('//a[re:test(@id, "i\d+")]')

# print(hxs)

# hxs = Selector(response=response).xpath('//a[re:test(@id, "i\d+")]/text()').extract()

# print(hxs)

# hxs = Selector(response=response).xpath('//a[re:test(@id, "i\d+")]/@href').extract()

# print(hxs)

# hxs = Selector(response=response).xpath('/html/body/ul/li/a/@href').extract()

# print(hxs)

# hxs = Selector(response=response).xpath('//body/ul/li/a/@href').extract_first()

# print(hxs)

# ul_list = Selector(response=response).xpath('//body/ul/li')

# for item in ul_list:

# v = item.xpath('./a/span')

# # 或

# # v = item.xpath('a/span')

# # 或

# # v = item.xpath('*/a/span')

# print(v)

python的xpath用法介绍_python爬虫之xpath的基本使用详解相关推荐

python爬虫常见报错_Python爬虫常见HTTP响应状态码详解
在使用Python进行网页数据抓取时,经常会遇到无数据返还或错误等异常,这个时候可以通过status_code命令来查看获得http请求返回的状态码,以便查找原因并制定相应的解决方案.import r ...
python多线程读取数据库数据_Python基于多线程操作数据库相关知识点详解
Python基于多线程操作数据库相关问题分析本文实例分析了Python多线程操作数据库相关问题.分享给大家供大家参考,具体如下: python多线程并发操作数据库,会存在链接数据库超时.数据库连接丢 ...
python在统计专业的应用_Python统计学一数据的概括性度量详解
一.数据的概括性度量 1.统计学概括: 统计学是应用数学的一个分支,主要通过利用概率论建立数学模型,收集所观察系统的数据,进行量化的分析.总结,并进而进行推断和预测,为相关决策提供依据和参考.统计学主 ...
python中4j什么意思_Python学习：4.数据类型以及运算符详解
运算符一.算数运算: 二.比较运算: 三.赋值运算四.逻辑运算五.成员运算基本数据类型一.Number(数字) Python3中支持int.float.bool.complex. 使用内置的 ...
python的常量和变量_python中的常量和变量代码详解
局部和全局变量: # name='lhf' # def change_name(): # # global name # name='帅了一比' # print('change_name',name) ...
python传入参数加星号_Python 带星号(* 或 **)的函数参数详解
1. 带默认值的参数在了解带星号(*)的参数之前,先看下带有默认值的参数,函数定义如下: >> def defaultValueArgs(common, defaultStr = &qu ...
python输入参数改变图形_Python基于Tensor FLow的图像处理操作详解
本文实例讲述了Python基于Tensor FLow的图像处理操作.分享给大家供大家参考,具体如下: 在对图像进行深度学习时,有时可能图片的数量不足,或者希望网络进行更多的学习,这时可以对现有的图片数 ...
python内置序列类型_Python序列内置类型之元组类型详解
Python序列内置类型之元组类型详解 1.元祖的概念 Python中的元组与列表类似,都是一个序列,不同的是元组的元素不能修改而已. 2.元组的创建元组使用小括号,列表使用方括号. tup = ( ...
python生成二维码_python生成二维码的实例详解
python生成二维码的实例详解版本相关操作系统:Mac OS X EI Caption Python版本:2.7 IDE:Sublime Text 3 依赖库 Python生成二维码需要的依赖库 ...

python的xpath用法介绍_python爬虫之xpath的基本使用详解

python的xpath用法介绍_python爬虫之xpath的基本使用详解相关推荐

最新文章

热门文章