用xpath解析网页

作者：金良（golden1314521@gmail.com） csdn博客：http://blog.csdn.net/u012176591

lxml手册：http://lxml.de/index.html

1.下面的例子源自于博客用lxml解析HTML¶

In [1]:

from lxml import etree

待解析的文本

In [4]:

html = '''<html>
　　<head>
　　　　<meta name="content-type" content="text/html; charset=utf-8" />
　　　　<title>友情链接查询 - 站长工具</title>
　　　　<!-- uRj0Ak8VLEPhjWhg3m9z4EjXJwc -->
　　　　<meta name="Keywords" content="友情链接查询" />
　　　　<meta name="Description" content="友情链接查询" />　　</head>
　　<body>
　　　　<h1 class="heading">Top News</h1>
　　　　<p style="font-size: 200%">World News only on this page</p>
　　　　Ah, and here's some more text, by the way.
　　　　<p>... and this is a parsed fragment ...</p>　　　　<a href="http://www.cydf.org.cn/" rel="nofollow" target="_blank">青少年发展基金会</a>
　　　　<a href="http://www.4399.com/flash/32979.htm" target="_blank">洛克王国</a>
　　　　<a href="http://www.4399.com/flash/35538.htm" target="_blank">奥拉星</a>
　　　　<a href="http://game.3533.com/game/" target="_blank">手机游戏</a>
　　　　<a href="http://game.3533.com/tupian/" target="_blank">手机壁纸</a>
　　　　<a href="http://www.4399.com/" target="_blank">4399小游戏</a>
　　　　<a href="http://www.91wan.com/" target="_blank">91wan游戏</a>　　</body>
</html>'''

使用lxml前注意事项：先确保html经过了utf-8解码，即code = html.decode(‘utf-8’, ‘ignore’)，否则会出现解析出错情况。因为中文被编码成utf-8之后变成 ‘/u2541’　之类的形式，lxml一遇到　“/”就会认为其标签结束。

In [47]:

page = etree.HTML(html.decode('utf-8'))
hrefs = page.xpath(u"//a")#它会找到整个html文本里的所有 a 标签
for href in hrefs:print href.attrib['href']#+"  "+href.text
for href in hrefs:print href.text

http://www.cydf.org.cn/
http://www.4399.com/flash/32979.htm
http://www.4399.com/flash/35538.htm
http://game.3533.com/game/
http://game.3533.com/tupian/
http://www.4399.com/
http://www.91wan.com/
青少年发展基金会
洛克王国
奥拉星
手机游戏
手机壁纸
4399小游戏
91wan游戏

上面解析HTML过程中出现的几个对象的类型

In [30]:

print type(hrefs)
print type(href)
print type(href.text)
print type(href.attrib)

<type 'list'>
<type 'lxml.etree._Element'>
<type 'unicode'>
<type 'lxml.etree._Attrib'>

过滤的方法就是用[”@”]把过滤条件加上。类似的还有@name, @id, @value, @href, @src, @class等等。

In [48]:

p = page.xpath(u"/html/body/p[@style='font-size: 200%']")
#用“/”来作为上下层级间的分隔。第一个“/”表示文档的根节点。
print p[0].values()
print p[0].text

['font-size: 200%']
World News only on this page

或者

In [49]:

p = page.xpath(u"//p[@style='font-size: 200%']")
print p[0].values()
print p[0].text

['font-size: 200%']
World News only on this page

数字定位功能，需要注意的是序号从1开始，而不是0.

In [53]:

hrefs = page.xpath(u"//a[3]")#此序号从1开始
print hrefs[0].attrib

{'href': 'http://www.4399.com/flash/35538.htm', 'target': '_blank'}

星号 * 可以代替所有的节点名

In [56]:

metas = page.xpath(u"/html/*/meta")
for meta in metas:print meta.attrib
for meta in metas:print meta.attrib['name']

{'content': 'text/html; charset=utf-8', 'name': 'content-type'}
{'content': u'\u53cb\u60c5\u94fe\u63a5\u67e5\u8be2', 'name': 'Keywords'}
{'content': u'\u53cb\u60c5\u94fe\u63a5\u67e5\u8be2', 'name': 'Description'}
content-type
Keywords
Description

2.下面的例子源自于博客 python lxml xpath 使用实例¶

In [57]:

import lxml.html

In [59]:

html='''
<html>
<body>
<bookstore position="cn">
    <book category="A">
        <title lang="en">Everyday Italian</title>
        <author>Giada De Laurentiis</author>
        <year>2005</year>
        <price>30.00</price>
    </book>
    <book category="B">
        <title lang="en">Harry Potter</title>
        <author>J K. Rowling</author>
        <year>2005</year>
        <price>29.99</price>
    </book>
</bookstore>
<bookstore position="pk">
    <book category="A">
        <title lang="en">Learning XML</title>
        <author>Erik T. Ray</author>
        <year>2003</year>
        <price>39.95</price>
    </book>
</bookstore>
<bookstore position="jp">
    <book category="C">
        <title lang="en">XQuery Kick Start</title>
        <author>James McGovern</author>
        <author>Per Bothner</author>
        <author>Kurt Cagle</author>
        <author>James Linn</author>
        <author>Vaidyanathan Nagarajan</author>
        <year>2003</year>
        <price>49.99</price>
    </book>
</bookstore>
</body>
</html>
'''

In [61]:

doc = lxml.html.document_fromstring(html)

In [62]:

print "总共有%d本书" %(len(doc.xpath('/html/body/bookstore/book')))

总共有4本书

In [63]:

print "2005 年出版的书有%d本"% (len(doc.xpath('/html/body/bookstore/book[year=2005]')))

2005 年出版的书有2本

In [66]:

print "2005 年出版的书在 %s" % (" ".join([ i.get("position")  for i in doc.xpath('/html/body/bookstore/book[year=2003]/parent::*') ]))
# get("position")biaosh表示获得position属性。
# parent::表示任意父节点

2005 年出版的书在 pk jp

In [67]:

price = doc.xpath("//bookstore/book[title='Harry Potter']/price")
print(price[0].text)

29.99

3.分析在线网页¶

In [71]:

r = requests.get('https://www.python.org')

In [78]:

doc = lxml.html.document_fromstring(r.content)

In [83]:

ps = doc.xpath('/html/body/div/div/nav/ul/li/a')

In [85]:

for p in ps:print p.text

Python
PSF
Docs
PyPI
Jobs
Community

4.博客园粉丝关系解析¶

In [ ]: