xpath的常见操作

1. 获取某一个节点下所有的文本数据：

data = response.xpath('//div[@id="zoomcon"]')
content = ''.join(data.xpath('string(.)').extract())

这段代码将获取，div为某一个特定id的所有文本数据：

http://www.nhfpc.gov.cn/fzs/s3576/200804/cdbda975a377456a82337dfe1cf176a1.shtml

2. 获取html几点属性的值

>>> response.xpath("//div[@id='zoomtime']").extract()
[u'<div class="content_subtitle" id="zoomtime" title="\u53d1\u5e03\u65e5\u671f\uff1a2010-10-26"><span>\u4e2d\u534e\u4eba\u6c11\u5171\u548c\u56fd\u56fd\u5bb6\u536b\u751f\u548c\u8ba1\u5212\u751f\u80b2\u59d4\u5458\u4f1a</span><span class="wzurl_tt" style="margin-left:10px;"></span><span style="margin-left:10px;">2010-10-26</span>\r\n                <span style="margin-left:30px;"></span> </div>']
>>> response.xpath("//div[@id='zoomtime']/@title").extract()
[u'\u53d1\u5e03\u65e5\u671f\uff1a2010-10-26']

这里需要获取的是某一个id下，属性title的值，使用的@title就可以获取到：

scrapy的项目结构：

nhfpc.py

# -*- coding: utf-8 -*-
import scrapy
import sys
import hashlib
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
from datetime import *
from common_lib import *reload(sys)
sys.setdefaultencoding('utf-8')class NhfpcItem(scrapy.Item):url = scrapy.Field()name = scrapy.Field()description = scrapy.Field()size = scrapy.Field()dateTime = scrapy.Field()class NhfpcSpider(scrapy.contrib.spiders.CrawlSpider):name = "nhfpc"allowed_domains = ["nhfpc.gov.cn"]start_urls = ('http://www.nhfpc.gov.cn/fzs/pzcfg/list.shtml','http://www.nhfpc.gov.cn/fzs/pzcfg/list_2.shtml','http://www.nhfpc.gov.cn/fzs/pzcfg/list_3.shtml','http://www.nhfpc.gov.cn/fzs/pzcfg/list_4.shtml','http://www.nhfpc.gov.cn/fzs/pzcfg/list_5.shtml','http://www.nhfpc.gov.cn/fzs/pzcfg/list_6.shtml','http://www.nhfpc.gov.cn/fzs/pzcfg/list_7.shtml',)rules = (Rule(LinkExtractor(allow='.*\d{6}/.*'),callback='parse_item'),Rule(LinkExtractor(allow='.*201307.*'),follow=True,),)def parse_item(self, response):retList =  response.xpath("//div[@id='zoomtitle']/*/text()").extract()title = ""if len(retList) == 0: retList = response.xpath("//div[@id='zoomtitl']/*/text()").extract()title =  retList[0].strip()else:title = retList[0].strip()content = ""data = response.xpath('//div[@id="zoomcon"]')if len(data) == 0: data = response.xpath('//div[@id="contentzoom"]')content = ''.join(data.xpath('string(.)').extract())pubTime = "1970-01-01 00:00:00"time = response.xpath("//div[@id='zoomtime']/@title").extract()if len(time) == 0 :time = response.xpath("//ucmspubtime/text()").extract()else:time = ''.join(time).split("：")[1]pubTime = ''.join(time)pubTime = pubTime + " 00:00:00"#print pubTime#insertTime = datetime.now().strftime("%20y-%m-%d %H:%M:%S")insertTime = datetime.now()webSite = "nhfpc.gov.cn"values = []values.append(title)md5Url=hashlib.md5(response.url.encode('utf-8')).hexdigest()values.append(md5Url)values.append(pubTime)values.append(insertTime)values.append(webSite)values.append(content)values.append(response.url)#print valuesinsertDB(values)

common_lib.py

#!/usr/bin/python
#-*-coding:utf-8-*-'''
This file include all the common routine,that are needed in
the crawler project.
Author: Justnzhang @(uestczhangchao@qq.com)
Time:2014年7月28日15:03:44
'''
import os
import sys
import MySQLdb
from urllib import quote, unquote
import uuidreload(sys)
sys.setdefaultencoding('utf-8')def insertDB(dictData):print "insertDB"print dictDataid = uuid.uuid1()try:conn_local = MySQLdb.connect(host='192.168.30.7',user='xxx',passwd='xxx',db='xxx',port=3306)conn_local.set_character_set('utf8')cur_local = conn_local.cursor()cur_local.execute('SET NAMES utf8;') cur_local.execute('SET CHARACTER SET utf8;')cur_local.execute('SET character_set_connection=utf8;')                values = []
#        print values
values.append("2")values.append("3")values.append("2014-04-11 00:00:00")values.append("2014-04-11 00:00:00")values.append("6")values.append("7")cur_local.execute("insert into health_policy values(NULL,%s,%s,%s,%s,%s,%s)",values)#print "invinsible seperator line-----------------------------------"
        conn_local.commit()cur_local.close()conn_local.close()except MySQLdb.Error,e:print "Mysql Error %d: %s" % (e.args[0], e.args[1])if __name__ == '__main__':values = [1,2,4]insertDB(values)

SET FOREIGN_KEY_CHECKS=0;-- ----------------------------
-- Table structure for health_policy
-- ----------------------------
DROP TABLE IF EXISTS `health_policy`;
CREATE TABLE `health_policy` (`hid` int(11) NOT NULL AUTO_INCREMENT,`title` varchar(1000) DEFAULT NULL COMMENT '政策标题',`md5url` varchar(1000) NOT NULL COMMENT '经过MD5加密后的URL',`pub_time` datetime DEFAULT NULL COMMENT '发布时间',`inser_time` datetime NOT NULL COMMENT '插入时间',`website` varchar(1000) DEFAULT NULL COMMENT '来源网站',`content` longtext COMMENT '政策内容',`url` varchar(1000) DEFAULT NULL,PRIMARY KEY (`hid`)
) ENGINE=InnoDB AUTO_INCREMENT=594 DEFAULT CHARSET=utf8;

转载于:https://www.cnblogs.com/justinzhang/p/4482170.html

xpath的常见操作相关推荐

路径，文件，目录，I/O常见操作汇总
摘要: 文件操作是程序中非常基础和重要的内容,而路径.文件.目录以及I/O都是在进行文件操作时的常见主题,这里想把这些常见的问题作个总结,对于每个问题,尽量提供一些解决方案,即使没有你想要的答案 ...
python字典操作添加_Python字典常见操作实例小结【定义、添加、删除、遍历】
本文实例总结了python字典常见操作.分享给大家供大家参考,具体如下: 简单的字典: 字典就是键值对key-value组合. #字典键值对组合 alien_0 ={'color':'green', ...
BOM,DOM常见操作和DHML
BOM (Browser Object Model)浏览器对象模型,控制浏览器的一些行为 window对象代表一个HTML文档属性页面导航的5个属性 self, parent, top, ope ...
go语言笔记——切片函数常见操作，增删改查和搜索、排序
7.6.6 搜索及排序切片和数组标准库提供了 sort 包来实现常见的搜索和排序操作.您可以使用 sort 包中的函数 func Ints(a []int) 来实现对 int 类型的切片排序.例如 ...
在单链表写入一组数据代码_链表常见操作和15道常见面试题
什么是单链表链表(Linked list)是一种常见的基础数据结构,是一种线性表,但是并不会按线性的顺序存储数据,而是在每一个节点里存到下一个节点的指针(Pointer),简单来说链表并不像数组那样 ...
python基础实例-Python基础之字符串常见操作经典实例详解
本文实例讲述了Python基础之字符串常见操作.分享给大家供大家参考,具体如下: 字符串基本操作切片 # str[beg:end] # (下标从 0 开始)从下标为beg开始算起,切取到下标为 en ...
C#路径/文件/目录/I/O常见操作汇总(一)
文件操作是程序中非常基础和重要的内容,而路径.文件.目录以及I/O都是在进行文件操作时的常见主题,这里想把这些常见的问题作个总结,对于每个问题, 尽量提供一些解决方案,即使没有你想要的答案,也希望能提 ...
python下selenium模拟浏览器常见操作
本文主要记录下selenium的常见操作,如定位具体元素的不同方法.在具体元素内循环.提取文本等.具体代码如下: # -*- coding: utf-8 -*- ''' Created on 2019 ...
Java数组常见操作
Java数组常见操作文章目录 Java数组常见操作 7.0 数组的遍历 1.使用foreach循环访问数组中每个元素. 2.使用简单的for循环(多层循环嵌套)来遍历数组. 7.1 数组长度 7.2 ...
python实战经典例子_Python基础之列表常见操作经典实例详解
本文实例讲述了Python基础之列表常见操作.分享给大家供大家参考,具体如下: Python中的列表操作列表是Python中使用最频繁的数据类型[可以说没有之一] 一组有序项目的集合可变的数据类型 ...

xpath的常见操作

xpath的常见操作相关推荐

最新文章

热门文章