python爬虫----（4. scrapy框架，官方文档以及例子）

为什么80%的码农都做不了架构师？>>>

官方文档： http://doc.scrapy.org/en/latest/

github例子： https://github.com/search?utf8=%E2%9C%93&q=scrapy

剩下的待会再整理...... 买饭去...... --2014年08月20日19:29:20

の...刚搜狗输入法出问题，直接注销重新登陆，结果刚才的那些内容全部没了。看来草稿箱也不是太靠谱呀！！！

再重新整理下吧

-- 2014年08月21日04:02:37

（一）基本的 -- scrapy.spider.Spider

（1）使用交互shell

dizzy@dizzy-pc:~$ scrapy shell "http://www.baidu.com/"
2014-08-21 04:09:11+0800 [scrapy] INFO: Scrapy 0.24.4 started (bot: scrapybot)
2014-08-21 04:09:11+0800 [scrapy] INFO: Optional features available: ssl, http11, django
2014-08-21 04:09:11+0800 [scrapy] INFO: Overridden settings: {'LOGSTATS_INTERVAL': 0}
2014-08-21 04:09:11+0800 [scrapy] INFO: Enabled extensions: TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2014-08-21 04:09:11+0800 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-08-21 04:09:11+0800 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2014-08-21 04:09:11+0800 [scrapy] INFO: Enabled item pipelines:
2014-08-21 04:09:11+0800 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6024
2014-08-21 04:09:11+0800 [scrapy] DEBUG: Web service listening on 127.0.0.1:6081
2014-08-21 04:09:11+0800 [default] INFO: Spider opened
2014-08-21 04:09:12+0800 [default] DEBUG: Crawled (200) <GET http://www.baidu.com/> (referer: None)
[s] Available Scrapy objects:
[s]   crawler    <scrapy.crawler.Crawler object at 0xa483cec>
[s]   item       {}
[s]   request    <GET http://www.baidu.com/>
[s]   response   <200 http://www.baidu.com/>
[s]   settings   <scrapy.settings.Settings object at 0xa0de78c>
[s]   spider     <Spider 'default' at 0xa78086c>
[s] Useful shortcuts:
[s]   shelp()           Shell help (print this help)
[s]   fetch(req_or_url) Fetch request (or URL) and update local objects
[s]   view(response)    View response in a browser>>> # response.body 返回的所有内容# response.xpath('//ul/li') 可以测试所有的xpath内容

More important, if you type response.selector you will access a selector object you can use to
query the response, and convenient shortcuts like response.xpath() and response.css() mapping to
response.selector.xpath() and response.selector.css()

也就是可以很方便的，以交互的形式来查看xpath选择是否正确。之前是用FireFox的F12来选择的，但是并不能保证每次都能正确的选择出内容。

也可使用：

scrapy shell ’http://scrapy.org’ --nolog
# 参数 --nolog 没有日志

（2）示例

from scrapy import Spider
from scrapy_test.items import DmozItemclass DmozSpider(Spider):name = 'dmoz'allowed_domains = ['dmoz.org']start_urls = ['http://www.dmoz.org/Computers/Programming/Languages/Python/Books/','http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/,''']def parse(self, response):for sel in response.xpath('//ul/li'):item = DmozItem()item['title'] = sel.xpath('a/text()').extract()item['link'] = sel.xpath('a/@href').extract()item['desc'] = sel.xpath('text()').extract()yield item

（3）保存文件

可以使用，保存文件。格式可以 json，xml，csv

scrapy crawl -o 'a.json' -t 'json'

（4）使用模板创建spider

scrapy genspider baidu baidu.com# -*- coding: utf-8 -*-
import scrapyclass BaiduSpider(scrapy.Spider):name = "baidu"allowed_domains = ["baidu.com"]start_urls = ('http://www.baidu.com/',)def parse(self, response):pass

这段先这样吧，记得之前5个的，现在只能想起4个来了. :-(

千万记得随手点下保存按钮。否则很是影响心情的(⊙o⊙)！

（二）高级 -- scrapy.contrib.spiders.CrawlSpider

（1）CrawlSpider

class scrapy.contrib.spiders.CrawlSpiderThis is the most commonly used spider for crawling regular websites, as it provides a convenient mechanism forfollowing links by defining a set of rules. It may not be the best suited for your particular web sites or project,but it’s generic enough for several cases, so you can start from it and override it as needed for more customfunctionality, or just implement your own spider.Apart from the attributes inherited from Spider (that you must specify), this class supports a new attribute:
rulesWhich is a list of one (or more) Rule objects. Each Rule defines a certain behaviour for crawling thesite. Rules objects are described below. If multiple rules match the same link, the first one will be used,according to the order they’re defined in this attribute.This spider also exposes an overrideable method:
parse_start_url(response)This method is called for the start_urls responses. It allows to parse the initial responses and must returneither a Item object, a Request object, or an iterable containing any of them.

（2）例子

#coding=utf-8
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
import scrapyclass TestSpider(CrawlSpider):name = 'test'allowed_domains = ['example.com']start_urls = ['http://www.example.com/']rules = (# 元组Rule(LinkExtractor(allow=('category\.php', ), deny=('subsection\.php', ))),Rule(LinkExtractor(allow=('item\.php', )), callback='pars_item'),)def parse_item(self, response):self.log('item page : %s' % response.url)item = scrapy.Item()item['id'] = response.xpath('//td[@id="item_id"]/text()').re('ID：(\d+)')item['name'] = response.xpath('//td[@id="item_name"]/text()').extract()item['description'] = response.xpath('//td[@id="item_description"]/text()').extract()return item

（3）其他的。

其他的还有 XMLFeedSpider，这个有空再研究吧。

class scrapy.contrib.spiders.XMLFeedSpiderclass scrapy.contrib.spiders.CSVFeedSpiderclass scrapy.contrib.spiders.SitemapSpider

（三）选择器

>>> from scrapy.selector import Selector
>>> from scrapy.http import HtmlResponse
可以灵活的使用 .css() 和 .xpath() 来快速的选取目标数据

！！！关于选择器，需要好好研究一下。xpath() 和 css() ，还要继续熟悉正则.

当通过class来进行选择的时候，尽量使用 css() 来选择，然后再用 xpath() 来选择元素的熟悉

（四）Item Pipeline

After an item has been scraped by a spider, it is sent to the Item Pipeline which process it through several components that are executed sequentially.

Typical use for item pipelines are:• cleansing HTML data # 清除HTML数据• validating scraped data (checking that the items contain certain fields) # 验证数据• checking for duplicates (and dropping them) # 检查重复• storing the scraped item in a database # 存入数据库

（1）验证数据

from scrapy.exceptions import DropItemclass PricePipeline(object):vat_factor = 1.5def process_item(self, item, spider):if item['price']:if item['price_excludes_vat']:item['price'] *= self.vat_factorelse:raise DropItem('Missing price in %s' % item)

（2）写Json文件

import jsonclass JsonWriterPipeline(object):def __init__(self):self.file = open('json.jl', 'wb')def process_item(self, item, spider):line = json.dumps(dict(item)) + '\n'self.file.write(line)return  item

（3）检查重复

from scrapy.exceptions import DropItemclass Duplicates(object):def __init__(self):self.ids_seen = set()def process_item(self, item, spider):if item['id'] in self.ids_seen:raise DropItem('Duplicate item found : %s' % item)else:self.ids_seen.add(item['id'])return item

至于将数据写入数据库，应该也很简单。在 process_item 函数中，将 item 存入进去即可了。

看了一晚上，看到85页。算是把基本的看的差不多了。

-- 2014年08月21日06:39:41

（五）

转载于:https://my.oschina.net/lpe234/blog/304880

python爬虫----（4. scrapy框架，官方文档以及例子）相关推荐

python爬虫之Scrapy框架的post请求和核心组件的工作流程
python爬虫之Scrapy框架的post请求和核心组件的工作流程一 Scrapy的post请求的实现在爬虫文件中的爬虫类继承了Spider父类中的start_urls,该方法就可以对star ...
Python格式化字符串字面值 | 被官方文档称之为『漂亮』的输出格式
Python格式化字符串字面值 | 被官方文档称之为『漂亮』的输出格式本文参考输入输出 - Python 3.7.10 文档.首先声明咱的实验环境. ❯ python --version Pytho ...
Python爬虫之scrapy框架360全网图片爬取
Python爬虫之scrapy框架360全网图片爬取在这里先祝贺大家程序员节快乐,在此我也有一个好消息送给大家,本人已开通了微信公众号,我会把资源放在公众号上,还请大家小手动一动,关注过微信公众号, ...
Python爬虫之Scrapy框架爬虫实战
Python爬虫中Scrapy框架应用非常广泛,经常被人用于属于挖掘.检测以及自动化测试类项目,为啥说Scrapy框架作为半成品我们又该如何利用好呢 ?下面的实战案例值得大家看看. 目录: 1.Scr ...
19. python爬虫——基于scrapy框架爬取网易新闻内容
python爬虫--基于scrapy框架爬取网易新闻内容 1.需求 [前期准备] 2.分析及代码实现 (1)获取五大板块详情页url (2)解析每个板块 (3)解析每个模块里的标题中详情页信息 1.需 ...
python爬虫——用Scrapy框架爬取阳光电影的所有电影
python爬虫--用Scrapy框架爬取阳光电影的所有电影 1.附上效果图 2.阳光电影网址http://www.ygdy8.net/index.html 3.先写好开始的网址 name = 'yg ...
14. python爬虫——基于scrapy框架爬取糗事百科上的段子内容
python爬虫--基于scrapy框架爬取糗事百科上的段子内容 1.需求 2.分析及实现 3.实现效果 4.进行持久化存储 (1)基于终端指令 (2)基于管道 [前置知识]python爬虫--scr ...
Python爬虫：Scrapy 框架快速入门及实战演练
文章目录一.Scrapy 框架准备二.快速启动项目 1.创建项目结构 2.创建爬虫 3.更改设置 4.爬虫类分析 5.编写启动脚本三.爬虫实战 1.初步探索 2.优化数据模型 3.优化数据存储方 ...
Python爬虫之Scrapy框架系列（18）——深入剖析中间件及实战使用
目录: 1.下载中间件: (1)终端获取下载中间件状态信息的命令: (2)下载中间件的API: (3)中间件的项目应用:通过添加中间件设置UA代理及IP代理 ①在middlewares.py中间件文件 ...

python爬虫----（4. scrapy框架，官方文档以及例子）

python爬虫----（4. scrapy框架，官方文档以及例子）相关推荐

最新文章

热门文章