
官方文档: http://doc.scrapy.org/en/latest/

github例子: https://github.com/search?utf8=%E2%9C%93&q=scrapy

剩下的待会再整理...... 买饭去......       --2014年08月20日19:29:20



-- 2014年08月21日04:02:37

(一)基本的 -- scrapy.spider.Spider


dizzy@dizzy-pc:~$ scrapy shell "http://www.baidu.com/"
2014-08-21 04:09:11+0800 [scrapy] INFO: Scrapy 0.24.4 started (bot: scrapybot)
2014-08-21 04:09:11+0800 [scrapy] INFO: Optional features available: ssl, http11, django
2014-08-21 04:09:11+0800 [scrapy] INFO: Overridden settings: {'LOGSTATS_INTERVAL': 0}
2014-08-21 04:09:11+0800 [scrapy] INFO: Enabled extensions: TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2014-08-21 04:09:11+0800 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-08-21 04:09:11+0800 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2014-08-21 04:09:11+0800 [scrapy] INFO: Enabled item pipelines:
2014-08-21 04:09:11+0800 [scrapy] DEBUG: Telnet console listening on
2014-08-21 04:09:11+0800 [scrapy] DEBUG: Web service listening on
2014-08-21 04:09:11+0800 [default] INFO: Spider opened
2014-08-21 04:09:12+0800 [default] DEBUG: Crawled (200) <GET http://www.baidu.com/> (referer: None)
[s] Available Scrapy objects:
[s]   crawler    <scrapy.crawler.Crawler object at 0xa483cec>
[s]   item       {}
[s]   request    <GET http://www.baidu.com/>
[s]   response   <200 http://www.baidu.com/>
[s]   settings   <scrapy.settings.Settings object at 0xa0de78c>
[s]   spider     <Spider 'default' at 0xa78086c>
[s] Useful shortcuts:
[s]   shelp()           Shell help (print this help)
[s]   fetch(req_or_url) Fetch request (or URL) and update local objects
[s]   view(response)    View response in a browser>>> # response.body 返回的所有内容# response.xpath('//ul/li') 可以测试所有的xpath内容

More important, if you type response.selector you will access a selector object you can use to
query the response, and convenient shortcuts like response.xpath() and response.css() mapping to
response.selector.xpath() and response.selector.css()



scrapy shell ’http://scrapy.org’ --nolog
# 参数 --nolog 没有日志


from scrapy import Spider
from scrapy_test.items import DmozItemclass DmozSpider(Spider):name = 'dmoz'allowed_domains = ['dmoz.org']start_urls = ['http://www.dmoz.org/Computers/Programming/Languages/Python/Books/','http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/,''']def parse(self, response):for sel in response.xpath('//ul/li'):item = DmozItem()item['title'] = sel.xpath('a/text()').extract()item['link'] = sel.xpath('a/@href').extract()item['desc'] = sel.xpath('text()').extract()yield item


可以使用,保存文件。格式可以 json,xml,csv

scrapy crawl -o 'a.json' -t 'json'


scrapy genspider baidu baidu.com# -*- coding: utf-8 -*-
import scrapyclass BaiduSpider(scrapy.Spider):name = "baidu"allowed_domains = ["baidu.com"]start_urls = ('http://www.baidu.com/',)def parse(self, response):pass

这段先这样吧,记得之前5个的,现在只能想起4个来了. :-(


(二)高级 -- scrapy.contrib.spiders.CrawlSpider


class scrapy.contrib.spiders.CrawlSpiderThis is the most commonly used spider for crawling regular websites, as it provides a convenient mechanism forfollowing links by defining a set of rules. It may not be the best suited for your particular web sites or project,but it’s generic enough for several cases, so you can start from it and override it as needed for more customfunctionality, or just implement your own spider.Apart from the attributes inherited from Spider (that you must specify), this class supports a new attribute:
rulesWhich is a list of one (or more) Rule objects. Each Rule defines a certain behaviour for crawling thesite. Rules objects are described below. If multiple rules match the same link, the first one will be used,according to the order they’re defined in this attribute.This spider also exposes an overrideable method:
parse_start_url(response)This method is called for the start_urls responses. It allows to parse the initial responses and must returneither a Item object, a Request object, or an iterable containing any of them.


from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
import scrapyclass TestSpider(CrawlSpider):name = 'test'allowed_domains = ['example.com']start_urls = ['http://www.example.com/']rules = (# 元组Rule(LinkExtractor(allow=('category\.php', ), deny=('subsection\.php', ))),Rule(LinkExtractor(allow=('item\.php', )), callback='pars_item'),)def parse_item(self, response):self.log('item page : %s' % response.url)item = scrapy.Item()item['id'] = response.xpath('//td[@id="item_id"]/text()').re('ID:(\d+)')item['name'] = response.xpath('//td[@id="item_name"]/text()').extract()item['description'] = response.xpath('//td[@id="item_description"]/text()').extract()return item


其他的还有 XMLFeedSpider,这个有空再研究吧。

class scrapy.contrib.spiders.XMLFeedSpiderclass scrapy.contrib.spiders.CSVFeedSpiderclass scrapy.contrib.spiders.SitemapSpider


>>> from scrapy.selector import Selector
    >>> from scrapy.http import HtmlResponse
    可以灵活的使用 .css() 和 .xpath() 来快速的选取目标数据

!!!关于选择器,需要好好研究一下。xpath() 和 css() ,还要继续熟悉 正则.

当通过class来进行选择的时候,尽量使用 css() 来选择,然后再用 xpath() 来选择元素的熟悉

(四)Item Pipeline

After an item has been scraped by a spider, it is sent to the Item Pipeline which process it through several components that are executed sequentially.

Typical use for item pipelines are:• cleansing HTML data # 清除HTML数据• validating scraped data (checking that the items contain certain fields) # 验证数据• checking for duplicates (and dropping them) # 检查重复• storing the scraped item in a database # 存入数据库


from scrapy.exceptions import DropItemclass PricePipeline(object):vat_factor = 1.5def process_item(self, item, spider):if item['price']:if item['price_excludes_vat']:item['price'] *= self.vat_factorelse:raise DropItem('Missing price in %s' % item)


import jsonclass JsonWriterPipeline(object):def __init__(self):self.file = open('json.jl', 'wb')def process_item(self, item, spider):line = json.dumps(dict(item)) + '\n'self.file.write(line)return  item


from scrapy.exceptions import DropItemclass Duplicates(object):def __init__(self):self.ids_seen = set()def process_item(self, item, spider):if item['id'] in self.ids_seen:raise DropItem('Duplicate item found : %s' % item)else:self.ids_seen.add(item['id'])return item

至于将数据写入数据库,应该也很简单。在 process_item 函数中,将 item 存入进去即可了。

看了一晚上,看到85页。 算是把基本的看的差不多了。

-- 2014年08月21日06:39:41



