【爬虫实践】记一次Scrapy框架入门使用爬取豆瓣电影数据
本次的学习分享主要是使用一次Scrapy框架,毕竟在很多次的时候,自己在提取一些或是需要实验数据的时候,数据量要求不大,很快便能通过简单的request等库进行调用,然后获取数据。
这次,则是想要使用一次Scrapy框架,毕竟如果一次通配使用Scrapy也算是为了以后的学习实验进行提前准备,顺便记录要点,容易出错的地方。
实验环境 | 版本号 |
python | 3.6.3 |
Scrapy | 1.5.1 |
最基本的环境要求,本次实验使用Pycharm进行实践,而不是简单使用cmd这块黑黑的面板进行操作,因为实在是太麻烦了,再者那就是如果你已经有了一定的基础了,那么完全可以直接使用IDEA进行一些操作,省时省力,也差不了多少。并且如果你需要设置的Scrapy新项目的位置不在其他地方,打开pycharm的终端之后,你就会发现,当前路径已经是到了你项目路径下!
第一步:设置启动一个新的Scrapy项目!
1.打开pycharm,打开底部的终端。
如图所示。
2.将当前的路径下键入 以下命令!
scrapy startproject <scrapyProjectName>
其中的scrapyProjectName就是你要设置的项目名称,很重要,因为之后的启动爬虫项目,也是必须要用这个名称,在这里,既然是以豆瓣的电影为目标数据集,那么就设置本次项目的名称为douban。
scrapy startproject douban
结果生成了如下的项目结构!
第二步:根据第一步当中的友好提示,设置Spider啦!
这一步也很简单,看见./douban/douban/spiders路径没,大家是否注意过,这个地方的spiders文件夹名称是叫做spiders,嗯复数哦!意味着,其实我们的爬虫也就是我们的蜘蛛其实是可以设计很多个的,但是现在我们只要爬虫电影数据,那么好了,我们就设置一个名称为movie的spider吧!
语法结构可是在第一步的第二张图告诉我们的!
依旧是在终端位置输入!
cd douban
scrapy genspider movie movie.douban.com --template=crawl
以上的movie就是我们设置的spider的名称,而且大家注意到没还设置了个example.comurl,而这个url其实就是我们spider要爬虫的入口url或者说是入口域名!
结果便是在spiders文件夹下将生成movie.py文件,而这个文件就是我们自定义依照crawl模板生成的spider!
而movie的spider(小蜘蛛)文件生成之后,我们会发现,在这个文件当中,其实最主要的就三个要关注的地方!一个便是start_url,一个就是rules,一个就是parse_item函数!
start_url是我们要爬取数据的页面,在这里,我们就爬取豆瓣的前250电影!
rules当中,要爬取也是规则也是在这个页面下操作的!
以及对应每次爬取之后的数据要怎么样去获取抽取出来,则就是parse_item的工作!
所以以下的movie.py当中的代码如下:
# -*- coding: utf-8 -*-
import scrapy
from scrapy.selector import Selector
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rulefrom douban.items import DoubanItemclass MovieSpider(CrawlSpider):name = 'movie'allowed_domains = ['movie.douban.com']start_urls = ['https://movie.douban.com/top250']rules = (Rule(LinkExtractor(allow=(r'https://movie.douban.com/top250\?start=\d+.*'))),Rule(LinkExtractor(allow=(r'https://movie.douban.com/subject/\d+')), callback='parse_item'),)def parse_item(self, response):sel = Selector(response)item = DoubanItem()item['name']=sel.xpath('//*[@id="content"]/h1/span[1]/text()').extract()item['year']=sel.xpath('//*[@id="content"]/h1/span[2]/text()').re(r'\((\d+)\)')item['score']=sel.xpath('//*[@id="interest_sectl"]/div/p[1]/strong/text()').extract()item['director']=sel.xpath('//*[@id="info"]/span[1]/a/text()').extract()item['classification']= sel.xpath('//span[@property="v:genre"]/text()').extract()item['actor']= sel.xpath('//*[@id="info"]/span[3]/a[1]/text()').extract()return item
以上有很重要的位置便是rules的设置,这部分的设置,使用回调函数,只有第二个链接才进行使用,可以查看下豆瓣的电影某条路径,你就可以知道为什么只设置了第二条进行回调函数parse_item,而第一条没有呢!
因为第一条是整个一页面!
而第二条则是某一条目的具体定位,只能通过第二条来页面当中某一条目的所有信息!!
(当然以上的写法主要是使用了r开头的regex也就是正则匹配!实在是太强大了是不是哈哈!)
注意到此代码其实是引入了Scapy项目当中的items,items其实就是我们生成的数据当中的域,其域有多个文件属性名,类似json的数据结构,所以,我们还要设置下douban/douban/items.py 。
# -*- coding: utf-8 -*-# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html
import scrapyclass DoubanItem(scrapy.Item):name = scrapy.Field()year = scrapy.Field()score = scrapy.Field()director = scrapy.Field()classification = scrapy.Field()actor = scrapy.Field()
注意,其实这里本不想多说下路径的问题,但是我们初学者总是容易出错的地方,便是不知道在哪个路径下进行启动或是生成,这是非常头疼的一件事,那么在这里说明下!
其实Scrapy生成 新项目的时候,会发现,基本上第一级目录路径就和第二级目录路径是一样的,所以我们常常产生疑惑,到底是在哪一层进行Scrapy项目的启动呢!可以告诉大家,我的个人经验,也加上之前别的大佬的分享指点,第一级基本没用,Scrapy设置第一级目录其实只是要分开项目名称,第二级当中才是真正的本体!所以在实践启动Scrapy项目的时候进入进入到二级目录当中去!
我们可以看下!setting.py文件就知道了!
发现没有,如果真的有第一级目录,那么本次的spider_modules岂不是要写成了douban.douban.spiders!
以上属于闲话,大家看个热闹,有不同意见欢迎交流!
第三步:其实到了这一步,我们已经可以完全跑一下这个Scrapy文件
设置完了第二步的spider,也就是spiders文件夹下的movie.py,我们就可以跑一下这个movie试试!
依旧是在终端键入!依旧是在终端键入!依旧是在终端键入!(整个Scrapy项目运行都是用终端键入命令进行执行!)
通过以下命令!
scrapy crawl moive -o result.json
以上的 -o (字母小o)表示的是output输出到同路径下的result.json文件当中去!
生成的文件 result.json便是生成在了当前的一级目录douban文件夹下!
这一步其实算的上是测试spider是否成功,为接下来的数据持久化做一个基础铺垫!如果不需要正式爬取数据,完全也可以在这一步就终止了!接下来就不用做了!
======================================================================
开玩笑,有始有终才是好孩子不是!我们继续!
以上的中间爬取输出不是最主要的,还请看scrapy的整体框架当中各个组件的解释说明!
组件
1.Scrapy引擎(Engine):Scrapy引擎是用来控制整个系统的数据处理流程。2。调度器(Scheduler):调度器从Scrapy引擎接受请求并排序列入队列,并在Scrapy引擎发出请求后返还给它们。3.下载器(Downloader):下载器的主要职责是抓取网页并将网页内容返还给蜘蛛(Spiders)。4.蜘蛛(Spiders):蜘蛛是有Scrapy用户自定义的用来解析网页并抓取特定URL返回的内容的类,每个蜘蛛都能处理一个域名或一组域名,简单的说就是用来定义特定网站的抓取和解析规则。5.条目管道(Item Pipeline):条目管道的主要责任是负责处理有蜘蛛从网页中抽取的数据条目,它的主要任务是清理、验证和存储数据。当页面被蜘蛛解析后,将被发送到条目管道,并经过几个特定的次序处理数据。每个条目管道组件都是一个Python类,它们获取了数据条目并执行对数据条目进行处理的方法,同时还需要确定是否需要在条目管道中继续执行下一步或是直接丢弃掉不处理。条目管道通常执行的任务有:清理HTML数据、验证解析到的数据(检查条目是否包含必要的字段)、检查是不是重复数据(如果重复就丢弃)、将解析到的数据存储到数据库(关系型数据库或NoSQL数据库)中。6.中间件(Middlewares):中间件是介于Scrapy引擎和其他组件之间的一个钩子框架,主要是为了提供自定义的代码来拓展Scrapy的功能,包括下载器中间件和蜘蛛中间件。
我们已经设置了 item 、movie的spider、但是还请别忘了我们还要多设置一个那就是pipeline!这样我们的Scrapy项目才真正实现了!
第四步:设置pipeline.py!并且在这一部分将数据持久化下!
# -*- coding: utf-8 -*-# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html# -*- coding: utf-8 -*-# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
from scrapy.exceptions import DropItem
from scrapy import logimport jsonclass DoubanPipeline(object):def open_spider(self,spider):self.file = open('resultself.json','w')def close_spider(self,spider):self.file.close()def process_item(self, item, spider):#Remove invalid datavalid = Truefor data in item:if not data:valid = Falseraise DropItem("Missing %s of blogpost from %s" %(data, item['url']))if valid:#Insert data into databasenew_moive=[{"name":item['name'][0],"year":item['year'][0],"score":item['score'],"director":item['director'],"classification":item['classification'],"actor":item['actor']}]line = json.dumps(new_moive)+"\n"self.file.write(line)return item
之后在终端输入!
scrapy crawl movie
就可以一边看终端爬取进度也可以查看下文本的生成结果!以下部分终端显示结果!
F:\PyCharm\MyCodes\Python100Days\douban\douban>scrapy crawl movie
2019-06-04 17:45:13 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: douban)
2019-06-04 17:45:13 [scrapy.utils.log] INFO: Versions: lxml 4.1.1.0, libxml2 2.9.7, cssselect 1.0.3, parsel 1.5.1, w3lib 1.20.0, Twisted 18.9.0, Python 3.6.4 |Anaconda, Inc.| (default, Jan 16 2018, 10:22:32) [MSC v.1900 64 bit (
AMD64)], pyOpenSSL 17.5.0 (OpenSSL 1.1.1a 20 Nov 2018), cryptography 2.5, Platform Windows-10-10.0.17134-SP0
2019-06-04 17:45:13 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'douban', 'DOWNLOAD_DELAY': 3, 'HTTPCACHE_ENABLED': True, 'NEWSPIDER_MODULE': 'douban.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['douban.spid
ers'], 'USER_AGENT': 'safari 5.1 – MAC User-Agent:Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50'}
2019-06-04 17:45:13 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats','scrapy.extensions.telnet.TelnetConsole','scrapy.extensions.logstats.LogStats']
2019-06-04 17:45:13 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware','scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware','scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware','scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware','scrapy.downloadermiddlewares.useragent.UserAgentMiddleware','scrapy.downloadermiddlewares.retry.RetryMiddleware','scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware','scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware','scrapy.downloadermiddlewares.redirect.RedirectMiddleware','scrapy.downloadermiddlewares.cookies.CookiesMiddleware','scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware','scrapy.downloadermiddlewares.stats.DownloaderStats','scrapy.downloadermiddlewares.httpcache.HttpCacheMiddleware']
2019-06-04 17:45:13 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware','scrapy.spidermiddlewares.offsite.OffsiteMiddleware','scrapy.spidermiddlewares.referer.RefererMiddleware','scrapy.spidermiddlewares.urllength.UrlLengthMiddleware','scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-06-04 17:45:13 [py.warnings] WARNING: F:\PyCharm\MyCodes\Python100Days\douban\douban\pipelines.py:16: ScrapyDeprecationWarning: Module `scrapy.log` has been deprecated, Scrapy now relies on the builtin Python library for lo
gging. Read the updated logging entry in the documentation to learn more.from scrapy import log2019-06-04 17:45:13 [scrapy.middleware] INFO: Enabled item pipelines:
['douban.pipelines.DoubanPipeline']
2019-06-04 17:45:13 [scrapy.core.engine] INFO: Spider opened
2019-06-04 17:45:13 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-06-04 17:45:13 [scrapy.extensions.httpcache] DEBUG: Using filesystem cache storage in F:\PyCharm\MyCodes\Python100Days\douban\.scrapy\httpcache
2019-06-04 17:45:13 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2019-06-04 17:45:13 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://movie.douban.com/robots.txt> (referer: None) ['cached']
2019-06-04 17:45:13 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://movie.douban.com/top250> (referer: None) ['cached']
2019-06-04 17:45:13 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://movie.douban.com/top250?start=25&filter=> (referer: https://movie.douban.com/top250) ['cached']
2019-06-04 17:45:13 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://movie.douban.com/top250?start=225&filter=> (referer: https://movie.douban.com/top250) ['cached']
2019-06-04 17:45:13 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://movie.douban.com/top250?start=200&filter=> (referer: https://movie.douban.com/top250) ['cached']
2019-06-04 17:45:13 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://movie.douban.com/top250?start=175&filter=> (referer: https://movie.douban.com/top250) ['cached']
2019-06-04 17:45:13 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://movie.douban.com/top250?start=150&filter=> (referer: https://movie.douban.com/top250) ['cached']
2019-06-04 17:45:13 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://movie.douban.com/top250?start=125&filter=> (referer: https://movie.douban.com/top250) ['cached']
2019-06-04 17:45:13 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://movie.douban.com/top250?start=100&filter=> (referer: https://movie.douban.com/top250) ['cached']
2019-06-04 17:45:13 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://movie.douban.com/top250?start=75&filter=> (referer: https://movie.douban.com/top250) ['cached']
2019-06-04 17:45:13 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://movie.douban.com/top250?start=50&filter=> (referer: https://movie.douban.com/top250) ['cached']
2019-06-04 17:45:13 [scrapy.dupefilters] DEBUG: Filtered duplicate request: <GET https://movie.douban.com/top250?start=25&filter=#more> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
2019-06-04 17:45:16 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://movie.douban.com/subject/1292052/> (referer: https://movie.douban.com/top250)2019-06-04 17:45:16 [scrapy.core.scraper] ERROR: Error processing {'actor': [],'classification': ['剧情', '犯罪'],'director': [],'name': ['肖申克的救赎 The Shawshank Redemption'],'score': [],'year': ['1994']}2019-06-04 17:45:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://movie.douban.com/subject/6786002/> (referer: https://movie.douban.com/top250)
2019-06-04 17:45:17 [scrapy.core.scraper] ERROR: Error processing {'actor': [],'classification': ['剧情', '喜剧'],'director': [],'name': ['触不可及 Intouchables'],'score': [],'year': ['2011']}2019-06-04 17:45:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://movie.douban.com/subject/3319755/> (referer: https://movie.douban.com/top250)
2019-06-04 17:45:18 [scrapy.core.scraper] ERROR: Error processing {'actor': [],'classification': ['剧情', '喜剧', '爱情'],'director': [],'name': ['怦然心动 Flipped'],'score': [],'year': ['2010']}2019-06-04 17:45:22 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://movie.douban.com/subject/1849031/> (referer: https://movie.douban.com/top250)
2019-06-04 17:45:23 [scrapy.core.scraper] ERROR: Error processing {'actor': [],'classification': ['剧情', '家庭', '传记'],'director': [],'name': ['当幸福来敲门 The Pursuit of Happyness'],'score': [],'year': ['2006']}2019-06-04 17:45:24 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://movie.douban.com/subject/25662329/> (referer: https://movie.douban.com/top250)
2019-06-04 17:45:24 [scrapy.core.scraper] ERROR: Error processing {'actor': [],'classification': ['喜剧', '动画', '冒险'],'director': [],'name': ['疯狂动物城 Zootopia'],'score': [],'year': ['2016']}2019-06-04 17:45:29 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://movie.douban.com/subject/1307914/> (referer: https://movie.douban.com/top250)
2019-06-04 17:45:29 [scrapy.core.scraper] ERROR: Error processing {'actor': [],'classification': ['剧情', '悬疑', '犯罪'],'director': [],'name': ['无间道 無間道'],'score': [],'year': ['2002']
.......
.......
以下就是设置生成结果的的resultself.json文件内容!当然此处我并未进行解码,所以。。。尴尬了!大家实际操作过程当中可以多添加一步就可以了!或者直接写入txt文件,就直接生成了中文汉字!
[{"name": "\u8096\u7533\u514b\u7684\u6551\u8d4e The Shawshank Redemption", "year": "1994", "score": [], "director": [], "classification": ["\u5267\u60c5", "\u72af\u7f6a"], "actor": []}]
[{"name": "\u89e6\u4e0d\u53ef\u53ca Intouchables", "year": "2011", "score": [], "director": [], "classification": ["\u5267\u60c5", "\u559c\u5267"], "actor": []}]
[{"name": "\u6026\u7136\u5fc3\u52a8 Flipped", "year": "2010", "score": [], "director": [], "classification": ["\u5267\u60c5", "\u559c\u5267", "\u7231\u60c5"], "actor": []}]
[{"name": "\u5f53\u5e78\u798f\u6765\u6572\u95e8 The Pursuit of Happyness", "year": "2006", "score": [], "director": [], "classification": ["\u5267\u60c5", "\u5bb6\u5ead", "\u4f20\u8bb0"], "actor": []}]
[{"name": "\u75af\u72c2\u52a8\u7269\u57ce Zootopia", "year": "2016", "score": [], "director": [], "classification": ["\u559c\u5267", "\u52a8\u753b", "\u5192\u9669"], "actor": []}]
[{"name": "\u65e0\u95f4\u9053 \u7121\u9593\u9053", "year": "2002", "score": [], "director": [], "classification": ["\u5267\u60c5", "\u60ac\u7591", "\u72af\u7f6a"], "actor": []}]
[{"name": "\u7194\u7089 \ub3c4\uac00\ub2c8", "year": "2011", "score": [], "director": [], "classification": ["\u5267\u60c5"], "actor": []}]
[{"name": "\u6559\u7236 The Godfather", "year": "1972", "score": [], "director": [], "classification": ["\u5267\u60c5", "\u72af\u7f6a"], "actor": []}]
[{"name": "\u9f99\u732b \u3068\u306a\u308a\u306e\u30c8\u30c8\u30ed", "year": "1988", "score": [], "director": [], "classification": ["\u52a8\u753b", "\u5947\u5e7b", "\u5192\u9669"], "actor": []}]
[{"name": "\u661f\u9645\u7a7f\u8d8a Interstellar", "year": "2014", "score": [], "director": [], "classification": ["\u5267\u60c5", "\u79d1\u5e7b", "\u5192\u9669"], "actor": []}]
[{"name": "\u5927\u8bdd\u897f\u6e38\u4e4b\u5927\u5723\u5a36\u4eb2 \u897f\u904a\u8a18\u5927\u7d50\u5c40\u4e4b\u4ed9\u5c65\u5947\u7de3", "year": "1995", "score": [], "director": [], "classification": ["\u559c\u5267", "\u7231\u60c5", "\u5947\u5e7b", "\u53e4\u88c5"], "actor": []}]
[{"name": "\u695a\u95e8\u7684\u4e16\u754c The Truman Show", "year": "1998", "score": [], "director": [], "classification": ["\u5267\u60c5", "\u79d1\u5e7b"], "actor": []}]
[{"name": "\u653e\u725b\u73ed\u7684\u6625\u5929 Les choristes", "year": "2004", "score": [], "director": [], "classification": ["\u5267\u60c5", "\u97f3\u4e50"], "actor": []}]
[{"name": "\u6d77\u4e0a\u94a2\u7434\u5e08 La leggenda del pianista sull'oceano", "year": "1998", "score": [], "director": [], "classification": ["\u5267\u60c5", "\u97f3\u4e50"], "actor": []}]
[{"name": "\u4e09\u50bb\u5927\u95f9\u5b9d\u83b1\u575e 3 Idiots", "year": "2009", "score": [], "director": [], "classification": ["\u5267\u60c5", "\u559c\u5267", "\u7231\u60c5", "\u6b4c\u821e"], "actor": []}]
有几点注意以下!
第一点:spiders当中自定义设置的spider中的rules的设置其实是最关键的,它决定了你的模板数据集定位!
第二点:对应整个的爬虫启动,主要就是使用 scrapy crawl <spidername>,其中spidername就是你自定义的爬虫!当然是可以启动多个终端,然后并行进行爬取数据注意启动路径,建议进入项目二级目录当中来。
第二点:因为网络等各种原因,在实际爬取过程当中,请注意设置下你的setting.py文件!
本文的settings.py文件内容如下:
# -*- coding: utf-8 -*-# Scrapy settings for douban project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://doc.scrapy.org/en/latest/topics/settings.html
# https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
# https://doc.scrapy.org/en/latest/topics/spider-middleware.htmlBOT_NAME = 'douban'SPIDER_MODULES = ['douban.spiders']
NEWSPIDER_MODULE = 'douban.spiders'# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'douban (+http://www.yourdomain.com)'
USER_AGENT = 'safari 5.1 – MAC User-Agent:Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50'# Obey robots.txt rules
ROBOTSTXT_OBEY = True# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and doc
DOWNLOAD_DELAY = 3
RANDOMIZE_DOWNLOAD_DELAY = True# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16# Disable cookies (enabled by default)
#COOKIES_ENABLED = False
COOKIES_ENABLED = True# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
# 'Accept-Language': 'en',
#}# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'douban.middlewares.DoubanSpiderMiddleware': 543,
#}# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
# 'douban.middlewares.DoubanDownloaderMiddleware': 543,
#}# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#}# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
# 'douban.pipelines.DoubanPipeline': 300,
#}
ITEM_PIPELINES = {'douban.pipelines.DoubanPipeline': 400,
}
LOG_LEVEL = 'DEBUG'# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
HTTPCACHE_ENABLED = True
HTTPCACHE_EXPIRATION_SECS = 0
HTTPCACHE_DIR = 'httpcache'
HTTPCACHE_IGNORE_HTTP_CODES = []
HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
以上!就是今天的分享内容!感谢观看!
谢绝转载!
【爬虫实践】记一次Scrapy框架入门使用爬取豆瓣电影数据相关推荐
- 使用python的scrapy框架简单的爬取豆瓣读书top250
使用python的scrapy框架简单的爬取豆瓣读书top250 一.配置scrapy环境 1. 配置相应模块 如果没有配置过scrapy环境的一般需要安装lxml.PyOpenssl.Twisted ...
- Python爬虫入门(爬取豆瓣电影信息小结)
Python爬虫入门(爬取豆瓣电影信息小结) 1.爬虫概念 网络爬虫,是一种按照一定规则,自动抓取互联网信息的程序或脚本.爬虫的本质是模拟浏览器打开网页,获取网页中我们想要的那部分数据. 2.基本流程 ...
- python实现爬虫探探_全栈 - 9 实战 爬取豆瓣电影数据
这是全栈数据工程师养成攻略系列教程的第九期:9 实战 爬取豆瓣电影数据. 掌握了爬虫的基本原理和代码实现,现在让我们通过实战项目巩固一下. 确定目标 在写爬虫之前应当想清楚:我需要哪方面的数据?需要包 ...
- python爬虫爬取豆瓣电影信息城市_Python爬虫入门 | 2 爬取豆瓣电影信息
这是一个适用于小白的Python爬虫免费教学课程,只有7节,让零基础的你初步了解爬虫,跟着课程内容能自己爬取资源.看着文章,打开电脑动手实践,平均45分钟就能学完一节,如果你愿意,今天内你就可以迈入爬 ...
- Python爬虫实战,requests+re模块,Python实现爬取豆瓣电影《魔女2》影评
前言 闭关几个月,今天为大家带来利用Python爬虫抓取豆瓣电影<魔女2>影评,废话不多说. 爬取了6月7月25的影片数据,Let's start happily 开发工具 Python版 ...
- Python Scrapy 爬虫入门: 爬取豆瓣电影top250
一.安装Scrapy cmd 命令执行 pip install scrapy 二.Scrapy介绍 Scrapy是一套基于Twisted的异步处理框架,是纯python实现的爬虫框架,用户只需要定制开 ...
- Scrapy框架入门之爬取虎扑体育的新闻标题
下图是2018年5月25日火箭和勇士西决G5时,火箭赢下天王山之战,虎扑NBA的首页. 我这次做的爬虫项目的目的就是:爬取图片中红色边框里的文字,然后txt文本的方式保存到本地. 接下来我介绍一下我完 ...
- scrapy框架入门(爬取itcast实例)
一.简介 官方文档 crapy Engine(引擎): 负责Spider.ItemPipeline.Downloader.Scheduler中间的通讯,信号.数据传递等. Scheduler(调度器) ...
- python3[爬虫基础入门实战] 爬取豆瓣电影排行top250
先来张爬取结果的截图 再来份代码吧 # encoding=utf8 import requests import re from bs4 import BeautifulSoup from tkint ...
最新文章
- MySQL Replace INTO的使用
- Jmeter Loadrunner高级性能测试真实曝光
- 内存分配的几个函数的简单对比分析
- gin中间件中使用Goroutines
- 1、HTML 初步认识
- Project Server的页面如何修改Text
- C#中动态加载卸载类库
- 关于Topic和Partition
- 【渝粤题库】国家开放大学2021春2219房屋构造与维护管理题目
- JDK和Spring中的设计模式
- recycleview 嵌套高度问题_解决RecyclerView嵌套RecyclerView位移问题
- c语言怎么设计程序框图,C语言课程设计————写下流程图! 谢谢
- 深入浅出理解 Java回调机制(异步)
- 数据压缩算法该如何选择?
- 【coq】函数语言设计 笔记 05 -tactics
- React开发(250):react项目理解 ant design loding控制页面转圈加载
- VCS(DVE)调试
- Tiger DAO VC:DAO组织风险投资时代来临
- 古文觀止卷九_愚溪詩序_柳宗元
- 基于51单片机的8位数码管显示的可调时电子时钟