[TimLinux] scrapy 在Windows平台的安装
1. 安装Python
这个不去细说,官网直接下载,安装即可,我自己选择的版本是 Python 3.6.5 x86_64bit windows版本。
2. 配置PATH
我用的windows 10系统,操作步骤,‘此电脑’ 上鼠标右键,选择 ’属性’, 在弹出的面板中,选择 ‘高级系统设置’, 新窗口中点击 ’高级‘ 标签页,然后点击 ’环境变量‘, 在用户环境变量中,选中 path(没有就添加),然后把:C:\Python365\Scripts;C:\Python365;添加到该变量值中即可。
3. 安装scrapy
采用的安装方式为pip, 在打开的cmd窗口中,输入: pip install scrapy,这时候估计会遇到如下错误:
building 'twisted.test.raiser' extensionerror: Microsoft Visual C++ 14.0 is required. Get it with "Microsoft Visual C++ Build Tools": http://landinghub.visualstudio.com/visual-cpp-build-tools---------------------------------------- Command "C:\Python365\python.exe -u -c "import setuptools, tokenize;__file__='C:\\Users\\admin\\AppData\\Local\\Temp\\pip-install-fkvobf_0\\Twisted\\setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record C:\Users\admin\AppData\Local\Temp\pip-record-6z5m4wfj\install-record.txt --single-version-externally-managed --compile" failed with error code 1 in C:\Users\admin\AppData\Local\Temp\pip-install-fkvobf_0\Twisted\
这是因为没有安装 visual studio c++ 2015, 但是其实我们不需要,另外这里给出的链接也不是正确可以访问的链接,这时候大家可以到这个网站上去下载 Twisted 的whl文件来直接安装即可:
https://www.lfd.uci.edu/~gohlke/pythonlibs/
https://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted
在这个页面,大家可以选择合适的包进行下载(我选的是:Twisted‑18.7.0‑cp36‑cp36m‑win_amd64.whl):
Twisted, an event-driven networking engine. Twisted‑18.7.0‑cp27‑cp27m‑win32.whl Twisted‑18.7.0‑cp27‑cp27m‑win_amd64.whl Twisted‑18.7.0‑cp34‑cp34m‑win32.whl Twisted‑18.7.0‑cp34‑cp34m‑win_amd64.whl Twisted‑18.7.0‑cp35‑cp35m‑win32.whl Twisted‑18.7.0‑cp35‑cp35m‑win_amd64.whl Twisted‑18.7.0‑cp36‑cp36m‑win32.whl Twisted‑18.7.0‑cp36‑cp36m‑win_amd64.whl Twisted‑18.7.0‑cp37‑cp37m‑win32.whl Twisted‑18.7.0‑cp37‑cp37m‑win_amd64.whl
下载完成之后,执行:pip install Twisted-18.7.0-cp36-cp36m-win_amd64.whl,这个步骤完成之后,继续执行:pip install scrapy,就能够完成剩余的安装任务了。
Installing collected packages: scrapy Successfully installed scrapy-1.5.1
4. github库
学习、工作最好有跟踪,为此建立自己的github仓库:
https://github.com/timscm/myscrapy
5. 示例
官方文档上就给出了一简单的示例,这里不做解释,只是尝试是否能够正常运行。
https://docs.scrapy.org/en/latest/intro/tutorial.html
PS D:\pycharm\labs> scrapy Scrapy 1.5.1 - no active projectUsage:scrapy <command> [options] [args]Available commands:bench Run quick benchmark testfetch Fetch a URL using the Scrapy downloadergenspider Generate new spider using pre-defined templatesrunspider Run a self-contained spider (without creating a project)settings Get settings valuesshell Interactive scraping consolestartproject Create new projectversion Print Scrapy versionview Open URL in browser, as seen by Scrapy[ more ] More commands available when run from project directoryUse "scrapy <command> -h" to see more info about a command
5.1. 创建项目
PS D:\pycharm\labs> scrapy startproject tutorial . New Scrapy project 'tutorial', using template directory 'c:\\python365\\lib\\site-packages\\scrapy\\templates\\project', created in:D:\pycharm\labsYou can start your first spider with:cd .scrapy genspider example example.com PS D:\pycharm\labs> dir目录: D:\pycharm\labsMode LastWriteTime Length Name ---- ------------- ------ ---- d----- 2018/9/23 10:58 .idea d----- 2018/9/23 11:46 tutorial -a---- 2018/9/23 11:05 1307 .gitignore -a---- 2018/9/23 11:05 11558 LICENSE -a---- 2018/9/23 11:05 24 README.md -a---- 2018/9/23 11:46 259 scrapy.cfg
5.2. 创建spider
文件结构如图所示:
tutorial/spiders/quotes_spider.py内容如下:
import scrapyclass QuotesSpider(scrapy.Spider):name = "quotes"def start_requests(self):urls = ['http://quotes.toscrape.com/page/1/','http://quotes.toscrape.com/page/2/',]for url in urls:yield scrapy.Request(url=url, callback=self.parse)def parse(self, response):page = response.url.split("/")[-2]filename = 'quotes-%s.html' % pagewith open(filename, 'wb') as f:f.write(response.body)self.log('Saved file %s' % filename)
5.3. 运行
运行需要在cmd窗口中:
PS D:\pycharm\labs> scrapy crawl quotes 2018-09-23 11:51:41 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: tutorial) 2018-09-23 11:51:41 [scrapy.utils.log] INFO: Versions: lxml 4.2.5.0, libxml2 2.9.5, cssselect 1.0.3, parsel 1.5.0, w3lib 1.19.0, Twisted 18.7.0, Python 3.6.5 (v3.6.5:f59c0932b4, Mar 28 2018, 17:00:18) [MSC v.1900 64 bit (AMD64)], pyOpenSSL 18.0.0 (OpenSSL 1.1.0i 14 Aug 2018), cryptography 2.3.1, Platform Windows-10-10.0.17134-SP0 2018-09-23 11:51:41 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'tutorial', 'NEWSPIDER_MODULE': 'tutorial.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['tutorial.spiders']} 2018-09-23 11:51:41 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.corestats.CoreStats','scrapy.extensions.telnet.TelnetConsole','scrapy.extensions.logstats.LogStats'] Unhandled error in Deferred: 2018-09-23 11:51:41 [twisted] CRITICAL: Unhandled error in Deferred:2018-09-23 11:51:41 [twisted] CRITICAL: Traceback (most recent call last):File "c:\python365\lib\site-packages\twisted\internet\defer.py", line 1418, in _inlineCallbacksresult = g.send(result)File "c:\python365\lib\site-packages\scrapy\crawler.py", line 80, in crawlself.engine = self._create_engine()File "c:\python365\lib\site-packages\scrapy\crawler.py", line 105, in _create_enginereturn ExecutionEngine(self, lambda _: self.stop())File "c:\python365\lib\site-packages\scrapy\core\engine.py", line 69, in __init__self.downloader = downloader_cls(crawler)File "c:\python365\lib\site-packages\scrapy\core\downloader\__init__.py", line 88, in __init__self.middleware = DownloaderMiddlewareManager.from_crawler(crawler)File "c:\python365\lib\site-packages\scrapy\middleware.py", line 58, in from_crawlerreturn cls.from_settings(crawler.settings, crawler)File "c:\python365\lib\site-packages\scrapy\middleware.py", line 34, in from_settingsmwcls = load_object(clspath)File "c:\python365\lib\site-packages\scrapy\utils\misc.py", line 44, in load_objectmod = import_module(module)File "c:\python365\lib\importlib\__init__.py", line 126, in import_modulereturn _bootstrap._gcd_import(name[level:], package, level)File "<frozen importlib._bootstrap>", line 994, in _gcd_importFile "<frozen importlib._bootstrap>", line 971, in _find_and_loadFile "<frozen importlib._bootstrap>", line 955, in _find_and_load_unlockedFile "<frozen importlib._bootstrap>", line 665, in _load_unlockedFile "<frozen importlib._bootstrap_external>", line 678, in exec_moduleFile "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removedFile "c:\python365\lib\site-packages\scrapy\downloadermiddlewares\retry.py", line 20, in <module>from twisted.web.client import ResponseFailedFile "c:\python365\lib\site-packages\twisted\web\client.py", line 41, in <module>from twisted.internet.endpoints import HostnameEndpoint, wrapClientTLSFile "c:\python365\lib\site-packages\twisted\internet\endpoints.py", line 41, in <module>from twisted.internet.stdio import StandardIO, PipeAddressFile "c:\python365\lib\site-packages\twisted\internet\stdio.py", line 30, in <module>from twisted.internet import _win32stdioFile "c:\python365\lib\site-packages\twisted\internet\_win32stdio.py", line 9, in <module>import win32api ModuleNotFoundError: No module named 'win32api' PS D:\pycharm\labs>
出错了,提升没有win32api,这是需要安装一个pypiwin32包:
PS D:\pycharm\labs> pip install pypiwin32 Collecting pypiwin32Downloading https://files.pythonhosted.org/packages/d0/1b/ 2f292bbd742e369a100c91faa0483172cd91a1a422a6692055ac920946c5/ pypiwin32-223-py3-none-any.whl Collecting pywin32>=223 (from pypiwin32)Downloading https://files.pythonhosted.org/packages/9f/9d/ f4b2170e8ff5d825cd4398856fee88f6c70c60bce0aa8411ed17c1e1b21f/ pywin32-223-cp36-cp36m-win_amd64.whl (9.0MB)100% |████████████████████████████████| 9.0MB 1.1MB/s Installing collected packages: pywin32, pypiwin32 Successfully installed pypiwin32-223 pywin32-223 PS D:\pycharm\labs>
然后再次运行:
PS D:\pycharm\labs> scrapy crawl quotes 2018-09-23 11:53:05 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: tutorial) 2018-09-23 11:53:05 [scrapy.utils.log] INFO: Versions: lxml 4.2.5.0, libxml2 2.9.5, cssselect 1.0.3, parsel 1.5.0, w3lib 1.19.0, Twisted 18.7.0, Python 3.6.5 (v3.6.5:f59c0932b4, Mar 28 2018, 17:00:18) [MSC v.1900 64 bit (AMD64)], pyOpenSSL 18.0.0 (OpenSSL 1.1.0i 14 Aug 2018), cryptography 2.3.1, Platform Windows-10-10.0.17134-SP0 2018-09-23 11:53:05 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'tutorial', 'NEWSPIDER_MODULE': 'tutorial.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['tutorial.spiders']} 2018-09-23 11:53:06 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.corestats.CoreStats','scrapy.extensions.telnet.TelnetConsole','scrapy.extensions.logstats.LogStats'] 2018-09-23 11:53:06 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware','scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware','scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware','scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware','scrapy.downloadermiddlewares.useragent.UserAgentMiddleware','scrapy.downloadermiddlewares.retry.RetryMiddleware','scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware','scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware','scrapy.downloadermiddlewares.redirect.RedirectMiddleware','scrapy.downloadermiddlewares.cookies.CookiesMiddleware','scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware','scrapy.downloadermiddlewares.stats.DownloaderStats'] 2018-09-23 11:53:06 [scrapy.middleware] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware','scrapy.spidermiddlewares.offsite.OffsiteMiddleware','scrapy.spidermiddlewares.referer.RefererMiddleware','scrapy.spidermiddlewares.urllength.UrlLengthMiddleware','scrapy.spidermiddlewares.depth.DepthMiddleware'] 2018-09-23 11:53:06 [scrapy.middleware] INFO: Enabled item pipelines: [] 2018-09-23 11:53:06 [scrapy.core.engine] INFO: Spider opened 2018-09-23 11:53:06 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2018-09-23 11:53:06 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023 2018-09-23 11:53:07 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://quotes.toscrape.com/robots.txt> (referer: None) 2018-09-23 11:53:08 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/1/> (referer: None) 2018-09-23 11:53:08 [quotes] DEBUG: Saved file quotes-1.html # Timlinux: 保存文件了 2018-09-23 11:53:08 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/2/> (referer: None) 2018-09-23 11:53:08 [quotes] DEBUG: Saved file quotes-2.html # Timlinux: 保存到文件了 2018-09-23 11:53:08 [scrapy.core.engine] INFO: Closing spider (finished) 2018-09-23 11:53:08 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 678,'downloader/request_count': 3,'downloader/request_method_count/GET': 3,'downloader/response_bytes': 5976,'downloader/response_count': 3,'downloader/response_status_count/200': 2,'downloader/response_status_count/404': 1,'finish_reason': 'finished','finish_time': datetime.datetime(2018, 9, 23, 3, 53, 8, 822749),'log_count/DEBUG': 6,'log_count/INFO': 7,'response_received_count': 3,'scheduler/dequeued': 2,'scheduler/dequeued/memory': 2,'scheduler/enqueued': 2,'scheduler/enqueued/memory': 2,'start_time': datetime.datetime(2018, 9, 23, 3, 53, 6, 381170)} 2018-09-23 11:53:08 [scrapy.core.engine] INFO: Spider closed (finished) PS D:\pycharm\labs>
咱们看下保存的文件:
内容:
<!DOCTYPE html> <html lang="en"> <head><meta charset="UTF-8"><title>Quotes to Scrape</title><link rel="stylesheet" href="/static/bootstrap.min.css"><link rel="stylesheet" href="/static/main.css"> </head> <body><div class="container"><div class="row header-box"> 很长,咱们就取这一小段吧。
5.4. 上传示例代码
$ git commit -m "init scrapy tutorial." [master b1d6e1d] init scrapy tutorial.9 files changed, 259 insertions(+)create mode 100644 .idea/vcs.xmlcreate mode 100644 scrapy.cfgcreate mode 100644 tutorial/__init__.pycreate mode 100644 tutorial/items.pycreate mode 100644 tutorial/middlewares.pycreate mode 100644 tutorial/pipelines.pycreate mode 100644 tutorial/settings.pycreate mode 100644 tutorial/spiders/__init__.pycreate mode 100644 tutorial/spiders/quotes_spider.py$ git push Counting objects: 14, done. Delta compression using up to 4 threads. Compressing objects: 100% (12/12), done. Writing objects: 100% (14/14), 4.02 KiB | 293.00 KiB/s, done. Total 14 (delta 0), reused 0 (delta 0) To https://github.com/timscm/myscrapy.gitc7e93fc..b1d6e1d master -> master
转载于:https://www.cnblogs.com/timlinux/p/9692319.html
[TimLinux] scrapy 在Windows平台的安装相关推荐
- Arduino可穿戴开发入门教程Windows平台下安装Arduino IDE
Arduino可穿戴开发入门教程Windows平台下安装Arduino IDE Windows平台下安装Arduino IDE Windows操作系统下可以使用安装向导和压缩包形式安装.下面详细讲解这 ...
- Windows平台下安装PhoenixSuit要点
2019独角兽企业重金招聘Python工程师标准>>> 在上手问题这个板块经常看到烧写固件失败的求助帖,这个帖子主要整理一下Windows平台下安装PhoenixSuit和刷机的要点 ...
- 关于Windows平台下安装mysql软件
关于Windows平台下安装mysql软件 mysql是数据库一个代表:本人安装踩过坑 5.7版本和5.1版本大不相同,低版本的对中文不友好,默认字符集不友好, 5.7.18-log目前是我使用比较b ...
- 在Windows平台上安装Dubbox框架
在Windows平台上安装Dubbox框架 一.分布式系统概述 分布式处理方式越来越受到业界的青睐--计算机系统正在经历一场前所未有的从集中式向分布式架构的变革.分布式系统是一个硬件或软件组件分布在不 ...
- Windows平台上安装搭建iPhone/iPad的开发环境
很多朋友希望在体验或学习iphone开发,但是iphone开发环境一般需要 安装在mac计算机下mac os中. 这给许多朋友带来了额外成本投入. 网上已经有各种破解方法,在非苹果电脑上安装iphon ...
- 在 Microsoft Windows 平台上安装 JDK 17
在 Microsoft Windows 平台上安装 JDK 本主题包括以下部分: 在 64 位 Windows 平台上安装 JDK 的系统要求 Windows JDK 安装说明符号 Windows J ...
- Gitea在windows平台的安装和简单使用教程
Gitea在windows平台的安装和简单使用教程 一.Gitea简介 官网介绍:Gitea的首要目标是创建一个极易安装,运行非常快速,安装和使用体验良好的自建 Git 服务.我们采用Go作为后端语言 ...
- windows平台下安装ES
文章目录 windows平台下安装ES 下载ES 安装ES 测试 windows平台下安装ES 下载ES https://www.elastic.co/cn/downloads/past-releas ...
- (转)在Windows平台上安装Node.js及NPM模块管理
本文转载自:http://www.cnblogs.com/seanlv/archive/2011/11/22/2258716.html 之前9月份的时候我写了一篇关于如何在Windows平台上手工管理 ...
最新文章
- 8月第2周中国五大顶级域名增4.1万 美国减6.8万
- 探索移动端的搜索设计
- html 子框架刷新,webpack 热更新 只对改变 CSS 有效 改变 HTML 页面会刷新 没用其他框架。...
- 8.1-8.5 shell介绍,命令历史,命令补全和别名,通配符,输入输出重定向
- MSEG和EKBE的区别在哪里
- 利用计算机管理分区,win7增加磁盘分区教学 利用磁盘管理增加分区
- linux 发送测试数据帧,ubuntu – 测试巨型帧是否真正起作用
- tf.keras data
- Java Servlet cookie
- UISearchController的使用
- 配置devtools热部署
- Oracle 11g简体中文版的安装过程及图解
- 2022最新黑马程序员大数据Hadoop入门
- Matlab 自定义imagesc彩色渲染
- 谷歌浏览器,如何不用翻墙,下载插件?
- 知网查重原理以及降重举例
- 微信网页PC端登录扫二维码登录
- html5网页中的表格教程,javascript程序设计_达内javascript教程-达内web前端培训
- 计算机专业报瑞士酒店管理,我适合去瑞士读酒店管理吗?
- 人人都可以用的项目管理工具,5分钟告诉你如何做好活动策划