scrapy splash 爬取图片学习心得

1. 鸣谢

首先特别感谢知乎作者 晚来天欲雪 的文章给予的知识帮助，以下内容也是基于原文基础上形成的。

2. 环境搭建（基于python 3.7）

2.1 准备镜像

参考官方文档教程，拉取 splash 镜像。

docker pull scrapinghub/splash

2.2 启动splash容器

将宿主机 8050 端口映射到容器 8050 端口。

docker run -p 8050:8050 scrapinghub/splash

2.3 安装 scrapy-splash

pip install scrapy-splash

2.4 安装 Pillow（图片处理）

pip install Pillow

3.生成项目

scrapy startproject netbian

4.setting.py设置

# Scrapy settings for spider project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.htmlBOT_NAME = 'spider'SPIDER_MODULES = ['spider.spiders']
NEWSPIDER_MODULE = 'spider.spiders'FEED_EXPORT_ENCODING = 'utf-8'# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36"# Obey robots.txt rules
ROBOTSTXT_OBEY = False# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16# Disable cookies (enabled by default)
COOKIES_ENABLED = False# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {#    'spider.middlewares.SpiderSpiderMiddleware': 543,
#}# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
# DOWNLOADER_MIDDLEWARES = {#    'spider.middlewares.SpiderDownloaderMiddleware': 543,
# }# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {#    'scrapy.extensions.telnet.TelnetConsole': None,
#}# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
# 此处设置你想要运行的pipeline，数值越大优先级越高。
ITEM_PIPELINES = {'spider.pipelines.SpiderPipeline': 300,    #项目自定义'scrapy.pipelines.images.ImagesPipeline': 1 #scrapy框架自带
}IMAGES_STORE = 'images'
IMAGES_URLS_FIELD = 'img_url'# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'# Splash# 添加splash服务器地址
SPLASH_URL = 'http://localhost:8050'# 添加Splash中间件
DOWNLOADER_MIDDLEWARES = {'scrapy_splash.SplashCookiesMiddleware': 723,'scrapy_splash.SplashMiddleware': 725,'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}# Enable SplashDeduplicateArgsMiddlewar
SPIDER_MIDDLEWARES = {'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}# 设置Splash自己的去重过滤器
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'# 如果你使用Splash的Http缓存，那么还要指定一个自定义的缓存后台存储介质
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'

5.items.py设置

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.htmlimport scrapyclass SpiderItem(scrapy.Item):# define the fields for your item here like:img_url = scrapy.Field()

6.编写爬虫

import scrapy
from scrapy_splash import SplashRequestlua_script = '''
function main(splash)                     splash:go(splash.args.url)        --打开页面splash:wait(2)                    --等待加载return splash:html()              --返回页面数据
end
'''class NetbianSpider(scrapy.Spider):name = 'netbian'allowed_domains = ['jd.com']start_urls = ['https://item.jd.com/34637635130.html']def start_requests(self):for url in self.start_urls:yield SplashRequest(url,endpoint='execute',args={'lua_source': lua_script,'timeout': 90, #超时时间，有的页面读取很慢导致504，可设置大值防止504'wait': 0.5},cache_args=['lua_source'],callback=self.parse)def parse(self, response):price = response.xpath('//span[@class="price J-p-34637635130"]/text()').extract_first()print("价格：", price)

7.执行脚本

scrapy crawl netbian

8.遇到的坑

WARNING: /xxx…/scrapy_splash/request.py:41: ScrapyDeprecationWarning: Call to deprecated function to_native_str. Use to_unicode instead.
url = to_native_str(url)

解决方法：
在 /xxx…/scrapy_splash/request.py 中增加

from scrapy.utils.python import to_unicode

在第41行将

url = to_native_str(url)

改为

url = to_unicode(url)

9.项目地址

git地址

scrapy splash 爬取图片学习心得相关推荐

scrapy简单爬取图片
#这里只爬取第一页 items.py import scrapy #定义爬取数据 class InsistItem(scrapy.Item):image_urls=scrapy.Field()teng ...
Scrapy管道爬取图片
大家好,我是python的初学者,我最近在学习Scrapy管道下载时遇到很多问题,最终虽没有完全克服,但也是收获颇丰,下面我将最近遇到并克服的问题与大家分享一下! 本文的基础是pycharm,scra ...
xpath contains_Python 爬虫进阶: Scrapy Shell 和 Xpath 学习心得
说来好笑,刚学习爬虫的时候为了调试代码,是将网页下载到本地,再用beautifulsoup载入本地网页文件进行分析,以保证选择器编写正确. Scrapy内置的调试器Scrapy Shell正好作此用途 ...
孤荷凌寒自学python第八十一天学习爬取图片1
孤荷凌寒自学python第八十一天学习爬取图片1 (完整学习过程屏幕记录视频地址在文末) 通过前面十天的学习,我已经基本了解了通过requests模块来与网站服务器进行交互的方法,也知道了Beauti ...
Scrapy爬取图片并重命名总结
文章目录 Scrapy爬取图片并重命名总结项目分析: 开始项目: 启动项目: 总结 Scrapy爬取图片并重命名总结项目分析: 1.现在很多网页都是动态加载资源,数据都不在静态html模板上,都是 ...
Scrapy爬取图片网站——最详细的入门爬虫教程，新手入门干货，不进来看一下？
开始前准备这次爬虫使用scrapy,所以用到的工具必然是python3.7,scrapy,pycharm这些东西, 目标网站:http://pic.netbian.com 彼岸图网,个人非常喜欢的图 ...
学习python应用的暑假（1、爬取图片）
暑假学习python的应用,希望可以让大家作为参考使用,有错的地方请大家指出,以下是本人自己学习的,本人也是大白一枚,如果有个别错误的地方,希望大家见谅首先我写一下这个暑假,希望自己完成的任务 py ...
使用Scrapy爬虫框架简单爬取图片并保存本地(妹子图）
使用Scrapy爬虫框架简单爬取图片并保存本地(妹子图) 初学Scrapy,实现爬取网络图片并保存本地功能一.先看最终效果保存在F:\pics文件夹下二.安装scrapy 1.python的安装 ...
scrapy爬虫，爬取图片
一.scrapy的安装: 本文基于Anacoda3, Anacoda2和3如何同时安装? 将Anacoda3安装在C:\ProgramData\Anaconda2\envs文件夹中即可. 如何用con ...

scrapy splash 爬取图片学习心得

目录

1. 鸣谢

2. 环境搭建（基于python 3.7）

2.1 准备镜像

2.2 启动splash容器

2.3 安装 scrapy-splash

2.4 安装 Pillow（图片处理）

3.生成项目

4.setting.py设置

5.items.py设置

6.编写爬虫

7.执行脚本

8.遇到的坑

9.项目地址

scrapy splash 爬取图片学习心得相关推荐

最新文章

热门文章

scrapy splash 爬取图片学习心得

目录

1. 鸣谢

2. 环境搭建（基于python 3.7）

2.1 准备镜像

2.2 启动splash容器

2.3 安装 scrapy-splash

2.4 安装 Pillow（图片处理 ）

3.生成项目

4.setting.py设置

5.items.py设置

6.编写爬虫

7.执行脚本

8.遇到的坑

9.项目地址

scrapy splash 爬取图片学习心得相关推荐

最新文章

热门文章

2.4 安装 Pillow（图片处理）