目录

  • 1. 鸣谢
  • 2. 环境搭建(基于python 3.7)
    • 2.1 准备镜像
    • 2.2 启动splash容器
    • 2.3 安装 scrapy-splash
    • 2.4 安装 Pillow(图片处理 )
  • 3.生成项目
  • 4.setting.py设置
  • 5.items.py设置
  • 6.编写爬虫
  • 7.执行脚本
  • 8.遇到的坑
  • 9.项目地址

1. 鸣谢

首先特别感谢知乎作者 晚来天欲雪 的 文章 给予的知识帮助,以下内容也是基于 原文 基础上形成的。

2. 环境搭建(基于python 3.7)

2.1 准备镜像

参考 官方文档 教程,拉取 splash 镜像。

docker pull scrapinghub/splash

2.2 启动splash容器

将宿主机 8050 端口映射到容器 8050 端口。

docker run -p 8050:8050 scrapinghub/splash

2.3 安装 scrapy-splash

pip install scrapy-splash

2.4 安装 Pillow(图片处理 )

pip install Pillow

3.生成项目

scrapy startproject netbian

4.setting.py设置

# Scrapy settings for spider project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.htmlBOT_NAME = 'spider'SPIDER_MODULES = ['spider.spiders']
NEWSPIDER_MODULE = 'spider.spiders'FEED_EXPORT_ENCODING = 'utf-8'# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36"# Obey robots.txt rules
ROBOTSTXT_OBEY = False# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16# Disable cookies (enabled by default)
COOKIES_ENABLED = False# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {#    'spider.middlewares.SpiderSpiderMiddleware': 543,
#}# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
# DOWNLOADER_MIDDLEWARES = {#    'spider.middlewares.SpiderDownloaderMiddleware': 543,
# }# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {#    'scrapy.extensions.telnet.TelnetConsole': None,
#}# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
# 此处设置你想要运行的pipeline,数值越大优先级越高。
ITEM_PIPELINES = {'spider.pipelines.SpiderPipeline': 300,    #项目自定义'scrapy.pipelines.images.ImagesPipeline': 1 #scrapy框架自带
}IMAGES_STORE = 'images'
IMAGES_URLS_FIELD = 'img_url'# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'# Splash# 添加splash服务器地址
SPLASH_URL = 'http://localhost:8050'# 添加Splash中间件
DOWNLOADER_MIDDLEWARES = {'scrapy_splash.SplashCookiesMiddleware': 723,'scrapy_splash.SplashMiddleware': 725,'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}# Enable SplashDeduplicateArgsMiddlewar
SPIDER_MIDDLEWARES = {'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}# 设置Splash自己的去重过滤器
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'# 如果你使用Splash的Http缓存,那么还要指定一个自定义的缓存后台存储介质
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'

5.items.py设置

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.htmlimport scrapyclass SpiderItem(scrapy.Item):# define the fields for your item here like:img_url = scrapy.Field()

6.编写爬虫

import scrapy
from scrapy_splash import SplashRequestlua_script = '''
function main(splash)                     splash:go(splash.args.url)        --打开页面splash:wait(2)                    --等待加载return splash:html()              --返回页面数据
end
'''class NetbianSpider(scrapy.Spider):name = 'netbian'allowed_domains = ['jd.com']start_urls = ['https://item.jd.com/34637635130.html']def start_requests(self):for url in self.start_urls:yield SplashRequest(url,endpoint='execute',args={'lua_source': lua_script,'timeout': 90, #超时时间,有的页面读取很慢导致504,可设置大值防止504'wait': 0.5},cache_args=['lua_source'],callback=self.parse)def parse(self, response):price = response.xpath('//span[@class="price J-p-34637635130"]/text()').extract_first()print("价格:", price)

7.执行脚本

scrapy crawl netbian

8.遇到的坑

WARNING: /xxx…/scrapy_splash/request.py:41: ScrapyDeprecationWarning: Call to deprecated function to_native_str. Use to_unicode instead.
url = to_native_str(url)

解决方法:
在 /xxx…/scrapy_splash/request.py 中增加

from scrapy.utils.python import to_unicode

在第41行将

url = to_native_str(url)

改为

url = to_unicode(url)

9.项目地址

git地址

scrapy splash 爬取图片学习心得相关推荐

  1. scrapy简单爬取图片

    #这里只爬取第一页 items.py import scrapy #定义爬取数据 class InsistItem(scrapy.Item):image_urls=scrapy.Field()teng ...

  2. Scrapy管道爬取图片

    大家好,我是python的初学者,我最近在学习Scrapy管道下载时遇到很多问题,最终虽没有完全克服,但也是收获颇丰,下面我将最近遇到并克服的问题与大家分享一下! 本文的基础是pycharm,scra ...

  3. xpath contains_Python 爬虫进阶: Scrapy Shell 和 Xpath 学习心得

    说来好笑,刚学习爬虫的时候为了调试代码,是将网页下载到本地,再用beautifulsoup载入本地网页文件进行分析,以保证选择器编写正确. Scrapy内置的调试器Scrapy Shell正好作此用途 ...

  4. 孤荷凌寒自学python第八十一天学习爬取图片1

    孤荷凌寒自学python第八十一天学习爬取图片1 (完整学习过程屏幕记录视频地址在文末) 通过前面十天的学习,我已经基本了解了通过requests模块来与网站服务器进行交互的方法,也知道了Beauti ...

  5. Scrapy爬取图片并重命名总结

    文章目录 Scrapy爬取图片并重命名总结 项目分析: 开始项目: 启动项目: 总结 Scrapy爬取图片并重命名总结 项目分析: 1.现在很多网页都是动态加载资源,数据都不在静态html模板上,都是 ...

  6. Scrapy爬取图片网站——最详细的入门爬虫教程,新手入门干货,不进来看一下?

    开始前准备 这次爬虫使用scrapy,所以用到的工具必然是python3.7,scrapy,pycharm这些东西, 目标网站:http://pic.netbian.com 彼岸图网,个人非常喜欢的图 ...

  7. 学习python应用的暑假(1、爬取图片)

    暑假学习python的应用,希望可以让大家作为参考使用,有错的地方请大家指出,以下是本人自己学习的,本人也是大白一枚,如果有个别错误的地方,希望大家见谅 首先我写一下这个暑假,希望自己完成的任务 py ...

  8. 使用Scrapy爬虫框架简单爬取图片并保存本地(妹子图)

    使用Scrapy爬虫框架简单爬取图片并保存本地(妹子图) 初学Scrapy,实现爬取网络图片并保存本地功能 一.先看最终效果 保存在F:\pics文件夹下 二.安装scrapy 1.python的安装 ...

  9. scrapy爬虫,爬取图片

    一.scrapy的安装: 本文基于Anacoda3, Anacoda2和3如何同时安装? 将Anacoda3安装在C:\ProgramData\Anaconda2\envs文件夹中即可. 如何用con ...

最新文章

  1. Linux用户查看系统有多少用户在登录
  2. desc mysql 连表查询_Mysql连表查询
  3. 容器 - concurrent包之ConcurrentHashMap
  4. 10G DB_LINK的问题
  5. Spark-shell提示找不到路径
  6. SQL练习题完整(做完你就是高手)
  7. 项目管理办公室 PMO
  8. socket.io简介
  9. 语音机器人究竟能做些什么?
  10. 计算机采用二进制无关,计算机为什么采用二进制?
  11. 如何使用Guitar Pro在乐谱播放时切换效果器音色?
  12. ORM定制的几点注意事项
  13. 中国各省所处的经纬度范围
  14. 妈妈帮上云之路:云上平台架构与运维实践
  15. 川土微 数字隔离器 CA-IS3722HS可替代ADUM1201ARZ
  16. Android App启动流程详解
  17. wireshark源码分析二
  18. 偏差代替误差进行稳态分析
  19. Linux详细安装教程(Centos)
  20. 机器人视觉引导定位介绍

热门文章

  1. 为什么脚本执行一行就不动了_在Linux中通过expect工具实现脚本的自动交互
  2. Python enumerate() 函数
  3. 对抗机器学习(Adversarial Machine Learning)发展现状
  4. vue锚点定位(代码通用) - 总结篇
  5. layui表单验证 内置自定义规则 - 使用说明
  6. WordPress一个还不错的404html单页代码
  7. 以30字符宽居中输出python字符串_python3字符串
  8. java开发框架_Java-程序员感悟-开发人员喜欢的框架之Spring
  9. 小黄鸡 php,PHP调用小黄鸡 api post发送
  10. 增删改查java代码_程序员:听说你正在为天天写增删改查代码而烦恼