1、创建项目

scrapy startproject scrapytest20230130

2、项目结构

3、dgtleSpider.py

import scrapyfrom scrapytest20230130.items import Scrapytest20230130Itemclass QuotesSpider(scrapy.Spider):name = "dgtle"def start_requests(self):urls = ['https://www.dgtle.com/inst-1858189-1.html']for url in urls:yield scrapy.Request(url=url, callback=self.parse)def parse(self, response):res = response.css('div.bg-img::attr(data-src)').extract()# print('=' * 60)count = 1for i in res:item = Scrapytest20230130Item()item['img_src'] = iitem['img_name'] = str(count) + '.jpg'count += 1# print('***********')# print(item)yield item# print('=' * 60)

4、item.py

import scrapyclass Scrapytest20230130Item(scrapy.Item):# define the fields for your item here like:# name = scrapy.Field()img_src = scrapy.Field()img_name = scrapy.Field()pass

5、piplines.py

from itemadapter import ItemAdapterfrom scrapy.pipelines.images import ImagesPipeline
import scrapy
import os
from scrapytest20230130.settings import IMAGES_STORE as IMGSclass Scrapytest20230130Pipeline(ImagesPipeline):# def process_item(self, item, spider):#     return itemdef get_media_requests(self, item, info):yield scrapy.Request(url=item['img_src'], meta={'item': item})# 返回图片名称即可def file_path(self, request, response=None, info=None):item = request.meta['item']print('########', item)filePath = item['img_name']return filePathdef item_completed(self, results, item, info):return item

6、settings.py

# Scrapy settings for scrapytest20230130 project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.htmlBOT_NAME = 'scrapytest20230130'SPIDER_MODULES = ['scrapytest20230130.spiders']
NEWSPIDER_MODULE = 'scrapytest20230130.spiders'import randomIMAGES_STORE = '/Users/liuxiawei1996/Code/scrapytest20230130/img'   #文件保存路径LOG_LEVEL = "WARNING"# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'scrapytest20230130 (+http://www.yourdomain.com)'
USER_AGENTS_LIST = ["Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1","Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6","Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6","Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1","Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5","Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3","Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3","Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3","Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3","Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3","Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3","Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3","Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24","Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"]
USER_AGENT = random.choice(USER_AGENTS_LIST)# Obey robots.txt rules
ROBOTSTXT_OBEY = False# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16# Disable cookies (enabled by default)
#COOKIES_ENABLED = False# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}
DEFAULT_REQUEST_HEADERS = {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8','Accept-Language': 'en',# 'User-Agent':"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36",'User-Agent':USER_AGENT
}# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {#    'scrapytest20230130.middlewares.Scrapytest20230130SpiderMiddleware': 543,
#}# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {#    'scrapytest20230130.middlewares.Scrapytest20230130DownloaderMiddleware': 543,
#}# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {#    'scrapy.extensions.telnet.TelnetConsole': None,
#}# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {'scrapytest20230130.pipelines.Scrapytest20230130Pipeline': 300,
}# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'# Set settings whose default value is deprecated to a future-proof value
REQUEST_FINGERPRINTER_IMPLEMENTATION = '2.7'
TWISTED_REACTOR = 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'

7、启动程序

scrapy crawl dgtle

8、效果

Scrap入门-爬取特定网页图片相关推荐

  1. (三)爬取一些网页图片

    爬取一些网页图片 1.随机爬取一个网页: import requests# 1.获取图片网页源码 def Get_Source_Page():url = 'https://pic.netbian.co ...

  2. python爬取动态网页图片

    爬取某知名网站图片(爬取动态网页) python爬取动态网页图片 python爬取动态网页图片 环境: python3.pycharm 库: requests.urllib.json 思路: 1.分析 ...

  3. Python爬虫入门——爬取贴吧图片

    最近忽然想听一首老歌,"I believe" 于是到网上去搜,把几乎所有的版本的MV都看了一遍(也是够无聊的),最喜欢的还是最初版的<我的野蛮女友>电影主题曲的哪个版本 ...

  4. python怎么爬取一个网页图片_python爬虫怎么实现爬取网站图片?

    对于网页结构而言,图片也就是一个文件及文件目录+名字的放在html中的src标签里.找到这个src标签对应的内容,就可以图片爬取下来. ps:有些网站的图片可能不是放在src标签里的,可能放在data ...

  5. python怎么爬取一个网页图片显示不出来_使用python爬取网页,获取不到图片地址【python 爬取图片教程】...

    python 网络爬虫,怎么自动保存图片 f12找啊 使用python爬取网页,获取不到图片地址 大图片是在点击之后用 JS 控制加载的. 你可以看看 js/js.js 这个文件,253 行:func ...

  6. python 爬虫(一) requests+BeautifulSoup 爬取简单网页图片代码示例

    最近学习了Python,借助各个大神的文章,自己写了以下代码,来爬取网页图片,希望可以帮助到大家. 工具是 idea #coding=utf-8 import requests from bs4 im ...

  7. python爬取动态网页图片_python爬虫之爬取动态加载的图片_百度

    运行坏境 python3.x 选择目标--百度 当我们在使用右键查看网页源码时,出来的却是一大堆JavaScript代码,并没有图片的链接等信息 因为它是一个动态页面嘛.它的网页原始数据其实是没有这个 ...

  8. 爬取简单网页图片下载到本地

    import requests import re from bs4 import BeautifulSoup url = 'https://baijiahao.baidu.com/s?id=1613 ...

  9. 超简单的图片爬取项目,复制粘贴就能用,批量爬取动漫图片。(保姆教程,可根据需要修改URL)

    各位未来国家栋梁们好啊~相信栋梁们经常需要在网络上寻找各种资源,作为二次元的必备精神食粮,图片资源那是必不可少!在这里用python写了一个超简单的图片爬取小项目~话不多说,附上源码!(有用的话点个赞 ...

最新文章

  1. LeetCode第121题 买卖股票的最佳时机
  2. Asp.Net Core IdentityServer4 管理面板集成
  3. 因此,甲骨文杀死了java.net
  4. 监听vuex的某条数据
  5. python的def语句_【零基础学Python】def语句,参数和None值
  6. 阿里健康研究院:仅17%人群拥有高质量睡眠 4成95后00后天天熬夜
  7. JXL读取,写入Excel
  8. SonarQube规则之坏味道类型
  9. 一文教你如何解决TXC晶振工作不正常的问题
  10. 查看表空间大小和使用率,增加表空间大小的四种方法
  11. 微软私有云服务器,微软发布私有云解决方案及数据平台
  12. 3dmax渲染计算机内存不足怎么办,3DMax渲染 提示内存不足怎么办
  13. ubuntu linux 软件安装位置,ubuntu查看软件安装位置
  14. ook的matlab,【伪技术】基于OOK的语音信号的数字传输
  15. grub引导安装win10
  16. 堪称神器的Chrome插件
  17. 江苏师范大学计算机学院投档线,江苏师范大学科文学院2018年各省及各专业录取分数线及最低录投档线【理科 文科】...
  18. 如何对EXCEL数值做累加
  19. 少儿编程兴起,作为老一辈程序员的你,怕了么?
  20. PTA 1108 String复读机(Python3)

热门文章

  1. 更有仪式感的收纳工具,让生活有条有理,精臣B21智能标签机上手
  2. 北大青鸟 oracle,北大青鸟oracle学习笔记31
  3. 2022年全国最新消防设施操作员(高级消防设施操作员)题库及答案
  4. linux 删除文件中重复,linux系统删除重复文件
  5. 字幕字体滚动插件——scroxt.js
  6. 当你爹给你展示出你们‘x姓’家的家谱时
  7. 分享我每天的健康养生好习惯
  8. 苹果开发者账号个人、公司、企业账号的优劣分析
  9. linux上 python使用cx_Oracle 连接 oracle 9i
  10. python画图工具