Scrap入门-爬取特定网页图片
1、创建项目
scrapy startproject scrapytest20230130
2、项目结构
3、dgtleSpider.py
import scrapyfrom scrapytest20230130.items import Scrapytest20230130Itemclass QuotesSpider(scrapy.Spider):name = "dgtle"def start_requests(self):urls = ['https://www.dgtle.com/inst-1858189-1.html']for url in urls:yield scrapy.Request(url=url, callback=self.parse)def parse(self, response):res = response.css('div.bg-img::attr(data-src)').extract()# print('=' * 60)count = 1for i in res:item = Scrapytest20230130Item()item['img_src'] = iitem['img_name'] = str(count) + '.jpg'count += 1# print('***********')# print(item)yield item# print('=' * 60)
4、item.py
import scrapyclass Scrapytest20230130Item(scrapy.Item):# define the fields for your item here like:# name = scrapy.Field()img_src = scrapy.Field()img_name = scrapy.Field()pass
5、piplines.py
from itemadapter import ItemAdapterfrom scrapy.pipelines.images import ImagesPipeline
import scrapy
import os
from scrapytest20230130.settings import IMAGES_STORE as IMGSclass Scrapytest20230130Pipeline(ImagesPipeline):# def process_item(self, item, spider):# return itemdef get_media_requests(self, item, info):yield scrapy.Request(url=item['img_src'], meta={'item': item})# 返回图片名称即可def file_path(self, request, response=None, info=None):item = request.meta['item']print('########', item)filePath = item['img_name']return filePathdef item_completed(self, results, item, info):return item
6、settings.py
# Scrapy settings for scrapytest20230130 project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://docs.scrapy.org/en/latest/topics/settings.html
# https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
# https://docs.scrapy.org/en/latest/topics/spider-middleware.htmlBOT_NAME = 'scrapytest20230130'SPIDER_MODULES = ['scrapytest20230130.spiders']
NEWSPIDER_MODULE = 'scrapytest20230130.spiders'import randomIMAGES_STORE = '/Users/liuxiawei1996/Code/scrapytest20230130/img' #文件保存路径LOG_LEVEL = "WARNING"# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'scrapytest20230130 (+http://www.yourdomain.com)'
USER_AGENTS_LIST = ["Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1","Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6","Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6","Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1","Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5","Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3","Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3","Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3","Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3","Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3","Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3","Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3","Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24","Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"]
USER_AGENT = random.choice(USER_AGENTS_LIST)# Obey robots.txt rules
ROBOTSTXT_OBEY = False# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16# Disable cookies (enabled by default)
#COOKIES_ENABLED = False# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
# 'Accept-Language': 'en',
#}
DEFAULT_REQUEST_HEADERS = {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8','Accept-Language': 'en',# 'User-Agent':"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36",'User-Agent':USER_AGENT
}# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {# 'scrapytest20230130.middlewares.Scrapytest20230130SpiderMiddleware': 543,
#}# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {# 'scrapytest20230130.middlewares.Scrapytest20230130DownloaderMiddleware': 543,
#}# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {# 'scrapy.extensions.telnet.TelnetConsole': None,
#}# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {'scrapytest20230130.pipelines.Scrapytest20230130Pipeline': 300,
}# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'# Set settings whose default value is deprecated to a future-proof value
REQUEST_FINGERPRINTER_IMPLEMENTATION = '2.7'
TWISTED_REACTOR = 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'
7、启动程序
scrapy crawl dgtle
8、效果
Scrap入门-爬取特定网页图片相关推荐
- (三)爬取一些网页图片
爬取一些网页图片 1.随机爬取一个网页: import requests# 1.获取图片网页源码 def Get_Source_Page():url = 'https://pic.netbian.co ...
- python爬取动态网页图片
爬取某知名网站图片(爬取动态网页) python爬取动态网页图片 python爬取动态网页图片 环境: python3.pycharm 库: requests.urllib.json 思路: 1.分析 ...
- Python爬虫入门——爬取贴吧图片
最近忽然想听一首老歌,"I believe" 于是到网上去搜,把几乎所有的版本的MV都看了一遍(也是够无聊的),最喜欢的还是最初版的<我的野蛮女友>电影主题曲的哪个版本 ...
- python怎么爬取一个网页图片_python爬虫怎么实现爬取网站图片?
对于网页结构而言,图片也就是一个文件及文件目录+名字的放在html中的src标签里.找到这个src标签对应的内容,就可以图片爬取下来. ps:有些网站的图片可能不是放在src标签里的,可能放在data ...
- python怎么爬取一个网页图片显示不出来_使用python爬取网页,获取不到图片地址【python 爬取图片教程】...
python 网络爬虫,怎么自动保存图片 f12找啊 使用python爬取网页,获取不到图片地址 大图片是在点击之后用 JS 控制加载的. 你可以看看 js/js.js 这个文件,253 行:func ...
- python 爬虫(一) requests+BeautifulSoup 爬取简单网页图片代码示例
最近学习了Python,借助各个大神的文章,自己写了以下代码,来爬取网页图片,希望可以帮助到大家. 工具是 idea #coding=utf-8 import requests from bs4 im ...
- python爬取动态网页图片_python爬虫之爬取动态加载的图片_百度
运行坏境 python3.x 选择目标--百度 当我们在使用右键查看网页源码时,出来的却是一大堆JavaScript代码,并没有图片的链接等信息 因为它是一个动态页面嘛.它的网页原始数据其实是没有这个 ...
- 爬取简单网页图片下载到本地
import requests import re from bs4 import BeautifulSoup url = 'https://baijiahao.baidu.com/s?id=1613 ...
- 超简单的图片爬取项目,复制粘贴就能用,批量爬取动漫图片。(保姆教程,可根据需要修改URL)
各位未来国家栋梁们好啊~相信栋梁们经常需要在网络上寻找各种资源,作为二次元的必备精神食粮,图片资源那是必不可少!在这里用python写了一个超简单的图片爬取小项目~话不多说,附上源码!(有用的话点个赞 ...
最新文章
- LeetCode第121题 买卖股票的最佳时机
- Asp.Net Core IdentityServer4 管理面板集成
- 因此,甲骨文杀死了java.net
- 监听vuex的某条数据
- python的def语句_【零基础学Python】def语句,参数和None值
- 阿里健康研究院:仅17%人群拥有高质量睡眠 4成95后00后天天熬夜
- JXL读取,写入Excel
- SonarQube规则之坏味道类型
- 一文教你如何解决TXC晶振工作不正常的问题
- 查看表空间大小和使用率,增加表空间大小的四种方法
- 微软私有云服务器,微软发布私有云解决方案及数据平台
- 3dmax渲染计算机内存不足怎么办,3DMax渲染 提示内存不足怎么办
- ubuntu linux 软件安装位置,ubuntu查看软件安装位置
- ook的matlab,【伪技术】基于OOK的语音信号的数字传输
- grub引导安装win10
- 堪称神器的Chrome插件
- 江苏师范大学计算机学院投档线,江苏师范大学科文学院2018年各省及各专业录取分数线及最低录投档线【理科 文科】...
- 如何对EXCEL数值做累加
- 少儿编程兴起,作为老一辈程序员的你,怕了么?
- PTA 1108 String复读机(Python3)