利用Scrapy框架爬取LOL皮肤站高清壁纸

 Lan   2020-03-06 21:22   81 人阅读  0 条评论

成品打包:点击进入

代码:

爬虫文件

# -*- coding: utf-8 -*-
import scrapy
from practice.items import PracticeItem
from urllib import parseclass LolskinSpider(scrapy.Spider):name = 'lolskin'allowed_domains = ['lolskin.cn']start_urls = ['https://lolskin.cn/champions.html']# 获取所有英雄链接def parse(self, response):item = PracticeItem()item['urls'] = response.xpath('//div[2]/div[1]/div/ul/li/a/@href').extract()for url in item['urls']:self.csurl = 'https://lolskin.cn'yield scrapy.Request(url=parse.urljoin(self.csurl, url), dont_filter=True, callback=self.bizhi)return item# 获取所有英雄皮肤链接def bizhi(self, response):skins = (response.xpath('//td/a/@href').extract())for skin in skins:yield scrapy.Request(url=parse.urljoin(self.csurl, skin), dont_filter=True, callback=self.get_bzurl)# 采集每个皮肤的壁纸,获取壁纸链接def get_bzurl(self, response):item = PracticeItem()image_urls = response.xpath('//body/div[1]/div/a/@href').extract()image_name = response.xpath('//h1/text()').extract()yield {'image_urls': image_urls,'image_name': image_name}return item

items.py

# -*- coding: utf-8 -*-# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.htmlimport scrapyclass PracticeItem(scrapy.Item):# define the fields for your item here like:# name = scrapy.Field()# titles = scrapy.Field()# yxpngs = scrapy.Field()urls = scrapy.Field()skin_name = scrapy.Field()  # 皮肤名image_urls = scrapy.Field()  # 皮肤壁纸urlimages = scrapy.Field()

pipelines.py

# -*- coding: utf-8 -*-# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
import os
import re
from scrapy.pipelines.images import ImagesPipeline
import scrapy# class PracticePipeline(object):
#     def __init__(self):
#         self.file = open('text.csv', 'a+')
#
#     def process_item(self, item, spider):
#         # os.chdir('lolskin')
#         # for title in item['titles']:
#         #     os.makedirs(title)
#         skin_name = item['skin_name']
#         skin_jpg = item['skin_jpg']
#         for i in range(len(skin_name)):
#             self.file.write(f'{skin_name[i]},{skin_jpg}\n')
#         self.file.flush()
#         return item
#
#     def down_bizhi(self, item, spider):
#         self.file.close()class LoLPipeline(ImagesPipeline):def get_media_requests(self, item, info):for image_url in item['image_urls']:yield scrapy.Request(image_url, meta={'image_name': item['image_name']})# 修改下载之后的路径以及文件名def file_path(self, request, response=None, info=None):image_name = re.findall('/skin/(.*?)/', request.url)[0] + "/" + request.meta[f'image_name'][0] + '.jpg'return image_name

settings.py

# -*- coding: utf-8 -*-# Scrapy settings for practice project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html
import osBOT_NAME = 'practice'SPIDER_MODULES = ['practice.spiders']
NEWSPIDER_MODULE = 'practice.spiders'# Crawl responsibly by identifying yourself (and your website) on the user-agent
# USER_AGENT = 'practice (+http://www.yourdomain.com)'# Obey robots.txt rules
ROBOTSTXT_OBEY = False# Configure maximum concurrent requests performed by Scrapy (default: 16)
# CONCURRENT_REQUESTS = 32# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
# 设置延时
DOWNLOAD_DELAY = 1
# The download delay setting will honor only one of:
# CONCURRENT_REQUESTS_PER_DOMAIN = 16
# CONCURRENT_REQUESTS_PER_IP = 16# Disable cookies (enabled by default)
# COOKIES_ENABLED = False# Disable Telnet Console (enabled by default)
# TELNETCONSOLE_ENABLED = False# Override the default request headers:
# DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
# }# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
# SPIDER_MIDDLEWARES = {
#    'practice.middlewares.PracticeSpiderMiddleware': 543,
# }# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
# DOWNLOADER_MIDDLEWARES = {
#    'practice.middlewares.PracticeDownloaderMiddleware': 543,
# }# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
# EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
# }# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {# 'practice.pipelines.PracticePipeline': 300,# 'scrapy.pipelines.images.ImagesPipeline': 1,'practice.pipelines.LoLPipeline': 1
}
# 设置采集文件夹路径
IMAGES_STORE = 'E:\Python\scrapy\practice\practice\LOLskin'
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
# AUTOTHROTTLE_ENABLED = True
# The initial download delay
# AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
# AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
# AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
# AUTOTHROTTLE_DEBUG = False# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
# HTTPCACHE_ENABLED = True
# HTTPCACHE_EXPIRATION_SECS = 0
# HTTPCACHE_DIR = 'httpcache'
# HTTPCACHE_IGNORE_HTTP_CODES = []
# HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

main.py

from scrapy.cmdline import executeexecute(['scrapy', 'crawl', 'lolskin'])

本文地址:https://www.lanol.cn/post/24.html
版权声明:本文为原创文章,版权归 Lan 所有,欢迎分享本文,转载请保留出处!
点赞
赞赏

PREVIOUS:Ascall对照表
NEXT:python3 + flask + sqlalchemy

文章导航


http://www.taodudu.cc/news/show-5217249.html

相关文章:

  • python爬虫入门练习,使用正则表达式和requests爬取LOL官网皮肤
  • lol服务器维护得多长时间,lol维护时间
  • gpj文件怎么转化成html,MP4视频中提取的音频默认是MP3格式?如何转成WAV?
  • 怎么将图片格式转换成JPG?学会这两种方法轻松转换
  • ARM裸机学习笔记(一)GPIO_and_LED
  • PTA里面怎么寻找JAVA题目_PTA基础题目集
  • cmake交叉编译时链接到x86库的问题
  • c语言中sub指令,汇编_指令_SUB
  • 嵌入式开发第44天(GPIO的使用)
  • 【篇二】控制寄存器点亮LED
  • ARM裸机的知识总结(4) ------- 利用GPIO控制LED
  • 《网蜂A8实战演练》——1.LED驱动
  • 一步步点亮LED
  • 一步步点亮LED3_从零开始手写汇编点亮LED
  • 从零开始之驱动发开、linux驱动(五、字符驱动之led驱动改进)
  • ARM学习之GPIO编程
  • 使用GHS MULTI新建工程并编译代码
  • GreenHills for ARM使用技巧总结
  • S5PV210-NoOS-一步一步点亮LED
  • arm裸机【4】 --- 一步步点亮LED(一)
  • 2.字符设备驱动基础
  • 嵌入式裸机GPIO和LED学习笔记
  • ARM:嵌入式系统之硬件总复习
  • 统一将gif格式的图片转换为gpj
  • LED灯涉及的寄存器
  • 1.4.ARM裸机第四部分-GPIO和LED
  • CSS编写静态网易注册界面
  • 第一篇:初学前端,网易注册界面实例
  • Ubuntu 16.04 + gmt 5.4 画某省地图(以安徽省为例)
  • 合泰32笔记2-GPIO使用(2022/2/20)

利用Scrapy框架爬取LOL皮肤站高清壁纸相关推荐

  1. python爬虫利用Scrapy框架爬取汽车之家奔驰图片--实战

    先看一下利用scrapy框架爬取汽车之家奔驰A级的效果图 1)进入cmd命令模式下,进入想要存取爬虫代码的文件,我这里是进入e盘下的python_spider文件夹内 C:\Users\15538&g ...

  2. 利用Scrapy框架爬取前途无忧招聘信息

    利用Scrapy框架爬取前途无忧招聘信息 关于安装和命令使用可参考:https://docs.scrapy.org/en/1.7/intro/install.html 先创建项目和爬虫文件 分析网站 ...

  3. Python批量爬取王者荣耀英雄高清壁纸

    Python批量爬取王者荣耀英雄高清壁纸 文章目录 Python批量爬取王者荣耀英雄高清壁纸 前言 爬虫步骤 python代码实现 总结 前言 很多喜欢玩王者的朋友很希望把王者荣耀的英雄图片拿来做壁纸 ...

  4. python笔记之利用scrapy框架爬取糗事百科首页段子

    环境准备: scrapy框架(可以安装anaconda一个python的发行版本,有很多库) cmd命令窗口 教程: 创建爬虫项目 scrapy startproject qq #创建了一个爬虫项目q ...

  5. 利用Scrapy框架爬取汽车之家图片(详细)

    爬取结果 爬取步骤 创建爬虫文件 进入cmd命令模式下,进入想要存取爬虫代码的文件,我这里是进入e盘下的E:\pystudy\scraping文件夹内 C:\Users\wei>E:E:\> ...

  6. 利用scrapy框架爬取动态加载的数据

    在爬取有些网站的是后,数据不一定全部是可视化界面的,当我们拖动滚动条时才会加载其他的数据,如果我们也想爬取这部分数据,就需要使用selenium模块,在scrapy里可以结合该模块修改返回对象 一.编 ...

  7. 利用Scrapy框架爬取落网上的音乐文件

    今天爬取的是本人特别喜欢的一个音乐网站,www.luoo.net, 首先是设置item中需要保存的字段. items.py 字段名称包括期刊号,期刊名,期刊创建时间,单期期刊下的音乐名,作者名,音乐文 ...

  8. 利用scrapy框架爬取网易新闻排行榜

    wyxw.py中代码 # -*- coding: utf-8 -*- import scrapy from ..items import WyxwItemclass WyxwSpider(scrapy ...

  9. 静态网页爬取:批量获取高清壁纸

    前言 本文的文字及图片来源于网络,仅供学习.交流使用,不具有任何商业用途,如有问题请及时联系我们以作处理. PS:如有需要Python学习资料的小伙伴可以加点击下方链接自行获取 python免费学习资 ...

最新文章

  1. Python:爬虫框架Scrapy的安装与基本使用
  2. 读后感与机翻《人类因果学习的分解:自下而上的联想学习和自上而下的图式推理》
  3. 【网络安全】浅析跨域原理及如何实现跨域
  4. 数据中心网络性能:新应用下的新需求
  5. vue --- configureWebpack模拟后台数据
  6. 你的手机浏览器不支持webgle_不支持n79频段5G手机不能买?OPPO Reno3全频覆盖消除后顾之忧...
  7. 如何从Java方向转向Linux C方向?
  8. 各类免费的的机器人仿真软件优缺点汇总
  9. 靶机渗透练习Vulnhub DriftingBlues-6
  10. Topaz ReMask 5 for Mac(抠图神器)
  11. ST语言入门(维修电工1)
  12. ad19原理图标注_Altium Designer 19绘制stm32最小系统原理图库及原理图
  13. Oracle基本认识
  14. Tableau数据分析笔记-Chapter08数据分层、数据分组、数据集
  15. 畜牧业的产业升级技术
  16. 扫描仪显示计算机无法,扫描仪无法显示怎么办 扫描仪无法显示解决方法【详解】...
  17. 快狗打车CTO沈剑:数据库架构一致性最佳实践
  18. 关于红外调制基础理解
  19. 如何快速的登陆github
  20. 世界五百强背景,为啥到了保险行业就成了「小公司」?

热门文章

  1. 还原精灵伴侣和Traybar
  2. MySQL 基础篇 -- 视图
  3. 疯狂的订餐系统-软件需求分析挑战之旅-4
  4. 搞软硬件的同事,最近和我聊了一件事
  5. 如何从Java代码生成UML图(尤其是序列图)?
  6. HTML5+CSS3知识点-周5.md
  7. 极简 Node.js 入门 - 3.2 文件读取
  8. linux rar加压_Linux环境中解压缩rar文件
  9. 初中计算机word试题,初中信息技术考试试题及答案Word.docx
  10. 数据库设计之逻辑设计