目录

  • 须知
  • 分析
    • A.\mathcal{A}.A.目标
    • B.\mathcal{B}.B.子目标
    • C.\mathcal{C}.C.子目标分析
  • manhua.py
  • items.py
  • pipelines.py
  • settings.py
  • middlewares.py
  • main.py

须知

\qquad本文不介绍Scrapy安装配置,不介绍Scrapy的架构,默认具备Scrapy初级基础,默认具备一定的爬虫知识.

\qquad本文以爬取古风漫画网中漫画咒术回战为例作分析,实际上代码可爬取网站内任意漫画.

分析

A.\mathcal{A}.A.目标

https://www.gufengmh8.com/manhua/zhoushuhuizhan/

B.\mathcal{B}.B.子目标

C.\mathcal{C}.C.子目标分析

https://www.gufengmh8.com/manhua/zhoushuhuizhan/1325415.html

\qquad某一张图片如下,到此可以使用普通的爬虫写法完成爬取,然而我们有更为方便的方法.

\qquad可以看到script里包含了当前子目标所有的图片,正则提取后与母网址join即可,至此分析结束.

manhua.py

\qquad主逻辑部分.

# -*- coding: utf-8 -*-
import re
import base64
import scrapy
from urllib import parse
from scrapy import Request
from PicSpider.items import PicItemclass ManhuaSpider(scrapy.Spider):name = 'manhua'start_urls = ['https://www.gufengmh8.com/manhua/zhoushuhuizhan/']def parse(self, response):li_list = response.xpath('//*[@id="chapter-list-54"]//li')for idx, li in enumerate(li_list):child_url = li.xpath('./a/@href').extract_first("")dir_name = str(li.xpath('./a/span/text()').extract_first(""))yield Request(url=parse.urljoin(response.url, child_url), callback=self.parse_detail, meta={'dir_name': dir_name})def parse_detail(self, response):raw_data = response.xpath('/html/body/script[1]').extract_first("")urls = re.match(r'.*var chapterImages = \[(.*?)]', raw_data).group(1)urls = re.sub(r'"', r'', urls)url_list = urls.split(",")chapter_path = re.match(r'.*var chapterPath = "(.*?)"', raw_data).group(1)chapter_path = "https://res.xiaoqinre.com/" + chapter_pathurl_list = [chapter_path + url for url in url_list]pic_item = PicItem()pic_item["pic_url"] = url_listpic_item["dir_name"] = response.meta["dir_name"]yield pic_item

items.py

\qquad定义简单的PicItem类.

# -*- coding: utf-8 -*-# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.htmlimport scrapyclass PicspiderItem(scrapy.Item):# define the fields for your item here like:# name = scrapy.Field()passclass PicItem(scrapy.Item):pic_url = scrapy.Field()dir_name = scrapy.Field()

pipelines.py

\qquad规范输出,图片重命名不可或缺.

# -*- coding: utf-8 -*-# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.htmlfrom scrapy import Request
import refrom scrapy.pipelines.images import ImagesPipelineclass PicspiderPipeline(object):def process_item(self, item, spider):return itemclass ImagesRenamePipeline(ImagesPipeline):# 重写get_media_requests方法def get_media_requests(self, item, info):for idx, image_url in enumerate(item['pic_url']):# meta里面的数据是从spider获取,然后通过meta传递给下面的file_path方法yield Request(image_url, meta={'dir_name': item['dir_name'], 'idx': str(idx+1)})# 重写file_path函数为图片划分目录并重命名def file_path(self, request, response=None, info=None):# 图片以索引命名image_name = request.meta['idx']# 接收上面meta传递过来的目录名称dir_name = request.meta['dir_name']# 过滤Windows命名非法字符串dir_name = re.sub(r'[?\\*|“<>:/]', '', dir_name)# 分文件夹存储的关键:{0}对应着dir_name, {1}对应着image_namefilename = u'{0}/{1}.png'.format(dir_name, image_name)return filename

settings.py

\qquad几处需要注意的设置,如代理USER_AGENTROBOTSTXT_OBEYDOWNLOAD_DELAYITEM_PIPELINES以及末尾的图片下载设置,同时需要在当前项目目录创建images文件夹以供图片存储,当然你可以自定义其他位置 (需要更改代码相应处配置).

# -*- coding: utf-8 -*-
import sys
import os
from fake_useragent import UserAgentua = UserAgent()# Scrapy settings for PicSpider project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.htmlBOT_NAME = 'PicSpider'SPIDER_MODULES = ['PicSpider.spiders']
NEWSPIDER_MODULE = 'PicSpider.spiders'# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = ua.random# Obey robots.txt rules
ROBOTSTXT_OBEY = False# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 0.01
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16# Disable cookies (enabled by default)
#COOKIES_ENABLED = False# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {#    'PicSpider.middlewares.PicspiderSpiderMiddleware': 543,
#}# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {'PicSpider.middlewares.PicspiderDownloaderMiddleware': 543,
}# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {#    'scrapy.extensions.telnet.TelnetConsole': None,
#}# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {'PicSpider.pipelines.ImagesRenamePipeline': 1,'PicSpider.pipelines.PicspiderPipeline': 300,
}# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'# 图片下载设置
IMAGES_URLS_FIELD = "pic_url"
project_dir = os.path.dirname(os.path.abspath(__file__))
IMAGES_STORE = os.path.join(project_dir, 'images')

middlewares.py

\qquad如果不使用IP代理,middlewares.py采取默认设置即可.

main.py

\qquad用于Pycharm调试.

from scrapy.cmdline import execute
import sys
import os# 添加路径
sys.path.append(os.path.dirname(os.path.abspath(__file__)))# 执行爬虫语句
execute(["scrapy", "crawl", "manhua"])

【爬虫】Scrapy爬取古风漫画网相关推荐

  1. Python爬取古风漫画网

    #!/user/bin/python # -*- coding: utf-8 -*-import requests from bs4 import BeautifulSoup from urllib ...

  2. Scrapy爬取1908电影网电影数据

    Scrapy爬取1908电影网电影数据 最初是打算直接从豆瓣上爬电影数据的,但编写完一直出现403错误,查了查是豆瓣反爬虫导致了,加了headers也还是一直出现错误,无奈只能转战1908电影网了. ...

  3. Scrapy爬取斗破苍穹漫画

    Scrapy爬取斗破苍穹漫画 文章目录 Scrapy爬取斗破苍穹漫画 前言 一.创建项目.创建爬虫 二.实战 1.items.py如下: 2.settings.py如下: 3.pipelines.py ...

  4. Python+scrapy爬取36氪网

    Python+Scrapy爬取36氪网新闻 一.准备工作: ​ ①安装python3 ​ ②安装scrapy ​ ③安装docker,用来运行splash,splash是用来提供js渲染服务(pyth ...

  5. scrapy 爬取校花网

    原文链接: scrapy 爬取校花网 上一篇: scrapy 安装和简单命令 下一篇: scrapy 腾讯 招聘信息爬取 网址,爬取名称和对应的图片链接,并保存为json格式 http://www.x ...

  6. Python爬虫 - scrapy - 爬取妹子图 Lv1

    0. 前言 这是一个利用python scrapy框架爬取网站图片的实例,本人也是在学习当中,在这做个记录,也希望能帮到需要的人.爬取妹子图的实例打算分成三部分来写,尝试完善实用性. 系统环境 Sys ...

  7. scrapy爬取海贼王漫画

    scrapy爬取海贼王漫画 1.创建项目scrapy startproject onepiecesScrapy 2.创建spider cd onepieces Scrapy scrapy genspi ...

  8. 爬虫练习-爬取简书网热评文章

    前言: 使用多进程爬虫方法爬取简书网热评文章,并将爬取的数据存储于MongoDB数据库中 本文为整理代码,梳理思路,验证代码有效性--2020.1.17 环境: Python3(Anaconda3) ...

  9. python爬虫scrapy爬取新闻标题及链接_python爬虫框架scrapy爬取梅花网资讯信息

    原标题:python爬虫框架scrapy爬取梅花网资讯信息 一.介绍 本例子用scrapy-splash爬取梅花网(http://www.meihua.info/a/list/today)的资讯信息, ...

最新文章

  1. cdn属于一种技术吗
  2. MapReduce基础开发之十二ChainMapper和ChainReducer使用
  3. python每秒20个请求_使用Python每秒百万个请求
  4. leetcode 300. 最长上升子序列
  5. python2.7中没有zlib库的解决方案(zipimport.ZipImportError: can't decompress data; zlib not available)
  6. jq 和java 多张图片_JQ实现多图片预览和单图片预览
  7. php://input 如何用?
  8. 动易cms聚合空间最近访客访问地址错误解决方法
  9. 王春亮推拿学堂:如何成为一名高级调理师
  10. 希腊字母发音表及所对应的Markdown KaTex代码
  11. win8/win10任务栏/语言栏跳动/闪烁
  12. MybatisPlus学习〖三〗crud接口实现
  13. MCDF_svlab4 代码解读
  14. element ui 前台模板_一个干净优雅的 Element UI Admin 模板
  15. 【C语言】算法学习·种类并查集
  16. 大学都要学计算机吗,大学要买电脑吗?大学生为什么现在都要标配一台笔记本?...
  17. 《刨根问底系列》:序言
  18. Adobe 正式发布 Flash Player 10 [version 10.0.12.36]
  19. 小程序编写类似微信朋友圈九宫格布局
  20. Docker使用问题汇总

热门文章

  1. 后端返回数组对象(id重复)来去重并保留第一个
  2. 疲劳驾驶样本集_无人驾驶技术入门(十六)| 初识深度学习之交通标志分类
  3. linux--通配符的使用
  4. jitsi-meet安卓端进入房间就断开连接
  5. 黑群晖二合一安装不了套件_【优选产品】Si1133/53光学传感器多功能评估套件
  6. winfax不能收传真
  7. 8.3. Outlook Express
  8. ls -la /usr/home/guest/ | more
  9. 这才是目前百度统计接口的正确打开方式20180322
  10. C#强烈粉碎文件代码