2021年scrapy一分钟内爬取笔趣阁整本小说存为txt

本人是一个小说迷，喜欢用电纸书看小说，但苦于难以寻找网络小说的txt版本，加之最近学习了一下怎么用scrapy爬取网页数据，所以想到去使用scrapy框架来爬取笔趣阁小说存为txt文件，在爬取过程中对于并发请求的章节排序是一个难点，本代码采用字典来解决这一问题。废话不多说，直接上思路和代码。

首先看一下小说的网页界面，今天爬取的小说是最近比较火的大奉打更人

首先在终端创建一个scrapy项目

 scrapy startproject dagengren

终端输出：

按照提示在终端输入：

cd dagengren
scrapy genspider quge quge7.com

在pycharm打开这个项目：

首先编辑quge.py这个文件：

import re
import scrapy
import urllibclass QugeSpider(scrapy.Spider):name = 'quge'allowed_domains = ['quge7.com']# start_urls填入爬取的网页链接start_urls = ['https://www.quge7.com/book/1472/']def parse(self, response):items = {}# 'cur'项记录爬取的章节数items['cur'] = 0# 'xpath'提取小说标题items['title'] = response.xpath("//div[@class='info']/h1/text()").extract_first()# 'xpath'提取小说作者items['author'] = response.xpath("//div[@class='small']/span[1]/text()").extract_first()# 'xpath'提取小说目录dd_list = response.xpath('//div[@class="listmain"]/dl/dd')# 'max_page_href'为最后一章的链接max_page = dd_list[-1]max_page_href = max_page.xpath('./a/@href').extract_first()# 因为'max_page_href'得到的链接不完整，如/book/1472/1.html# 通过urlib.parse.urljoin()方法可将链接补全max_page_href = urllib.parse.urljoin('https://www.quge7.com/book/1472/',max_page_href)# 提取最大章节数max_number = int(re.findall(r'https://www.quge7.com/book/1472/(.*?)\.html', max_page_href)[-1])# 生成爬取的网页链接列表url_list = ["https://www.quge7.com/book/1472/"+str(i)+".html" for i in range(1,max_number+1)]# 爬取网页链接列表里面的网页for url in url_list:# 创造请求，交给parse_content处理响应yield scrapy.Request(url,callback=self.parse_content,meta={'items':items})# 在item里面存入章节数items['max_page'] = max_numberdef parse_content(self,response):# 拿出itemsitems = response.meta['items']# 爬取的章节数加一items['cur'] += 1# 'current_page'为当前第几章current_page = int(re.findall(r'https://www.quge7.com/book/1472/(.*?)\.html',response.url)[-1])# 提取章节标题small_title = response.xpath("//span[@class='title']/text()").extract_first()# 提取章节文本texts = response.xpath("//div[@id='chaptercontent']/text()").extract()# 过滤文本texts = [re.sub(r'\u3000\u3000',r'\n',text) for text in texts]# 把文本列表转换成字符串items[current_page] = "".join(texts)# 将章节标题加入到文本中items[current_page] = small_title+'\n'+items[current_page]# 把item传递给piplines处理yield items

这里有一个很巧妙的地方就是将章节数与章节文本做了一个键值对：

items[current_page] = small_title+'\n'+items[current_page]

到时候写入文本的时候就可以依照这个键值对按顺序写入文件中，解决章节排序问题

pipelines.py中的配置

from itemadapter import ItemAdapterclass DagengrenPipeline:def process_item(self, item, spider):# 显示爬取的章节数目print(item['cur'])# 如果爬取的章节数目等于最大章节数，开始将小说写入文件if item['cur']==item['max_page']:with open("novel.txt",'w+') as file_object:try:# 写入小说标题file_object.write(item['title']+'\n')# 写入作者名file_object.write(item['author']+'\n')# 按章节顺序写入小说正文for i in range(1,item['max_page']+1):file_object.write(item[i])except:pass

注意到这段处理代码中只有当爬取的章节数等于最大章节数时，才在文件中写入小说文本，方便在接下来的循环中按顺序写入小说。

settings.py配置：

BOT_NAME = 'dagengren'SPIDER_MODULES = ['dagengren.spiders']
NEWSPIDER_MODULE = 'dagengren.spiders'# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36'# Obey robots.txt rules
ROBOTSTXT_OBEY = True
LOG_LEVEL = 'WARNING'# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16# Disable cookies (enabled by default)
#COOKIES_ENABLED = False# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
SPIDER_MIDDLEWARES = {'dagengren.middlewares.DagengrenSpiderMiddleware': 543,
}# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'dagengren.middlewares.DagengrenDownloaderMiddleware': 543,
#}# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {'dagengren.pipelines.DagengrenPipeline': 300,
}# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

爬取速度：

可以看到下载速度已经很快了，这还只是使用了默认的16线程的结果，想要更快的话可以在settings.py里面修改并发数目。

最终成品：

2021年scrapy一分钟内爬取笔趣阁整本小说存为txt｜解决章节排序问题相关推荐

python爬取小说写入txt_Python BeautifulSoup 爬取笔趣阁所有的小说
这是一个练习作品.用python脚本爬取笔趣阁上面的免费小说. 环境:python3 类库:BeautifulSoup 数据源:http://www.biqukan.cc 原理就是伪装正常http请求 ...
Python爬虫（scrapy模块、bs4模块）爬取笔趣阁全本小说(三级页面)
今天要做的是一个爬虫小项目,爬取小说网站,那么首先呢先对网站进行分析,这里要想爬取到每部小说的全部章节,需要爬取到三级页面,让我们看看代码实现.(https://m.wanwenhui.com/shu ...
java爬取笔趣阁魔道祖师小说
@RequestMapping("biquge")@ResponseBodypublic void biquge() throws Exception {int i=1;Strin ...
python3+正则(re)增量爬虫爬取笔趣阁小说( 斗罗大陆IV终极斗罗)
python3+re 爬虫爬取笔趣阁小说斗罗大陆IV终极斗罗爬取前准备导入的模块分析正则的贪婪与非贪婪附完整代码示例爬取前准备导入的模块 import redis #redis数据库 ...
python爬取小说爬取_用python爬取笔趣阁小说
原标题:用python爬取笔趣阁小说首先打开笔趣阁网址,链接,搜索自己想要的小说. 在网站内单击右键,点击检查,会出现如下界面! 我们需要的章节信息就在我划的这块, 可以将每个标签点一下,它对应的内 ...
python爬取最新说章节_练习_Python3 爬取笔趣阁最新小说章节
警告:本文代码仅供学习,禁止违法使用或商用. 这里拿人气小说<黎明之剑>来举个栗子,喜欢小说<黎明之剑>的朋友们请支持正版阅读. 笔趣阁网站上的其他书籍基本上的都可以套用,其他 ...
spider_使用随机User-Agent库，爬取笔趣阁万古天帝章节目录_(fake_useragent)
"""使用随机User-Agent库,爬取笔趣阁万古天帝章节目录 """ import requestsfrom fake_useragen ...
python爬取笔趣阁小说（附源码）
python爬取笔趣阁小说文章目录 python爬取笔趣阁小说前言一.获取小说目录结构获取目录连接请求代码解析目录 XPath tqdm 解析二.获取小说章节结构请求代码解析章节代 ...
爬取笔趣阁小说网站上的所有小说（二）
爬取笔趣阁小说网站上的所有小说(二) 网址为:https://www.biqukan.cc/topallvisit/1.html 我们已经拿到了所有小说的地址爬取笔趣阁小说网站上的所有小说(一),现在 ...

2021年scrapy一分钟内爬取笔趣阁整本小说存为txt｜解决章节排序问题

2021年scrapy一分钟内爬取笔趣阁整本小说存为txt｜解决章节排序问题相关推荐

最新文章

热门文章