scrapy框架爬取小说

首先创建相关项目文件，打开cmd输入以下命令：

scrapy startproject  项目名称

接着切换到目录文件：

cd 项目名称

定义要爬取的网站：

scrapy genspider 爬虫名称  起始url网站(域名)

过程如下：

C:\Users\Administrator\Desktop\scrapy>scrapy startproject xiaoshuo_text
New Scrapy project 'xiaoshuo_text', using template directory 'c:\users\administrator\appdata\local\programs\python\python37\lib\site-packages\scrapy\templates\project', created in:C:\Users\Administrator\Desktop\scrapy\xiaoshuo_textYou can start your first spider with:cd xiaoshuo_textscrapy genspider example example.comC:\Users\Administrator\Desktop\scrapy>cd xiaoshuo_text
C:\Users\Administrator\Desktop\scrapy\xiaoshuo_text>scrapy genspider xiaoshuo 81zw.com
Created spider 'xiaoshuo' using template 'basic' in module:xiaoshuo_text.spiders.xiaoshuo

工程文件如下：

工程文件准备完成，第一步先设置好settings.py文件，主要设置以下：

（1）LOG_LEVEL = “WARNING” 设置记录日志只显示警告级别信息；
（2）设置爬虫请求头 USER_AGENT 及登录需的cookies信息 DEFAULT_REQUEST_HEADERS；
（3）打开传输管道 ITEM_PIPELINES；
（4）存储运行消息 LOG_FILE = “./log.log”；
（5）设置下载的等待时间DOWNLOAD_DELAY，降低访问的频率

# Scrapy settings for xiaoshuo_text project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html
from fake_useragent import UserAgent
import randomBOT_NAME = 'xiaoshuo_text'SPIDER_MODULES = ['xiaoshuo_text.spiders']
NEWSPIDER_MODULE = 'xiaoshuo_text.spiders'# 设置记录日志只显示警告级别信息
LOG_LEVEL = "WARNING"# 存储运行消息
LOG_FILE = "./log.log"ua = UserAgent()
header = {"User-Agent": ua.random ,}
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = ua# Obey robots.txt rules
ROBOTSTXT_OBEY = False# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
# 设置下载的等待时间，降低访问的频率
DOWNLOAD_DELAY = 2
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16# Disable cookies (enabled by default)
#COOKIES_ENABLED = False# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8','Accept-Language': 'en',
}# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {#    'xiaoshuo_text.middlewares.XiaoshuoTextSpiderMiddleware': 543,
#}# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {#    'xiaoshuo_text.middlewares.XiaoshuoTextDownloaderMiddleware': 543,
#}# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

第二步获取对应数据，编写xiaoshuo.py文件，例如爬取的网页为xxxx.html，可以设置参数start_urls为该网址，代码如下：

import scrapyclass XiaoshuoSpider(scrapy.Spider):name = 'xiaoshuo'allowed_domains = ['xxxx.com']start_urls = ['xxxxx.html']def parse(self, response):# 获取章节的标题和内容chapter_title = response.xpath("//h1/text()").extract_first()chapter_content = "".join(response.xpath("//*[@id='content']/text()").extract()).replace("\u3000\u3000","\n     ")# 构建字典，使用调度器进行传输data = {}data["title"] = chapter_titledata["content"] = chapter_content# print(data)yield data# 获取下一章内容，循环下载next_chapter = response.xpath('//div[@class="bookname"]/div[1]/a[3]/@href').extract_first()# base_url = "https://www.81zw.com/{}".format(next_chapter)if next_chapter.find(".html") != 1:yield scrapy.Request(response.urljoin(next_chapter), callback=self.parse)

第三步处理数据并保持在本地，编辑pipelines.py文件，接受xiaoshuo.py传来的数据并保存，代码如下：

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html# useful for handling different item types with a single interface
from itemadapter import ItemAdapterclass XiaoshuoTextPipeline:def open_spider(self, spider):self.file = open("wddf.txt", "w", encoding="utf-8")def process_item(self, item, spider):title = item.get("title")content = item.get("content")print(title)info = title + "\n" + "     " + content + "\n"self.file.write(info)# 从内存中取出数据，相对与刷新一下self.file.flush()return itemdef close_spider(self, spider):self.file.close()

最后运行scrapy框架进行爬取，运行有两种方式，第一种打开cmd输入以下代码：

scrapy crawl xiaoshuo

第二种新建文件，例如main.py文件，运行以下代码：

from scrapy.cmdline import executeif __name__ == '__main__':execute(["scrapy","crawl","xiaoshuo"])

以上就是scrapy框架爬取小说的全部代码，编写时已将思路进行分解，有疑问的欢迎评论或私信博主啊。

scrapy框架爬取小说相关推荐

Python爬虫之scrapy框架-爬取小说信息
1.需求分析我们要得到小说热销榜的数据,每部小说的提取内容为:小说名字.作者.类型.形式.然后将得到的数据存入CSV文件. 2.创建项目创建一个scrapy爬虫项目方式如下: (1)在D盘下面创建 ...
scrapy获取a标签的连接_python爬虫——基于scrapy框架爬取网易新闻内容
python爬虫--基于scrapy框架爬取网易新闻内容 1.需求[前期准备] 2.分析及代码实现(1)获取五大板块详情页url(2)解析每个板块(3)解析每个模块里的标题中详情页信息点击此处,获取 ...
利用python的scrapy框架爬取google搜索结果页面内容
scrapy google search 实验目的爬虫实习的项目1,利用python的scrapy框架爬取google搜索结果页面内容. https://github.com/1012598167/ ...
基于Scrapy框架爬取豆瓣《复联4》影评，并生成词云
基于Scrapy框架爬取豆瓣<复联4>影评,并生成词云 1. 介绍及开发环境 2. 爬虫实现 2.1 新建项目 2.2 构造请求 2.3 提取信息 2.4 数据存储 2.4 运行结果 3. ...
03_使用scrapy框架爬取豆瓣电影TOP250
前言: 本次项目是使用scrapy框架,爬取豆瓣电影TOP250的相关信息.其中涉及到代理IP,随机UA代理,最后将得到的数据保存到mongoDB中.本次爬取的内容实则不难.主要是熟悉scrapy相关 ...
scrapy框架爬取网站图片
使用scrapy 框架爬取彼岸图库前言: 这两天在网上学习了一下scrapy框架,发现及其好用,把爬虫步骤分的细细的.所以写了一个简单项目回顾一下并分享给大家^ . ^ 源码我已经放到Github了 ...
【Python爬虫系列教程 28-100】小姐姐带你入门爬虫框架Scrapy、使用Scrapy框架爬取糗事百科段子
文章目录 Scrapy快速入门安装和文档: 快速入门: 创建项目: 目录结构介绍: Scrapy框架架构 Scrapy框架介绍: Scrapy框架模块功能: Scrapy Shell 打开Scrap ...
Python网络爬虫数据采集实战：Scrapy框架爬取QQ音乐存入MongoDB
通过前七章的学习,相信大家对整个爬虫有了一个比较全貌的了解 ,其中分别涉及四个案例:静态网页爬取.动态Ajax网页爬取.Selenium浏览器模拟爬取和Fillder今日头条app爬取,基本 ...
Python网络爬虫数据采集实战（八）：Scrapy框架爬取QQ音乐存入MongoDB
通过前七章的学习,相信大家对整个爬虫有了一个比较全貌的了解 ,其中分别涉及四个案例:静态网页爬取.动态Ajax网页爬取.Selenium浏览器模拟爬取和Fillder今日头条app爬取,基本涵盖了爬虫 ...

scrapy框架爬取小说

工程文件准备完成，第一步先设置好settings.py文件，主要设置以下：

第二步获取对应数据，编写xiaoshuo.py文件，例如爬取的网页为xxxx.html，可以设置参数start_urls为该网址，代码如下：

第三步处理数据并保持在本地，编辑pipelines.py文件，接受xiaoshuo.py传来的数据并保存，代码如下：

scrapy框架爬取小说相关推荐

最新文章

热门文章