python采集小说网站完整教程（附完整代码）

python 采集网站数据，本教程用的是scrapy蜘蛛

1、安装Scrapy框架

命令行执行：

 pip install scrapy

安装的scrapy依赖包和原先你安装的其他python包有冲突话，推荐使用Virtualenv安装

安装完成后，随便找个文件夹创建爬虫

scrapy startproject 你的蜘蛛名称

文件夹目录

爬虫规则写在spiders目录下

items.py ——需要爬取的数据

pipelines.py ——执行数据保存

settings —— 配置

middlewares.py——下载器

下面是采集一个小说网站的源码

先在items.py定义采集的数据

# author 小白<qq群：810735403>import scrapyclass BookspiderItem(scrapy.Item):# define the fields for your item here like:i = scrapy.Field()book_name = scrapy.Field()book_img = scrapy.Field()book_author = scrapy.Field()book_last_chapter = scrapy.Field()book_last_time = scrapy.Field()book_list_name = scrapy.Field()book_content = scrapy.Field()pass

编写采集规则

# author 小白<qq群：810735403>import scrapy
from ..items import BookspiderItem
class Book(scrapy.Spider):name = "BookSpider"start_urls = ['http://www.xbiquge.la/xiaoshuodaquan/']def parse(self, response):bookAllList = response.css('.novellist:first-child>ul>li')for all in bookAllList:booklist = all.css('a::attr(href)').extract_first()yield scrapy.Request(booklist,callback=self.list)def list(self,response):book_name = response.css('#info>h1::text').extract_first()book_img = response.css('#fmimg>img::attr(src)').extract_first()book_author = response.css('#info p:nth-child(2)::text').extract_first()book_last_chapter = response.css('#info p:last-child::text').extract_first()book_last_time = response.css('#info p:nth-last-child(2)::text').extract_first()bookInfo = {'book_name':book_name,'book_img':book_img,'book_author':book_author,'book_last_chapter':book_last_chapter,'book_last_time':book_last_time}list = response.css('#list>dl>dd>a::attr(href)').extract()i = 0for var in list:i += 1bookInfo['i'] = i # 获取抓取时的顺序，保存数据时按顺序保存yield scrapy.Request('http://www.xbiquge.la'+var,meta=bookInfo,callback=self.info)def info(self,response):self.log(response.meta['book_name'])content = response.css('#content::text').extract()item = BookspiderItem()item['i'] = response.meta['i']item['book_name'] = response.meta['book_name']item['book_img'] = response.meta['book_img']item['book_author'] = response.meta['book_author']item['book_last_chapter'] = response.meta['book_last_chapter']item['book_last_time'] = response.meta['book_last_time']item['book_list_name'] = response.css('.bookname h1::text').extract_first()item['book_content'] = ''.join(content)yield item

保存数据

import os
class BookspiderPipeline(object):def process_item(self, item, spider):curPath = 'E:/小说/'tempPath = str(item['book_name'])targetPath = curPath + tempPathif not os.path.exists(targetPath):os.makedirs(targetPath)book_list_name = str(str(item['i'])+item['book_list_name'])filename_path = targetPath+'/'+book_list_name+'.txt'print('------------')print(filename_path)with open(filename_path,'a',encoding='utf-8') as f:f.write(item['book_content'])return item

执行

scrapy crawl  BookSpider

即可完成一个小说程序的采集

这里推荐使用

scrapy shell 爬取的网页url

然后 response.css('') 测试规则是否正确

在这里还是要推荐下我自己建的Python开发学习群:810735403，群里都是学Python开发的，如果你正在学习Python ，欢迎你加入，大家都是软件开发党，不定期分享干货（只有Python软件开发相关的），包括我自己整理的一份2020最新的Python进阶资料和高级开发教程，欢迎进阶中和进想深入Python的小伙伴！

python采集小说网站完整教程（附完整代码）相关推荐

python算法完整教程专栏完整目录
python算法完整教程专栏完整目录专栏说明如下专栏目录专栏说明如下内容:python算法完整教程数量:692篇博文(2023年2月15日截止) 更新时间至:2023年2月15日(后续加上去 ...
【仿真】Carla之收集数据快速教程 (附完整代码)
收集过程可视化展示,随后进入正文: 参考与前言看到仿真群对这类任务下(用carla收集数据然后再做训练等) 需求量大,顺手马上写一个好了,首先收集数据需要考虑清楚: 收集什么数据,需要什么样的数据格 ...
Python实现恩尼格玛加密算法——附完整源码
Python实现恩尼格玛加密算法--附完整源码恩尼格玛是第二次世界大战中德国所使用的复杂电机械式密码机.它被认为是世界上最复杂的加密设备之一.在这个项目中,我们将使用Python模拟实现恩尼格玛加密 ...
Python语言打造智能语音助手——附完整源码
Python语言打造智能语音助手--附完整源码随着智能家居.智能办公等领域的逐渐兴起,实现语音控制与交互已成为了一种趋势.而Python语言是一门极具魅力的编程语言,其强大的库.简洁的语法以及易于学 ...
python爬虫小说代码示例-中文编程，用python编写小说网站爬虫
原标题:中文编程,用python编写小说网站爬虫作者:乘风龙王原文:https://zhuanlan.zhihu.com/p/51309019 为保持源码格式, 转载时使用了截图. 原文中的源码块 ...
python爬小说代码_中文编程，用python编写小说网站爬虫
原标题:中文编程,用python编写小说网站爬虫作者:乘风龙王原文:https://zhuanlan.zhihu.com/p/51309019 为保持源码格式, 转载时使用了截图. 原文中的源码块 ...
OpenCV完整教程专栏完整目录
OpenCV完整教程专栏完整目录专栏说明如下专栏目录专栏说明如下内容:OpenCV完整教程数量:403篇博文(2023年2月15日截止) 更新时间至:2023年2月15日(后续加上去的博文, ...
零基础小白10分钟用Python搭建小说网站！网友：我可以！
都说Python什么都能做,本来我是不信的!直到我在CSDN站内看到了一件真事儿:一位博主贴出了自己10分钟用Python搭建小说网站的全过程!全程只用了2步操作,简直太秀了!!-- 第一步:爬取小说 ...
C和C++算法完整教程专栏完整目录
C和C++算法完整教程专栏完整目录专栏说明如下完整专栏目录如下专栏说明如下内容:C和C++算法完整教程数量:680篇博文(2023年2月15日截止) 更新时间至:2023年2月15日(后续加 ...

python采集小说网站完整教程（附完整代码）

python采集小说网站完整教程（附完整代码）相关推荐

最新文章

热门文章