起点小说免费看 Scrapy爬取起点小说网数据导入MongoDB数据

本文中我们将详细介绍使用Scrapy抓取数据并存入MongoDB数据库，首先给出我们需要抓取得数据：

抓取起点网得全部作品，网址为：https://www.qidian.com/all

关于Scrapy的下载与安装请移步上篇博客Scrapy简单案例

关于MongoDB的下载安装请移步博客MongoDB安装

下面直接给出相关代码;

(1) 数据封装类item.py# -*- coding: utf-8 -*-# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.htmlimport scrapyclass NovelItem(scrapy.Item):# define the fields for your item here like:# name = scrapy.Field()link = scrapy.Field()#URLcategory = scrapy.Field()bookname = scrapy.Field()author = scrapy.Field()content = scrapy.Field()
（2）爬虫主程序# -*- coding: utf-8 -*-
import scrapyfrom novel.items import NovelItemclass SolveSpider(scrapy.Spider):name = "solve"allowed_domains = ["qidian.com"]start_urls = [];for x in range(1,5):#只有5页start_urls.append("https://www.qidian.com/all?orderId=&style=1&pageSize=20&siteid=1&pubflag=0&hiddenField=0&page=" + str(x))#start_urls = ["https://www.qidian.com/all?orderId=&style=1&pageSize=20&siteid=1&pubflag=0&hiddenField=0&page="]# page_index = ["1", "2", "3", "4", "5", "6", "7","8", "9", "10"]def parse(self, response):nolves = response.xpath('//ul[@class="all-img-list cf"]/li')for each in nolves:# print("***************************")item = NovelItem()part = each.xpath('./div[@class="book-mid-info"]')#print(part)item['bookname'] = part.xpath('./h4/a/text()').extract()[0]item['link'] = part.xpath('./h4/a/@href').extract()[0]item['author'] = part.xpath('./p[@class="author"]/a[@class="name"]/text()').extract()[0]item['category'] = part.xpath('./p[@class="author"]/a/text()').extract()[1]item['content'] = part.xpath('./p[@class="intro"]/text()').extract()[0]yield item
（3）管道pipeline.py还有更多免费的Python学习资料688244617自己来拿import  pymongoclass MongoDBPipeline(object):collection_name = 'scrapy_items'def __init__(self, mongo_uri, mongo_db):self.mongo_uri = mongo_uriself.mongo_db = mongo_db@classmethoddef from_crawler(cls, crawler):return cls(mongo_uri=crawler.settings.get('MONGO_URI'),mongo_db=crawler.settings.get('MONGO_DB'),)def open_spider(self, spider):self.client = pymongo.MongoClient(self.mongo_uri)self.db = self.client[self.mongo_db]self.collection = self.db["novel"]def close_spider(self, spider):self.client.close()def process_item(self, item, spider):self.collection.insert(dict(item))print("插入成功")return item
（4）配置文件BOT_NAME = 'novel'SPIDER_MODULES = ['novel.spiders']
NEWSPIDER_MODULE = 'novel.spiders'
ITEM_PIPELINES = {'novel.pipelines.NovelPipeline':100,}MONGO_URI = "192.168.177.13"
MONGO_DB = "novels"
MONGO_COLLECTION = "novel"
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'novel (+http://www.yourdomain.com)'# Obey robots.txt rules
ROBOTSTXT_OBEY = True
# 不验证SSL证书
DOWNLOAD_HANDLERS_BASE = {'file': 'scrapy.core.downloader.handlers.file.FileDownloadHandler','http': 'scrapy.core.downloader.handlers.http.HttpDownloadHandler','https': 'scrapy.core.downloader.handlers.http.HttpDownloadHandler','s3': 'scrapy.core.downloader.handlers.s3.S3DownloadHandler',}

（5）查询结果

原文：https://blog.csdn.net/qq_16669583/article/details/91611823

起点小说免费看 Scrapy爬取起点小说网数据导入MongoDB数据相关推荐

scrapy爬取起点小说网
闲来无事,在学习过程中练习用scrapy爬取起点小说名工具:python3.6 操作系统:linux 浏览器:谷歌浏览器创建项目在黑屏终端创建一个项目:scrapy startproject Q ...
scrapy爬取起点中文网24小时热销榜单
系列文章目录第一章 scrapy爬取起点中文网24小时热销榜单. 文章目录系列文章目录前言一.项目需求二.项目分析三.程序编写 1.编写item(数据存储) 2.编写spider(数据抓取 ...
scrapy爬取起点中文网24小时热销榜单（将数据存到数据库）
系列文章目录第一章:scrapy爬取起点中文网24小时热销榜单第二章:scrapy爬取苏州二手房交易信息第三章:scrapy爬取QQ音乐榜单歌曲及豆瓣电影信息第四章:scrapy爬取起点中文网 ...
scrapy 爬取起点中文网首页的本周强推作品的详情介绍
scrapy 爬取起点中文网首页的每周强推作品的详情介绍从列表页跳转到详情页保存的数据封面图小说名作者类型简介 import scrapy# 起点首页本周推荐 class Weektj ...
Scrapy 爬取起点中文网存储到 MySQL 数据库（自定义 middleware）
Scrapy 爬取起点中文网 1. 创建项目 2. 创建爬虫实例 3. 创建一个启动文件 main.py 4. 项目的总体树结构 5. settings.py 6. items.py 7. qidia ...
Scrapy 爬取盗墓笔记小说
Scrapy 爬取盗墓笔记小说应用 Scrapy框架爬取盗墓笔记小说数据,存入MongoDB 数据库. # settings 配置mongodb MONGODB_HOST = '127.0.0.1 ...
Python+scrapy爬取36氪网
Python+Scrapy爬取36氪网新闻一.准备工作: ①安装python3 ②安装scrapy ③安装docker,用来运行splash,splash是用来提供js渲染服务(pyth ...
Scrapy爬取1908电影网电影数据
Scrapy爬取1908电影网电影数据最初是打算直接从豆瓣上爬电影数据的,但编写完一直出现403错误,查了查是豆瓣反爬虫导致了,加了headers也还是一直出现错误,无奈只能转战1908电影网了. ...
scrapy 爬取校花网
原文链接: scrapy 爬取校花网上一篇: scrapy 安装和简单命令下一篇: scrapy 腾讯招聘信息爬取网址,爬取名称和对应的图片链接,并保存为json格式 http://www.x ...

起点小说免费看 Scrapy爬取起点小说网数据导入MongoDB数据

起点小说免费看 Scrapy爬取起点小说网数据导入MongoDB数据相关推荐

最新文章

热门文章