python如何利用scrapy爬取纵横小说三级链接内容并存储到数据库
效果展示
settings.py# Scrapy settings for zongheng project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://docs.scrapy.org/en/latest/topics/settings.html
# https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
# https://docs.scrapy.org/en/latest/topics/spider-middleware.htmlBOT_NAME = 'zongheng'SPIDER_MODULES = ['zongheng.spiders']
NEWSPIDER_MODULE = 'zongheng.spiders'# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'zongheng (+http://www.yourdomain.com)'# Obey robots.txt rules
ROBOTSTXT_OBEY = False# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 1
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16# Disable cookies (enabled by default)
#COOKIES_ENABLED = False# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8','Accept-Language': 'en',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.190 Safari/537.36'
}# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'zongheng.middlewares.ZonghengSpiderMiddleware': 543,
#}# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
# 'zongheng.middlewares.ZonghengDownloaderMiddleware': 543,
#}# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#}# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {'zongheng.pipelines.ZonghengPipeline': 300,
}# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
DATABASE_CONFIG={"type":"mysql","config":{"host":"127.0.0.1","port":3306,"user":"root","password":"123456","db":"xiao","charset":"utf8"}
}
LOG_FILE='aa.log'#输出日志
zh.py# -*- coding: utf-8 -*-
import datetimeimport scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from ..items import NovelItem, ChapterItem, ContentItemclass ZhSpider(CrawlSpider):name = 'zh'allowed_domains = ['book.zongheng.com']start_urls = ['http://book.zongheng.com/store/c0/c0/b0/u1/p1/v0/s1/t0/u0/i1/ALL.html'] # 起始的url# 定义爬取规则 1.提取url(LinkExtractor对象) 2.形成请求 3.响应的处理规则rules = (Rule(LinkExtractor(allow=r'http://book.zongheng.com/book/\d+.html', restrict_xpaths='//div[@class="bookname"]'),callback='parse_book', follow=True, process_links="process_booklink"),Rule(LinkExtractor(allow=r'http://book.zongheng.com/showchapter/\d+.html'), callback='parse_catalog',follow=True, ),Rule(LinkExtractor(allow=r'http://book.zongheng.com/chapter/\d+/\d+.html',restrict_xpaths='//ul[@class="chapter-list clearfix"]'),callback='get_content', follow=False, process_links="process_chpter"),)def process_booklink(self, links):# 处理 LinkExtractor 提取到的urlfor index, link in enumerate(links):if index <= 2:# print(index, link.url)yield linkelse:returndef process_chpter(self, links):for index, link in enumerate(links):if index <= 5:yield linkelse:returndef parse_book(self, response):category = response.xpath('//div[@class="book-label"]/a/text()').extract()[1]book_name = response.xpath('//div[@class="book-name"]/text()').extract()[0].strip()author = response.xpath('//div[@class="au-name"]/a/text()').extract()[0]status = response.xpath('//div[@class="book-label"]/a/text()').extract()[0]book_nums = response.xpath('//div[@class="nums"]/span/i/text()').extract()[0]description = ''.join(response.xpath('//div[@class="book-dec Jbook-dec hide"]/p/text()').re("\S+"))c_time = datetime.datetime.now()book_url = response.urlcatalog_url = response.css("a").re('http://book.zongheng.com/showchapter/\d+.html')[0]item = NovelItem()item["category"] = categoryitem["book_name"] = book_nameitem["author"] = authoritem["status"] = statusitem["book_nums"] = book_numsitem["description"] = descriptionitem["c_time"] = c_timeitem["book_url"] = book_urlitem["catalog_url"] = catalog_urlyield itemdef parse_catalog(self, response):a_tags = response.xpath('//ul[@class="chapter-list clearfix"]/li/a')chapter_list = []catalog_url = response.urlfor a in a_tags:# print("解析catalog_url")title = a.xpath("./text()").extract()[0]chapter_url = a.xpath("./@href").extract()[0]chapter_list.append((title, chapter_url, catalog_url))item = ChapterItem()item["chapter_list"] = chapter_listyield itemdef get_content(self, response):chapter_url = response.urlcontent = ''.join(response.xpath('//div[@class="content"]/p/text()').extract())c_time = datetime.datetime.now()# 向管道传递数据item = ContentItem()item["chapter_url"] = chapter_urlitem["content"] = contentyield item
items.py# -*- coding: utf-8 -*-# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.htmlimport scrapyclass ZonghengItem(scrapy.Item):# define the fields for your item here like:# name = scrapy.Field()passclass NovelItem(scrapy.Item):# define the fields for your item here like:# name = scrapy.Field()category = scrapy.Field()book_name = scrapy.Field()author = scrapy.Field()status = scrapy.Field()book_nums = scrapy.Field()description = scrapy.Field()c_time = scrapy.Field()book_url = scrapy.Field()catalog_url = scrapy.Field()class ChapterItem(scrapy.Item):# define the fields for your item here like:# name = scrapy.Field()chapter_list = scrapy.Field()catalog_url = scrapy.Field()class ContentItem(scrapy.Item):# define the fields for your item here like:# name = scrapy.Field()content = scrapy.Field()chapter_url = scrapy.Field()
pipelines.py# -*- coding: utf-8 -*-# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.htmlimport pymysql
from .items import NovelItem,ChapterItem,ContentItem
import datetime
from scrapy.exceptions import DropItem
class ZonghengPipeline(object):#连接数据库def open_spider(self,spider):data_config = spider.settings["DATABASE_CONFIG"]print("数据库内容",data_config)if data_config["type"] == "mysql":self.conn = pymysql.connect(**data_config["config"])# self.conn = pymysql.connect( host=None, user=None, password="",# database=None, port=0,)self.cursor = self.conn.cursor()spider.conn = self.connspider.cursor = self.cursor#数据存储def process_item(self, item, spider):#1.小说信息存储if isinstance(item,NovelItem):sql="select id from novel where book_name=%s and author=%s"#确保数据中没有self.cursor.execute(sql,(item["book_name"],item["author"]))print('*' * 30)if not self.cursor.fetchone():#如果这里没有找到#写入小说数据sql="insert into novel(category,book_name,author,status,book_nums,description,c_time,book_url,catalog_url)" \"values (%s,%s,%s,%s,%s,%s,%s,%s,%s)"#补充sql语句 并执行self.cursor.execute(sql,(item["category"],item["book_name"],item["author"],item["status"],item["book_nums"],item["description"],item["c_time"],item["book_url"],item["catalog_url"],))self.conn.commit()return item#2.章节信息存储elif isinstance(item,ChapterItem):#写入 目录信息sql = "insert into chapter(title,ordernum,c_time,chapter_url,catalog_url) values(%s,%s,%s,%s,%s)"data_list=[]for index,chapter in enumerate(item["chapter_list"]):c_time = datetime.datetime.now()ordernum=index+1title,chapter_url,catalog_url=chapter #(title, chapter_url, catalog_url)data_list.append((title,ordernum,c_time,chapter_url,catalog_url))self.cursor.executemany(sql,data_list) #[(),(),()]self.conn.commit()return item#3.章节内容存储elif isinstance(item, ContentItem):sql="update chapter set content=%s where chapter_url=%s"content=item["content"]chapter_url=item["chapter_url"]self.cursor.execute(sql,(content,chapter_url))self.conn.commit()print('-'*30)return itemelse:return DropItem#关闭数据库def close_spider(self,spider):data_config=spider.settings["DATABASE_CONFIG"]#setting里设置数据库if data_config["type"]=="mysql":self.cursor.close()self.conn.close()
python如何利用scrapy爬取纵横小说三级链接内容并存储到数据库相关推荐
- Python利用Scrapy爬取前程无忧
** Python利用Scrapy爬取前程无忧 ** 一.爬虫准备 Python:3.x Scrapy PyCharm 二.爬取目标 爬取前程无忧的职位信息,此案例以Python为关键词爬取相应的职位 ...
- [Python 爬虫] 使用 Scrapy 爬取新浪微博用户信息(四) —— 应对反爬技术(选取 User-Agent、添加 IP代理池以及Cookies池 )
上一篇:[Python 爬虫] 使用 Scrapy 爬取新浪微博用户信息(三) -- 数据的持久化--使用MongoDB存储爬取的数据 最近项目有些忙,很多需求紧急上线,所以一直没能完善< 使用 ...
- [Python 爬虫] 使用 Scrapy 爬取新浪微博用户信息(三) —— 数据的持久化——使用MongoDB存储爬取的数据
上一篇:[Python 爬虫] 使用 Scrapy 爬取新浪微博用户信息(二) -- 编写一个基本的 Spider 爬取微博用户信息 在上一篇博客中,我们已经新建了一个爬虫应用,并简单实现了爬取一位微 ...
- python爬虫scrapy爬取新闻标题及链接_python爬虫框架scrapy爬取梅花网资讯信息
原标题:python爬虫框架scrapy爬取梅花网资讯信息 一.介绍 本例子用scrapy-splash爬取梅花网(http://www.meihua.info/a/list/today)的资讯信息, ...
- [Python 爬虫] 使用 Scrapy 爬取新浪微博用户信息(二) —— 编写一个基本的 Spider 爬取微博用户信息
上一篇:[Python 爬虫] 使用 Scrapy 爬取新浪微博用户信息(一) -- 新建爬虫项目 在上一篇我们新建了一个 sina_scrapy 的项目,这一节我们开始正式编写爬虫的代码. 选择目标 ...
- 利用正则表达式爬取网络小说,并按照章节下载到本地
利用正则表达式爬取网络小说,并按照章节下载到本地 闲来无事,尝试了使用正则表达式爬取了某个明显没有反扒机制的小说网站,其实也是刚刚接触爬虫,第一次从网络上爬到感兴趣的东西还是令人开心的. 先贴为敬. ...
- scrapy爬取起点小说网
闲来无事,在学习过程中练习用scrapy爬取起点小说名 工具:python3.6 操作系统:linux 浏览器:谷歌浏览器 创建项目 在黑屏终端创建一个项目:scrapy startproject Q ...
- Scrapy 爬取盗墓笔记小说
Scrapy 爬取盗墓笔记小说 应用 Scrapy框架 爬取盗墓笔记小说数据,存入MongoDB 数据库. # settings 配置mongodb MONGODB_HOST = '127.0.0.1 ...
- Scrapy爬取顶点小说网
Scrapy爬取小说 爬取目标:顶点小说网 1.Scrapy的安装 pip install scrapy 2.Scrapy的介绍 创建项目 scrapy startproject xxx xxx项目名 ...
最新文章
- 专家访谈:为什么我们需要Erlang
- tomcat 和 jdk 版本 对应关系
- python零基础入门教材-Python零基础入门到精通自学视频教程
- 计算机设计思想 —— 代理(proxy)
- Android高通平台下编译时能生成(拷贝)预编译的so到system的lib目录
- MySQL 8.0 技术详解
- Mysql 死锁过程及案例详解之插入意向锁与自增锁备份锁日志锁Insert Intention Lock Auto-increment Lock Backup Lock Log Lock
- 2018年,该转行AI工程师吗?
- Java偏向锁、轻量级锁、重量级锁
- Linux物理内存初始化
- 子类重写方法aop切不到_Spring-aop 全面解析(从应用到原理)
- 微信小程序排名规则大揭秘
- 如何进入Github【亲测有效】
- 用python实现阴阳师简单挂机脚本
- 什么是跨域问题?跨域解决问题
- 机器学习小组知识点4:批量梯度下降法(BGD)
- WIFI6 芯片厂商制程工艺
- 【学习笔记】proxy的用法
- 统计学习:模型评估与选择--留出法(python实现)
- vue\uniapp自定义活动倒计时组件