效果展示

settings.py# Scrapy settings for zongheng project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.htmlBOT_NAME = 'zongheng'SPIDER_MODULES = ['zongheng.spiders']
NEWSPIDER_MODULE = 'zongheng.spiders'# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'zongheng (+http://www.yourdomain.com)'# Obey robots.txt rules
ROBOTSTXT_OBEY = False# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 1
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16# Disable cookies (enabled by default)
#COOKIES_ENABLED = False# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8','Accept-Language': 'en',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.190 Safari/537.36'
}# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'zongheng.middlewares.ZonghengSpiderMiddleware': 543,
#}# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'zongheng.middlewares.ZonghengDownloaderMiddleware': 543,
#}# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {'zongheng.pipelines.ZonghengPipeline': 300,
}# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
DATABASE_CONFIG={"type":"mysql","config":{"host":"127.0.0.1","port":3306,"user":"root","password":"123456","db":"xiao","charset":"utf8"}
}
LOG_FILE='aa.log'#输出日志
zh.py# -*- coding: utf-8 -*-
import datetimeimport scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from ..items import NovelItem, ChapterItem, ContentItemclass ZhSpider(CrawlSpider):name = 'zh'allowed_domains = ['book.zongheng.com']start_urls = ['http://book.zongheng.com/store/c0/c0/b0/u1/p1/v0/s1/t0/u0/i1/ALL.html']  # 起始的url# 定义爬取规则  1.提取url(LinkExtractor对象)   2.形成请求    3.响应的处理规则rules = (Rule(LinkExtractor(allow=r'http://book.zongheng.com/book/\d+.html', restrict_xpaths='//div[@class="bookname"]'),callback='parse_book', follow=True, process_links="process_booklink"),Rule(LinkExtractor(allow=r'http://book.zongheng.com/showchapter/\d+.html'), callback='parse_catalog',follow=True, ),Rule(LinkExtractor(allow=r'http://book.zongheng.com/chapter/\d+/\d+.html',restrict_xpaths='//ul[@class="chapter-list clearfix"]'),callback='get_content', follow=False, process_links="process_chpter"),)def process_booklink(self, links):# 处理 LinkExtractor  提取到的urlfor index, link in enumerate(links):if index <= 2:# print(index, link.url)yield linkelse:returndef process_chpter(self, links):for index, link in enumerate(links):if index <= 5:yield linkelse:returndef parse_book(self, response):category = response.xpath('//div[@class="book-label"]/a/text()').extract()[1]book_name = response.xpath('//div[@class="book-name"]/text()').extract()[0].strip()author = response.xpath('//div[@class="au-name"]/a/text()').extract()[0]status = response.xpath('//div[@class="book-label"]/a/text()').extract()[0]book_nums = response.xpath('//div[@class="nums"]/span/i/text()').extract()[0]description = ''.join(response.xpath('//div[@class="book-dec Jbook-dec hide"]/p/text()').re("\S+"))c_time = datetime.datetime.now()book_url = response.urlcatalog_url = response.css("a").re('http://book.zongheng.com/showchapter/\d+.html')[0]item = NovelItem()item["category"] = categoryitem["book_name"] = book_nameitem["author"] = authoritem["status"] = statusitem["book_nums"] = book_numsitem["description"] = descriptionitem["c_time"] = c_timeitem["book_url"] = book_urlitem["catalog_url"] = catalog_urlyield itemdef parse_catalog(self, response):a_tags = response.xpath('//ul[@class="chapter-list clearfix"]/li/a')chapter_list = []catalog_url = response.urlfor a in a_tags:# print("解析catalog_url")title = a.xpath("./text()").extract()[0]chapter_url = a.xpath("./@href").extract()[0]chapter_list.append((title, chapter_url, catalog_url))item = ChapterItem()item["chapter_list"] = chapter_listyield itemdef get_content(self, response):chapter_url = response.urlcontent = ''.join(response.xpath('//div[@class="content"]/p/text()').extract())c_time = datetime.datetime.now()# 向管道传递数据item = ContentItem()item["chapter_url"] = chapter_urlitem["content"] = contentyield item
items.py# -*- coding: utf-8 -*-# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.htmlimport scrapyclass ZonghengItem(scrapy.Item):# define the fields for your item here like:# name = scrapy.Field()passclass NovelItem(scrapy.Item):# define the fields for your item here like:# name = scrapy.Field()category = scrapy.Field()book_name = scrapy.Field()author = scrapy.Field()status = scrapy.Field()book_nums = scrapy.Field()description = scrapy.Field()c_time = scrapy.Field()book_url = scrapy.Field()catalog_url = scrapy.Field()class ChapterItem(scrapy.Item):# define the fields for your item here like:# name = scrapy.Field()chapter_list = scrapy.Field()catalog_url = scrapy.Field()class ContentItem(scrapy.Item):# define the fields for your item here like:# name = scrapy.Field()content = scrapy.Field()chapter_url = scrapy.Field()
pipelines.py# -*- coding: utf-8 -*-# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.htmlimport pymysql
from .items import NovelItem,ChapterItem,ContentItem
import datetime
from  scrapy.exceptions import DropItem
class ZonghengPipeline(object):#连接数据库def open_spider(self,spider):data_config = spider.settings["DATABASE_CONFIG"]print("数据库内容",data_config)if data_config["type"] == "mysql":self.conn = pymysql.connect(**data_config["config"])# self.conn = pymysql.connect( host=None, user=None, password="",#      database=None, port=0,)self.cursor = self.conn.cursor()spider.conn = self.connspider.cursor = self.cursor#数据存储def process_item(self, item, spider):#1.小说信息存储if isinstance(item,NovelItem):sql="select id from  novel  where  book_name=%s  and author=%s"#确保数据中没有self.cursor.execute(sql,(item["book_name"],item["author"]))print('*' * 30)if  not self.cursor.fetchone():#如果这里没有找到#写入小说数据sql="insert  into  novel(category,book_name,author,status,book_nums,description,c_time,book_url,catalog_url)" \"values (%s,%s,%s,%s,%s,%s,%s,%s,%s)"#补充sql语句  并执行self.cursor.execute(sql,(item["category"],item["book_name"],item["author"],item["status"],item["book_nums"],item["description"],item["c_time"],item["book_url"],item["catalog_url"],))self.conn.commit()return item#2.章节信息存储elif isinstance(item,ChapterItem):#写入  目录信息sql = "insert into  chapter(title,ordernum,c_time,chapter_url,catalog_url)  values(%s,%s,%s,%s,%s)"data_list=[]for  index,chapter  in  enumerate(item["chapter_list"]):c_time = datetime.datetime.now()ordernum=index+1title,chapter_url,catalog_url=chapter  #(title, chapter_url, catalog_url)data_list.append((title,ordernum,c_time,chapter_url,catalog_url))self.cursor.executemany(sql,data_list) #[(),(),()]self.conn.commit()return item#3.章节内容存储elif isinstance(item, ContentItem):sql="update chapter set  content=%s where chapter_url=%s"content=item["content"]chapter_url=item["chapter_url"]self.cursor.execute(sql,(content,chapter_url))self.conn.commit()print('-'*30)return itemelse:return  DropItem#关闭数据库def close_spider(self,spider):data_config=spider.settings["DATABASE_CONFIG"]#setting里设置数据库if data_config["type"]=="mysql":self.cursor.close()self.conn.close()

python如何利用scrapy爬取纵横小说三级链接内容并存储到数据库相关推荐

  1. Python利用Scrapy爬取前程无忧

    ** Python利用Scrapy爬取前程无忧 ** 一.爬虫准备 Python:3.x Scrapy PyCharm 二.爬取目标 爬取前程无忧的职位信息,此案例以Python为关键词爬取相应的职位 ...

  2. [Python 爬虫] 使用 Scrapy 爬取新浪微博用户信息(四) —— 应对反爬技术(选取 User-Agent、添加 IP代理池以及Cookies池 )

    上一篇:[Python 爬虫] 使用 Scrapy 爬取新浪微博用户信息(三) -- 数据的持久化--使用MongoDB存储爬取的数据 最近项目有些忙,很多需求紧急上线,所以一直没能完善< 使用 ...

  3. [Python 爬虫] 使用 Scrapy 爬取新浪微博用户信息(三) —— 数据的持久化——使用MongoDB存储爬取的数据

    上一篇:[Python 爬虫] 使用 Scrapy 爬取新浪微博用户信息(二) -- 编写一个基本的 Spider 爬取微博用户信息 在上一篇博客中,我们已经新建了一个爬虫应用,并简单实现了爬取一位微 ...

  4. python爬虫scrapy爬取新闻标题及链接_python爬虫框架scrapy爬取梅花网资讯信息

    原标题:python爬虫框架scrapy爬取梅花网资讯信息 一.介绍 本例子用scrapy-splash爬取梅花网(http://www.meihua.info/a/list/today)的资讯信息, ...

  5. [Python 爬虫] 使用 Scrapy 爬取新浪微博用户信息(二) —— 编写一个基本的 Spider 爬取微博用户信息

    上一篇:[Python 爬虫] 使用 Scrapy 爬取新浪微博用户信息(一) -- 新建爬虫项目 在上一篇我们新建了一个 sina_scrapy 的项目,这一节我们开始正式编写爬虫的代码. 选择目标 ...

  6. 利用正则表达式爬取网络小说,并按照章节下载到本地

    利用正则表达式爬取网络小说,并按照章节下载到本地 闲来无事,尝试了使用正则表达式爬取了某个明显没有反扒机制的小说网站,其实也是刚刚接触爬虫,第一次从网络上爬到感兴趣的东西还是令人开心的. 先贴为敬. ...

  7. scrapy爬取起点小说网

    闲来无事,在学习过程中练习用scrapy爬取起点小说名 工具:python3.6 操作系统:linux 浏览器:谷歌浏览器 创建项目 在黑屏终端创建一个项目:scrapy startproject Q ...

  8. Scrapy 爬取盗墓笔记小说

    Scrapy 爬取盗墓笔记小说 应用 Scrapy框架 爬取盗墓笔记小说数据,存入MongoDB 数据库. # settings 配置mongodb MONGODB_HOST = '127.0.0.1 ...

  9. Scrapy爬取顶点小说网

    Scrapy爬取小说 爬取目标:顶点小说网 1.Scrapy的安装 pip install scrapy 2.Scrapy的介绍 创建项目 scrapy startproject xxx xxx项目名 ...

最新文章

  1. 专家访谈:为什么我们需要Erlang
  2. tomcat 和 jdk 版本 对应关系
  3. python零基础入门教材-Python零基础入门到精通自学视频教程
  4. 计算机设计思想 —— 代理(proxy)
  5. Android高通平台下编译时能生成(拷贝)预编译的so到system的lib目录
  6. MySQL 8.0 技术详解
  7. Mysql 死锁过程及案例详解之插入意向锁与自增锁备份锁日志锁Insert Intention Lock Auto-increment Lock Backup Lock Log Lock
  8. 2018年,该转行AI工程师吗?
  9. Java偏向锁、轻量级锁、重量级锁
  10. Linux物理内存初始化
  11. 子类重写方法aop切不到_Spring-aop 全面解析(从应用到原理)
  12. 微信小程序排名规则大揭秘
  13. 如何进入Github【亲测有效】
  14. 用python实现阴阳师简单挂机脚本
  15. 什么是跨域问题?跨域解决问题
  16. 机器学习小组知识点4:批量梯度下降法(BGD)
  17. WIFI6 芯片厂商制程工艺
  18. 【学习笔记】proxy的用法
  19. 统计学习:模型评估与选择--留出法(python实现)
  20. vue\uniapp自定义活动倒计时组件

热门文章

  1. node.js毕业设计安卓基于Android的超市会员管理系统开发(程序+APP+LW)
  2. 直播平台软件开发,展示弹窗常见API详解
  3. Nginx学习部署环境(六)-Nginx原理
  4. 计算机函数提取班级,老师,execl中的那个提取班级的函数是什么意思?
  5. 【杰理AC632n】IIC-VCNL36826S
  6. Android 终端推流-采集
  7. 微星主板在停过电后无法被远程开机
  8. 英雄联盟LOL可以免费用皮肤了!来自免费开源软件的福利
  9. 如何购买腾讯--云服务器
  10. 分屏多窗开窗画中画多视图播放器