闲来无事用Python的scrapy框架练练手,爬取顶点小说网的所有小说的详细信息。

看一下网页的构造:

tr标签里面的 td 使我们所要爬取的信息

下面是我们要爬取的二级页面 小说的简介信息:

下面上代码:

mydingdian.py

import scrapy
from scrapy.http import Request
from ..items import DingdianItemclass MydingdianSpider(scrapy.Spider):name = 'mydingdian'allowed_domains = ['www.x23us.com/']start_url = ['https://www.x23us.com/class/']starturl=['.html']def start_requests(self):# for i in range(1,11):for i in range(5, 6):#print(i)url_con=str(i)+'_1'#print(url_con)url1 = self.start_url+list(url_con)+self.starturl#print(url1)url=''for j in url1:url+=j+''#print(url)yield Request(url, self.parse)def parse(self, response):baseurl=response.url #真正的url链接#print(baseurl)max_num = response.xpath('//*[@id="pagelink"]/a[14]/text()').extract_first()  # 获取当前页面的最大页码数#print(max_num) #页码数baseurl = baseurl[:-7]#print(baseurl)for num in range(1,int(max_num)+1):#for num in range(1, 5):#print(list("_" + str(num)))newurl1 = list(baseurl) + list("_" + str(num)) + self.starturl#print(newurl1)newurl=''for j in newurl1:newurl+=j+''print(newurl)# 此处使用dont_filter和不使用的效果不一样,使用dont_filter就能够抓取到第一个页面的内容,不用就抓不到# scrapy会对request的URL去重(RFPDupeFilter),加上dont_filter则告诉它这个URL不参与去重。yield Request(newurl, dont_filter=True, callback=self.get_name)  # 将新的页面url的内容传递给get_name函数去处理def get_name(self,response):item=DingdianItem()for nameinfo in response.xpath('//tr'):#print(nameinfo)novelurl = nameinfo.xpath('td[1]/a/@href').extract_first()  # 小说地址#print(novelurl)name = nameinfo.xpath('td[1]/a[2]/text()').extract_first()  # 小说名字#print(name)newchapter=nameinfo.xpath('td[2]/a/text()').extract_first()   #最新章节#print(newchapter)date=nameinfo.xpath('td[5]/text()').extract_first()    #更新日期#print(date)author = nameinfo.xpath('td[3]/text()').extract_first()  # 小说作者#print(author)serialstatus = nameinfo.xpath('td[6]/text()').extract_first()  # 小说状态#print(serialstatus)serialsize = nameinfo.xpath('td[4]/text()').extract_first()  # 小说大小#print(serialnumber)#print('--==--'*10)if novelurl:item['novel_name'] = name#print(item['novel_name'])item['author'] = authoritem['novelurl'] = novelurl#print(item['novelurl'])item['serialstatus'] = serialstatusitem['serialsize'] = serialsizeitem['date']=dateitem['newchapter']=newchapterprint('小说名字:', item['novel_name'])print('小说作者:', item['author'])print('小说地址:', item['novelurl'])print('小说状态:', item['serialstatus'])print('小说大小:', item['serialsize'])print('更新日期:', item['date'])print('最新章节:', item['newchapter'])print('===='*5)#yield Request(novelurl,dont_filter=True,callback=self.get_novelcontent,meta={'item':item})yield item'''def get_novelcontent(self,response):#print(123124)  #测试调用成功urlitem=response.meta['item']novelurl=response.url#print(novelurl)serialnumber = response.xpath('//tr[2]/td[2]/text()').extract_first()  # 连载字数#print(serialnumber)category = response.xpath('//tr[1]/td[1]/a/text()').extract_first()  # 小说类别#print(category)collect_num_total = response.xpath('//tr[2]/td[1]/text()').extract_first()  # 总收藏#print(collect_num_total)click_num_total = response.xpath('//tr[3]/td[1]/text()').extract_first()  # 总点击novel_breif = response.xpath('//dd[2]/p[2]').extract_first()        #小说简介# item['serialnumber'] = serialnumber# item['category'] = category# item['collect_num_total']=collect_num_total# item['click_num_total']=click_num_total# item['novel_breif']=novel_breif## print('小说字数:', item['serialnumber'])# print('小说类别:', item['category'])# print('总收藏:', item['collect_num_total'])# print('总点击:', item['click_num_total'])# print('小说简介:', item['novel_breif'])# print('===='*10)yield item

items.py

# -*- coding: utf-8 -*-# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.htmlimport scrapyclass DingdianItem(scrapy.Item):# define the fields for your item here like:# name = scrapy.Field()novel_name = scrapy.Field()  # 小说名字author = scrapy.Field()  # 作者novelurl = scrapy.Field()  # 小说地址serialstatus = scrapy.Field()  # 状态serialsize = scrapy.Field()  # 连载大小date=scrapy.Field()        #小说日期newchapter=scrapy.Field()   #最新章节
      serialnumber = scrapy.Field()  # 连载字数category = scrapy.Field()  # 小说类别collect_num_total = scrapy.Field()  # 总收藏click_num_total = scrapy.Field()  # 总点击novel_breif = scrapy.Field()  # 小说简介

插入数据库的管道  iopipelines.py

from 爬虫大全.dingdian.dingdian import dbutil# 作业: 自定义的管道,将完整的爬取数据,保存到MySql数据库中
class DingdianPipeline(object):def process_item(self, item, spider):dbu = dbutil.MYSQLdbUtil()dbu.getConnection()  # 开启事物# 1.添加try:#sql = "insert into movies (电影排名,电影名称,电影短评,评价分数,评价人数)values(%s,%s,%s,%s,%s)"sql = "insert into ebook (novel_name,author,novelurl,serialstatus,serialsize,ebookdate,newchapter)values(%s,%s,%s,%s,%s,%s,%s)"#date = [item['rank'],item['title'],item['quote'],item['star']]#dbu.execute(sql, date, True)dbu.execute(sql, (item['novel_name'],item['author'],item['novelurl'],item['serialstatus'],item['serialsize'],item['date'],item['newchapter']),True)#dbu.execute(sql,True)
            dbu.commit()print('插入数据库成功!!')except:dbu.rollback()dbu.commit()  # 回滚后要提交finally:dbu.close()return item

settings.py

# -*- coding: utf-8 -*-# Scrapy settings for dingdian project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://doc.scrapy.org/en/latest/topics/settings.html
#     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://doc.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'dingdian'SPIDER_MODULES = ['dingdian.spiders']
NEWSPIDER_MODULE = 'dingdian.spiders'# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'dingdian (+http://www.yourdomain.com)'# Obey robots.txt rules
ROBOTSTXT_OBEY = True# Configure maximum concurrent requests performed by Scrapy (default: 16)
CONCURRENT_REQUESTS = 2# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16# Disable cookies (enabled by default)
#COOKIES_ENABLED = False# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
SPIDER_MIDDLEWARES = {'dingdian.middlewares.DingdianSpiderMiddleware': 543,
}# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {'dingdian.middlewares.DingdianDownloaderMiddleware': 543,'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware' : None,'dingdian.rotate_useragent.RotateUserAgentMiddleware' :400
}# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {#    'scrapy.extensions.telnet.TelnetConsole': None,
#}# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {#'dingdian.pipelines.DingdianPipeline': 300,#'dingdian.iopipelines.DingdianPipeline': 301,
}# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
HTTPCACHE_ENABLED = True
HTTPCACHE_EXPIRATION_SECS = 0
HTTPCACHE_DIR = 'httpcache'
HTTPCACHE_IGNORE_HTTP_CODES = []
HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'LOG_LEVEL='INFO'
LOG_FILE='dingdian.log'

在往数据库插入数据的时候 ,我遇到了

pymysql.err.InterfaceError: (0, '') 这种问题,百度了好久才解决。。。

那是因为scrapy异步的存储的原因,太快。

解决方法:只要放慢爬取速度就能解决,setting.py中设置 DOWNLOAD_DELAY = 2

 

详细代码 附在Github上了

tyutltf/dingdianbook: 爬取顶点小说网的所有小说信息  https://github.com/tyutltf/dingdianbook

转载于:https://www.cnblogs.com/yuxuanlian/p/10000968.html

Python的scrapy之爬取顶点小说网的所有小说相关推荐

  1. Python爬虫 scrapy框架爬取某招聘网存入mongodb解析

    这篇文章主要介绍了Python爬虫 scrapy框架爬取某招聘网存入mongodb解析,文中通过示例代码介绍的非常详细,对大家的学习或者工作具有一定的参考学习价值,需要的朋友可以参考下 创建项目 sc ...

  2. 利用python的scrapy框架爬取google搜索结果页面内容

    scrapy google search 实验目的 爬虫实习的项目1,利用python的scrapy框架爬取google搜索结果页面内容. https://github.com/1012598167/ ...

  3. python之scrapy实战爬取表情包

    python之scrapy实战爬取表情包 前言:在之前我们学习了scrapy框架的基本使用,今天我们通过实战(爬取http://www.doutula.com的套图)来更加了解scrapy框架额使用, ...

  4. 爬取诗词名句网的三国演义小说

    爬取诗词名句网的三国演义小说 诗词名句网,有很多的诗词和一些课本上古诗的,是一个很好的文学网站,但是我们就来爬取诗词名句网的三国演义小说 第一步我们还是导入要导入的库: import requests ...

  5. 运用Scrapy框架爬取淘车网十七万二手车数据

    本篇内容将使用scrapy框架爬取淘车网所有二手车信息. 下面开始讲解下如何爬取我们想要的数据: 明确爬取目标: 首先,进入官网:https://www.taoche.com/ 进入官网发现,我们要获 ...

  6. Scrapy框架爬取中国裁判文书网案件数据

    Scrapy框架爬取中国裁判文书网案件数据 项目Github地址: https://github.com/Henryhaohao/Wenshu_Spider 中国裁判文书网 - http://wens ...

  7. 使用Scrapy框架爬取88读书网小说,并保存本地文件

    Scrapy框架,爬取88读书网小说 链接: 88读书网 源码 工具 python 3.7 pycharm scrapy框架 教程 spider: # -*- coding: utf-8 -*- im ...

  8. Python的Scrapy框架爬取诗词网站爱情诗送给女友

    文章目录 前言 效果展示: 一.安装scrapy库 二.创建scrapy项目 三.新建爬虫文件scmg_spider.py 四.配置settings.py文件 五.定义数据容器,修改item.py文件 ...

  9. Scrapy+MySQL爬取去哪儿网

    Scrapy+MySQL爬取去哪儿旅游[超详细!!!] 基于Python语言,利用Scrapy框架爬取信息,并持久化存储在MySQL 文章目录 Scrapy+MySQL爬取去哪儿旅游[超详细!!!] ...

最新文章

  1. 2.pandas数据清洗
  2. python解释器哪一年_Python即Python解释器的发展史
  3. 利用注解 + 反射消除重复代码,妙!
  4. linux二进制文件构建mysql_linux上二进制部署mysql详细步骤(测试环境常用)
  5. 编程小白的第一条博客
  6. matlab if m不等于0,matlab问题clearfor a=0.1:0.1:50for b=0.1:0.1:20for m=0.1:0.1:5
  7. ERROR 1045 (28000): Access denied for user root@localhost (using password: NO)
  8. 炒股十余年,亏了很多钱,现在很迷茫是退出股市还是继续坚持?
  9. 阿里云成为国内首个时序数据库标准工作组成员
  10. 物联网中大数据的挑战有哪些
  11. oracle 分析函数之分组求和、连续求和
  12. js动态生成表格(添加删除行操作)
  13. 怎样关闭百度云开机启动服务器,怎样取消百度网盘的启动打开?百度网盘关闭启动自启的办法...
  14. html边框倾斜,弯曲的边框CSS实现
  15. 橱柜高度与身高对照表_橱柜高度与身高对照表 橱柜高度怎么根据身高定制?...
  16. python字符串修改
  17. 苹果网页显示无法连接服务器失败怎么办啊,苹果手机自带的浏览器显示无法连接互联网是怎么回事啊...
  18. Ubuntu快捷键——终端
  19. docker之制作镜像
  20. datasets: mnist

热门文章

  1. 计数排序/Counting Sort
  2. mapreduce原理
  3. Pycharm同步git代码提示:Merge failed
  4. 关于tomcat8在windows2008下高并发下问题的解决方案
  5. 信息安全复习2关于网络安全
  6. 苍天啊,请你不要再哭泣
  7. YOLO训练Pedestrain
  8. CMake 与 Finder
  9. CodeBlocks: 生成的exe文件自定义一个图标
  10. flash新闻图片轮转————c#+数据库解决