python爬虫之一(2)：爬取网页小说（圣墟）

强化：
爬取最新的小说圣墟

代码：

#coding=utf-8
import os
import sys
reload(sys)
sys.setdefaultencoding('utf8')
from Spider import getHtmlCode
from bs4 import BeautifulSoup
import re#第一章的地址
url = 'https://www.biquge5200.com/52_52542/20380548.html'
def getTree(url):temp = getHtmlCode(url)soup = BeautifulSoup(temp,'html.parser')return soup#输入为：章节网页地址
#输出为：（章节名，内容）
def getAll(url):temp = getTree(url)chaptername = temp.h1.stringprint u'章节名：',chapternamecontent = temp.find_all('div',id='content')content = str(content[0])content = content.replace('<br />','\n')pattern = re.compile('<(.*)>')list_line = pattern.findall(content)for line in list_line:line = '<' + line +'>'content = content.replace(line,'')# print u'内容 ：',content,'\n'return(chaptername,content)#输入为：章节地址
#输出为：创建一个章节名为文件名的txt文本
def creatFile(url):(fileName,txt) = getAll(url)fileName = fileName + '.txt'f = open(fileName,'a+')f.write(txt)f.close()
def nextUrl(url):tree = getTree(url)aSpan = tree('a',href=re.compile('.*52_52542'))for nextChapter in aSpan:# print type(nextChapter.string)if u'下一章' == nextChapter.string:pathUrl = nextChapter['href']print pathUrlbreakelse:pathUrl = ''return pathUrl
# nextUrl(url)
#输入为：章节地址
#输出为：整本小说（每一百章为一个文件夹放置）
def main(url):count = 1flag = Truecmd = 'del /q /s *.txt'os.system(cmd)while flag:creatFile(url)print 'adress = ',urlurl = nextUrl(url)count = count + 1if 0 == (count % 100) :filename = count / 100cmd_md = 'md ' + str(filename)cmd_mv = 'move *.txt ' + str(filename)os.system(cmd_md)os.system(cmd_mv)if -1 == url.find('.html'):filename = count / 100 + 1cmd_md = 'md ' + str(filename)cmd_mv = 'move *.txt ' + str(filename)os.system(cmd_md)os.system(cmd_mv)flag = False
main(url)

结果截图：

在执行时发现这种方式容易报错，后面改为所有章节合并为一本书。
代码：

#coding=utf-8
import os
import sys
reload(sys)
sys.setdefaultencoding('utf8')
from Spider import getHtmlCode
from bs4 import BeautifulSoup
import re#第一章的地址
url = 'https://www.biquge5200.com/52_52542/20380548.html'
def getTree(url):temp = getHtmlCode(url)soup = BeautifulSoup(temp,'html.parser')return soup#输入为：章节网页地址
#输出为：（章节名，内容）
def getAll(url):temp = getTree(url)chaptername = temp.h1.stringprint u'章节名：',chapternamecontent = temp.find_all('div',id='content')content = str(content[0])content = content.replace('<br />','\n')pattern = re.compile('<(.*)>')list_line = pattern.findall(content)for line in list_line:line = '<' + line +'>'content = content.replace(line,'')# print u'内容 ：',content,'\n'return(chaptername,content)#输入为：章节地址
#输出为：创建一个章节名为文件名的txt文本
def creatFile(url):(fileName,txt) = getAll(url)txt = fileName + '\n' + txtstoryFileName = u'圣墟.txt'f = open(storyFileName,'a+')f.write(txt)f.close()
def nextUrl(url):tree = getTree(url)aSpan = tree('a',href=re.compile('.*52_52542'))for nextChapter in aSpan:# print type(nextChapter.string)if u'下一章' == nextChapter.string:pathUrl = nextChapter['href']print pathUrlbreakelse:pathUrl = ''return pathUrl
# nextUrl(url)
#输入为：章节地址
#输出为：整本小说
def main(url):flag = Truecmd = 'del /q /s *.txt'os.system(cmd)while flag:creatFile(url)print 'adress = ',urlurl = nextUrl(url)if -1 == url.find('.html'):flag = False
main(url)

结果：

（划重点）所有代码以及小说见我的下载资源，没有积分的qq私聊我

python爬虫之一(2)：爬取网页小说（圣墟）相关推荐

Python爬虫：Xpath爬取网页信息（附代码）
Python爬虫:Xpath爬取网页信息(附代码) 上一次分享了使用Python简单爬取网页信息的方法.但是仅仅对于单一网页的信息爬取一般无法满足我们的数据需求.对于一般的数据需求,我们通常需要从一个 ...
Python爬虫期末作业 | 爬取起点小说网作者和书名，并以Excel形式存储
使用Python爬虫技术爬取起点小说网作者及书名,并且以xlsx形式保存前言随着人工智能的不断发展,机器学习这门技术也越来越重要,很多人都开启了学习机器学习,本文就介绍了机器学习的基础内容. 一. ...
Python爬虫练习笔记——爬取一本小说并保存为txt文件
最近竟然开始磕起了黄晓明和尹正的CP!!! 但是万恶的爱某艺不好好更新剧集,居然搞起了超前点映- WTF???有什么是我这个贫穷而又尊贵的VIP用户不能看的吗??? 于是我决定开始看小说了!找个网站把 ...
【Python爬虫实战】爬取某小说网排行榜上的图书封面图片
文章目录一.Python爬虫必备两大模块 1.1 requests模块用于发送http请求 1.2 bs4(beautifulsoup)模块用于解析html文本二.Python爬虫项目演示 2.1 ...
python爬虫简单实例-爬取17K小说网小说
什么是网络爬虫? 网络爬虫(Web Spider),又被称为网页蜘蛛,是一种按照一定的规则,自动地抓取网站信息的程序或者脚本. 爬虫流程先由urllib的request打开Url得到网页html文档 ...
python爬虫scrapy框架爬取网页数据_Scrapy-Python
scrapy Scrapy:Python的爬虫框架实例Demo 抓取:汽车之家.瓜子.链家等数据信息版本+环境库 Python2.7 + Scrapy1.12 初窥Scrapy Scrapy是一 ...
python爬虫学习一--爬取网络小说实例
最近疫情猖獗,长假憋在家里实在无聊,早上突然看了一篇python爬虫文章,当场决定试验一下,参照了一下别人的案例,自己各种踩坑捣鼓了好几个小时,终于成功最后把具体步骤和注意点分享给大家: 1.Pyth ...
Python爬虫实战：爬取全站小说排行榜
喜欢看小说的骚年们都知道,总是有一些小说让人耳目一新,不管是仙侠还是玄幻,前面更了几十章就成功圈了一大波粉丝,成功攀上飙升榜,热门榜等各种榜,扔几个栗子出来:
Android 通过okhttp + jsoup 爬虫爬取网页小说
Android 通过okhttp + jsoup 爬虫爬取网页小说效果图 1.准备工作测试地址:http://www.tlxs.net 第三方依赖: implementation 'com.squ ...
还在苦于Kindle的epub格式吗？python爬虫，一键爬取小说加txt转换epub。
还在苦于Kindle的epub格式吗?python爬虫,一键爬取小说加txt转换epub. 项目地址: https://github.com/Fruiticecake/dubuNovel/blob/m ...

python爬虫之一(2)：爬取网页小说（圣墟）

python爬虫之一(2)：爬取网页小说（圣墟）相关推荐

最新文章

热门文章