pyhton——爬小说网站（顶点最强国防生）

一直想做个爬虫，用.net尝试过。一直在.net平台上转悠。没有尝试过其他语言。因为懂得越来越觉得自己懂得少。最近遇到了瓶颈。就尝试使用别得语言

来扩展自己得思路。希望能突破瓶颈，达到另一个高度把。作为程序员，永远都是说得再多看得再多，不如敲几行代码来的实在。

我是看小说的，应该是老了吧，现在很少的小说符合自己了。我是穷人看的是盗版。（^_^但是我不喷作者，因为我不看正版）言归正传。

1：首先在python官网上下载python3.X版本。（选择操作系统，选择要下载的版本）。（安装有问题请自行百度）

2：编辑器，作为.net 死忠。肯定首选Visual studio Code。安装python 插件。新建项目文件 xiaoshuo

3：作为爬下来的东西，要保存到txt文件中（或者别的你可以喜欢的）。所以新建个文件夹FileManager

4: 在FileManager文件夹下新建FileWrite.py 代码如下：

1 def CreatFile(filename):
2     with open(filename +".txt",'w',encoding='utf-8',errors='ignore') as f:
3         f.writelines("           " + filename + "        " + '\r\n')
4 def appointchapter(filename, context):
5     with open(filename +".txt",'a',encoding='utf-8',errors='ignore') as f:
6         f.writelines(context + '\r\n')

View Code

5：爬下来的东西解析，这里没有使用第三方插件。（本来python 有个很好的html 解析插件，但是公司的网络好像屏蔽掉了，一直下载不下来，所以改成正则。原理应该是相同的）

新建RegularExpression 文件夹，新建Regular.py文件代码如下：

 1 import re
 2 # 获取小说名字
 3 reName = re.compile(r'<h1>.+?(?:</h1>)',re.DOTALL)
 4 #获取小说章节目录
 5 rechapterLi = re.compile(r'<table.+?(?:</table>)',re.DOTALL)
 6 #获取小说单个章节目录
 7 rePage = re.compile(r'<a.+?(?:</a>)',re.MULTILINE)
 8 #获取小说单个章节链接
 9 rePageUrl = re.compile(r'[\d]+.+?(?:html)',re.DOTALL)
10 #获取小说章节名字
11 reTitle = re.compile(r'<dd><h1>.+?(?:</h1></dd>)',re.DOTALL)
12 #获取原始的章节内容
13 reContent =re.compile(r'<dd id="contents".+?(?:</dd>)',re.DOTALL)
14
15 def DoHtmlToTxt(htmlstr):
16     txt = re.sub('&nbsp;',' ',htmlstr)
17     txt = re.sub('<br />','\r\n',txt)
18     txt = re.sub(r'<.+?(?:>)','',txt)
19     return txt
20 def DelHtmlCoding(htmlstr):
21     txt = re.sub(r'<.+?(?:>)','',htmlstr)
22     txt = re.sub(r'最新.+?(?:列表)','',txt)
23     txt = re.sub(' ','',txt)
24     return txt

View Code

6：新建文件夹WebManager。在WebManager 中新建HttpContext.py 用来管理和处理http请求。代码如下：

 1 import urllib
 2 import urllib.request
 3 def GetHtmlContent(html,endcoding):
 4     try:
 5         request =  urllib.request.Request(html)
 6         response = urllib.request.urlopen(request)
 7         data = response.read()
 8         data = data.decode(endcoding)
 9         return data
10     except Exception as e:
11         with open('erro.txt','a',encoding='utf-8') as f:
12             f.writelines('抓取报错' + html)
13             f.writelines('报错原因' + e)
14             return '0'
15     except URLError as e:
16         with open('erro.txt','a',encoding='utf-8') as f:
17             f.writelines('抓取报错' + html)
18             f.writelines('报错原因' + e)
19             return '0'
20

View Code

7：好了准备工作好了，现在在xiaoshuo 文件夹下，新建Main.py。代码如下：

 1 import RegularExpression.Regular
 2 import FileManager.FileWrite
 3 import WebManager.HttpContext
 4 #htmlurl = 'https://www.x23us.com/html/69/69720/'
 5 htmlurl = 'https://www.23us.cc/html/173/173244/'
 6 htmltxt = WebManager.HttpContext.GetHtmlContent(htmlurl,'gbk')
 7 if htmltxt != '0':
 8     titlehtml = RegularExpression.Regular.reName.search(htmltxt).group()
 9     title =  RegularExpression.Regular.DelHtmlCoding(titlehtml)
10     FileManager.FileWrite.CreatFile(title)
11     chaptercontent = RegularExpression.Regular.rechapterLi.search(htmltxt).group()
12     if chaptercontent:
13         chapterli = RegularExpression.Regular.rePage.finditer(chaptercontent)
14         if chapterli:
15             for m in chapterli:
16                 pageurl = RegularExpression.Regular.rePageUrl.search(m.group()).group()
17                 htmlpagetxt =  WebManager.HttpContext.GetHtmlContent(htmlurl + pageurl,'gbk')
18                 if htmlpagetxt != '0':
19                     pageName = RegularExpression.Regular.reTitle.search(htmlpagetxt).group()
20                     pageContent = RegularExpression.Regular.reContent.search(htmlpagetxt).group()
21                     pageName = RegularExpression.Regular.DelHtmlCoding(pageName)
22                     pageContent = RegularExpression.Regular.DoHtmlToTxt(pageContent)
23                     txt = '\r\n'+ pageName +'\r\n' + pageContent
24                     FileManager.FileWrite.appointchapter(title,txt)
25                 else:
26                     continue
27             print('下载成功!')

View Code

备注：由于只是尝试，所以有很多可以扩展的地方。大家多多包涵

转载于:https://www.cnblogs.com/duchyaiai/p/8392194.html

pyhton——爬小说网站（顶点最强国防生）相关推荐

NodeJs ES6 写简单爬虫爬小说网站《我当方士那些年》
个人网站地址 www.wenghaoping.com Vue + express + Mongodb构建请大家多支持一下. 先上代码,然后在慢慢逼逼 Git地址,有需要的Clone ==先从小说网站 ...
爬取顶点小说网站小说
简单的爬取小说的脚本 1 ''' 2 爬取网站顶点小说 3 网站地址 https://www.booktxt.net 4 本脚本只为学习 5 ''' 6 import requests 7 from ...
python爬小说代码_中文编程，用python编写小说网站爬虫
原标题:中文编程,用python编写小说网站爬虫作者:乘风龙王原文:https://zhuanlan.zhihu.com/p/51309019 为保持源码格式, 转载时使用了截图. 原文中的源码块 ...
python爬虫小说代码示例-中文编程，用python编写小说网站爬虫
原标题:中文编程,用python编写小说网站爬虫作者:乘风龙王原文:https://zhuanlan.zhihu.com/p/51309019 为保持源码格式, 转载时使用了截图. 原文中的源码块 ...
Python网络爬虫（九）：爬取顶点小说网站全部小说，并存入MongoDB
前言:本篇博客将爬取顶点小说网站全部小说.涉及到的问题有:Scrapy架构.断点续传问题.Mongodb数据库相关操作. 背景: Python版本:Anaconda3 运行平台:Windows IDE ...
用Python爬取顶点小说网站中的《庆余年》思路参考——记一次不成功的抓取
目的:用python爬虫抓取顶点小说网站中的<庆余年>小说内容个,并保存为txt格式文件. 环境:Win10系统,Anaconda3 + PyCharm, python3.6版本思路:( ...
JAVA爬虫进阶之springboot+webmagic抓取顶点小说网站小说
闲来无事最近写了一个全新的爬虫框架WebMagic整合springboot的爬虫程序,不清楚WebMagic的童鞋可以先查看官网了解什么是Webmagic,顺便说说用springboot时遇到的一些坑 ...
Python——爬取小说网站的整本小说
编译环境:pycharm 需要的库:requests,lxml,bs4,BeautifulSoup,os 思路如下: 首先可以先建立一个文件,使用os库中的os.makedirs("文件名: ...
类似零基础学python的小说_零基础小白十分钟用Python搭建小说网站！Python真的强！...
零基础小白十分钟用Python搭建小说网站!Python真的强!-1.jpg (128.29 KB, 下载次数: 0) 2018-10-8 18:51 上传 Python 和放大镜的二进制代码人生苦 ...
网站爬取工具_Python项目：结合Django和爬虫开发小说网站，免安装，无广告
前言很多喜欢看小说的小伙伴都是是两袖清风的学生党,沉迷小说,不能自拔.奈何囊中甚是羞涩,没有money去看正版小说,但是往往这些免费的小说网站或者小说软件,随之而来的是大量的广告. Python嘛, ...

pyhton——爬小说网站（顶点最强国防生）

pyhton——爬小说网站（顶点最强国防生）相关推荐

最新文章

热门文章