近来比较清闲，想学习一些东西，刚好最近正在入我师兄的坑，于是想自己写一个python爬虫下载这本小说，方便自己的后续阅读。

一、爬虫的基本步骤

不管是多么复杂的爬虫，它的流程永远是分为以下几个部分：
一、发起请求
通过HTTP库向目标站点发起请求，也就是发送一Request，请求可以包含额外的header等信息，等待服务器响应；
二、获取响应内容
如果服务器能正常响应，会得到一个Response，Response的内容便是所要获取的页面内容，类型可能是HTML,Json字符串，二进制数据（图片或者视频）等类型；
三、解析内容
得到的内容可能是HTML,可以用正则表达式，页面解析库进行解析，可能是Json,可以直接转换为Json对象解析，可能是二进制数据，可以做保存或者进一步的处理；
四、保存数据
保存形式多样，可以存为文本，也可以保存到数据库，或者保存特定格式的文件。

二、爬虫的编写过程

由于是第一次编写爬虫，能力有限，所以自己只能设置为指定的网站地址：http://www.paoshu8.com/131_131325/
下载内容存放地址：D:\novel\

一、发起请求：

link = "http://www.paoshu8.com/131_131325/"
headers = {'Cookie': 'td_cookie = 405542359;width = 85 % 25;Hm_lpvt_9352f2494d8aed671d970e0551ae3758 = 1596682842;Hm_lvt_9352f2494d8aed671d970e0551ae3758 = 1596677570','User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.18363'
}

aders可以在浏览器打开小说网站时，按F12进入开发者调试工具，然后点击network，选择一个目标网站的访问记录，点击之后会在屏幕左侧显示一个请求标头，其中就有上述代码中需要的请求。如果点击目标网站的访问记录没有需要的请求，可以再换一个目标网站的访问记录即可。

二、获取响应内容：

response = requests.get(link, headers=headers) ##获取响应内容
print(response.encoding)  ##查看网站的编码方式
response.encoding = 'utf-8' ##以UTF-8解码（有的网站是以gbk方式解码的）
html = response.text ##将获取到的内容编码成字符串
soup = BeautifulSoup(html, 'lxml') ##解析html文件
soup = str(soup)  ##将获取到的内容转换成字符类型，不然后面的正则表达式会出错

上述代码的功能就是将获取到的内容进行转换和处理，使之变成我们能够理解的形式。

三、解析内容和下载内容：

title = re.findall(r'<dt>.*?</dt>', soup)  ##正则表达式，将<dt>和</dt>之间内容过滤出来，这样就可以得到小说的名字
print(title)
title = title[1]  ##根据显示的内容选取小说的名字
title = title.replace('<dt>', '') ##清洗小说名
title = title.replace('</dt>', '')
title = title.replace('正文', '')
print(title)

上述代码的运行结果如图所示：

filename = 'D:/novel/' + '%s.txt' % (title) ##以小说名命名的TXT文件
soup = re.findall(r'<dl>.*?</dl>', soup, re.S)[0] ##缩小过滤范围，因为小说的主要内容就集中在<dl>和</dl>之间，这样更好过滤小说的章节名和资源地址
# print(soup)
zhangjie = re.findall(r'href="(.*?)">(.*?)<', soup) ##过滤小说的章节名和相应的资源地址存入到列表中
del zhangjie[0:9] ##由于这个小说网站会在正文之前显示最近更新的9章小说，如果不管的话，会造成小说阅读的混乱
# print(zhangjie)
for info in zhangjie: ##逐章地清洗、下载小说url, name = info# print(info)url = 'http://www.paoshu8.com%s' % url ##每章小说的地址print('正在下载 %s......' % name)with open(filename, 'a+', encoding='gbk') as f:chapter_request = requests.get(url) ##请求每章小说的内容chapter_request.encoding = 'utf-8' ##解码chapter_html = chapter_request.text ##解析每章小说的内容# print(chapter_html)chapter_content = re.findall(r'<div id="content">(.*?)</div>', chapter_html) ##过滤每章小说的具体内容chapter_content = str(chapter_content) ##将获取到的内容转换成字符串类型chapter_content = chapter_content.replace('&nbsp;', '') ##以下都是清洗获取到的章节内容chapter_content = chapter_content.replace('<p>', '      ')chapter_content = chapter_content.replace('</p>', '\n')chapter_content = chapter_content.replace(r'\u3000', '')chapter_content = chapter_content.replace(r"'", "")chapter_content = chapter_content.replace('[', '')chapter_content = chapter_content.replace(']', '')chapter_content = "".join([s for s in chapter_content.splitlines(True) if s.strip()])  # 去除字符串中的空行##下载小说f.write(name.encode("gbk", 'ignore').decode("gbk", "ignore")) ##写入章节名，并且忽略其中gbk编码的内容，下同f.write('\n')f.write(chapter_content.encode("gbk", 'ignore').decode("gbk", "ignore"))f.write('\n\n')

最后如果没有那个写入时忽略gbk编码的内容，就会在下载篇幅比较长的小说的过程中报错，以上就是我的编码过程，请各位大佬指点。

三、程序的不足之处

一、在下载过程中不能暂停，一暂停下次开始就会从头开始下载；
二、不能搜索小说，只能给出指定的小说网站；
三、下载速度太慢，六百章左右的小说要下载七八分钟。

附录总代码

import requests
import re
from bs4 import BeautifulSouptitle = []
link = "http://www.paoshu8.com/131_131325/"
headers = {'Cookie': 'td_cookie = 405542359;width = 85 % 25;Hm_lpvt_9352f2494d8aed671d970e0551ae3758 = 1596682842;Hm_lvt_9352f2494d8aed671d970e0551ae3758 = 1596677570','User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.18363'
}
response = requests.get(link, headers=headers)
# print(response.encoding)
response.encoding = 'utf-8'
html = response.text
soup = BeautifulSoup(html, 'lxml')
soup = str(soup)
# print(html)
# print(soup)
title = re.findall(r'<dt>.*?</dt>', soup)
# print(title)
title = title[1]
title = title.replace('<dt>', '')
title = title.replace('</dt>', '')
title = title.replace('正文', '')
# print(title)
filename = 'D:/novel/' + '%s.txt' % (title)
soup = re.findall(r'<dl>.*?</dl>', soup, re.S)[0]
# print(soup)
zhangjie = re.findall(r'href="(.*?)">(.*?)<', soup)
del zhangjie[0:9]
# print(zhangjie)
for info in zhangjie:url, name = info# print(info)url = 'http://www.paoshu8.com%s' % urlprint('正在下载 %s......' % name)with open(filename, 'a+', encoding='gbk') as f:chapter_request = requests.get(url)chapter_request.encoding = 'utf-8'chapter_html = chapter_request.text# print(chapter_html)chapter_content = re.findall(r'<div id="content">(.*?)</div>', chapter_html)chapter_content = str(chapter_content)chapter_content = chapter_content.replace('&nbsp;', '')chapter_content = chapter_content.replace('<p>', '      ')chapter_content = chapter_content.replace('</p>', '\n')# chapter_content = chapter_content.replace('\n', '')chapter_content = chapter_content.replace(r'\u3000', '')chapter_content = chapter_content.replace(r"'", "")chapter_content = chapter_content.replace('[', '')chapter_content = chapter_content.replace(']', '')chapter_content = "".join([s for s in chapter_content.splitlines(True) if s.strip()])  # 去除字符串中的空行f.write(name.encode("gbk", 'ignore').decode("gbk", "ignore"))f.write('\n')f.write(chapter_content.encode("gbk", 'ignore').decode("gbk", "ignore"))f.write('\n\n')

python爬虫（一）——指定小说的爬取详解相关推荐

从入门到入土：Python爬虫学习|实例练手|爬取新浪新闻搜索指定内容|Xpath定位标签爬取|代码注释详解
此博客仅用于记录个人学习进度,学识浅薄,若有错误观点欢迎评论区指出.欢迎各位前来交流.(部分材料来源网络,若有侵权,立即删除) 本人博客所有文章纯属学习之用,不涉及商业利益.不合适引用,自当删除! 若 ...
python爬虫 - 起点女生榜单爬取 - 1
python爬虫 - 起点女生榜单爬取最近一直在追庆余年,顺带瞄了一眼小说,真真是精彩(虽然因为范闲多妻的设定接受不了就放弃了). 说来说去,还是钟爱女频的修仙小说,所以就想爬一下起点女生网 ...
从入门到入土：Python爬虫学习|实例练手|爬取猫眼榜单|Xpath定位标签爬取|代码
此博客仅用于记录个人学习进度,学识浅薄,若有错误观点欢迎评论区指出.欢迎各位前来交流.(部分材料来源网络,若有侵权,立即删除) 本人博客所有文章纯属学习之用,不涉及商业利益.不合适引用,自当删除! 若 ...
从入门到入土：Python爬虫学习|实例练手|爬取百度翻译|Selenium出击|绕过反爬机制|
此博客仅用于记录个人学习进度,学识浅薄,若有错误观点欢迎评论区指出.欢迎各位前来交流.(部分材料来源网络,若有侵权,立即删除) 本人博客所有文章纯属学习之用,不涉及商业利益.不合适引用,自当删除! 若 ...
从入门到入土：Python爬虫学习|实例练手|爬取百度产品列表|Xpath定位标签爬取|代码注释详解
此博客仅用于记录个人学习进度,学识浅薄,若有错误观点欢迎评论区指出.欢迎各位前来交流.(部分材料来源网络,若有侵权,立即删除) 本人博客所有文章纯属学习之用,不涉及商业利益.不合适引用,自当删除! 若 ...
Python 爬虫实战，模拟登陆爬取数据
Python 爬虫实战,模拟登陆爬取数据从0记录爬取某网站上的资源连接: 模拟登陆爬取数据保存到本地结果演示: 源网站展示: 爬到的本地文件展示: 环境准备: python环境安装略安装r ...
Python爬虫利用18行代码爬取虎牙上百张小姐姐图片
Python爬虫利用18行代码爬取虎牙上百张小姐姐图片下面开始上代码需要用到的库 import request #页面请求 import time #用于时间延迟 import re #正则表达式 ...
Python爬虫笔记（3）- 爬取丁香园留言
Python爬虫笔记(3)- 爬取丁香园留言爬取丁香园留言:主要用到了模拟登录爬取丁香园留言:主要用到了模拟登录 import requests, json, re, random,time fr ...
python爬虫之股票数据定向爬取
python爬虫之股票数据定向爬取功能描述目标:获取上交所和深交所所有股票的名称和交易的信息输出:保存到文件中技术路线:requests-bs4-re 前期分析选取原则:股票的信息静态存在H ...

python爬虫（一）——指定小说的爬取详解