今天做一个爬虫练手的小实战：爬取顶点小说网的小说，实现下载到本地（虽然网站上本来就可以下载，不过还是自己写代码来有成就感嘛！）

爬取网站

进入官网后，点击元尊，就爬取这本书了。
我们先把整个网页爬下来吧！

import requestsurl = r'https://www.booktxt.net/6_6453/' # 网站路径
# 伪装请求头
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'
}
res = requests.get(url, headers=headers) # 爬取网站
res.encoding = 'utf-8' # 设置编码
print(res.text)

运行上面代码，看到网站已经成功爬取下来的，但里面的内容却是乱码，我们进入网页按 F12 分析一下。
按 F12 或鼠标右键选择检查后就会出现开发者工具，默认是出现在浏览器右边的，我喜欢设置出现在下面。选中Elements这个选项卡，下面的内容就是网页的源码，打开header之后我发现网站设置的编码 charset是gbk格式的，所以我们设置网页的编码的时候也要设置为gbk格式的才能正常显示。

res.encoding = 'gbk'

解析元素

得到网站源码以后就该分析一下网站，然后解析出书名和章节。
在网站的检查工具中，单击左上角的有一个箭头的图标然后再选中页面元素，可以快速定位元素源码。

点击所有章节列表所在的区域，可以发现下面源码中的某个div被标蓝了，打开这个div分析，发现所有的章节都在它里面的dd标签里面，我们用BeautifulSoup4来解析页面得到想要的内容。

import requests
from bs4 import BeautifulSoupurl = r'https://www.booktxt.net/6_6453/' # 网站路径
# 伪装请求头
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'
}
res = requests.get(url, headers=headers)
res.encoding = 'gbk' # 设置编码
# 使用BeautifulSoup解析页面
soup = BeautifulSoup(res.text, 'html.parser')
# BeautifuSoup支持css选择器，借此得到指定标签
chapters = soup.select('.box_con #list dl dd')
# 循环打印
for chapter in chapters:print(chapter)

我们用同样的方法得到书名和作者。

import requests
from bs4 import BeautifulSoupurl = r'https://www.booktxt.net/6_6453/' # 网站路径
# 伪装请求头
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'
}
res = requests.get(url, headers=headers)
res.encoding = 'gbk' # 设置编码
# 使用BeautifulSoup解析页面
soup = BeautifulSoup(res.text, 'html.parser')
# BeautifuSoup支持css选择器，借此得到指定标签
chapters = soup.select('.box_con #list dl dd')
# 得到作者
author = soup.select('.box_con #maininfo #info h1, .box_con #maininfo #info p')
print(author[0]) # 分析发现取得的标签里面第一项为书名
print(author[1]) # 第二项为作者
# 循环打印
for chapter in chapters:print(chapter)

得到上面内容后，差的就是得到每章的内容了。我们先点进某一张看一下。
我们发现进入某一章后，上面的路由后面多了该章节所在的a标签的的href，而章节的内容在id为content的div里面。那我们就得到这两部分内容，在使用furl这个库把url链接起来。

import requests
from bs4 import BeautifulSoup
from furl import furlurl = r'https://www.booktxt.net/6_6453/' # 网站路径# 伪装请求头
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'
}
res = requests.get(url, headers=headers)
res.encoding = 'gbk' # 设置编码
# 使用BeautifulSoup解析页面
soup = BeautifulSoup(res.text, 'html.parser')
# BeautifuSoup支持css选择器，借此得到指定标签# 得到作者
author = soup.select('.box_con #maininfo #info h1, .box_con #maininfo #info p')
print(author[0]) # 分析发现取得的标签里面第一项为书名
print(author[1]) # 第二项为作者
# 取得章节
chapters = soup.select('.box_con #list dl dd a')
# 循环打印
# for chapter in chapters:
#     print(chapter.text, chapter['href']) # 打印章节名和连接地址
# 定义一个方法得到具体章节的内容
def content(url_last):# 当前章节的url等于书籍的url加上章节a标签的hrefglobal urlurl_now = url + '/' + url_last# 爬取具体章节内容res_chap = requests.get(url_now, headers=headers)res_chap.encoding = 'gbk'soup = BeautifulSoup(res_chap.text, 'html.parser')cont = soup.select('.content_read .box_con #content')print(cont[0])
# 爬取第一章的内容
content(chapters[0]['href'])

通过定义的函数，接收指定章节的连接，然后爬取本章的内容。
虽然内容爬取到了，但我们看见里面还有很多html标签，所以我们需要把内容处理一下，并加上文章标题返回成一个字符串。

import requests
from bs4 import BeautifulSoup
from furl import furlurl = r'https://www.booktxt.net/6_6453/' # 网站路径# 伪装请求头
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'
}
res = requests.get(url, headers=headers)
res.encoding = 'gbk' # 设置编码
# 使用BeautifulSoup解析页面
soup = BeautifulSoup(res.text, 'html.parser')
# BeautifuSoup支持css选择器，借此得到指定标签# 得到作者
author = soup.select('.box_con #maininfo #info h1, .box_con #maininfo #info p')
print(author[0]) # 分析发现取得的标签里面第一项为书名
print(author[1]) # 第二项为作者
# 取得章节
chapters = soup.select('.box_con #list dl dd a')
# 循环打印
# for chapter in chapters:
#     print(chapter.text, chapter['href']) # 打印章节名和连接地址
# 定义一个方法得到具体章节的内容
def content(url_last):# 当前章节的url等于书籍的url加上章节a标签的hrefglobal urlurl_now = url + '/' + url_last['href']# 爬取具体章节内容res_chap = requests.get(url_now, headers=headers)res_chap.encoding = 'gbk'soup = BeautifulSoup(res_chap.text, 'html.parser')cont = soup.select('.content_read .box_con #content')content = '\n' + url_last.text + '\n' # 使用一个变量来储存单章内容，第一行是文章标题# 处理问斩内容for text in cont[0]: # soup的select方法返回的是一个列表，所以cont[0]才是我们想要的具体内容，使用for循环得到每一行# 去除换行标签if str(text) == '<br/>':content += '\n'else:content += str(text)return content
# 爬取第一章的内容
print(content(chapters[0]))

文章的内容准确的爬取出来了，只是我明明爬取的是第一章，但实际得到的却不是。原来文章列表并不是从第一章开始的，前面还有几章最近更新的章节，可以用一个判断忽略掉。

# 定义一个方法得到具体章节的内容
# 定义一个变量判断是爬取此章节
isreq = True
def content(url_last):# 当前章节的url等于书籍的url加上章节a标签的hrefglobal url, isreq# 判断是否该爬取if isreq:# 判断是否到第一章了if '第一章' in str(url_last.text):isreq = Falseelse:returnurl_now = url + '/' + url_last['href']# 爬取具体章节内容res_chap = requests.get(url_now, headers=headers)res_chap.encoding = 'gbk'soup = BeautifulSoup(res_chap.text, 'html.parser')cont = soup.select('.content_read .box_con #content')content = '\n' + url_last.text + '\n' # 使用一个变量来储存单章内容，第一行是文章标题# 处理问斩内容for text in cont[0]: # soup的select方法返回的是一个列表，所以cont[0]才是我们想要的具体内容，使用for循环得到每一行# 去除换行标签if str(text) == '<br/>':content += '\n'else:content += str(text)return content
# 爬取第一章的内容
print(content(chapters[0]))

保存爬取内容

想得到的内容得到后，我们把爬取到的小说保存在一个txt文件里面

import requests
import os
from bs4 import BeautifulSoup
from furl import furlurl = r'https://www.booktxt.net/6_6453/' # 网站路径# 伪装请求头
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'
}
res = requests.get(url, headers=headers)
res.encoding = 'gbk' # 设置编码
# 使用BeautifulSoup解析页面
soup = BeautifulSoup(res.text, 'html.parser')
# BeautifuSoup支持css选择器，借此得到指定标签# 得到作者
book_info = soup.select('.box_con #maininfo #info h1, .box_con #maininfo #info p')
book_name = book_info[0].text
book_author = book_info[1].text
# 设置保存路径
root_path = os.path.abspath(os.path.dirname(__file__)) # 得到当前文件路径
save_path = os.path.join(root_path, book_name+'.txt')# 取得章节
chapters = soup.select('.box_con #list dl dd a')
# 循环打印
# for chapter in chapters:
#     print(chapter.text, chapter['href']) # 打印章节名和连接地址
# 定义一个方法得到具体章节的内容
# 定义一个变量判断是爬取此章节
isreq = True
def content(url_last):# 当前章节的url等于书籍的url加上章节a标签的hrefglobal url, isreq, headers# 判断是否该爬取if isreq:# 判断是否到第一章了if '第一章' in str(url_last.text):isreq = Falseelse:returnurl_now = url + '/' + url_last['href']print(url_last.text)# 爬取具体章节内容res_chap = requests.get(url_now, headers=headers)res_chap.encoding = 'gbk'soup = BeautifulSoup(res_chap.text, 'html.parser')cont = soup.select('.content_read .box_con #content')content = '\n' + url_last.text + '\n' # 使用一个变量来储存单章内容，第一行是文章标题# 处理问斩内容for text in cont[0]: # soup的select方法返回的是一个列表，所以cont[0]才是我们想要的具体内容，使用for循环得到每一行# 去除换行标签if str(text) == '<br/>':content += '\n'else:content += str(text)return content
# 定义保存文件的方法
# 首先写入书名和作者
with open(save_path, 'w', encoding='utf-8') as w:w.write(book_name+'\n'+book_author)
def save_book(content):if content == None:returnwith open(save_path, 'a', encoding='utf-8') as f:f.write(content)# 循环章节列表爬取并保存
for chapter in chapters:cont = content(chapter)save_book(cont)

上面的代码已经是完成版了，运行了之后就会在python文件同目录生成一个元尊.txt文件，有必要注意的是代码最好保存到一个文件，然后在自己能找到的目录中运行，在jupyter中运行的话，可能会有一些麻烦。
还有就是处理章节的时候涉及到大量循环，所以爬取速度会比较慢，但这同时也避免了爬取速度过快被网站封ip的情况出现。

@快乐是一切

python爬虫实战-爬取小说相关推荐

Python爬虫实战爬取租房网站2w+数据-链家上海区域信息（超详细）
Python爬虫实战爬取租房网站-链家上海区域信息(过程超详细) 内容可能有点啰嗦大佬们请见谅后面会贴代码带火们有需求的话就用吧正好这几天做的实验报告就直接拿过来了,我想后面应该会有人用的到吧 ...
python爬虫实战---爬取大众点评评论
python爬虫实战-爬取大众点评评论(加密字体) 1.首先打开一个店铺找到评论很多人学习python,不知道从何学起. 很多人学习python,掌握了基本语法过后,不知道在哪里寻找案例上手. 很多 ...
python爬虫实战-爬取视频网站下载视频至本地(selenium)
#python爬虫实战-爬取视频网站下载视频至本地(selenium) import requests from lxml import etree import json from selenium ...
python爬虫实战-爬取微信公众号所有历史文章 - (00) 概述
http://efonfighting.imwork.net 欢迎关注微信公众号"一番码客"获取免费下载服务与源码,并及时接收最新文章推送. 最近几年随着人工智能和大数据的兴起,p ...
初次尝试python爬虫，爬取小说网站的小说。
本次是小阿鹏,第一次通过python爬虫去爬一个小说网站的小说. 下面直接上菜. 1.首先我需要导入相应的包,这里我采用了第三方模块的架包,requests.requests是python实现的简单易 ...
python爬虫实战--爬取猫眼专业版-实时票房
小白级别的爬虫入门最近闲来无事,发现了猫眼专业版-实时票房,可以看到在猫眼上映电影的票房数据,便验证自己之前学的python爬虫,爬取数据,做成.svg文件. 爬虫开始之前我们先来看看猫眼专业版- ...
Python爬虫之爬取小说
(^_−)☆本喵的放松方式是看小说,而且类型不限,属于偏好成谜的那一种.所以从爬取完天气预报开始,我就开始想着爬取小说,编写了一个还不算完善的爬取小说程序,期待你们的完善. 小说来源: 努努书坊:ht ...
Python爬虫实战 | 抓取小说网完结小说斗罗大陆
储备知识应有:Python语言程序设计 Python网络爬虫与信息提取两门课程都是中国大学MOOC的精彩课程,特别推荐初学者.环境Python3 本文整体思路是:1.获取小说目录页面,解析目录页面, ...
Python爬虫实战- 爬取整个网站112G-8000本pdf epub格式电子书下载
(整个代码附在最后) 目录: 爬虫准备 - 某电子书网站内容架构分析爬虫前奏 - 网站Html代码分析,如何获取需要的链接? 爬虫高潮 - 测试是否有反爬虫措施,测试是否能正常下载一个sample ...

python爬虫实战-爬取小说

爬取网站

解析元素

保存爬取内容

python爬虫实战-爬取小说相关推荐

最新文章

热门文章