python爬虫简单实例-爬取17K小说网小说

什么是网络爬虫?

网络爬虫（Web Spider），又被称为网页蜘蛛，是一种按照一定的规则，自动地抓取网站信息的程序或者脚本。

爬虫流程

先由urllib的request打开Url得到网页html文档
浏览器打开网页源代码分析元素节点
通过Beautiful Soup或者正则表达式提取想要的数据
存储数据到本地磁盘或数据库（抓取，分析，存储）

简单实例

爬取17K小说网（https://www.17k.com/）中的一部小说《斩月》

获取章节内容

先看代码：

import requests
if __name__ == '__main__':target = 'https://www.17k.com/chapter/3062292/39084147.html'req = requests.get(target)req.encoding = req.apparent_encodinghtml = req.textprint(html)

通过urllib的request打开Url得到网页html文档，apparent_encoding 属性是通过解析得到网页的编码方式，并且赋值给requests.encoding，就能保证打印出来的不是乱码格式的网页html文档。

不过有时候apparent_encoding属性解析出来的是它所认为正确的编码格式，但是和原编码格式不一致，最终导致乱码问题。因此可以现在网页端知道网页编码格式，再通过requests.encoding直接赋值进行转码。就不会出现乱码，如 requests.encoding=‘utf-8’

通过此方法得到网页文档，找到文章内容所对应的div：

再通过 BeautifulSoup 提取想要的内容：

from bs4 import BeautifulSoup
import requests
if __name__ == '__main__':target = 'https://www.17k.com/chapter/3062292/39084147.html'req = requests.get(target)req.encoding = req.apparent_encodinghtml = req.textbf = BeautifulSoup(html)texts = bf.find_all('div',class_='p')print(texts[0].text)

texts[0].text通过text属性将<p>标签隐去，只打印<p>标签包含的内容，得到结果如下：

获取章节标题和链接

在章节页面中只能获取章节内容，不能获取每章的章节链接，因此我们需要回到小说的目录页面上进行信息获取。方法也和上述获取章节内容一致，先通过requests获取网页文档，再通过BeautifulSoup提取想要的内容。

from bs4 import BeautifulSoup
import requests
if __name__ == '__main__':www = 'https://www.17k.com'target = 'https://www.17k.com/list/3062292.html'req = requests.get(target)req.encoding = req.apparent_encodinghtml = req.textbf = BeautifulSoup(html)texts = bf.find_all('dl',class_='Volume')a_bf = BeautifulSoup(str(texts[0]))a_text = a_bf.find_all('a')print(a_text[0].text)

但是我们发现章节标题被包在了<span>标签下，这就意味着提取出来的内容即使通过.text属性转变后，<span>标签和标签中的内容也会一起打印出来。
那么此时就可以通过截取字符的形式将标题截取出来：

from bs4 import BeautifulSoup
import requests
if __name__ == '__main__':www = 'https://www.17k.com'target = 'https://www.17k.com/list/3062292.html'req = requests.get(target)req.encoding = req.apparent_encodinghtml = req.textbf = BeautifulSoup(html)texts = bf.find_all('dl',class_='Volume')a_bf = BeautifulSoup(str(texts[0]))a_text = a_bf.find_all('a')for each in a_text[2:]:  # 去掉第一个不需要的链接name = str(each.find_all('span',class_='ellipsis'))href = each.get('href')print(name[43:len(name)-24], www+href)

得到如下结果

整合

上面已经学会了获取章节内容和章节标题、链接，接下来就是下载整部小说了，直接上代码，逻辑就是上述的逻辑，加了点函数包装而已。

# -*- coding:UTF-8 -*-
from bs4 import BeautifulSoup
import requests, sys"""
类说明：下载17K小说网中的小说《斩月》
"""
class download(object):def __init__(self):self.server = 'https://www.17k.com'self.target = 'https://www.17k.com/list/3062292.html'self.names = []  # 存放章节名self.urls = []   # 存放章节链接self.nums = 0    # 章节数"""函数说明：获取下载链接"""def get_download_url(self):req = requests.get(url=self.target)req.encoding = 'utf-8'html = req.textbf = BeautifulSoup(html)div = bf.find_all('dl',class_='Volume')a_bf = BeautifulSoup(str(div[0]))a = a_bf.find_all('a')self.nums = len(a[1:])  # 去取一些不必要的链接，并统计章节数for each in a[1:]:name = str(each.find_all('span', class_='ellipsis')) # 获取章节名字href = each.get('href') # 获取章节链接self.names.append(name[43:len(name) - 24])self.urls.append(self.server + href)"""函数说明：获取章节内容Parameters：target - 下载链接(String)Returns：texts - 章节内容(String)"""def get_contents(self, target):req = requests.get(url=target)req.encoding = 'utf-8'html = req.textbf = BeautifulSoup(html)texts = bf.find_all('div',class_='p')texts = texts[0].textreturn texts[:len(texts)-90]"""函数说明：将爬取的文章内容写入文件Parameters：name - 章节名称(String)path - 当前路径下，小说保存名称(String)text - 章节内容(String)"""def write(self,name,path,text):write_flag = Truewith open(path, 'a', encoding='utf-8') as f:f.write(name + '\n')f.writelines(text)f.write('\n\n')if __name__ == "__main__":download = download()download.get_download_url()print("《斩月》开始下载：")for i in range(download.nums):download.write(download.names[i], "斩月.txt", download.get_contents(download.urls[i]))sys.stdout.write("  已下载:%.3f%%" %float(i/download.nums) + '\r')sys.stdout.flush()print("《斩月》下载完成！")

大功告成！