Python爬虫之爬取笔趣阁小说下载到本地文件并且存储到数据库

学习了python之后，接触到了爬虫，加上我又喜欢看小说，所以就做了一个爬虫的小程序，爬取笔趣阁小说。

程序中一共引入了以下几个库：

import requests
import mysql.connector
import os
import time
from bs4 import BeautifulSoup
import urllib

网站不可能把所有书籍放在页面上，仔细观察网站构成与思考之后，从网站的搜索栏查找书籍是最好的。

这是搜索地址栏，searchkey+= 后面加上书名就能够进入到搜索书籍的详情页了，但是这里就涉及URL的规定了。按照标准，URL只允许一部分ASCLL字符，其他字符（如汉字）是不符合标准的，所以就要对汉字进行编码。例如我们要搜索三寸人间：

首先将输入获得的小说名转化为gbk编码，然后调用urllib.parse.quote（）方法就能把中文转化为相应的URL地址。然后拼接地址得到url_search。到这里我们就得到了我们想要的搜索的地址，然后就可以进行爬虫了。

先上完整代码，然后简单解释一下代码。

import requests
import mysql.connector
import os
from bs4 import BeautifulSoup
import urllib
url_base='http://www.biquge.com.tw'
lastrowid=0
class Sql(object):#获得数据库连接：conn = mysql.connector.connect(host='localhost',port=3306,user='root',passwd='123456789',database='novel',charset='utf8')def addnovels(self,novelname):#向数据库表插入小说cur = self.conn.cursor()cur.execute("insert into novel(novelname) values('%s')" %(novelname))lastrowid = cur.lastrowidcur.close()self.conn.commit()return lastrowiddef addchapters(self,novelid,chaptername,content):#向数据库插入小说的内容cur = self.conn.cursor()cur.execute("insert into chapter(novelid,chaptername,content) values(%s , '%s' ,'%s')" %(novelid,chaptername,content))cur.close()self.conn.commit()def getHtmltext(url):try:r = requests.get(url, timeout=60)r.raise_for_status()r.encoding = "gbk"return r.textexcept:return ""def Download(soup_search, path1):second=0datas=soup_search.find("div",{"id":"list"}).find("dl").find_all("dd")if not os.path.exists(path1):os.makedirs(path1)for i in datas:second+=1link = i.a.attrs.get("href")  # 得到具体一个章节的URL地址numSection=i.a.string#得到小说章节名print(numSection)path2=path1+"\\"+numSection+'.txt'download_url=url_base+linkhtml=getHtmltext(download_url)soup=BeautifulSoup(html,"html.parser")content=soup.find("div",{"id":"content"}).text#得到小说内容mysql.addchapters(lastrowid,numSection,content)with open(path2,"w",encoding='utf-8') as f:#以写的方式打开文件，不存在则创建，存在就覆盖if second%5==0:time.sleep(10)#休眠10秒，防止被反爬虫            f.write(content)def Tuijian(soup):All_li=soup.find("div",{"class":"r"}).find("ul").find_all("li")#print(All_li)for i in All_li:link = i.a.attrs.get("href")name1=i.a.textspan=i.find("span",{"class":"s5"}).textprint(link+"*****"+name1+"*****"+span)novel_name =input("想要下载这些小说吗？输入名字吧:")path = input("请输入小说存储的盘：")path1 = path + ":" + "\\" + novel_nameurl_name = urllib.parse.quote(novel_name.encode("gbk"))url_search = 'http://www.biquge.com.tw/modules/article/soshu.php?searchkey=+' + url_name  # 拼接得到小说地址html_search = getHtmltext(url_search)soup_search = BeautifulSoup(html_search, "html.parser")lastrowid = mysql.addnovels(novel_name)Download(soup_search,path1)print("下载完毕！！！请到{}盘目录下查看小说。。。".format(path))mysql=Sql()def main():novel_name=input("本程序是笔趣阁小说的下载程序，请输入正确的小说名：")path=input("请输入小说存储的盘：")path1=path+":"+"\\"+novel_name#，按照标准，URL只允许一部分ASCII字符，其他字符（如汉字）是不符合标准的，此时就要进行编码。#因为我在构造URL的过程中要使用到中文，将输入的字符转化为GBK的中文编码方式url_name=urllib.parse.quote(novel_name.encode("gbk"))url_search='http://www.biquge.com.tw/modules/article/soshu.php?searchkey=+'+url_name#拼接得到小说地址html_search=getHtmltext(url_search)soup_search=BeautifulSoup(html_search,"html.parser")book=soup_search.select('#info > h1')if book==[]:print("非常遗憾没有找到图书！！！")else:print("已找到{}正在下载......".format(novel_name))lastrowid = mysql.addnovels(novel_name)#Download(soup_search,path1)print("下载完毕！！！请到{}盘目录下查看小说。。。".format(path))url=[ "https://www.biquge5200.cc/xuanhuanxiaoshuo/","https://www.biquge5200.cc/xiuzhenxiaoshuo/","https://www.biquge5200.cc/dushixiaoshuo/","https://www.biquge5200.cc/chuanyuexiaoshuo/","https://www.biquge5200.cc/wangyouxiaoshuo/","https://www.biquge5200.cc/kehuanxiaoshuo/","https://www.biquge5200.cc/yanqingxiaoshuo/","https://www.biquge5200.cc/tongrenxiaoshuo/"]url_n=["玄幻","修真","都市","穿越","网游","科幻","言情","同人"]print("还想图书，但是不知道读什么？？？")share=input("看看以下分类,输入一个你喜欢的分类（玄幻，修真，都市，穿越，网游，科幻，言情，同人）：")for i in range(0,8):if share==url_n[i]:html = getHtmltext(url[i])print(url[i])soup = BeautifulSoup(html, "html.parser")Tuijian(soup)break;main()

程序从main()函数开始执行的，首先将得到的小说名进行编码，然后拼接URL地址得到搜索页的地址，这样就进入了根据小说名搜索的页面：

    novel_name=input("本程序是笔趣阁小说的下载程序，请输入正确的小说名：")path=input("请输入小说存储的盘：")path1=path+":"+"\\"+novel_name#，按照标准，URL只允许一部分ASCII字符，其他字符（如汉字）是不符合标准的，此时就要进行编码。#因为我在构造URL的过程中要使用到中文，将输入的字符转化为GBK的中文编码方式url_name=urllib.parse.quote(novel_name.encode("gbk"))url_search='http://www.biquge.com.tw/modules/article/soshu.php?searchkey=+'+url_name

根据得到的搜索页，在里面查找得到具体的小说名，这里用了select（）方法去寻找（“#info>h1”），因为通过网页代码内容的查看得知书名在h1标签里面。这里是原文链接，具体讲解了select()方法的用法。

    html_search=getHtmltext(url_search)soup_search=BeautifulSoup(html_search,"html.parser")book=soup_search.select('#info > h1')

如果没有找到图书就输出没有找到，如果找到图书之后就调用下载方法进行下载以及数据库存储。

    if book==[]:print("非常遗憾没有找到图书！！！")else:print("已找到{}正在下载......".format(novel_name))lastrowid = mysql.addnovels(novel_name)Download(soup_search,path1)print("下载完毕！！！请到{}盘目录下查看小说。。。".format(path))

在download()方法里面进行小说的下载和小说内容的数据库存储，下载使用了with open方法来进行文件的读写操作，这样的话文件就会自动关闭，不需要我们去手动关闭。

def Download(soup_search, path1):second=0datas=soup_search.find("div",{"id":"list"}).find("dl").find_all("dd")#print(datas)#Path=path1if not os.path.exists(path1):os.makedirs(path1)for i in datas:second+=1link = i.a.attrs.get("href")  # 得到具体一个章节的URL地址numSection=i.a.string#得到小说章节名#print(numSection)path2=path1+"\\"+numSection+'.txt'#拼接小说存储的路径，以不同的章节名存储download_url=url_base+link#得到小说阅读的地址，从这里得到小说内容html=getHtmltext(download_url)soup=BeautifulSoup(html,"html.parser")content=soup.find("div",{"id":"content"}).text#得到小说内容mysql.addchapters(lastrowid,numSection,content)#调用方法向数据库里面添加内容#print(content)with open(path2,"w",encoding='utf-8') as f:if second%5==0:time.sleep(10)#休眠10秒，防止被反爬虫f.write(content)#存储小说内容到本地#print(i)

数据库存储就涉及到数据库操作，我写了一个类来进行数据库的读写。然后通过mysql=Sql()得到一个实例，这样调用里面的方法方便的进行数据库的操作。

class Sql(object):conn = mysql.connector.connect(host='localhost',port=3306,user='root',passwd='123456789',database='novel',charset='utf8')def addnovels(self,novelname):cur = self.conn.cursor()cur.execute("insert into novel(novelname) values('%s')" %(novelname))lastrowid = cur.lastrowidcur.close()self.conn.commit()return lastrowiddef addchapters(self,novelid,chaptername,content):cur = self.conn.cursor()cur.execute("insert into chapter(novelid,chaptername,content) values(%s , '%s' ,'%s')" %(novelid,chaptername,content))cur.close()self.conn.commit()

到这里我们就实现了本地下载与数据库存储的功能

然后我又写了一个推荐图书的功能

 url=[ "https://www.biquge5200.cc/xuanhuanxiaoshuo/","https://www.biquge5200.cc/xiuzhenxiaoshuo/","https://www.biquge5200.cc/dushixiaoshuo/","https://www.biquge5200.cc/chuanyuexiaoshuo/","https://www.biquge5200.cc/wangyouxiaoshuo/","https://www.biquge5200.cc/kehuanxiaoshuo/","https://www.biquge5200.cc/yanqingxiaoshuo/","https://www.biquge5200.cc/tongrenxiaoshuo/"]url_n=["玄幻","修真","都市","穿越","网游","科幻","言情","同人"]print("还想图书，但是不知道读什么？？？")share=input("看看以下分类,输入一个你喜欢的分类（玄幻，修真，都市，穿越，网游，科幻，言情，同人）：")for i in range(0,8):if share==url_n[i]:html = getHtmltext(url[i])soup = BeautifulSoup(html, "html.parser")Tuijian(soup)break;

url和url_n的值是一一对应的。输入相应分类名称，循环此名称，调用Tuijian()方法实现相应功能。

def Tuijian(soup):All_li=soup.find("div",{"class":"r"}).find("ul").find_all("li")#print(All_li)for i in All_li:link = i.a.attrs.get("href")name1=i.a.textspan=i.find("span",{"class":"s5"}).textprint(link+"*****"+name1+"*****"+span)novel_name =input("想要下载这些小说吗？输入名字吧:")path = input("请输入小说存储的盘：")path1 = path + ":" + "\\" + novel_nameurl_name = urllib.parse.quote(novel_name.encode("gbk"))url_search = 'http://www.biquge.com.tw/modules/article/soshu.php?searchkey=+' + url_name  # 拼接得到小说地址#print(url_search)html_search = getHtmltext(url_search)soup_search = BeautifulSoup(html_search, "html.parser")print(soup_search)lastrowid = mysql.addnovels(novel_name)Download(soup_search,path1)print("下载完毕！！！请到{}盘目录下查看小说。。。".format(path))

，得到分类图书列表，循环显示链接，图书名，作者。

All_li=soup.find("div",{"class":"r"}).find("ul").find_all("li")
for i in All_li:link = i.a.attrs.get("href")name1=i.a.textspan=i.find("span",{"class":"s5"}).textprint(link+"*****"+name1+"*****"+span)

从显示的书名里面选择喜欢的输入，剩下的代码就是重复操作了。

这里是数据库设计：

novel表：novelid为主键，设为自增长。

chapter表：chapterid为主键，设为自增长。novelid为外键，关联的是novel表：

不足之处，请多多指教。

Python爬虫之爬取笔趣阁小说下载到本地文件并且存储到数据库相关推荐

爬虫练习-爬取笔趣阁小说
练习一下爬虫,将笔趣阁的小说根据需求目标再爬取下来,本文仅仅学习爬虫技术,大家还是要支持一下正版网站的思路: Created with Raphaël 2.2.0开始输入书名查询小说是否存在跳转页面 ...
python爬小说目录_【python入门爬虫】爬取笔趣阁小说
[Python] 纯文本查看复制代码import time from bs4 import BeautifulSoup import requests import urllib.parse #模拟 ...
python爬取笔趣阁小说（附源码）
python爬取笔趣阁小说文章目录 python爬取笔趣阁小说前言一.获取小说目录结构获取目录连接请求代码解析目录 XPath tqdm 解析二.获取小说章节结构请求代码解析章节代 ...
python3+正则(re)增量爬虫爬取笔趣阁小说( 斗罗大陆IV终极斗罗)
python3+re 爬虫爬取笔趣阁小说斗罗大陆IV终极斗罗爬取前准备导入的模块分析正则的贪婪与非贪婪附完整代码示例爬取前准备导入的模块 import redis #redis数据库 ...
python爬取小说爬取_用python爬取笔趣阁小说
原标题:用python爬取笔趣阁小说首先打开笔趣阁网址,链接,搜索自己想要的小说. 在网站内单击右键,点击检查,会出现如下界面! 我们需要的章节信息就在我划的这块, 可以将每个标签点一下,它对应的内 ...
java爬虫爬取笔趣阁小说
java爬虫爬取笔趣阁小说 package novelCrawler;import org.jsoup.Connection; import org.jsoup.HttpStatusException ...
爬取笔趣阁小说网站上的所有小说（二）
爬取笔趣阁小说网站上的所有小说(二) 网址为:https://www.biqukan.cc/topallvisit/1.html 我们已经拿到了所有小说的地址爬取笔趣阁小说网站上的所有小说(一),现在 ...
爬取笔趣阁小说网站上的所有小说（一）
爬取笔趣阁小说网站上的所有小说(一) 网址为:https://www.biqukan.cc/topallvisit/1.html 反反爬虫爬虫首先要做的就是看看目标网址有没有反爬虫手段,一般网站都是 ...
用Scrapy爬取笔趣阁小说
今天早上无聊,去笔趣阁扒了点小说存Mongodb里存着,想着哪天做一个小说网站有点用,无奈网太差,爬了一个小时就爬了几百章,爬完全网的小说,不知道要到猴年马月去了.再说说scrapy这个爬虫框架,真是 ...

Python爬虫之爬取笔趣阁小说下载到本地文件并且存储到数据库

先上完整代码，然后简单解释一下代码。

Python爬虫之爬取笔趣阁小说下载到本地文件并且存储到数据库相关推荐

最新文章

热门文章