python爬取网站小说并下载实例

目的：
实现在控制台输入小说的目录路径敲击回车，实现全本下载
分析：
1.目标网站的网页结构
2.网站的数据是否有用
需求分析：
1.目录路径：
2.章节路径

通过模拟浏览器进行两次请求：
1.第一次请求小说的目录的路径，通过这个请求分析标签找到章节的路径，并获取路径
2.第二次根据获取的章节路径发起第二次请求，获取小说文字内容

代码如下

import re
import requests
from bs4 import BeautifulSoup
import os#获取小说目录方法
def get_xiaoshou_mulu(xiaoshuo_mulu,header):# 发起请求response = requests.get(url=xiaoshuo_mulu,headers=header)if response.status_code == 200:#设置编码，要爬取页面的编码response.encoding = 'gbk'#将请求的页面结构进行获取html = response.content#通过解析器将请求的结构进行解析soup = BeautifulSoup(html,'lxml')#分析页面标签tag_dl = soup.find('dl')print(tag_dl)start_flag = Falsefor tag_dd in tag_dl:#找到一个就换行if tag_dd == '\n':continueelif tag_dd.string == '《'+xiaoshuo_name+'》正文卷':start_flag = Trueelif start_flag:#获取路径进行下载print(tag_dd.a.string,':',url+tag_dd.a['href'],'------------下载完成！')content_name = tag_dd.a.stringcontent_src = url + tag_dd.a['href']get_xiaoshuo_mulu_content(content_name,content_src)else:print('访问的页面不可描述！')#通过章节路径获取小说
def get_xiaoshuo_mulu_content(content_name,content_src):#利用连个传来的参数 进行爬虫请求 并下载#再发一次请求response = requests.get(url=content_src, headers=header, verify=True)if response.status_code == 200:response.encoding = 'gbk'html = response.contentsoup = BeautifulSoup(html,'lxml')#获取标签中的内容div = soup.find(attrs={'id':'content','class':'showtxt'})# sub()正则的替换字符方法lines = re.sub('[\xa0]','\n\n',div.text)#储存 建立文件夹path = '笔趣小说\\'+xiaoshuo_nameif not os.path.exists(path):os.makedirs(path)print('创建成功！')#根据创建的路径写入file = open(path+'\\'+content_name+'.txt','w',encoding='utf-8',newline='')file.writelines(lines)file.close()else:print('访问不可描述')if __name__ == '__main__':print('====================小说下载助手===============')print('说明：1.输入小说的目录路径 2.输入小说的名字')header = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'}xiaoshuo_mulu = input('输入小说的目录路径：')#利用正则 路径验证 字符串是否符合路径的格式要求#url=xxxxxx.comurl = xiaoshuo_mulu[:re.search('.com',xiaoshuo_mulu).span()[1]]xiaoshuo_name = input('请输入小说的名字：')get_xiaoshou_mulu(xiaoshuo_mulu,header)

分析

下面展示一些 内联代码片。

#     利用requests的请求  模拟浏览器header = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'}

在Google开发者找到代理信息：按F12，刷新即可查看

#设置编码，要爬取页面的编码
response.encoding = 'gbk'

#分析页面标签tag_dl = soup.find('dl')print(tag_dl)start_flag = Falsefor tag_dd in tag_dl:#找到一个就换行if tag_dd == '\n':continueelif tag_dd.string == '《'+xiaoshuo_name+'》正文卷':start_flag = Trueelif start_flag:#获取路径进行下载print(tag_dd.a.string,':',url+tag_dd.a['href'],'------------下载完成！')content_name = tag_dd.a.stringcontent_src = url + tag_dd.a['href']get_xiaoshuo_mulu_content(content_name,content_src)

python爬取网站小说并下载实例相关推荐

Python爬取网站小说保存txt，pdf文件
# 爬取小说 http://www.hengyan.com/dir/9495.aspxfrom lxml.html import etree import requests import re imp ...
python爬取顶点小说简单版
python爬取顶点小说简单版爬取网络资源首先要下载requests库因为这里面也有数据提取和分析所以也要有etree库,re库下载库的代码是:pip install 库名如:pip inst ...
python爬取有声小说_2019-04-23-Python爬取有声小说
Python爬取有声小说 [toc] 通过python爬取网站的资源,实现批量下载功能: 记录一次自己的学习经历,小白,非专业,难免有不足之处,望读者取其精华! 摘要功能如下: 1.批量下载 2.批 ...
Python爬取有声小说
Python爬取有声小说文章目录 Python爬取有声小说摘要 1.获取下载链接 2.分析规律,循环爬取 3.保存到本地,批量命名 4.界面设计 5.效果展示通过python爬取网站的资源,实现 ...
python爬取网站源代码+图片
python爬取网站源代码+图片需求分析基础知识正则表达式 python网络请求文件读写实现基本思路具体实现结果总结需求分析大部分有志青年都想建立属于自己的个人网站,从零开始设计 ...
Python爬取起点小说并保存到本地文件夹和MongoDB数据库中
Python爬取起点小说并保存到本地MongoDB数据库中工具:Python3.7 + Mongo4.0 + Pycharm """ 爬取起点小说<诡秘之主> ...
完全小白篇-使用Python爬取网络小说
完全小白篇-使用Python爬取网络小说一.找一个你要爬取的小说二.分析网页网页的展示方式需要用到的库文件三.向网站发送请求四.正则提取五.跳转的逻辑六.后续处理七.保存信息进入do ...
php取qq空间说说id,Python爬取qq空间说说的实例代码
具体代码如下所示: #coding:utf-8 #!/usr/bin/python3 from selenium import webdriver import time import re impo ...
python爬虫怎么爬取图片_怎么用python爬取网站Jpg图片
用python爬取网站图片,通过引用requests库就可完成.下面,小编将以爬取百度图片为例工具/原料 python环境,网络安装requests库 1 cmd打开命令行界面,输入pip ins ...
Python爬取网站图片数据
Python爬取网站图片数据找到需要爬取的网站地址模拟网站http请求根据调试模式获取的了解读取到真实的地址url,读取请求头数据和参数信息,模拟http请求调用 import requests ...

python爬取网站小说并下载实例

代码如下

分析

python爬取网站小说并下载实例相关推荐

最新文章

热门文章