利用requests模块爬取小说

面向过程用python爬取网站某一小说并以文本形式存储

代码比较简单，过程如下：

1. 导入requests

import requests

2. 模拟浏览器发送HTTP请求，获得小说主页网页源码

novel_url = 'http://www.xs4.cc/book/9/3802/'
response = requests.get(novel_url)
response.encoding = 'utf-8'
html = response.text

3. 利用正则表达式获取每一章节title和url

div = re.findall(r'<DIV class="clearfix dirconone">.*?</div>',html,re.S)[0]
chapter_list = re.findall(r'<a href="(.*?)" title=".*?">(.*?)</a>',div)

4. 循环章节列表

for chapter_content in chapter_list:chapter_title = chapter_content[1]chapter_url = chapter_content[0]chapter_url = 'http://www.xs4.cc%s'% chapter_urlprint(chapter_url,chapter_title)

5. 提取小说的title，以title创建小说文本，后面存储章节内容

title = re.findall(r'<strong>(.*?)</strong>',html)[0]
fb = open('%s.text' %title,'w',encoding = 'utf-8')

6. 提取小说章节内容

chapter_download = requests.get(chapter_url)
chapter_download.encoding = 'utf-8'
chapter_html = chapter_download.text
chapter_content_download = re.findall(r'id=\"content\">(.*?)<div class=\"backs\">',chapter_html,re.S)[0]

7. 保存小说内容

print('正在保存  %s'%chapter_title)
fb.write(chapter_title)
fb.write('\n')
fb.write(chapter_content_download)
fb.write('\n')

8. 清理数据

chapter_content_download = chapter_content_download.replace(' ','')
chapter_content_download = chapter_content_download.replace(' ','')
chapter_content_download = chapter_content_download.replace('<br/>','')

----------------------------------------------函数实现------------------------------------------------

import requests
import re# 获取章节信息和url
def get_chapter_list():response = requests.get('http://www.xs4.cc/book/9/3802/')response.encoding = 'utf-8'html = response.textdiv = re.findall(r'<DIV class="clearfix dirconone">.*?</div>', html, re.S)[0]chapter_list = re.findall(r'<a href="(.*?)" title=".*?">(.*?)</a>', div)return chapter_list# 获取章节内容
def chapter_download(chapter_url):chapter_dl = requests.get(chapter_url)chapter_dl.encoding = 'utf-8'chapter_html = chapter_dl.textchapter_content_download = re.findall(r'id=\"content\">(.*?)<div class=\"backs\">',chapter_html,re.S)[0]# 清洗数据chapter_content_download = chapter_content_download.replace(' ', '')chapter_content_download = chapter_content_download.replace(' ', '')chapter_content_download = chapter_content_download.replace('<br/>', '')chapter_content_download = chapter_content_download.replace('<!--<divstyle="margin:1px1px6px1px;"><scriptsrc=/d/js/acmsd/thea16.js></script></div>-->','')return chapter_content_download# 循环章节，建立章节文本存取小说内容
for chapter_url,chapter_title in get_chapter_list():chapter_url = 'http://www.xs4.cc%s' % chapter_urlprint(chapter_url,chapter_title)# 数据持久化print('正在保存 %s' % chapter_title)fn = open('%s.text' % chapter_title, 'a+', encoding='utf-8')fn.write(chapter_download(chapter_url))

每个章节以chapter_title创建文本存储小说章节内容

小说在此 -->龙之禁锢

总结：

虽然代码量很少，但也遇到一些坎，后面巩固基础，保持学习状态！

利用requests模块爬取小说相关推荐

利用requests模块爬取任意城市肯德基门店地址
最近,作者在学习爬虫,故也简单的做了一个爬取作者所在城市肯德基门店地址信息的项目实例,并将其推广到可爬取各大城市肯德基门店地址.具体如下: 运行结果:
python爬取网页内容requests_[转][实战演练]python3使用requests模块爬取页面内容
本文摘要: 1.安装pip 2.安装requests模块 3.安装beautifulsoup4 4.requests模块浅析 + 发送请求 + 传递URL参数 + 响应内容 + 获取网页编码 + 获取 ...
python之利用requests库爬取西刺代理，并检验IP的活性
用爬虫爬取某个网站的数据时,如果用一个IP频繁的向该网站请求大量数据,那么你的ip就可能会被该网站拉入黑名单,导致你不能访问该网站,这个时候就需要用到IP动态代理,即让爬虫爬取一定数据后更换IP来继续 ...
python怎么爬取Linux作业,Python爬虫之使用Fiddler+Postman+Python的requests模块爬取各国国旗...
介绍本篇博客将会介绍一个Python爬虫,用来爬取各个国家的国旗,主要的目标是为了展示如何在Python的requests模块中使用POST方法来爬取网页内容. 为了知道POST方法所需要传递的HT ...
Python爬虫之使用Fiddler+Postman+Python的requests模块爬取各国国旗
介绍本篇博客将会介绍一个Python爬虫,用来爬取各个国家的国旗,主要的目标是为了展示如何在Python的requests模块中使用POST方法来爬取网页内容. 为了知道POST方法所需要传 ...
【6】实战：利用re模块爬取淘宝商品信息
文章目录基本介绍确定目标数据页URL 确定目标字段设置输出格式编写逻辑函数尝试爬取附 END 基本介绍正则表达式是一种匹配字符串的工具.它提供了一系列的规则即用法,也就是给字符串定义一系 ...
python使用requests+xpath爬取小说并下载
这个爬虫只是选定热门小说,不支持自选搜索下载,日后会补充并改进. 选定小说网址: 笔趣阁爬取: 需要导入的包 import requests from lxml import etree impor ...
爬虫使用python+requests模块爬取12306网站的车次信息
用requests模块,爬取12306的车次信息先看代码 import re import requests import json a=requests.get('https://kyfw.123 ...
利用requests库爬取搜狗图片并存入文件夹下
看了一篇帖子,https://www.cnblogs.com/dearvee/p/6558571.html 这篇帖子作为一个引导,摸索着完成了第一个爬虫,现在将过程总结如下. 搜狗图片地址为 http ...
运用requests模块爬取NCBI数据库论文题目及摘要
本人生物专业,本身做湿实验的,但对python有着极大的兴趣,因此开始自学Python.在这里记录一下学习进程. 近期编一个爬取NCBI数据库文献的脚本,放在这里希望大家能帮忙看看可以改进的地方,谢谢 ...