获取HTML信息

# -*- coding:UTF-8 -*-
import requestsif __name__ == '__main__':target = 'http://www.biqukan.com/1_1094/5403177.html'req = requests.get(url=target)print(req.text)

解析HTML信息

提取的方法有很多，例如使用正则表达式、Xpath、Beautiful Soup等。

Beautiful Soup的安装方法和requests一样，使用如下指令安装(也是二选一)：

pip install beautifulsoup4
easy_install beautifulsoup4

仔细观察目标网站一番，我们会发现这样一个事实：class属性为showtxt的div标签，独一份！这个标签里面存放的内容，是我们关心的正文部分。

知道这个信息，我们就可以使用Beautiful Soup提取我们想要的内容了，编写代码如下：

# -*- coding:UTF-8 -*-
from bs4 import BeautifulSoup
import requests
if __name__ == "__main__":target = 'http://www.biqukan.com/1_1094/5403177.html'req = requests.get(url = target)html = req.textbf = BeautifulSoup(html)texts = bf.find_all('div', class_ = 'showtxt') print(texts)

在解析html之前，我们需要创建一个Beautiful Soup对象。BeautifulSoup函数里的参数就是我们已经获得的html信息。然后我们使用find_all方法，获得html信息中所有class属性为showtxt的div标签。find_all方法的第一个参数是获取的标签名，第二个参数class_是标签的属性，为什么不是class，而带了一个下划线呢？因为python中class是关键字，为了防止冲突，这里使用class_表示标签的class属性，class_后面跟着的showtxt就是属性值了。看下我们要匹配的标签格式：

<div id="content", class="showtxt">

div标签名，br标签，以及各种空格。怎么去除这些东西呢？

# -*- coding:UTF-8 -*-
from bs4 import BeautifulSoup
import requests
if __name__ == "__main__":target = 'http://www.biqukan.com/1_1094/5403177.html'req = requests.get(url = target) html = req.textbf = BeautifulSoup(html)texts = bf.find_all('div', class_ = 'showtxt')print(texts[0].text.replace('\xa0'*8,'\n\n'))

find_all匹配的返回的结果是一个列表。提取匹配结果后，使用text属性，提取文本内容，滤除br标签。随后使用replace方法，剔除空格，替换为回车进行分段。在html中是用来表示空格的。replace(‘\xa0’*8,’\n\n’)就是去掉下图的八个空格符号，并用回车代替：

小说每章的链接放在了class属性为listmain的<div>标签下的<a>标签中。链接具体位置放在html->body->div->dl->dd->a的href属性中。先匹配class属性为listmain的<div>标签，再匹配<a>标签。编写代码如下：

# -*- coding:UTF-8 -*-
from bs4 import BeautifulSoup
import requests
if __name__ == "__main__":target = 'http://www.biqukan.com/1_1094/'req = requests.get(url = target)html = req.textdiv_bf = BeautifulSoup(html)div = div_bf.find_all('div', class_ = 'listmain')print(div[0])

下来再匹配每一个<a>标签，并提取章节名和章节文章。

Beautiful Soup返回的匹配结果a，使用a.get(‘href’)方法就能获取href的属性值，使用a.string就能获取章节名，编写代码如下：

# -*- coding:UTF-8 -*-
from bs4 import BeautifulSoup
import requests
if __name__ == "__main__":server = 'http://www.biqukan.com/'target = 'http://www.biqukan.com/1_1094/'req = requests.get(url = target) html = req.textdiv_bf = BeautifulSoup(html)div = div_bf.find_all('div', class_ = 'listmain')a_bf = BeautifulSoup(str(div[0]))a = a_bf.find_all('a')for each in a:print(each.string, server + each.get('href'))

因为find_all返回的是一个列表，里边存放了很多的<a>标签，所以使用for循环遍历每个<a>标签并打印出来。

整合代码

整合代码，将获得内容写入文本文件存储就好了。

# -*- coding:UTF-8 -*-
from bs4 import BeautifulSoup
import requests, sysclass downloader(object):def __init__(self):self.server = 'http://www.biqukan.com/'self.target = 'http://www.biqukan.com/1_1094/'self.names = []            #存放章节名self.urls = []            #存放章节链接self.nums = 0            #章节数def get_download_url(self):req = requests.get(url = self.target)html = req.textdiv_bf = BeautifulSoup(html)div = div_bf.find_all('div', class_ = 'listmain')a_bf = BeautifulSoup(str(div[0]))a = a_bf.find_all('a')self.nums = len(a[15:])                                #剔除不必要的章节，并统计章节数for each in a[15:]:self.names.append(each.string)self.urls.append(self.server + each.get('href'))def get_contents(self, target):req = requests.get(url = target)html = req.textbf = BeautifulSoup(html)texts = bf.find_all('div', class_ = 'showtxt')texts = texts[0].text.replace('\xa0'*8,'\n\n')return texts"""函数说明:将爬取的文章内容写入文件Parameters:name - 章节名称(string)path - 当前路径下,小说保存名称(string)text - 章节内容(string)Returns:无"""def writer(self, name, path, text):write_flag = Truewith open(path, 'a', encoding='utf-8') as f:f.write(name + '\n')f.writelines(text)f.write('\n\n')if __name__ == "__main__":dl = downloader()dl.get_download_url()print('《一念永恒》开始下载：')for i in range(dl.nums):dl.writer(dl.names[i], '一念永恒.txt', dl.get_contents(dl.urls[i]))sys.stdout.write("  已下载:%.3f%%" %  float(i/dl.nums) + '\r')sys.stdout.flush()print('《一念永恒》下载完成')

笔趣看小说Python3爬虫抓取相关推荐

笔趣看小说全部章节爬取实战
import requests from bs4 import BeautifulSoup import os # 本地写入 headers={ 'User-Agent': 'Mozilla/5.0 ...
Python爬虫：笔趣阁小说搜索和爬取
目录 0x00 写在前面 0x01 搜索页面 0x02 章节获取 0x03 章节内容获取 0x04 完整代码 0x00 写在前面最近开始学习Python的爬虫,就试着写了写笔趣阁小说的爬虫,由于是初 ...
Python爬虫练习（一）爬取新笔趣阁小说（搜索+爬取）
爬取笔趣阁小说(搜索+爬取) 首先看看最终效果(gif): 实现步骤: 1.探查网站"http://www.xbiquge.la/",看看网站的实现原理. 2.编写搜索功能(获取每 ...
通过Python3 爬虫抓取漫画图片
通过Python3 爬虫抓取漫画图片引言: 最近闲来无事所以想着学习下python3,看了好长时间的文档,于是用python3写了一个漫画抓取的程序,好了废话不多说上码! 第一步: 准备环境和类 ...
python3 爬虫抓取股市数据
python3 爬虫抓取股市数据爬虫抓取数据的一般步骤代码运行结果小结注意事项爬虫抓取数据的一般步骤 1.确定需要抓取的网站2.分析url,找到url的的变化规律3.分析页面的数据4.获取 ...
python3抓取图片_通过Python3 爬虫抓取漫画图片
引言: 最近闲来无事所以想着学习下python3,看了好长时间的文档,于是用python3写了一个漫画抓取的程序,好了废话不多说上码! 第一步: 准备环境和类库,我用的是python3.5 ...
python3爬虫抓取链家上海租房信息
环境:win10,anaconda3(python3.5) 爬取对象网站:链家上海租房方法一:利用requests获取网页信息,再利用正则提取数据,并将结果保存到csv文件. 代码地址:代码抓取到 ...
Python3爬虫抓取网易云音乐热评实战
前一段时间刚刚入门python爬虫,有大概半个月时间没有写python了,都快遗忘了.于是准备写个简单的爬虫练练手,我觉得网易云音乐最优特色的就是其精准的歌曲推荐和独具特色的用户评论,于是写了这个抓取 ...
Python3爬虫抓取微信好友数量、性别、以及城市分布等信息。
import itchat import pandas as pd # 先登录 itchat.login()# 获取好友列表 friends = itchat.get_friends(update=T ...
python3+正则(re)增量爬虫爬取笔趣阁小说( 斗罗大陆IV终极斗罗)
python3+re 爬虫爬取笔趣阁小说斗罗大陆IV终极斗罗爬取前准备导入的模块分析正则的贪婪与非贪婪附完整代码示例爬取前准备导入的模块 import redis #redis数据库 ...

笔趣看小说Python3爬虫抓取

笔趣看小说Python3爬虫抓取

获取HTML信息

解析HTML信息

整合代码

笔趣看小说Python3爬虫抓取相关推荐

最新文章

热门文章