Python爬虫之利用requests，BeautifulSoup爬取小说标题、章节

爬取雪鹰领主标题和章节内容为列：
查看网页的源代码，如下图所示：
获取html内容部分

import requests
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko'}
response = requests.get('https://quanxiaoshuo.com/177913/', headers=headers)

获取标题代码部分

from bs4 import BeautifulSoup
soup = BeautifulSoup(response.text, 'html.parser', from_encoding='utf-8')#html.parser或lxml
title = []
for volumn in soup.find_all(class_="volumn"):b = volumn.find('b')if b!=None:b_title = b.stringtitle.append({'volumn': b_title})

获取章节代码部分

chapters = []
for chapter in soup.find_all(class_='chapter'):# 获取所有的a标记中url和章节内容a = chapter.find('a')chapter_title = a.get('title')chapters.append({'chapter_title': chapter_title})

保存为json数据部分

import json
with open('xylz_title.json', 'w') as fp:json.dump(title, fp=fp, indent=4)
with open('xylz_chapters.json', 'w') as fp:json.dump(chapters, fp=fp, indent=4)

完整代码如下：

import requests
from bs4 import BeautifulSoup
import json#获取html内容
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko'}
response = requests.get('https://quanxiaoshuo.com/177913/', headers=headers)
#分析结构，抽取要标记的位置。获取标题与章节
soup = BeautifulSoup(response.text, 'html.parser', from_encoding='utf-8')#html.parser或lxml
title = []
for volumn in soup.find_all(class_="volumn"):b = volumn.find('b')if b!=None:b_title = b.string# 获取标题title.append({'volumn': b_title})
chapters = []
for chapter in soup.find_all(class_='chapter'):# 获取所有的a标记中章节a = chapter.find('a')chapter_title = a.get('title')chapters.append({'chapter_title': chapter_title})
#将标题，章节和链接进行JSON储存
with open('xylz_title.json', 'w') as fp:json.dump(title, fp=fp, indent=4)
with open('xylz_chapters.json', 'w') as fp:json.dump(chapters, fp=fp, indent=4)

Python爬虫之利用requests，BeautifulSoup爬取小说标题、章节相关推荐

python爬虫入门练习：BeautifulSoup爬取猫眼电影TOP100排行榜，pandas保存本地excel文件
传送门:[python爬虫入门练习]正则表达式爬取猫眼电影TOP100排行榜,openpyxl保存本地excel文件对于上文使用的正则表达式匹配网页内容,的确是有些许麻烦,替换出现任何的差错都会导致 ...
爬虫实战：Requests+BeautifulSoup 爬取京东内衣信息并导入表格（python）
准备工作假如我们想把京东内衣类商品的信息全部保存到本地,通过手工复制粘贴将是一项非常庞大的工程,此时,可以用python爬虫实现. 第一步:分析网页地址起始网页地址起始网页地址 https:// ...
Python爬虫之四：使用BeautifulSoup爬取微博热搜
(一)安装BeautifulSoup模块目前,Beautiful Soup 的最新版本是 4.x 版本,之前的版本已经停止开发了.这里推荐使用 pip 来安装,安装命令如下: pip install ...
python爬斗鱼直播_Python爬虫：利用API实时爬取斗鱼弹幕
原标题:Python爬虫:利用API实时爬取斗鱼弹幕这些天一直想做一个斗鱼爬取弹幕,但是一直考试时间不够,而且这个斗鱼的api接口虽然开放了但是我在github上没有找到可以完美实现连接.我看了好多 ...
Python 爬虫实战，模拟登陆爬取数据
Python 爬虫实战,模拟登陆爬取数据从0记录爬取某网站上的资源连接: 模拟登陆爬取数据保存到本地结果演示: 源网站展示: 爬到的本地文件展示: 环境准备: python环境安装略安装r ...
Python爬虫笔记（3）- 爬取丁香园留言
Python爬虫笔记(3)- 爬取丁香园留言爬取丁香园留言:主要用到了模拟登录爬取丁香园留言:主要用到了模拟登录 import requests, json, re, random,time fr ...
python爬虫之股票数据定向爬取
python爬虫之股票数据定向爬取功能描述目标:获取上交所和深交所所有股票的名称和交易的信息输出:保存到文件中技术路线:requests-bs4-re 前期分析选取原则:股票的信息静态存在H ...
Python爬虫实战系列(一)-request爬取网站资源
Python爬虫实战系列(一)-request爬取网站资源 python爬虫实战系列第一期文章目录 Python爬虫实战系列(一)-request爬取网站资源前言一.request库是什么? 二 ...
python爬虫之-斗图网爬取
python爬虫之-斗图啦爬取利用:requests, re 功能:用户自定义关键词,页码整体代码 # 请求库 import requests # 正则 import re # 让用户输入 im ...

Python爬虫之利用requests，BeautifulSoup爬取小说标题、章节

Python爬虫之利用requests，BeautifulSoup爬取小说标题、章节相关推荐

最新文章

热门文章