python爬虫爬取起点小说_python3爬虫-使用requests爬取起点小说

import requests

from lxml import etree

from urllib import parse

import os, time

def get_page_html(url):

'''向url发送请求'''

resoponse = session.get(url, headers=headers, timeout=timeout)

try:

if resoponse.status_code == 200:

return resoponse

except exception:

return none

def get_next_url(resoponse):

'''获取下一页的url链接'''

if resoponse:

try:

selector = etree.html(resoponse.text)

url = selector.xpath("//a[@id='j_chapternext']/@href")[0]

next_url = parse.urljoin(resoponse.url, url)

return next_url

except indexerror:

return none

def xs_content(resoponse):

'''获取小说的章节名，内容'''

if resoponse:

selector = etree.html(resoponse.text)

title = selector.xpath("//h3[@class='j_chaptername']/text()")[0]

content_xpath = selector.xpath(

"//div[contains(@class,'read-content') and contains(@class,'j_readcontent')]//p/text()")

return title, content_xpath

def write_to_txt(info_tuple: tuple):

if not info_tuple: return

path = os.path.join(base_path, info_tuple[0])

if not os.path.exists(path):

with open(path + ".txt", "wt", encoding="utf-8") as f:

for line in info_tuple[1]:

f.write(line + "\n")

f.flush()

def run(url):

'''启动'''

html = get_page_html(url)

next_url = get_next_url(html)

info_tupe = xs_content(html)

if next_url and info_tupe:

print("正在写入")

write_to_txt(info_tupe)

time.sleep(sleep_time) # 延迟发送请求的时间，减少对服务器的压力。

print("正在爬取%s" % info_tupe[0])

print("正在爬取%s" % next_url)

run(next_url)

if __name__ == '__main__':

session = requests.session()

sleep_time = 5

timeout = 5

base_path = r"d:\图片\lszj" # 存放文件的目录

url = "https://read.qidian.com/chapter/8iw8dkb_ztxrzk4x-cujuw2/fwjwroiobhn4p8iew--ppw2" # 这是斗破苍穹第一章的url 需要爬取的小说的第一章的链接(url)

headers = {

"referer": "read.qidian.com",

"user-agent": "mozilla/5.0 (windows nt 10.0; wow64) applewebkit/537.36 (khtml, like gecko) chrome/72.0.3626.121 safari/537.36"

}

print('开始运行爬虫')

run(url)

python爬虫爬取起点小说_python3爬虫-使用requests爬取起点小说相关推荐

python爬虫爬取新闻标题_Python3爬虫实战(一)：新闻标题及其URL
本文以'链节点'网站为例,实现新闻标题及其URL批量获取,并以字典的形式存入本地. 代码使用python的requests模块,并以json格式转存本地. 分成3步:1,发请求:2,解析数据:3,保存 ...
python怎么爬取豆瓣首页_Python3 爬虫（二） -- 爬取豆瓣首页图片
''' 批量下载豆瓣首页的图片采用伪装浏览器的方式爬取豆瓣网站首页的图片,保存到指定路径文件夹下 ''' #导入所需的库 import urllib.request,socket,re,sys,os ...
python3爬取视频原理_Python3爬虫实战：以爬取豆瓣电影为例
爬虫又称为网页蜘蛛,是一种程序或脚本. 但重点在于,它能够按照一定的规则,自动获取网页信息. 爬虫的基本原理--通用框架 1.挑选种子URL: 2.讲这些URL放入带抓取的URL列队: 3.取出带抓取 ...
python网易云爬虫网络技术的意义_Python3爬虫实战之网易云音乐
Xpath最初被设计用来搜寻XML文档,但它同样适用于HTML文档的搜索.通过简洁明了的路径选择表达式,它提供了强大的选择功能:同时得益于其内置的丰富的函数,它可以匹配和处理字符串.数值.时间等数据格 ...
Crawl：利用bs4和requests爬取了国内顶级某房源平台(2020年7月2日上海二手房)将近30*100多条数据并进行房价分析以及预测
Crawl:利用bs4和requests爬取了国内顶级某房源平台(2020年7月2日上海二手房)将近30*100多条数据并进行房价分析以及预测目录利用bs4和requests爬取了国内顶级某房源平 ...
Python爬虫新手入门教学（十四）：爬取有声小说网站数据
前言本文的文字及图片来源于网络,仅供学习.交流使用,不具有任何商业用途,如有问题请及时联系我们以作处理. Python爬虫.数据分析.网站开发等案例教程视频免费在线观看 https://space. ...
python爬取凤凰新闻网_python3.6爬取凤凰网新闻-爬虫框架式思维
一.序言先前几篇爬虫的代码,是简单的脚本代码.在爬取小网页觉得挺简单.高效,但涉及复杂网页的时候,就要考虑成熟的爬虫框架与分布式.本篇博客作为无框架式爬虫和有框架式爬虫的一个过渡,介绍具有框架式思维 ...
python爬虫怎么爬小说_python爬虫：定向爬取小说
01 注:本文利用requests库和BeautifulSoup库来爬取笔趣看中的小说'诛仙' 02 首先,你要安装这两个python的第三方库:安装方法如下: requests --> pip ...
python爬取今日头条_Python3网络爬虫实战-36、分析Ajax爬取今日头条街拍美图
本节我们以今日头条为例来尝试通过分析 Ajax 请求来抓取网页数据的方法,我们这次要抓取的目标是今日头条的街拍美图,抓取完成之后将每组图片分文件夹下载到本地保存下来. 1. 准备工作在本节开始之前请 ...

python爬虫爬取起点小说_python3爬虫-使用requests爬取起点小说

python爬虫爬取起点小说_python3爬虫-使用requests爬取起点小说相关推荐

最新文章

热门文章