python3爬虫实战-requests+beautifulsoup-爬取下载顶点网站的小说

python3爬虫实战之一

下载顶点小说的小说，有单线程和多线程两种方式，自行体验两种方式快慢

环境先安装requests库、beautifulsiup库

看心情，啥时候补个详细步骤介绍，如果我有动力的话= =

dingdian_novel_download.py

#  coding:utf-8
##  先安装环境python3、pip环境
##  pip3 install requests
##  pip3 install BeautifulSoup4
##  sudo apt-get install python3-lxml 或者 pip3 install lxml
##  顶点小说爬虫
##  输入小说的网址进行下载
##  example： https://www.booktxt.net/8_8937/
##  直接运行即可  python3 dingdian_novel_download.py
##  测试环境为win10-python3
##  author: lxfhahaha
##  date: 2018年5月27日13:59:21import  requests
from bs4 import BeautifulSoup
import re
import sys,io
from multiprocessing.dummy import Pool as threadpoolclass Novel(object):              ## 小说类title=''            ## 小说名字author=''           ## 小说作者content=[]          ## 小说内容 （序号，标题，内容/url）url=''              ## 小说网址def set_url(self,headers):    ## 设置网址url=''while True:url=input('Your url:')if(url.startswith('https://www.booktxt.net/')):r1=requests.get(url,headers=headers)if r1.status_code==200:breakprint('Error!Input again~')sys.stdout.flush()self.url=urldef add_section(self,num,titleThis,content):      ##加章节self.content.append([num,titleThis,content])def get_details(self,headers):                    ##获得具体信息，包括小说名字、作者、各章节url,返回一个Novel对象req=requests.get(self.url,headers=headers)so1=BeautifulSoup(req.content.decode('gbk'),'lxml')self.title=so1.select('#maininfo #info h1')[0].get_text()  ##小说名字self.author=so1.select('#maininfo #info p')[0].get_text()  ##作者名字startTag=so1.select('#list dl dt')[1]                 ##设置各章节名称、序号、urlfor index,one in enumerate(startTag.find_all_next("dd")):self.add_section(num=index+1,titleThis=one.a.get_text(),content='https://www.booktxt.net'+one.find('a').get('href'))def get_content_all(self,headers):    ## 爬虫爬取各章节内容，单线程def make_great(one):r2=requests.get(one[2])if r2.status_code==200:so2=BeautifulSoup(r2.content.decode('gbk'),'lxml')one[2]=so2.select('#content')[0].get_text()else:one[2]='内容错误！'self.allLength=self.allLength-1print (str(one[0])+' is ok! '+str(self.allLength)+' left!')sys.stdout.flush()return oneself.allLength=len(self.content)print('all is '+str(self.allLength))sys.stdout.flush()self.content=[make_great(one) for one in self.content]def get_content_all_pool(self,headers):    ## 爬虫爬取各章节内容，多线程self.allLength=len(self.content)print('all is '+str(self.allLength))sys.stdout.flush()def getAll(one):r2=requests.get(one[2])if r2.status_code==200:so2=BeautifulSoup(r2.content.decode('gbk'),'lxml')one[2]=so2.select('#content')[0].get_text()else:one[2]='内容错误！'self.allLength=self.allLength-1print (str(one[0])+' is ok! '+str(self.allLength)+' left!')sys.stdout.flush()return onepool=threadpool(4)result=pool.map(getAll,self.content)pool.close()pool.join()self.content=resultdef downloadToTxt(self):file=open(self.title+'.txt','w+',encoding='utf-8')file.write('*****'+self.title+'-'+self.author+'*****\n\n\n')self.content.sort(key=lambda x:x[0])          ##对第一个关键字进行排序，使内容有序for one in self.content:    file.write('### '+one[1]+'\n\n')file.writelines(one[2]+'\n\n')file.close()    print('End Downloads and start enjoying!!') sys.stdout.flush()if __name__ == '__main__':          ##主函数入口#改变标准输出的默认编码sys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='gbk') #头信息，有利脚本稳定headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36','Referer':'https://www.booktxt.net/'}#新建novel对象novel=Novel()#确认小说网址novel.set_url(headers=headers)#获得具体信息（书名，作者，各章节url）novel.get_details(headers=headers)#获得所有章节内容#可选，单线程或者多线程danDuo=input('Y/n to choose whether use multithreading:')if (danDuo=='Y' or  danDuo=='y'):novel.get_content_all_pool(headers=headers)     #多线程else:novel.get_content_all(headers=headers)          #单线程#下载小说novel.downloadToTxt()

python3爬虫实战-requests+beautifulsoup-爬取下载顶点网站的小说相关推荐

Python3爬虫实战一之爬取网易云音乐热评
文中涉及的一些python3模块需要安装: from pyecharts import Bar from wordcloud import WordCloud import matplotlib.py ...
python爬网易新闻_Python爬虫实战教程：爬取网易新闻；爬虫精选高手技巧
Python爬虫实战教程:爬取网易新闻:爬虫精选高手技巧发布时间:2020-02-21 17:42:43 前言本文的文字及图片来源于网络,仅供学习.交流使用,不具有任何商业用途,版权归原作者所有, ...
《python爬虫实战》：爬取贴吧上的帖子
<python爬虫实战>:爬取贴吧上的帖子经过前面两篇例子的练习,自己也对爬虫有了一定的经验. 由于目前还没有利用BeautifulSoup库,因此关于爬虫的难点还是正则表达式的书写. ...
python爬虫实战之多线程爬取前程无忧简历
python爬虫实战之多线程爬取前程无忧简历 import requests import re import threading import time from queue import Queu ...
爬虫实战4：爬取猫眼电影排名Top100的详细数据保存到csv文件
申明:资料来源于网络及书本,通过理解.实践.整理成学习笔记. 文章目录猫眼电影完整代码运行结果猫眼电影完整代码 import time import requests import re i ...
爬虫实战5：爬取全部穿越火线武器的图片以武器名称命名保存到本地文件
申明:资料来源于网络及书本,通过理解.实践.整理成学习笔记. 文章目录穿越火线官网完整代码运行结果穿越火线官网完整代码 import requests# 循环33次,官网武器库展示有33页 ...
python爬取喜马拉雅_Python爬虫实战案例之爬取喜马拉雅音频数据详解
这篇文章我们来讲一下在网站建设中,Python爬虫实战案例之爬取喜马拉雅音频数据详解.本文对大家进行网站开发设计工作或者学习都有一定帮助,下面让我们进入正文. 前言喜马拉雅是专业的音频分享平台,汇集 ...
python爬虫实战之异步爬取数据
python爬虫实战之异步爬取数据文章目录前言一.需求二.使用步骤 1.思路 2.引入库 3.代码如下总结前言 python中异步编程的主要三种方法:回调函数.生成器函数.线程大法. 以进 ...
转 Python爬虫实战一之爬取糗事百科段子
静觅 » Python爬虫实战一之爬取糗事百科段子首先,糗事百科大家都听说过吧?糗友们发的搞笑的段子一抓一大把,这次我们尝试一下用爬虫把他们抓取下来. 友情提示糗事百科在前一段时间进行了改版,导致 ...
Python爬虫实战一之爬取糗事百科段子
点我进入原文另外, 中间遇到两个问题: 1. ascii codec can't decode byte 0xe8 in position 0:ordinal not in range(128) 解 ...

python3爬虫实战-requests+beautifulsoup-爬取下载顶点网站的小说

python3爬虫实战之一

下载顶点小说的小说，有单线程和多线程两种方式，自行体验两种方式快慢

环境先安装requests库、beautifulsiup库

看心情，啥时候补个详细步骤介绍，如果我有动力的话= =

dingdian_novel_download.py

python3爬虫实战-requests+beautifulsoup-爬取下载顶点网站的小说相关推荐

最新文章

热门文章