python爬虫实例之——多线程爬取小说

之前写过一篇爬取小说的博客，但是单线程爬取速度太慢了，之前爬取一部小说花了700多秒，1秒两章的速度有点让人难以接受。

所以弄了个多线程的爬虫。

这次的思路和之前的不一样，之前是一章一章的爬，每爬一章就写入一章的内容。这次我新增加了一个字典用于存放每章爬取完的内容，最后当每个线程都爬取完之后，再将所有信息写入到文件中。

之所以用字典是因为爬完之后需要排序，字典的排序比较方便

为了便于比较，这次选择的还是之前博客里面相同的小说，不清楚的可以看看那篇博客：
python爬虫实例之小说爬取器

下面就上新鲜出炉代码：

import threading
import time
from bs4 import BeautifulSoup
import codecs
import requestsbegin = time.clock()#多线程类
class myTread(threading.Thread):def __init__(self,threadID,name,st):threading.Thread.__init__ (self)self.threadID = threadIDself.name = nameself.st = stdef run(self):print('start ',str(self.name))threadget(self.st)print('end ',str(self.name))txtcontent = {} #存储小说所有内容novellist = {}  #存放小说列表
def getnovels(html):soup = BeautifulSoup(html,'lxml')list = soup.find('div',id='main').find_all('a')baseurl = 'http://www.paoshu8.com'for l in list:novellist[l.string] = baseurl+str(l['href']).replace('http:','')#获取页面html源码
def getpage(url):headers = {'user-agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'}page = requests.get(url).content.decode('utf-8')return pagechaptername = []  #存放小说章节名字
chapteraddress = []     #存放小说章节地址#获取小说所有章节以及地址
def getchapter(html):soup = BeautifulSoup(html,'lxml')try:alist = soup.find('div',id='list').find_all('a')for list in alist:chaptername.append(list.string)href = 'http://www.paoshu8.com'+list['href']chapteraddress.append(href)return Trueexcept:print('未找到章节')return False#获取章节内容
def getdetail(html):soup = BeautifulSoup(html,'lxml')try:content = '     'pstring = soup.find('div',id='content').find_all('p')for p in pstring:content += p.stringcontent += '\n      'return contentexcept:print('出错')return '出错'def threadget(st):max = len(chaptername)#print('threadget函数',st,max)while st < max:url = str(chapteraddress[st])html = getpage(url)content = getdetail(html)txtcontent[st] = contentprint('下载完毕'+chaptername[st])st += thread_counturl = 'http://www.paoshu8.com/xiaoshuodaquan/' #小说大全网址
html = getpage(url)
getnovels(html)  #获取小说名单name = input('请输入想要下载小说的名字:\n')
if name in novellist:print('开始下载')url = str(novellist[name])html = getpage(url)getchapter(html)thread_list = []thread_count = int(input('请输入需要开的线程数'))for id in range(thread_count):thread1 = myTread(id,str(id),id)thread_list.append(thread1)for t in thread_list:t.setDaemon(False)t.start()for t in thread_list:t.join()print('\n子线程运行完毕')txtcontent1 = sorted(txtcontent)file = codecs.open('C:/Users/Lenovo/Desktop/novellist/'+name+'.txt','w','utf-8')  #小说存放在本地的地址chaptercount = len (chaptername)#写入文件中for ch in range(chaptercount):title = '\n           第' + str (ch + 1) + '章  ' + str (chaptername[ch]) + '         \n\n'content = str(txtcontent[txtcontent1[ch]])file.write(title+content)file.close()end = time.clock()print('下载完毕，总耗时',end-begin,'秒')
else:print('未找见该小说')

我开了100个线程用来测试：

速度比单线程提高了很多

同一时间段的单线程花了1200多秒，而100个线程的速度是他的20多倍。

python爬虫实例之——多线程爬取小说相关推荐

python爬虫实战之多线程爬取前程无忧简历
python爬虫实战之多线程爬取前程无忧简历 import requests import re import threading import time from queue import Queu ...
python爬虫第二弹-多线程爬取网站歌曲
python爬虫第二弹-多线程爬取网站歌曲一.简介二.使用的环境三.网页解析 1.获取网页的最大页数 2.获取每一页的url形式 3.获取每首歌曲的相关信息 4.获取下载的链接四.代码实现一 ...
Python爬虫实战 | 利用多线程爬取 LOL 高清壁纸
来源:公众号[杰哥的IT之旅] 作者:阿拉斯加 ID:Jake_Internet 如需获取本文完整代码及 LOL 壁纸,请为本文右下角点赞并添加杰哥微信:Hc220088 获取. 一.背景介绍随着移 ...
使用 requests+lxml 库的 Python 爬虫实例（以爬取网页连载小说《撒野》为例）
需求目标介绍使用 requests 库与 lxml 库进行简单的网页数据爬取普通框架与爬虫实例,本文以爬取网页连载小说<撒野>为例~ 当然有很多小说基本都能找到现成的 .txt 或者 . ...
Python爬虫之scrapy框架-爬取小说信息
1.需求分析我们要得到小说热销榜的数据,每部小说的提取内容为:小说名字.作者.类型.形式.然后将得到的数据存入CSV文件. 2.创建项目创建一个scrapy爬虫项目方式如下: (1)在D盘下面创建 ...
python爬虫实例练习：爬取慕课网课程名称以及对应的链接
1.安装与开发环境模块安装: bs4 解析库安装:pip install bs4 开发环境: python 3.x + pycharm ps:文章来源于小编的头条号:"python数据科学 ...
Python爬虫进阶之多线程爬取数据并保存到数据库
今天刚看完崔大佬的<python3网络爬虫开发实战>,顿时觉得自己有行了,准备用appium登录QQ爬取列表中好友信息,接踵而来的是一步一步的坑,前期配置无数出错,安装之后连接也是好多错误 ...
python爬去百度百科词条_Python爬虫入门学习实践——爬取小说
本学期开始接触python,python是一种面向对象的.解释型的.通用的.开源的脚本编程语言,我觉得python最大的优点就是简单易用,学习起来比较上手,对代码格式的要求没有那么严格,这种风格使得我 ...
Python爬虫【四】爬取PC网页版“微博辟谣”账号内容(selenium多线程异步处理多页面)
专题系列导引爬虫课题描述可见: Python爬虫[零]课题介绍 – 对"微博辟谣"账号的历史微博进行数据采集课题解决方法: 微博移动版爬虫 Python爬虫[一]爬取移 ...

python爬虫实例之——多线程爬取小说

python爬虫实例之——多线程爬取小说相关推荐

最新文章

热门文章