Python分别用单线程，多线程，异步协程爬取一部小说，最快仅需要5s

文章目录

单线程爬取
多线程爬取
异步协程爬取

本文运用了三种方式爬取一整部小说，分别运用了单线程爬取，多线程爬取和异步协程爬取。
小说网址：`

http://www.doupo321.com/doupocangqiong/`

网页很简单，也不用过多分析，内容都在网页源代码中，就是一个多级链接爬虫，步骤就是先爬取到网页下的子链接，然后通过子链接爬取到每章小说内容。
因为这个网页的源代码都很规整，所有我们用xpath来匹配，当然你熟悉正则或者bs4也可以用bs4来匹配。然后我们就开始写代码吧。

单线程爬取

# @Time:2022/1/1312:04
# @Author:中意灬
# @File:斗破2.py
# @ps:tutu qqnum:2117472285
import time
import requests
from lxml import etree
def download(url,title):#下载内容resp=requests.get(url)resp.encoding='utf-8'html=resp.texttree=etree.HTML(html)body = tree.xpath("/html/body/div/div/div[4]/p/text()")body = '\n'.join(body)with open(f'斗破2/{title}.txt',mode='w',encoding='utf-8')as f:f.write(body)
def geturl(url):#获取子链接resp=requests.get(url)resp.encoding='utf-8'html=resp.texttree=etree.HTML(html)lis=tree.xpath("/html/body/div[1]/div[2]/div[1]/div[3]/div[2]/ul/li")for li in lis:href=li.xpath("./a/@href")[0].strip('//')href="http://"+hreftitle=li.xpath("./a/text()")[0]download(href,title)
if __name__ == '__main__':url="http://www.doupo321.com/doupocangqiong/"t1=time.time()geturl(url)t2=time.time()print("耗时：",t2-t1)

运行结果：

多线程爬取

# @Time:2022/1/1311:42
# @Author:中意灬
# @File:斗破1.py
# @ps:tutu qqnum:2117472285
import time
import requests
from lxml import etree
from concurrent.futures import ThreadPoolExecutor
def download(url,title):resp=requests.get(url)resp.encoding='utf-8'html=resp.texttree=etree.HTML(html)body = tree.xpath("/html/body/div/div/div[4]/p/text()")body = '\n'.join(body)with open(f'斗破1/{title}.txt',mode='w',encoding='utf-8')as f:f.write(body)
def geturl(url):resp = requests.get(url)resp.encoding = 'utf-8'html = resp.texttree = etree.HTML(html)lis = tree.xpath("/html/body/div[1]/div[2]/div[1]/div[3]/div[2]/ul/li")return lisif __name__ == '__main__':url="http://www.doupo321.com/doupocangqiong/"t1=time.time()lis=geturl(url)with ThreadPoolExecutor(1000)as t:#创建线程池，有1000个线程for li in lis:href = li.xpath("./a/@href")[0].strip('//')href = "http://" + hreftitle = li.xpath("./a/text()")[0]t.submit(download,url=href,title=title)t2=time.time()print("耗时：",t2-t1)

运行结果：

异步协程爬取

# @Time:2022/1/1310:30
# @Author:中意灬
# @File:斗破.py
# @ps:tutu qqnum:2117472285
import requests
import aiohttp
import asyncio
import aiofiles
from lxml import etree
import time
async def download(url,title,session):async with session.get(url) as resp:#resp=requst.get()html= await resp.text()tree=etree.HTML(html)body=tree.xpath("/html/body/div/div/div[4]/p/text()")body='\n'.join(body)async with aiofiles.open(f'斗破/{title}.txt',mode='w',encoding='utf-8')as f:#保存下载内容await f.write(body)async def geturl(url):resp=requests.get(url)resp.encoding='utf-8'html=resp.texttree=etree.HTML(html)lis=tree.xpath("/html/body/div[1]/div[2]/div[1]/div[3]/div[2]/ul/li")tasks=[]async with aiohttp.ClientSession() as session:#requestfor li in lis:href=li.xpath("./a/@href")[0].strip('//')href="http://"+hreftitle=li.xpath("./a/text()")[0]# 插入异步操作tasks.append(asyncio.create_task(download(href,title,session)))await asyncio.wait(tasks)
if __name__ == '__main__':url="http://www.doupo321.com/doupocangqiong/"t1=time.time()loop = asyncio.get_event_loop()loop.run_until_complete(geturl(url))t2=time.time()print("耗时：",t2-t1)

运行结果：

因为没有进行排序，所以爬取出来的章节都是乱序的，大家可以写爬虫的时候里面自己设置一下标题，这样爬取出来的顺序就可能是有序的了。
我们可以看出，用多线程，仅仅5秒就扒完了一部1600多章的小说，但是多线程会对系统的开销较大；如果用异步协程，爬取速度会稍微慢些，需要大概20多秒，但是对系统开销较小，建议大家采用异步协程的方式，但是用单线程去爬取会慢很多，扒完一部小说耗时需要9分多钟，不是很推荐。

Python分别用单线程，多线程，异步协程爬取一部小说，最快仅需要5s相关推荐

python异步协程爬取百度小说之西游记
爬虫百度小说之西游记参考文章链接:https://blog.csdn.net/weixin_45788900/article/details/119539952 一.百度小说之西游记网址:小说网址 ...
Python爬虫——aiohttp异步协程爬取同程旅行酒店评论
大家好!我是霖hero Python并发编程有三种方式:多线程(Threading).多进程(Process).协程(Coroutine),使用并发编程会大大提高程序的效率,今天我们将学习如何选择多线 ...
送书 | aiohttp异步协程爬取同程旅行酒店评论并作词云图
大家好!我是啃书君! Python并发编程有三种方式:多线程(Threading).多进程(Process).协程(Coroutine),使用并发编程会大大提高程序的效率,今天我们将学习如何选择多线程 ...
python爬虫 asyncio aiohttp aiofiles 单线程多任务异步协程爬取图片
python爬虫 asyncio aiohttp aiofiles 多任务异步协程爬取图片 main.py """=== coding: UTF8 ==="&q ...
链家网开源java_异步协程爬取链家租房信息
异步协程抓取链家数据+pandas写入csv import asyncio import aiohttp import pandas from bs4 import BeautifulSoup fro ...
Python初级爬虫（利用多任务协程爬取虎牙MM图片）
Python多任务协程下载虎牙直播MM图片 # coding = utf-8 import re import gevent from gevent import monkey, pool impor ...
爬虫的单线程+多任务异步协程:asyncio 3.6
单线程+多任务异步协程:asyncio 3.6 事件循环无限循环的对象.事件循环中最终需要将一些特殊的函数(被async关键字修饰的函数) 注册在该对象中. 协程本质上是一个对象.可以把协程对象 ...
爬虫第四章单线程+多任务异步协程
单线程+多任务异步协程: asyncio 事件循环 loop: 无限循环的对象,事件循环中最终需要将一些特殊的函数注册到该事件循环中特殊的函数: 被ansyc关键字修饰的函数协程: 本质上是一个对象, ...
python从网址爬图片协程_Python爬虫多任务协程爬取虎牙MM图片
查看: 4420|回复: 241 [作品展示] Python爬虫多任务协程爬取虎牙MM图片电梯直达发表于 2019-4-17 21:35:47 | 只看该作者 |倒序浏览 |阅读模式马上注册,结 ...

Python分别用单线程，多线程，异步协程爬取一部小说，最快仅需要5s

文章目录

单线程爬取

多线程爬取

异步协程爬取

Python分别用单线程，多线程，异步协程爬取一部小说，最快仅需要5s相关推荐

最新文章

热门文章