【Python核心】揭秘Python协程

首先要明白什么是协程？

协程是实现并发编程的一种方式。一说并发肯定想到了多线程/多进程模型，多线程/多进程正是解决并发问题的经典模型之一

先从一个爬虫实例出发，用清晰的思路并且结合实战来搞懂这个不算特别容易理解的概念。之后，再由浅入深直击协程的核心

一、从一个爬虫说起

爬虫，就是互联网的蜘蛛，在搜索引擎诞生时与其一同来到世上
爬虫每秒钟都会爬取大量的网页，提取关键信息后存储在数据库中以便日后分析。爬虫有非常简单的Python十行代码实现，也有Google那样的全球分布式爬虫的上百万行代码，分布在内部上万台服务器上对全世界的信息进行嗅探

先看一个简单的爬虫例子：

import timedef crawl_page(url):print('crawling {}'.format(url))sleep_time = int(url.split('_')[-1])time.sleep(sleep_time)print('OK {}'.format(url))def main(urls):for url in urls:crawl_page(url)%time main(['url_1', 'url_2', 'url_3', 'url_4'])########## 输出 ##########crawling url_1
OK url_1
crawling url_2
OK url_2
crawling url_3
OK url_3
crawling url_4
OK url_4
Wall time: 10 s

注意：
本节的主要目的是协程的基础概念，因此简化爬虫的scrawl_page函数为休眠数秒，休眠时间取决于url最后的那个数字

这是一个很简单的爬虫，main()函数执行时调取crawl_page()函数进行网络通信，经过若干秒等待后收到结果，然后执行下一个

看起来很简单，但仔细一算它也占用了不少时间，五个页面分别用了1秒到4秒的时间，加起来一共用了10秒。这显然效率低下，该怎么优化呢？

二、简单的协程示例

于是，一个很简单的思路出现了——这种爬取操作完全可以并发化。接下来看看使用协程怎么写

import asyncioasync def crawl_page(url):print('crawling {}'.format(url))sleep_time = int(url.split('_')[-1])await asyncio.sleep(sleep_time)print('OK {}'.format(url))async def main(urls):for url in urls:await crawl_page(url)%time asyncio.run(main(['url_1', 'url_2', 'url_3', 'url_4']))########## 输出 ##########crawling url_1
OK url_1
crawling url_2
OK url_2
crawling url_3
OK url_3
crawling url_4
OK url_4
Wall time: 10 s

看到这段代码，可以发现在Python 3.7以上版本中，使用协程写异步程序非常简单

首先来看import asyncio，这个库包含了大部分我们实现协程所需的魔法工具
sync修饰词声明异步函数，于是，这里的crawl_page和main都变成了异步函数。而调用异步函数，便可得到一个协程对象(coroutine object)

举个例子，如果print(crawl_page(''))便会输出提示这是一个 Python的协程对象，而并不会真正执行这个函数

三、协程的执行

再来说说协程的执行
执行协程有多种方法，这里介绍一下常用的三种

首先，可以通过await来调用
await执行的效果和Python正常执行是一样的，也就是说程序会阻塞在这里进入被调用的协程函数，执行完毕返回后再继续，而这也是await的字面意思。代码中await asyncio.sleep(sleep_time)会在这里休息若干秒，await crawl_page(url)则会执行crawl_page()函数

其次，可以通过asyncio.create_task()来创建任务

最后，需要asyncio.run来触发运行
asyncio.run这个函数是Python 3.7之后才有的特性，可以让Python的协程接口变得非常简单，不用去理会事件循环怎么定义和怎么使用的问题
一个非常好的编程规范是，asyncio.run(main())作为主程序的入口函数，在程序运行周期内只调用一次asyncio.run

这样，大概看懂了协程是怎么用的吧。不妨试着跑一下代码，怎么还是10秒？

10 秒就对了，如上面所说await是同步调用，因此，crawl_page(url)在当前的调用结束之前是不会触发下一次调用的。于是，这个代码效果就和上面完全一样，相当于用异步接口写了个同步代码

现在又该怎么办呢？

其实很简单，也正是接下来要讲的协程中的一个重要概念，任务(Task)。继续看下面的代码

import asyncioasync def crawl_page(url):print('crawling {}'.format(url))sleep_time = int(url.split('_')[-1])await asyncio.sleep(sleep_time)print('OK {}'.format(url))async def main(urls):tasks = [asyncio.create_task(crawl_page(url)) for url in urls]for task in tasks:await task%time asyncio.run(main(['url_1', 'url_2', 'url_3', 'url_4']))########## 输出 ##########crawling url_1
crawling url_2
crawling url_3
crawling url_4
OK url_1
OK url_2
OK url_3
OK url_4
Wall time: 3.99 s

可以看到，有了协程对象后便可以通过asyncio.create_task来创建任务，任务创建后很快就会被调度执行，这样，代码也不会阻塞在任务这里。所以，要等所有任务都结束才行，用for task in tasks: await task即可

可以看到执行的效果，结果显示运行总时长等于运行时间最长的爬虫

当然，也可以想一想这里用多线程应该怎么写？而如果需要爬取的页面有上万个又该怎么办呢？再对比下协程的写法，谁更清晰自是一目了然

其实，对于执行tasks还有另一种做法：

import asyncioasync def crawl_page(url):print('crawling {}'.format(url))sleep_time = int(url.split('_')[-1])await asyncio.sleep(sleep_time)print('OK {}'.format(url))async def main(urls):tasks = [asyncio.create_task(crawl_page(url)) for url in urls]await asyncio.gather(*tasks)%time asyncio.run(main(['url_1', 'url_2', 'url_3', 'url_4']))########## 输出 ##########crawling url_1
crawling url_2
crawling url_3
crawling url_4
OK url_1
OK url_2
OK url_3
OK url_4
Wall time: 4.01 s

这里的代码也很好理解，唯一要注意的是*tasks解包列表，将列表变成了函数的参数，与之对应的是**dict将字典变成了函数的参数

注意：
asyncio.create_task，asyncio.run这些函数都是Python 3.7以上的版本才提供的

四、解密协程运行时

不妨来深入代码底层看看。有了前面的知识做基础应该很容易理解这两段代码

import asyncioasync def worker_1():print('worker_1 start')await asyncio.sleep(1)print('worker_1 done')async def worker_2():print('worker_2 start')await asyncio.sleep(2)print('worker_2 done')async def main():print('before await')await worker_1()print('awaited worker_1')await worker_2()print('awaited worker_2')%time asyncio.run(main())########## 输出 ##########before await
worker_1 start
worker_1 done
awaited worker_1
worker_2 start
worker_2 done
awaited worker_2
Wall time: 3 s

import asyncioasync def worker_1():print('worker_1 start')await asyncio.sleep(1)print('worker_1 done')async def worker_2():print('worker_2 start')await asyncio.sleep(2)print('worker_2 done')async def main():task1 = asyncio.create_task(worker_1())task2 = asyncio.create_task(worker_2())print('before await')await task1print('awaited worker_1')await task2print('awaited worker_2')%time asyncio.run(main())########## 输出 ##########before await
worker_1 start
worker_2 start
worker_1 done
awaited worker_1
worker_2 done
awaited worker_2
Wall time: 2.01 s

不过，第二个代码到底发生了什么呢？为了更详细了解到协程和线程的具体区别，这里详细地分析了整个过程

asyncio.run(main())，程序进入main()函数，事件循环开启
task1和task2任务被创建，并进入事件循环等待运行，运行到print时输出before await
await task1执行，用户选择从当前的主任务中切出，事件调度器开始调度worker_1
worker_1开始运行，运行print输出worker_1 start，然后运行到await asyncio.sleep(1)，从当前任务切出，事件调度器开始调度worker_2
worker_2开始运行，运行print输出worker_2 start，然后运行到await asyncio.sleep(2)，从当前任务切出
以上所有事件的运行时间，都应该在1ms到10ms之间，甚至可能更短，事件调度器从这个时候开始暂停调度
一秒钟后，worker_1的sleep完成，事件调度器将控制权重新传给task_1，输出worker_1 done，task_1完成任务，从事件循环中退出
await task1完成，事件调度器将控制器传给主任务，输出awaited worker_1，·然后在await task2处继续等待
两秒钟后，worker_2的sleep完成，事件调度器将控制权重新传给task_2，输出worker_2 done，task_2完成任务，从事件循环中退出
主任务输出awaited worker_2，协程全任务结束，事件循环结束

接下来，进阶一下。如果想给某些协程任务限定运行时间，一旦超时就取消，又该怎么做呢？再进一步，如果某些协程运行时出现错误，又该怎么处理呢？同样的，来看代码


import asyncioasync def worker_1():await asyncio.sleep(1)return 1async def worker_2():await asyncio.sleep(2)return 2 / 0async def worker_3():await asyncio.sleep(3)return 3async def main():task_1 = asyncio.create_task(worker_1())task_2 = asyncio.create_task(worker_2())task_3 = asyncio.create_task(worker_3())await asyncio.sleep(2)task_3.cancel()res = await asyncio.gather(task_1, task_2, task_3, return_exceptions=True)print(res)%time asyncio.run(main())########## 输出 ##########[1, ZeroDivisionError('division by zero'), CancelledError()]
Wall time: 2 s

可以看到，worker_1正常运行，worker_2运行中出现错误，worker_3执行时间过长被cancel，这些信息全部体现在最终的返回结果res中

不过要注意return_exceptions=True这行代码。如果不设置这个参数错误就会完整地throw到执行层，从而需要try except来捕捉，这也就意味着其他还没被执行的任务会被全部取消掉。为了避免这个局面，将return_exceptions设置为True即可

到这里，可以发现线程能实现的，协程都能做到。接下来用协程来实现一个经典的生产者消费者模型

import asyncio
import randomasync def consumer(queue, id):while True:val = await queue.get()print('{} get a val: {}'.format(id, val))await asyncio.sleep(1)async def producer(queue, id):for i in range(5):val = random.randint(1, 10)await queue.put(val)print('{} put a val: {}'.format(id, val))await asyncio.sleep(1)async def main():queue = asyncio.Queue()consumer_1 = asyncio.create_task(consumer(queue, 'consumer_1'))consumer_2 = asyncio.create_task(consumer(queue, 'consumer_2'))producer_1 = asyncio.create_task(producer(queue, 'producer_1'))producer_2 = asyncio.create_task(producer(queue, 'producer_2'))await asyncio.sleep(10)consumer_1.cancel()consumer_2.cancel()await asyncio.gather(consumer_1, consumer_2, producer_1, producer_2, return_exceptions=True)%time asyncio.run(main())########## 输出 ##########producer_1 put a val: 5
producer_2 put a val: 3
consumer_1 get a val: 5
consumer_2 get a val: 3
producer_1 put a val: 1
producer_2 put a val: 3
consumer_2 get a val: 1
consumer_1 get a val: 3
producer_1 put a val: 6
producer_2 put a val: 10
consumer_1 get a val: 6
consumer_2 get a val: 10
producer_1 put a val: 4
producer_2 put a val: 5
consumer_2 get a val: 4
consumer_1 get a val: 5
producer_1 put a val: 2
producer_2 put a val: 8
consumer_1 get a val: 2
consumer_2 get a val: 8
Wall time: 10 s

五、实战豆瓣近日推荐电影爬虫

接下来进入实战环节，实现一个完整的协程爬虫

任务描述：https://movie.douban.com/cinema/later/beijing/这个页面描述了北京最近上映的电影，通过Python得到这些电影的名称、上映时间和海报？这个页面的海报是缩小版的，希望从具体的电影描述页面中抓取到海报

下面给出了同步版本的代码和协程版本的代码，通过运行时间和代码写法的对比，希望能对协程有更深的了解(注意：为了突出重点和简化代码，因此省略了异常处理)

import requests
from bs4 import BeautifulSoupdef main():url = "https://movie.douban.com/cinema/later/beijing/"init_page = requests.get(url).contentinit_soup = BeautifulSoup(init_page, 'lxml')all_movies = init_soup.find('div', id="showing-soon")for each_movie in all_movies.find_all('div', class_="item"):all_a_tag = each_movie.find_all('a')all_li_tag = each_movie.find_all('li')movie_name = all_a_tag[1].texturl_to_fetch = all_a_tag[1]['href']movie_date = all_li_tag[0].textresponse_item = requests.get(url_to_fetch).contentsoup_item = BeautifulSoup(response_item, 'lxml')img_tag = soup_item.find('img')print('{} {} {}'.format(movie_name, movie_date, img_tag['src']))%time main()########## 输出 ##########阿拉丁 05月24日 https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2553992741.jpg
龙珠超：布罗利 05月24日 https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2557371503.jpg
五月天人生无限公司 05月24日 https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2554324453.jpg
... ...
直播攻略 06月04日 https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2555957974.jpg
Wall time: 56.6 s

import asyncio
import aiohttpfrom bs4 import BeautifulSoupasync def fetch_content(url):async with aiohttp.ClientSession(headers=header, connector=aiohttp.TCPConnector(ssl=False)) as session:async with session.get(url) as response:return await response.text()async def main():url = "https://movie.douban.com/cinema/later/beijing/"init_page = await fetch_content(url)init_soup = BeautifulSoup(init_page, 'lxml')movie_names, urls_to_fetch, movie_dates = [], [], []all_movies = init_soup.find('div', id="showing-soon")for each_movie in all_movies.find_all('div', class_="item"):all_a_tag = each_movie.find_all('a')all_li_tag = each_movie.find_all('li')movie_names.append(all_a_tag[1].text)urls_to_fetch.append(all_a_tag[1]['href'])movie_dates.append(all_li_tag[0].text)tasks = [fetch_content(url) for url in urls_to_fetch]pages = await asyncio.gather(*tasks)for movie_name, movie_date, page in zip(movie_names, movie_dates, pages):soup_item = BeautifulSoup(page, 'lxml')img_tag = soup_item.find('img')print('{} {} {}'.format(movie_name, movie_date, img_tag['src']))%time asyncio.run(main())########## 输出 ##########阿拉丁 05月24日 https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2553992741.jpg
龙珠超：布罗利 05月24日 https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2557371503.jpg
五月天人生无限公司 05月24日 https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2554324453.jpg
... ...
直播攻略 06月04日 https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2555957974.jpg
Wall time: 4.98 s

六、总结

用了较长的篇幅，从一个简单的爬虫开始，到一个真正的爬虫结束，在中间穿插讲解了Python协程最新的基本概念和用法

协程和多线程的区别
主要在于两点，一是协程为单线程，二是协程由用户决定，在哪些地方交出控制权，切换到下一个任务
协程的写法更加简洁清晰
把async/await语法和create_task结合来用，对于中小级别的并发需求毫无压力
清晰的事件循环概念
写协程程序的时候，要有清晰的事件循环概念，知道程序在什么时候需要暂停、等待 I/O，什么时候需要一并执行到底

最后的最后，请一定不要轻易炫技
多线程模型也一定有其优点，一个真正牛逼的程序员应该懂得，在什么时候用什么模型能达到工程上的最优