协程池gevent实现糗事百科爬取

标题

-协程池gevent实现糗事百科爬取

import gevent.monkey
gevent.monkey.patch_all()
from gevent.pool import Pool
import requests
from lxml import etree
from queue import Queue
from pprint import pprint
import timeclass Qiubai:def __init__(self):self.temp_url = "https://www.qiushibaike.com/hot/page/{}"self.headers = {"user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.68 Safari/537.36"}self.queue = Queue()self.pool = Pool()self.is_running = Trueself.total_request_num = 0self.total_response_num = 0def get_url_list(self):for i in range(1, 14):self.queue.put(self.temp_url.format(i))# print(self.temp_url.format(i))self.total_request_num += 1def parse_url(self, url):response = requests.get(url, headers=self.headers)# print(response)return response.content.decode()def get_content_list(self, html_str):html = etree.HTML(html_str)div_list = html.xpath("//div[@class='col1 old-style-col1']/div")content_list = []for div in div_list:item = {}item['user_name'] = div.xpath(".//h2/text()")[0].strip()item['content'] = [i.strip() for i in div.xpath(".//div[@class='content']/span/text()")]content_list.append(item)return content_listdef save_content_list(self, content_list):for content in content_list:pprint(content)def _execute_request_content_save(self):  # 进行一次url请求，提取，保存url = self.queue.get()html_str = self.parse_url(url)content_list = self.get_content_list(html_str)self.save_content_list(content_list)self.total_response_num += 1def _callback(self, temp):if self.is_running:self.pool.apply_async(self._execute_request_content_save, callback=self._callback)def run(self):# 1. 准备url列表self.get_url_list()for i in range(3):self.pool.apply_async(self._execute_request_content_save, callback=self._callback)while True:time.sleep(0.0001)if self.total_response_num >= self.total_request_num:self.is_running = Falsebreakif __name__ == "__main__":qiubai = Qiubai()qiubai.run()

协程池gevent实现糗事百科爬取相关推荐

python 协程池gevent.pool_进程池\线程池,协程,gevent
目录 1. 进程池与线程池 2. 协程 3. gevent 4. 单线程下实现并发的套接字通信首先写一个基于多线程的套接字服务端: from socket import * from thread ...
5 使用ip代理池爬取糗事百科
从09年读本科开始学计算机以来,一直在迷茫中度过,很想学些东西,做些事情,却往往陷进一些技术细节而蹉跎时光.直到最近几个月,才明白程序员的意义并不是要搞清楚所有代码细节,而是要有更宏高的方向,要有更专 ...
Python爬虫--抓取糗事百科段子
今天使用python爬虫实现了自动抓取糗事百科的段子,因为糗事百科不需要登录,抓取比较简单.程序每按一次回车输出一条段子,代码参考了 http://cuiqingcai.com/990.html 但该 ...
python协程池_python3下multiprocessing、threading和gevent性能对比—-暨进程池、线程池和协程池性能对比 | 学步园...
目前计算机程序一般会遇到两类I/O:硬盘I/O和网络I/O.我就针对网络I/O的场景分析下python3下进程.线程.协程效率的对比.进程采用multiprocessing.Pool进程池,线程是自己 ...
高仿糗事百科学习(三)NET
高仿糗事百科,是一个典型的cs模式,所以我们获取数据就要通过net,今天我就开始书写关于网络连接方面的书写. 在安卓中,我们将联网请求往往是放在次线程中,如果放在主线程中,将会导致主线程要处理事件太多 ...
Python进程池，线程池，协程池
线程池 import threading import time def myThread():for i in range(10):time.sleep()print('d') sep=thread ...
网络爬虫--15.【糗事百科实战】多线程实现
文章目录一. Queue(队列对象) 二. 多线程示意图三. 代码示例一. Queue(队列对象) Queue是python中的标准库,可以直接import Queue引用;队列是线程间最常用的 ...
爬虫——多线程糗事百科案例
Queue(队列对象) Queue是python中的标准库,可以直接import Queue引用;队列是线程间最常用的交换数据的形式 python下多线程的思考对于资源,加锁是个重要的环节.因为py ...
JavaScript获取本机浏览器UA助力Python爬取糗事百科首页
问题背景: 使用Python编写爬虫时,经常会遇到反爬机制,例如网站要求必须使用浏览器访问.就像下面的403错误: 或者下面这种错误信息: 一般来说,这是遇到反爬机制了,对方要求使用浏览器访问.这时可 ...

协程池gevent实现糗事百科爬取

标题

协程池gevent实现糗事百科爬取相关推荐

最新文章

热门文章