requests 200 scrapy超时_selenium咋和scrapy一起用捏？

最近反爬越来越严重啦~有些网站有心跳函数，破js还不如直接上selenium搞数据~

但如果大部分其他网站还是用scrapy的管道处理数据的话，单纯用selenium，就代表又要把管道删删改改丢到selenium的脚本中....

太麻烦了我拒绝！

其实可以用scrapy的中间件包裹selenium中的网页源码，丢回scrapy程序中采集~

相当于只是把selenium当做中间件工具一样使用~

话不多说直接上代码~

先把用浏览器打开网页的jio本写好~

1.操纵浏览器的selenium_chrome.py

#! /usr/bin/env python3# coding=utf-8from selenium import webdriverimport requestsimport sysimport timesys.path.append("./")import randomfrom fake_useragent import UserAgentfrom selenium.webdriver.support import expected_conditions as ECfrom selenium.webdriver.support.ui import WebDriverWait       #WebDriverWait注意大小写from selenium.webdriver.common.by import Byclass Chrome_selenium(object):    '''开启无头浏览器'''    def __init__(self):        self.chrome_options = webdriver.ChromeOptions()        # 看情况是显示界面 or无头模式。本地调试时可以开启界面看一下~        # linux服务器上只能用无头模式        self.chrome_options.add_argument('--headless')        self.chrome_options.add_argument('--no-sandbox')        self.chrome_options.add_argument('--disable-gpu')        # #https证书        self.chrome_options.add_argument('--ignore-certificate-errors')        # self.chrome_options.debugger_address = "127.0.0.1:9222"        #test        self.chrome_options.add_experimental_option('excludeSwitches', ['enable-automation'])        # chrome_driver = "/usr/bin/google-chrome"        # #设置请求头        self.chrome_options.add_argument(UserAgent(verify_ssl=False).random)        # 不需要图片和js的话可以把下面的注释打开~        # prefs = {        #    'profile.default_content_setting_values': {        #        'images': 2,  # 屏蔽图片        #        'javascript': 2,  # 屏蔽js        #        'notifications': 2,  # 屏蔽消息推送        #    }        # }        # # 添加屏蔽chrome浏览器禁用图片的设置        # self.chrome_options.add_experimental_option("prefs", prefs)        #test        # self.chrome_options.debugger_address = "127.0.0.1:9210"        self.client = webdriver.Chrome(options=self.chrome_options)        # # # 隐性等待        # self.client.implicitly_wait(30)        # # timeout        # self.client.set_page_load_timeout(30)        #时间看加载情况~        # # 隐性等待        self.client.implicitly_wait(120)        # timeout        self.client.set_page_load_timeout(160)        # # 窗口最大化        self.client.maximize_window()            def get_one_page(self, one_html_url):        '''获取one page的源码'''        try:            self.client.get(one_html_url)            # self.client.execute_script("window.scrollTo(0, document.body.scrollHeight);")            # 需要滚动到底部 或者点击 滑动之类动作的话，就用if根据网址规则来进行操作            if "www.zzzzz.com" in one_html_url:                # 执行这段代码，会获取到当前窗口总高度                js = "return action=document.body.scrollHeight"                # 初始化现在滚动条所在高度为0                height = 0                # 当前窗口总高度                new_height = self.client.execute_script(js)                while height < new_height:                    # 将滚动条调整至页面底部                    for i in range(height, new_height, 100):                        self.client.execute_script('window.scrollTo(0, {})'.format(i))                        time.sleep(0.5)                    height = new_height                    time.sleep(2)                    new_height = self.client.execute_script(js)            elif 'www.xxxxxx.com' in one_html_url:                # 等指定的元素加载完毕后，再                if '/cat' in one_html_url:                    try:                        element = WebDriverWait(self.client, 10).until(EC.presence_of_element_located((By.XPATH, '//div[@data-pagination-start]')))                        # print(f'element:{element}')                    except Exception as message:                        print('元素定位报错%s' % message)                    finally:                        pass                elif 'xxxxx.desktop.json' not in one_html_url:                    #//h2[contains(text(),"Details")]                    try:                        element = WebDriverWait(self.client, 10).until(EC.presence_of_element_located((By.XPATH, '//div[contains(@class,"horizontalSlider")]')))                        # print(f'element:{element}')                    except Exception as message:                        print('元素定位报错%s' % message)                    finally:                        pass                    # 超时就直接返回源码        except Exception as e:            print('client error:{}'.format(e))            # self.client.refresh()            return self.client.page_source        current_page_html = self.client.page_source        return current_page_html

2.scrapy工程中的middlewares.py

...# 导入操作浏览器的脚本from selenium_chrome import Chrome_selenium# 用HtmlResponse封装源码from scrapy.http import HtmlResponse......class Seleniummiddleware(object):    def __init__(self):        self.chrome_selenium = Chrome_selenium()        self.request_session = requests.Session()    def process_request(self, request, spider):        if 'www.zzzz.com' in request.url:            one_list_page_source = self.chrome_selenium.get_one_page(request.url)            # time.sleep(30)  # 等待加载,  可以用显示等待来优化.            # print(f'one_list_page_source:{one_list_page_source}')            # 参数url指当前浏览器访问的url, 在这里参数url也可以用request.url            # 参数body指要封装成符合HTTP协议的源数据, 后两个参数可有可无            return HtmlResponse(url=request.url, body=one_list_page_source, encoding='utf-8',                                request=request)

这样搞完后，在scrapy的爬虫代码文件中，就可以和普通的爬虫response一样使用xpath函数来取值了~

一定要记得在爬虫代码中开启selenium的中间件~

搞完收工~

requests 200 scrapy超时_selenium咋和scrapy一起用捏？相关推荐

Scrapy框架的使用之Scrapy爬取新浪微博
前面讲解了Scrapy中各个模块基本使用方法以及代理池.Cookies池.接下来我们以一个反爬比较强的网站新浪微博为例,来实现一下Scrapy的大规模爬取. 一.本节目标本次爬取的目标是新浪微博用户 ...
java启动scrapy爬虫,爬虫入门之Scrapy 框架基础功能(九)详解
Scrapy是用纯Python实现一个为了爬取网站数据.提取结构性数据而编写的应用框架,用途非常广泛. 框架的力量,用户只需要定制开发几个模块就可以轻松的实现一个爬虫,用来抓取网页内容以及各种图片,非 ...
scrapy安装_爬虫框架Scrapy简介与安装
Scrapy 框架 Scrapy是用纯Python实现一个为了爬取网站数据.提取结构性数据而编写的应用框架,用途非常广泛. 框架的力量,用户只需要定制开发几个模块就可以轻松的实现一个爬虫,用来抓取网页 ...
Scrapy框架的使用之Scrapy入门
接下来介绍一个简单的项目,完成一遍Scrapy抓取流程.通过这个过程,我们可以对Scrapy的基本用法和原理有大体了解. 一,准备工作本节要完成的任务如下. 创建一个Scrapy项目. 创建一个蜘蛛 ...
python scrapy 简单教程_python之scrapy入门教程
看这篇文章的人,我假设你们都已经学会了python(派森),然后下面的知识都是python的扩展(框架). 在这篇入门教程中,我们假定你已经安装了Scrapy.如果你还没有安装,那么请参考安装指南. ...
Scrapy基础第二节：Scrapy版的Hello World
第二节:Scrapy版的Hello World 前置知识: 掌握Python的基础知识对爬虫基础有一定了解说明: 运行环境 Win10,Python3 64位目录: 第一节:Scrapy介绍和安 ...
Python爬虫基础：安装Scrapy爬虫框架和创建Scrapy爬虫项目
首先为了避免国外镜像不稳定,我们使用了清华大学的python库镜像:https://pypi.tuna.tsinghua.edu.cn/simple 1.安装scrapy 1.1.安装pywin32( ...
Scrapy框架的学习(9.Scrapy中的CrawlSpider类的作用以及使用，实现优化的翻页爬虫)
1.CrawlSpider类通过一些规则(rules),使对于链接(网页)的爬取更具有通用性, 换句话说,CrawlSpider爬虫为通用性的爬虫, 而Spider爬虫更像是为一些特殊网站制定的爬虫. ...
Scrapy框架的学习(2.scrapy入门，简单爬取页面，并使用管道(pipelines)保存数据)
上个博客写了: Scrapy的概念以及Scrapy的详细工作流程 https://blog.csdn.net/wei18791957243/article/details/86154068 1.sc ...

requests 200 scrapy超时_selenium咋和scrapy一起用捏？

requests 200 scrapy超时_selenium咋和scrapy一起用捏？相关推荐

最新文章

热门文章