scrapy-selenium-谷歌浏览器爬取带有时效性cookies的网站

最新版本cookie问题已解决，需要具体方法啥的评论区留言或者私信。 --2020年3月20日

下面是最初版本，使用selenium采集。

1.要爬取的网站:

黑龙江政府采购网

这网站谁爬谁知道,时效性cookies,隔段时间爬取就得重新手动输入，用脚本返回的cookies还没用。

至少对我这种小渣渣来说用纯框架爬取不来,请教了前后端同事也没解决.无奈，只得使用自动化测试工具selenium

2.分析网站:

第一次请求不管从哪进去，都会回到这个界面，所以我们直接从这个url：http://www.ccgp-heilongj.gov.cn/index.jsp开始

在项目的spider.py中:

# -*- coding: utf-8 -*-
import scrapy
from CCGPHLJ.items import CcgphljItem
from scrapy import Request
import time
import re
from selenium import webdriverclass HljccgpSpider(scrapy.Spider):name = 'hlj_tender_11'allowed_domains = ['']start_urls = ['']bash_url = 'http://www.ccgp-heilongj.gov.cn'def __init__(self):self.chromedriverPath = "C:\Program Files (x86)\Google\Chrome\Application\chromedriver.exe"#这是我的谷歌驱动位置,若是不同,记得修改self.browser = webdriver.Chrome(executable_path=self.chromedriverPath)self.browser.set_page_load_timeout(30)self.flag = True  #True需要进过自动化测试工具，False则不操作,直接返回请求结果def spider_closed(self, spider):# 当爬虫退出的时候关闭Chromeprint('关闭selenium打开的网页')self.browser.quit()def start_requests(self):for page in range(1,3):full_url = "http://www.ccgp-heilongj.gov.cn/index.jsp"yield scrapy.Request(full_url, callback=self.parse, dont_filter=True)def parse(self, response):second_url = response.xpath(r'//span[@class="lbej"]/a/@onclick').extract()for j, url in enumerate(second_url):if "http" in url:full_url = urlelse:full_url = self.bash_url + re.findall(r"href='(.*?)'",url)[0]self.flag = False #传入False,那么在中间件就会直接返回结果yield requestdef parse_content(self, response):#省略自行操作print("获取到了各个详情页",response.text)

在 middlewares.py中间件中:

 def process_request(self, request, spider):if spider.name == 'hlj_tender_11':  # 只针对特定的爬虫脚本实现selenium 爬取if spider.flag:try:#这网站是通过js实现跳转的,所以就没有用点击了spider.browser.get(request.url)spider.browser.execute_script("location.href='/welcome.jsp?dq=23';return false;")#执行js代码,进入首页spider.browser.execute_script("onedet('4','');return false;")#执行js代码,进入更多页except TimeoutException as e:print('超时',e)spider.browser.execute_script('window.stop()')time.sleep(2)return HtmlResponse(url=spider.browser.current_url, body=spider.browser.page_source,encoding="utf-8", request=request)else:time.sleep(1)return None

部分代码截图及其效果:

最终爬取效果:

成功获取列表页及其内容

scrapy-selenium-谷歌浏览器爬取带有时效性cookies的网站相关推荐

scrapy + selenium + chromedriver爬取动态数据
scrapy是一个网页爬虫框架安装scrapy 推荐使用Anaconda 安装 Anaconda 安装介绍 http://www.scrapyd.cn/doc/124.html 安装后需要配置清华 ...
用scrapy+selenium+Firefox爬取腾讯新闻
一.首先配置 1.scrapy 2.selenium 3.webdriver 4.浏览器Firefox 具体的安装可以去这个兄弟的博客看看https://blog.csdn.net/azsx02/ar ...
跌跌撞撞尝试Scrapy+Selenium+MySQL爬取与存储东方财富网股票数据
目录网页信息爬虫框架 stock_spider.py(爬虫文件) items.py (爬取字段命名) middlewares.py(Selenium中间件) pipelines.py settin ...
用scrapy+selenium + phantomjs 爬取vip网页,保存为json格式,写入到mysql数据库,下载图片(二)
接上一编 weipin.py文件的代码 : # -*- coding: utf-8 -*- import scrapy from weipinhui.items import WeipinhuiIte ...
python爬取手机aop_用selenium+谷歌浏览器爬取美拍视频真实链接
[Python] 纯文本查看复制代码from selenium import webdriver import time driver = webdriver.Chrome() #实例化对象 dri ...
python Scrapy Selenium PhantomJS 爬取微博图片
1,创建项目 scrapy startproject weibo #创建工程 scrapy genspider -t basic weibo.com weibo.com #创建spider 目录结构 ...
用scrapy+selenium + phantomjs 爬取vip网页,保存为json格式,写入到mysql数据库,下载图片(一)
用命令在终端创建一个项目: scrapy startproject myvipspider 进入到myvipspider项目下运行命令: scrapy genspider weipin "v ...
Python爬虫实战使用scrapy与selenium来爬取数据
系列文章目录实战使用scrapy与selenium来爬取数据文章目录系列文章目录前言一.前期准备二.思路与运行程序 1.思路 2.运行程序三.代码 1.代码下载 2.部分代码总结前言 ...
Python+谷歌浏览器--电商秒杀器Selenium自动化爬取
用Selenium自动化爬取电商本文只用于学习,禁止用于商业用途,否则后果自负. 本文主要写Selenium自动化爬取编程参考之前的一些博客,做过类似的爬取,但不是爬电商,原理都是类似的.有实际用 ...