mac os平台使用python爬虫自动下载巨潮网络文件

环境配置

选择python+selenium+wget+Safari的环境来下载文件，本来期望使用phantomjs，但使用时点击出的链接网页为空白网页，无法下载文件。

使用Safari时遇到的错误：selenium.common.exceptions.WebDriverException: Message: Could not create a session: You must enable the 'Allow Remote Automation' option in Safari's Develop menu to control Safari via WebDriver.解决方式需要在Safari=>开发中选上"允许远程自动化"。

原始代码

#!/usr/bin/python# -*- coding: utf-8 -*-
__metaclass__ = typeimport io
from selenium import webdriver
import time
import sys
import re
import os
from selenium.webdriver.common.keys import Keys
import wget
import urllib
from urllib import request
import shutil
#from selenium.webdriver import ActionChains
#from selenium.webdriver.common.keys import Keys'''class: DownloadFromCninfo'''
class DownloadFromCninfo(object):def __init__(self,stockNumberStr,maxNumber=10000):self.stockNumber = stockNumberStr#选择浏览器self.RecordDownloadIndex = 1self.maxDownloadNumber = maxNumberself.driver = webdriver.Safari()#self.driver = webdriver.PhantomJS(executable_path='/usr/local/phantomjs/bin/phantomjs')if(int(stockNumberStr) >= 600000):self.dst_url = 'http://www.cninfo.com.cn/cninfo-new/disclosure/sse'else:self.dst_url = 'http://www.cninfo.com.cn/cninfo-new/disclosure/szse'#make new directoryprefixpath = "./download/"self.prefixpathname = prefixpath+self.stockNumber+"/"if os.path.exists(self.prefixpathname):passelse:os.mkdir(self.prefixpathname)def downloadPDF(self):    self.driver.quit()#self.driver = webdriver.PhantomJS(executable_path='/usr/local/phantomjs/bin/phantomjs')self.driver = webdriver.Safari()#设置超时时间，存在有可能超时为无限值，无法访问网页时挂死的情况self.driver.set_page_load_timeout(10)#tmpURL = "http://www.cninfo.com.cn/finalpage/2017-12-29/1204276365.PDF"#self.driver.get(tmpURL)self.driver.get(self.dst_url)self.driver.maximize_window()time.sleep(2)#print(self.stockNumberprint('%s'%self.driver.current_url)self.driver.find_element_by_class_name("input-stock").send_keys(self.stockNumber)#self.driver.find_element_by_xpath("//ul[@id='stock_list']/li[1]/a").click()self.driver.find_element_by_xpath("//ul[@id='stock_list']/li[1]/a").send_keys(Keys.ENTER)#切换网页，以获取新弹出的网页窗口#tmpDriver = self.driver#time.sleep(30)time.sleep(5)for handle in self.driver.window_handles:self.driver.switch_to_window(handle)print('current url:%s'%self.driver.current_url)if "show" in self.driver.current_url :breaktime.sleep(1)urldata = self.driver.find_element_by_xpath("//div[@id='con-div-his-fulltext']/div[@class='stat-right']")print('%s'%urldata.text)name = self.driver.find_element_by_xpath("//div[@id='plus-tag-div']/a/span").textprint('%s'%name)patternStr = '\d+'rslt = re.findall(patternStr,urldata.text)#print(len(self.driver.window_handles))#最大化窗口,不可以随便最大化，否则影响handle的顺序#self.driver.maximize_window()#print(len(self.driver.window_handles))while(rslt[0] != rslt[1]):#self.driver.find_element_by_link_text('更多').click()#self.driver.find_element_by_link_text('更多').send_keys(Keys.ENTER)if(int(rslt[1]) >= self.maxDownloadNumber):breakself.driver.find_element_by_xpath("//div[@id='con-div-his-fulltext']/div[@class='show-more']/a").click()#等待网页相应时间time.sleep(1)urldata = self.driver.find_element_by_xpath("//div[@id='con-div-his-fulltext']/div[@class='stat-right']")print('%s'%urldata.text)patternStr = '\d+'rslt = re.findall(patternStr,urldata.text)listNum = int(rslt[1])if(listNum != 0):for indexValue in range(1,listNum+1):for handle in self.driver.window_handles:self.driver.switch_to_window(handle)print('current url:%s'%self.driver.current_url)if "show" in self.driver.current_url :breaktime.sleep(1)findXpathStr = "//ul[@id='ul_his_fulltext']/li[%d]/div[@class='t3']/dd/span[@class='d3']"%indexValueurlTextGet = self.driver.find_element_by_xpath(findXpathStr)tmpTimeStr = urlTextGet.textprint('timestr %s'%urlTextGet.text)findXpathStr = "//ul[@id='ul_his_fulltext']/li[%d]/div[@class='t3']/dd/span/a"%indexValueprint('%s'%findXpathStr)urlTextGet = self.driver.find_element_by_xpath(findXpathStr)print('%s'%urlTextGet.text)tmpName = urlTextGet.text#if(re.search('澄清公告',urlTextGet.text)):#print('澄清公告忽略！%s'%urlTextGet.text)#continueself.driver.find_element_by_xpath(findXpathStr).click()time.sleep(5)#enterNumber=0for handle in self.driver.window_handles:self.driver.switch_to_window(handle)print('%s'%self.driver.current_url)if "pdf" in self.driver.current_url :breakif "PDF" in self.driver.current_url :break#enterNumber = enterNumber + 1#print(enterNumber)time.sleep(1)print('%s'%self.driver.current_url)wgetURL = self.driver.current_urlfindlinkSuccess = 1downloadfilename = '%s%s%s.pdf'%(self.prefixpathname,tmpTimeStr.strip(),tmpName)if(findlinkSuccess == 1):wget.download(wgetURL,downloadfilename)else:print('无效链接！ignore')self.driver.close()#返回指向前一次最新的URLfor handle in self.driver.window_handles:self.driver.switch_to_window(handle)time.sleep(1)self.driver.close()self.driver.quit()if __name__ == "__main__":if(len(sys.argv) < 2):print("Input stock number error!")print(sys.argv[0])sys.exit()downloadHandle = DownloadFromCninfo(sys.argv[1],20)downloadHandle.downloadPDF()

遗留问题

不清楚phantomjs不能使用的具体原因是什么；发现phantomjs变换成Safari两个平台时，有时单击不起效果，需要使用Enter的方式。

mac os平台使用python爬虫自动下载巨潮网络文件相关推荐

Python 爬虫自动下载OpenAI Key Papers
Spinning Up是OpenAI开源的面向初学者的深度强化学习资料,其中列出了105篇深度强化学习领域非常经典的文章, 见 Spinning Up: 博主使用Python爬虫自动爬取了所有文章,而 ...
下载巨潮网络数据的python脚本
从巨潮网络下载财报数据,觉得手动比较麻烦,就做了一个简单的python脚本.具体主要代码如下: driver = webdriver.PhantomJS(executable_path='浏览器引擎/ ...
python爬虫(自动下载图片)
爬虫第一步下载第三方工具(requests包): win+R 输入cmd点击确定或回车输入以下命令下载requests包: requests包是python爬虫常用的包他的下载方式是 pip in ...
Python爬虫自动下载音乐(网易)
songs.txt 带着地球去流浪我在夜里偷看过一颗星星蜉蝣寄旅不让我的眼泪陪我过夜谁明浪子心说谎的爱人残酷月光 #coding:utf-8 import requests, sys, ...
开源python爬虫软件下载_83款网络爬虫开源软件
Nutch 是一个开源Java 实现的搜索引擎.它提供了我们运行自己的搜索引擎所需的全部工具.包括全文搜索和Web爬虫. 尽管Web搜索是漫游Internet的基本要求, 但是现有web搜索引擎的数目 ...
python认证考试mac_Mac OS 平台使用 Python 和 Docker 创建测试用 Https Server
Mac OS 平台使用 Python 和 Docker 创建测试用 Https Server Flask 是我很喜欢的 Python Web Framework,最近需要测试 Https 通信,需要创 ...
抓取安居客二手房经纪人数据，python爬虫自动翻页
为什么80%的码农都做不了架构师?>>> 和链接不一样,安居客网站里面没有找到总页数,可能在json里面有,只是我没有找到. 基于此能不能做网页的循环爬取呢. 能否判断页面读取 ...
Mac OS平台上全世界上最广泛使用的扫描仪驱程序，能够随时随地为旧的扫描仪创建驱动程序
VueScan Mac版是目前Mac OS平台上全世界上最广泛使用的扫描仪驱程序,能够随时随地为旧的扫描仪创建驱动程序,以便用户可以继续使用已有的扫描仪,目前已经有支持5600多种扫描仪. 测试系统: ...
Keka Mac版是如何满足mac os平台的日常解压缩需求的？
Keka Mac版是一款比较常用的压缩解压软件,这个解压缩软件基本可以满足mac os平台的日常解压缩需求,体积小,简单易用,速度较快,是一款非常不错的解压缩软件.测试环境:MacOS 10.14.6 ...

mac os平台使用python爬虫自动下载巨潮网络文件

mac os平台使用python爬虫自动下载巨潮网络文件相关推荐

最新文章

热门文章