文件下载的Selenium配置

由于打开PDF文件网页时，可能会直接打开PDF预览页面，所以需要用：

from selenium import webdriverdownload_dir = r"C:\Users\xxx\Desktop"
options = webdriver.ChromeOptions()
options.add_experimental_option('prefs', {"download.default_directory": download_dir, #Change default directory for downloads"download.prompt_for_download": False, #To auto download the file"download.directory_upgrade": True,"plugins.always_open_pdf_externally": True #It will not show PDF directly in chrome
})
driver = webdriver.Chrome(r'C:\Users\HenryFox\Downloads\chromedriver.exe', options=options)  # Optional argument, if not specified will search path.

上述代码修改自 https://stackoverflow.com/a/54427220

文件名修改

需要去掉一些不能用于文件和路径名的字符，下面的代码摘录自 https://www.polarxiong.com/archives/Python-%E6%9B%BF%E6%8D%A2%E6%88%96%E5%8E%BB%E9%99%A4%E4%B8%8D%E8%83%BD%E7%94%A8%E4%BA%8E%E6%96%87%E4%BB%B6%E5%90%8D%E7%9A%84%E5%AD%97%E7%AC%A6.html

import redef validateTitle(title):rstr = r"[\/\\\:\*\?\"\<\>\|]"  # '/ \ : * ? " < > |'new_title = re.sub(rstr, "_", title)  # 替换为下划线return new_title

按钮点击出错

错误提示： ElementClickInterceptedException:element click intercepted

详见 https://www.cnblogs.com/xiaoguo-/p/12143912.html

处理方法：用webdriver.ActionChains(driver).move_to_element(element).click(element).perform()

示例

例子：需要获取 https://dl.acm.org/doi/proceedings/10.1145/3448016 的所有PDF文件及相关信息

根据`Xpath`获取内容

首先先获取每个论文的element：

from selenium import webdriver
import timedriver = webdriver.Chrome(r'C:\Users\xxx\Downloads\chromedriver.exe')
driver.get("https://dl.acm.org/doi/proceedings/10.1145/3448016")tabs = driver.find_elements_by_xpath('//*[@id="pb-page-content"]/div/main/div[4]/div/div[2]/div[1]/div/div[2]/div/div/div')

打开所有折叠的session标签

for i in tabs:if 'js--open' not in i.get_attribute('class'):i.find_element_by_tag_name('a').click()print("GETTING", i.find_element_by_tag_name('a').text)while i.get_attribute("data-ajaxloaded") != 'true':time.sleep(5)
print("FINISHED!")

获取所有session的论文列表

from selenium.webdriver.common.action_chains import ActionChainsresult = []
for i in tabs:TAB_NAME = i.find_element_by_tag_name('a').textpapers = i.find_elements_by_class_name('issue-item-container')for p in papers:PAPER_TYPE = p.find_element_by_class_name('issue-item__citation').texttitle = p.find_element_by_class_name('issue-item__title').find_element_by_tag_name('a')PAPER_TITLE = title.textPAPER_URL = title.get_attribute('href')# 作者信息more_author_count = p.find_elements_by_class_name('count-list')if len(more_author_count) > 0:# more_author_count[0].click()  # 无法直接点击ActionChains(driver).move_to_element(more_author_count[0]).click().perform()time.sleep(0.5)AUTHORS = [[author.text, author.get_attribute('href')] for author in p.find_element_by_tag_name('ul').find_elements_by_tag_name('a') if author.text != '(Less)']# 月份，页数，网站ISSUE_DETAIL = [e.text for e in p.find_element_by_class_name('issue-item__detail').find_elements_by_tag_name('span')]# 摘要# abstract_more = p.find_element_by_class_name('issue-item__abstract').find_elements_by_tag_name('a')# if len(abstract_more) > 0:#     ActionChains(driver).move_to_element(abstract_more[0]).click().perform()#     time.sleep(0.5)ABSTRACT = p.find_element_by_class_name('issue-item__abstract').textif ABSTRACT.endswith("(Less)") or ABSTRACT.endswith("(More)"):ABSTRACT = ABSTRACT[:-6]DOWNLOAD = ""for a in p.find_elements_by_tag_name('a'):if a.get_attribute('data-title') == 'PDF':DOWNLOAD = a.get_attribute('href')breakresult.append([TAB_NAME, PAPER_TYPE, PAPER_TITLE, PAPER_URL, AUTHORS, ISSUE_DETAIL, ABSTRACT, DOWNLOAD])

数据保存

用json和csv分别保存

import jsonwith open("sigmod2021.json", 'w') as f:json.dump(result, f)import pandas as pddataframe = pd.DataFrame({'SESSION': [i[0] for i in result],'TITLE': [i[2] for i in result],'DOI': [i[3] for i in result],'PDF_URL': [i[7] for i in result]
})dataframe.to_csv("sigmod2021.csv", index=False)

下载文件&重命名

需要根据文件下载的Selenium配置这一节配置。在此需要下载文件+判断哪个是新下载的文件+放入对应文件夹中。

for i, r in enumerate(result):session_collection[r[0]] += 1os.makedirs(os.path.join(download_dir, validateTitle(r[0])), exist_ok=True)title = ('%03d-' % session_collection[r[0]]) + validateTitle(r[2]) + '.pdf'now_files = os.listdir(download_dir)driver.get(r[7])time.sleep(30)for i in os.listdir(download_dir):if i not in now_files:shutil.move(os.path.join(download_dir, i), os.path.join(download_dir, validateTitle(r[0]), title))print("OK", i)break

Selenium下载PDF文件实战 2021-07-21相关推荐

用Python和selenium下载pdf文件
今天要从国外的网站上下载一个学术会议的几百篇pdf文献,具体网址为https://www.onepetro.org/conferences/SPE/17ADIP/all?start=0&row ...
selenium - firefox下载 pdf 文件或者任何文件不弹窗的终极解决方法
今天试着用 firefox 通过自动化下载文件,使用网上教程 fp = webdriver.FirefoxProfile() fp.set_preference("browser.downl ...
python使用FPDF包将多个图像文件写入pdf文件实战
python使用FPDF包将多个图像文件写入pdf文件实战目录 python使用FPDF包将多个图像文件写入pdf文件实战 #FPDF包安装
php+预览和下载pdf文件,vue实现在线预览pdf文件和下载（pdf.js）
最近做项目遇到在线预览和下载pdf文件,试了多种pdf插件,例如jquery.media.js(ie无法直接浏览) 最后选择了pdf.js插件(兼容ie10及以上.谷歌.安卓,苹果) 强烈推荐改插件, ...
微信公众号内，实现下载 PDF 文件。
背景:需要在微信公众号内实现,通过点击一个[下载PDF文件]按钮,预期将 PDF 文件下载到本地自行打印. 前言:首先,本文采用的是配合后端的实现方案.后端返回file文件,通过 a 链接 ...
java从页面下载pdf文件到本地
java从页面下载pdf文件,strtus2为例,其他框架语法大致一样直接上代码这边我传了个参数从数据库中查出来文件存在服务器的相对路径页面 <button class="la ...
H5 下载PDF文件
h5 下载pdf 文件请看代码: fetch(url).then(res => {* 响应一个promise 对象// 此处响应体的是一个 [ReadableStream]console.lo ...
下载PDF文件及打印PDF文件
一:下载PDF 如果单纯的用A标签设置download属性来下载是直接打开pdf文件的而不是下载. import { download } from './download'; /*** 下载PDF文 ...
PHP ajax 远程下载PDF文件保存在本地服务器
在一些时候我们想ajax方式来保存一些PDF文件,尤其是它放在远程服务器上,并且是保存在我们自己的服务器上存储,这个时候我们需要写一段程序来帮助我们完成这个工作,本文介绍了PHP 远程下载PDF文件保 ...
使用js直接下载pdf文件而不用在新的浏览器窗口打开
最近接了一个需求,要求用户点击下载按钮后直接下载pdf文件,而不是打开一个新窗口,让用户再去手动保存. 接到需求后我立刻着手在网上查找文档,发现很多声称可以实现直接下载的方法都不行,只有下面这个方法成 ...

Selenium下载PDF文件实战 2021-07-21

文章目录

文件下载的Selenium配置

文件名修改

按钮点击出错

示例

根据`Xpath`获取内容

打开所有折叠的session标签

获取所有session的论文列表

数据保存

下载文件&重命名

Selenium下载PDF文件实战 2021-07-21相关推荐

最新文章

热门文章

Selenium下载PDF文件实战 2021-07-21

文章目录

文件下载的Selenium配置

文件名修改

按钮点击出错

示例

根据Xpath获取内容

打开所有折叠的session标签

获取所有session的论文列表

数据保存

下载文件&重命名

Selenium下载PDF文件实战 2021-07-21相关推荐

最新文章

热门文章

根据`Xpath`获取内容