Python丨使用selenium实现批量文件下载

目标：批量下载联想某型号的全部驱动程序。

一般在做网络爬虫的时候，都是保存网页信息为主，或者下载单个文件。当涉及到多文件批量下载的时候，由于下载所需时间不定，下载的文件名不定，所以有一定的困难。

思路

参数配置

在涉及下载的时候，需要先对chromedriver进行参数配置，设定默认下载目录：

'''
想要学习Python？Python学习交流群：984632579满足你的需求，资料都已经上传群文件，可以自行下载！
'''
global base_path
profile = {'download.default_directory': base_path
}
chrome_options = webdriver.ChromeOptions()
chrome_options.add_experimental_option('prefs', profile)
driver = webdriver.Chrome(executable_path='../common/chromedriver', options=chrome_options)
driver.implicitly_wait(10)

页面分析

联想官网上每个型号的驱动下载页面如上图所示，虽然前面有一个登陆的遮罩，但是实际上并不影响点击。需要注意的是：

驱动列表，需要点击才可以显示具体的下载项目表格，否则可以找到对应元素但无法获取正确的信息
```
driver_list.find_element_by_class_name('download-center_list_t_icon').click()
```

每个下载列表的表头建议做跳过处理

if sub_list.find_element_by_class_name('download-center_usblist_td01').text == '驱动名称':continue

下载处理

在页面中，找到“普通下载”的元素，点击即可下载。最终实现结果是我们希望根据网页的列表进行重命名和重新归档到文件夹，但是我们会发现如下几个问题：

下载过来的文件名无法控制。
依次下载的话，我们无法确认需要下载多久。并行下载的话，无法有效的区分重命名。

在网上找了很久，也没找到在下载时直接重命名的方法，所以最终选择依次下载，当每次下载完成后进行重命名和归档，思路如下：

对每个驱动目录，先新建一个文件夹，如：主板
点击下载后开始下载文件
通过os模块，找到下载目录中所有文件，并按创建时间排序，找到最新创建的文件
由于未完成的文件后缀为.crdownload（chrome），那么根据后缀来判断是否已完成下载，未完成的话继续等待

待下载完成，将文件重命名并剪切到开始建立的归档目录。这里需要注意的是，有些文件名中不能存在/符号，否则会导致重命名失败，需要做一下替换。

在后期测试的时候，发现还有几个坑需要注意：

在查找最新创建的文件时，需要注意.DS_Store文件的处理。（Mac系统，Windows则需要考虑thumbs.db）
需要判断一下最新创建的文件是否为文件夹，可以通过filter函数来处理

最新文件的排序查找实现如下：

def sort_file():# 排序文件dir_link = base_pathdir_lists = list(filter(check_file, os.listdir(dir_link)))if len(dir_lists) == 0:return ''else:dir_lists.sort(key=lambda fn: os.path.getmtime(dir_link + os.sep + fn))return os.path.join(base_path, dir_lists[-1])def check_file(filename):# 忽略系统文件if filename == '.DS_Store' or filename == 'thumbs.db':return Falseglobal base_path# 排除文件夹return os.path.isfile(os.path.join(base_path, filename))

最终实现效果如下：

完整代码

import os
import time
import re
from selenium import webdriver
'''
想要学习Python？Python学习交流群：984632579满足你的需求，资料都已经上传群文件，可以自行下载！
'''def sort_file():# 排序文件dir_link = base_pathdir_lists = list(filter(check_file, os.listdir(dir_link)))if len(dir_lists) == 0:return ''else:dir_lists.sort(key=lambda fn: os.path.getmtime(dir_link + os.sep + fn))return os.path.join(base_path, dir_lists[-1])def check_file(filename):# 忽略系统文件if filename == '.DS_Store' or filename == 'thumbs.db':return Falseglobal base_path# 排除文件夹return os.path.isfile(os.path.join(base_path, filename))def download_drivers(url):global base_pathprofile = {'download.default_directory': base_path}chrome_options = webdriver.ChromeOptions()chrome_options.add_experimental_option('prefs', profile)driver = webdriver.Chrome(executable_path='../common/chromedriver', options=chrome_options)driver.implicitly_wait(10)driver.get(url)driver_lists = driver.find_elements_by_class_name('dlist-item')for driver_list in driver_lists:# 提取中文及英文字母title = ''.join(re.findall(r'[\u4e00-\u9fa5a-zA-Z]+', driver_list.text))temp_path = './drivers/' + titleif not os.path.exists(temp_path):os.mkdir(temp_path)driver_list.find_element_by_class_name('download-center_list_t_icon').click()sub_lists = driver_list.find_elements_by_tag_name('tr')for sub_list in sub_lists:try:if sub_list.find_element_by_class_name('download-center_usblist_td01').text == '驱动名称':continueelse:sub_title = sub_list.find_element_by_class_name('download-center_usblist_td01').\find_element_by_tag_name('a').get_attribute('title').replace('/', '_')print('开始下载:' + sub_title)sub_list.find_element_by_link_text('普通下载').click()# 等待开始下载time.sleep(2)while True:oldname = sort_file()file_type = oldname.split('.')[-1]if oldname != '' and file_type != 'crdownload':print('下载已完成')breakelse:print("等待下载。。。")time.sleep(10)newnamne = temp_path + os.sep + sub_title + '.' + file_typeos.rename(oldname, newnamne)print('归档成功')except Exception as e:print(e)continueprint('下载结束')driver.quit()if __name__ == '__main__':base_path = './drivers'if not os.path.exists(base_path):os.mkdir(base_path)print('创建drivers文件夹')# T470s win10 64biturl = "https://think.lenovo.com.cn/support/driver/newdriversdownlist.aspx?categoryid=12832&CODEName=ThinkPad%20T470s&SearchType=1&wherePage=1&SearchNodeCC=ThinkPad%20T470s"# T470s win7 64bit#url = 'https://think.lenovo.com.cn/support/driver/newdriversdownlist.aspx?categoryid=12832&CODEName=ThinkPad%20T470s&SearchType=1&wherePage=1&SearchNodeCC=ThinkPad%20T470s&osid=26'# T460s win10 64bit# url = 'https://think.lenovo.com.cn/support/driver/newdriversdownlist.aspx?yt=pt&categoryid=12358&CODEName=ThinkPad%20T460s&SearchType=0&wherePage=2&osid=42'# T460s win7 64bit# url = 'https://think.lenovo.com.cn/support/driver/newdriversdownlist.aspx?yt=pt&categoryid=12358&CODEName=ThinkPad%20T460s&SearchType=0&wherePage=2&osid=26'# T450s win10 64bit# url = 'https://think.lenovo.com.cn/support/driver/newdriversdownlist.aspx?yt=pt&categoryid=12002&CODEName=ThinkPad%20T450s&SearchType=0&wherePage=2&osid=42'download_drivers(url)

Python丨使用selenium实现批量文件下载相关推荐

python批量下载文件-python使用selenium实现批量文件下载
背景实现需求:批量下载联想某型号的全部驱动程序. 一般在做网络爬虫的时候,都是保存网页信息为主,或者下载单个文件.当涉及到多文件批量下载的时候,由于下载所需时间不定,下载的文件名不定,所以有一定的困 ...
python webdriver save_Python + Selenium +Chrome 批量下载网页代码修改【新手必学】
Python + Selenium +Chrome 批量下载网页代码修改主要修改以下代码可以调用本地的 user-agent.txt 和 cookie.txt 来达到在登陆状态下批量打开并下载网 ...
python批量下载网页文件-python使用selenium实现批量文件下载
背景实现需求:批量下载联想某型号的全部驱动程序. 一般在做网络爬虫的时候,都是保存网页信息为主,或者下载单个文件.当涉及到多文件批量下载的时候,由于下载所需时间不定,下载的文件名不定,所以有一定的困 ...
python 批量下载代码_Python + Selenium +Chrome 批量下载网页代码修改
Python + Selenium +Chrome 批量下载网页代码修改主要修改以下代码可以调用本地的 user-agent.txt 和 cookie.txt 来达到在登陆状态下批量打开并下载网 ...
Python pip安装selenium安装不了报错原因
Python pip安装selenium安装不了报错原因 1.首先要确保已经安装了pip, 打开cmd,输入pip,如下方有出现一系列pip的相关命令,则表示安装成功. 2.接着输入命令pip i ...
Python requests下载超大文件/批量下载文件
(一)下载超大文件: 使用 python 下载超大文件,直接全部下载,文件过大,可能会造成内存不足,这时候要使用 requests 的 stream 模式主要代码如下 iter_content:一块 ...
python抓取文献关键信息,python爬虫——使用selenium爬取知网文献相关信息
python爬虫--使用selenium爬取知网文献相关信息写在前面: 本文章限于交流讨论,请不要使用文章的代码去攻击别人的服务器如侵权联系作者删除文中的错误已经修改过来了,谢谢各位爬友指出错误 ...
python爬虫——用selenium爬取淘宝商品信息
python爬虫--用selenium爬取淘宝商品信息 1.附上效果图 2.淘宝网址https://www.taobao.com/ 3.先写好头部 browser = webdriver.Chrome ...
python 模拟浏览器selenium 微信_Spider-Python爬虫之使用Selenium模拟浏览器行为
分析他的代码比较简单,主要有以下的步骤:使用BeautifulSoup库,打开百度贴吧的首页地址,再解析得到id为new_list标签底下的img标签,最后将img标签的图片保存下来. header ...