python下载数据的各种方法

上一篇我写了怎么去网站取爬取一些下载链接，还有一些自己写的python程序，主要用了request 和 urllib.request 等库，不过发现使用这些下载文件有些问题，例如不能断点传，不好下大文件，没有进度界面，没有校验等等，只适合下数量少的小文件，今天我发现有两个有用的工具来下载。

我一开始写的下载代码


import requests
import time
import urllib.request
import os
Timeout = 200 #单次超时设定
url="https://"#下载链接
File_name = url.rsplit('/', 1)[-1]def formatFloat(num):#输出统一格式return '{:.3f}'.format(num)def Download_point_to_point(name,url,Timeout):#download File, name=File, url=download address,TIMEOUT是超时时间，并返回下载状态try:r1 = requests.get(url, stream=True, verify=False)total_size = int(r1.headers['Content-Length'])except Exception as e:error_type = e.__class__.__name__print ('error type is:', error_type)#查看异常名称if error_type == 'MissingSchema':return 'link_error'if error_type == 'ConnectionError' or error_type == 'ReadTimeout':return 'Timeout'  if os.path.exists(name):temp_size = os.path.getsize(name)  # 本地已经下载的文件大小else:temp_size = 0if temp_size == total_size:#文件大小相等print (name+' has successful download'+ '文件大小检查')return 'successful'i = 0      headers = {'Proxy-Connection':'keep-alive'}while i < 3: #尝试三次try:r = requests.get(url,timeout=Timeout,stream=True, headers=headers)#TIMEOUT 设置超时length = float(r.headers['content-length']) #文件长度f = open(name, 'wb')count = 0count_tmp = 0time1 = time.time()for chunk in r.iter_content(chunk_size = 256):if chunk:f.write(chunk)count += len(chunk)if time.time() - time1 > 2:  #2s 统计一次p = count / length * 100 #下载百分比speed = (count - count_tmp) / 1024 / 1024 / 2#速度 MB/Scount_tmp = countprint(name + ': ' + formatFloat(p) + '%' + ' Speed: ' + formatFloat(speed) + 'MB/S')time1 = time.time()print(name + ': download finished!')f.close()return 'successful'except Exception as e:error_type = e.__class__.__name__print ('error type is:', error_type)#查看异常名称if error_type == 'MissingSchema':return 'link_error'print (name+" 链接错误")if error_type == 'ConnectionError' or error_type == 'ReadTimeout':i = i + 1#只有超时异常才尝试三次print (name+" 超时——try agian ", i)return 'Timeout' #走完循环还没有下载完，即Timeout

这个代码带有简单的下载进程显示，主要是用来request库，但是没有断点续传是最蛋疼的，我要下的数据实在是太慢了，这个异常处理我实在不知道怎么写就随便写了，不太好用。

带有断点续传的request库方法（不一定能行）

我在网上看到有带断点续传的request库方法，其实就是通过比较本地下载文件的长度和需要下载文件的总长度来续传的，但是我下载之后发现有些文件完成后的长度竟然大于headers里面的内容长度，很怀疑这个下载程序下载文件的完整性。这里我主要是参考这个文章的：
Python实现断点续传下载文件，大文件下载还怕下载到一半就断了吗？不存在！
顺便贴上我的代码

import sys
import requests
import os
import time
# 屏蔽warning信息
requests.packages.urllib3.disable_warnings()
url = 'https://d'#下载链接
file_path = './spZbest'
Timeout = 100
def Download_continued_mode(file_path,url,Timeout):# 第一次请求是为了得到文件总大小try:r1 = requests.get(url, stream=True, verify=False)total_size = int(r1.headers['Content-Length'])# 这重要了，先看看本地文件下载了多少if os.path.exists(file_path):temp_size = os.path.getsize(file_path)  # 本地已经下载的文件大小else:temp_size = 0# 显示一下下载了多少   print(temp_size)print(total_size)# 核心部分，这个是请求下载时，从本地文件已经下载过的后面下载headers = {'Range': 'bytes=%d-' % temp_size}  # 重新请求网址，加入新的请求头的r = requests.get(url, stream=True, verify=False, headers=headers, timeout=Timeout)time1 = time.time()count_tmp = temp_sizespeed = 0# 下面写入文件也要注意，看到"ab"了吗？# "ab"表示追加形式写入文件with open(file_path, "ab") as f:for chunk in r.iter_content(chunk_size=1024):if chunk:temp_size += len(chunk)f.write(chunk)f.flush()###这是下载实现进度显示####if (time.time() - time1) >= 1:  #1s 统计一次speed = (temp_size - count_tmp) / 1024 #速度 kB/Scount_tmp = temp_sizetime1 = time.time()done = int(50 * temp_size / total_size)sys.stdout.write("%s %dkB/S" % (file_path, int(speed)))sys.stdout.write("\r[%s%s] %d%%" % ('█' * done, ' ' * (50 - done), 100 * temp_size / total_size))sys.stdout.flush()done = int(50 * temp_size / total_size)sys.stdout.write("%s %dkB/S" % (file_path, int(speed)))sys.stdout.write("\r[%s%s] %d%%" % ('█' * done, ' ' * (50 - done), 100 * temp_size / total_size))sys.stdout.flush()except Exception as e:error_type = e.__class__.__name__print ('error type is:', error_type)#查看异常名称if error_type == 'MissingSchema':print (file_path + " 链接错误")return 'link_error'if error_type == 'ConnectionError' or error_type == 'ReadTimeout':print (file_path + " 超时——try agian ")return 'Timeout'print()  # 避免上面\r 回车符return 'successful'

wget方法

最后是wget方法，wget有一个python库，可以很方便的使用它来下载文件，使用可以参考这里：
链接: python中wget方式下载使用.
使用wget.download最大很简单方便，但是我看来帮助文档里面有没有断点续传的方法，这可能是这个唯一缺点把

import wget
file_name = wget.filename_from_url(url)## 获取文件名 就是链接最后一串
print(os.path.join(PATH,file_name))
help(wget)#看一下文档描述
wget.download(url,out = os.path.join(PATH,file_name))

后来我看到可以直接用wget这个应用来下载，（不是python库），通python代码来操作命令行一样可以使用，而且有更多下载模式选择，有断点续传等等模式，而python的wget库只有一个下载函数可以用。

所以先下载一个在windows可用的wget，我参考了这里：
链接: Wget for windows——优雅地实现批量下载.

配置好代码然后就写代码，多线程可以参考一些这里：
链接: python wget并行下载文件.

然后可以看看wget命令详解
和这里链接: Linux wget命令详解.

这里就是代码：

def Download_wget_OS(url,PATH):#单个文件下载#判断url是否存在try:        url_opener = urllib2.urlopen(url)except:print ('open url error')return 'url error'#判断内容是否为空file_name = wget.filename_from_url(url)## 获取文件名 就是链接最后一串print('The size of '+file_name+' is '+url_opener.headers['Content-Length'])if not url_opener.headers['Content-Length']:print ('no content length')return 'no Content'status_subprocess = subprocess.call('wget -c %s'  %(url),shell=True)#等同于在shell中直接运行命令if status_subprocess == 0:print ('[%s]:download complete!' % (file_name))return 'download complete'else:print ('[%s]:download failed !' % (file_name))return 'download failed'

当然最好就是使用这种方法，因为使用这样不仅可以下url列表，还能使用多进程。关于多进程，可以去这里看看：
链接: Python并发之协程gevent基础(5).
和这里：
链接: python 多线程（一）.

使用python各种方法下载数据相关推荐

python利用tushare下载数据并计算当日收益率
python利用tushare下载数据并计算当日收益率计算股票收益率的程序主要有以下几部分构成: 1.获取股票接口数据函数:pro_daily_stock() 2.计算收益率函数:cal_stock ...
python学习之下载数据可视化（一）
本文地址https://blog.csdn.net/sidens/article/details/80710006,转载请说明使用网上下载的CSV文件,利用matplotlib生成可视化图表以下代 ...
python链接FTP下载数据
使用python'代码获取发图片数据并下载 from ftplib import FTP import os class FTP_OP:def __init__(self, host, userna ...
python根据url下载数据_利用Python如何实现根据URL地址下载并保存文件至对应目录...
利用Python如何实现根据URL地址下载并保存文件至对应目录发布时间:2020-11-16 14:23:11 来源:亿速云阅读:58 作者:Leah 这篇文章将为大家详细讲解有关利用Python ...
下载数据CityEngine示例数据（Tutorial）配套视频教程（英文）
最近用应开发的过程中出现了一个小问题,顺便记录一下原因和方法--下载数据各位网友,当我们安装CityEngine当前,可以点击主菜单[Help]->[Download Tutorialsand ...
python教程怎么抓起数据_介绍python 数据抓取三种方法
三种数据抓取的方法正则表达式(re库) BeautifulSoup(bs4) lxml *利用之前构建的下载网页函数,获取目标网页的html,我们以https://guojiadiqu.bmcx.co ...
python 模块 chardet下载方法及介绍
来源:http://blog.csdn.net/aqwd2008/article/details/7506007 python 模块 chardet 下载及介绍在处理字符串时,常常会遇到不知道字符串 ...
python urlretrieve_使用urllib库的urlretrieve()方法下载网络文件到本地的方法
概述见源码源码 # !/usr/bin/env python # -*- coding:utf-8 -*- """ 图片(文件)下载,核心方法是 urllib.url ...
python脚本实现GNSS数据自动下载
python脚本实现GNSS数据自动下载本文代码思路参考了博文[https://blog.csdn.net/weixin_39672353/article/details/1098525] 在此基础 ...

使用python各种方法下载数据