因为英语学习的需要,经常会去网上下载一些VOA的MP3,常去的一个网站是http://www.51voa.com/

要想下载该网站上的MP3,需要手动选择要下载的篇幅,打开之后再选择要下载的MP3文件。要下载单独一个MP3文件还好,但要是想把某一时间内的所有MP3文件都下载下来,就很繁琐,需要重复做那些无聊的操作。能否用python来做一个下载voa MP3的工具呢?

设计思路如下:

一、打开http://www.51voa.com/主页,分析html,解析出主页上 VOA美国之音听力最近更新 文件列表,生成<文件名,文件下载地址>的dictionary

二、对已生成的dictionary按照当前日期过滤,得到能下载的当天的VOA MP3

三、对过滤后的dictionary遍历,同时进行下载操作

其中所使用的技术:

一、解析html,可以使用standar library中的HTMLParser,或者SGMLParser,也可以选择3rd party的解析库,比如BeautifulSoup(对html和xml都能很好的支持),本文采用BeautifulSoup

二、下载MP3,采用urllib,为提高效率,使用多线程进行下载,url header中可以使用Range参数分片下载,这样一来就能多部分协同操作。

具体代码如下:

一、多线程下载部分代码

#!/usr/bin/env python# -*- coding :utf-8 -*-
"""It is a multi-thread downloading tool
"""import sys
import os
import time
import urllib2
import urllib
from threading import Threadclass MyWorkThread(Thread, urllib.FancyURLopener):"""Multi-thread downloading class.run() is a vitual method of Thread"""def __init__(self, threadname, url, filename, ranges = 0):Thread.__init__(self, name = threadname)urllib.FancyURLopener.__init__(self)self.name = threadnameself.url = urlself.filename = filenameself.ranges = rangesself.downloaded = 0def run(self):"""virtual function in Thread"""try:self.downloaded = os.path.getsize(self.filename)except OSError:self.downloaded = 0#rebuild start pointself.startpoint = self.ranges[0] + self.downloaded#if this part is completedif self.startpoint >= self.ranges[1]:print 'Part %s has been downloaded over.' % self.filenamereturnself.oneTimeSize = 8 * 1024 #8K bytes / timeprint 'task %s will download from %d to %d' %(self.name, self.startpoint, self.ranges[1])self.addheader('Range', 'bytes=%d-%d' %(self.startpoint, self.ranges[1]))self.urlhandle = self.open(self.url)data = self.urlhandle.read(self.oneTimeSize)while data:filehandle = open(self.filename, 'ab+')filehandle.write(data)filehandle.close()self.downloaded += len(data)data = self.urlhandle.read(self.oneTimeSize)def GetUrlFileSize(url):urlHandler = urllib.urlopen(url)headers = urlHandler.info().headerslength = 0for header in headers:if header.find('Length') != -1:length = header.split(':')[-1].strip()length = int(length)return length
def SpliteBlocks(totalsize, blocknumber):blocksize = totalsize / blocknumberranges = []for i in range(0, blocknumber -1):ranges.append((i * blocksize, i * blocksize + blocksize -1))ranges.append((blocksize * (blocknumber -1), totalsize -1))return ranges
def isLive(tasks):for task in tasks:if task.isAlive():return Truereturn False
def downLoadFile(url, output, blocks = 6):sys.stdout.write('Begin to download from %s\n' %url )sys.stdout.flush()size = GetUrlFileSize(url)ranges = SpliteBlocks(size, blocks)threadname = ["thread_%d" %i for i in range(0, blocks)]filename = ["tmpfile_%d" %i for i in range(0, blocks)]tasks = []for i in range(0, blocks):task = MyWorkThread(threadname[i], url, filename[i], ranges[i])task.setDaemon(True)task.start()tasks.append(task)time.sleep(2)while isLive(tasks):downloaded = sum([task.downloaded for task in tasks])process = downloaded / float(size) * 100show = u'\rFilesize: %d Downloaded:%d Completed: %.2f%%' %(size, downloaded, process)sys.stdout.write(show)sys.stdout.flushtime.sleep(1)output = formatFileName(output)filehandle = open(output, 'wb+')for i in filename:f = open(i, 'rb')filehandle.write(f.read())f.close()os.remove(i)filehandle.close()sys.stdout.write("Completed!\n")sys.stdout.flush()def formatFileName(filename):if isinstance(filename, str):header, tail = os.path.split(filename)if tail != '':tuple = ('\\','/',':', '*', '?', '"', '<', '>', '|')for char in tuple:if tail.find(char) != -1:tail = tail.replace(char, '')filename = os.path.join(header, tail)#print filenamereturn filenameelse:return 'None'if __name__ == '__main__':url = r'http://www.51voa.com/path.asp?url=/201008/hennessy_africa_wildlife_18aug10-32b.mp3'output = r"D:\Voa\Study:'Shoot to Kill' Policy in Africa's Parks Abuses Human Rights.mp3"downLoadFile(url, output, blocks = 4)

二、解析voa页面部分代码

#!/usr/bin/env python# -*- coding:utf-8 -*-import urllib2import chardetimport osimport timeimport stringimport refrom HTMLParser import HTMLParserimport sysfrom BeautifulSoup import BeautifulSoupimport multiThreadDownloadTool

VOA_URL = r'http://www.51voa.com'DOWNLOAD_DIR = r'D:/Voa'

"""File downloading from the web."""

def getURLContent(url):    """    get url content of the url, begin with html and ignor the doctype declarations    """    file = urllib2.urlopen(url)    #print file.info()    data = file.read()    file.close()    #return data.decode('utf-8')    index = data.find('html')    data = data[index - 1 :]    return datadef getVOAURLs(content):    """    find the voa script urls in the content    """    urls = {}    soup = BeautifulSoup(content)    divs = soup.findAll('div', {'id':'rightContainer'})    #print divs    neededDiv = None    if len(divs) >= 1:        neededDiv = divs[0]    if neededDiv != None:        #pass the div        #print neededDiv        neededSpan = neededDiv.find('span', {'id' : 'list'})        #print neededSpan        lis = neededSpan.findAll('li')        #print lis        for li in lis:            needAs = li.findAll('a')            #got it            #print needAs[1]            #print needAs[1]['href']            #print needAs[-1].string            urls[needAs[-1].string] = VOA_URL + needAs[-1]['href']    print "getVOAURLs() urls count is " , len(urls)    return urlsdef filterbyDate(urls ,date):    """    filter the urls by date    """    neededURLs = {}    currentDate = time.localtime(time.time());    #currentDateStr = time.strftime('%Y-%m-%d', currentDate)    #currentDateStr = currentDate.tm_year + '-' + currentDate.tm_mon + ' ' + currentDate.tm_mday    currentDateStr = "%s-%s-%s"  %(currentDate.tm_year, currentDate.tm_mon, currentDate.tm_mday)    if(date != None):        currentDateStr = date    for url in urls.keys():        name = url.lstrip().rstrip()        length = len(name)        publishDate = name[- len(currentDateStr) - 1 : -1]        #print publishDate        if publishDate == currentDateStr:            neededURLs[name] = urls[url]            print 'find ', name

    print 'After filter, the count is ' , len(neededURLs)    return neededURLs

def findMP3FileInURL(url):    """    find MP3 files in a url    """    print 'parse the content of ', url    urls = []    #define a MP3 re string    p = re.compile(r'/path.asp\?url=[-\w/]*\.mp3')    #p = re.compile(r'/[-\w/]*\.mp3')    content = getURLContent(url)    matchLinks = p.findall(content)    #print matchLinks    for link in matchLinks:        tmp = VOA_URL + link        if tmp in urls: # check if exist already            pass        else:            urls.append(tmp)    print 'Current count of mp3 files is ', len(urls)    return urlsdef getHTMLFile(url, file_name):    ifile = urllib2.urlopen(url)    content = ifile.read()    local_file = open(file_name, 'w')    local_file.write(content)    local_file.close()

def downloadFile(url, fileName2Store):    """    download file from url, and store it to local system using fileName2Store parameter    """    try:        full_path = os.path.join(DOWNLOAD_DIR, fileName2Store)        print 'begin to download url to ', full_path        if os.path.isfile(full_path):            #already exist            print 'the file ', full_path, 'is alreasy exist, so just skip it!'        else:            print '\tDownloading the mp3 file...',              data=urllib2.urlopen(url).read()              print 'Done'              print '\tWriting data info file...',              f=file(full_path, 'wb')              f.write(data)              print 'Done'              f.close()    except Exception, ex:        print 'some exceptions occur when downloading ', exif __name__ == "__main__":    try:        #getHTMLFile(VOA_URL, r'.\Voa.html')        context = getURLContent(VOA_URL)        #file_read = open(r'.\Voa.html', 'r')        #context = file_read.read()        #print context        #print '\n' * 5#print chardet.detect(context)print 'Begin to get download information, it may cost some minuts, please wait...'files2download = getVOAURLs(context)neededDownload = filterbyDate(files2download, None)neededDownloadMp3s = {}for name in neededDownload.keys():fullURL = neededDownload[name]formatedName = name[: -11].lstrip().rstrip()#formatedName = formatedName.replace(' ', '-')#print formatedName, ' ' * 5, fullURL#print fullURLmp3Names = findMP3FileInURL(fullURL)if len(mp3Names) == 1:#there is only on mp3 file in this file ,so we will use the formatednameneededDownloadMp3s[formatedName] = mp3Names[0]else:for name in mp3Names:print nameindex_begin = name.rfind('/')index_end = name.rfind('.')tmpName = name[index_begin + 1 : index_end]neededDownloadMp3s[tmpName] = nameprint 'Now , the mp3 files  are :'print neededDownloadMp3s#findMP3FileInURL(r'http://www.51voa.com/VOA_Special_English/Phoning-Fertilizer-Philippine-Rice-Farmers--38545.html')#findMP3FileInURL(r'http://www.51voa.com/Voa_English_Learning/Learn_A_Word_38412.html')#down load filefor filename in neededDownloadMp3s.keys():try:full_path = os.path.join(DOWNLOAD_DIR, filename)full_path = full_path + r'.mp3'if full_path == r'D:\Voa\hennessy_africa_wildlife_18aug10-32b.mp3':multiThreadDownloadTool.downLoadFile(neededDownloadMp3s[filename], full_path)except Exception, ex:print 'Some exceptions occur, when downloading file from %s, exception messages are %s' %(neededDownloadMp3s[filename], ex)#downloadFile(r'http://www.51voa.com/path.asp?url=/201008/mercer_australia_election_16aug10-32b.mp3', 'test.mp3')except Exception, ex:print 'Exception caught, tracebacks are :',sys.exc_info(), exprint 'download all completed!'raw_input("Press any key to continue...")

需要注意的地方:

在使用BeautifulSoup进行html解析的时候发现,BeautifulSoup对于

<!DOCTYPE html PUBliC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" />

的支持不是很好,经常解析不出来,所以为了方便,在解析的时候先将源文件解析,只将<html></html>之间的数据交与BeautifulSoup解析。具体为什么BeautifulSoup解析

DOCTYPE出错,我还没查出问题所在,希望有知道的朋友告知一声。

python 自动下载 voa MP3相关推荐

  1. python 自动下载文件_【Py大法系列--03】Python如何自动下载文件

    问题描述 Python自动下载文件,通用文件,包括但不限于压缩文件.图片等. 解决方法 一般情况下,Python下载文件的方式有以下几种: 利用urllib.urllib2,即Python爬虫用的最多 ...

  2. Python自动下载论文

    Python自动下载论文 下载地址 http://dblp.uni-trier.de/db/conf/sigcomm/ 目录 - 先上最终版本: - 说说sigcomm上论文下载的姿势 - 中间的一些 ...

  3. python自动下载高品质无损歌曲

    不知道大家有没有这样的感觉,听到一些好听的歌曲,总是想要把歌曲下载到自己电脑或手机上才感觉到踏实,尤其是在早些年智能手机还没这么普及的时候,喜欢的歌曲不在自己电脑上就会没有安全感. 虽然现在听歌基本都 ...

  4. python自动下载论文_python自动下载高品质无损歌曲

    不知道大家有没有这样的感觉,听到一些好听的歌曲,总是想要把歌曲下载到自己电脑或手机上才感觉到踏实,尤其是在早些年智能手机还没这么普及的时候,喜欢的歌曲不在自己电脑上就会没有安全感. 虽然现在听歌基本都 ...

  5. python自动下载qq文件夹_GitHub - 1061700625/QQZone_AutoDownload_Album: Python+selenium 自动下载QQ空间相册...

    QQZone_AutoDownload_Album Python+selenium 自动下载QQ空间相册 . selenium_firefox.zip 需要解压后放在同路径下 . 貌似腾讯的登陆加密做 ...

  6. python自动下载阿里云数据库数据_脚本自动下载阿里云每日备份数据库镜像

    脚本自动下载阿里云每日备份数据库镜像 背景 前端时间街道一个临时需求,要求根据每日的数据快照,统计计算出需要数据结果,并写入数据库,提供查询接口. 遇到两个自己没有尝试过的点: 阿里云导出的数据库是. ...

  7. python自动下载安装软件_30行Python代码从百度自动下载图片(内附源码和exe程序)...

    只需要30行代码就可以从百度自动下载图片 大家好,我是行哥,一个专门教小学生撸Python的编程老师(小学生都能学会的编程) 这里行哥想问大家三个问题 : - 你还在为批量下载表情包发愁吗? - 你还 ...

  8. python自动下载邮件附件_Python 批量导出邮件附件 | 互联网笔记

    采用python 3 windows 环境可Anaconda进行一键安装环境, 此脚本适用用于下列状况 报表每日自动发送至邮箱,多邮箱每天需要登陆或使用客户端下载邮件 收集的资料手动下载太麻烦. 可在 ...

  9. python自动下载邮件附件_Python批量下载电子邮件附件并汇总合并Excel文件

    原标题:Python批量下载电子邮件附件并汇总合并Excel文件 前几天在公众号搞了一波送书活动,详见福利:免费赠送240本Python教材,该文推送之后,立刻收到了大量的样书申请表,那么接下来的工作 ...

最新文章

  1. oracle 导入数据
  2. Mysql高级之触发器
  3. Python开发技术详解PDF
  4. CCF NOI1018 打电话
  5. c#对输入的字符串加密
  6. 微信ipad协议源码
  7. cad卸载_CAD一键卸载工具
  8. android双屏不同apk,双屏可折叠 通吃.exe和.apk 微软终于发大招了!
  9. 导线中电流分布和集肤深度
  10. 计算机领域中的CAE,CAE
  11. Swift UIView代码控制隐藏与显示
  12. window10开启移动热点
  13. 微软Exchange服务器被黑客攻击以部署Cuba勒索软件
  14. hdu2177——威佐夫博弈变形
  15. 浏览器必备灵魂插件谷歌浏览器翻译,番剧解除区域限制/全能启动器/
  16. SfM多视图三维点云重建--【VS2015+OpenCV3.4+PCL1.8】
  17. ESP32使用外设RMT控制WS2812灯条
  18. SW转发与接口类型 DHCP配置
  19. 科技赋能,泰然金融开启上市征程
  20. 西安历史美食休闲三日游【规划】

热门文章

  1. 开源社区团购微商城小程序,直播
  2. 知识(文章)付费阅读系统源码(含小程序)
  3. ollyice的学习
  4. ASP.NET(c#)如何判断浏览器是否支持cookies
  5. Zimbra系统资料
  6. LeetCode 507. Perfect Number
  7. 【TensorFlow】池化层max_pool中两种paddding操作
  8. 【OpenGL】理解GL_TRIANGLE_STRIP等绘制三角形序列的三种方式
  9. 理解流 java 0325
  10. 如何使用idea生成javaDoc文档