Python批量跑Hive数据到本地目录

适用于需要跑多日数据及临时数据需求的时候；Hql代码及目录路径可以随时按需更换；为提高效率采用了并行的方式。
/Users/nisj/PycharmProjects/BiDataProc/love/HiveRunData2LocalFile.py

# -*- coding=utf-8 -*-
import os
import datetime
import warnings
import time
import threadpoolwarnings.filterwarnings("ignore")def dateRange(beginDate, endDate):dates = []dt = datetime.datetime.strptime(beginDate, "%Y-%m-%d")date = beginDate[:]while date <= endDate:dates.append(date)dt = dt + datetime.timedelta(1)date = dt.strftime("%Y-%m-%d")return datesdef hiveRunData2localFile(pt_day):os.system("""/usr/lib/hive-current/bin/hive -e " \with tab_pay_info as( \select '{pt_day}' runday,x.uid,sum(x.pay_amount) total_pay_amount,max(x.pay_amount) max_pay_amount \from (select pt_day,uid,amount pay_amount from oss_bi_all_chushou_pay_info where pt_day between '{pt_day}' and '{pt_day7after}' and state=0 \union all \select pt_day,uid,amount pay_amount from oss_bi_all_open_point_record where pt_day between '{pt_day}' and '{pt_day7after}' and state=0 and (key='alipay_cp' or key='tmall_pay') \) x \group by x.uid \), \tab_all_device as( \select pt_day,parms['uid'] uid,max(device_info['device_model']) device_model,max(parms['ip']) ip \from oss_bi_all_device_log \where pt_day='{pt_day}' \group by pt_day,parms['uid']) \select '{pt_day}' pt_day,a1.appkey,a1.appsource,a1.uid,a2.device_model,a2.ip, \substr(a3.id_card_num,7,8) birthday,2018-substr(a3.id_card_num,7,4) age,case when substr(a3.id_card_num,-2,1)%2=1 then '男' when substr(a3.id_card_num,-2,1)%2=0 then '女' end sex, \case when a4.uid is null then 0 else 1 end is_pay,a4.max_pay_amount,a4.total_pay_amount \from oss_bi_type_of_retention_user a1 \left join tab_all_device a2 on a1.uid=a2.uid and a1.pt_day=a2.pt_day \left join xxx_user_id_card a3 on a1.uid=a3.uid \left join tab_pay_info a4 on a1.uid=a4.uid and a1.pt_day=a4.runday \where a1.pt_day='{pt_day}' and a1.remain_type=1 and a1.remain_day_num=0 \; \">/home/hadoop/nisj/love/hive2Local/{pt_day}.txt """.format(pt_day=pt_day, pt_day7after=(datetime.datetime.strptime(pt_day, "%Y-%m-%d") + datetime.timedelta(6)).strftime('%Y-%m-%d')));# # run serial Batch
# for ptDay in dateRange(beginDate='2018-03-01', endDate='2018-03-01'):
#     print ptDay
#     hiveRunData2localFile(pt_day=ptDay)# run parallel Batch
now_time = time.strftime('%Y-%m-%d %X', time.localtime())
print "当前时间是：",now_timerunDay_list = dateRange(beginDate='2018-03-01', endDate='2018-03-31')
requests = []
request_hiveRunData2localFile_batchCtl = threadpool.makeRequests(hiveRunData2localFile, runDay_list)
requests.extend(request_hiveRunData2localFile_batchCtl)
main_pool = threadpool.ThreadPool(16)
[main_pool.putRequest(req) for req in requests]if __name__ == '__main__':while True:try:time.sleep(30)main_pool.poll()except KeyboardInterrupt:print("**** Interrupted!")breakexcept threadpool.NoResultsPending:breakif main_pool.dismissedWorkers:print("Joining all dismissed worker threads...")main_pool.joinAllDismissedWorkers()now_time = time.strftime('%Y-%m-%d %X', time.localtime())
print "当前时间是：",now_time

用并行跑批的时候，要么安装threadpool模块，要么将threadpool代码脚本拷到跑批目录下。
跟出的多个数据可以考虑压缩后再传输，提高效率。

cd love/
cd hive2Local/
zip 201803.zip 2018*
sz 201803.zip

Python批量跑Hive数据到本地目录相关推荐

使用python批量下载天猫数据并进行合并（非爬虫）
使用python批量下载天猫数据并进行合并(非爬虫) 做电商运营少不了数据分析支持,无论是选品.选关键词.研究竞品还是开直通车.店铺引流都需要有数据支持.不过生意参谋虽然强大,但现在都只能显示&quo ...
利用Python批量识别电子账单数据
文章目录一.前言二.调用Baidu aip识别三.批量识别电子账单一.前言有一定数量类似如下截图所示的账单,利用 Python 批量识别电子账单数据,并将数据保存到Excel. 百度智能云接 ...
python账单查询软件_利用Python批量识别电子账单数据的方法
这篇文章主要介绍了利用Python批量识别电子账单数据的方法,本文给大家介绍的非常详细,对大家的学习或工作具有一定的参考借鉴价值,需要的朋友可以参考下一.前言有一定数量类似如下截图所示的账单,利用 ...
python下载文件到本地-Python下载网络文本数据到本地内存的四种实现方法示例
本文实例讲述了Python下载网络文本数据到本地内存的四种实现方法.分享给大家供大家参考,具体如下: import urllib.request import requests from io imp ...
Python批量采集商品数据并使用多线程（含完整源码）
前言嗨喽,大家好,这里是魔王~ 本次目的: Python批量采集商品数据知识点: 爬虫基本流程非结构化数据解析 csv数据保存线程池的使用开发环境: python 3.8 pycharm r ...
python批量处理excel数据_Python批量处理Excel，真香（超实用！）
本文介绍了利用Python批量处理Excel文件的一种方法,超实用,超简单.轻松可实现,节省时间不只一点点.文章不长,功能超强. 上菜. 某一天,老板丢个我一个任务.需要将400多张表按照一定条件进行 ...
用python批量更新es数据根据id_Python Elasticsearch批量操作客户端
基于Python实现的Elasticsearch批量操作客户端 by:授客 QQ:1033553122 1．代码用途 Elasticsearch客户端,目的在于实现批量操作,如下: <1> ...
python批处理栅格转点_三种利用Python批量处理地理数据的方法——以栅格数据投影转换为例...
时至今日,笔者已经总结了三种用Python语言结合ArcGIS10.2提供的接口去批量处理地理数据的方法.即: 1. 用IDLE.Python Tools for Visual Studio等去编写独 ...
python 批量修改/替换数据
在进行数据操作时,经常会根据条件批量的修改数据,如以下数据,按照日期的条件,将部门日期下的promotion改为1 tot_qty price date price_delta1 price_de ...

Python批量跑Hive数据到本地目录

Python批量跑Hive数据到本地目录相关推荐

最新文章

热门文章