Hive--hive一种通用的上亿级别的去重方法

前些阵子在公司做项目遇到了一个问题，就是需要都行业中的所有品牌的uid进行去重的然后计数的操作。

数据量去完重复大概2个亿，去之前大概将近三个亿。

做法一：最原始的做法使用的是count(distingct uid)这个需要大概跑3个小时的任务。

做法二：使用group by去重，效果依然不好。

做法三：使用row_number() over(partition by uid order by uid desc) as rn ，然后取rn=1，这样也不行。

通用做法：将任务分成5份，即uid%5=0,1,2,3,4这几个任务去跑，然后进行union all和并即可。任务从三小时降到0.5小时。

代码：开启5个以下任务，uid%5=0,1,2,3,4 五种情况，写到wb_ad_brand_industry_count_temp1，2，3，4，5

#!/bin/bash
source /usr/local/jobclient/config/.hive_config.sh
source /usr/local/jobclient/lib/source $0 $1
source /usr/local/jobclient/demo/execute_modular.sh $work_log_noticesource ./mysql_comm.sh
source ./date_comm.shif [ $? -ne 0 ]
thenexit 255
fi
workpath=$(dirname $(dirname $0))#hive paramter
hive_db='default'
cust_mds_user_dis_info='cust_mds_user_dis_info'
wb_ad_brand_cust_industry_map='wb_ad_brand_cust_industry_map'function make_brand_industry_count {write_log "[INFO] make_brand_industry_count "write_log "create  hive table $wb_ad_brand_industry_count .........begin"sql="use default;DROP TABLE IF EXISTS wb_ad_brand_industry_count_temp1;create table wb_ad_brand_industry_count_temp1 stored as textfileselect aa.industry_id,count(1) as uv from(select a.industry_id,b.uid from(select aa.industry_id ,aa.cust_uid from wb_ad_brand_cust_industry_map aa join             wb_ad_industry_white_list bb on aa.industry_id=bb.industry_id)a join(select cust_uid,uid from cust_mds_user_dis_info where dt=$dt and uid%5=0)b on     a.cust_uid=b.cust_uid group by a.industry_id,b.uid)aa group by aa.industry_id"write_log "$sql"hive -e "$sql"if [ $? -ne 0 ]thenwrite_log "create table $wb_ad_brand_industry_count fails"exit 255fi
}
write_log "make_brand_industry_count start"make_brand_industry_count

最终合并：

#!/bin/bash
source /usr/local/jobclient/config/.hive_config.sh
source /usr/local/jobclient/lib/source $0 $1
source /usr/local/jobclient/demo/execute_modular.sh $work_log_noticesource ./mysql_comm.sh
source ./date_comm.shif [ $? -ne 0 ]
thenexit 255
fi
workpath=$(dirname $(dirname $0))#hive paramter
hive_db='default'
cust_mds_user_dis_info='cust_mds_user_dis_info'
wb_ad_brand_cust_industry_map='wb_ad_brand_cust_industry_map'
wb_ad_brand_industry_count='wb_ad_brand_industry_count' function make_brand_industry_count {write_log "[INFO] make_brand_industry_count "write_log "create  hive table $wb_ad_brand_industry_count .........begin"sql="use default;insert overwrite table $wb_ad_brand_industry_count partition(dt=$dt)select industry_id,sum(uv)from(select * from wb_ad_brand_industry_count_temp1UNION ALLselect * from wb_ad_brand_industry_count_temp2UNION ALLselect * from wb_ad_brand_industry_count_temp3UNION ALLselect * from wb_ad_brand_industry_count_temp4UNION ALLselect * from wb_ad_brand_industry_count_temp5)agroup by industry_id;"write_log "$sql"hive -e "$sql"if [ $? -ne 0 ]thenwrite_log "create table $wb_ad_brand_industry_count fails"exit 255fi
}
write_log "make_brand_industry_count start"make_brand_industry_count

Hive--hive一种通用的上亿级别的去重方法相关推荐

hive上亿级别的表关联调优
环境:公司决定使用宽表,将10个相关的大表进行全量关联 (1个上亿级别的表,5个上千万的表,剩下的表都不到百万的表) 花了两天的时间研究,测试例如: a~g这几个表中,a表为上亿级别的表,5个上千万 ...
apk 路由器劫持_一种在路由器上防止网页劫持的方法与流程
本发明属于网络技术领域,尤其涉及一种在路由器上防止网页劫持的方法. 背景技术: HTTP请求在网络中进行明文传输,传输过程中常常会被网络节点中的路由设备进行连接并修改,以实现广告插入和将请求导流到钓鱼 ...
Mysql5.7在上亿级别的存储性能测试报告 Mysql到底可不可以支持单表过亿？要分区么？分表？...
软硬件环境 Intel 酷睿i5 480M,2.66GHz(笔记本) 5400转硬盘 6G内存 Win10 64 位操作系统 PHP version: 7.0.6 Server version: 5. ...
使用mongodb处理上亿级别数据
最近接到一个任务关于效能监控平台的开发,该效能平台要求监控日志的发送量以及成功率等信息,了解到需求,由于每天将会有平均200万的日志信息,最大接近400万,这数据还是十分庞大的,哪么半年下来起码有6亿 ...
一种通用的数据仓库分层方法
0x00 概述数据分层是数据仓库设计中十分重要的一个环节,优秀的分层设计能够让整个数据体系更易理解和使用.而目前网络中大部分可以被检索到相关文章只是简单地提及数据分层的设计,或缺少明确而详细的说明, ...
九种将元器件从PCB上拆焊下的方法
Proto-G 2021-01-09 Saturday 在Proto-G 的博文9 Different Desoldering Techniques 中介绍了九种从电路板上拆下元器件的方法,可以用于维 ...
imi在虎扑上亿数据迁移实践
1.项目背景: 随着数据规模的越来越大,mysql已经不能适用大数据多维度的查询,需要用ES等一类的搜索引擎,进行多维度的分词查询,MYSQL现阶段使用按天分表存储,不能满足跨天的长时间查询.如何以最 ...
python网络爬虫的方法有几种_Python网络爬虫过程中5种网页去重方法简要介绍
一般的,我们想抓取一个网站所有的URL,首先通过起始URL,之后通过网络爬虫提取出该网页中所有的URL链接,之后再对提取出来的每个URL进行爬取,提取出各个网页中的新一轮URL,以此类推.整体的感觉就 ...
在电脑上使用考研APP的方法（亲测有效）
目录实际问题:为什么要在电脑上使用手机APP? 操作教程 1.下载安装安卓模拟器 2.下载考研APP安卓版 3.在雷电模拟器中安装文都考研APP 4.配置转屏功能 5.效果展示常见问题解决方法 1 ...

Hive--hive一种通用的上亿级别的去重方法

Hive--hive一种通用的上亿级别的去重方法相关推荐

最新文章

热门文章