hive统计每日的活跃用户和新用户sql开发（附shell脚本）

假如有一个web系统，每天生成以下日志文件：

2020年12月21日数据
192.228.33.6,hunter,2020-12-21 10:30:20,/a
192.228.33.7,hunter,2020-12-21 10:30:26,/b
192.228.33.6,jack,2020-12-21 10:30:27,/a
192.228.33.8,tom,2020-12-21 10:30:28,/b
192.228.33.9,rose,2020-12-21 10:30:30,/b
192.228.33.10,julia,2020-12-21 10:30:40,/c2020年12月22日数据
192.228.33.22,hunter,2020-12-22 10:30:20,/a
192.228.33.18,jerry,2020-12-22 10:30:30,/b
192.228.33.26,jack,2020-12-22 10:30:40,/a
192.228.33.18,polo,2020-12-22 10:30:50,/b
192.228.33.39,nissan,2020-12-22 10:30:53,/b
192.228.33.39,nissan,2020-12-22 10:30:55,/a
192.228.33.39,nissan,2020-12-22 10:30:58,/c
192.228.33.20,ford,2020-12-22 10:30:54,/c2020年12月23日数据
192.228.33.46,hunter,2020-12-23 10:30:21,/a
192.228.43.18,jerry,2020-12-23 10:30:22,/b
192.228.43.26,tom,2020-12-23 10:30:23,/a
192.228.53.18,bmw,2020-12-23 10:30:24,/b
192.228.63.39,benz,2020-12-23 10:30:25,/b
192.228.33.25,haval,2020-12-23 10:30:30,/c
192.228.33.10,julia,2020-12-23 10:30:40,/c

需求

1.建立一个表，来存储每天新增的数据(分区表)
2.统计每天的活跃用户(日活)（需要用户的ip，用户的账号，用户访问时间最早的一条url和时间）
3.统计每天的新增用户(日新)

实现

建表(存储数据的分区表)

create table t_web_log(ip string,uid string,access_time string,url string)
partitioned by (day string)
row format delimited fields terminated by ',';

导数据

load data local inpath '/root/hivetest/log.21' into table t_web_log partition (day='2020-12-21');
load data local inpath '/root/hivetest/log.22' into table t_web_log partition (day='2020-12-22');
load data local inpath '/root/hivetest/log.23' into table t_web_log partition (day='2020-12-23');

show partitions t_web_log;(查看分区表)

指标统计

每日活跃用户统计

1.建一个保存日活数据的表

create table t_user_active_day(ip string,uid string,first_access string,url string) partitioned by (day string);

2.从日志表中查出日活数据并插入日活数据表

插入21日数据
insert into table t_user_active_day partition(day='2020-12-21')
select ip,uid,access_time,url
from
(
select ip,uid,access_time,url,
row_number() over(partition by uid order by access_time) as rn
from t_web_log
where day='2020-12-21') tmp
where rn=1;插入22日数据
insert into table t_user_active_day partition(day='2020-12-22')
select ip,uid,access_time,url
from
(
select ip,uid,access_time,url,
row_number() over(partition by uid order by access_time) as rn
from t_web_log
where day='2020-12-22') tmp
where rn=1;插入23日数据
insert into table t_user_active_day partition(day='2020-12-23')
select ip,uid,access_time,url
from
(
select ip,uid,access_time,url,
row_number() over(partition by uid order by access_time) as rn
from t_web_log
where day='2020-12-23') tmp
where rn=1;

每日新用户统计

思路——将当日活跃用户跟历史用户表关联，找出那些在历史用户表中尚不存在的用户

1.建历史用户表

create table t_user_history(uid string);

2.建一个存放新用户的表

create table t_user_new_day like t_user_active_day;

3.求出每日的新用户并把数据插入新用户表

21日新用户
insert into table t_user_new_day partition (day='2020-12-21')
select ip,uid,first_access,url
from
(
select a.ip,a.uid,a.first_access,a.url,b.uid as b_uid
from t_user_active_day a
left join t_user_history b on a.uid=b.uid
where a.day='2020-12-21') tmp
where tmp.b_uid is null;22日新用户
insert into table t_user_new_day partition (day='2020-12-22')
select ip,uid,first_access,url
from
(
select a.ip,a.uid,a.first_access,a.url,b.uid as b_uid
from t_user_active_day a
left join t_user_history b on a.uid=b.uid
where a.day='2020-12-22') tmp
where tmp.b_uid is null;23日新用户
insert into table t_user_new_day partition (day='2020-12-23')
select ip,uid,first_access,url
from
(
select a.ip,a.uid,a.first_access,a.url,b.uid as b_uid
from t_user_active_day a
left join t_user_history b on a.uid=b.uid
where a.day='2020-12-23') tmp
where tmp.b_uid is null;

4.将每日的新用户插入历史表

21日数据
insert into table t_user_history
select uid
from t_user_new_day
where day='2020-12-21';22日数据
insert into table t_user_history
select uid
from t_user_new_day
where day='2020-12-22';23日数据
insert into table t_user_history
select uid
from t_user_new_day
where day='2020-12-23';

加餐

清空表

truncate table t_user_active_day;

修改Linux系统时间

date -s '2020-12-20 00:20:00'

统计日新、日活的shell脚本

vi user_etl.sh

#!/bin/bash
day_str=`date -d '-1 day'+'%Y-%m-%d'`echo "准备处理$day_str的数据......"
hive_exec=/root/apps/hive-1.2.1/bin/hiveHQL_user_active_day="
insert into table exercise.t_user_active_day partition(day=\"$day_str\")
select ip,uid,access_time,url
from
(
select ip,uid,access_time,url,
row_number() over(partition by uid order by access_time) as rn
from t_web_log
where day=\"$day_str\") tmp
where rn=1;
"
$hive_exec "$HQL_user_active_day"HQL_user_new_day="
insert into table t_user_new_day partition (day=\"$day_str\")
select ip,uid,first_access,url
from
(
select a.ip,a.uid,a.first_access,a.url,b.uid as b_uid
from t_user_active_day a
left join t_user_history b on a.uid=b.uid
where a.day=\"$day_str\") tmp
where tmp.b_uid is null;
"
$hive_exec "$HQL_user_new_day"HQL_new_to_history="
insert into table t_user_history
select uid
from t_user_new_day
where day=\"$day_str\;
"
$hive_exec "$HQL_new_to_history"

运行：sh user_etl.sh

hive统计每日的活跃用户和新用户sql开发（附shell脚本）相关推荐

安全狗“老用户推荐新用户”有奖活动进行中最高IPhone 4S手机
2019独角兽企业重金招聘Python工程师标准>>> 号外!号外!为回馈广大用户对安全狗长期以来的大力支持与宣传,即日起,我们将启动"老客户介绍新客户"安装安全 ...
LINUX--创建新用户为新用户设置权限
文章目录 [一张图总结] [详细说明] 1.登录root 2.新建用户并创建家目录 3.更改为bash命令 4.设置密码 5.设置sudo权限 [关于本文Linux命令的说明] 1.`useradd ...
用户管理（一）：使用shell脚本批量添加指定数量的用户
运行环境 CentOS 6.9.Xshell 5 前言我们在需要创建多个用户的时候,使用手工单独创建是比较耗费精力的事情,我们可以通过shell脚本实现批量添加用户,实现指定数量用户.创建用户默认名 ...
网站的活跃用户与流失用户
网站用户管理的目标是发掘新用户,保留老用户.但仅仅吸引新用户还不错,还需要保持新用户的活跃度,使其能持久地为网站创造价值:而一旦用户的活跃度下降,很可能用户就会渐渐地远离网站,进而流失.所以基于此,我 ...
全媒体运营师胡耀文教你：从Aha时刻来看新用户激活
Aha时刻是新用户第一次认识到产品的价值,脱口而出"啊哈,原来这个产品可以帮我做这个".当花了大力气吸引来了一波新用户时,如何帮助留存转化,就要依靠Aha时刻. 2008年,当你宅 ...
每个日期新用户的次日留存率
SQL练习题:网站每天有很多人登录,请你统计一下牛客每个日期新用户的次日留存率. 题目: 牛客每天有很多人登录,请你统计一下牛客每个日期新用户的次日留存率. 有一个登录(login)记录表,简况如下: ...
airbnb_airbnb新用户预订kaggle竞赛
airbnb Instead of waking to overlooked "Do not disturb" signs, Airbnb travelers find thems ...
Linux下对文件的操作及添加新用户
Linux下对文件的操作及添加新用户一.对文件的操作 1.打包压缩文件 2.解压缩文件 3.对文件操作的其他命令二.创建新用户一.对文件的操作 1.打包压缩文件 2.解压缩文件 3.对文件操作的 ...
Linux系统管理技术手册——第6章添加新用户
6.1/etc/passwd文件用户登录时Linux识别用户的文件/etc/passwd /etc/passwd包括7个字段: 登录名(不超过32位,使用NIS系统后不超过8位) 经过加密的口令或口 ...