hive 常用函数操作

一 hive mysql create table as/like区别

mysql

-- 没有开启gtid的情况下,不拷贝数据，只创建一模一样的表结构，包括索引约束等，结合insert语句可以实现复制一个表的结构和数据的目的
create table tbl_test_bak like  tbl_test;
insert into tbl_test_bak select * from tbl_test;-- 以下方式也可以创建表结构,包含数据，但是没有索引约束等，所以不推荐再使用了。而且，在开启gtid情况下，会报语法错误
create table tbl_test_bak as select * from tbl_test;

hive

create table as

CTAS建表语句（CREATE TABLE AS SELECT）
使用查询创建并填充表，select中选取的列名会作为新表的列名（所以通常是要取别名）

1)会改变表的属性、结构，比如只能是内部表、分区分桶也没了

2)目标表不允许使用分区分桶的，FAILED: SemanticException [Error 10068]: CREATE-TABLE-AS-SELECT does not support partitioning in the target table
3)对于旧表中的分区字段，如果通过select * 的方式，新表会把它看作一个新的字段，这里要注意
目标表不允许使用外部表，如create external table … as select…报错 FAILED: SemanticException [Error 10070]: CREATE-TABLE-AS-SELECT cannot create external table
4)CTAS创建的表存储格式会变成默认的格式TEXTFILE
5)对了，还有字段的注释comment也会丢掉，同时新表也无法加上注释
6)但可以在CTAS语句中指定表的存储格式，行和列的分隔符等

create table xxx as select ...create table xxxrow format delimitedfields terminated by ' 'stored as parquet
as
select ...

create table like

如何快速复制一张分区表？完全复制备份表。

方法1

create table... like...+ insert into table ... partition(xxx)...select...

create [external] table partition_test like old_table;
insert into table partition_test
partition(dt)
select
trim(userid) userid,
trim(cardno) cardno,
if(lower(trim(flag)) in ("true","1","01"),"1","0") flag,
substr(cardtype,1,1) cardtype,
trim(times) times,
substr(times,0,10) dt
from old_table
order by rand()
limit 100;

方法2

或者通过hdfs复制数据并修复新表的分区相关元数据

create table newtable like oldtable
hdfs dfs -cp /old_table/ /path/new_table/
msck repair table newtable

二几个sql题目的网址

https://www.cnblogs.com/qingyunzong/p/8747656.html

1 列转行。行转列

explode就是将hive一行中复杂的array或者map结构拆分成多行。

select explode(split('a,b,c,d,e',','));
--------------------------------------------------------------------------------
OK
a
b
c
d
e
Time taken: 19.317 seconds, Fetched: 5 row(s)

Lateral View一般与用户自定义表生成函数（如explode()）结合使用。如内置表生成函数中所述，UDTF为每个输入行生成零个或多个输出行。 Lateral View 首先将UDTF应用于基表的每一行，然后将结果输出行连接到输入行，以形成具有提供的表别名的虚拟表。

select 'test' a,split('a,b,c,d,e',',') b
--------------------------------------------------------------------------------
OK
test    ["a","b","c","d","e"]
Time taken: 10.905 seconds, Fetched: 1 row(s)with test as (
select 'test' a,split('a,b,c,d,e',',') b
)
select a,adid from test LATERAL VIEW explode(b) adTable AS adid;
--------------------------------------------------------------------------------
OK
test    a
test    b
test    c
test    d
test    e
Time taken: 6.063 seconds, Fetched: 5 row(s)

正式的行列转换的例子

hive> select * from col_lie limit 10;
OK
col_lie.user_id    col_lie.order_id
104399    1715131
104399    2105395
104399    1758844
104399    981085
104399    2444143
104399    1458638
104399    968412
104400    1609001
104400    2986088
104400    1795054行转列#########################
hive>
create table lie_col as
select user_id,
concat_ws(',',collect_list(order_id)) as order_value
from col_lie
group by user_id;--------------------------------------------------------------------------------
OK
104399  1715131,2105395,1758844,981085,2444143,1458638,968412,1715131,2105395,1758844,981085,2444143,1458638,968412
104400  1609001,2986088,1795054,1609001,2986088,1795054
Time taken: 23.739 seconds, Fetched: 2 row(s)列转行###################33
hive>
select user_id,order_id
from lie_col
lateral view explode(split(order_value,',')) num as order_id;
--------------------------------------------------------------------------------
OK
104399  1715131
104399  2105395
104399  1758844
104399  981085
104399  2444143
104399  1458638
104399  968412
104399  1715131
104399  2105395
104399  1758844
104399  981085
104399  2444143
104399  1458638
104399  968412
104400  1609001
104400  2986088
104400  1795054
104400  1609001
104400  2986088
104400  1795054
Time taken: 8.171 seconds, Fetched: 20 row(s)

2 窗口函数（主要用途是筛选没有出现在group by后面的字段的需求）

over()指定分析函数工作的数据窗口大小，这个窗口大小可能随着行的变化而变化

current row:当前行

n preceding 往前n行数据

n following 往后n行数据

unbounded 起点

unbounded preceding 从前面的起点

unbounded following 到后面的终点

lag(col,n)往前第n行数据

lead(col,n) 往后第n行数据

ntile(n) 把有序分区中的行分发到指定数据的组中，各个组有编号，编号从1开始，对于每一行，ntile返回此行所属的组的编号。

例子：

表student中的数据格式如下：

name month degree

s1 201801 A
s1 201802 A
s1 201803 C
s1 201804 A
s1 201805 A
s1 201806 A
s2 201801 A
s2 201802 B
s2 201803 C
s2 201804 A
s2 201805 D
s2 201806 A
s3 201801 C
s3 201802 A
s3 201803 A
s3 201804 A
s3 201805 B
s3 201806 A

现要查询表中连续三个月以上degree均为A的记录

selecta1.name,a1.month,a1.degree
from
(selectname,month,degree,sum(if(degree = 'A', 1, 0)) OVER(PARTITION BY name ORDER BY month ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS score1,sum(if(degree = 'A', 1, 0)) OVER(PARTITION BY name ORDER BY month ROWS BETWEEN 1 PRECEDING AND 1 following) AS score2,sum(if(degree = 'A', 1, 0)) OVER(PARTITION BY name ORDER BY month ROWS BETWEEN  CURRENT ROW AND 2 following) AS score3from student
) as a1
wherea1.score1 = 3 ora1.score2 = 3 ora1.score3 = 3

例子2

两者效果是一样的，窗口是开在了group by聚合之后的结果集，临时表上。

select a.name, count(*) over ()
from
(
select name
from business
where substr(orderdate,1,7)="2017-01"
group by name) a;--------------------------------------------------------------------------------
OK
andy    4
jack    4
lucy    4
tom 4
Time taken: 7.586 seconds, Fetched: 4 row(s)select name, count(*) over()
from business
where substr(orderdate,1,7)="2017-01"
group by name;--------------------------------------------------------------------------------
OK
andy    4
jack    4
lucy    4
tom 4
Time taken: 7.586 seconds, Fetched: 4 row(s)

例子3

over(partition by uname order by umonth) 窗口函数里面加order by 跟不加有什么变化。

经过测试 order by 相当于后面再加上rows between unbounded preceding and current row，感觉两者结合效果是一样的。

order by;

order by rows between unbounded preceding and current row;

###背景如下：select
uname,
umonth,
sum(ucount) mncounts
from t_access
group by uname,umonth；--------------------------------------------------------------------------------
OK
A   2015-01 33
A   2015-02 10
A   2015-03 38
B   2015-01 30
B   2015-02 15
B   2015-03 34
Time taken: 17.804 seconds, Fetched: 6 row(s)###第一次计算
with t1 as (
select
uname,
umonth,
sum(ucount) mncounts
from t_access
group by uname,umonth
)
select
uname,
umonth,
mncounts,
max(mncounts) over(partition by uname rows between unbounded preceding and current row),
sum(mncounts) over(partition by uname order by umonth) totalcounts
#######上面这句换成这句效果一样sum(mncounts) over(partition by uname  order by umonth rows #######   between unbounded preceding and current row) totalcounts
from
t1;
--------------------------------------------------------------------------------
OK
A   2015-01 33  33  33
A   2015-02 10  33  43
A   2015-03 38  38  81
B   2015-01 30  30  30
B   2015-02 15  30  45
B   2015-03 34  34  79
Time taken: 14.4 seconds, Fetched: 6 row(s)####注意这次窗口函数没有加order by umonth
with t1 as (
select
uname,
umonth,
sum(ucount) mncounts
from t_access
group by uname,umonth
)
select
uname,
umonth,
mncounts,
max(mncounts) over(partition by uname rows between unbounded preceding and current row),
sum(mncounts) over(partition by uname) totalcounts
from
t1;--------------------------------------------------------------------------------
OK
A   2015-01 33  33  81
A   2015-02 10  33  81
A   2015-03 38  38  81
B   2015-01 30  30  79
B   2015-02 15  30  79
B   2015-03 34  34  79
Time taken: 22.425 seconds, Fetched: 6 row(s)

例子4

分析顾客上次购物的时间？

=====================背景=====================
hive> select * from business;
OK
jack    2017-01-01  10
jack    2017-01-02  10
tom 2017-01-01  10
tom 2017-01-02  10
tom 2017-01-03  10
andy    2017-01-01  10
andy    2017-01-02  10
andy    2017-01-03  10
lucy    2017-01-01  10
lucy    2017-01-02  10
lucy    2017-01-03  10
jack    2017-01-03  10
jack    2017-02-01  10
jack    2017-02-02  10
tom 2017-02-01  10
tom 2017-02-02  10
tom 2017-02-03  10
andy    2017-02-01  10
Time taken: 0.071 seconds, Fetched: 18 row(s)===================================正式计算============================
select *, lag(orderdate,1) over(distribute by name sort by orderdate) from business;
--------------------------------------------------------------------------------
OK
andy    2017-01-01  10  NULL
andy    2017-01-02  10  2017-01-01
andy    2017-01-03  10  2017-01-02
andy    2017-02-01  10  2017-01-03
jack    2017-01-01  10  NULL
jack    2017-01-02  10  2017-01-01
jack    2017-01-03  10  2017-01-02
jack    2017-02-01  10  2017-01-03
jack    2017-02-02  10  2017-02-01
lucy    2017-01-01  10  NULL
lucy    2017-01-02  10  2017-01-01
lucy    2017-01-03  10  2017-01-02
tom 2017-01-01  10  NULL
tom 2017-01-02  10  2017-01-01
tom 2017-01-03  10  2017-01-02
tom 2017-02-01  10  2017-01-03
tom 2017-02-02  10  2017-02-01
tom 2017-02-03  10  2017-02-02
Time taken: 9.021 seconds, Fetched: 18 row(s)