精选——Hive十道面试题(下)

1.求出要求格式的数据

数据:

字段含义：
日期，MAC，颜色，销量
day,mac,color,num

20171011 1292    金色  1
20171011    1292    金色  14
20171011    1292    金色  2
20171011    1292    金色  11
20171011    1292    黑色  2
20171011    1292    粉金  58
20171011    1292    金色  1
20171011    2013    金色  10
20171011    2013    金色  9
20171011    2013    金色  2
20171011    2013    金色  1
20171012    1292    金色  5
20171012    1292    金色  7
20171012    1292    金色  5
20171012    1292    粉金  1
20171012    2013    粉金  1
20171012    2013    金色  6
20171013    1292    黑色  1
20171013    2013    粉金  2
20171011    12460   茶花金 1

求出来的结果要是这样

day      mac color   num nums
20171011    12460   茶花金 1   1
20171011    1292    金色  1   89
20171011    1292    金色  14  89
20171011    1292    金色  2   89
20171011    1292    金色  11  89
20171011    1292    黑色  2   89
20171011    1292    粉金  58  89
20171011    1292    金色  1   89
20171011    2013    金色  10  22
20171011    2013    金色  9   22
20171011    2013    金色  2   22
20171011    2013    金色  1   22
20171012    1292    金色  5   18
20171012    1292    金色  7   18
20171012    1292    金色  5   18
20171012    1292    粉金  1   18
20171012    2013    粉金  1   7
20171012    2013    金色  6   7
20171013    1292    黑色  1   1
20171013    2013    粉金  2   2

建表与导入数据

create table if not exists interview06(day string,mac string,color string,num int) row format delimited fields terminated by "\t";
load data local inpath "/home/hadoop/hive_data/interview06" into table interview06;

分析题目可知我们要求的列是当天某种型号的手机的总销售量

辅助SQL：

先按型号与日期分组，求出总销售量

select day,mac,sum(num)nums
from interview06
group by mac,day

最终SQL：

然后直接将上表与原表关联，注意这里有两个关联条件

select a.day,a.mac,color,num,nums
from interview06 a join
(select day,mac,sum(num)nums
from interview06
group by mac,day)b on a.mac = b.mac and a.day=b.day;

结果与题目给的需求相同

2.编写一个SQL实现查找所有至少连续三次出现的数字

数据

建表与导入数据

create table interview07(id int, number int) row format delimited fields terminated by "\t";
load data local inpath "/home/hadoop/hive_data/interview07" into table interview07;
select * from interview07;

这一题，就是让我们把当前行和后两行的相同的数据放置在同一行中，比如有abc三个数据，他们的id为123，我们可以加b的id减1与a比较，c的id在b的基础上减减1，如果这两个条件同时满足，这三个数就一定是相等的

辅助SQL：

select * from interview07 a join interview07 b on a.id = b.id - 1 join interview07 c on b.id = c.id - 1;

结果

a.id    a.number        b.id    b.number        c.id    c.number
1       1       2       1       3       1
2       1       3       1       4       2
3       1       4       2       5       1
4       2       5       1       6       2
5       1       6       2       7       2
6       2       7       2       8       3
7       2       8       3       9       3
8       3       9       3       10      3
9       3       10      3       11      3
10      3       11      3       12      4

最终SQL：

select distinct a.number counts
from interview07 a join interview07 b on a.id=b.id-1
join interview07 c on b.id=c.id-1
where a.number=b.number and b.number = c.number;

结果

counts
1
3

3.有一张hive表包含两列，分别是学生姓名name(类型string)，学生成绩score(类型map<string,int>)，成绩列中key是课程名称，例如语文、数学等，value是对应课程分数(0-100)，请用一个hql求一下每个学生成绩最好的课程及分数、最差的课程及分数

表名称：student_score
字段：
name,score
数据格式：

huangbo  yuwen:80,shuxue:89,yingyu:95
xuzheng yuwen:70,shuxue:65,yingyu:81
wangbaoqiang    yuwen:75,shuxue:100,yingyu:76

建表与导入数据

create table if not exists interview08(name string, score map<string,int>)
row format delimited fields terminated by '\t'
collection items terminated by ','
map keys terminated by ':';
load data local inpath "/home/hadoop/hive_data/interview08" into table interview08;

辅助SQL：

首先我们得使用爆炸函数将map集合中的元素炸裂开来，然后求出最高分与最低分

select name,max(b.score)maxscore,min(b.score)minscore
from interview08 a lateral view explode(a.score)b as course,score group by name;

结果

name    maxscore        minscore
huangbo 95      80
wangbaoqiang    100     75
xuzheng 81      65

这里再次使用炸裂函数将信息分散开来，这样方便与上表进行关联

select name,b.course,b.score
from interview08 a lateral view explode(a.score) b as course,score;

结果

name    b.course        b.score
huangbo yuwen   80
huangbo shuxue  89
huangbo yingyu  95
xuzheng yuwen   70
xuzheng shuxue  65
xuzheng yingyu  81
wangbaoqiang    yuwen   75
wangbaoqiang    shuxue  100
wangbaoqiang    yingyu  76

最终SQL：

将上面两表进行关联，关联条件为姓名与分数对应的，将结果映射在一个表中，然后直接查出我们要的字段

select a.name,b.score,b.course
from
(select name,max(b.score)maxscore,min(b.score)minscore
from interview08 a lateral view explode(a.score)b as course,score group by name)a
join (select name,b.course,b.score
from interview08 a lateral view explode(a.score) b as course,score)b on a.name=b.name
and (a.maxscore=b.score or a.minscore=b.score);

最终结果：

a.name  b.score b.course
huangbo 80      yuwen
huangbo 95      yingyu
wangbaoqiang    75      yuwen
wangbaoqiang    100     shuxue
xuzheng 65      shuxue
xuzheng 81      yingyu

4.统计每个月每天出现的情况，比如下面两串数字

0100010101000101010100101010101 这串数字一共31位的，每一位代表某个月的某一天，如电信的号码某一天有通话记录就置成1，没有为0
1101010101010101010100101010100 这串数字一共31位的，每一位代表某个月的某一天，如电信的号码某一天有使用流量记录就置成1，没有为0

直接使用自定义函数实现

package udf;import org.apache.hadoop.hive.ql.exec.UDF;/*** 自定义类*/
public class MyUDF3 extends UDF {public static String evaluate(String s1, String s2) {char[] c1 = s1.toCharArray();char[] c2 = s2.toCharArray();char[] c3 = new char[c1.length];for (int i = 0; i < c1.length; i++) {if (c1[i] == '1' || c2[i] == '1')c3[i] = '1';elsec3[i] = '0';}String s = "";for (int i = 0; i < c3.length; i++) {s += c3[i];}return s;}/*public static void main(String[] args) {String s1 = "0100010101000101010100101010101";String s2 = "1101010101010101010100101010100";System.out.println(evaluate(s1, s2));}*/
}

首先自行打包，将jar包上传到linux

然后将jar添加到hive中

add jar /home/hadoop/hive_data/test.jar;

创建一个临时函数

create temporary function res as "udf.MyUDF3";

直接使用

select res("0100010101000101010100101010101","1101010101010101010100101010100");

结果

1101010101010101010100101010101

5.假设有表（日期，产品id，产品当日收入，产品当日成本），日期和产品id是组合主键，有若干条数据，日期范围2016年1月1日至今，且一定每个产品，每天都有数据，写出SQL实现如下要求

数据

2018-03-01,a,3000,2500
2018-03-01,b,4000,3200
2018-03-01,c,3200,2400
2018-03-01,d,3000,2500
2018-03-02,a,3000,2500
2018-03-02,b,1500,800
2018-03-02,c,2600,1800
2018-03-02,d,2400,1000
2018-03-03,a,3100,2400
2018-03-03,b,2500,2100
2018-03-03,c,4000,1200
2018-03-03,d,2500,1900
2018-03-04,a,2800,2400
2018-03-04,b,3200,2700
2018-03-04,c,2900,2200
2018-03-04,d,2700,2500
2018-03-05,a,2700,1000
2018-03-05,b,1800,200
2018-03-05,c,5600,2200
2018-03-05,d,1200,1000
2018-03-06,a,2900,2500
2018-03-06,b,4500,2500
2018-03-06,c,6700,2300
2018-03-06,d,7500,5000
2018-04-01,a,3000,2500
2018-04-01,b,4000,3200
2018-04-01,c,3200,2400
2018-04-01,d,3000,2500
2018-04-02,a,3000,2500
2018-04-02,b,1500,800
2018-04-02,c,4600,1800
2018-04-02,d,2400,1000
2018-04-03,a,6100,2400
2018-04-03,b,4500,2100
2018-04-03,c,6000,1200
2018-04-03,d,3500,1900
2018-04-04,a,2800,2400
2018-04-04,b,3200,2700
2018-04-04,c,2900,2200
2018-04-04,d,2700,2500
2018-04-05,a,4700,1000
2018-04-05,b,3800,200
2018-04-05,c,5600,2200
2018-04-05,d,5200,1000
2018-04-06,a,2900,2500
2018-04-06,b,4500,2500
2018-04-06,c,6700,2300
2018-04-06,d,7500,5000

建表与导入数据

create table if not exists interview10(dt string, name string, income int, cost int) row format delimited fields terminated by ',';
load data local inpath '/home/hadoop/hive_data/interview10' into table interview10;
select * from interview10;

5.1输出每个产品，在2018年期间，每个月的净利润，日均成本

这里要对dt这个字段进行切割，得到我们需要的月份与年份，直接按照月份与名字分组即可

select
substring(dt,7,1)month,name,sum(income-cost)profits,avg(cost)avgcost
from interview10
where substring(dt,1,4)='2018'
group by substring(dt,7,1),name;

结果

month   name    profits avgcost
3       a       4200    2216.6666666666665
3       b       6000    1916.6666666666667
3       c       12900   2016.6666666666667
3       d       5400    2316.6666666666665
4       a       9200    2216.6666666666665
4       b       10000   1916.6666666666667
4       c       16900   2016.6666666666667
4       d       10400   2316.6666666666665

5.2输出每个产品，在2018年3月中每一天与上一天相比，成本的变化

这里要比较上一天的就需要用到自连接，直接将两天的cost进行相减，这里使用了DATE_ADD函数来作为条件

select a.dt,b.dt,a.name,(a.cost-b.cost)costs
from interview10 a join interview10 b
on a.dt=DATE_ADD(b.dt,1)
where a.name=b.name and substring(a.dt,1,4)='2018' and substring(a.dt, 7, 1)='3'
order by a.name;

结果

a.dt    b.dt    a.name  costs
2018-03-02      2018-03-01      a       0
2018-03-03      2018-03-02      a       -100
2018-03-05      2018-03-04      a       -1400
2018-03-06      2018-03-05      a       1500
2018-03-04      2018-03-03      a       0
2018-03-04      2018-03-03      b       600
2018-03-06      2018-03-05      b       2300
2018-03-05      2018-03-04      b       -2500
2018-03-03      2018-03-02      b       1300
2018-03-02      2018-03-01      b       -2400
2018-03-05      2018-03-04      c       0
2018-03-03      2018-03-02      c       -600
2018-03-06      2018-03-05      c       100
2018-03-02      2018-03-01      c       -600
2018-03-04      2018-03-03      c       1000
2018-03-05      2018-03-04      d       -1500
2018-03-03      2018-03-02      d       900
2018-03-06      2018-03-05      d       4000
2018-03-02      2018-03-01      d       -1500
2018-03-04      2018-03-03      d       600

5.3输出2018年4月，有多少个产品总收入大于22000元，必须用一句SQL语句实现，且不允许使用关联表查询、子查询。

这里要使用rank over，按照总收入进行倒序排序，给每一个排序结果一个count作为下标记录，每有一个符合要求的下标就会加1，最后按照count排序输出，取第一个即最大的就是我们要的结果

select
rank() over (order by sum(income) desc) as count
from interview10
where substring(dt,1,4)='2018' and substring(dt, 7, 1)='4'
group by name
having sum(income)>22000
order by count desc
limit 1;

结果

count
3

5.4输出2018年4月，总收入最高的那个产品，过程使用over()函数

我们要求某个产品的信息，先要按name进行分区，然后再对总收入进行排序,我们得到的结果就是总收入最高的那个产品

select
name, rank() over (order by sum(income) desc) as total
from interview10 a
WHERE substring(dt, 0, 7) == "2018-04"
group by name
limit 1;

结果

name total
c   1

最后直接对上表与原表进行关联查询，取出我们要的字段即可

最终SQL：

select dt,a.income,a.cost
from interview10 a join
(select
name,income,cost,rank() over (partition by name order by sum(income) desc) as total
from interview10
where substring(dt,1,4)='2018' and substring(dt, 7, 1)='4'
group by name,income,cost
limit 1)b on a.name=b.name
where substring(dt, 7, 1)='4';

最终结果

dt      a.income        a.cost
2018-04-01      3000    2500
2018-04-02      3000    2500
2018-04-03      6100    2400
2018-04-04      2800    2400
2018-04-05      4700    1000
2018-04-06      2900    2500