大数据知识面试题-Hive （2022版）

序列号	内容	链接
1	大数据知识面试题-通用（2022版）	https://blog.csdn.net/qq_43061290/article/details/124819089
2	大数据知识面试题-Hadoop（2022版）	https://blog.csdn.net/qq_43061290/article/details/124822293
3	大数据知识面试题-MapReduce和yarn（2022版）	https://blog.csdn.net/qq_43061290/article/details/124841929
4	大数据知识面试题-Zookeepr （2022版）	https://blog.csdn.net/qq_43061290/article/details/124548428
5	大数据知识面试题-Hive （2022版）	https://blog.csdn.net/qq_43061290/article/details/125105485
6	大数据知识面试题-Flume（2022版）	https://blog.csdn.net/qq_43061290/article/details/125132610
7	大数据知识面试题-Hbase（2022版）	https://blog.csdn.net/qq_43061290/article/details/125145399
8	大数据知识面试题-sqoop（2022版）	https://blog.csdn.net/qq_43061290/article/details/125145736
9	大数据知识面试题-Kafka（2022版）	https://blog.csdn.net/qq_43061290/article/details/125145841
10	大数据知识面试题-Azkaban（2022版）	https://blog.csdn.net/qq_43061290/article/details/125146859
11	大数据知识面试题-Scala （2022版）	https://blog.csdn.net/qq_43061290/article/details/125145976
12	大数据知识面试题-Spark （2022版）	https://blog.csdn.net/qq_43061290/article/details/125146030
13	大数据知识面试题-Flink（2022版）	https://blog.csdn.net/qq_43061290/article/details/125182137
14	数据仓库—建设规范（2022版）	https://blog.csdn.net/qq_43061290/article/details/125785261
15	数据仓库–开发流程（2022版）	https://blog.csdn.net/qq_43061290/article/details/125967931
16	数据仓库–名词解释及关系（2022版）	https://blog.csdn.net/qq_43061290/article/details/125886082

文章目录

1.1、Hive 数据模型
1.2、常用操作
- 1.2.1、数据库相关
- 1.2.2、内部表外部表
- 1.2.3、创建分区表
- 1.2.4、增删分区
- 1.2.5、hive中的join
- 1.2.6、json解析
1.3、常用函数
- 1.3.1、数值函数
- 1.3.2、日期函数
- 1.3.3、条件函数
- 1.3.4、字符串函数
- 1.3.5、类型转换
1.4、hive常用的优化
- 1.4.1、 Fetch抓取（Hive可以避免进行MapReduce）
- 1.4.2、本地模式
- 1.4.3、分区表分桶表
- 1.4.4、join优化
- - 1.4.4.1、小表Join大表
  - 1.4.4.2、mapjoin
- 1.4.5、group by
- 1.4.6、Map数
- 1.4.7、reduce数
- 1.4.8、jvm重用
- 1.4.9、数据压缩与存储格式
- - 1.压缩方式
  - 2.存储格式
- 1.4.10、并行执行
- 1.4.11、合并小文件
1.5、hive的数据倾斜
1.6、其他

1.1、Hive 数据模型

Hive中所有的数据都存储在HDFS中，没有专门的数据存储格式

在创建表时指定数据中的分隔符，Hive 就可以映射成功，解析数据。

Hive中包含以下数据模型：

**db：**在hdfs中表现为hive.metastore.warehouse.dir目录下一个文件夹

**table：**在hdfs中表现所属db目录下一个文件夹

**external table：**数据存放位置可以在HDFS任意指定路径

**partition：**在hdfs中表现为table目录下的子目录

**bucket：**在hdfs中表现为同一个表目录下根据hash散列之后的多个文件

1.2、常用操作

1.2.1、数据库相关

Hive配置单元包含一个名为 default 默认的数据库.

—创建数据库

create database [if not exists] ；

–显示所有数据库

show databases;

–删除数据库

drop database if exists [restrict|cascade];

默认情况下，hive不允许删除含有表的数据库，要先将数据库中的表清空才能drop，否则会报错
–加入cascade关键字，可以强制删除一个数据库

hive> drop database if exists users cascade;

–切换数据库

use ;

1.2.2、内部表外部表

建内部表
create table
student(Sno int,Sname string,Sex string,Sage int,Sdept string)
row format delimited fields terminated by ',';
建外部表
create external table
student_ext(Sno int,Sname string,Sex string,Sage int,Sdept string)
row format delimited fields terminated by ',' location '/stu';

内、外部表加载数据：
load data local inpath '/root/hivedata/students.txt' overwrite into table student;load data inpath '/stu' into table student_ext;

1.2.3、创建分区表

分区建表分为2种，一种是单分区，也就是说在表文件夹目录下只有一级文件夹目录。另外一种是多分区，表文件夹下出现多文件夹嵌套模式。
单分区建表语句

create table day_table (id int, content string) partitioned by (dt string);
单分区表，按天分区，在表结构中存在id，content，dt三列。

双分区建表语句

create table day_hour_table (id int, content string) partitioned by (dt string, hour string);
双分区表，按天和小时分区，在表结构中新增加了dt和hour两列。

 导入数据load data local inpath '/root/hivedata/dat_table.txt' into table day_table partition(dt='2017-07-07');load data local inpath '/root/hivedata/dat_table.txt' into table day_hour_table partition(dt='2017-07-07', hour='08');基于分区的查询：SELECT day_table.* FROM day_table WHERE day_table.dt = '2017-07-07';查看分区show partitions day_hour_table;  总的说来partition就是辅助查询，缩小查询范围，加快数据的检索速度和对数据按照一定的规格和条件进行管理。

指定分隔符

—指定分隔符创建分区表

create table day_table (id int, content string) partitioned by (dt string) row format delimited fields terminated by ',';

—复杂类型的数据表指定分隔符

数据如下

zhangsan beijing,shanghai,tianjin,hangzhou
wangwu  shanghai,chengdu,wuhan,haerbin

建表语句

create table
complex_array(name string,work_locations array<string>)
row format delimited fields terminated by '\t'
collection items terminated by ',';

1.2.4、增删分区

增加分区

alter table t_partition add partition (dt='2008-08-08') location 'hdfs://node-21:9000/t_parti/';执行添加分区  /t_parti文件夹下的数据不会被移动。并且没有分区目录dt=2008-08-08

删除分区

alter table t_partition drop partition (dt='2008-08-08');执行删除分区时/t_parti下的数据会被删除并且连同/t_parti文件夹也会被删除注意区别于load data时候添加分区:会移动数据 会创建分区目录

1.2.5、hive中的join

准备数据
1,a
2,b
3,c
4,d
7,y
8,u2,bb
3,cc
7,yy
9,pp建表：
create table a(id int,name string)
row format delimited fields terminated by ',';create table b(id int,name string)
row format delimited fields terminated by ',';导入数据：
load data local inpath '/root/hivedata/a.txt' into table a;
load data local inpath '/root/hivedata/b.txt' into table b;实验：
** inner join
select * from a inner join b on a.id=b.id;+-------+---------+-------+---------+--+
| a.id  | a.name  | b.id  | b.name  |
+-------+---------+-------+---------+--+
| 2     | b       | 2     | bb      |
| 3     | c       | 3     | cc      |
| 7     | y       | 7     | yy      |
+-------+---------+-------+---------+--+**left join
select * from a left join b on a.id=b.id;
+-------+---------+-------+---------+--+
| a.id  | a.name  | b.id  | b.name  |
+-------+---------+-------+---------+--+
| 1     | a       | NULL  | NULL    |
| 2     | b       | 2     | bb      |
| 3     | c       | 3     | cc      |
| 4     | d       | NULL  | NULL    |
| 7     | y       | 7     | yy      |
| 8     | u       | NULL  | NULL    |
+-------+---------+-------+---------+--+**right join
select * from a right join b on a.id=b.id;select * from b right join a on b.id=a.id;
+-------+---------+-------+---------+--+
| a.id  | a.name  | b.id  | b.name  |
+-------+---------+-------+---------+--+
| 2     | b       | 2     | bb      |
| 3     | c       | 3     | cc      |
| 7     | y       | 7     | yy      |
| NULL  | NULL    | 9     | pp      |
+-------+---------+-------+---------+--+**
select * from a full outer join b on a.id=b.id;
+-------+---------+-------+---------+--+
| a.id  | a.name  | b.id  | b.name  |
+-------+---------+-------+---------+--+
| 1     | a       | NULL  | NULL    |
| 2     | b       | 2     | bb      |
| 3     | c       | 3     | cc      |
| 4     | d       | NULL  | NULL    |
| 7     | y       | 7     | yy      |
| 8     | u       | NULL  | NULL    |
| NULL  | NULL    | 9     | pp      |
+-------+---------+-------+---------+--+**hive中的特别join
select * from a left semi join b on a.id = b.id;select a.* from a inner join b on a.id=b.id;
+-------+---------
| a.id  | a.name
+-------+---------
| 2     | b
| 3     | c
| 7     | y
+-------+---------
相当于
select a.id,a.name from a where a.id in (select b.id from b); 在hive中效率极低select a.id,a.name from a join b on (a.id = b.id);select * from a inner join b on a.id=b.id;cross join（##慎用）
返回两个表的笛卡尔积结果，不需要指定关联键。
select a.*,b.* from a cross join b;

1.2.6、json解析

1、先加载rating.json文件到hive的一个原始表 rat_json
样例：{"movie":"1193","rate":"5","timeStamp":"978300760","uid":"1"}create table rat_json(line string) row format delimited;
load data local inpath '/root/hivedata/rating.json' into table rat_json;2、需要解析json数据成四个字段，插入一张新的表 t_rating
drop table if exists t_rating;
create table t_rating(movieid string,rate int,timestring string,uid string)
row format delimited fields terminated by '\t';3、json表数据解析到rating表中
insert overwrite table t_rating
select
get_json_object(line,'$.movie') as moive,
get_json_object(line,'$.rate') as rate,
get_json_object(line,'$.timeStamp') as timestring, get_json_object(line,'$.uid') as uid
from rat_json limit 10;

1.3、常用函数

1.3.1、数值函数

指定精度取整函数 : round

语法: round(double a, int d)

返回值: DOUBLE

说明: 返回指定精度d的double类型

举例：
```
hive> select round(3.1415926,4) from dual;3.1416
```
向下取整函数 : floor

语法: floor(double a)

返回值: BIGINT

说明: 返回等于或者小于该double变量的最大的整数

举例：
```
hive> select floor(3.1415926) from dual;
3hive> select floor(25) from dual;
25
```
向上取整函数 : ceil

语法: ceil(double a)

返回值: BIGINT

说明: 返回等于或者大于该double变量的最小的整数

举例：
```
hive> select ceil(3.1415926) from dual;
4hive> select ceil(46) from dual;
46
```
取随机数函数 : rand

语法: rand(),rand(int seed)

返回值: double

说明: 返回一个0到1范围内的随机数。如果指定种子seed，则会等到一个稳定的随机数序列

举例：
```
hive> select rand() from dual;0.5577432776034763
```
绝对值函数 : abs

语法: abs(double a) abs(int a)

返回值: double int

说明: 返回数值a的绝对值

举例：
```
hive> select abs(-3.9) from dual;
3.9hive> select abs(10.9) from dual;
10.9
```

1.3.2、日期函数

to_date(string timestamp):返回时间字符串中的日期部分,
- 如to_date(‘1970-01-01 00:00:00’)=‘1970-01-01’
current_date:返回当前日期
year(date)：返回日期date的年,类型为int
- 如year(‘2019-01-01’)=2019
month(date)：返回日期date的月,类型为int,
- 如month(‘2019-01-01’)=1
day(date): 返回日期date的天,类型为int,
- 如day(‘2019-01-01’)=1
weekofyear(date1)：返回日期date1位于该年第几周。
- 如weekofyear(‘2019-03-06’)=10
datediff(date1,date2):返回日期date1与date2相差的天数
- 如datediff(‘2019-03-06’,‘2019-03-05’)=1
date_add(date1,int1):返回日期date1加上int1的日期
- 如date_add(‘2019-03-06’,1)=‘2019-03-07’
date_sub(date1,int1):返回日期date1减去int1的日期
- 如date_sub(‘2019-03-06’,1)=‘2019-03-05’
months_between(date1,date2):返回date1与date2相差月份
- 如months_between(‘2019-03-06’,‘2019-01-01’)=2
add_months(date1,int1):返回date1加上int1个月的日期，int1可为负数
- 如add_months(‘2019-02-11’,-1)=‘2019-01-11’
last_day(date1):返回date1所在月份最后一天
- 如last_day(‘2019-02-01’)=‘2019-02-28’
next_day(date1,day1):返回日期date1的下个星期day1的日期。day1为星期X的英文前两字母
- 如next_day(‘2019-03-06’,‘MO’) 返回’2019-03-11’
**trunc(date1,string1)

大数据知识面试题-Hive （2022版）相关推荐
1. 大数据知识面试题-Flink（2022版）
  序列号内容链接 1 大数据知识面试题-通用(2022版) https://blog.csdn.net/qq_43061290/article/details/124819089 2 大数据知识面试 ...
2. 大数据知识面试题-MapReduce和YARN（2022版）
  序列号内容链接 1 大数据知识面试题-通用(2022版) https://blog.csdn.net/qq_43061290/article/details/124819089 2 大数据知识面试 ...
3. 大数据知识面试题-Hadoop（2022版）
  序列号内容链接 1 大数据知识面试题-通用(2022版) https://blog.csdn.net/qq_43061290/article/details/124819089 2 大数据知识面试 ...
4. 大数据知识面试题-通用（2022版）
  序列号内容链接 1 大数据知识面试题-通用(2022版) https://blog.csdn.net/qq_43061290/article/details/124819089 2 大数据知识面试 ...
5. Redis面试题（2022版）
  序列号内容链接 1 Java基础知识面试题(2022版) https://blog.csdn.net/qq_43061290/article/details/124023797 2 Java集合容 ...
6. 数据结构与算法面试题（2022版）
  序列号内容链接 1 Java基础知识面试题(2022版) https://blog.csdn.net/qq_43061290/article/details/124023797 2 Java集合容 ...
7. Spring面试题（2022版）
  序列号内容链接 1 Java基础知识面试题(2022版) https://blog.csdn.net/qq_43061290/article/details/124023797 2 Java集合容 ...
8. Java异常面试题（2022版）
  序列号内容链接 1 Java基础知识面试题(2022版) https://blog.csdn.net/qq_43061290/article/details/124023797 2 Java集合容 ...
9. 大数据面试题Spark篇（1）
  目录 1.spark数据倾斜 2.Spark为什么比mapreduce快? 3.hadoop和spark使用场景? 4.spark宕机怎么迅速恢复? 5. RDD持久化原理? 6.checkpoint ...
最新文章
热门文章