Hive之数据类型、查询操作

3.数据类型，表以及表的操作

1.数据类型

tinyint/smallint/int/bigint
y        s            lfloat/double/decimalbooleanstring/varchar/chartimestamp/date/intervalbinaryArray 数组   Map 键值对   Struct 对象

2.表

•Table 内部表

•External Table 外部表

•Partition 分区表

•Bucket Table 桶表

2.1内部表

每一个Table在Hive中都有一个相应的目录存数据，所有的内部表数据都保存在这个目录中。
当表定义被删除的时候，表中的数据和元数据随之一并被删除。

use bd04;
create table bd04.t_student
(id int,name string,weight double)
row format delimited
fields terminated by ','
stored as textfile;insert into bd04.t_student values(1,'tom',90.3);

hive文件存储格式包括以下几类：

· TEXTFILE
· SEQUENCEFILE
· RCFILE
· 自定义格式

1.其中TEXTFILE为默认格式，数据不做压缩，磁盘开销大，数据解析开销大。可结合Gzip、Bzip2使用（系统自动检查，执行查询时自动解压），但使用这种方式，hive不会对数据进行切分，从而无法对数据进行并行操作。

2.SequenceFile是Hadoop API提供的一种二进制文件支持，其具有使用方便、可分割、可压缩的特点。
SequenceFile支持三种压缩选择：NONE, RECORD, BLOCK。 Record压缩率低，一般建议使用BLOCK压缩

3.RCFILE是一种行列存储相结合的存储方式。首先，其将数据按行分块，保证同一个record在一个块上，避免读一个记录需要读取多个block。其次，块数据列式存储，有利于数据压缩和快速的列存取。

SequenceFile,RCFile格式的表不能直接从本地文件导入数据，数据要先导入到textfile格式的表中，然后再从textfile表中用insert导入到SequenceFile,RCFile表中。

create table zone0000tf(ra int, dec int, mag int
) row format delimited
fields terminated by '|';create table zone0000rc(ra int, dec int, mag int
) row format delimited
fields terminated by '|'
stored as rcfile;load data local inpath '/home/briup/zone0000.asc ' into table zone0000tf;
insert overwrite table zone0000rc select * from zone0000tf;

create table test2(str STRING)
STORED AS SEQUENCEFILE;    SET hive.exec.compress.output=true;
SET io.seqfile.compression.type=BLOCK;
INSERT OVERWRITE TABLE test2 SELECT * FROM test1;

create table test3(str STRING)
STORED AS RCFILE;    INSERT OVERWRITE TABLE test3 SELECT * FROM test1;

相比TEXTFILE和SEQUENCEFILE，RCFILE由于列式存储方式，数据加载时性能消耗较大，但是具有较好的压缩比和查询响应。数据仓库的特点是一次写入、多次读取，因此，整体来看，RCFILE相比其余两种格式具有较明显的优势。

2.2外部表

外部表，数据存在与否和表的定义互不约束，仅仅只是表对hdfs上相应文件的一个引用，当删除表定义的时候，表中的数据依然存在。

当删除一个外部表时，仅删除该链接。

create external table t_phone
(id int,name string,price double)
row format delimited
fields terminated by '\t'
stored as textfile
location '/user/t_phone';create external table t_phone_ext2(id int,name string,price double)
row format delimited
fields terminated by ','
stored as textfile;
外部表把某个目录下的数据当做表来使用load data inpath '/user/phone_extr' into table t_phone_ext2

2.4分区表

在许多场景下，可以通过分区的方法减少每一次扫描总数据量，这种做法可以显著地改善性能。

根据业务编码、日期、其他类型等维度创建分区表。
比如一个重庆市的9个区域各自一个分区，如果要查某一个区域的数据，只需要访问一个分区的数据，而不需要从全量数据中进行筛选。
分区底层实现逻辑为：
在一个表对应的目录下，一个分区对应一个目录。

Hive的分区使用HDFS的子目录功能实现。每一个子目录包含了分区对应的列名和每一列的值。但是由于HDFS并不支持大量的子目录，这也给分区的使用带来了限制。我们有必要对表中的分区数量进行预估，从而避免因为分区数量过大带来一系列问题。

使用场景：
单表数据量巨大，而且查询又经常限定某一个类别，那么可以将表按照该类别进行分区，以提高数据查询效率，减少资源开销。

Hive查询通常使用分区的列作为查询条件。这样的做法可以指定MapReduce任务在HDFS中指定的子目录下完成扫描的工作。HDFS的文件目录结构可以像索引一样高效利用。

create external table t_order(id int,name string,cost double
)partitioned by (month string)
row format delimited
fields terminated by ',';load data local inpath '/home/hdfs/order_data' into table t_order partition(month='8');load data local inpath 'order_data' into table t_order
partition(month='5');show partitions t_order;

2.5桶表

在分区数量过于庞大以至于可能导致文件系统崩溃时，我们就需要使用分桶来解决问题了。

将大表进行哈希散列抽样存储，方便做数据和代码验证。比如将表分成10分，每次只拿其中的十分之一来使用，可以快速的得到结果。

分桶底层实现逻辑：
在表对应的目录下，将源文件拆分成N个小文件。

使用场景：
对于一个庞大的数据集我们经常需要拿出来一小部分作为样例，然后在样例上验证我们的查询，优化我们的程序,利用分桶表可以实现数据的抽样。

记住，在数据量足够大的情况下，分桶比分区，更高的查询效率。

分区和分桶最大的区别就是分桶随机分割数据库，分区是非随机分割数据库。

因为分桶是按照列(唯一值OrderID)的哈希函数进行分割的，相对比较平均；而分区是按照列的值来进行分割的，容易造成数据倾斜。

其次两者的另一个区别就是分桶是对应不同的文件（细粒度），分区是对应不同的文件夹（粗粒度）。

create table t_phone_bucket(id int,name string ,price string)clustered by(id) into 3 bucketsrow format delimited fields terminated by ',';create  table t_phone_str(id int,name string,price string)row format delimited fields terminated by ','stored as textfile;#桶表汇中的数据,只能从其他表中用子查询进行插入
set hive.enforce.bucketing=true;
insert into table t_phone_bucket select * from t_phone;
select * from t_phone_bucket tablesample(bucket 3 out of 3 on id);

3.导入数据

1 使用insert命令
2 使用load命令
3 子查询

#1.insert 格式
insert into 表名(字段1,字段2) values(值1,值2);
insert into t_student(id,name) values(1,'tom');#2.load数据
load data [local] inpath '数据所在hdfs中的路径' [overwrite] into table 表名;hdfs dfs -put  /home/hdfs/phone_data /data2
#本地文件导入
load data local inpath '/home/hdfs/phone_data' overwrite into table t_student;
#hdfs文件导入
load data inpath '/data2' into table t_student#3.子查询
#1.建表的时候指定子查询导入数据
create table t_phone_back
as
select * from t_phone;
#2通过insert指定子查询导入数据
insert overwrite/into table t_phone_back select * from t_phone;

4.其他类型

#1.array
create table tab_array (a array<int>,b array<string>)
row format delimited
fields terminated by '\t'
collection items terminated by ',';#/home/hdfs/data_array文件中的数据为：
#1,2,3  hello,world,briupload data local inpath '/home/hdfs/data_array' into table tab_array;select a[2],b[1] from tab_array;#2.map
create table tab_map (name string,info map<string,string>)
row format delimited
fields terminated by '\t'
collection items terminated by ','
map keys terminated by ':';#/home/hdfs/data_map文件中的数据为：
#zhangsan    name:zhangsan,age:18,gender:maleload data local inpath '/home/hdfs/data_map' into table tab_map;select  info['name'] from tab_map;#3.structcreate table tab_struct(name string,info struct<age:int,tel:string,salary:double>)
row format delimited
fields terminated by '\t'
collection items terminated by ',';#/home/hdfs/data_struct文件中的数据为：
#zhangsan   18,189,22.3load data local inpath '/home/hdfs/data_struct' into table tab_struct;select name,info.age,info.tel from tab_struct;

4.HiveQL查询数据

select
from
where
order by sort by
group by  distinct
distribute by   cluster by
limit

1.order by ,sort by , distribute by , cluster by

1.准备表和数据

create table test1(id int)
row format delimited
fields terminated by '\t'
stored as textfile;load data local inpath '/home/hdfs/data1' into table test1;

2.测试

1.order by id asc 全局排序
ex:
hive> set mapred.reduce.tasks=2;
hive> select * from test1 order by id;2.sort by  id  desc 局部排序
ex:
hive> set mapred.reduce.tasks=2;
hive> select * from test1 sort by id;3.distribute by   按照指定的字段或表达式对数据进行划分，输出到对应的Reduce或者文件中
ex:
hive> set mapred.reduce.tasks=2;
hive>INSERT overwrite LOCAL directory '/home/hdfs/res1'
SELECT id FROM test1 distribute BY id;4.cluster by
除了兼具distribute by的功能，还兼具sort by的排序功能
ex:
hive> set mapred.reduce.tasks=2;
hive>INSERT overwrite LOCAL directory '/home/hdfs/res2'
SELECT id FROM test1 cluster by id;

2.group by 和 distinct

1.准备表和数据

create table test2(name String,age int,num String)
row format delimited
fields terminated by '\t'
stored as textfile;

文件内容如下：

zhao 15  20170807
zhao    14  20170809
zhao    15  20170809
zhao    16  20170809

加载数据：

load data local inpath '/home/hdfs/data_test2' into table test2;

测试：

select name from test2 group by name;
select distinct name from test2;

如果数据较多，distinct效率会更低一些，一般推荐使用group by。

3.虚拟列

Hive查询中有两个虚拟列：INPUT__FILE__NAME：数据对应的HDFS文件名；BLOCK__OFFSET__INSIDE__FILE：该行记录在文件中的偏移量；
ex:
hive> select id,INPUT__FILE__NAME, BLOCK__OFFSET__INSIDE__FILE from test1;

4.连接操作

Hive中的Join可分为Common Join（Reduce阶段完成join）和Map Join（Map阶段完成join）。如果不指定MapJoin或者不符合MapJoin的条件（hive.auto.convert.join=true 对于小表启用mapjoin,hive.mapjoin.smalltable.filesize=25M 设置小表的阈值），那么Hive解析器会将Join操作转换成Common Join,即：在Reduce阶段完成join.

连接分类:

内连接 inner join
外连接左外连接 left outer join右外连接 right outer join全外连接 full outer join
左半连接 left semi join
笛卡尔积关联（cross join）

1.准备表和数据

1.1创建用户名字表:

create table user_name(id int,name string)
row format delimited
fields terminated by '\t'
stored as textfile;

文件data_user_name内容如下：

1    zhangsan
2   lisi
3   wangwu

加载数据:

load data local inpath '/home/hdfs/data_user_name' into table user_name;

1.2创建用户年龄表:

create table user_age(id int,age int)
row format delimited
fields terminated by '\t'
stored as textfile;

文件data_user_age内容如下：

1    30
2   29
4   21

加载数据:

load data local inpath '/home/hdfs/data_user_age' into table user_age;

2.练习

1.内连接
SELECT a.id,
a.name,
b.age
FROM user_name a
inner join user_age b
ON (a.id = b.id);2.左外连接
SELECT a.id,
a.name,
b.age
FROM user_name
left join user_age b
ON (a.id = b.id);3.右外连接
SELECT a.id,
a.name,
b.age
FROM user_name a
RIGHT OUTER JOIN user_age b
ON (a.id = b.id);4.全外连接
SELECT a.id,
a.name,
b.age
FROM user_name a
FULL OUTER JOIN user_age b
ON (a.id = b.id);5.左半连接
左半连接用来代替in操作或者exists操作的SELECT a.id,
a.name
FROM user_name a
LEFT SEMI JOIN user_age b
ON (a.id = b.id);
--等价于：
SELECT a.id,
a.name
FROM user_name a
WHERE a.id
IN (SELECT id FROM user_age);
但是，hive不支持in子句。所以只能变通，使用left semi子句。6.笛卡尔积关联（CROSS JOIN）
SELECT a.id,
a.name,
b.age
FROM user_name a
CROSS JOIN user_age b;