Hive之——数据定义

转载请注明出处：https://blog.csdn.net/l1028386804/article/details/88379603

Hive会为每个数据库创建一个目录。数据库中的表将会以这个数据库目录的子目录形式存储。有一个例外就是default数据库中的表，因为这个数据库本身没有自己的目录。

数据库所在的目录位于属性hive.metastore.warehouse.dir所指定的顶层目录之后，如果使用的是默认配置，也就是/user/hive/warehouse,当创建数据库test时，Hive会对应的创建一个目录/user/hive/warehouse/test.db。注意：数据库的文件目录名是以.db结尾的。
用户可以通过如下命令修改这个默认的位置:

hive> create database test location '/my/test/directory';

拷贝一张已经存在的表:

create table if not exists mydb.employees2 like mydb.employees;

列出当前数据库下的表:

show tables;

列出指定数据库下的表:

show tables in test;

使用正则表达式过滤需要的表名:

show tables 'empl.*'

可以过滤出当前数据库中以empl开头的任何表

注意：in database_name 语句和对表明使用正则表达式两个功能尚不能同时使用。

如果想查看某一列的信息，则只要在表名后增加这个字段的名称即可。

describe mydb.employees.salary;

管理表

也被成为内部表，默认情况下这些表的数据存储在由配置项hive.metastore.warehouse.dir(/user/hive/warehouse)所定义的目录的子目录下。
当我们删除一个管理表时，Hive也会删除这个表中的数据。

外部表

例如，创建一个外部表，其可以读取所有位于/data/stocks目录下的以逗号分隔的数据:

create external table if not exists stocks(
exchange1 string,
symbol string,
ymd string,
price_open float,
price_high float,
price_low float,
price_close float,
volume int,
price_adj_close float
)
row format delimited fields terminated by ','
location '/data/stocks';

删除外部表并不会删除数据，只会删除描述表的元数据信息。

对于管理表，用户可以对一张存在的表进行表结构复制，不会复制数据

create external table if not exists mydb.exployees3 like mydb.employees location '/path/to/data';

这里，如果语句总省略掉external关键字而且源表是外部表的话，那么生成的新表也将是外部表，如果语句中省略掉external关键字而且源表是管理表的话，那么生成的新表也将是管理表。但是，如果语句中包含有external关键字而且源表是管理表的话，那么生成的新表将是外部表，即使在这种场景下，location子句同样是可选的。

分区表、管理表

让我们先按照country(国家)再按照state(州)对数据进行分区:

create table employees(
name string,
salary float,
subordinates array<string>,
deductions map<string, float>,
address struct<street:string, city:string, state:string, zip:int>
)
partitioned by (country string, state string);

执行分区查询语句:

select * from employees where country = 'US' and state = 'IL';

注意: 因为country和state的值已经存在文件目录名称中了，所以也就没有必要将这些值存放到它们目录下的文件中了。

建议将Hive设置为"strict(严格)"模式，如果对分区表进行查询而where子句没有加分区过滤的话，将会禁止提交这个任务。
设置strict模式:

hive> set hive.mapred.mode=strict;

设置nonstrict模式

hive> set hive.mapred.mode=nonstrict;

可以通过show partitions命令查看表存在的所有分区：

hive> show partitions;

如果表中存在很多的分区，用户只想查看是否存储某个特定分区键的分区的话，用户还可以在这个命令上增加一个指定了一个或者多个特定分区字段值的partition子句，进行过滤查询：

hive> show partitions employees partition(country='US');describe extended employees也会显示出分区键
hive> describe extended employees;

在管理表中用户可以通过载入数据的方式创建分区。比如：从一个本地目录($HOME/california-employees)载入数据到表中的时候，将会创建一个US和CA分区。用户需要为每个分区字段指定一个值。注意: 在HiveQL中是如何引用HOME变量的:

load data local inpath '${env:HOME}/california-employees' into table employees partition(country = 'US', state = 'CA');

Hive将会创建这个分区对应的目录.../employees/country=US/state=CA,而且${env:HOME}/california-employees这个目录下的文件将会被拷贝到上述分区目录下。

外部分区表

create external table if not exists log_message(
hms int,
severity string,
server string,
process_id int,
message string
)
partitioned by (year int, month int, day int)
row format delimited fields terminated by '\t';

非外部分区表，要求使用一个location子句，对于外部分区表则没有这样的要求。alter table语句可以单独进行增加分区。这个语句需要为每一个分区键指定一个值。

增加一个2019年3月7日的分区:

alter table log_message add partition(year=2019, month=3, day=7) location='hdfs://liuyazhuang11:9000/data/log_message/2019/03/07';

每天我们可以使用如下的处理过程将一个月前的旧数据转移到S3中。
(1) 将分区下的数据拷贝到S3中

hadoop distcp /data/log_message/2019/03/07 s3n://ourbucket/logs/2019/03/07

(2) 修改表，将分区指向到S3路径:

alter table log_message partition(year=2019, month=3, dat=7) set location 's3n://ourbucket/logs/2019/03/07'

(3) 删除HDFS中的这个分区数据

hadoop fs -rm -r /data/log_message/2019/03/07

删除外部分区表，也只会删除元数据信息

查看外部分区表的分区：

show partitions log_message;

查看分区数据所在的路径：

desc extended log_message partition (year=2019, month=3, day=7);

自定义表的存储格式
表的存储格式可以通过stored as ... 指定，比如：

stored as textfile
stored as sequencefile
stored as rcfile

用户可以指定第三方的输入输出格式以及SerDe,这个功能允许用户自定义Hive本身不支持的其他广泛的文件格式
比如：使用自定义的SerDe、输入、输出格式创建表：

create table kst
partitioned by (ds string)
row format serde 'com.linkedin.haivvreo.AvroSerde'
with serdeproperties ('schema.url'='http://schema_provider/kst.avsc')
stored as
inputformat 'com.linkedin.haivvreo.AvroContainerInputFormat'
outputformat 'com.linkedin.haivvreo.AvroContainerOutputFormat';

row format serde ...指定了使用的Serde。 Hive提供了with serdeproperties功能，允许用户传递配置信息给SerDe。这里的属性名称和值都应该是带引号的字符串。
stored as inputformat ... outputformat ... 子句分别指定了用于输入格式和输出格式的Java类。如果要指定，必须对输入格式和输出格式都进行指定。

create external table if not exists stocks(
exchange1 string,
symbol string,
ymd string,
price_open float,
price_high float,
price_low float,
price_close float,
volume int,
price_adj_close float
)
clustered by (exchange, symbol)
stored by (ymd asc)
into 96 buckets
row format delimited fields terminated by ','
location '/data/stocks';

clustered by ... into .. buckets 子句还可以后接一个可选的stored by... 子句，用来优化某些类型的查询。

修改表
alter table 仅仅会修改表元数据，表数据本身不会又任何修改。需要用户自己确认所有的修改都和真实的数据是一致的。

表重命名

alter table log_message rename to logmsgs;

增加、修改和删除表分区

alter table xxx add partition ... 语句用于为表(通常是外部表)增加一个新的分区

修改某个分区的路径

alter table log_message partition (year=2019, month=3, day=7) set location 's3n://ourbucket/logs/2019/03/07'

这个命令不会将数据从旧的路径转移走，也不会删除旧的路径。

删除某个分区

alter table log_message drop if exists partition (year=2019, month=3, day=7);

对于管理表，即使是使用alter table ... add partition 语句增加分区，分区内的数据也是会同时和元数据信息一起被删除，对于外部表，分区内数据不会被删除。

修改列信息
对某个字段进行重命名，并修改其位置、类型或者注释:

alter table log_message
change column hms hours_minutes_seconds int
comment 'The hours, minutes, and seconds part of the timestamp'
after severity;

即使字段名或者字段类型没有改变，用户也需要完全指定旧的字段名，并给出新的字段名及新的字段类型。这个例子是将字段移动到severity字段之后，如果用户想将这个字段移动到第一个位置，那么只需要使用first关键字替代after other_column子句即可。

和通常一样，这个命令只会修改元数据信息。如果用户移动的是字段，那么数据也应当和新的模式匹配或者通过其他某些方法修改数据以使其能够和模式匹配。

增加列：

alter table log_message add columns(
app_name string comment 'application name',
session_id long comment 'the current session id'
);

如果新增的字段中有某个或多个字段位置是错误的，那么需要使用alter colume 表名 change column 语句逐一将字段调整到正确的位置。

删除或者替换列
移除所有字段并重新指定新的字段:

alter table log_message replace columns(
hours_minutes_seconds int comment 'hour, minute, seconds from timestamp',
severity string comment 'The message severity',
message string comment 'The rest of the message'
);

replace语句只能用于DynamicSerde或者MetadataTypedColumnsetSerDe这2中内置SerDe模块的表。

修改表属性
可以增加附件的表属性或者修改已经存在的属性，但是无法删除属性:

alter table log_message set tblproperties('notes'='The process id is no longer captured; this column is always NULL');

修改存储属性
将一个分区的存储格式修改成sequence file

alter table log_message
partition (year=2019, month=3, day=7)
set fileformat sequencefile;

使用一个名为com.example.JSONSerDe的Java类处理记录使用JSON编码的文件：

alter table table_using_JSON_storage
set SERDE 'com.example.JSONSerDe'
with serdeproperties(
'prop1' = 'value1',
'prop2' = 'value2');

这里，属性名和属性值都必须带引号

向一个已经存在着的SerDe增加新的serdeproperties属性:

alter table table_using_JSON_storage
set serdeproperties (
'prop3' = 'value3',
'prop4' = 'value4');

众多的修改表语句
触发钩子函数：

alter table log_message touch partition (year=2019, month=3, day=7)

将分区内的文件打包成Hadoop的HAR文件

alter table log_message archive partition (year=2019, month=3, day=7);

解压分区内的HAR文件

alter table log_message unarchive partition (year=2019, month=3, day=7);

防止分区被删除:

alter table log_message partition (year=2019, month=3, day=7) enable no_drop;

防止分区被查询：

alter table log_message partition (year=2019, month=3, day=7) enable offline;

使用disable替换enable可以达到反向操作的目的。