工作中发现很多同事连基础的hive命令都不知道，所以准备写一个系列把hive一些常用的命令进行一个总结。第一个讲的命令是MSCK REPAIR TABLE。

MSCK REPAIR TABLE 命令是做啥的

MSCK REPAIR TABLE命令主要是用来解决通过hdfs dfs -put或者hdfs api写入hive分区表的数据在hive中无法被查询到的问题。

我们知道hive有个服务叫metastore，这个服务主要是存储一些元数据信息，比如数据库名，表名或者表的分区等等信息。如果不是通过hive的insert等插入语句，很多分区信息在metastore中是没有的，如果插入分区数据量很多的话，你用 ALTER TABLE table_name ADD PARTITION 一个个分区添加十分麻烦。这时候MSCK REPAIR TABLE就派上用场了。只需要运行MSCK REPAIR TABLE命令，hive就会去检测这个表在hdfs上的文件，把没有写入metastore的分区信息写入metastore。

例子

我们先创建一个分区表，然后往其中的一个分区插入一条数据，在查看分区信息

CREATE TABLE repair_test (col_a STRING) PARTITIONED BY (par STRING);
INSERT INTO TABLE repair_test PARTITION(par="partition_1") VALUES ("test");
SHOW PARTITIONS repair_test;

查看分区信息的结果如下

0: jdbc:hive2://localhost:10000> show partitions repair_test;
INFO  : Compiling command(queryId=hive_20180810175151_5260f52e-10bb-4589-ad48-31ba72a81c21): show partitions repair_test
INFO  : Semantic Analysis Completed
INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:partition, type:string, comment:from deserializer)], properties:null)
INFO  : Completed compiling command(queryId=hive_20180810175151_5260f52e-10bb-4589-ad48-31ba72a81c21); Time taken: 0.029 seconds
INFO  : Executing command(queryId=hive_20180810175151_5260f52e-10bb-4589-ad48-31ba72a81c21): show partitions repair_test
INFO  : Starting task [Stage-0:DDL] in serial mode
INFO  : Completed executing command(queryId=hive_20180810175151_5260f52e-10bb-4589-ad48-31ba72a81c21); Time taken: 0.017 seconds
INFO  : OK
+------------------+--+
|    partition     |
+------------------+--+
| par=partition_1  |
+------------------+--+
1 row selected (0.073 seconds)
0: jdbc:hive2://localhost:10000>

然后我们通过hdfs的put命令手动创建一个数据

[ericsson@h3cnamenode1 pcc]$ echo "123123" > test.txt
[ericsson@h3cnamenode1 pcc]$ hdfs dfs -mkdir -p /user/hive/warehouse/test.db/repair_test/par=partition_2/
[ericsson@h3cnamenode1 pcc]$ hdfs dfs -put -f test.txt /user/hive/warehouse/test.db/repair_test/par=partition_2/
[ericsson@h3cnamenode1 pcc]$ hdfs dfs -ls -R /user/hive/warehouse/test.db/repair_test
drwxrwxrwt   - ericsson hive          0 2018-08-10 17:46 /user/hive/warehouse/test.db/repair_test/par=partition_1
drwxrwxrwt   - ericsson hive          0 2018-08-10 17:46 /user/hive/warehouse/test.db/repair_test/par=partition_1/.hive-staging_hive_2018-08-10_17-45-59_029_1594310228554990949-1
drwxrwxrwt   - ericsson hive          0 2018-08-10 17:46 /user/hive/warehouse/test.db/repair_test/par=partition_1/.hive-staging_hive_2018-08-10_17-45-59_029_1594310228554990949-1/-ext-10000
-rwxrwxrwt   3 ericsson hive          5 2018-08-10 17:46 /user/hive/warehouse/test.db/repair_test/par=partition_1/000000_0
drwxr-xr-x   - ericsson hive          0 2018-08-10 17:57 /user/hive/warehouse/test.db/repair_test/par=partition_2
-rw-r--r--   3 ericsson hive          7 2018-08-10 17:57 /user/hive/warehouse/test.db/repair_test/par=partition_2/test.txt
[ericsson@h3cnamenode1 pcc]$

这时候我们查询分区信息，发现partition_2这个分区并没有加入到hive中

0: jdbc:hive2://localhost:10000> show partitions repair_test;
INFO  : Compiling command(queryId=hive_20180810175959_e7cefe8c-57b5-486c-8e03-b1201dac4d79): show partitions repair_test
INFO  : Semantic Analysis Completed
INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:partition, type:string, comment:from deserializer)], properties:null)
INFO  : Completed compiling command(queryId=hive_20180810175959_e7cefe8c-57b5-486c-8e03-b1201dac4d79); Time taken: 0.029 seconds
INFO  : Executing command(queryId=hive_20180810175959_e7cefe8c-57b5-486c-8e03-b1201dac4d79): show partitions repair_test
INFO  : Starting task [Stage-0:DDL] in serial mode
INFO  : Completed executing command(queryId=hive_20180810175959_e7cefe8c-57b5-486c-8e03-b1201dac4d79); Time taken: 0.02 seconds
INFO  : OK
+------------------+--+
|    partition     |
+------------------+--+
| par=partition_1  |
+------------------+--+
1 row selected (0.079 seconds)
0: jdbc:hive2://localhost:10000>

运行MSCK REPAIR TABLE 命令后再查询分区信息,可以看到通过put命令放入的分区已经可以查询了

0: jdbc:hive2://localhost:10000> MSCK REPAIR TABLE repair_test;
INFO  : Compiling command(queryId=hive_20180810180000_7099daf2-6fde-44dd-8938-d2a02589358f): MSCK REPAIR TABLE repair_test
INFO  : Semantic Analysis Completed
INFO  : Returning Hive schema: Schema(fieldSchemas:null, properties:null)
INFO  : Completed compiling command(queryId=hive_20180810180000_7099daf2-6fde-44dd-8938-d2a02589358f); Time taken: 0.004 seconds
INFO  : Executing command(queryId=hive_20180810180000_7099daf2-6fde-44dd-8938-d2a02589358f): MSCK REPAIR TABLE repair_test
INFO  : Starting task [Stage-0:DDL] in serial mode
INFO  : Completed executing command(queryId=hive_20180810180000_7099daf2-6fde-44dd-8938-d2a02589358f); Time taken: 0.138 seconds
INFO  : OK
No rows affected (0.154 seconds)
0: jdbc:hive2://localhost:10000> show partitions repair_test;
INFO  : Compiling command(queryId=hive_20180810180000_ff711820-6f41-4d5d-9fee-b6e1cdbe1e25): show partitions repair_test
INFO  : Semantic Analysis Completed
INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:partition, type:string, comment:from deserializer)], properties:null)
INFO  : Completed compiling command(queryId=hive_20180810180000_ff711820-6f41-4d5d-9fee-b6e1cdbe1e25); Time taken: 0.045 seconds
INFO  : Executing command(queryId=hive_20180810180000_ff711820-6f41-4d5d-9fee-b6e1cdbe1e25): show partitions repair_test
INFO  : Starting task [Stage-0:DDL] in serial mode
INFO  : Completed executing command(queryId=hive_20180810180000_ff711820-6f41-4d5d-9fee-b6e1cdbe1e25); Time taken: 0.016 seconds
INFO  : OK
+------------------+--+
|    partition     |
+------------------+--+
| par=partition_1  |
| par=partition_2  |
+------------------+--+
2 rows selected (0.088 seconds)
0: jdbc:hive2://localhost:10000> select * from repair_test;
INFO  : Compiling command(queryId=hive_20180810180101_1225075e-43c8-4a49-b8ef-a12f72544a38): select * from repair_test
INFO  : Semantic Analysis Completed
INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:repair_test.col_a, type:string, comment:null), FieldSchema(name:repair_test.par, type:string, comment:null)], properties:null)
INFO  : Completed compiling command(queryId=hive_20180810180101_1225075e-43c8-4a49-b8ef-a12f72544a38); Time taken: 0.059 seconds
INFO  : Executing command(queryId=hive_20180810180101_1225075e-43c8-4a49-b8ef-a12f72544a38): select * from repair_test
INFO  : Completed executing command(queryId=hive_20180810180101_1225075e-43c8-4a49-b8ef-a12f72544a38); Time taken: 0.001 seconds
INFO  : OK
+--------------------+------------------+--+
| repair_test.col_a  | repair_test.par  |
+--------------------+------------------+--+
| test               | partition_1      |
| 123123             | partition_2      |
+--------------------+------------------+--+
2 rows selected (0.121 seconds)
0: jdbc:hive2://localhost:10000>

后续

后面发生了更有意思的事情。大致情况是很多人以为alter table drop partition只能删除一个分区的数据，结果用hdfs dfs -rmr 删除hive分区表的hdfs文件。这就导致了一个问题hdfs上的文件虽然删除了，但是hive metastore中的原信息没有删除。如果用show parttions table_name 这些分区信息还在，需要把这些分区原信息清除。

后来我想看看MSCK REPAIR TABLE这个命令能否删除已经不存在hdfs上的表分区信息，发现不行，我去jira查了下，发现Fix Version/s: 3.0.0, 2.4.0, 3.1.0 这几个版本的hive才支持这个功能。但由于我们的hive版本是1.1.0-cdh5.11.0，这个方法无法使用。

附上官网的链接
Recover Partitions (MSCK REPAIR TABLE)

Recover Partitions (MSCK REPAIR TABLE)

Hive stores a list of partitions for each table in its metastore. If, however, new partitions are directly added to HDFS (say by using hadoop fs -put command) or removed from HDFS, the metastore (and hence Hive) will not be aware of these changes to partition information unless the user runs ALTER TABLE table_name ADD/DROP PARTITION commands on each of the newly added or removed partitions, respectively.
However, users can run a metastore check command with the repair table option:
MSCK [REPAIR] TABLE table_name [ADD/DROP/SYNC PARTITIONS];
which will update metadata about partitions to the Hive metastore for partitions for which such metadata doesn't already exist. The default option for MSC command is ADD PARTITIONS. With this option, it will add any partitions that exist on HDFS but not in metastore to the metastore. The DROP PARTITIONS option will remove the partition information from metastore, that is already removed from HDFS. The SYNC PARTITIONS option is equivalent to calling both ADD and DROP PARTITIONS. See HIVE-874 and HIVE-17824 for more details. When there is a large number of untracked partitions, there is a provision to run MSCK REPAIR TABLE batch wise to avoid OOME (Out of Memory Error). By giving the configured batch size for the property hive.msck.repair.batch.size it can run in the batches internally. The default value of the property is zero, it means it will execute all the partitions at once. MSCK command without the REPAIR option can be used to find details about metadata mismatch metastore.
The equivalent command on Amazon Elastic MapReduce (EMR)'s version of Hive is:
ALTER TABLE table_name RECOVER PARTITIONS;
Starting with Hive 1.3, MSCK will throw exceptions if directories with disallowed characters in partition values are found on HDFS. Use hive.msck.path.validation setting on the client to alter this behavior; "skip" will simply skip the directories. "ignore" will try to create partitions anyway (old behavior). This may or may not work.

HIVE-17824 是关于hive msck repair 增加清理metastore中已经不在hdfs上的分区信息

作者：润土1030
链接：https://www.jianshu.com/p/c1b0dc86f9b0
来源：简书
著作权归作者所有。商业转载请联系作者获得授权，非商业转载请注明出处。

HIVE常用命令之MSCK REPAIR TABLE命令简述相关推荐

HIVE常用命令之MSCK REPAIR TABLE
目录 MSCK REPAIR TABLE 命令是做啥的例子后续 MSCK REPAIR TABLE 命令是做啥的 MSCK REPAIR TABLE命令主要是用来解决通过hdfs dfs -put ...
Hive 修复分区 MSCK REPAIR TABLE的使用
因为昨天工作的时候踩了坑,所以来记录一下.(我的问题是:我把hive表手动删掉 ,后来重新创建了一个一样的表,然后原有的分区数据全部损坏了,数据导不进去了) 一.msck repair table ...
Hive有分区文件到时select不到数据问题-----修复分区命令 msck repair table xxxxx
问题:在导数据到hive分区表时, 手动把HDFS路径建好了,然后把对应的文件添加到路径下. 这时用select语句查询却查不到数据. 原因:虽然分区文件有了,但是分区信息没有添加到hive元数据表中 ...
HIVE——常用sql命令总结
文章目录 hive常用交互命令 `-e`执行sql `-f`执行脚本中sql语句 hive cli命令行窗口操作hdfs 查看hive中输入的所有历史命令库创建库查看库使用库修改库删除库 ...
hive执行msck repair报错msck is missing partition columns under hdfs：//表分区路径
排查: 查看hiveserver日志报以下异常 msck is missing partition columns under hdfs://表分区路径查看hdfs该表分区目录,存在分区=$%xxx ...
MySQL建表(create table)命令详解
MySQL建表(create table)命令详解 create table命令强调:使用建表命令之前必须使用use命令选择表所在的数据库.create table命令的格式如下: create t ...
Hive 常用指令记录
一.Hive基本概念 1.1 hive是什么 hive是基于hadoop的一个数仓分析工具,hive可以将hdfs上存储的结构化的数据,映射成一张表,然后让用户写HQL(类SQL)来分析数据 tel ...
hive动态分区，分区数据的几种插入方式，hive常用优化
首先列举下hive分区插入的方式: 1:从文件导入数据到hive指定分区方式 load data local inpath 'filepath' into table tableName partit ...
hive 的drop table命令出错
一前言今天遇到如题所示问题,出错内容提示如下 FAILED: Error in metadata: javax.jdo.JDODataStoreException: Error(s) were ...

HIVE常用命令之MSCK REPAIR TABLE命令简述

MSCK REPAIR TABLE 命令是做啥的

例子

后续

HIVE常用命令之MSCK REPAIR TABLE命令简述相关推荐

最新文章

热门文章