HBase 常用操作

hbase只支持行级事务，不支持多行事务。

进入shell:hbase shell:

配置完分布式zk后：

单启Hmaster：hbase-daemon.sh start master

HFile默认是十亿字节进行拆分 hbase是版本化数据库

缓冲区默认大小：2M 也可用API的HTable.setWriteBufferSize();

上述代码运行报错，原因：scan在扫描的时候返回一个iterator,而这个iterator的cursor是指向一个单元格的

Cell：在HBase表中，是由rowkey + (colomn family:colomn qualifier) + version来标识一个cell，用户保存的具体数据是保存在这个cell中的，它的值时byte[]类型的，需要在客户端将之改为需要的类型。

所以要想得到下一列，每次都要进行迭代。然而在hbase的存储模型中，列名是排序的，所以第一列得到的是age的值，上面写的是no，所以出现了空指针异常。

hbase不能直接查询以下操作：

得采用过虐器filter

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

没有value,value长度为0

用户可以通过继承FilterBase来定义自己的过滤器。maven clean package -DskipTests打成jar包分发到hbase/lib下。

优先级决定了处理器执行的先后顺序，优先级相同，通过序号决定。 coprocessorHost:对协处理器提供了一些公共的运行时服务

配置文件写的是系统或者自定义的协处理器类的全限定名称。

log()是自定义的写入日志的方法

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

放在 META-INF/persisience.xml

"ns1:t1"对应的是表名

创建实体类：

move后的第二个参数为 ‘host，port，startcode’

compact:物理上的，把磁盘上的存储文件合并在一起

assign 是没有force的 unassign可以设置force 指定区域名或者区域编码名下线依然能够提供服务上线下线类似于move

hbase MR wordcount：hbase作为数据来源，hdfs作为数据输出目的地：

hbase作为输出路径：

内存优化：Java -X

使用压缩必须保证所有节点都安装了所用压缩类类库

后面的3是整个表拆分的区域

也可以用$hbase>create 'ns1:t1',splits=>{'10','20','30'}

一般我们不关闭写前日志

3. Observer协处理器的实现

　　相对来说Observer的实现来的简单点，只需要实现服务端代码逻辑即可。通过实现一个RegionserverObserver来加深了解。

　　所要实现的功能：

假定某个表有A和B两个列--------------------------------便于后续描述我们称之为coprocessor_table
1. 当我们向A列插入数据的时候通过协处理器像B列也插入数据。
2.在读取数据的时候只允许客户端读取B列数据而不能读取A列数据。换句话说A列是只写 B列是只读的。（为了简单起见，用户在读取数据的时候需要制定列名）
3. A列值必须是整数，换句话说B列值也自然都是整数4.当删除操作的时候不能指定删除B列
5.当删除A列的时候同时需要删除B列
6.对于其他列的删除不做检查

在上述功能点确定后，我们就要开始实现这两个功能。好在HBase API中有BaseRegionObserver，这个类已经帮助我们实现了大部分的默认实现，我们只要专注于业务上的方法重载即可。

代码框架：　　

public class 协处理器类名称 extends BaseRegionObserver {private static final Log LOG = LogFactory.getLog(协处理器类名称.class);private RegionCoprocessorEnvironment env = null;// 协处理器是运行于region中的，每一个region都会加载协处理器// 这个方法会在regionserver打开region时候执行（还没有真正打开）@Overridepublic void start(CoprocessorEnvironment e) throws IOException {env = (RegionCoprocessorEnvironment) e;}// 这个方法会在regionserver关闭region时候执行（还没有真正关闭）@Overridepublic void stop(CoprocessorEnvironment e) throws IOException {// nothing to do here}/*** 出发点，比如可以重写prePut postPut等方法，这样就可以在数据插入前和插入后做逻辑控制了。*/@Override

业务代码实现：

　　根据上述需求和代码框架，具体逻辑实现如下。

在插入需要做检查所以重写了prePut方法
在删除前需要做检查所以重写了preDelete方法

public class MyRegionObserver extends BaseRegionObserver {private static final Log LOG = LogFactory.getLog(MyRegionObserver.class);private RegionCoprocessorEnvironment env = null;// 设定只有F族下的列才能被操作，且A列只写，B列只读。的语言private static final String FAMAILLY_NAME = "F";private static final String ONLY_PUT_COL = "A";private static final String ONLY_READ_COL = "B";// 协处理器是运行于region中的，每一个region都会加载协处理器// 这个方法会在regionserver打开region时候执行（还没有真正打开）@Overridepublic void start(CoprocessorEnvironment e) throws IOException {env = (RegionCoprocessorEnvironment) e;}// 这个方法会在regionserver关闭region时候执行（还没有真正关闭）@Overridepublic void stop(CoprocessorEnvironment e) throws IOException {// nothing to do here}/*** 需求 1.不允许插入B列 2.只能插入A列 3.插入的数据必须为整数 4.插入A列的时候自动插入B列*/@Overridepublic void prePut(final ObserverContext<RegionCoprocessorEnvironment> e,final Put put, final WALEdit edit, final Durability durability)throws IOException {// 首先查看单个put中是否有对只读列有写操作List<Cell> cells = put.get(Bytes.toBytes(FAMAILLY_NAME),Bytes.toBytes(ONLY_READ_COL));if (cells != null && cells.size() != 0) {LOG.warn("User is not allowed to write read_only col.");throw new IOException("User is not allowed to write read_only col.");}// 检查A列cells = put.get(Bytes.toBytes(FAMAILLY_NAME),Bytes.toBytes(ONLY_PUT_COL));if (cells == null || cells.size() == 0) {// 当不存在对于A列的操作的时候则不做任何的处理，直接放行即可LOG.info("No A col operation, just do it.");return;}// 当A列存在的情况下在进行值得检查，查看是否插入了整数byte[] aValue = null;for (Cell cell : cells) {try {aValue = CellUtil.cloneValue(cell);LOG.warn("aValue = " + Bytes.toString(aValue));Integer.valueOf(Bytes.toString(aValue));} catch (Exception e1) {LOG.warn("Can not put un number value to A col.");throw new IOException("Can not put un number value to A col.");}}// 当一切都ok的时候再去构建B列的值，因为按照需求，插入A列的时候需要同时插入B列LOG.info("B col also been put value!");put.addColumn(Bytes.toBytes(FAMAILLY_NAME),Bytes.toBytes(ONLY_READ_COL), aValue);}/*** 需求 1.不能删除B列 2.只能删除A列 3.删除A列的时候需要一并删除B列*/@Overridepublic void preDelete(final ObserverContext<RegionCoprocessorEnvironment> e,final Delete delete, final WALEdit edit, final Durability durability)throws IOException {// 首先查看是否对于B列进行了指定删除List<Cell> cells = delete.getFamilyCellMap().get(Bytes.toBytes(FAMAILLY_NAME));if (cells == null || cells.size() == 0) {// 如果客户端没有针对于FAMAILLY_NAME列族的操作则不用关心，让其继续操作即可。LOG.info("NO F famally operation ,just do it.");return;}// 开始检查F列族内的操作情况byte[] qualifierName = null;boolean aDeleteFlg = false;for (Cell cell : cells) {qualifierName = CellUtil.cloneQualifier(cell);// 检查是否对B列进行了删除，这个是不允许的if (Arrays.equals(qualifierName, Bytes.toBytes(ONLY_READ_COL))) {LOG.info("Can not delete read only B col.");throw new IOException("Can not delete read only B col.");}// 检查是否存在对于A队列的删除if (Arrays.equals(qualifierName, Bytes.toBytes(ONLY_PUT_COL))) {LOG.info("there is A col in delete operation!");aDeleteFlg = true;}}// 如果对于A列有删除，则需要对B列也要删除if (aDeleteFlg){LOG.info("B col also been deleted!");delete.addColumn(Bytes.toBytes(FAMAILLY_NAME), Bytes.toBytes(ONLY_READ_COL));}}}

4. Observer协处理器上传加载

　　完成实现后需要将协处理器类打包成jar文件，对于协处理器的加载通常有三种方法：

　　1.配置文件加载：即通过hbase-site.xml文件配置加载，一般这样的协处理器是系统级别的，全局的协处理器，如权限控制等检查。

　　2.shell加载：可以通过alter命令来对表进行scheme进行修改来加载协处理器。

　　3.通过API代码实现：即通过API的方式来加载协处理器。

上述加载方法中，1，3都需要将协处理器jar文件放到集群的hbase的classpath中。而2方法只需要将jar文件上传至集群环境的hdfs即可。

下面我们只介绍如何通过2方法进行加载。

　　步骤1：通过如下方法创建表

hbase(main):001:0> create 'coprocessor_table','F'
0 row(s) in 2.7570 seconds=> Hbase::Table - coprocessor_table

　　步骤2：通过alter命令将协处理器加载到表中

alter 'coprocessor_table' , METHOD =>'table_att','coprocessor'=>'hdfs://ns1/testdata/Test-HBase-Observer.jar|cn.com.newbee.feng.MyRegionObserver|1001'

　　其中：'coprocessor'=>'jar文件在hdfs上的绝对路径|协处理器主类|优先级|协处理器参数

　　上述协处理器并没有参数，所以未给出参数，对于协处理器的优先级不在此做讨论。

　　步骤3：检查协处理器的加载

hbase(main):021:0> describe 'coprocessor_table'
Table coprocessor_table is ENABLED
coprocessor_table, {TABLE_ATTRIBUTES => {coprocessor$1 => 'hdfs://ns1/testdata/T
est-HBase-Observer.jar|cn.com.newbee.feng.MyRegionObserver|1001'}
COLUMN FAMILIES DESCRIPTION
{NAME => 'F', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'ROW', REPLICATION_S
COPE => '0', VERSIONS => '1', COMPRESSION => 'NONE', MIN_VERSIONS => '0', TTL =>'FOREVER', KEEP_DELETED_CELLS => 'FALSE', BLOCKSIZE => '65536', IN_MEMORY => 'f
alse', BLOCKCACHE => 'true'}
1 row(s) in 0.0300 seconds

　　可以看到协处理器被表成功加载，其实内部是将update所有此表的region去加载协处理器的。

5. Observer协处理器测试

　　经过上述代码完成和加载完成后我们进行简单的协处理器测试，由于Observer并不需要我们去定制客户端代码，所以我们可以直接通过shell命令环境来测试。

　　用例1：正常插入A列

hbase(main):024:0> scan 'coprocessor_table'
ROW                   COLUMN+CELL
0 row(s) in 0.0100 secondshbase(main):025:0> put 'coprocessor_table','row1','F:A',123
0 row(s) in 0.0210 secondshbase(main):026:0> scan 'coprocessor_table'
ROW                   COLUMN+CELL                                               row1                 column=F:A, timestamp=1469838240645, value=123            row1                 column=F:B, timestamp=1469838240645, value=123
1 row(s) in 0.0180 seconds

　　结果：B列也被插入，OK

　　用例2：插入A列，但是值不为整数

hbase(main):027:0> put 'coprocessor_table','row1','F:A','cc'ERROR: Failed 1 action: IOException: 1 time,

　　结果：插入失败，服务端报如下错误，OK

2016-07-29 20:25:45,406 WARN  [B.defaultRpcServer.handler=3,queue=0,port=60020] feng.MyRegionObserver: Can not put un number value to A col.

　　用例3：插入B列

hbase(main):028:0> put 'coprocessor_table','row1','F:B',123ERROR: Failed 1 action: IOException: 1 time,

　　结果：插入失败，服务器报如下错误，OK

2016-07-29 20:27:13,342 WARN  [B.defaultRpcServer.handler=20,queue=2,port=60020] feng.MyRegionObserver: User is not allowed to write read_only col.

　　用例4：删除B列

hbase(main):029:0> delete 'coprocessor_table','row1','F:B'ERROR: java.io.IOException: Can not delete read only B col.

　　结果：删除失败，OK

　　用例4：删除A列

hbase(main):030:0> scan 'coprocessor_table'
ROW                   COLUMN+CELL                                                row1                 column=F:A, timestamp=1469838240645, value=123             row1                 column=F:B, timestamp=1469838240645, value=123
1 row(s) in 0.0230 secondshbase(main):031:0> delete 'coprocessor_table','row1','F:A'
0 row(s) in 0.0060 secondshbase(main):032:0> scan 'coprocessor_table'
ROW                   COLUMN+CELL
0 row(s) in 0.0070 seconds

　　结果：A列和B列同时被删除了。

如果想详细了解hbase的安装：http://abloz.com/hbase/book.html 和官网http://hbase.apache.org/

1. 快速单击安装

在单机安装Hbase的方法。会引导你通过shell创建一个表，插入一行，然后删除它，最后停止Hbase。只要10分钟就可以完成以下的操作。

1.1下载解压最新版本

选择一个 Apache 下载镜像：http://www.apache.org/dyn/closer.cgi/hbase/，下载 HBase Releases. 点击 stable目录，然后下载后缀为 .tar.gz 的文件; 例如 hbase-0.90.4.tar.gz.

后面需要安装集群，整合到hadoop，所以注意选择与hadoop对应的版本：

选择 Hadoop 版本对HBase部署很关键。下表显示不同HBase支持的Hadoop版本信息。基于HBase版本，应该选择合适的Hadoop版本。我们没有绑定 Hadoop 发行版选择。可以从Apache使用 Hadoop 发行版，或了解一下Hadoop发行商产品： http://wiki.apache.org/hadoop/Distributions%20and%20Commercial%20Support

Table 2.1. Hadoop version support matrix

	HBase-0.92.x	HBase-0.94.x	HBase-0.96
Hadoop-0.20.205	S	X	X
Hadoop-0.22.x	S	X	X
Hadoop-1.0.x	S	S	S
Hadoop-1.1.x	NT	S	S
Hadoop-0.23.x	X	S	NT
Hadoop-2.x	X	S	S

S = supported and tested,支持

X = not supported,不支持

NT = not tested enough.可以运行但测试不充分

由于 HBase 依赖 Hadoop，它配套发布了一个Hadoop jar 文件在它的 lib 下。该套装jar仅用于独立模式。在分布式模式下，Hadoop版本必须和HBase下的版本一致。用你运行的分布式Hadoop版本jar文件替换HBase lib目录下的Hadoop jar文件，以避免版本不匹配问题。确认替换了集群中所有HBase下的jar文件。Hadoop版本不匹配问题有不同表现，但看起来都像挂掉了。

安装：

$ tar xfz hbase-0.90.4.tar.gz
$ cd hbase-0.90.4

现在你已经可以启动Hbase了。但是你可能需要先编辑 conf/hbase-site.xml 去配置hbase.rootdir，来选择Hbase将数据写到哪个目录 .

单机配置，只需要如下配置hbase-site.xml：

[html] view plain copy

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>hbase.rootdir</name>
<value>file:///DIRECTORY/hbase</value>
</property>
</configuration>

将 DIRECTORY 替换成你期望写文件的目录. 默认 hbase.rootdir 是指向 /tmp/hbase-${user.name} ，也就说你会在重启后丢失数据(重启的时候操作系统会清理/tmp目录)

1.2. 启动 HBase

现在启动Hbase:

$ ./bin/start-hbase.sh

starting Master, logging to logs/hbase-user-master-example.org.out

现在你运行的是单机模式的Hbaes。所以的服务都运行在一个JVM上，包括Hbase和Zookeeper。Hbase的日志放在logs目录,当你启动出问题的时候，可以检查这个日志。

1.3. Hbase Shell 练习

用shell连接你的Hbase

$ ./bin/hbase shell

HBase Shell; enter 'help<RETURN>' for list of supported commands.

Type "exit<RETURN>" to leave the HBase Shell

Version: 0.90.0, r1001068, Fri Sep 24 13:55:42 PDT 2010

hbase(main):001:0>

输入 help 然后 <RETURN> 可以看到一列shell命令。这里的帮助很详细，要注意的是表名，行和列需要加引号。

创建一个名为 test 的表，这个表只有一个column family 为 cf。可以列出所有的表来检查创建情况，然后插入些值。

hbase(main):003:0> create 'test', 'cf'

0 row(s) in 1.2200 seconds

hbase(main):003:0> list 'table'

test

1 row(s) in 0.0550 seconds

hbase(main):004:0> put 'test', 'row1', 'cf:a', 'value1'

0 row(s) in 0.0560 seconds

hbase(main):005:0> put 'test', 'row2', 'cf:b', 'value2'

0 row(s) in 0.0370 seconds

hbase(main):006:0> put 'test', 'row3', 'cf:c', 'value3'

0 row(s) in 0.0450 seconds

以上我们分别插入了3行。第一个行key为row1, 列为 cf:a，值是 value1。Hbase中的列是由 column family前缀和列的名字组成的，以冒号间隔。例如这一行的列名就是a.

检查插入情况.

Scan这个表，操作如下

hbase(main):007:0> scan 'test'

ROW        COLUMN+CELL

row1       column=cf:a, timestamp=1288380727188, value=value1

row2       column=cf:b, timestamp=1288380738440, value=value2

row3       column=cf:c, timestamp=1288380747365, value=value3

3 row(s) in 0.0590 seconds

Get一行，操作如下

hbase(main):008:0> get 'test', 'row1'

COLUMN      CELL

cf:a        timestamp=1288380727188, value=value1

1 row(s) in 0.0400 seconds

disable 再 drop 这张表，可以清除你刚刚的操作

hbase(main):012:0> disable 'test'

0 row(s) in 1.0930 seconds

hbase(main):013:0> drop 'test'

0 row(s) in 0.0770 seconds

关闭shell

hbase(main):014:0> exit

1.4. 停止 HBase

运行停止脚本来停止HBase.

$ ./bin/stop-hbase.sh

stopping hbase...............

2. Hbase集群安装前注意

1） Java：（hadoop已经安装了）

2） Hadoop 0.20.x / Hadoop-2.x 已经正确安装，并且可以启动 HDFS 系统, 可参考的Hadoop安装文档：Hadoop集群配置（最全面总结）http://blog.csdn.net/hguisu/article/details/7237395

3） ssh 必须安装ssh ， sshd 也必须运行，这样Hadoop的脚本才可以远程操控其他的Hadoop和Hbase进程。ssh之间必须都打通，不用密码都可以登录，详细方法可以 Google一下 ("ssh passwordless login").

4） NTP：集群的时钟要保证基本的一致。稍有不一致是可以容忍的，但是很大的不一致会造成奇怪的行为。运行 NTP 或者其他什么东西来同步你的时间.

如果你查询的时候或者是遇到奇怪的故障，可以检查一下系统时间是否正确!

设置集群各个节点时钟：date -s “2012-02-13 14:00:00”

5） ulimit 和 nproc:

Base是数据库，会在同一时间使用很多的文件句柄。大多数linux系统使用的默认值1024是不能满足的，会导致FAQ: Why do I see "java.io.IOException...(Too manyopen files)" in my logs?异常。还可能会发生这样的异常

2010-04-06 03:04:37,542 INFO org.apache.hadoop.hdfs.DFSClient: ExceptionincreateBlockOutputStream java.io.EOFException

2010-04-06 03:04:37,542 INFO org.apache.hadoop.hdfs.DFSClient:Abandoning block blk_-6935524980745310745_1391901

所以你需要修改你的最大文件句柄限制。可以设置到10k. 你还需要修改 hbase 用户的 nproc，如果过低会造成 OutOfMemoryError异常。 [2] [3].

需要澄清的，这两个设置是针对操作系统的，不是Hbase本身的。有一个常见的错误是Hbase运行的用户，和设置最大值的用户不是一个用户。在Hbase启动的时候，第一行日志会现在ulimit信息，所以你最好检查一下。

可以先查看当前用户 ulimit：

ulimit -n

设置ulimit:

如果你使用的是Ubuntu,你可以这样设置:

在文件 /etc/security/limits.conf 添加一行，如:

hadoop - nofile 32768

可以把 hadoop 替换成你运行Hbase和Hadoop的用户。如果你用两个用户，你就需要配两个。还有配nproc hard 和 softlimits. 如:

hadoop soft/hard nproc 32000

在 /etc/pam.d/common-session 加上这一行:

session required pam_limits.so

否则在 /etc/security/limits.conf上的配置不会生效.

还有注销再登录，这些配置才能生效!

7 ）修改Hadoop HDFS Datanode同时处理文件的上限：dfs.datanode.max.xcievers

一个 Hadoop HDFS Datanode 有一个同时处理文件的上限. 这个参数叫 xcievers (Hadoop的作者把这个单词拼错了). 在你加载之前，先确认下你有没有配置这个文件conf/hdfs-site.xml里面的xceivers参数，至少要有4096:

<name>dfs.datanode.max.xcievers</name>

</property>

对于HDFS修改配置要记得重启.

如果没有这一项配置，你可能会遇到奇怪的失败。你会在Datanode的日志中看到xcievers exceeded，但是运行起来会报 missing blocks错误。例如: 02/12/1220:10:31 INFO hdfs.DFSClient: Could not obtain blockblk_XXXXXXXXXXXXXXXXXXXXXX_YYYYYYYY from any node: java.io.IOException: No livenodes contain current block. Will get new block locations from namenode andretry...

8）继承hadoop安装的说明：

每个机子/etc/hosts

10.64.56.74 node2 （master）

10.64.56.76 node1 （slave）

10.64.56.77 node3 （slave）

9) 继续使用hadoop用户安装

Chown –R hadoop /usr/local/hbase

3. 分布式模式配置

3.1配置`conf/hbase-env.sh`

# exportJAVA_HOME=/usr/java/jdk1.6.0/

exportJAVA_HOME=/usr/lib/jvm/java-6-sun-1.6.0.26

# Tell HBase whether it should manage it'sown instance of Zookeeper or not.

export HBASE_MANAGES_ZK=true

不管是什么模式，你都需要编辑 conf/hbase-env.sh来告知Hbase java的安装路径.在这个文件里你还可以设置Hbase的运行环境，诸如 heapsize和其他 JVM有关的选项, 还有Log文件地址，等等. 设置 JAVA_HOME指向 java安装的路径.

一个分布式运行的Hbase依赖一个zookeeper集群。所有的节点和客户端都必须能够访问zookeeper。默认的情况下Hbase会管理一个zookeep集群。这个集群会随着Hbase的启动而启动。当然，你也可以自己管理一个zookeeper集群，但需要配置Hbase。你需要修改conf/hbase-env.sh里面的HBASE_MANAGES_ZK 来切换。这个值默认是true的，作用是让Hbase启动的时候同时也启动zookeeper.

让Hbase使用一个现有的不被Hbase托管的Zookeep集群，需要设置 conf/hbase-env.sh文件中的HBASE_MANAGES_ZK 属性为 false

# Tell HBase whether it should manage it's own instanceof Zookeeper or not.

exportHBASE_MANAGES_ZK=false

3.2 配置conf/hbase-site.xml

[html] view plain copy

<configuration>
<property>
<name>hbase.rootdir</name>
<value>hdfs://node1:49002/hbase</value>
<description>The directory shared byRegionServers.
</description>
</property>
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
<description>The mode the clusterwill be in. Possible values are
false: standalone and pseudo-distributedsetups with managed Zookeeper
true: fully-distributed with unmanagedZookeeper Quorum (see hbase-env.sh)
</description>
</property>
<property>
<name>hbase.zookeeper.property.clientPort</name>
<value>2222</value>
<description>Property fromZooKeeper's config zoo.cfg.
The port at which the clients willconnect.
</description>
</property>
<property>
<name>hbase.zookeeper.quorum</name>
<value>node1,node2,node3</value>
<description>Comma separated listof servers in the ZooKeeper Quorum.
For example,"host1.mydomain.com,host2.mydomain.com,host3.mydomain.com".
By default this is set to localhost forlocal and pseudo-distributed modes
of operation. For a fully-distributedsetup, this should be set to a full
list of ZooKeeper quorum servers. IfHBASE_MANAGES_ZK is set in hbase-env.sh
this is the list of servers which we willstart/stop ZooKeeper on.
</description>
</property>
<property>
<name>hbase.zookeeper.property.dataDir</name>
<value>/home/hadoop/zookeeper</value>
<description>Property fromZooKeeper's config zoo.cfg.
The directory where the snapshot isstored.
</description>
</property>
</configuration>

要想运行完全分布式模式，加一个属性 hbase.cluster.distributed 设置为 true 然后把 hbase.rootdir 设置为HDFS的NameNode的位置。例如，你的namenode运行在node1，端口是49002 你期望的目录是 /hbase,使用如下的配置：hdfs://node1:49002/hbase

hbase.rootdir：这个目录是region server的共享目录，用来持久化Hbase。URL需要是'完全正确'的，还要包含文件系统的scheme。例如，要表示hdfs中的'/hbase'目录，namenode 运行在node1的49002端口。则需要设置为hdfs://node1:49002/hbase。默认情况下Hbase是写到/tmp的。不改这个配置，数据会在重启的时候丢失。默认: file:///tmp/hbase-${user.name}/hbase

hbase.cluster.distributed ：Hbase的运行模式。false是单机模式，true是分布式模式。若为false,Hbase和Zookeeper会运行在同一个JVM里面。

默认: false

在hbase-site.xml配置zookeeper：

当Hbase管理zookeeper的时候，你可以通过修改zoo.cfg来配置zookeeper，

一个更加简单的方法是在 conf/hbase-site.xml里面修改zookeeper的配置。Zookeeer的配置是作为property写在 hbase-site.xml里面的。

对于zookeepr的配置，你至少要在 hbase-site.xml中列出zookeepr的ensemble servers，具体的字段是 hbase.zookeeper.quorum. 该这个字段的默认值是 localhost，这个值对于分布式应用显然是不可以的. (远程连接无法使用)。

hbase.zookeeper.property.clientPort：ZooKeeper的zoo.conf中的配置。客户端连接的端口。

hbase.zookeeper.quorum：Zookeeper集群的地址列表，用逗号分割。例如："host1.mydomain.com,host2.mydomain.com,host3.mydomain.com".默认是localhost,是给伪分布式用的。要修改才能在完全分布式的情况下使用。如果在hbase-env.sh设置了HBASE_MANAGES_ZK，这些ZooKeeper节点就会和Hbase一起启动。

默认: localhost

运行一个zookeeper也是可以的，但是在生产环境中，你最好部署3，5，7个节点。部署的越多，可靠性就越高，当然只能部署奇数个，偶数个是不可以的。你需要给每个zookeeper 1G左右的内存，如果可能的话，最好有独立的磁盘。 (独立磁盘可以确保zookeeper是高性能的。).如果你的集群负载很重，不要把Zookeeper和RegionServer运行在同一台机器上面。就像DataNodes 和 TaskTrackers一样

hbase.zookeeper.property.dataDir：ZooKeeper的zoo.conf中的配置。快照的存储位置

把ZooKeeper保存数据的目录地址改掉。默认值是 /tmp ，这里在重启的时候会被操作系统删掉，可以把它修改到 /home/hadoop/zookeeper (这个路径hadoop用户拥有操作权限)

对于独立的Zookeeper，要指明Zookeeper的host和端口。可以在 hbase-site.xml中设置, 也可以在Hbase的CLASSPATH下面加一个zoo.cfg配置文件。 HBase 会优先加载 zoo.cfg 里面的配置，把hbase-site.xml里面的覆盖掉.

参见 http://www.yankay.com/wp-content/hbase/book.html#hbase_default_configurations可以查找hbase.zookeeper.property 前缀，找到关于zookeeper的配置。

3.3 配置conf/regionservers

Node1

Node2

完全分布式模式的还需要修改conf/regionservers. 在这里列出了你希望运行的全部 HRegionServer，一行写一个host (就像Hadoop里面的 slaves 一样). 列在这里的server会随着集群的启动而启动，集群的停止而停止.

3.4 替换hadoop的jar包

hbase基本的配置完了。

查看hbase的lib目录下。

ls lib |grep hadoop

hadoop-annotations-2.1.0-beta.jar
hadoop-auth-2.1.0-beta.jar
hadoop-client-2.1.0-beta.jar
hadoop-common-2.1.0-beta.jar
hadoop-hdfs-2.1.0-beta.jar
hadoop-hdfs-2.1.0-beta-tests.jar
hadoop-mapreduce-client-app-2.1.0-beta.jar
hadoop-mapreduce-client-common-2.1.0-beta.jar
hadoop-mapreduce-client-core-2.1.0-beta.jar
hadoop-mapreduce-client-jobclient-2.1.0-beta.jar
hadoop-mapreduce-client-jobclient-2.1.0-beta-tests.jar
hadoop-mapreduce-client-shuffle-2.1.0-beta.jar
hadoop-yarn-api-2.1.0-beta.jar
hadoop-yarn-client-2.1.0-beta.jar
hadoop-yarn-common-2.1.0-beta.jar
hadoop-yarn-server-common-2.1.0-beta.jar
hadoop-yarn-server-nodemanager-2.1.0-beta.jar

看到它是基于hadoop2.1.0的，所以我们需要用我们的hadoop2.2.0下的jar包来替换2.1的，保证版本的一致性，hadoop下的jar包都是在$HADOOP_HOME/share/hadoop下的.

我们先cd 到 /home/hadoop/hbase-0.96.0-hadoop2/lib下运行命令： rm -rf hadoop*.jar删掉所有的hadoop相关的jar包，然后运行：

find /home/hadoop/hadoop-2.2.0/share/hadoop -name "hadoop*jar" | xargs -i cp {} /home/hadoop/hbase-0.96.0-hadoop2/ lib/

拷贝所有hadoop2.2.0下的jar包hbase下进行hadoop版本的统一

4. 运行和确认安装

4.1当Hbase托管ZooKeeper的时候

当Hbase托管ZooKeeper的时候Zookeeper集群的启动是Hbase启动脚本的一部分

首先确认你的HDFS是运行着的。你可以运行HADOOP_HOME中的 bin/start-hdfs.sh 来启动HDFS.你可以通过put命令来测试放一个文件，然后有get命令来读这个文件。通常情况下Hbase是不会运行mapreduce的。所以比不需要检查这些。

用如下命令启动Hbase:

bin/start-hbase.sh

这个脚本在HBASE_HOME目录里面。

你现在已经启动Hbase了。Hbase把log记在 logs 子目录里面. 当Hbase启动出问题的时候，可以看看Log.

Hbase也有一个界面，上面会列出重要的属性。默认是在Master的60010端口上H (HBase RegionServers 会默认绑定 60020端口，在端口60030上有一个展示信息的界面 ).如果Master运行在 node1，端口是默认的话，你可以用浏览器在 http://node:60010看到主界面. .

一旦Hbase启动，可以看到如何建表，插入数据，scan你的表，还有disable这个表，最后把它删掉。

可以在Hbase Shell停止Hbase

$./bin/stop-hbase.sh

stoppinghbase...............

停止操作需要一些时间，你的集群越大，停的时间可能会越长。如果你正在运行一个分布式的操作，要确认在Hbase彻底停止之前，Hadoop不能停.

4.2独立的zookeeper启动，

除了启动habse，

执行：bin/start-hbase.sh启动habse

你需要自己去运行zookeeper：

${HBASE_HOME}/bin/hbase-daemons.sh {start,stop} zookeeper

你可以用这条命令启动ZooKeeper而不启动Hbase. HBASE_MANAGES_ZK 的值是 false，如果你想在Hbase重启的时候不重启ZooKeeper,你可以这样。

5. 测试

可以使用jps查看进程：在master上：

在node2，node3（slave节点）上

通过浏览器查看60010端口：

1. 安装中出现的问题

1 ）

用./start-hbase.sh启动HBase后，执行hbase shell
# bin/hbase shell
HBase Shell; enter 'help<RETURN>' for list of supported commands.
Version: 0.20.6, rUnknown, Thu Oct 28 19:02:04 CST 2010
接着创建表时候出现如下情况：hbase(main):001:0> create 'test',''c
NativeException: org.apache.hadoop.hbase.MasterNotRunningException: null

jps下，发现主节点上HMaster没有启动，查理HBase log（logs/hbase-hadoop-master-ubuntu.log）里有下面异常：
FATAL org.apache.hadoop.hbase.master.HMaster: Unhandled exception. Starting shutdown.
java.io.IOException: Call to node1/10.64.56.76:49002 failed on local exception: java.io.EOFException

解决：

从hadoop_home/下面cp一个hadoop/hadoop-core-0.20.203.0.jar到hbase_home/lib下。

因为Hbase建立在Hadoop之上，所以他用到了hadoop.jar,这个Jar在 lib 里面。这个jar是hbase自己打了branch-0.20-append 补丁的hadoop.jar. Hadoop使用的hadoop.jar和Hbase使用的必须一致。所以你需要将 Hbaselib 目录下的hadoop.jar替换成Hadoop里面的那个，防止版本冲突。比方说CDH的版本没有HDFS-724而branch-0.20-append里面有，这个HDFS-724补丁修改了RPC协议。如果不替换，就会有版本冲突，继而造成严重的出错，Hadoop会看起来挂了。

再用./start-hbase.sh启动HBase后，jps下，发现主节点上HMaster还是没有启动，在HBase log里有下面异常：
FATAL org.apache.hadoop.hbase.master.HMaster: Unhandled exception. Starting shutdown.
java.lang.NoClassDefFoundError: org/apache/commons/configuration/Configuration
解决：
在NoClassDefFoundError,缺少 org/apache/commons/configuration/Configuration
果断给他加一个commons-configuration包，
从hadoop_home/lib下面cp一个hadoop/lib/commons-configuration-1.6.jar到hbase_home/lib下。

（集群上所有机子的hbase配置都需要一样）

创建表报错：

ERROR: java.io.IOException: Table Namespace Manager not ready yet, try again later
at org.apache.hadoop.hbase.master.HMaster.getNamespaceDescriptor(HMaster.java:3101)
at org.apache.hadoop.hbase.master.HMaster.createTable(HMaster.java:1738)
at org.apache.hadoop.hbase.master.HMaster.createTable(HMaster.java:1777)
at org.apache.hadoop.hbase.protobuf.generated.MasterProtos$MasterService$2.callBlockingMethod(MasterProtos.java:38221)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2146)
at org.apache.hadoop.hbase.ipc.RpcServer$Handler.run(RpcServer.java:1851)

解决：

1）查看集群的所有机器上，

HRegionServer和HQuorumPeer进程是否都启动？

2）查看集群的所有机器的logs是不是有错误消息；

tail -f hbase-hadoop-regionserver-XXX..log

2 注意事项：

1）、先启动hadoop后，再开启hbase
2）、去掉hadoop的安全模式：hadoop dfsadmin -safemode leave
3）、把/etc/hosts里的ubuntu的IP改为服务器当前的IP
4) 、确认hbase的hbase-site.xml中
          <name>hbase.rootdir</name>
                 <value>hdfs://node：49002/hbase</value>
         与hadoop的core-site.xml中
                   <name>fs.default.name</name>
                  <value>hdfs://node：49002/hbase</value>
       红字部分保持一致

<value>hdfs://localhost:8020/hbase</value>

否则报错：java.lang.RuntimeException: HMaster Aborted

6)、重新执行./start-hbase.sh之前，先kill掉当前的hbase和zookeeper进程

7）hosts注意顺序：

192.168.1.214 master
192.168.1.205 node1
192.168.1.207 node2
192.168.1.209 node3
192.168.1.205 T205.joy.cc

PS：遇到问题时，先查看logs，很有帮助。

HBase 官方文档，全面介绍hbase安装配置：

http://www.yankay.com/wp-content/hbase/book.html#hbase_default_configurations

概述

对于建表，和RDBMS类似，HBase也有namespace的概念，可以指定表空间创建表，也可以直接创建表，进入default表空间。

对于数据操作，HBase支持四类主要的数据操作，分别是：

Put：增加一行，修改一行；
Delete：删除一行，删除指定列族，删除指定column的多个版本，删除指定column的制定版本等；
Get：获取指定行的所有信息，获取指定行和指定列族的所有colunm，获取指定column，获取指定column的几个版本，获取指定column的指定版本等；
Scan：获取所有行，获取指定行键范围的行，获取从某行开始的几行，获取满足过滤条件的行等。

这四个类都是org.apache.hadoop.hbase.client的子类，可以到官网API去查看详细信息，本文仅总结常用方法，力争让读者用20%的时间掌握80%的常用功能。

1.命名空间Namespace

2.创建表

3.删除表

4.修改表

5.新增、更新数据Put

6.删除数据Delete

7.获取单行Get

8.获取多行Scan

1. 命名空间Namespace

在关系数据库系统中，命名空间namespace指的是一个表的逻辑分组，同一组中的表有类似的用途。命名空间的概念为即将到来的多租户特性打下基础：

配额管理（Quota Management (HBASE-8410)）：限制一个namespace可以使用的资源，资源包括region和table等；
命名空间安全管理（Namespace Security Administration (HBASE-9206)）：提供了另一个层面的多租户安全管理；
Region服务器组（Region server groups (HBASE-6721)）：一个命名空间或一张表，可以被固定到一组regionservers上，从而保证了数据隔离性。

1.1.命名空间管理

命名空间可以被创建、移除、修改。

表和命名空间的隶属关系在在创建表时决定，通过以下格式指定：

Example：hbase shell中创建命名空间、创建命名空间中的表、移除命名空间、修改命名空间

#Create a namespace
create_namespace 'my_ns'create_namespace 'my_ns'

#create my_table in my_ns namespace
create 'my_ns:my_table', 'fam''my_ns:my_table', 'fam'

#drop namespace
drop_namespace 'my_ns'drop_namespace 'my_ns'

#alter namespace
alter_namespace 'my_ns', {METHOD => 'set', 'PROPERTY_NAME' => 'PROPERTY_VALUE'}alter_namespace 'my_ns', {METHOD => 'set', 'PROPERTY_NAME' => 'PROPERTY_VALUE'}

1.2. 预定义的命名空间

有两个系统内置的预定义命名空间：

hbase：系统命名空间，用于包含hbase的内部表
default：所有未指定命名空间的表都自动进入该命名空间

Example：指定命名空间和默认命名空间

#namespace=foo and table qualifier=bar
create 'foo:bar', 'fam'#namespace=default and table qualifier=bar
create 'bar', 'fam'

2.创建表

废话不多说，直接上样板代码，代码后再说明注意事项和知识点：

Configuration conf = HBaseConfiguration.create();

HBaseAdmin admin = new HBaseAdmin(conf);

//create namespace named "my_ns"

admin.createNamespace(NamespaceDescriptor.create("my_ns").build());

//create tableDesc, with namespace name "my_ns" and table name "mytable"

HTableDescriptor tableDesc = new HTableDescriptor(TableName.valueOf("my_ns:mytable"));

tableDesc.setDurability(Durability. SYNC_WAL );

//add a column family "mycf"

HColumnDescriptor hcd = new HColumnDescriptor("mycf");

tableDesc.addFamily(hcd);

admin.createTable(tableDesc);

admin.close();

关键知识点：

必须将HBase集群的hbase-site.xml文件添加进工程的classpath中，或者通过Configuration对象设置相关属性，否则程序获取不到集群相关信息，也就无法找到集群，运行程序时会报错；
HTableDescriptor tableDesc = new HTableDescriptor(TableName.valueOf("my_ns:mytable"))代码是描述表mytable，并将mytable放到了my_ns命名空间中，前提是该命名空间已存在，如果指定的是不存在命名空间，则会报错org.apache.hadoop.hbase.NamespaceNotFoundException；
命名空间一般在建模阶段通过命令行创建，在java代码中通过admin.createNamespace(NamespaceDescriptor.create("my_ns").build())创建的机会不多；
创建HBaseAdmin对象时就已经建立了客户端程序与HBase集群的connection，所以在程序执行完成后，务必通过admin.close()关闭connection；
可以通过HTableDescriptor对象设置表的特性，比如：通过tableDesc.setMaxFileSize(512)设置一个region中的store文件的最大size，当一个region中的最大store文件达到这个size时，region就开始分裂；通过tableDesc.setMemStoreFlushSize(512)设置region内存中的memstore的最大值，当memstore达到这个值时，开始往磁盘中刷数据。更多特性请自行查阅官网API；
可以通过HColumnDescriptor对象设置列族的特性，比如：通过hcd.setTimeToLive(5184000)设置数据保存的最长时间；通过hcd.setInMemory(true)设置数据保存在内存中以提高响应速度；通过 hcd.setMaxVersions(10)设置数据保存的最大版本数；通过hcd.setMinVersions(5)设置数据保存的最小版本数（配合TimeToLive使用）。更多特性请自行查阅官网API；
数据的版本数只能通过HColumnDescriptor对象设置，不能通过HTableDescriptor对象设置；
由于HBase的数据是先写入内存，数据累计达到内存阀值时才往磁盘中flush数据，所以，如果在数据还没有flush进硬盘时，regionserver down掉了，内存中的数据将丢失。要想解决这个场景的问题就需要用到WAL（Write-Ahead-Log），tableDesc.setDurability(Durability.SYNC_WAL)就是设置写WAL日志的级别，示例中设置的是同步写WAL，该方式安全性较高，但无疑会一定程度影响性能，请根据具体场景选择使用；
setDurability(Durability d)方法可以在相关的三个对象中使用，分别是：HTableDescriptor，Delete，Put（其中Delete和Put的该方法都是继承自父类org.apache.hadoop.hbase.client.Mutation）。分别针对表、插入操作、删除操作设定WAL日志写入级别。需要注意的是，Delete和Put并不会继承Table的Durability级别（已实测验证）。Durability是一个枚举变量，可选值参见4.2节。如果不通过该方法指定WAL日志级别，则为默认USE_DEFAULT级别。

3.删除表

删除表没创建表那么多学问，直接上代码：

Configuration conf = HBaseConfiguration.create();

HBaseAdmin admin = new HBaseAdmin(conf);

String tablename = "my_ns:mytable";

if(admin.tableExists(tablename)) {

try {

admin.disableTable(tablename);

admin.deleteTable(tablename);

} catch (Exception e) {

// TODO: handle exception

e.printStackTrace();

}

admin.close();

说明：删除表前必须先disable表。

4.修改表

4.1.实例代码

（1）删除列族、新增列族

修改之前，四个列族：

hbase(main):014:0> describe 'rd_ns:itable'

DESCRIPTION ENABLED

'rd_ns:itable', {NAME => 'info', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'ROW', REPLICATION_SCOPE => '0', V true

ERSIONS => '10', COMPRESSION => 'NONE', MIN_VERSIONS => '0', TTL => '2147483647', KEEP_DELETED_CELLS => 'false',

BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}, {NAME => 'newcf', DATA_BLOCK_ENCODING => 'NONE

', BLOOMFILTER => 'ROW', REPLICATION_SCOPE => '0', COMPRESSION => 'NONE', VERSIONS => '10', TTL => '2147483647',

MIN_VERSIONS => '0', KEEP_DELETED_CELLS => 'false', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'tr

ue'}, {NAME => 'note', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'ROW', REPLICATION_SCOPE => '0', VERSIONS =>

'10', COMPRESSION => 'NONE', MIN_VERSIONS => '0', TTL => '2147483647', KEEP_DELETED_CELLS => 'false', BLOCKSIZE

=> '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}, {NAME => 'sysinfo', DATA_BLOCK_ENCODING => 'NONE', BLOOM

FILTER => 'ROW', REPLICATION_SCOPE => '0', COMPRESSION => 'NONE', VERSIONS => '10', TTL => '2147483647', MIN_VERS

IONS => '0', KEEP_DELETED_CELLS => 'true', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}

1 row(s) in 0.0450 seconds

修改表，删除三个列族，新增一个列族，代码如下：

Configuration conf = HBaseConfiguration.create();

HBaseAdmin admin = new HBaseAdmin(conf);

String tablename = "rd_ns:itable";

if(admin.tableExists(tablename)) {

try {

admin.disableTable(tablename);

//get the TableDescriptor of target table

HTableDescriptor newtd = admin.getTableDescriptor (Bytes.toBytes("rd_ns:itable"));

//remove 3 useless column families

newtd.removeFamily(Bytes.toBytes("note"));

newtd.removeFamily(Bytes.toBytes("newcf"));

newtd.removeFamily(Bytes.toBytes("sysinfo"));

//create HColumnDescriptor for new column family

HColumnDescriptor newhcd = new HColumnDescriptor("action_log");

newhcd.setMaxVersions(10);

newhcd.setKeepDeletedCells(true);

//add the new column family(HColumnDescriptor) to HTableDescriptor

newtd.addFamily(newhcd);

//modify target table struture

admin. modifyTable (Bytes.toBytes("rd_ns:itable"),newtd);

admin.enableTable(tablename);

} catch (Exception e) {

// TODO: handle exception

e.printStackTrace();

}

admin.close();

修改之后：

hbase(main):015:0> describe 'rd_ns:itable'

DESCRIPTION ENABLED

'rd_ns:itable', {NAME => 'action_log', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'ROW', REPLICATION_SCOPE => true

'0', COMPRESSION => 'NONE', VERSIONS => '10', TTL => '2147483647', MIN_VERSIONS => '0', KEEP_DELETED_CELLS => 'tr

ue', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}, {NAME => 'info', DATA_BLOCK_ENCODING => '

NONE', BLOOMFILTER => 'ROW', REPLICATION_SCOPE => '0', VERSIONS => '10', COMPRESSION => 'NONE', MIN_VERSIONS => '

0', TTL => '2147483647', KEEP_DELETED_CELLS => 'false', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE =>

'true'}

1 row(s) in 0.0400 seconds

逻辑很简单：

通过admin.getTableDescriptor(Bytes.toBytes("rd_ns:itable"))取得目标表的描述对象，应该就是取得指向该对象的指针了；
修改目标表描述对象；
通过admin.modifyTable(Bytes.toBytes("rd_ns:itable"),newtd)将修改后的描述对象应用到目标表。

（2）修改现有列族的属性（setMaxVersions）

Configuration conf = HBaseConfiguration.create();

HBaseAdmin admin = new HBaseAdmin(conf);

String tablename = "rd_ns:itable";

if(admin.tableExists(tablename)) {

try {

admin.disableTable(tablename);

//get the TableDescriptor of target table

HTableDescriptor htd = admin.getTableDescriptor(Bytes.toBytes("rd_ns:itable"));

HColumnDescriptor infocf = htd.getFamily(Bytes.toBytes("info"));

infocf.setMaxVersions(100);

//modify target table struture

admin.modifyTable(Bytes.toBytes("rd_ns:itable"),htd);

admin.enableTable(tablename);

} catch (Exception e) {

// TODO: handle exception

e.printStackTrace();

}

admin.close();

5.新增、更新数据Put

5.1.常用构造函数：

（1）指定行键

public Put(byte[] row)

参数：row 行键

（2）指定行键和时间戳

public Put(byte[] row, long ts)

参数：row 行键，ts 时间戳

（3）从目标字符串中提取子串，作为行键

Put(byte[] rowArray, int rowOffset, int rowLength)

（4）从目标字符串中提取子串，作为行键，并加上时间戳

Put(byte[] rowArray, int rowOffset, int rowLength, long ts)

5.2.常用方法：

（1）指定列族、限定符，添加值

add(byte[] family, byte[] qualifier, byte[] value)

（2）指定列族、限定符、时间戳，添加值

add(byte[] family, byte[] qualifier, long ts, byte[] value)

（3）设置写WAL （Write-Ahead-Log）的级别

public void setDurability(Durability d)

参数是一个枚举值，可以有以下几种选择：

ASYNC_WAL ：当数据变动时，异步写WAL日志
SYNC_WAL ：当数据变动时，同步写WAL日志
FSYNC_WAL ：当数据变动时，同步写WAL日志，并且，强制将数据写入磁盘
SKIP_WAL ：不写WAL日志
USE_DEFAULT ：使用HBase全局默认的WAL写入级别，即SYNC_WAL

5.3.实例代码

（1）插入行

Configuration conf = HBaseConfiguration.create();

HTable table = new HTable(conf, "rd_ns:leetable");

Put put = new Put(Bytes.toBytes("100001"));

put.add(Bytes.toBytes("info"), Bytes.toBytes("name"), Bytes.toBytes("lion"));

put.add(Bytes.toBytes("info"), Bytes.toBytes("address"), Bytes.toBytes("shangdi"));

put.add(Bytes.toBytes("info"), Bytes.toBytes("age"), Bytes.toBytes("30"));

put.setDurability(Durability.SYNC_WAL);

table.put(put);

table.close();

（2）更新行

Configuration conf = HBaseConfiguration.create();

HTable table = new HTable(conf, "rd_ns:leetable");

Put put = new Put(Bytes.toBytes("100001"));

put.add(Bytes.toBytes("info"), Bytes.toBytes("name"), Bytes.toBytes("lee"));

put.add(Bytes.toBytes("info"), Bytes.toBytes("address"), Bytes.toBytes("longze"));

put.add(Bytes.toBytes("info"), Bytes.toBytes("age"), Bytes.toBytes("31"));

put.setDurability(Durability.SYNC_WAL);

table.put(put);

table.close();

注意：

Put的构造函数都需要指定行键，如果是全新的行键，则新增一行；如果是已有的行键，则更新现有行。
创建Put对象及put.add过程都是在构建一行的数据，创建Put对象时相当于创建了行对象，add的过程就是往目标行里添加cell，直到table.put才将数据插入表格；
以上代码创建Put对象用的是构造函数1，也可用构造函数2，第二个参数是时间戳；
Put还有别的构造函数，请查阅官网API。

（3）从目标字符串中提取子串，作为行键，构建Put

Configuration conf = HBaseConfiguration.create();

HTable table = new HTable(conf, "rd_ns:leetable");

Put put = new Put(Bytes.toBytes("100001_100002"),7,6);

put.add(Bytes.toBytes("info"), Bytes.toBytes("name"), Bytes.toBytes("show"));

put.add(Bytes.toBytes("info"), Bytes.toBytes("address"), Bytes.toBytes("caofang"));

put.add(Bytes.toBytes("info"), Bytes.toBytes("age"), Bytes.toBytes("30"));

table.put(put);

table.close();

注意，关于：Put put = new Put(Bytes.toBytes( "100001_100002"),7,6)

第二个参数是偏移量，也就是行键从第一个参数的第几个字符开始截取；
第三个参数是截取长度；
这个代码实际是从 100001_100002 中截取了100002子串作为目标行的行键。

6.删除数据Delete

Delete类用于删除表中的一行数据，通过HTable.delete来执行该动作。

在执行Delete操作时，HBase并不会立即删除数据，而是对需要删除的数据打上一个“墓碑”标记，直到当Storefile合并时，再清除这些被标记上“墓碑”的数据。

如果希望删除整行，用行键来初始化一个Delete对象即可。如果希望进一步定义删除的具体内容，可以使用以下这些Delete对象的方法：

为了删除指定的列族，可以使用deleteFamily
为了删除指定列的多个版本，可以使用deleteColumns
为了删除指定列的指定版本，可以使用deleteColumn，这样的话就只会删除版本号（时间戳）与指定版本相同的列。如果不指定时间戳，默认只删除最新的版本

下面详细说明构造函数和常用方法：

6.1.构造函数

（1）指定要删除的行键

Delete(byte[] row)

删除行键指定行的数据。

如果没有进一步的操作，使用该构造函数将删除行键指定的行中所有列族中所有列的所有版本！

（2）指定要删除的行键和时间戳

Delete(byte[] row, long timestamp)

删除行键和时间戳共同确定行的数据。

如果没有进一步的操作，使用该构造函数将删除行键指定的行中，所有列族中所有列的时间戳小于等于指定时间戳的数据版本。

注意：该时间戳仅仅和删除行有关，如果需要进一步指定列族或者列，你必须分别为它们指定时间戳。

（3）给定一个字符串，目标行键的偏移，截取的长度

Delete(byte[] rowArray, int rowOffset, int rowLength)

（4）给定一个字符串，目标行键的偏移，截取的长度，时间戳

Delete(byte[] rowArray, int rowOffset, int rowLength, long ts)

6.2.常用方法

Delete deleteColumn(byte[] family, byte[] qualifier) 删除指定列的最新版本的数据。

Delete deleteColumns(byte[] family, byte[] qualifier) 删除指定列的所有版本的数据。

Delete deleteColumn(byte[] family, byte[] qualifier, long timestamp) 删除指定列的指定版本的数据。

Delete deleteColumns(byte[] family, byte[] qualifier, long timestamp) 删除指定列的，时间戳小于等于给定时间戳的所有版本的数据。

Delete deleteFamily(byte[] family) 删除指定列族的所有列的所有版本数据。

Delete deleteFamily(byte[] family, long timestamp) 删除指定列族的所有列中时间戳小于等于指定时间戳的所有数据。

Delete deleteFamilyVersion(byte[] family, long timestamp) 删除指定列族中所有列的时间戳等于指定时间戳的版本数据。

voidsetTimestamp(long timestamp) 为Delete对象设置时间戳。

6.3.实例代码

（1）删除整行的所有列族、所有行、所有版本

Configuration conf = HBaseConfiguration.create();

HTable table = new HTable(conf, "rd_ns:leetable");

Delete delete = new Delete(Bytes.toBytes("000"));

table.delete(delete);

table.close();

（2）删除指定列的最新版本

以下是删除之前的数据，注意看100003行的info:address，这是该列最新版本的数据，值是caofang1，在这之前的版本值是caofang：

hbase(main):007:0> scan 'rd_ns:leetable'

ROW COLUMN+CELL

100001 column=info:address, timestamp=1405304843114, value=longze

100001 column=info:age, timestamp=1405304843114, value=31

100001 column=info:name, timestamp=1405304843114, value=leon

100002 column=info:address, timestamp=1405305471343, value=caofang

100002 column=info:age, timestamp=1405305471343, value=30

100002 column=info:name, timestamp=1405305471343, value=show

100003 column=info:address, timestamp=1405390959464, value=caofang1

100003 column=info:age, timestamp=1405390959464, value=301

100003 column=info:name, timestamp=1405390959464, value=show1

3 row(s) in 0.0270 seconds

执行以下代码：

Configuration conf = HBaseConfiguration.create();

HTable table = new HTable(conf, "rd_ns:leetable");

Delete delete = new Delete(Bytes.toBytes("100003"));

delete.deleteColumn(Bytes.toBytes("info"), Bytes.toBytes("address"));

table.delete(delete);

table.close();

然后查看数据，发现100003列的info:address列的值显示为前一个版本的caofang了！其余值均不变：

hbase(main):008:0> scan 'rd_ns:leetable'

ROW COLUMN+CELL

100001 column=info:address, timestamp=1405304843114, value=longze

100001 column=info:age, timestamp=1405304843114, value=31

100001 column=info:name, timestamp=1405304843114, value=leon

100002 column=info:address, timestamp=1405305471343, value=caofang

100002 column=info:age, timestamp=1405305471343, value=30

100002 column=info:name, timestamp=1405305471343, value=show

100003 column=info:address, timestamp=1405390728175, value=caofang

100003 column=info:age, timestamp=1405390959464, value=301

100003 column=info:name, timestamp=1405390959464, value=show1

3 row(s) in 0.0560 seconds

（3）删除指定列的所有版本

接以上场景，执行以下代码：

Configuration conf = HBaseConfiguration.create();

HTable table = new HTable(conf, "rd_ns:leetable");

Delete delete = new Delete(Bytes.toBytes("100003"));

delete. deleteColumns(Bytes. toBytes( "info"), Bytes. toBytes( "address"));

table.delete(delete);

table.close();

然后我们会发现，100003行的整个info:address列都没了：

hbase(main):009:0> scan 'rd_ns:leetable'

ROW COLUMN+CELL

100001 column=info:address, timestamp=1405304843114, value=longze

100001 column=info:age, timestamp=1405304843114, value=31

100001 column=info:name, timestamp=1405304843114, value=leon

100002 column=info:address, timestamp=1405305471343, value=caofang

100002 column=info:age, timestamp=1405305471343, value=30

100002 column=info:name, timestamp=1405305471343, value=show

100003 column=info:age, timestamp=1405390959464, value=301

100003 column=info:name, timestamp=1405390959464, value=show1

3 row(s) in 0.0240 seconds

（4）删除指定列族中所有列的时间戳等于指定时间戳的版本数据

为了演示效果，我已经向100003行的info:address列新插入一条数据

hbase(main):010:0> scan 'rd_ns:leetable'

ROW COLUMN+CELL

100001 column=info:address, timestamp=1405304843114, value=longze

100001 column=info:age, timestamp=1405304843114, value=31

100001 column=info:name, timestamp=1405304843114, value=leon

100002 column=info:address, timestamp=1405305471343, value=caofang

100002 column=info:age, timestamp=1405305471343, value=30

100002 column=info:name, timestamp=1405305471343, value=show

100003 column=info:address, timestamp=1405391883886, value=shangdi

100003 column=info:age, timestamp=1405390959464, value=301

100003 column=info:name, timestamp=1405390959464, value=show1

3 row(s) in 0.0250 seconds

现在，我们的目的是删除info列族中，时间戳为1405390959464的所有列数据：

Configuration conf = HBaseConfiguration.create();

HTable table = new HTable(conf, "rd_ns:leetable");

Delete delete = new Delete(Bytes.toBytes("100003"));

delete.deleteFamilyVersion(Bytes.toBytes("info"), 1405390959464L);

table.delete(delete);

table.close();

hbase(main):011:0> scan 'rd_ns:leetable'

ROW COLUMN+CELL

100001 column=info:address, timestamp=1405304843114, value=longze

100001 column=info:age, timestamp=1405304843114, value=31

100001 column=info:name, timestamp=1405304843114, value=leon

100002 column=info:address, timestamp=1405305471343, value=caofang

100002 column=info:age, timestamp=1405305471343, value=30

100002 column=info:name, timestamp=1405305471343, value=show

100003 column=info:address, timestamp=1405391883886, value=shangdi

100003 column=info:age, timestamp=1405390728175, value=30

100003 column=info:name, timestamp=1405390728175, value=show

3 row(s) in 0.0250 seconds

可以看到，100003行的info列族，已经不存在时间戳为1405390959464的数据，比它更早版本的数据被查询出来，而info列族中时间戳不等于1405390959464的address列，不受该delete的影响。

7.获取单行Get

如果希望获取整行数据，用行键初始化一个Get对象就可以，如果希望进一步缩小获取的数据范围，可以使用Get对象的以下方法：

如果希望取得指定列族的所有列数据，使用addFamily添加所有的目标列族即可；
如果希望取得指定列的数据，使用addColumn添加所有的目标列即可；
如果希望取得目标列的指定时间戳范围的数据版本，使用setTimeRange；
如果仅希望获取目标列的指定时间戳版本，则使用setTimestamp；
如果希望限制每个列返回的版本数，使用setMaxVersions；
如果希望添加过滤器，使用setFilter

下面详细描述构造函数及常用方法：

7.1.构造函数

Get的构造函数很简单，只有一个构造函数：Get(byte[] row) 参数是行键。

7.2.常用方法

GetaddFamily(byte[] family) 指定希望获取的列族
GetaddColumn(byte[] family, byte[] qualifier) 指定希望获取的列
GetsetTimeRange(long minStamp, long maxStamp) 设置获取数据的时间戳范围
GetsetTimeStamp(long timestamp) 设置获取数据的时间戳
GetsetMaxVersions(int maxVersions) 设定获取数据的版本数
GetsetMaxVersions() 设定获取数据的所有版本
GetsetFilter(Filter filter) 为Get对象添加过滤器，过滤器详解请参见：http://blog.csdn.net/u010967382/article/details/37653177
voidsetCacheBlocks(boolean cacheBlocks) 设置该Get获取的数据是否缓存在内存中

7.3.实测代码

测试表的所有数据：

hbase(main):016:0> scan 'rd_ns:leetable'

ROW COLUMN+CELL

100001 column=info:address, timestamp=1405304843114, value=longze

100001 column=info:age, timestamp=1405304843114, value=31

100001 column=info:name, timestamp=1405304843114, value=leon

100002 column=info:address, timestamp=1405305471343, value=caofang

100002 column=info:age, timestamp=1405305471343, value=30

100002 column=info:name, timestamp=1405305471343, value=show

100003 column=info:address, timestamp=1405407883218, value=qinghe

100003 column=info:age, timestamp=1405407883218, value=28

100003 column=info:name, timestamp=1405407883218, value=shichao

3 row(s) in 0.0250 seconds

（1）获取行键指定行的所有列族、所有列的最新版本数据

Configuration conf = HBaseConfiguration.create();

HTable table = new HTable(conf, "rd_ns:leetable");

Get get = new Get(Bytes.toBytes("100003"));

Result r = table.get(get);

for (Cell cell : r.rawCells()) {

System.out.println(

"Rowkey : "+Bytes.toString(r.getRow())+

" Familiy:Quilifier : "+Bytes.toString(CellUtil.cloneQualifier(cell))+

" Value : "+Bytes.toString(CellUtil.cloneValue(cell))

);

}

table.close();

代码输出：

Rowkey : 100003 Familiy:Quilifier : address Value : qinghe

Rowkey : 100003 Familiy:Quilifier : age Value : 28

Rowkey : 100003 Familiy:Quilifier : name Value : shichao

（2）获取行键指定行中，指定列的最新版本数据

Configuration conf = HBaseConfiguration.create();

HTable table = new HTable(conf, "rd_ns:leetable");

Get get = new Get(Bytes.toBytes("100003"));

get.addColumn(Bytes.toBytes("info"), Bytes.toBytes("name"));

Result r = table.get(get);

for (Cell cell : r.rawCells()) {

System.out.println(

"Rowkey : "+Bytes.toString(r.getRow())+

" Familiy:Quilifier : "+Bytes.toString(CellUtil.cloneQualifier(cell))+

" Value : "+Bytes.toString(CellUtil.cloneValue(cell))

);

}

table.close();

代码输出：

Rowkey : 100003 Familiy:Quilifier : name Value : shichao

（3）获取行键指定的行中，指定时间戳的数据

Configuration conf = HBaseConfiguration.create();

HTable table = new HTable(conf, "rd_ns:leetable");

Get get = new Get(Bytes.toBytes("100003"));

get.setTimeStamp(1405407854374L);

Result r = table.get(get);

for (Cell cell : r.rawCells()) {

System.out.println(

"Rowkey : "+Bytes.toString(r.getRow())+

" Familiy:Quilifier : "+Bytes.toString(CellUtil.cloneQualifier(cell))+

" Value : "+Bytes.toString(CellUtil.cloneValue(cell))

);

}

table.close();

代码输出了上面scan命令输出中没有展示的历史数据：

Rowkey : 100003 Familiy:Quilifier : address Value : huangzhuang

Rowkey : 100003 Familiy:Quilifier : age Value : 32

Rowkey : 100003 Familiy:Quilifier : name Value : lily

（4）获取行键指定的行中，所有版本的数据

Configuration conf = HBaseConfiguration.create();

HTable table = new HTable(conf, "rd_ns:itable");

Get get = new Get(Bytes.toBytes("100003"));

get.setMaxVersions();

Result r = table.get(get);

for (Cell cell : r.rawCells()) {

System.out.println(

"Rowkey : "+Bytes.toString(r.getRow())+

" Familiy:Quilifier : "+Bytes.toString(CellUtil.cloneQualifier(cell))+

" Value : "+Bytes.toString(CellUtil.cloneValue(cell))+

" Time : "+cell.getTimestamp()

);

}

table.close();

代码输出：

Rowkey : 100003 Familiy:Quilifier : address Value : xierqi Time : 1405417500485

Rowkey : 100003 Familiy:Quilifier : address Value : shangdi Time : 1405417477465

Rowkey : 100003 Familiy:Quilifier : address Value : longze Time : 1405417448414

Rowkey : 100003 Familiy:Quilifier : age Value : 29 Time : 1405417500485

Rowkey : 100003 Familiy:Quilifier : age Value : 30 Time : 1405417477465

Rowkey : 100003 Familiy:Quilifier : age Value : 31 Time : 1405417448414

Rowkey : 100003 Familiy:Quilifier : name Value : leon Time : 1405417500485

Rowkey : 100003 Familiy:Quilifier : name Value : lee Time : 1405417477465

Rowkey : 100003 Familiy:Quilifier : name Value : lion Time : 1405417448414

注意：

能输出多版本数据的前提是当前列族能保存多版本数据，列族可以保存的数据版本数通过HColumnDescriptor的setMaxVersions(Int)方法设置。

8.获取多行Scan

Scan对象可以返回满足给定条件的多行数据。如果希望获取所有的行，直接初始化一个Scan对象即可。如果希望限制扫描的行范围，可以使用以下方法：

如果希望获取指定列族的所有列，可使用addFamily方法来添加所有希望获取的列族
如果希望获取指定列，使用addColumn方法来添加所有列
通过setTimeRange方法设定获取列的时间范围
通过setTimestamp方法指定具体的时间戳，只返回该时间戳的数据
通过setMaxVersions方法设定最大返回的版本数
通过setBatch方法设定返回数据的最大行数
通过setFilter方法为Scan对象添加过滤器，过滤器详解请参见：http://blog.csdn.net/u010967382/article/details/37653177
Scan的结果数据是可以缓存在内存中的，可以通过getCaching()方法来查看当前设定的缓存条数，也可以通过setCaching(int caching)来设定缓存在内存中的行数，缓存得越多，以后查询结果越快，同时也消耗更多内存。此外，通过setCacheBlocks方法设置是否缓存Scan的结果数据块，默认为true
我们可以通过setMaxResultSize(long)方法来设定Scan返回的结果行数。

下面是官网文档中的一个入门示例：假设表有几行键值为 "row1", "row2", "row3"，还有一些行有键值 "abc1", "abc2", 和 "abc3"，目标是返回"row"打头的行：

HTable htable = ... // instantiate HTable

Scan scan = new Scan();

scan.addColumn(Bytes.toBytes("cf"),Bytes.toBytes("attr"));

scan.setStartRow( Bytes.toBytes("row")); // start key is inclusive

scan.setStopRow( Bytes.toBytes("row" + (char)0)); // stop key is exclusive

ResultScanner rs = htable.getScanner(scan);

try {

for (Result r = rs.next(); r != null; r = rs.next()) {

// process result...

} finally {

rs.close(); // always close the ResultScanner!

}

8.1.常用构造函数

（1）创建扫描所有行的Scan

Scan()

（2）创建Scan，从指定行开始扫描，

Scan(byte[] startRow)

参数：startRow行键

注意：如果指定行不存在，从下一个最近的行开始

（3）创建Scan，指定起止行

Scan(byte[] startRow, byte[] stopRow)

参数：startRow起始行，stopRow终止行

注意：startRow <= 结果集 < stopRow

（4）创建Scan，指定起始行和过滤器

Scan(byte[] startRow, Filter filter)

参数：startRow起始行，filter过滤器

注意：过滤器的功能和构造参见 http://blog.csdn.net/u010967382/article/details/37653177

8.2.常用方法

Scan setStartRow(byte[] startRow) 设置Scan的开始行，默认结果集包含该行。如果希望结果集不包含该行，可以在行键末尾加上0。
Scan setStopRow(byte[] stopRow) 设置Scan的结束行，默认结果集不包含该行。如果希望结果集包含该行，可以在行键末尾加上0。

Scan setTimeRange(long minStamp, long maxStamp) 扫描指定时间范围的数据
Scan setTimeStamp(long timestamp) 扫描指定时间的数据

Scan addColumn(byte[] family, byte[] qualifier) 指定扫描的列
Scan addFamily(byte[] family) 指定扫描的列族

Scan setFilter(Filter filter) 为Scan设置过滤器

Scan setReversed(boolean reversed) 设置Scan的扫描顺序，默认是正向扫描（false），可以设置为逆向扫描（true）。注意：该方法0.98版本以后才可用！！

Scan setMaxVersions() 获取所有版本的数据
Scan setMaxVersions(int maxVersions) 设置获取的最大版本数

void setCaching(int caching) 设定缓存在内存中的行数，缓存得越多，以后查询结果越快，同时也消耗更多内存

voidsetRaw(boolean raw) 激活或者禁用raw模式。如果raw模式被激活，Scan将返回所有已经被打上删除标记但尚未被真正删除的数据。该功能仅用于激活了KEEP_DELETED_ROWS的列族，即列族开启了hcd.setKeepDeletedCells(true)。Scan激活raw模式后，就不能指定任意的列，否则会报错

Enable/disable "raw" mode for this scan. If "raw" is enabled the scan will return all delete marker and deleted rows that have not been collected, yet. This is mostly useful for Scan on column families that have KEEP_DELETED_ROWS enabled. It is an error to specify any column when "raw" is set.

hcd.setKeepDeletedCells(true);

8.3.实测代码

（1）扫描表中的所有行的最新版本数据

Configuration conf = HBaseConfiguration.create();

HTable table = new HTable(conf, "rd_ns:itable");

Scan s = new Scan();

ResultScanner rs = table.getScanner(s);

for (Result r : rs) {

for (Cell cell : r.rawCells()) {

System.out.println(

"Rowkey : "+Bytes.toString(r.getRow())+

" Familiy:Quilifier : "+Bytes.toString(CellUtil.cloneQualifier(cell))+

" Value : "+Bytes.toString(CellUtil.cloneValue(cell))+

" Time : "+cell.getTimestamp()

);

}

table.close();

代码输出：

Rowkey : 100001 Familiy:Quilifier : address Value : anywhere Time : 1405417403438

Rowkey : 100001 Familiy:Quilifier : age Value : 24 Time : 1405417403438

Rowkey : 100001 Familiy:Quilifier : name Value : zhangtao Time : 1405417403438

Rowkey : 100002 Familiy:Quilifier : address Value : shangdi Time : 1405417426693

Rowkey : 100002 Familiy:Quilifier : age Value : 28 Time : 1405417426693

Rowkey : 100002 Familiy:Quilifier : name Value : shichao Time : 1405417426693

Rowkey : 100003 Familiy:Quilifier : address Value : xierqi Time : 1405417500485

Rowkey : 100003 Familiy:Quilifier : age Value : 29 Time : 1405417500485

Rowkey : 100003 Familiy:Quilifier : name Value : leon Time : 1405417500485

（2）扫描指定行键范围，通过末尾加0，使得结果集包含StopRow

Configuration conf = HBaseConfiguration.create();

HTable table = new HTable(conf, "rd_ns:itable");

Scan s = new Scan();

s.setStartRow(Bytes.toBytes("100001"));

s.setStopRow(Bytes.toBytes("1000020"));

ResultScanner rs = table.getScanner(s);

for (Result r : rs) {

for (Cell cell : r.rawCells()) {

System.out.println(

"Rowkey : "+Bytes.toString(r.getRow())+

" Familiy:Quilifier : "+Bytes.toString(CellUtil.cloneQualifier(cell))+

" Value : "+Bytes.toString(CellUtil.cloneValue(cell))+

" Time : "+cell.getTimestamp()

);

}

table.close();

代码输出：

Rowkey : 100001 Familiy:Quilifier : address Value : anywhere Time : 1405417403438

Rowkey : 100001 Familiy:Quilifier : age Value : 24 Time : 1405417403438

Rowkey : 100001 Familiy:Quilifier : name Value : zhangtao Time : 1405417403438

Rowkey : 100002 Familiy:Quilifier : address Value : shangdi Time : 1405417426693

Rowkey : 100002 Familiy:Quilifier : age Value : 28 Time : 1405417426693

Rowkey : 100002 Familiy:Quilifier : name Value : shichao Time : 1405417426693

（3）返回所有已经被打上删除标记但尚未被真正删除的数据

本测试针对rd_ns:itable表的100003行。

如果使用get结合 setMaxVersions()方法能返回所有未删除的数据，输出如下：

Rowkey : 100003 Familiy:Quilifier : address Value : huilongguan Time : 1405494141522

Rowkey : 100003 Familiy:Quilifier : address Value : shangdi Time : 1405417477465

Rowkey : 100003 Familiy:Quilifier : age Value : new29 Time : 1405494141522

Rowkey : 100003 Familiy:Quilifier : name Value : liyang Time : 1405494141522

然而，使用Scan强大的 s.setRaw( true )方法，可以获得所有已经被打上删除标记但尚未被真正删除的数据。

代码如下：

Configuration conf = HBaseConfiguration.create();

HTable table = new HTable(conf, "rd_ns:itable");

Scan s = new Scan();

s.setStartRow(Bytes.toBytes("100003"));

s.setRaw(true);

s.setMaxVersions();

ResultScanner rs = table.getScanner(s);

for (Result r : rs) {

for (Cell cell : r.rawCells()) {

System.out.println(

"Rowkey : "+Bytes.toString(r.getRow())+

" Familiy:Quilifier : "+Bytes.toString(CellUtil.cloneQualifier(cell))+

" Value : "+Bytes.toString(CellUtil.cloneValue(cell))+

" Time : "+cell.getTimestamp()

);

}

table.close();

输出结果如下：

Rowkey : 100003 Familiy:Quilifier : address Value : huilongguan Time : 1405494141522

Rowkey : 100003 Familiy:Quilifier : address Value : Time : 1405417500485

Rowkey : 100003 Familiy:Quilifier : address Value : xierqi Time : 1405417500485

Rowkey : 100003 Familiy:Quilifier : address Value : shangdi Time : 1405417477465

Rowkey : 100003 Familiy:Quilifier : address Value : Time : 1405417448414

Rowkey : 100003 Familiy:Quilifier : address Value : longze Time : 1405417448414

Rowkey : 100003 Familiy:Quilifier : age Value : new29 Time : 1405494141522

Rowkey : 100003 Familiy:Quilifier : age Value : Time : 1405417500485

Rowkey : 100003 Familiy:Quilifier : age Value : 29 Time : 1405417500485

Rowkey : 100003 Familiy:Quilifier : age Value : 30 Time : 1405417477465

Rowkey : 100003 Familiy:Quilifier : age Value : 31 Time : 1405417448414

Rowkey : 100003 Familiy:Quilifier : name Value : liyang Time : 1405494141522

Rowkey : 100003 Familiy:Quilifier : name Value : Time : 1405493879419

Rowkey : 100003 Familiy:Quilifier : name Value : leon Time : 1405417500485

Rowkey : 100003 Familiy:Quilifier : name Value : lee Time : 1405417477465

Rowkey : 100003 Familiy:Quilifier : name Value : lion Time : 1405417448414

（4）结合过滤器，获取所有age在25到30之间的行

目前的数据：

hbase(main):049:0> scan 'rd_ns:itable'

ROW COLUMN+CELL

100001 column=info:address, timestamp=1405417403438, value=anywhere

100001 column=info:age, timestamp=1405417403438, value=24

100001 column=info:name, timestamp=1405417403438, value=zhangtao

100002 column=info:address, timestamp=1405417426693, value=shangdi

100002 column=info:age, timestamp=1405417426693, value=28

100002 column=info:name, timestamp=1405417426693, value=shichao

100003 column=info:address, timestamp=1405494141522, value=huilongguan

100003 column=info:age, timestamp=1405494999631, value=29

100003 column=info:name, timestamp=1405494141522, value=liyang

3 row(s) in 0.0240 seconds

代码：

Configuration conf = HBaseConfiguration.create();

HTable table = new HTable(conf, "rd_ns:itable");

FilterList filterList = new FilterList(FilterList.Operator.MUST_PASS_ALL);

SingleColumnValueFilter filter1 = new SingleColumnValueFilter(

Bytes.toBytes("info"),

Bytes.toBytes("age"),

CompareOp.GREATER_OR_EQUAL,

Bytes.toBytes("25")

);

SingleColumnValueFilter filter2 = new SingleColumnValueFilter(

Bytes.toBytes("info"),

Bytes.toBytes("age"),

CompareOp.LESS_OR_EQUAL,

Bytes.toBytes("30")

);

filterList.addFilter(filter1);

filterList.addFilter(filter2);

Scan scan = new Scan();

scan.setFilter(filterList);

ResultScanner rs = table.getScanner(scan);

for (Result r : rs) {

for (Cell cell : r.rawCells()) {

System.out.println(

"Rowkey : "+Bytes.toString(r.getRow())+

" Familiy:Quilifier : "+Bytes.toString(CellUtil.cloneQualifier(cell))+

" Value : "+Bytes.toString(CellUtil.cloneValue(cell))+

" Time : "+cell.getTimestamp()

);

}

table.close();

代码输出：

Rowkey : 100002 Familiy:Quilifier : address Value : shangdi Time : 1405417426693

Rowkey : 100002 Familiy:Quilifier : age Value : 28 Time : 1405417426693

Rowkey : 100002 Familiy:Quilifier : name Value : shichao Time : 1405417426693

Rowkey : 100003 Familiy:Quilifier : address Value : huilongguan Time : 1405494141522

Rowkey : 100003 Familiy:Quilifier : age Value : 29 Time : 1405494999631

Rowkey : 100003 Familiy:Quilifier : name Value : liyang Time : 1405494141522

注意：

HBase对列族、列名大小写敏感

下面我们看看 HBase Shell 的一些基本操作命令，我列出了几个常用的 HBase Shell 命令，如下：

名称	命令表达式
创建表	create '表名称', '列名称1','列名称2','列名称N'
添加记录	put '表名称', '行名称', '列名称:', '值'
查看记录	get '表名称', '行名称'
查看表中的记录总数	count '表名称'
删除记录	delete '表名' ,'行名称' , '列名称'
删除一张表	先要屏蔽该表，才能对该表进行删除，第一步 disable '表名称' 第二步 drop '表名称'
查看所有记录	scan "表名称"
查看某个表某个列中所有数据	scan "表名称" , ['列名称:']
更新记录	就是重写一遍进行覆盖

一、一般操作
1.查询服务器状态
hbase(main):024:0>status
3 servers, 0 dead,1.0000 average load

2.查询hive版本

hbase(main):025:0>version
0.90.4, r1150278,Sun Jul 24 15:53:29 PDT 2011

二、DDL操作

1.创建一个表
hbase(main):011:0>create 'member','member_id','address','info'
0 row(s) in 1.2210seconds

2.获得表的描述
hbase(main):012:0>list
TABLE
member
1 row(s) in 0.0160seconds
hbase(main):006:0>describe 'member'
DESCRIPTION                                                                                        ENABLED
{NAME => 'member', FAMILIES => [{NAME=> 'address', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', true
  VERSIONS => '3', COMPRESSION => 'NONE',TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'fa
lse', BLOCKCACHE => 'true'}, {NAME =>'info', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', VERSI
ONS => '3', COMPRESSION => 'NONE', TTL=> '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false',
BLOCKCACHE => 'true'}]}
1 row(s) in 0.0230seconds

3.删除一个列族，alter，disable，enable
我们之前建了3个列族，但是发现member_id这个列族是多余的，因为他就是主键，所以我们要将其删除。
hbase(main):003:0>alter 'member',{NAME=>'member_id',METHOD=>'delete'}

ERROR: Table memberis enabled. Disable it first before altering.

报错，删除列族的时候必须先将表给disable掉。
hbase(main):004:0>disable 'member'
0 row(s) in 2.0390seconds
hbase(main):005:0>alter'member',NAME=>'member_id',METHOD=>'delete'
0 row(s) in 0.0560seconds
hbase(main):006:0>describe 'member'
DESCRIPTION                                                                                        ENABLED
{NAME => 'member', FAMILIES => [{NAME=> 'address', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0',false
  VERSIONS => '3', COMPRESSION => 'NONE',TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'fa
lse', BLOCKCACHE => 'true'}, {NAME =>'info', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', VERSI
ONS => '3', COMPRESSION => 'NONE', TTL=> '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false',
BLOCKCACHE => 'true'}]}
1 row(s) in 0.0230seconds
该列族已经删除，我们继续将表enable
hbase(main):008:0> enable 'member'
0 row(s) in 2.0420seconds

4.列出所有的表
hbase(main):028:0>list
TABLE
member
temp_table
2 row(s) in 0.0150seconds

5.drop一个表
hbase(main):029:0>disable 'temp_table'
0 row(s) in 2.0590seconds

hbase(main):030:0>drop 'temp_table'
0 row(s) in 1.1070seconds

6.查询表是否存在
hbase(main):021:0>exists 'member'
Table member doesexist
0 row(s) in 0.1610seconds

7.判断表是否enable
hbase(main):034:0>is_enabled 'member'
true
0 row(s) in 0.0110seconds

8.判断表是否disable
hbase(main):032:0>is_disabled 'member'
false
0 row(s) in 0.0110seconds

三、DML操作

1.插入几条记录
put'member','scutshuxue','info:age','24'
put'member','scutshuxue','info:birthday','1987-06-17'
put'member','scutshuxue','info:company','alibaba'
put'member','scutshuxue','address:contry','china'
put'member','scutshuxue','address:province','zhejiang'
put'member','scutshuxue','address:city','hangzhou'

put'member','xiaofeng','info:birthday','1987-4-17'
put'member','xiaofeng','info:favorite','movie'
put'member','xiaofeng','info:company','alibaba'
put'member','xiaofeng','address:contry','china'
put'member','xiaofeng','address:province','guangdong'
put'member','xiaofeng','address:city','jieyang'
put'member','xiaofeng','address:town','xianqiao'

2.获取一条数据
获取一个id的所有数据
hbase(main):001:0>get 'member','scutshuxue'
COLUMN                                  CELL
address:city                         timestamp=1321586240244, value=hangzhou
address:contry                      timestamp=1321586239126, value=china
address:province                      timestamp=1321586239197, value=zhejiang
info:age                            timestamp=1321586238965, value=24
info:birthday                         timestamp=1321586239015, value=1987-06-17
info:company                         timestamp=1321586239071, value=alibaba
6 row(s) in 0.4720seconds

获取一个id，一个列族的所有数据
hbase(main):002:0>get 'member','scutshuxue','info'
COLUMN                                  CELL
info:age                            timestamp=1321586238965, value=24
info:birthday                         timestamp=1321586239015, value=1987-06-17
info:company                         timestamp=1321586239071, value=alibaba
3 row(s) in 0.0210seconds

获取一个id，一个列族中一个列的所有数据
hbase(main):002:0>get 'member','scutshuxue','info:age'
COLUMN CELL
info:age timestamp=1321586238965, value=24
1 row(s) in 0.0320seconds

6.更新一条记录
将scutshuxue的年龄改成99
hbase(main):004:0>put 'member','scutshuxue','info:age' ,'99'
0 row(s) in 0.0210seconds

hbase(main):005:0>get 'member','scutshuxue','info:age'
COLUMN CELL
info:age timestamp=1321586571843, value=99
1 row(s) in 0.0180seconds

3.通过timestamp来获取两个版本的数据
hbase(main):010:0>get 'member','scutshuxue',{COLUMN=>'info:age',TIMESTAMP=>1321586238965}
COLUMN CELL
info:age timestamp=1321586238965, value=24
1 row(s) in 0.0140seconds

hbase(main):011:0>get 'member','scutshuxue',{COLUMN=>'info:age',TIMESTAMP=>1321586571843}
COLUMN CELL
info:age timestamp=1321586571843, value=99
1 row(s) in 0.0180seconds

4.全表扫描：
hbase(main):013:0>scan 'member'
ROW                                  COLUMN+CELL
scutshuxue                            column=address:city, timestamp=1321586240244, value=hangzhou
scutshuxue                            column=address:contry, timestamp=1321586239126, value=china
scutshuxue                            column=address:province, timestamp=1321586239197, value=zhejiang
scutshuxue                            column=info:age,timestamp=1321586571843, value=99
scutshuxue                            column=info:birthday, timestamp=1321586239015, value=1987-06-17
scutshuxue                            column=info:company, timestamp=1321586239071, value=alibaba
temp                                  column=info:age, timestamp=1321589609775, value=59
xiaofeng                            column=address:city, timestamp=1321586248400, value=jieyang
xiaofeng                            column=address:contry, timestamp=1321586248316, value=china
xiaofeng                            column=address:province, timestamp=1321586248355, value=guangdong
xiaofeng                            column=address:town, timestamp=1321586249564, value=xianqiao
xiaofeng                            column=info:birthday, timestamp=1321586248202, value=1987-4-17
xiaofeng                            column=info:company, timestamp=1321586248277, value=alibaba
xiaofeng                            column=info:favorite, timestamp=1321586248241, value=movie
3 row(s) in 0.0570seconds

5.删除id为temp的值的‘info:age’字段
hbase(main):016:0>delete 'member','temp','info:age'
0 row(s) in 0.0150seconds
hbase(main):018:0>get 'member','temp'
COLUMN CELL
0 row(s) in 0.0150seconds

6.删除整行

hbase(main):001:0>deleteall 'member','xiaofeng'
0 row(s) in 0.3990seconds

7.查询表中有多少行：
hbase(main):019:0>count 'member'
2 row(s) in 0.0160seconds

8.给‘xiaofeng’这个id增加'info:age'字段，并使用counter实现递增
hbase(main):057:0*incr 'member','xiaofeng','info:age'
COUNTER VALUE = 1

hbase(main):058:0>get 'member','xiaofeng','info:age'
COLUMN CELL
info:age timestamp=1321590997648, value=\x00\x00\x00\x00\x00\x00\x00\x01
1 row(s) in 0.0140seconds

hbase(main):059:0>incr 'member','xiaofeng','info:age'
COUNTER VALUE = 2

hbase(main):060:0>get 'member','xiaofeng','info:age'
COLUMN CELL
info:age timestamp=1321591025110, value=\x00\x00\x00\x00\x00\x00\x00\x02
1 row(s) in 0.0160seconds

获取当前count的值
hbase(main):069:0>get_counter 'member','xiaofeng','info:age'
COUNTER VALUE = 2

9.将整张表清空：
hbase(main):035:0>truncate 'member'
Truncating 'member'table (it may take a while):
- Disabling table...
- Dropping table...
- Creating table...
0 row(s) in 4.3430seconds

可以看出，hbase是先将掉disable掉，然后drop掉后重建表来实现truncate的功能的。

hbase一般的插入过程都使用HTable对象，将数据封装在Put对象中，Put在new创建的时候需要传入rowkey，并将列族，列名，列值add进去。然后HTable调用put方法，通过rpc请求提交到Regionserver端。写入的方式可以分为以下几种

单条put
批量put
使用Mapreduce
bluckload

　　HTable

　　要向hbase中写入就免不了要和HTable打交道，HTable负责向一张hbase表中读或者写数据，HTable对象是非线程安全的。多线程使用时需要注意，创建HTable对象时需要指定表名参数，HTable内部有一个LinkedList<Row>的队列writeAsyncBuffer ，负责对写入到hbase的数据在客户端缓存，开启缓存使用参数 table.setAutoFlushTo(false); 默认情况不开启每次put一条数据时，htable对象就会调用flushCommits方法向regserver中提交，开启缓存则会比较队列的大小，如果大于某个值则调用flushCommits，这个值默认是2m，可以通过在hbase-site.xml中设置参数 "hbase.client.write.buffer"来调整，默认是2097152，在关闭htable连接时，会隐式的调用flushCommits方法，保证数据完全提交。提交时会根据rowkey定位该put应该提交到哪个reginserver，然后每个regionserver一组action发送出去，（多扯两句，这里和solr略微不同，solr可以把请求发送到任一节点，节点判断是否属于当前节点，如果不符合则将请求发送所有节点，但同时也可以实现和hbase类似的功能）

　　单条put

　　最简单基础的写入hbase，一般应用场景是线上业务运行时，记录单条插入，如报文记录，处理记录，写入后htable对象即释放。每次提交就是一次rpc请求。

table.setAutoFlushTo(true);

 1   /**2      * 插入一条记录3      * rowkey 为rk001 列族为f14      * 插入两列  c1列   值为0015      *          c2列   值为0026      *7      */8     public void insertPut(){9         //Configuration 加载hbase的配置信息，HBaseConfiguration.create()是先new Configuration然后调用addResource方法将
10         //hbase-site.xml配置文件加载进来
11         Configuration conf = HBaseConfiguration.create();
12         try {
13             table = new HTable(conf,tableName);
14             table.setAutoFlushTo(true);//不显示设置则默认是true
15
16             String rowkey  = "rk001";
17             Put  put = new Put(rowkey.getBytes());
18             put.add(cf.getBytes(),"c1".getBytes(),"001".getBytes());
19             put.add(cf.getBytes(),"c2".getBytes(),"002".getBytes());
20             table.put(put);
21             table.close();//关闭hbase连接
22
23         } catch (IOException e) {
24             e.printStackTrace();
25         }
26     }

　　多条put

　　有了单条的put自然就想到这种方式其实是低效的，每次只能提交一条记录，有没有上面方法可以一次提交多条记录呢？减少请求次数，最简单的方式使用List<Put>，这种方式操作时和单条put没有区别，将put对象add到list中，然后调用put(List<Put>)方法，过程和单条put基本一致，应用场景一般在数据量稍多的环境下，通过批量提交减少请求次数

 1     /**2      * 批量请求，一次提交两条 3      */4     public void insertPuts() {5         Configuration conf = HBaseConfiguration.create();6         try {7             table = new HTable(conf, tableName);8             table.setAutoFlushTo(true);9             List<Put> lists = new ArrayList<Put>();
10
11             String rowkey1 = "rk001";
12             Put put1 = new Put(rowkey1.getBytes());
13             put1.add(cf.getBytes(), "c1".getBytes(), "001".getBytes());
14             put1.add(cf.getBytes(), "c2".getBytes(), "002".getBytes());
15             lists.add(put1);
16
17             String rowkey2 = "rk002";
18             Put put2 = new Put(rowkey2.getBytes());
19             put2.add(cf.getBytes(), "c1".getBytes(), "v2001".getBytes());
20             put2.add(cf.getBytes(), "c2".getBytes(), "v2002".getBytes());
21             lists.add(put2);
22
23
24             table.put(lists);
25             table.close();
26
27         } catch (IOException e) {
28             e.printStackTrace();
29         }
30
31
32     }

　　使用Mapreduce

　　以上两种方式一般用来处理小批量的数据，那么在面对数据量多的时候应该如何处理呢，常见的做法使用多线程来并行向hbase中写入，不过这需要我们自己来控制任务的划分，比较麻烦，另外值得注意的时HTable对象是线程不安全的，因此在多线程写入时需要格外注意。而更加常见的做法是使用Mapreduce。HBase本身就是运行在hdfs上的数据库，因此和Mapreduce有很好的融合。

　　使用mapreduce来向hbase中写入数据时，将输入文件拆分成一个个的块，然后交给集群，分布式的去读取块，然后数据写入到hbase中，而根据具体业务情况的不同，在使用Mapreduce中也有略微的不同，先介绍一下最常见的处理过程，使用hbase官方提供的hbase和mapreduce整合的工具类TableMapReduceUtil，具体使用细节可以参考HBase官方手册这里只贴一下在map端读入数据，然后直接写hbase的情景，这种方式一般用于hive或者文件数据入hbase，不需要业务逻辑处理，保持原有的数据入库，rowkey一般时某个字段或者若干个字段拼接而成，比如卡号信息入库，使用卡号作为rowkey（需要对卡号做散列处理，卡号一般为62或者40开头，会造成数据热点问题）

 1 package hbase.demo.mapreduce;2 3 import org.apache.hadoop.conf.Configuration;4 import org.apache.hadoop.conf.Configured;5 import org.apache.hadoop.fs.Path;6 import org.apache.hadoop.hbase.HBaseConfiguration;7 import org.apache.hadoop.hbase.client.Put;8 import org.apache.hadoop.hbase.io.ImmutableBytesWritable;9 import org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil;
10 import org.apache.hadoop.io.Text;
11 import org.apache.hadoop.io.Writable;
12 import org.apache.hadoop.mapreduce.Job;
13 import org.apache.hadoop.mapreduce.Mapper;
14 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
15 import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
16 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
17 import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
18 import org.apache.hadoop.util.Tool;
19
20 /**
21  * Created by BB on 2017/2/26.
22  */
23 public class InsertMR extends Configured implements Tool {
24
25
26     public static void main(String[] args) throws Exception {
27         InsertMR im = new InsertMR();
28         im.run(args);
29     }
30
31     public int run(String[] strings) throws Exception {
32         String jobName = "insert data into hbase";
33         String outputTable = "OutTable";
34         String inputPath = "/usr/mapreduce/input";
35         String outputPath = "usr/maprduce/output";
36         Configuration conf = HBaseConfiguration.create();
37         Job job = Job.getInstance(conf, jobName);
38
39         job.setJarByClass(InsertMR.class);
40
41         job.setMapperClass(InsertMap.class);
42         job.setMapOutputKeyClass(ImmutableBytesWritable.class);
43         job.setMapOutputValueClass(Put.class);
44
45         job.setInputFormatClass(TextInputFormat.class);//hadoop 默认使用TextInputFormat
46
47         //设置输入输出路径
48         FileInputFormat.setInputPaths(job,new Path(inputPath));
49         FileOutputFormat.setOutputPath(job,new Path(outputPath));
50
51
52         TableMapReduceUtil.initTableReducerJob(
53                 outputTable,
54                 null,
55                 job);
56         job.setNumReduceTasks(0);
57         return job.waitForCompletion(true) ? 0 : 1;
58     }
59
60
61     public class InsertMap extends Mapper<Writable, Text, ImmutableBytesWritable, Put> {
62         @Override
63         protected void map(Writable key, Text value, Context context) {
64             try {
65
66                 String line = value.toString();
67                 String[] items = line.split(",", -1);
68                 ImmutableBytesWritable outkey = new ImmutableBytesWritable(items[0].getBytes());
69                 String rk = items[0];//rowkey字段
70                 Put put = new Put(rk.getBytes());
71                 put.add("f1".getBytes(), "c1".getBytes(), items[0].getBytes());
72                 put.add("f1".getBytes(), "c2".getBytes(), items[1].getBytes());
73                 context.write(outkey, put);
74             } catch (Exception e) {
75
76
77             }
78         }
79
80     }
81
82
83 }

　　这种方式最终会调用Tableoutputformat类，核心的原理还是使用htable的put方法，不过由于使用了mapreduce分布式提交到hbase，速度比单线程效率高出许多，但是这种方式也不是万能的，put提交的熟读太快时会给hbase造成比较大的压力，容易发生gc造成节点挂掉，尤其是初始化表到hbase时，一般都会有很多的历史数据需要入库，容易造成比较大的压力，这种情况下建议使用下面的方式bulkload方式入库，减少给hbase压力。上面这种方式是直接在map中生成put然后交给TableOutputformat去提交的，因为这里几乎不需要逻辑处理，如果需要做逻辑处理，那么一般会在reduce端去生成put对象，在map端做业务逻辑处理，比如数据关联，汇总之类的

　　bulkload

　　如果在写入hbase的上述的方式还是不能满足需求的话，就可以考虑使用bulkload的方式了。上述几种方式虽然实现的方式涉及到的东西不同，但是本质是一样的，都是使用HTable对象调用put方法，然后HTable通过rpc提交到reginserver上，然后通过LSM过程之后最终写入到磁盘上。HBase的数据最终会变成hfile文件落到磁盘上，那么有没有一种方式可以绕过前面的这些过程，直接生成最终的hfile文件呢。答案是有的，bulkload写入hbase的原理正是基于此。使用mapreduce来生成hbase的hfile文件，然后将文件塞到hbase存储数据的目录下，这样做可以减少了海量的数据请求时间，也完全避免了regionserver的处理数据的压力。由于涉及到hbase存储架构的原理，只大概讲一下过程，在map端生成put对象，reduce使用hbase提供的KeyValueSortReducer即可，reduce端会将数据按照rowkey做排序，生成hfile文件，然后按照region的分布对hfile做分割，将分割的hfile文件放到相应的region目录下，这里就不详细赘述，直接上代码

driver

 1 package com.hbase.mapreudce.driver;2 3 import java.io.IOException;4 5 import org.apache.hadoop.conf.Configuration;6 import org.apache.hadoop.conf.Configured;7 import org.apache.hadoop.fs.Path;8 import org.apache.hadoop.hbase.HBaseConfiguration;9 import org.apache.hadoop.hbase.KeyValue;
10 import org.apache.hadoop.hbase.TableNotFoundException;
11 import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
12 import org.apache.hadoop.hbase.mapreduce.HFileOutputFormat;
13 import org.apache.hadoop.hbase.mapreduce.HFileOutputFormat2;
14 import org.apache.hadoop.hbase.mapreduce.KeyValueSortReducer;
15 import org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles;
16 import org.apache.hadoop.hbase.mapreduce.SimpleTotalOrderPartitioner;
17 import org.apache.hadoop.io.Text;
18 import org.apache.hadoop.mapreduce.Job;
19 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
20 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
21 import org.apache.hadoop.util.GenericOptionsParser;
22 import org.apache.hadoop.util.Tool;
23 import org.apache.hadoop.util.ToolRunner;
24 import org.apache.log4j.Logger;
25
26 import com.hbase.mapreudce.map.BaseBulkLoadBaseMapper;
27 import com.spdbccc.mapreduce.plus.util.ConnectUtil;
28 import com.spdbccc.mapreduce.plus.util.Util;
29
30 public class BulkLoadHFileDriver extends Configured implements Tool {
31
32     private static Logger logger = Logger.getLogger(BulkLoadHFileDriver.class);
33
34     private String jobname;
35
36     private Configuration conf;
37
38     public static void main(String[] args) throws Exception {
39         BulkLoadHFileDriver bld = new BulkLoadHFileDriver();
40         bld.excute(args);
41     }
42
43     public void excute(String[] args) throws Exception {
44         int rtn = ToolRunner.run(new BulkLoadHFileDriver(), args);
45         this.dobulkLoadFile(conf);
46
47     }
48
49     public int run(String[] args) throws Exception {
50         this.conf = HBaseConfiguration.create();
51         String[] dfsArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
52
53         // conf.get("", "");
54         String tablename = conf.get("", "");
55         String inputPathstr = conf.get("", "");
56         String outputPathstr = conf.get("", "");
57
58         Path outputPath = Util.getTempPath(conf, outputPathstr, true);
59
60         Job job = Job.getInstance(conf, "HFile bulk load test");
61         job.setJarByClass(BulkLoadHFileDriver.class);
62
63         job.setMapperClass(BaseBulkLoadBaseMapper.class);
64         job.setReducerClass(KeyValueSortReducer.class);
65
66         job.setMapOutputKeyClass(ImmutableBytesWritable.class);
67         job.setMapOutputValueClass(KeyValue.class);
68
69         job.setPartitionerClass(SimpleTotalOrderPartitioner.class);
70
71         FileInputFormat.addInputPath(job, new Path(inputPathstr));
72         FileOutputFormat.setOutputPath(job, outputPath);
73
74         HFileOutputFormat2.configureIncrementalLoad(job, ConnectUtil.getHTable(conf, tablename));
75
76         LoadIncrementalHFiles loader = new LoadIncrementalHFiles(conf);
77         loader.doBulkLoad(new Path(dfsArgs[0]), ConnectUtil.getHTable(conf, tablename));
78
79         return job.waitForCompletion(true) ? 0 : 1;
80     }
81
82     private void dobulkLoadFile(Configuration conf) throws Exception {
83         String tablename = conf.get("", "");
84         String hfiledirpathstr = conf.get("", "");
85
86         LoadIncrementalHFiles loader = new LoadIncrementalHFiles(conf);
87         loader.doBulkLoad(new Path(hfiledirpathstr), ConnectUtil.getHTable(conf, tablename));
88
89     }
90
91 }

map

 1 package com.hbase.mapreudce.map;2 3 import java.io.IOException;4 5 import org.apache.hadoop.hbase.KeyValue;6 import org.apache.hadoop.hbase.io.ImmutableBytesWritable;7 import org.apache.hadoop.hbase.util.Bytes;8 import org.apache.hadoop.io.LongWritable;9 import org.apache.hadoop.io.Text;
10 import org.apache.hadoop.io.Writable;
11 import org.apache.hadoop.mapreduce.Mapper;
12 import org.apache.log4j.Logger;
13
14 public class BaseBulkLoadBaseMapper extends
15 Mapper<Writable, Text, ImmutableBytesWritable, KeyValue> {
16
17
18     private static Logger logger = Logger.getLogger(BaseBulkLoadBaseMapper.class);
19
20      @Override
21      protected void map(Writable key, Text value, Context context)
22              throws IOException, InterruptedException {
23             String line = value.toString();
24             String[] items = line.split(",", -1);
25             ImmutableBytesWritable rowkey = new ImmutableBytesWritable(
26                     items[0].getBytes());
27
28             KeyValue kv = new KeyValue(Bytes.toBytes(items[0]),
29                     Bytes.toBytes(items[1]), Bytes.toBytes(items[2]),
30                     System.currentTimeMillis(), Bytes.toBytes(items[3]));
31             if (null != kv) {
32                 context.write(rowkey, kv);
33             }
34
35
36
37      }47 }

一．HBase特点：

1.弱视图，HBase是一种高效的映射嵌套，用户可以在运行时定义列，每一行都有属于自己的列。

2.非标准化数据。

二．HBase表组成

1.行健，按字典顺序存储。

2.列簇，一组列的集合

3.单元格，列和行的交集是一个单元格，单元格是版本化的（默认使用时间戳），最新的版本在最前边，默认保三个版本。单元格中的数据以二进制字节数组存储。

三．列簇

1.列簇必须在创建表的时候定义。

2.每个列簇中的列数是没有限制的。

3.同一列簇下的所有列会保存在一起。一个HFile文件保存一个列簇的数据。

4.列在列簇中是有序的，按照字典顺序排列。

5.列可以在运行时动态创建。

6.列只有在插入数据后才会存在，空值不保存，不占存储空间。

四．HBase存储设置

1.压缩格式，默认无压缩，可以使用GZ、LZO、SNAPPY

2.HBase默认块大小是64K，不是64兆，因为HBase需要支持随机访问，一旦在查询的时候找到行健所在的块，就会对应块的单元格数据取出，对64k大小的块进行扫描的速度明显快与64M。

3.内存模型：默认false，如果设置为true，HBase会尝试将整个列簇保存在内存中。

4.块缓存，默认true，HBase使用块缓存将最近使用的块加载到内存中。块缓存使用最近没有使用（LRU）的原则删除块数据。

5.布隆过滤，是一种空间高效的数据结构，可以设置为ROW，使用行级的布隆过滤，ROWCOL使用行健与列标识级别的布隆过滤。

五．HBase组件

架构图（官方图）

1.Zookeeper

主要职责：支持远程组件之间的松散耦合消息机制，组件之间通过心跳机制协调他们的状态和健康信息。

2.Master

集群环境中可以同时存在多个主节点，但是同一时间只有一个扮演主节点的角色，其他节点作为备用节点，如果主节点与zk的租约过期，剩下的备用节点重新发起竞争，由zk选举主节点。

功能：执行集群管理操作，跟踪数据位于哪个节点上（行健鉴别），控制负载均衡和故障切换，主节点不直接获取RegionServer信息，通过zk获取各RegionServer运行状态。

3.RegionServer（分区服务器）

主要负责管理多个分区，数据实际存储在分区中，每一个RegionServer有一个WAL文件（预写日志）和块缓存，预写日志文件存储在hdfs，防止丢失。该RegionServer上的所有分区共享WAL和块缓存.

4.WAL

功能：确保数据不丢失，如果发生故障，可以使用WAL恢复数据。

WAL中保存了分区服务器还没有写入HFile的所有操作。写入WAL之后再写入MemStore，当MemStore的大小达到预设的值时，将数据写入HFile固化到磁盘。如果想提高写入性能可以关闭WAL，但是会有数据丢失的风险。

5.块缓存

是一种内存结构，遵循LRU（最近没有使用）回收规则，默认对表是开启的。通过SCAN和GET获取的数据都会加载到缓存。0.96版本之后支持使用JAVA NIO实现的堆外缓存，这个优化大大减小了RegionServer的GC压力。

6.分区

一个RegionServer上有多个分区，每一个分区对应一张表，管理表中一定范围行健内的行数据，分区的行健范围不必是连续的，分区对于表中不同列簇是分开存储的，一个HFile存储一个列簇的数据。

7.MemStore

当HBase有更新操作时，先写入WAL再写入MemStore（内存结构），预定义MemStore大小配置hbase.hregion.memstore.flush.size。每一次持久化操作都会针对分区中的每一个列簇创建一个HFile文件，当用户读取数据时先从MemStore中读取数据，如果没有在MemStore中找到对应的值才会尝试从HFile文件中读取。

注意：MemStore的持久化过程是阻塞的，此时不能提供服务。

早期的MemStore使用堆内存实现，会产生大量的堆内存碎片，并导致周期行的FullGC，FullGC会导致程序暂停。

8．HFile文件

最终保存HBase数据的文件（存储于HDFS），一个HFile文件属于一张表的一个列簇，数据以行健和列的字典顺序排列，同时单元格内的数据是以版本号降序排列。多个RegionServer的MemStore多次持久化的HFile文件，他们之间是无序的，HBase会通过部分压缩和全量压缩进行合并。当HFile文件的数量达到一定的阈值，就会启动部分压缩，将多个HFile文件合并为一个大的HFile文件。全量压缩会在每天的固定时间段进行（系统活跃度低的时间）。合并的过程会对数据重新排序，保证合并后的文件数据依然是有序的。

9.两个重要的表-Root-.META.

-Root-保存.META.的路径，它指向.META.在集群的哪个节点上。是唯一不会被拆分的表。

.META.保存集群中每个分区的元数据，关键的类HRegionInfo，包括表的信息，表的起始行健值和终止行健值。

六．客户端与HBase建立连接的过程

A．Client与ZK集群建立连接，从ZK集群获取-Root-表位置信息。

B．Client连接-Root-表，并得到.META.表的位置信息。

C．Client连接.META.表并下载分区列表和他们的位置。

D．Client使用从.META.表下载的信息直接连接RegionServer中的分区操作数据，在这过程中Clien会进行一系列的查询操作。

E．Client会将-Root- .META.周期性的下载到本地，当Client连接RegionServer后发现.META.表提供的行健范围不在分区中才会重新刷新本地目录表缓存。

七．分区与压缩

如果在创建表是没有设置预分区，刚创建表时只有一个分区，直到该分区的大小达到habse.hregion.max.filesize属性值时默认10G，该分区会再次分区，注意此时的分区是逻辑分区，只记录分区的行健。分区大小会继续增长，只有在全量压缩的时候才会把逻辑分区转为物理分区，将原始数据文件拆开到其他分区HDFS路径下的不同文件中。原始分区将在物理上分成两个分区，并且这两个分区属于不同的两台RegionServer。

压缩是分区中HFile文件合并的过程，减少HFile的数量能够提高HBase的性能。因为能够减少scan和get读取的文件数。压缩分两种：部分和全量。

部分压缩，会把多个HFile文件压缩成一个，部分压缩的过程，Client仍然可以向MemStore写数据，但是不执行MemStore的持久化，如果MemStore被写满，此时暂停对外服务，与该分区交互的Client处于阻塞状态，直到压缩过程执行完毕，恢复操作。

全量压缩，将一个分区内所有的HFile文件合并成一个文件，在这个过程所有的删除操作和版本过期的单元格都会被移除。全量压缩默认每天执行一次。（提示：HBase的删除操作并不会在执行完删除命令后就立即删除数据，而是先给该数据打上被删除的标签，在全量压缩的过程中才会被真正的删除）