看到这么好的东西实在不敢独享,作者Bruno Dumon 写的非常好!

文章来源:http://outerthought.org/blog/465-ot.html

I was looking into more detail at how HBase compactions work, and given my experience collecting metrics for Lily , and also inspired by this blog post on Lucene, I thought it would be nice to do some tests and make some graphs of the HBase flush and compact process.

Background

When something is written to HBase, it is first written to an in-memory store (memstore ), once this memstore reaches a certain size, it is flushed to disk into a store file  (everything is also written immediately to a log file for durability). The store files created on disk are immutable. Sometimes the store files are merged together, this is done by a process called compaction .

This buffer-flush-merge strategy is a common pattern, for example also used by Lucene, and is described in a paper called the Log-Structured Merge-Tree (never read it though). In contrast to Lucene though, were one usually has only one IndexWriter, HBase typically has hundreds of regions open.

There are two kinds of compactions:

  • the minor compactions : these are triggered each time a memstore is flushed, and will merge some of the store files, determined by an algorithm described below.

  • the major compactions : these run about every 24 hours (after the currently oldest store file was written), and merge together all store files into one. The 24 hours is adjusted with a random margin of up to 20% to avoid many major compactions happening at the same time. Major compactions can also be triggered manually, via the API or the shell.

There is another difference between minor and major compactions: major compactions process delete markers, max versions, etc, while minor compactions don't. This is because delete markers might also affect data in the non-merged files, so it is only possible to do this when merging all files.

If, after a compaction, a newly written store file is greater than a certain size (default 256 MB, see property hbase.hregion.max.filesize), the region will be split into two new regions.

The basic compaction process

The illustration below shows a simple scenario: in a loop we perform a put of a row with a single 1k value. The table has one column family and only one region. The flush size size of the memstore has been set to 100MB (hbase.hregion.memstore.flush.size), the maximum file size (hbase.hregion.max.filesize) has been set high enough to avoid splits from happening. All tests in this blog have been done on a single node (my laptop).

The main goal here is to illustrate how the minor compaction process decides when and what to merge.

We see that the memstore grows gradually until it reaches its flush size of 100MB, when this happens it is flushed to disk in a new store file. This process repeats itself until we have 3 store files. At this point, a compaction is performed which merges the files together into one file.

The second compaction is only performed after 4 store files are written. How does HBase decide this?

The algorithm is basically as follows:

  1. Run over the set of all store files, from oldest to youngest

  2. If there are more than 3 (hbase.hstore.compactionThreshold) store files left and the current store file is 20% larger then the sum of all younger store files, and it is larger than the memstore flush size, then we go on to the next, younger, store file and repeat step 2.

  3. Once one of the conditions in step two is not valid anymore, the store files from the current one to the youngest one are the ones that will be merged together. If there are less than the compactionThreshold, no merge will be performed. There is also a limit which prevents more than 10 (hbase.hstore.compaction.max) store files to be merged in one compaction.

If you need more detail, have a look at the source of the Store.compact() method.

Now, with this knowledge, we can see why the second compaction does not run after flush 5: the oldest file was about ~240MB, the two younger ones together ~160MB, 240 > 160 * 1.2

When the second compaction runs after flush 6, it does merge all the files together, instead of just the 3 youngest, since ~240 > 3*80*1.2 is not true. After flush 9, we have ~480 > 3*80*1.2, so the oldest store file is not included in the merge.

With the default configuration of HBase, if a store file would be 480MB, the region will be split. I have on purpose augmented the limit to illustrate what happens if there are more store files, though I could also have lowered the memstore flush size.

If you wonder how the store file merging behaves over longer time, below is the result of a longer run with 157 flushes of 10 MB. Click on the image below to see it bigger (10000px by 800px).

The largest number of store files at any time is 7 (see number 143 in the chart), though the oldest (biggest) one is only merged occasionally. It is a good thing that the number of store files stays small, because when you read a row from HBase it needs to check all the store files and the memstore (this can be optimized through bloomfilters or by specifying timestamp ranges ).

The number 7 might make you think of the property hbase.hstore.blockingStoreFiles, which has a default value of 7, but I augmented it for this test to be sure it did not have any influence.

Compactions and deletes

When you perform a delete in HBase, nothing gets deleted immediately, rather a delete marker (a.k.a. thombstone) is written. This is because HBase does not modify files once they are written.

The deletes are processed during the major compaction process, at which point the data they hide and the delete marker itself will not be present in the merged file.

Lets illustrate this with an extreme scenario. The following graphs are from a scenario where, in a loop, we put a row and then immediately delete the row. So conceptually, if we look at the table during the test, it will contain at most one row. Nevertheless, the store files will only grow.

What we see is:

  • the memstore grows and flushes three times, three store files are created

  • after the third store file is written, a compaction is performed

  • the result of the compaction is a new store file of size 0.

This is a surprising result: only major compactions process deletes, and major compactions supposedly only run every 24 hours.

What happens is that if the set of store files selected for compaction is the set of all store files, then HBase decides to do a major compaction. In fact, it should, otherwise the 24H check, which looks at the time of the oldest store file, would mistakenly assume that this one file was the result of a major compaction.

Besides explicit deletes, there are other cases where store files will shrink after a major compaction: cells that are removed because there are more than max-versions versions of it, or when using the TTL feature.

Multiple column families

Each column family has its own memstore and its own set of store files. But all the column families within a table stick together. They will all flush their memstore at the same time (this might change ). They are also splitted in the same regions, so when one of the column families grows to large, the other ones, how small they might be, will also split.

The illustration below shows a scenario where a table with two column families is used. In family 2, we only do a put for every 10 puts we do in family 1. So family 1 will grow much faster than family 2 (family 2 is sparse family).

The memstore and storefile count graphs show the data for both families counted together. Note how at each memstore flush, the number of store files increases with two.

You can also see how a second compaction is performed for family 2 but not for family 1. This is because all store files in family 2 are smaller than the memstore flush size of 100 MB.

Multiple regions

Now lets do a scenario with a table that has multiple regions.

For this test, a table was created with 10 initials regions, divided so that random UUIDs would be evenly spread over these regions. This time the memstore flush size was set to 30 MB.

As the puts are performed, the memstores of each region should fill up evenly, and so we expect them to all reach the flush size at about the same time, and thus to flush at about the same time. However, in the HBase region server there is only one thread which performs flushes, so they happen in sequence. This means that the memstores of some regions will keep growing above their limit. The property hbase.hregion.memstore.block.multiplier controls how many times a memstore can grow above its limit before HBase blocks updates. The default is 2, in our example here this means the memstore would be allowed to grow to 2*30=60MB maximum.

Similarly, there is one thread for doing the compactions and splits, so these are performed in sequence too. Still, it is a period where resources are used which would otherwise have been usable to serve client requests. If you would run a similar scenario a cluster, the different nodes could still run there compactions in parallel. In such a scenario, were all regions are filled very evenly and decide to do compactions at about the same time, you can see some negative spikes when observing the hbaseRequestCount metric. In the chart here you can notice that memstore grows a bit less steep during the compactions.

Memstore upperLimit

All the above illustrations were quite synthetic. On a real system, there will often be much more regions, and more than one table, and there will often not be enough memory to let memory stores grow to their configured flush size.

HBase uses by default 40% of the heap (see property hbase.regionserver.global.memstore.upperLimit) for all memstores of all regions of all column families of all tables. If this limit is reached, it starts flushing some memstores until the memory used by memstores is below at least 35% of heap (lowerLimit property).

Let's simulate this scenario by again using 10 regions, but now with an upper limit of 100 MB. The regionserver in this test has only 1GB of memory, so 400MB available for the memstores. Given that we have 10 memstores (1 table, 1 column family, 10 regions), and that we will fill them evenly, we expect the first flushes to happen when the memstores reach 400/10=40MB (instead of going on to their 100MB flush size).

And, hurary huray, the theory is confirmed by the plot below.

Some final notes:

  • during flushes & compactions, HBase keeps processing put and get requests, always giving a consistent view of the data
  • certain properties, like the flush size, can be configured on the level of the table

The charts were made with gnuplot, afterwards annotated with Inkscape. The metrics were collected through HBaseAdmin's ClusterStatus and JMX. The store file sizes were read directly from HDFS. I set hbase.regionserver.msginterval to 1000 to get fine-grained data from ClusterStatus. Lily's clientmetrics package was used to collect & output the metrics, and to generate the necessary input files for gnuplot.

categories: hbase nosql design
by Bruno Dumon on 2/21/11

Visualizing HBase Flushes And Compactions相关推荐

  1. hbase 使用lzo_hbase 使用LZO笔记

    之前我们发现数据录入到hbase中,数据占用的空间变大,所以我们考虑使用压缩来降低,下面是安装使用lzo的过程,在这记录一下备忘. 一.hbase添加LZO 注意:root用户安装,否则安装会出现错误 ...

  2. HBASE配置参数说明中文文档(官方文档翻译)

    HBASE所有参数原文加中文译文,方便学习交流研究使用,因使用爬虫获取部分内容翻译可能存在误差,仅供参考,以原文内容为准. 官方原文地址 文章目录 hbase.tmp.dir hbase.rootdi ...

  3. RocksDB Compaction(二)源码分析

    本文的主要目的是(1)了解RocksDB源码中Flush和Compaction的基本流程(2)了解Compaction/FLush过程中是在何处.如何产生I/O的. 目录 线程调度过程&com ...

  4. JAVA_基础部分_综合篇

    JVM (1) 基本概念: JVM是可运行Java代码的假想计算机 ,包括一套字节码指令集.一组寄存器.一个栈.一个垃圾回收,堆 和 一个存储方法域.JVM 是运行在操作系统之上的,它与硬件没有直接的 ...

  5. An In-Depth Look at the HBase Architecture--转载

    原文地址:https://www.mapr.com/blog/in-depth-look-hbase-architecture In this blog post, I'll give you an ...

  6. hbase中列簇和列_为什么不建议在hbase中使用过多的列簇

    我们知道,hbase表可以设置一个至多个列簇(column families),但是为什么说越少的列簇越好呢? 官网原文: HBase currently does not do well with ...

  7. Hbase Memstore刷新方式与Region的数目上限

    目录 Region数目上限 Region大小上限 MemStore的刷新方式(触发条件) HLog (WAL) Size & Memstore Flush 频繁的Memstore Flushe ...

  8. HBase源代码分析之MemStore的flush发起时机、推断条件等详情(二)

    在<HBase源代码分析之MemStore的flush发起时机.推断条件等详情>一文中,我们具体介绍了MemStore flush的发起时机.推断条件等详情.主要是两类操作.一是会引起Me ...

  9. (转)HBase二级索引与Join

    二级索引与索引Join是Online业务系统要求存储引擎提供的基本特性.RDBMS支持得比较好,NOSQL阵营也在摸索着符合自身特点的最佳解决方案. 这篇文章会以HBase做为对象来探讨如何基于Hba ...

  10. HIVE和HBASE区别

    http://www.cnblogs.com/justinzhang/p/4273470.html https://www.zhihu.com/question/21677041 1. 两者分别是什么 ...

最新文章

  1. 快速提升UI设计感的7个版式小妙招
  2. DB2安全(一)——概述
  3. kpu 处理器_首轮融资即估值过亿,中科驭数用全新专用计算架构让芯片也能“私人订制”...
  4. android gradle proguard,Android Gradle插件2.2.0 ProGuard开始保留内部类
  5. Ocelot(三)- 服务发现
  6. 作者:孟磊,山东省农业信息中心助理农经师。
  7. mysql ef 一对多 更新数据库_Entity Framework_成功针对多种数据库使用实体框架(EF)...
  8. vue打开html自动加载js,vue.js怎么实现懒加载
  9. apache php 调优_Apache的性能优化实例(一)
  10. php中的加密解密模块-mcrypt
  11. AI CycleGAN
  12. Layui第三方扩展LAY_EXCEL自定义导出数据类型
  13. UEFI开发探索75- YIE001PCIe开发板(08 跑马灯实验)
  14. 故障:在 Application Log 中出现 ID57860 的 Backup Exec 错误日志
  15. Java聊天室(实现群聊、私聊功能)GUI界面
  16. 如何修复word文档损坏的?
  17. 阿里一位 70 后程序员、架构师的 26 个职场感悟
  18. 基于“机器学习”的智能聊天机器人---python实现(1)
  19. 白盒测试与黑盒测试的联系与区别
  20. CSS学习笔记1(2020年11月)

热门文章

  1. bbys_tu_2016(ret2text)
  2. (四)HEVC基本理论——变换单元TU
  3. 《东周列国志》第九回 齐侯送文姜婚鲁 祝聃射周王中肩
  4. android 恢复出厂设置流程分析,android恢复出厂设置流程概括
  5. kappa一致性检验教程_R语言中的试验一致性检验分析 kappa检验和McNemar检验
  6. Photoshop制作水印简易教程
  7. mp3中不可缺少的音乐
  8. Flink DataStream的wordCount、数据源和Sink、Side Outputs、两阶段提交(two-phase commit, 2PC)
  9. 【题解】[LuoguP3503]「BZOJ2086」[POI2010] Blocks
  10. SNAT/DNAT/MASQUERADE