Seek vs. Transfer

我之前专门比较过B+ tree和LSM tree

http://www.cnblogs.com/fxjwind/archive/2012/06/09/2543357.html

里面最后一篇blog比较好的分析使用B+ tree和LSM tree (Log-Structured Merge-Trees) 的本质, 读写效率的balance, 全局有序和局部有序...
但之前对这个标题, Seek vs. Transfer不是很理解, 这儿重点解释一下,

B+树也是要存磁盘的, 和磁盘交换数据的单位是page, 以书中的例子, 当一个page超出配置的大小, 就会被split.

问题如下所述, 相邻的page在磁盘上并不相邻, 有可能距离很远.

The issue here is that the new pages aren't necessarily next to each other on disk. So now if you ask to query a range from key 1 to key 3, it's going to have to read two leaf pages which could be far apart from each other.

所以无论读写B+ tree上的node, 首先要做的是通过磁盘seek, 寻道找到这个node所在的page, 这个是很低效的, 而且如下数据显示, 这个问题随着磁盘的不断变大变的越来越严重.

While CPU, RAM and disk size double every 18-24 months the seek time remains nearly constant at around 5% speed-up per year.

对于读我们可以通过buffer cache来部分解决这个问题, 但是对于随机写, 这个issue无法避免, 而且还会产生大量的page碎片.
所以对于有大量随机写的场景, 用B+ tree作为索引是不合适的.

而LSM tree, 就比较好的对随机写进行了优化, 当然对于磁盘的随机写很难有好的优化, 所以采取的策略, 把随机写buffer在memory中, 并进行排序, 最终批量的把随机写flush到磁盘中去, 这样就把随机写转化为顺序写. 关于LSM tree的详细介绍, 参考http://www.cnblogs.com/fxjwind/archive/2012/08/14/2638371.html

这样就有效的避免了Seek的问题, 而只是不断的把数据transfer到磁盘上, 这样写效率就高了许多.

书上有数据来证明LSM tree的写效率有多高,

When updating 1% of entries (100,000,000) it takes:

1,000 days with random B-tree updates
• 100 days with batched B-tree updates
1 day with sort and merge

当然这样的问题是, 无法保证全局有序, 读数据的时候效率会比较低一些, 这样问题可以通过merge和bloom filter来解决.

B+ tree是传统的数据方式, 支持CRUD, 而实际证明, 支持UD确实会让数据系统变得非常复杂, 而且从本质上来看, 新数据的产生并不能否定老数据的曾经存在.
所以通过简化为CR, 可以大大简化系统, 而且可以更好的容错.

上面是我的理解, 下面是书中的比较的原文,

Comparing B+ trees and LSM-trees is about understanding where they have their relative strengths and weaknesses.
B+ trees work well until there are too many modifications, because they force you to perform costly optimizations to retain that advantage for a limited amount of time.
The more and faster you add data at random locations, the faster the pages become fragmented again. Eventually you may take in data at a higher rate than the optimization process takes to rewrite the existing files.
The updates and deletes are done at disk seek rates, and force you to use one of the slowest metric a disk has to offer.

LSM-trees work at disk transfer rates and scale much better to handle vast amounts of data.
They also guarantee a very consistent insert rate, as they transform random writes into sequential ones using the log file plus in-memory store.
The reads are independent from the writes, so you also get no contention between these two operations.
The stored data is always in an optimized layout. So, you have a predictable and consistent bound on number of disk seeks to access a key, and reading any number of records following that key doesn't incur any extra seeks. In general, what could be emphasized about an LSM-tree based system is cost transparency: you know that if you have five storage files, access will take a maximum of five disk seeks. Whereas you have no way to determine the number of disk seeks a RDBMS query will take, even if it is indexed.

Storage

之前在这个blog里面, 已经笔记过

http://www.cnblogs.com/fxjwind/archive/2012/08/21/2649499.html

Write-Ahead Log

The region servers keep data in-memory until enough is collected to warrant a flush to disk, avoiding the creation of too many very small files. While the data resides in memory it is volatile, meaning it could be lost of the server loses power for example. This is a typical problem, as explained in the section called “Seek vs. Transfer”.
A common approach to solving this issue is write-ahead logging[87]:
each update (also called "edit") is written to a log, and only if that has succeeded the client is informed that the operation has succeeded.
The server then has the liberty to batch or aggregate the data in memory as needed.

其实想法很简单, 需要把数据buffer在memory, 并批量flush到disk, memory中的数据容易丢失, 所以使用WAL来解决这个问题.

Overview

The WAL is the lifeline that is needed when disaster strikes. Similar to a binary log in MySQL, it records all changes to the data.
This is important in case something happens to the primary storage. If the server crashes it can effectively replay the log to get everything up to where the server should have been just before the crash. It also means that if writing the record to the WAL fails, the whole operation must be considered a failure.

Since it is shared by all regions hosted by the same region server it acts as a central logging backbone for every modification.
一个region server上所有regions共享一个WAL. 可以参考Bigtable论文里面, refinements里面对于log机制的优化.

HLog Class

The class which implements the WAL is called HLog. When a HRegion is instantiated the single HLog instance is passed on as a parameter to the constructor of HRegion. When a region receives an update operation, it can save the data directly to the shared WAL instance.

HLogKey Class

Currently the WAL is using a Hadoop SequenceFile, which stores records as sets of key/values.

Read Path

这节写的不太有条理, 反正他就想说明, 对一行数据, 不是简单的get, 而是scan. 为什么, 这个是由LSM tree结构导致的, 同一行数据, 可以分散在mem, 和不同的files里面.
详细参考http://www.cnblogs.com/fxjwind/archive/2012/08/14/2638371.html

所以可以看出不断compaction的重要性, 否则面对大量的file, read是会很慢的, 另外再用bloom filter, 和 time stamp对files进行进一步过滤来提高读效率.

Region Lookups

For the clients to be able to find the region server hosting a specific row key range HBase provides two special catalog tables called -ROOT- and .META..

The -ROOT- table is used to refer to all regions in the .META. table.
The design considers only one root region, i.e., the root region is never split to guarantee a three level, B+ tree like lookup scheme:
the first level is a node stored in ZooKeeper that contains the location of the root table's region, in other words the name of the region server hosting that specific region.
The second level is the lookup of a matching meta region from the -ROOT- table,
and the third is the retrieval of the user table region from the .META. table.

参考Bigtable论文里面, 5.1 Tablet Location, 完全一样

转载于:https://www.cnblogs.com/fxjwind/archive/2012/10/12/2721616.html

HBase-TDG Architecture相关推荐

  1. 【图文详解】HBase 的数据模型与架构原理详解

    HBase 简介 https://hbase.apache.org/ HBase, Hadoop Database,是一个高可靠性.高性能.面向列.可伸缩. 实时读写的分布式开源 NoSQL 数据库, ...

  2. JAVA_基础部分_综合篇

    JVM (1) 基本概念: JVM是可运行Java代码的假想计算机 ,包括一套字节码指令集.一组寄存器.一个栈.一个垃圾回收,堆 和 一个存储方法域.JVM 是运行在操作系统之上的,它与硬件没有直接的 ...

  3. HBase存储剖析与数据迁移

    1.概述 HBase的存储结构和关系型数据库不一样,HBase面向半结构化数据进行存储.所以,对于结构化的SQL语言查询,HBase自身并没有接口支持.在大数据应用中,虽然也有SQL查询引擎可以查询H ...

  4. HBase 参考指南 3.0 翻译活动期待大家的参与 | ApacheCN

    参与方式:https://github.com/apachecn/h... 整体进度:https://github.com/apachecn/h... 项目仓库:https://github.com/ ...

  5. HBase - Phoenix剖析

    1.概述 在<Hadoop-Drill深度剖析>一文当中,给大家介绍了Drill的相关内容,就实时查询来说,Drill基本能够满足要求,同时还可以做一个简单业务上的聚合,如果在使用Hive ...

  6. An In-Depth Look at the HBase Architecture--转载

    原文地址:https://www.mapr.com/blog/in-depth-look-hbase-architecture In this blog post, I'll give you an ...

  7. Facebook Architecture @ QCon Next Month: Infrastructure, HTML5, NoSQL, OO Design

    From Palo Alto to London, next month's QCon London will feature four of Facebook's finest engineers, ...

  8. Facebook's New Real-time Messaging System: HBase to Store 135+ Billion Messages a Month

    2019独角兽企业重金招聘Python工程师标准>>> You may have read somewhere that Facebook has introduced a new ...

  9. HBase读写操作流程介绍

    HBase读写操作 读和写是Hbase的两种常见的基本操作,这两种操作都会涉及到Hfile和Meta表,我们依次看看 Hfile HFile是Hbase在HDFS中存储数据的格式,它有如下特性: 主标 ...

  10. cassandra随机获取数据,Cassandra适合写入和少读,HBASE随机读取写入

    Is it right that Cassandra is good for write and less read, whereas HBASE is good for random read an ...

最新文章

  1. seo发展基本趋势优化专员必须知道!
  2. 9.2 协同过滤-机器学习笔记-斯坦福吴恩达教授
  3. 计算机应用基础2010一级,2010年一级结构基础辅导:(计算机应用基础)备考讲义(10)...
  4. lwip1.4.1需要的文件
  5. Qt5 使用 #pragma 加载 lib 文件的注意事项
  6. mysql 代理 a_Keepalived+Mysql+Haproxy
  7. 一:ActiveMQ知识整理
  8. spring mvc统一异常处理(@ControllerAdvice + @ExceptionHandler)
  9. 1.11 Linux压缩和解压文件
  10. rollup函数 和cube函数 的区别?
  11. ps css圆形路径文字,ps圆形路径文字怎么做
  12. 基于视频/摄像头的简单行为动作识别模型的训练步骤
  13. 美团四面 Java 岗,终获 offer,我是这么回答面试官的
  14. hankerrank 刷题二( Python 基础)
  15. Yii Framework 开发教程(36) Zii组件-DatePicker示例
  16. ERROR LNK 2001||2019
  17. 重读《由C#风潮想起的-给初学编程者的忠告》有感 (转载)
  18. centos7软件安装更新
  19. Qt Xlsx使用教程、Qt操作Excel、Qt生成Excel图表、跨平台不依赖Office
  20. 跳槽找工作避坑指南(2019最新版)

热门文章

  1. 同样是数据分析师,他靠“打标签”总被夸,我天天加班取数还被骂
  2. 总被业务当工具人,数据IT人怎么才能提高自己在公司的地位?
  3. java多线程的基本实现方式与示例
  4. 2010年6月计算机组织与结构,2010 计算机组织与体系结构课程设计.doc
  5. 多阶段决策求最优解----动态规划(Dynamic Programming)
  6. 【机器学习-西瓜书】九、K-means;聚类算法划分
  7. 【论文笔记】Beyond Low-frequency Information in Graph Convolutional Networks
  8. Win测试——使用Spy++获取窗口标题
  9. COCO和 PASCAL VOC标注格式的学习笔记
  10. 图像处理中小波变换的学习笔记