HBase（2）

详述人员角色表的设计思路以及实现

思路:两个部分的信息分别保存到两张表中,因为hbase是列存储的表,一般存储非关系数据,就像记笔记一样,把关键点写上.

第一张表: 个人信息表

rowkey为编号, 列族1为个人信息(性别,名字等),列族2为其拥有的角色(包含优先级)

第二张表: 角色信息表

rowkey: 角色id,列族1 角色信息(主要是名称),列族2拥有的人员

详述电话案例分析的设计思路以及实现

背景: 10个用户,每个用户每年产生1000条通话记录

2.1 电话表

rowkey为手机号+"-"+(Long.MAX_VALUE - sdf.parse(date).getTime()),sdf为simpledateformat

cf族中,

dnum为对方手机号

type为类型 ,0主叫，1被叫

length：长度

date:时间

2.2insert数据

public void insert() throws Exception {List<Put> puts = new ArrayList<Put>();for (int i = 0; i < 10; i++) {String phoneNumber = getPhone("158");for (int j = 0; j < 1000; j++) {// 属性String dnum = getPhone("177");String length = String.valueOf(r.nextInt(100));String type = String.valueOf(r.nextInt(2));// yyyyMMddHHmmssString date = getDate("2018");// rowkey设计String rowkey = phoneNumber + "_" + (Long.MAX_VALUE - sdf.parse(date).getTime());Put put = new Put(rowkey.getBytes());put.add("cf".getBytes(), "dnum".getBytes(), dnum.getBytes());put.add("cf".getBytes(), "length".getBytes(), length.getBytes());put.add("cf".getBytes(), "type".getBytes(), type.getBytes());put.add("cf".getBytes(), "date".getBytes(), date.getBytes());puts.add(put);}table.put(puts);}}

2.3 查看数据

public void scan() throws Exception {String phoneNumber = "15822158090";// 小于等于这个时间的都包含String startRow = phoneNumber + "_" + (Long.MAX_VALUE - sdf.parse("20180331000000").getTime());// 大于这个时间的都包含String stopRow = phoneNumber + "_" + (Long.MAX_VALUE - sdf.parse("20180231000000").getTime());Scan scan = new Scan();scan.setStartRow(startRow.getBytes());scan.setStopRow(stopRow.getBytes());ResultScanner scanner = table.getScanner(scan);for (Result result : scanner) {System.out.print(Bytes.toString(CellUtil.cloneValue(result.getColumnLatestCell("cf".getBytes(), "dnum".getBytes()))));System.out.print("--" + Bytes.toString(CellUtil.cloneValue(result.getColumnLatestCell("cf".getBytes(), "type".getBytes()))));System.out.print("--" + Bytes.toString(CellUtil.cloneValue(result.getColumnLatestCell("cf".getBytes(), "date".getBytes()))));System.out.println("--" + Bytes.toString(CellUtil.cloneValue(result.getColumnLatestCell("cf".getBytes(), "length".getBytes()))));}}

protobuf做什么用的？详述如何使用？

Protocol Buffers 是一种轻便高效的结构化数据存储格式，可以用于结构化数据串行化，或者说序列化。它很适合做数据存储或 RPC 数据交换格式。可用于通讯协议、数据存储等领域的语言无关、平台无关、可扩展的序列化结构数据格式。目前提供了 C++、Java、Python 三种语言的 API。

在本课程中，是把数据当成一个protobuf对象存储到Hbase中,而不是向以前一样,把字段直接存储到Hbase中.目的是为了节省空间.根据课堂上的例子,大概可以压缩到原来的3分之1左右.

1安装protobuf-2.5.0

(1)解压

[root@node1 temp]# tar -zxf protobuf-2.5.0.tar.gz -C /usr/local/protobuf/

(2)安装开发工具

yum -y groupinstall "Development tools"

(3) 使用配置文件生成make文件和相关配置

./configure

(4) make和install

make && make install

2 在程序中使用protobuf

书写.proto文件便于生成phone的protobuf类

package com.sxt
message PhoneRecord{required string dnum=1;required string type=2;required int32 length=3;required string date=4;
}

详述hbase的优化思路

(1) 表设计

a:预先创建region分区,当数据写入HBase中时,会按照region分区情况,负载均衡也就是指定startkey和endkey,在初期提高吞吐量

b:rowkey设计,越小越好,依据实际业务,散列(取反或者哈希)

c:列族最多两个,否则服务器之间的io太多

d: in memory 通过HColoumDescriptor.setInMemory(true)

e: 最大版本,不易太多

HColumnDescriptor.setMaxVersions(int maxVersions)

f:time to live

HColumnDescriptor.setTimeToLive(int timeToLive)设置表中数据的存储生命期，过期数据将自动被删除,单位为秒

I：实际应用中，可以考虑必要时手动进行major compact，将同一个row key的修改进行合并形成一个大的StoreFile。同时，可以将StoreFile设置大些，减少split的发生。

hbase.hregion.majoucompaction 默认为24 小时、hbase.hregion.majorcompaction.jetter 默认值为0.2 防止region server 在同一时间进行major compaction）。

hbase.hregion.majorcompaction.jetter参数的作用是：对参数hbase.hregion.majoucompaction 规定的值起到浮动的作用，假如两个参数都为默认值24和0,2，那么major compact最终使用的数值为：19.2~28.8 这个范围。

最好手动合并，编写 major compaction

(2)写表

a：关闭自动flush（持久化到硬盘）

b：通过调用HTable.setWriteBufferSize(writeBufferSize)方法可以设置HTable客户端的写buffer大小，如果新设置的buffer小于当前写buffer中的数据时，buffer将会被flush到服务端。其中，writeBufferSize的单位是byte字节数

c：WAL

不重要到数据，关闭 WAL功能

d: 多用List

e:多线程并发写可以使用MR

(3)读表

a 设置scanner 一次从服务器抓取多条数据

HTable.setScannerCaching(int scannerCaching)

b 指定要查找的列族

get.addcolum(‘cf’.getBytes(),‘name’.getBytes())相当于不是查询* 而是查询有条件的数据

c 使用完resultScanner后关闭resultScanner

d 多线程并发读

e 加入缓存（自己单独做 redis）

f blockCache

读流程 memstore 》 blockCache（默认65536字节）》storefile

为什么用hbase+mapreduce整合？

快

（1）如果使用一个脚本把hbse中的数据，进行处理，效率是及其底下的，hbase中存放的是海量的数据

（2）如果使用多线程，效率是有些提升，但是，线程之间会占用资源，在hbase的海量数据面前，速度还是不够快

列出从hdfs读取数据写到hbase以及hbase读取数据写到hdfs的步骤及实现

1 从hdfs读取数据到hbase

(1) 主类

public class MainClass {private static String targetTable = "table1y";public static void main(String[] args) throws Exception {Configuration conf = HBaseConfiguration.create();conf.set("hbase.zookeeper.quorum", "node2,node3,node4");conf.set("mapreduce.framework.name", "local");Job job = Job.getInstance(conf);job.setJobName("hbase to mr to hbase");job.setJarByClass(MainClass.class);FileInputFormat.addInputPath(job, new Path("/mr/fof/input/fof.txt"));job.setMapperClass(MyMapper.class);job.setMapOutputKeyClass(Text.class);job.setMapOutputValueClass(IntWritable.class);job.setOutputKeyClass(NullWritable.class);job.setOutputValueClass(Put.class);TableMapReduceUtil.initTableReducerJob(targetTable, MyReducer.class,job, null, null, null, null, false);//      TableMapReduceUtil.initTableReducerJob(targetTable, MyReducer.class, job);//        job.setOutputFormatClass(cls);
//      job.setInputFormatClass(cls);boolean b = job.waitForCompletion(true);}
}

(2) mapper类

public class MyMapper extends Mapper<LongWritable, Text, Text, IntWritable> {private Text outKey = new Text();private IntWritable outValue = new IntWritable(1);@Overrideprotected void map(LongWritable key, Text value, Context context)throws IOException, InterruptedException {String line = value.toString();String[] words = line.split(" ");for (String word : words) {outKey.set(word);context.write(outKey, outValue);}}
}

(3) reducer类

public class MyReducer extends TableReducer<Text, IntWritable, ImmutableBytesWritable> {@Overrideprotected void reduce(Text key, Iterable<IntWritable> values, Context context)throws IOException, InterruptedException {Put put = new Put(key.toString().getBytes());Iterator<IntWritable> itera = values.iterator();int sum = 0;int num = 0;while (itera.hasNext()) {IntWritable val = itera.next();num = val.get();sum += num;}put.add("cf".getBytes(), "age".getBytes(), String.valueOf(sum).getBytes());context.write(null, put);}}

2 hbase到hdfs

(1)主类

public class MainClass {private static String sourceTable = "table1";public static void main(String[] args) throws Exception {Configuration conf = HBaseConfiguration.create();
//      Configuration config = new Configuration(true);conf.set("hbase.zookeeper.quorum", "node2,node3,node4");conf.set("mapreduce.framework.name", "local");Job job = Job.getInstance(conf);job.setJarByClass(MainClass.class); // MR主入口类Scan scan = new Scan();scan.setCaching(500); // 默认值是1，对MR作业太小了，设置为500scan.setCacheBlocks(false); // 服务端缓存没有意义TableMapReduceUtil.initTableMapperJob(sourceTable, // 源表scan, // Scan对象MyMapper.class, // 继承自TableMapper的Mapper类Text.class, // mapper输出key的类型IntWritable.class, // mapper输出value的类型job); //  TableInputFormatjob.setReducerClass(MyReducer.class); // reducer类job.setNumReduceTasks(1); // 至少一个reducer//设置输出路径 FileOutputFormat.setOutputPath(job, new Path("/hbase2hdfs/output"));boolean b = job.waitForCompletion(true);if (!b) {throw new IOException("error with job!");}}}

(2)mapper类

public class MyMapper extends TableMapper<Text, IntWritable> {private Text outKey = new Text();private IntWritable outValue = new IntWritable(1);@Overrideprotected void map(ImmutableBytesWritable key, Result value, Context context)throws IOException, InterruptedException {Cell nameCell = value.getColumnLatestCell("cf".getBytes(), "name".getBytes());Cell ageCell = value.getColumnLatestCell("cf".getBytes(), "age".getBytes());Cell sexCell = value.getColumnLatestCell("cf".getBytes(), "sex".getBytes());String name = Bytes.toString(CellUtil.cloneValue(nameCell));String age = Bytes.toString(CellUtil.cloneValue(ageCell));String sex = Bytes.toString(CellUtil.cloneValue(sexCell));//从mapper的key中获取rowkey的值String rowKey = Bytes.toString(key.get());StringBuffer sb = new StringBuffer();sb.append(rowKey).append(":");sb.append(name).append("-").append(age).append("-").append(sex);outKey.set(sb.toString());context.write(outKey, outValue);}
}

(3) reducer类

public class MyReducer extends Reducer<Text, IntWritable, Text, IntWritable> {@Overrideprotected void reduce(Text key, Iterable<IntWritable> values,Context context) throws IOException, InterruptedException {int i = 0;for (IntWritable val : values) {i += val.get();}context.write(key, new IntWritable(i));}}

BigBrother的大数据之旅Day 13 hbase(2)相关推荐

BigBrother的大数据之旅Day 12 Hbase(1)
HBASE (Hadoop Database)是个高可用,高性能, 面向列可伸缩.实时读写的分布式数据库 Hadoop HDFS作为其文件存储系统,利用Hadoop MapReduce来处理HBase ...
BigBrother的大数据之旅 Day 5 负载均衡
BigBrother的大数据之旅 Day 5 负载均衡学习目标: 掌握nginx集群 nginx的配置文件基本配置 Location(难点,重点) 一个配置属性:根据用户的请求地址,去服务器上匹配文 ...
BigBrother的大数据之旅 Day 1 Linux(1)
Today is a new day! 1 scala是大数据的主流编程语言 Scala是一种强大的JVM语言,混合了函数式和面向对象编程范式 2 大数据的本质为:分布式计算 3 大数据中常见的数据类 ...
BigBrother的大数据之旅Day 14 某项目流程
某平台日志项目流程架构分析 1 埋点程序把数据以url的方式提交给nginx服务器 2 nginx服务器把日志信息(文本文件)保存到本地硬盘 3 flume,安装中nginx上通过主动方式读取日志 ...
BigBrother的大数据之旅Day 15 sqoop
sqoop part one 1 数据迁移工具 2 通常用于hdfs和关系型数据库之间 3 实质是mapreduce 4 数据的导入导出是从hdfs的角度来说的,谁让sqoop是apache的东西呢 ...
BigBrother的大数据之旅Day 11 hive(2)
HIVE参数命名空间读写权限含义 hiveconf 可读写 hive-site.xml当中的各配置变量例:hive --hiveconf hive.cli.print.header=true ...
BigBrother的大数据之旅 Day 4 Linux(4)
学习目标: 掌握shell的基本语法变量本地变量:作用在当前shellscript中环境变量:作用在当前的shell(bash)及其子shell(bash) 局部变量特殊变量 :$#, $?, ...
BigBrother的大数据之旅Day 10 hive(1)
HIVE 1 一些概念 HIve存在的目的:非编程人员使用mapreduce进行分析 HIVE:是基于Hadoop的一个数据仓库工具,可以将结构化的数据文件映射为一张数据库表,并提供简单的sql查询功 ...
BigBrother的大数据之旅Day 17 redis(2)
Redis(2) 1 redis持久化持久化: 将数据从掉电易失的内存存放到能够永久存储的设备上 redis持久化两种 : RDB (redis DB)镜像文件,dump.rdb AOF(appe ...

BigBrother的大数据之旅Day 13 hbase(2)