druid Hadoop-based Batch Ingestion

2019独角兽企业重金招聘Python工程师标准>>>

背景

Kafka Indexing Service segements 生成规则是根据topic 的partitions决定，假设 topic 有12个partiontions ，查询粒度是 1小时，那么 1天最多产生的segements 数量 216，一个segements的大小官网建议 500-700 MB ，其中有些segment大小只有几十K,非常不合理。

合并

从官网提供的合并实例当时并未执行成功，最终经过尝试

{"type" : "index_hadoop","spec" : {"dataSchema" : {"dataSource" : "wikipedia","parser" : {"type" : "hadoopyString","parseSpec" : {"format" : "json","timestampSpec" : {"column" : "timestamp","format" : "auto"},"dimensionsSpec" : {"dimensions": ["page","language","user","unpatrolled","newPage","robot","anonymous","namespace","continent","country","region","city"],"dimensionExclusions" : [],"spatialDimensions" : []}}},"metricsSpec" : [{"type" : "count","name" : "count"},{"type" : "doubleSum","name" : "added","fieldName" : "added"},{"type" : "doubleSum","name" : "deleted","fieldName" : "deleted"},{"type" : "doubleSum","name" : "delta","fieldName" : "delta"}],"granularitySpec" : {"type" : "uniform","segmentGranularity" : "DAY","queryGranularity" : "NONE","intervals" : [ "2013-08-31/2013-09-01" ]}},"ioConfig" : {"type" : "hadoop","inputSpec":{"type":"dataSource","ingestionSpec":{"dataSource":"wikipedia","intervals":["2013-08-31/2013-09-01"]}},"tuningConfig" : {"type": "hadoop"}}
}
}

说明

 "inputSpec":{"type":"dataSource","ingestionSpec":{"dataSource":"wikipedia","intervals":["2013-08-31/2013-09-01"]}

设置Hadoop 任务工作目录，默认通过/tmp，如果临时目录可用空间比较小，则会导致任务无法正常执行

{"type":"index_hadoop","spec":{"dataSchema":{"dataSource":"test","parser":{"type":"hadoopyString","parseSpec":{"format":"json","timestampSpec":{"column":"timeStamp","format":"auto"},"dimensionsSpec": {"dimensions": ["test_id","test_id"],"dimensionExclusions": ["timeStamp","value"]}}},"metricsSpec": [{"type": "count","name": "count"}],"granularitySpec":{"type":"uniform","segmentGranularity":"MONTH","queryGranularity": "HOUR","intervals":["2017-12-01/2017-12-31"]}},"ioConfig":{"type":"hadoop","inputSpec":{"type":"dataSource","ingestionSpec":{"dataSource":"test","intervals":["2017-12-01/2017-12-31"]}}},"tuningConfig":{"type":"hadoop","maxRowsInMemory":500000,"partitionsSpec":{"type":"hashed","targetPartitionSize":5000000},"numBackgroundPersistThreads":1,"jobProperties":{"mapreduce.job.local.dir":"/home/ant/druid/druid-0.11.0/var/mapred","mapreduce.cluster.local.dir":"/home/ant/druid/druid-0.11.0/var/mapred","mapred.job.map.memory.mb":2300,"mapreduce.reduce.memory.mb":2300}}}
}

这是对于加载的数据的说明。

提交

URL

http://overlord:8090/druid/indexer/v1/task
HTTP

POST
参数

参数名称类型值 Content-Type header application/json

其它解决方案

druid 本身提供合并任务方式，但仍是建议，直接通过hadoop计算。

参考文章

http://druid.io/docs/latest/ingestion/batch-ingestion.html

http://druid.io/docs/latest/ingestion/update-existing-data.html

转载于:https://my.oschina.net/u/3247419/blog/1588538

druid Hadoop-based Batch Ingestion相关推荐

Druid基本概念及架构介绍
Druid基本概念及架构介绍学习参考:https://www.apache-druid.cn/ Apache Druid是一个高性能的实时分析型数据库作者:it_zzy 链接:https://ww ...
druid安装与案例
2019独角兽企业重金招聘Python工程师标准>>> druid 可以运行在单机环境下,也可以运行在集群环境下.简单起见,我们先从单机环境着手学习. 环境要求 java7 或者更高 ...
Comparison of Big Data OLAP DB : ClickHouse, Druid, and Pinot
In this post I want to compare ClickHouse, Druid, and Pinot, the three open source data stores that ...
实时统计分析系统-Apache Druid
Druid.io(以下简称Druid)是2013年底开源出来的, 主要解决的是对实时数据以及较近时间的历史数据的多维查询提供高并发(多用户),低延时,高可靠性的问题. Druid简介: Druid是一 ...
Druid的segment
Druid把它的索引存储在segment的文件中,segment是以时间进行分区的.在基本的设置中,segment文件是根据一定的时间间隔创建的,通过granularitySpec中的 segment ...
Druid（准）实时分析统计数据库——列存储+高效压缩
Druid是一个开源的.分布式的.列存储系统,特别适用于大数据上的(准)实时分析统计.且具有较好的稳定性(Highly Available). 其相对比较轻量级,文档非常完善,也比较容易上手. Dru ...
Druid 大数据分析之快速应用（单机模式）
1.概述本节快速安装基于单机服务器,很多配置可以默认不需要修改,数据存储在操作系统级别的磁盘.推出快速安装的目的,便于了解并指导基于Druid进行大数据分析的开发流程.本节主要讲Druid的安装.实 ...
与 Hadoop 对比，如何看待 Spark 技术
http://www.zhihu.com/question/26568496 与 Hadoop 对比,如何看待 Spark 技术? 最近公司邀请来王家林老师来做培训,其浮夸的授课方式略接受不了.其强烈 ...
ambari部署hadoop
ambari搭建hadoop集群为何选用ambari 安装前准备部署节点间做免密登陆搭建nginx服务创建repo源安装mysql 服务安装ambari服务启动ambari服务登陆a ...
Druid（Druid.io）简单使用
Druid简单使用一.Druid服务进程 Historical进程:Historical进程用于处理历史数据的存储和查询(历史数据包括所以已经被committed的流数据).Historical进程 ...

druid Hadoop-based Batch Ingestion

背景

合并

说明

设置Hadoop 任务工作目录，默认通过/tmp，如果临时目录可用空间比较小，则会导致任务无法正常执行

提交

其它解决方案

参考文章

druid Hadoop-based Batch Ingestion相关推荐

最新文章

热门文章