KYLIN使用spark构建引擎(HDP2.6.5.0环境)
一、搭建环境
- kylin版本: 2.6.4
- hdp版本:2.6.5.0
- spark版本:2.3.2
二、配置
1)配置HADOOP_CONF_DIR
export HADOOP_CONF_DIR=/usr/hdp/2.6.5.0-292/hadoop/conf
2)配置SPARK_HOME
# spark
export SPARK_HOME=/usr/hdp/2.6.5.0-292/spark2
export PATH=$SPARK_HOME/bin:$PATH
3)配置KYLIN_HOME
export KYLIN_HOME=/opt/kylin
4)配置KYLIN(推荐)
# 动态分配spark资源
kylin.engine.spark-conf.spark.dynamicAllocation.enabled=true
kylin.engine.spark-conf.spark.dynamicAllocation.minExecutors=1
kylin.engine.spark-conf.spark.dynamicAllocation.maxExecutors=1000
kylin.engine.spark-conf.spark.dynamicAllocation.executorIdleTimeout=300# spark相关
kylin.engine.spark-conf.spark.master=yarn
kylin.engine.spark-conf.spark.submit.deployMode=cluster
kylin.engine.spark-conf.spark.yarn.queue=default
kylin.engine.spark-conf.spark.driver.memory=2G
kylin.engine.spark-conf.spark.executor.memory=4G
kylin.engine.spark-conf.spark.yarn.executor.memoryOverhead=1024
kylin.engine.spark-conf.spark.executor.cores=10
kylin.engine.spark-conf.spark.network.timeout=600
kylin.engine.spark-conf.spark.shuffle.service.enabled=true
kylin.engine.spark-conf.spark.eventLog.enabled=true
kylin.engine.spark-conf.spark.hadoop.dfs.replication=2
kylin.engine.spark-conf.spark.hadoop.mapreduce.output.fileoutputformat.compress=true
kylin.engine.spark-conf.spark.hadoop.mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.DefaultCodec
kylin.engine.spark-conf.spark.io.compression.codec=org.apache.spark.io.SnappyCompressionCodec
kylin.engine.spark-conf.spark.eventLog.dir=hdfs\:///kylin/spark-history
kylin.engine.spark-conf.spark.history.fs.logDirectory=hdfs\:///kylin/spark-history
kylin.engine.spark-conf.spark.yarn.archive=hdfs://192.168.2.101:8020/kylin/spark/spark-libs.jar## uncomment for HDP
kylin.engine.spark-conf.spark.driver.extraJavaOptions=-Dhdp.version=2.6.5.0-292
kylin.engine.spark-conf.spark.yarn.am.extraJavaOptions=-Dhdp.version=2.6.5.0-292
kylin.engine.spark-conf.spark.executor.extraJavaOptions=-Dhdp.version=2.6.5.0-292
5)下载最新版本spark(2.4.4)
# 下载
cd $KYLIN_HOME
wget http://mirrors.tuna.tsinghua.edu.cn/apache/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz# 解压
tar -zxvf spark-2.4.4-bin-hadoop2.7.tgz -C .
mv spark-2.4.4-bin-hadoop2.7 spark# 打包jars
jar cv0f spark-libs.jar -C $KYLIN_HOME/spark/jars/ .# 上传hdfs
hadoop fs -mkdir -p /kylin/spark/
hadoop fs -put spark-libs.jar /kylin/spark/
三、使用
1)更换构建引擎
2)按需设置(可选)
样例 cube 有两个耗尽内存的度量: “COUNT DISTINCT” 和 “TOPN(100)”;当源数据较小时,他们的大小估计的不太准确: 预估的大小会比真实的大很多,导致了更多的 RDD partitions 被切分,使得 build 的速度降低。500 对于其是一个较为合理的数字。
坑一:
Caused by: java.lang.RuntimeException: Could not create interface org.apache.hadoop.hbase.regionserver.MetricsRegionServerSourceFactory Is the hadoop compatibility jar on the classpath?at org.apache.hadoop.hbase.CompatibilitySingletonFactory.getInstance(CompatibilitySingletonFactory.java:73)at org.apache.hadoop.hbase.io.MetricsIO.<init>(MetricsIO.java:31)at org.apache.hadoop.hbase.io.hfile.HFile.<clinit>(HFile.java:192)... 15 more
Caused by: java.util.NoSuchElementExceptionat java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:365)at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:404)at java.util.ServiceLoader$1.next(ServiceLoader.java:480)at org.apache.hadoop.hbase.CompatibilitySingletonFactory.getInstance(CompatibilitySingletonFactory.java:59)... 17 more
解决办法是:
将 hbase-hadoop2-compat-*.jar
和 hbase-hadoop-compat-*.jar
拷贝到 $KYLIN_HOME/spark/jars
目录下 (这两个 jar 文件可以从 HBase 的 lib 目录找到); 如果你已经生成了 Spark assembly jar 并上传到了 HDFS, 那么你需要重新打包上传。在这之后,重试失败的 cube 任务,应该就可以成功了。相关的 JIRA issue 是 KYLIN-3607,会在未来版本修复.(已修复)
坑二:
19/10/14 16:48:25 INFO Client: client token: N/Adiagnostics: User class threw exception: java.lang.RuntimeException: error execute org.apache.kylin.storage.hbase.steps.SparkCubeHFile. Root cause: Job aborted.at org.apache.kylin.common.util.AbstractApplication.execute(AbstractApplication.java:42)at org.apache.kylin.common.util.SparkEntry.main(SparkEntry.java:44)at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)at java.lang.reflect.Method.invoke(Method.java:498)at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$4.run(ApplicationMaster.scala:721)
Caused by: org.apache.spark.SparkException: Job aborted.at org.apache.spark.internal.io.SparkHadoopWriter$.write(SparkHadoopWriter.scala:100)at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1083)at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1081)at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1081)at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)at org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopDataset(PairRDDFunctions.scala:1081)at org.apache.spark.api.java.JavaPairRDD.saveAsNewAPIHadoopDataset(JavaPairRDD.scala:831)at org.apache.kylin.storage.hbase.steps.SparkCubeHFile.execute(SparkCubeHFile.java:238)at org.apache.kylin.common.util.AbstractApplication.execute(AbstractApplication.java:37)... 6 more
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 4, server3.tuzhanai.com, executor 39): org.apache.spark.SparkException: Task failed while writing rowsat org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:155)at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:83)at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)at org.apache.spark.scheduler.Task.run(Task.scala:109)at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/util/Counterat org.apache.hadoop.metrics2.lib.MutableHistogram.<init>(MutableHistogram.java:42)at org.apache.hadoop.metrics2.lib.MutableRangeHistogram.<init>(MutableRangeHistogram.java:41)at org.apache.hadoop.metrics2.lib.MutableTimeHistogram.<init>(MutableTimeHistogram.java:42)at org.apache.hadoop.metrics2.lib.MutableTimeHistogram.<init>(MutableTimeHistogram.java:38)at org.apache.hadoop.metrics2.lib.DynamicMetricsRegistry.newTimeHistogram(DynamicMetricsRegistry.java:262)at org.apache.hadoop.hbase.io.MetricsIOSourceImpl.<init>(MetricsIOSourceImpl.java:49)at org.apache.hadoop.hbase.io.MetricsIOSourceImpl.<init>(MetricsIOSourceImpl.java:36)at org.apache.hadoop.hbase.regionserver.MetricsRegionServerSourceFactoryImpl.createIO(MetricsRegionServerSourceFactoryImpl.java:89)at org.apache.hadoop.hbase.io.MetricsIO.<init>(MetricsIO.java:32)at org.apache.hadoop.hbase.io.hfile.HFile.<clinit>(HFile.java:192)at org.apache.hadoop.hbase.mapreduce.HFileOutputFormat2$1.getNewWriter(HFileOutputFormat2.java:247)at org.apache.hadoop.hbase.mapreduce.HFileOutputFormat2$1.write(HFileOutputFormat2.java:194)at org.apache.hadoop.hbase.mapreduce.HFileOutputFormat2$1.write(HFileOutputFormat2.java:152)at org.apache.spark.internal.io.HadoopMapReduceWriteConfigUtil.write(SparkHadoopWriter.scala:356)at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:130)at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:127)at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1415)at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:139)... 8 more
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.hbase.util.Counterat java.net.URLClassLoader.findClass(URLClassLoader.java:382)at java.lang.ClassLoader.loadClass(ClassLoader.java:424)at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)at java.lang.ClassLoader.loadClass(ClassLoader.java:357)... 26 more
解决办法是:
原因是(hdp自带的spark2.3.0和kylin自带的spark2.3.2都有BUG),下载spark2.4.4,然后将 jars 打成 spark-libs.jar ,然后上传到hdfs /kylin/spark/ 目录下,并在配置文件配置即可
KYLIN使用spark构建引擎(HDP2.6.5.0环境)相关推荐
- 基于Spark构建推荐引擎
基于Spark构建推荐引擎之一:基于物品的协同过滤推荐 http://blog.csdn.net/sunbow0/article/details/42737541 Spark构建推荐引擎之二:基于Sp ...
- 实践 | Kylin在滴滴OLAP引擎中的应用
2019独角兽企业重金招聘Python工程师标准>>> 本文转载自 AI前线 作者 | 滴滴数据平台团队 编辑 | Vincent AI 前线导读:企业的生产活动会产生各种各样的数据 ...
- BigData之Spark:Spark计算引擎的简介、下载、经典案例之详细攻略
BigData之Spark:Spark计算引擎的简介.下载.经典案例之详细攻略 目录 Spark的简介 1.Spark三大特点 Spark的下载 Spark的经典案例 1.Word Count 2.P ...
- spark restful_使用Spark构建简单的RESTful API
spark restful 免责声明 :这篇文章是关于名为Spark的Java微型Web框架的,而不是关于数据处理引擎Apache Spark的 . 在此博客文章中,我们将看到如何使用Spark构建简 ...
- 使用Spark构建简单的RESTful API
免责声明 :这篇文章是关于名为Spark的Java微型Web框架的,而不是关于数据处理引擎Apache Spark的 . 在此博客文章中,我们将看到如何使用Spark构建简单的Web服务. 如免责声明 ...
- 【Spark】扩展Spark Catalyst,打造自定义的Spark SQL引擎
1.概述 转载自:扩展Spark Catalyst,打造自定义的Spark SQL引擎 Apache Spark是大数据处理领域最常用的计算引擎之一,被应用在各种各样的场景中,除了易用的API,稳定高 ...
- 原来 Kylin 的增量构建,大有学问! | 原力计划
作者| Alice菌 责编 | 王晓曼 出品 | CSDN博客 Kylin增量构建应用场景 Kylin 在每次 Cube 的构建都会从 Hive 中批量读取数据,而对于大多数业务场景来说,Hive 中 ...
- KubeVela:标准化的云原生平台构建引擎
作者 | 孙健波(天元) 来源|阿里巴巴云原生公众号 本文由"GO 开源说"第三期 KubeVela 直播内容修改整理而成,视频内容较长,本文内容有所删减和重构. 点击查看视频 K ...
- OpenFire+Spark构建实时协作平台
最近由于公司复杂的内部网络约束,使得部分人员之间无法通过企业内部定制的协作软件进行沟通,造成工作中的诸多不变.所以在内网中尝试使用OpenFire和Spark构建了实时协作平台. OpenFire : ...
最新文章
- 惨淡!苏州楼市政策调控下,这些房企高调入驻,如今黯然离场?
- docker安装redis提示没有日记写入权限_浅析Linux下Redis的攻击面(一)
- 《看聊天记录都学不会C#?太菜了吧》(3)变量:我大哥呢?$:小弟我罩着你!
- LeetCode 1885. Count Pairs in Two Arrays(二分查找)
- 用javascript实现以下功能!_JavaScript实现汉字转拼音功能
- [六字真言]1.唵.遁入佛门之异常.md
- SQL查询1-12月的数据
- 超级硬盘数据恢复软件 4.6.5.0注冊码破解版
- RecyclerView实现京东分类联动效果
- MindManager 2020注册机下载
- AI+护肤领域带来的产业价值
- 实体消歧、实体统一和指代消歧
- echarts实现3D地球模式--3D线和标记mark
- DTL常用过滤器详解
- 这段时间做的简单dp题目(部分)
- 视频如何加水印文字?一分钟学会
- 麦肯锡深度解析:量子计算将拯救地球?
- 内存使用占比存储到csv中加上时间
- 半年巨亏10亿,水滴烧钱枯竭硬伤难愈
- 为知笔记有linux版本吗,为知笔记Linux版
热门文章
- gpu服务器各硬件的全面认识ppt,GPU服务器介绍.ppt
- [动态规划] 以“合唱团”问题为例 [python]
- excel 统计字符出现的次数
- I/O操作总结(一)
- 【linux】grep: /var/mqm/mqs.ini: No such file or directory
- c++模板函数声明定义分离编译错误详解
- 使用蒲公英给iOS应用做内测
- 传统方法的pcb缺陷检测
- 有之以为利,无之以为用
- VS2017下编译Jsoncpp静态库