一、搭建环境

  • kylin版本: 2.6.4
  • hdp版本:2.6.5.0
  • spark版本:2.3.2

二、配置

1)配置HADOOP_CONF_DIR

export HADOOP_CONF_DIR=/usr/hdp/2.6.5.0-292/hadoop/conf

2)配置SPARK_HOME

# spark
export SPARK_HOME=/usr/hdp/2.6.5.0-292/spark2
export PATH=$SPARK_HOME/bin:$PATH

3)配置KYLIN_HOME

export KYLIN_HOME=/opt/kylin

4)配置KYLIN(推荐)

# 动态分配spark资源
kylin.engine.spark-conf.spark.dynamicAllocation.enabled=true
kylin.engine.spark-conf.spark.dynamicAllocation.minExecutors=1
kylin.engine.spark-conf.spark.dynamicAllocation.maxExecutors=1000
kylin.engine.spark-conf.spark.dynamicAllocation.executorIdleTimeout=300# spark相关
kylin.engine.spark-conf.spark.master=yarn
kylin.engine.spark-conf.spark.submit.deployMode=cluster
kylin.engine.spark-conf.spark.yarn.queue=default
kylin.engine.spark-conf.spark.driver.memory=2G
kylin.engine.spark-conf.spark.executor.memory=4G
kylin.engine.spark-conf.spark.yarn.executor.memoryOverhead=1024
kylin.engine.spark-conf.spark.executor.cores=10
kylin.engine.spark-conf.spark.network.timeout=600
kylin.engine.spark-conf.spark.shuffle.service.enabled=true
kylin.engine.spark-conf.spark.eventLog.enabled=true
kylin.engine.spark-conf.spark.hadoop.dfs.replication=2
kylin.engine.spark-conf.spark.hadoop.mapreduce.output.fileoutputformat.compress=true
kylin.engine.spark-conf.spark.hadoop.mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.DefaultCodec
kylin.engine.spark-conf.spark.io.compression.codec=org.apache.spark.io.SnappyCompressionCodec
kylin.engine.spark-conf.spark.eventLog.dir=hdfs\:///kylin/spark-history
kylin.engine.spark-conf.spark.history.fs.logDirectory=hdfs\:///kylin/spark-history
kylin.engine.spark-conf.spark.yarn.archive=hdfs://192.168.2.101:8020/kylin/spark/spark-libs.jar## uncomment for HDP
kylin.engine.spark-conf.spark.driver.extraJavaOptions=-Dhdp.version=2.6.5.0-292
kylin.engine.spark-conf.spark.yarn.am.extraJavaOptions=-Dhdp.version=2.6.5.0-292
kylin.engine.spark-conf.spark.executor.extraJavaOptions=-Dhdp.version=2.6.5.0-292

5)下载最新版本spark(2.4.4)

# 下载
cd $KYLIN_HOME
wget http://mirrors.tuna.tsinghua.edu.cn/apache/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz# 解压
tar -zxvf spark-2.4.4-bin-hadoop2.7.tgz -C .
mv spark-2.4.4-bin-hadoop2.7 spark# 打包jars
jar cv0f spark-libs.jar -C $KYLIN_HOME/spark/jars/ .# 上传hdfs
hadoop fs -mkdir -p /kylin/spark/
hadoop fs -put spark-libs.jar /kylin/spark/

三、使用

1)更换构建引擎

2)按需设置(可选)

样例 cube 有两个耗尽内存的度量: “COUNT DISTINCT” 和 “TOPN(100)”;当源数据较小时,他们的大小估计的不太准确: 预估的大小会比真实的大很多,导致了更多的 RDD partitions 被切分,使得 build 的速度降低。500 对于其是一个较为合理的数字。

坑一:

Caused by: java.lang.RuntimeException: Could not create  interface org.apache.hadoop.hbase.regionserver.MetricsRegionServerSourceFactory Is the hadoop compatibility jar on the classpath?at org.apache.hadoop.hbase.CompatibilitySingletonFactory.getInstance(CompatibilitySingletonFactory.java:73)at org.apache.hadoop.hbase.io.MetricsIO.<init>(MetricsIO.java:31)at org.apache.hadoop.hbase.io.hfile.HFile.<clinit>(HFile.java:192)... 15 more
Caused by: java.util.NoSuchElementExceptionat java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:365)at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:404)at java.util.ServiceLoader$1.next(ServiceLoader.java:480)at org.apache.hadoop.hbase.CompatibilitySingletonFactory.getInstance(CompatibilitySingletonFactory.java:59)... 17 more

解决办法是:

将 hbase-hadoop2-compat-*.jar 和 hbase-hadoop-compat-*.jar 拷贝到 $KYLIN_HOME/spark/jars 目录下 (这两个 jar 文件可以从 HBase 的 lib 目录找到); 如果你已经生成了 Spark assembly jar 并上传到了 HDFS, 那么你需要重新打包上传。在这之后,重试失败的 cube 任务,应该就可以成功了。相关的 JIRA issue 是 KYLIN-3607,会在未来版本修复.(已修复)

坑二:

19/10/14 16:48:25 INFO Client: client token: N/Adiagnostics: User class threw exception: java.lang.RuntimeException: error execute org.apache.kylin.storage.hbase.steps.SparkCubeHFile. Root cause: Job aborted.at org.apache.kylin.common.util.AbstractApplication.execute(AbstractApplication.java:42)at org.apache.kylin.common.util.SparkEntry.main(SparkEntry.java:44)at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)at java.lang.reflect.Method.invoke(Method.java:498)at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$4.run(ApplicationMaster.scala:721)
Caused by: org.apache.spark.SparkException: Job aborted.at org.apache.spark.internal.io.SparkHadoopWriter$.write(SparkHadoopWriter.scala:100)at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1083)at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1081)at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1081)at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)at org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopDataset(PairRDDFunctions.scala:1081)at org.apache.spark.api.java.JavaPairRDD.saveAsNewAPIHadoopDataset(JavaPairRDD.scala:831)at org.apache.kylin.storage.hbase.steps.SparkCubeHFile.execute(SparkCubeHFile.java:238)at org.apache.kylin.common.util.AbstractApplication.execute(AbstractApplication.java:37)... 6 more
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 4, server3.tuzhanai.com, executor 39): org.apache.spark.SparkException: Task failed while writing rowsat org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:155)at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:83)at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)at org.apache.spark.scheduler.Task.run(Task.scala:109)at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/util/Counterat org.apache.hadoop.metrics2.lib.MutableHistogram.<init>(MutableHistogram.java:42)at org.apache.hadoop.metrics2.lib.MutableRangeHistogram.<init>(MutableRangeHistogram.java:41)at org.apache.hadoop.metrics2.lib.MutableTimeHistogram.<init>(MutableTimeHistogram.java:42)at org.apache.hadoop.metrics2.lib.MutableTimeHistogram.<init>(MutableTimeHistogram.java:38)at org.apache.hadoop.metrics2.lib.DynamicMetricsRegistry.newTimeHistogram(DynamicMetricsRegistry.java:262)at org.apache.hadoop.hbase.io.MetricsIOSourceImpl.<init>(MetricsIOSourceImpl.java:49)at org.apache.hadoop.hbase.io.MetricsIOSourceImpl.<init>(MetricsIOSourceImpl.java:36)at org.apache.hadoop.hbase.regionserver.MetricsRegionServerSourceFactoryImpl.createIO(MetricsRegionServerSourceFactoryImpl.java:89)at org.apache.hadoop.hbase.io.MetricsIO.<init>(MetricsIO.java:32)at org.apache.hadoop.hbase.io.hfile.HFile.<clinit>(HFile.java:192)at org.apache.hadoop.hbase.mapreduce.HFileOutputFormat2$1.getNewWriter(HFileOutputFormat2.java:247)at org.apache.hadoop.hbase.mapreduce.HFileOutputFormat2$1.write(HFileOutputFormat2.java:194)at org.apache.hadoop.hbase.mapreduce.HFileOutputFormat2$1.write(HFileOutputFormat2.java:152)at org.apache.spark.internal.io.HadoopMapReduceWriteConfigUtil.write(SparkHadoopWriter.scala:356)at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:130)at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:127)at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1415)at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:139)... 8 more
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.hbase.util.Counterat java.net.URLClassLoader.findClass(URLClassLoader.java:382)at java.lang.ClassLoader.loadClass(ClassLoader.java:424)at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)at java.lang.ClassLoader.loadClass(ClassLoader.java:357)... 26 more

解决办法是:

原因是(hdp自带的spark2.3.0和kylin自带的spark2.3.2都有BUG),下载spark2.4.4,然后将 jars 打成 spark-libs.jar ,然后上传到hdfs  /kylin/spark/ 目录下,并在配置文件配置即可

KYLIN使用spark构建引擎(HDP2.6.5.0环境)相关推荐

  1. 基于Spark构建推荐引擎

    基于Spark构建推荐引擎之一:基于物品的协同过滤推荐 http://blog.csdn.net/sunbow0/article/details/42737541 Spark构建推荐引擎之二:基于Sp ...

  2. 实践 | Kylin在滴滴OLAP引擎中的应用

    2019独角兽企业重金招聘Python工程师标准>>> 本文转载自 AI前线 作者 | 滴滴数据平台团队 编辑 | Vincent AI 前线导读:企业的生产活动会产生各种各样的数据 ...

  3. BigData之Spark:Spark计算引擎的简介、下载、经典案例之详细攻略

    BigData之Spark:Spark计算引擎的简介.下载.经典案例之详细攻略 目录 Spark的简介 1.Spark三大特点 Spark的下载 Spark的经典案例 1.Word Count 2.P ...

  4. spark restful_使用Spark构建简单的RESTful API

    spark restful 免责声明 :这篇文章是关于名为Spark的Java微型Web框架的,而不是关于数据处理引擎Apache Spark的 . 在此博客文章中,我们将看到如何使用Spark构建简 ...

  5. 使用Spark构建简单的RESTful API

    免责声明 :这篇文章是关于名为Spark的Java微型Web框架的,而不是关于数据处理引擎Apache Spark的 . 在此博客文章中,我们将看到如何使用Spark构建简单的Web服务. 如免责声明 ...

  6. 【Spark】扩展Spark Catalyst,打造自定义的Spark SQL引擎

    1.概述 转载自:扩展Spark Catalyst,打造自定义的Spark SQL引擎 Apache Spark是大数据处理领域最常用的计算引擎之一,被应用在各种各样的场景中,除了易用的API,稳定高 ...

  7. 原来 Kylin 的增量构建,大有学问! | 原力计划

    作者| Alice菌 责编 | 王晓曼 出品 | CSDN博客 Kylin增量构建应用场景 Kylin 在每次 Cube 的构建都会从 Hive 中批量读取数据,而对于大多数业务场景来说,Hive 中 ...

  8. KubeVela:标准化的云原生平台构建引擎

    作者 | 孙健波(天元) 来源|阿里巴巴云原生公众号 本文由"GO 开源说"第三期 KubeVela 直播内容修改整理而成,视频内容较长,本文内容有所删减和重构. 点击查看视频 K ...

  9. OpenFire+Spark构建实时协作平台

    最近由于公司复杂的内部网络约束,使得部分人员之间无法通过企业内部定制的协作软件进行沟通,造成工作中的诸多不变.所以在内网中尝试使用OpenFire和Spark构建了实时协作平台. OpenFire : ...

最新文章

  1. 惨淡!苏州楼市政策调控下,这些房企高调入驻,如今黯然离场?
  2. docker安装redis提示没有日记写入权限_浅析Linux下Redis的攻击面(一)
  3. 《看聊天记录都学不会C#?太菜了吧》(3)变量:我大哥呢?$:小弟我罩着你!
  4. LeetCode 1885. Count Pairs in Two Arrays(二分查找)
  5. 用javascript实现以下功能!_JavaScript实现汉字转拼音功能
  6. [六字真言]1.唵.遁入佛门之异常.md
  7. SQL查询1-12月的数据
  8. 超级硬盘数据恢复软件 4.6.5.0注冊码破解版
  9. RecyclerView实现京东分类联动效果
  10. MindManager 2020注册机下载
  11. AI+护肤领域带来的产业价值
  12. 实体消歧、实体统一和指代消歧
  13. echarts实现3D地球模式--3D线和标记mark
  14. DTL常用过滤器详解
  15. 这段时间做的简单dp题目(部分)
  16. 视频如何加水印文字?一分钟学会
  17. 麦肯锡深度解析:量子计算将拯救地球?
  18. 内存使用占比存储到csv中加上时间
  19. 半年巨亏10亿,水滴烧钱枯竭硬伤难愈
  20. 为知笔记有linux版本吗,为知笔记Linux版

热门文章

  1. gpu服务器各硬件的全面认识ppt,GPU服务器介绍.ppt
  2. [动态规划] 以“合唱团”问题为例 [python]
  3. excel 统计字符出现的次数
  4. I/O操作总结(一)
  5. 【linux】grep: /var/mqm/mqs.ini: No such file or directory
  6. c++模板函数声明定义分离编译错误详解
  7. 使用蒲公英给iOS应用做内测
  8. 传统方法的pcb缺陷检测
  9. 有之以为利,无之以为用
  10. VS2017下编译Jsoncpp静态库