实验目的:
配置Kettle向Spark集群提交作业。

实验环境:
Spark History Server:
172.16.1.126

Spark Gateway:
172.16.1.124
172.16.1.125
172.16.1.126
172.16.1.127

PDI:
172.16.1.105

Hadoop版本:CDH 6.3.1
Spark版本:2.4.0-cdh6.3.1
PDI版本:8.3

Kettle连接CDH参见“https://wxy0327.blog.csdn.net/article/details/106406702”。配置步骤:
1. 将CDH中Spark的库文件复制到PDI所在主机

-- 在172.16.1.126上执行
cd /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/spark
scp -r * 172.16.1.105:/root/spark/

2. 为Kettle配置Spark
以下操作均在172.16.1.105以root用户执行。
(1)备份原始配置文件

cp spark-defaults.conf spark-defaults.conf.bak
cp spark-env.sh spark-env.sh.bak

(2)编辑spark-defaults.conf文件.

vim /root/spark/conf/spark-defaults.conf

内容如下:

spark.yarn.archive=hdfs://manager:8020/user/spark/lib/spark_jars.zip
spark.hadoop.yarn.timeline-service.enabled=false
spark.eventLog.enabled=true
spark.eventLog.dir=hdfs://manager:8020/user/spark/applicationHistory
spark.yarn.historyServer.address=http://node2:18088

(3)编辑spark-env.sh文件

vim /root/spark/conf/spark-env.sh

内容如下:

#!/usr/bin/env bashHADOOP_CONF_DIR=/root/data-integration/plugins/pentaho-big-data-plugin/hadoop-configurations/cdh61
SPARK_HOME=/root/spark

(4)编辑core-site.xml文件

vim /root/data-integration/plugins/pentaho-big-data-plugin/hadoop-configurations/cdh61/core-site.xml

去掉下面这段的注释:

<property><name>net.topology.script.file.name</name><value>/etc/hadoop/conf.cloudera.yarn/topology.py</value>
</property>

提交Spark作业:
1. 修改PDI自带的Spark例子

cp /root/data-integration/samples/jobs/Spark\ Submit/Spark\ submit.kjb /root/big_data/

在Kettle中打开/root/big_data/Spark\ submit.kjb文件,如图1所示。

图1

编辑Spark Submit Sample作业项,如图2所示。

图2

2. 保存行执行作业

日志如下:

2020/06/10 10:12:19 - Spoon - Starting job...
2020/06/10 10:12:19 - Spark submit - Start of job execution
2020/06/10 10:12:19 - Spark submit - Starting entry [Spark PI]
2020/06/10 10:12:19 - Spark PI - Submitting Spark Script
2020/06/10 10:12:20 - Spark PI - Warning: Master yarn-cluster is deprecated since 2.0. Please use master "yarn" with specified deploy mode instead.
2020/06/10 10:12:21 - Spark PI - 20/06/10 10:12:21 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2020/06/10 10:12:21 - Spark PI - 20/06/10 10:12:21 INFO client.RMProxy: Connecting to ResourceManager at manager/172.16.1.124:8032
2020/06/10 10:12:21 - Spark PI - 20/06/10 10:12:21 INFO yarn.Client: Requesting a new application from cluster with 3 NodeManagers
2020/06/10 10:12:21 - Spark PI - 20/06/10 10:12:21 INFO conf.Configuration: resource-types.xml not found
2020/06/10 10:12:21 - Spark PI - 20/06/10 10:12:21 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
2020/06/10 10:12:21 - Spark PI - 20/06/10 10:12:21 INFO yarn.Client: Verifying our application has not requested more than the maximum memory capability of the cluster (2048 MB per container)
2020/06/10 10:12:21 - Spark PI - 20/06/10 10:12:21 INFO yarn.Client: Will allocate AM container, with 1408 MB memory including 384 MB overhead
2020/06/10 10:12:21 - Spark PI - 20/06/10 10:12:21 INFO yarn.Client: Setting up container launch context for our AM
2020/06/10 10:12:21 - Spark PI - 20/06/10 10:12:21 INFO yarn.Client: Setting up the launch environment for our AM container
2020/06/10 10:12:22 - Spark PI - 20/06/10 10:12:22 INFO yarn.Client: Preparing resources for our AM container
2020/06/10 10:12:22 - Spark PI - 20/06/10 10:12:22 INFO yarn.Client: Source and destination file systems are the same. Not copying hdfs://manager:8020/user/spark/lib/spark_jars.zip
2020/06/10 10:12:22 - Spark PI - 20/06/10 10:12:22 INFO yarn.Client: Uploading resource file:/root/spark/examples/jars/spark-examples_2.11-2.4.0-cdh6.3.1.jar -> hdfs://manager:8020/user/root/.sparkStaging/application_1591323999364_0060/spark-examples_2.11-2.4.0-cdh6.3.1.jar
2020/06/10 10:12:22 - Spark PI - 20/06/10 10:12:22 INFO yarn.Client: Uploading resource file:/tmp/spark-281973dd-8233-4f12-b416-36d28b74159c/__spark_conf__2533521329006469303.zip -> hdfs://manager:8020/user/root/.sparkStaging/application_1591323999364_0060/__spark_conf__.zip
2020/06/10 10:12:22 - Spark PI - 20/06/10 10:12:22 INFO spark.SecurityManager: Changing view acls to: root
2020/06/10 10:12:22 - Spark PI - 20/06/10 10:12:22 INFO spark.SecurityManager: Changing modify acls to: root
2020/06/10 10:12:22 - Spark PI - 20/06/10 10:12:22 INFO spark.SecurityManager: Changing view acls groups to:
2020/06/10 10:12:22 - Spark PI - 20/06/10 10:12:22 INFO spark.SecurityManager: Changing modify acls groups to:
2020/06/10 10:12:22 - Spark PI - 20/06/10 10:12:22 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root); groups with view permissions: Set(); users  with modify permissions: Set(root); groups with modify permissions: Set()
2020/06/10 10:12:22 - Spark PI - 20/06/10 10:12:22 INFO conf.HiveConf: Found configuration file file:/root/data-integration/plugins/pentaho-big-data-plugin/hadoop-configurations/cdh61/hive-site.xml
2020/06/10 10:12:22 - Spark PI - 20/06/10 10:12:22 INFO security.YARNHadoopDelegationTokenManager: Attempting to load user's ticket cache.
2020/06/10 10:12:22 - Spark PI - 20/06/10 10:12:22 INFO yarn.Client: Submitting application application_1591323999364_0060 to ResourceManager
2020/06/10 10:12:22 - Spark PI - 20/06/10 10:12:22 INFO impl.YarnClientImpl: Submitted application application_1591323999364_0060
2020/06/10 10:12:23 - Spark PI - 20/06/10 10:12:23 INFO yarn.Client: Application report for application_1591323999364_0060 (state: ACCEPTED)
2020/06/10 10:12:23 - Spark PI - 20/06/10 10:12:23 INFO yarn.Client:
2020/06/10 10:12:23 - Spark PI -      client token: N/A
2020/06/10 10:12:23 - Spark PI -      diagnostics: AM container is launched, waiting for AM container to Register with RM
2020/06/10 10:12:23 - Spark PI -      ApplicationMaster host: N/A
2020/06/10 10:12:23 - Spark PI -      ApplicationMaster RPC port: -1
2020/06/10 10:12:23 - Spark PI -      queue: root.users.root
2020/06/10 10:12:23 - Spark PI -      start time: 1591755142818
2020/06/10 10:12:23 - Spark PI -      final status: UNDEFINED
2020/06/10 10:12:23 - Spark PI -      tracking URL: http://manager:8088/proxy/application_1591323999364_0060/
2020/06/10 10:12:24 - Spark submit - Starting entry [Success]
2020/06/10 10:12:24 - Spark submit - Finished job entry [Success] (result=[true])
2020/06/10 10:12:24 - Spark submit - Finished job entry [Spark PI] (result=[true])
2020/06/10 10:12:24 - Spark submit - Job execution finished
2020/06/10 10:12:24 - Spoon - Job has ended.

Spark History Server Web UI如图3所示。

图3

点击“application_1591323999364_0061”,如图4所示。

图4

参考:

  • https://help.pentaho.com/Documentation/8.3/Products/Spark_Submit
  • https://blog.csdn.net/wzy0623/article/details/51097471

Kettle与Hadoop(九)提交Spark作业相关推荐

  1. 利用 livy 远程提交 spark作业

    1. 下载,安装 livy 下载 地址 :http://livy.io/quickstart.html 下载之后,解压即可运行 2. 配置环境变量 export SPARK_HOME=/usr/lib ...

  2. livy使用样例_黑猴子的家:利用 livy 远程提交 spark作业

    livy是cloudera开发的通过REST来连接.管理spark的解决方案,此文记录在使用livy中遇到的一些问题 1.livy的下载 livy安装不多赘述,可以从github上自己build,也可 ...

  3. Spark 作业提交

    Spark 作业提交 一.作业打包jar 1.工程目录结构 2.不同运行模式的打包方式 Local模式与Yarn模式不同就在于:Local模式运行时jar包仅在本地存在,而Yarn模式需要在每台从机的 ...

  4. Spark学习(四) -- Spark作业提交

    标签(空格分隔): Spark 作业提交 先回顾一下WordCount的过程: sc.textFile("README.rd").flatMap(line => line.s ...

  5. Hadoop入门(十二)Intellij IDEA远程向hadoop集群提交mapreduce作业

    Intellij IDEA远程向hadoop集群提交mapreduce作业,需要依赖到hadoop的库,hadoop集群的配置信息,还有本地项目的jar包. 一.软件环境 (1)window本地安装h ...

  6. Kettle构建Hadoop ETL实践(六):数据转换与装载

    目录 一.数据清洗 1. 处理"脏数据" 2. 数据清洗原则 3.    数据清洗实例 (1)身份证号码格式检查 (2)去除重复数据 (3)建立标准数据对照表 二.Hive简介 1 ...

  7. Spark详解(五):Spark作业执行原理

    Spark的作业和任务调度系统是其核心,它能够有效地进行调度的根本原因是对任务的划分DGG和容错.下面我们介绍一下相关术语: 作业(Job):RDD中由行动操作所生成的一个或者多个调度阶段 调度阶段( ...

  8. apache ignite_使用Apache Ignite优化Spark作业性能(第1部分)

    apache ignite 来看看他们是如何工作的! 本文的某些部分摘自我的书< Apache Ignite的高性能内存计算> . 如果您对这篇文章感兴趣,请查看本书的其余部分,以获取更多 ...

  9. 使用Apache Ignite优化Spark作业性能(第1部分)

    快来看看他们是如何工作的! 本文的某些部分摘自我的书< Apache Ignite的高性能内存计算> . 如果您对这篇文章感兴趣,请查看本书的其余部分,以获取更多有用的信息. Apache ...

最新文章

  1. python变量词是什么意思_python1变量,表达式和语句
  2. Heroku:革命性的Rails托管服务
  3. 用MDT 2012为企业部署windows 7(十一)--抓取标准模板机镜像
  4. C++三路比较运算符
  5. linux创建文件后会自动删除,linux会自动删除目录和文件的吗
  6. python3.7.2安装pywifi_python pywifi
  7. 北航校赛2014 预赛 题解
  8. 互联网短平快下,DevCloud如何支撑软件开发的“转型”?
  9. hdfs客户端的学习理解
  10. Oracle在SQL语句中对时间操作、运算
  11. 十次方——父工程子模块、公共模块
  12. 申请软件著作权有哪些好处,你知道吗?
  13. html请假条源码,请假条(事假) 的例子
  14. Python爬虫:给我一个链接,虎牙视频随便下载
  15. 关于pycharm提取
  16. 伺服电机(servo motor)
  17. html字两边的横线_css实现中间文字,两边横线效果
  18. 千锋3G学院-iOS开发视频教程
  19. php 删除符号,php – 从字符串中删除变音符号
  20. linux+字符串+补零,Linux之基本文本处理工具(一)

热门文章

  1. 计算机怎么看事件管理,【谁登了个人电脑?教你如何查看Windows事件日志】
  2. FontFamily 看这里就够了
  3. xaml多语言实现方案记录
  4. We~ˇsay~~ˇ
  5. sql count函数
  6. 人来人往的五年,微软智能云Azure凭什么活着?
  7. 超链接的目标属性值意义_超链接a的target属性
  8. Reactor与Proactor的区别
  9. MSP430F2132IRHBR功能框图TPS259824LNRGER电路保护和电源管理解决方案芯片
  10. Math-Model(三)高斯羽烟模型计算气体扩散浓度