引言
由于工作需要,即将拥抱Spark,曾经进行过相关知识的学习,现在计划详细读一遍最新版本Spark1.3的部分官方文档,一是复习,二是了解最新进展,三是为公司团队培训做储备。

欢迎转载,请注明出处:
http://blog.csdn.net/u010967382/article/details/45062407

原文URL :http://spark.apache.org/docs/latest/running-on-yarn.html
该文档重点介绍如何在YARN上运行Spark程序。
先回顾YARN的架构:
这里不详述YARN的架构,请自行查阅,重点理解灰色的部分即可,灰色的部分是Container,是YARN体系中资源分为的单元,一个程序,比如图中的MR程序,由一个负责总调度的MR Application Master和多个具体执行任务的Map/Reduce Tasks组成。
从0.6.0版本以后,Spark就提供了对YARN的支持,并在后续版本中持续改进。
运行Spark on YARN需要一个支持YARN功能的Spark的发行版,可以到官网下载(http://spark.apache.org/downloads.html),也可以自行编译(http://spark.apache.org/docs/latest/building-spark.html)。
关于Spark on YARN的配置,基础配置都和别的模式一样,有一些为YARN专门设计的参数,参见下表,我就不翻译了,需要的时候自行查阅:
Property Name Default Meaning
spark.yarn.am.memory 512m Amount of memory to use for the YARN Application Master in client mode, in the same format as JVM memory strings (e.g. 512m2g). In cluster mode, use spark.driver.memory instead.
spark.driver.cores 1 Number of cores used by the driver in YARN cluster mode. Since the driver is run in the same JVM as the YARN Application Master in cluster mode, this also controls the cores used by the YARN AM. In client mode, usespark.yarn.am.cores to control the number of cores used by the YARN AM instead.
spark.yarn.am.cores 1 Number of cores to use for the YARN Application Master in client mode. In cluster mode, use spark.driver.cores instead.
spark.yarn.am.waitTime 100000 In yarn-cluster mode, time in milliseconds for the application master to wait for the SparkContext to be initialized. In yarn-client mode, time for the application master to wait for the driver to connect to it.
spark.yarn.submit.file.replication The default HDFS replication (usually 3) HDFS replication level for the files uploaded into HDFS for the application. These include things like the Spark jar, the app jar, and any distributed cache files/archives.
spark.yarn.preserve.staging.files false Set to true to preserve the staged files (Spark jar, app jar, distributed cache files) at the end of the job rather than delete them.
spark.yarn.scheduler.heartbeat.interval-ms 5000 The interval in ms in which the Spark application master heartbeats into the YARN ResourceManager.
spark.yarn.max.executor.failures numExecutors * 2, with minimum of 3 The maximum number of executor failures before failing the application.
spark.yarn.historyServer.address (none) The address of the Spark history server (i.e. host.com:18080). The address should not contain a scheme (http://). Defaults to not being set since the history server is an optional service. This address is given to the YARN ResourceManager when the Spark application finishes to link the application from the ResourceManager UI to the Spark history server UI.
spark.yarn.dist.archives (none) Comma separated list of archives to be extracted into the working directory of each executor.
spark.yarn.dist.files (none) Comma-separated list of files to be placed in the working directory of each executor.
spark.executor.instances 2 The number of executors. Note that this property is incompatible withspark.dynamicAllocation.enabled.
spark.yarn.executor.memoryOverhead executorMemory * 0.07, with minimum of 384 The amount of off heap memory (in megabytes) to be allocated per executor. This is memory that accounts for things like VM overheads, interned strings, other native overheads, etc. This tends to grow with the executor size (typically 6-10%).
spark.yarn.driver.memoryOverhead driverMemory * 0.07, with minimum of 384 The amount of off heap memory (in megabytes) to be allocated per driver in cluster mode. This is memory that accounts for things like VM overheads, interned strings, other native overheads, etc. This tends to grow with the container size (typically 6-10%).
spark.yarn.am.memoryOverhead AM memory * 0.07, with minimum of 384 Same as spark.yarn.driver.memoryOverhead, but for the Application Master in client mode.
spark.yarn.queue default The name of the YARN queue to which the application is submitted.
spark.yarn.jar (none) The location of the Spark jar file, in case overriding the default location is desired. By default, Spark on YARN will use a Spark jar installed locally, but the Spark jar can also be in a world-readable location on HDFS. This allows YARN to cache it on nodes so that it doesn't need to be distributed each time an application runs. To point to a jar on HDFS, for example, set this configuration to "hdfs:///some/path".
spark.yarn.access.namenodes (none) A list of secure HDFS namenodes your Spark application is going to access. For example, `spark.yarn.access.namenodes=hdfs://nn1.com:8032,hdfs://nn2.com:8032`. The Spark application must have acess to the namenodes listed and Kerberos must be properly configured to be able to access them (either in the same realm or in a trusted realm). Spark acquires security tokens for each of the namenodes so that the Spark application can access those remote HDFS clusters.
spark.yarn.appMasterEnv.[EnvironmentVariableName] (none) Add the environment variable specified by EnvironmentVariableName to the Application Master process launched on YARN. The user can specify multiple of these and to set multiple environment variables. In yarn-cluster mode this controls the environment of the SPARK driver and in yarn-client mode it only controls the environment of the executor launcher.
spark.yarn.containerLauncherMaxThreads 25 The maximum number of threads to use in the application master for launching executor containers.
spark.yarn.am.extraJavaOptions (none) A string of extra JVM options to pass to the YARN Application Master in client mode. In cluster mode, use spark.driver.extraJavaOptions instead.
spark.yarn.maxAppAttempts yarn.resourcemanager.am.max-attempts in YARN The maximum number of attempts that will be made to submit the application. It should be no larger than the global number of max attempts in the YARN configuration.
要想在YARN上运行Spark程序,首先要确保 HADOOP_CONF_DIR或者 YARN_CONF_DIR环境变量已经正确指向了Hadoop的配置目录(export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop)。
接下来就是决定yarn-xxx模式。
根据 driver进程所处的位置,我们可以有两种启动Spark on YARN的模式:
  • yarn-client模式:在yarn-client模式中,driver运行在client进程中,Application Master(Application Master是YARN架构的一部分,是运行在YARN中各个应用程序的调度器)仅仅用于向YARN申请资源(client在控制台可以看到程序打印输出)。

  • yarn-cluster模式:在yarn-cluster模式中,Spark driver运行在 Application Master进程中(client控制台看不到程序打印的输出)。
注意: 和Spark standalone和Mesos模式不同,这两个模式的master地址都用master参数指定, 在YARN模式中,资源管理器的地址是从Hadoop的配置文件中读取的,所以spark-submit脚本中的--master参数仅需要指定yarn-client或者yarn-cluster。
以下命令格式可以用yarn-cluster模式启动一个Spark on YARN应用程序:
./bin/spark-submit --class path.to.your.Class --master yarn-cluster [options] <app jar> [app options]
例如:
$ ./bin/spark-submit --class org.apache.spark.examples.SparkPi \
    --master yarn-cluster \
    --num-executors 3 \
    --driver-memory 4g \
    --executor-memory 2g \
    --executor-cores 1 \
    --queue thequeue \
    lib/spark-examples*.jar \
    10
上面的脚本启动了一个YARN cluster模式的应用程序, SparkPi将会作为Application Master的子线程运行。 Client(在集群外)将周期性的轮询Application Master来获取最新的应用程序执行状态,并将接受到的状态信息展示在控制台上。 Client会在应用程序结束运行时退出。
在yarn-cluster模式中,driver和client运行在不同的机器中,所以driver中的SparkContext.addJar对于client机器的文件就不起作用了。 为了使得driver中的SparkContext.addJar可以访问client机器中的文件,可以在spark-submit脚本中使用 --jars 参数,例如:
$ ./bin/spark-submit --class my.main.Class \--master yarn-cluster \--jars my-other-jar.jar,my-other-other-jar.jarmy-main-jar.jarapp_arg1 app_arg2

【甘道夫】Spark1.3.0 Running Spark on YARN 官方文档精华摘要相关推荐

  1. 【甘道夫】Spark1.3.0 Cluster Mode Overview 官方文档精华摘要

    引言 由于工作需要,即将拥抱Spark,曾经进行过相关知识的学习,现在计划详细读一遍最新版本Spark1.3的部分官方文档,一是复习,二是了解最新进展,三是为公司团队培训做储备. 欢迎转载,请注明出处 ...

  2. 【甘道夫】Hive 0.13.1 on Hadoop2.2.0 + Oracle10g部署详细解释

    环境: hadoop2.2.0 hive0.13.1 Ubuntu 14.04 LTS java version "1.7.0_60" Oracle10g ***欢迎转载.请注明来 ...

  3. 【甘道夫】Win7x64环境下编译Apache Hadoop2.2.0的Eclipse插件

    目标: 编译Apache Hadoop2.2.0在win7x64环境下的Eclipse插件 环境: win7x64家庭普通版 eclipse-jee-kepler-SR1-win32-x86_64.z ...

  4. 【甘道夫】Win7x64环境下编译Apache Hadoop2.2.0的Eclipse小工具

    目标: 编译Apache Hadoop2.2.0在win7x64环境下的Eclipse插件 环境: win7x64家庭普通版 eclipse-jee-kepler-SR1-win32-x86_64.z ...

  5. 【甘道夫】HBase(0.96以上版本)过滤器Filter详解及实例代码

    说明: 本文参考官方Ref Guide,Developer API和众多博客,并结合实测代码编写,详细总结HBase的Filter功能,并附上每类Filter的相应代码实现. 本文尽量遵从Ref Gu ...

  6. 【甘道夫】MapReduce实现矩阵乘法--实现代码

    之前写了一篇分析MapReduce实现矩阵乘法算法的文章: [甘道夫]Mapreduce实现矩阵乘法的算法思路 为了让大家更直观的了解程序运行,今天编写了实现代码供大家參考. 编程环境: java v ...

  7. 用Python与Watson,将《魔戒》甘道夫的性格可视化!

    全文共4301字,预计学习时长9分钟 图源Unsplash,由Marko Blažević提供 著名心理学家詹姆斯· 彭内贝克曾说:"仔细观察人们通过语言表达思想的方式,会感受到他们的性格特 ...

  8. 【甘道夫】Hadoop2.4.1尝鲜部署+完整版配置文件

    引言 转眼间,Hadoop的stable版本已经升级到2.4.1了,社区的力量真是强大!3.0啥时候release呢? 今天做了个调研,尝鲜了一下2.4.1版本的分布式部署,包括NN HA(目前已经部 ...

  9. 【甘道夫】Hive扩展GIS函数

    阶段一:编译函数包 基于 https://github.com/Esri/spatial-framework-for-hadoop 项目编译产出两个jar包: spatial-sdk-hive-2.1 ...

最新文章

  1. hadoop优化之操作系统优化
  2. 【问题收集·中级】关于XMPP使用Base传送图片
  3. html 图片布局,CSS 布局图片
  4. MyBatis中针对if-test的参数为指定值的xml写法
  5. OpenCV腐蚀和膨胀Eroding and Dilating
  6. oracle之单行函数之分组函数之课后练习
  7. 教你29招,让你在社交,职场上人人对你刮目相看
  8. android studio使用ndk,jni随记
  9. 计算机 图论基础知识,计算机基础知识
  10. Gantt - attachEvent事件监听 - 无参数事件
  11. 计算机并口回路测试工具,COM口和LPT口回路环的制作与CheckIT3.0测试方法
  12. 第一章第4节-GIS平台
  13. dnsdhcp服务器实验原理,DHCP服务器配置实验报告.doc
  14. 灰度共生矩阵纹理特征提取matlab,灰度共生矩阵纹理特征提取的Matlab实现
  15. Python语法基础14 pickle与json模块 异常处理
  16. 谈谈你对Spring 的理解
  17. STM32F105替换为AT32F403A注意事项
  18. 随笔杂记(十)——C++:C4996报错解决方法
  19. 浅谈微前端在滴滴车服中的应用实践
  20. TensorFlow学习笔记(1)--TensorFlow简介,常用基本操作

热门文章

  1. Maven - StackOverflowError
  2. 集合泛型不匹配导致的ClassCastException异常解决
  3. Opera 推出 Opera One,将取代 Opera 浏览器
  4. Vivado使用技巧(15):DRC设计规则检查
  5. Hibernate的DAO实现
  6. 希尔顿集团推出面向社会的零工模式;玫琳凯中国市场电商销售占比已达100% | 美通企业日报...
  7. event_base
  8. 一个特立独行的普通人
  9. 微信转服服务器,王者荣耀qq转移到微信可以吗 跨平台转区规则介绍
  10. RS|哨兵二号(.SAFE格式)转tif格式