A-Deeper-Understanding-of-Spark-Internals（Spark内核深入理解）

这篇文章是对Spark Submit 2014会议上Aaron Davidson做的报告的PPT内容的整理，报告主要讲了Spark中对一个统计各个字母开头的名字的个数的代码做的优化。

对此PPT做了下简单整理，加入一些自己的理解。

Goal: Understanding how Spark runs, focus on performance

• Major core components:

– Execution Model

– The Shuffle

– Caching

Why understand internals?

Goal: Find number of distinct names per “first letter”

sc.textFile(“hdfs:/names”)

.map(name => (name.charAt(0), name))

.groupByKey()

.mapValues(names => names.toSet.size)

.collect()

RDD中元素转换过程

Spark Execution Model

1. Create DAG of RDDs to represent computation

2. Create logical execution plan for DAG

3. Schedule and execute individual tasks

在一本书上对要进行shuffle的原因的解释我感觉很有道理：之所以需要shuffle，还是因为具有某种共同特征的一类数据需要最终汇聚到一个计算节点上进行计算。

另外之前有一个误区是看到文章中提到的Spark基于内存计算，就认为Spark作计算的时候真的只是纯内存计算，然而现在发现自己真是幼稚的可以，道路阻且长啊。通过阅读相关书籍并结合本ppt，可以知道，在Spark 0.8以后，Shuffle Write会将数据持久化到硬盘，也就是shuffle过程依然是要读写硬盘的，存在性能瓶颈，然后自己现在有点搞不清Spark和Hadoop的区别了。

What went wrong?

• Too few partitions to get good concurrency

• Large per-key groupBy()

• Shipped all data across the cluster

Common issue checklist

1. Ensure enough partitions for concurrency

2. Minimize memory consumption (esp. of sorting and large keys in groupBys)

3. Minimize amount of data shuffled

4. Know the standard library

1 & 2 are about tuning number of partitions!

Importance of Partition Tuning

• Main issue: too few partitions

– Less concurrency

– More susceptible to data skew

– Increased memory pressure for groupBy,reduceByKey, sortByKey, etc.

• Secondary issue: too many partitions

• Need “reasonable number” of partitions

– Commonly between 100 and 10,000 partitions

– Lower bound: At least ~2x number of cores in cluster

– Upper bound: Ensure tasks take at least 100m

Memory Problems

• Symptoms:

– Inexplicably bad performance

– Inexplicable executor/machine failures"

(can indicate too many shuffle files too)

• Diagnosis:

– Set spark.executor.extraJavaOptions to include

• -XX:+PrintGCDetails

• -XX:+HeapDumpOnOutOfMemoryError

– Check dmesg for oom-killer logs

• Resolution:

– Increase spark.executor.memory

– Increase number of partitions

– Re-evaluate program structure (!)

了解了我们的程序存在什么问题，以及Spark中出现问题的诊断方法及部分解决方案，我们可以对之前的程序作出修改。

下面是原始程序代码

下面是修改后的代码：

前面有提到对shuffle的优化，但是目前对shuffle的具体过程还不是很了解，PPT也没有详细解释哪一步是对shuffle的优化，通过另一篇文章中的内容，初步判断是将groupByKey改为reduceByKey，此步中含有对shuffle的优化。

A-Deeper-Understanding-of-Spark-Internals（Spark内核深入理解）相关推荐

Spark之 spark简介、生态圈详解
来源:http://www.cnblogs.com/shishanyuan/p/4700615.html 1.简介 1.1 Spark简介 Spark是加州大学伯克利分校AMP实验室(Algorith ...
[Spark进阶]--Spark配置参数说明
感谢原文链接:http://blog.javachen.com/2015/06/07/spark-configuration.html 参考官方原文:https://spark.apache.org/ ...
BigData之Spark：Spark计算引擎的简介、下载、经典案例之详细攻略
BigData之Spark:Spark计算引擎的简介.下载.经典案例之详细攻略目录 Spark的简介 1.Spark三大特点 Spark的下载 Spark的经典案例 1.Word Count 2.P ...
Spark之Spark角色介绍及运行模式
Spark之Spark角色介绍及运行模式集群角色运行模式 1. 集群模式从物理部署层面上来看,Spark主要分为两种类型的节点,Master节点和Worker节点: Master节点主要运行集群 ...
Spark之Spark概述
Spark之Spark概述什么是Spark Spark内置项目介绍 Spark特点 Spark的用户和用途 1. 什么是Spark Spark是一种快速.通用.可扩展的大数据分析引擎,2009年诞生 ...
【Spark】Spark基础教程知识点
第 1 部分 Spark 基础 Spark 概述本章介绍 Spark 的一些基本认识. Spark官方地址一:什么是 Spark Spark 是一个快速(基于内存), 通用, 可扩展的集群计算引擎 ...
[Spark][Python]Spark 访问 mysql , 生成 dataframe 的例子：
[Spark][Python]Spark 访问 mysql , 生成 dataframe 的例子: mydf001=sqlContext.read.format("jdbc").o ...
Hive on Spark和Spark sql on Hive，你能分的清楚么
摘要:结构上Hive On Spark和SparkSQL都是一个翻译层,把一个SQL翻译成分布式可执行的Spark程序. 本文分享自华为云社区<Hive on Spark和Spark sql o ...
【Spark】Spark Stream读取kafka写入kafka报错 AbstractMethodError
1.概述根据这个博客 [Spark]Spark 2.4 Stream 读取kafka 写入kafka 报错如下 Exception in thread "main" java.l ...
Spark报错： Invalid Spark URL: spark://YarnScheduler@stream_test_nb:40659
1.背景参考:请点击参考:请点击参考:请点击 org.apache.spark.SparkException: Invalid Spark URL: spark://YarnScheduler@ ...

A-Deeper-Understanding-of-Spark-Internals（Spark内核深入理解）

A-Deeper-Understanding-of-Spark-Internals（Spark内核深入理解）相关推荐

最新文章

热门文章