spark通过合理设置spark.default.parallelism参数提高执行效率

spark中有partition的概念（和slice是同一个概念，在spark1.2中官网已经做出了说明），一般每个partition对应一个task。在我的测试过程中，如果没有设置spark.default.parallelism参数，spark计算出来的partition非常巨大，与我的cores非常不搭。我在两台机器上（8cores *2 +6g * 2）上，spark计算出来的partition达到2.8万个，也就是2.9万个tasks，每个task完成时间都是几毫秒或者零点几毫秒，执行起来非常缓慢。在我尝试设置了 spark.default.parallelism 后，任务数减少到10，执行一次计算过程从minute降到20second。

参数可以通过spark_home/conf/spark-default.conf配置文件设置。

eg.

 spark.master                       spark://master:7077spark.default.parallelism          10spark.driver.memory                2gspark.serializer                   org.apache.spark.serializer.KryoSerializerspark.sql.shuffle.partitions       50

下面是官网的相关描述：

from:http://spark.apache.org/docs/latest/configuration.html

Property Name Default Meaning

Property Name	Default	Meaning
`spark.default.parallelism`	For distributed shuffle operations like `reduceByKey` and `join`, the largest number of partitions in a parent RDD. For operations like`parallelize` with no parent RDDs, it depends on the cluster manager: Local mode: number of cores on the local machine Mesos fine grained mode: 8 Others: total number of cores on all executor nodes or 2, whichever is larger	Default number of partitions in RDDs returned by transformations like `join`, `reduceByKey`, and `parallelize` when not set by user.

spark.default.parallelism

For distributed shuffle operations like reduceByKey and join, the largest number of partitions in a parent RDD. For operations likeparallelize with no parent RDDs, it depends on the cluster manager:

Local mode: number of cores on the local machine
Mesos fine grained mode: 8
Others: total number of cores on all executor nodes or 2, whichever is larger

Default number of partitions in RDDs returned by transformations like join, reduceByKey, and parallelize when not set by user.

from:http://spark.apache.org/docs/latest/tuning.html

Level of Parallelism

Clusters will not be fully utilized unless you set the level of parallelism for each operation high enough. Spark automatically sets the number of “map” tasks to run on each file according to its size (though you can control it through optional parameters to SparkContext.textFile, etc), and for distributed “reduce” operations, such as groupByKey and reduceByKey, it uses the largest parent RDD’s number of partitions. You can pass the level of parallelism as a second argument (see the spark.PairRDDFunctions documentation), or set the config propertyspark.default.parallelism to change the default. In general, we recommend 2-3 tasks per CPU core in your cluster.

spark通过合理设置spark.default.parallelism参数提高执行效率相关推荐

Spark中的spark.sql.shuffle.partitions 和spark.default.parallelism参数设置默认partition数目
当不跟随父对象partition数目的shuffle过程发生后,结果的partition会发生改变,这两个参数就是控制这类shuffle过程后,返回对象的partition的经过实测,得到结论: s ...
从spark.default.parallelism参数来看Spark并行度、并行计算任务概念
1 并行度概念理解并行度:并行度= partition= task总数.但是同一时刻能处理的task数量由并行计算任务决定(CPU cores决定). 并行度(Parallelism)指的是分布式数 ...
Spark调优之 -- Spark的并行度深入理解（别再让资源浪费了）
1. 并行度理解 Spark作业中,各个stage的task的数量,代表Spark作业在各个阶段stage的并行度. 分为资源并行度(物理并行度)和数据并行度(逻辑并行度) 在Spark Appli ...
spark重要参数调优建议：spark.default.parallelism设置每个stage默认的task数量
spark.default.parallelism 参数说明:该参数用于设置每个stage的默认task数量.这个参数极为重要,如果不设置可能会直接影响你的Spark作业性能. 参数调优建议:Spar ...
谈谈spark.sql.shuffle.partitions和 spark.default.parallelism 的区别及spark并行度的理解
谈谈spark.sql.shuffle.partitions和 spark.default.parallelism 的区别及spark并行度的理解 spark.sql.shuffle.partitio ...
spark.sql.shuffle.partitions 和 spark.default.parallelism 的区别
在关于spark任务并行度的设置中,有两个参数我们会经常遇到,spark.sql.shuffle.partitions 和 spark.default.parallelism, 那么这两个参数到底有什 ...
关于Spark默认并行度spark.default.parallelism的理解
spark.default.parallelism是指RDD任务的默认并行度,Spark中所谓的并行度是指RDD中的分区数,即RDD中的Task数. 当初始RDD没有设置分区数(numPartitio ...
streaming优化：spark.default.parallelism调整处理并行度
官方是这么说的: Cluster resources can be under-utilized if the number of parallel tasks used in any stage o ...
spark任务优先级设置：spark.yarn.priority
Spark对于Yarn priority的支持源码详解 Yarn的调度器在Yarn中,提供了Capacity scheduler和Fair scheduler,它们都支持priority的.这里我们 ...
spark 写mysql 设置主键_Spark Sql 连接mysql
1.基本概念和用法(摘自spark官方文档中文版) Spark SQL 还有一个能够使用 JDBC 从其他数据库读取数据的数据源.当使用 JDBC 访问其它数据库时,应该首选 JdbcRDD.这是因为 ...

spark通过合理设置spark.default.parallelism参数提高执行效率

Level of Parallelism

spark通过合理设置spark.default.parallelism参数提高执行效率相关推荐

最新文章

热门文章