[spark] spark推测式执行

概述

推测任务是指对于一个Stage里面拖后腿的Task，会在其他节点的Executor上再次启动这个task，如果其中一个Task实例运行成功则将这个最先完成的Task的计算结果作为最终结果，同时会干掉其他Executor上运行的实例。spark推测式执行默认是关闭的，可通过spark.speculation属性来开启。

检测是否有需要推测式执行的Task

在SparkContext创建了schedulerBackend和taskScheduler后，立即调用了taskScheduler 的start方法：

override def start() {backend.start()if (!isLocal && conf.getBoolean("spark.speculation", false)) {logInfo("Starting speculative execution thread")speculationScheduler.scheduleAtFixedRate(new Runnable {override def run(): Unit = Utils.tryOrStopSparkContext(sc) {checkSpeculatableTasks()}}, SPECULATION_INTERVAL_MS, SPECULATION_INTERVAL_MS, TimeUnit.MILLISECONDS)}}

可以看到，TaskScheduler在启动SchedulerBackend后，在非local模式前提下检查推测式执行功能是否开启（默认关闭，可通过spark.speculation开启），若开启则会启动一个线程每隔SPECULATION_INTERVAL_MS（默认100ms，可通过spark.speculation.interval属性设置）通过checkSpeculatableTasks方法检测是否有需要推测式执行的tasks：

// Check for speculatable tasks in all our active jobs.def checkSpeculatableTasks() {var shouldRevive = falsesynchronized {shouldRevive = rootPool.checkSpeculatableTasks()}if (shouldRevive) {backend.reviveOffers()}}

然后又通过rootPool的方法判断是否有需要推测式执行的tasks，若有则会调用SchedulerBackend的reviveOffers去尝试拿资源运行推测任务。继续看看检测逻辑是什么样的：

override def checkSpeculatableTasks(): Boolean = {var shouldRevive = falsefor (schedulable <- schedulableQueue.asScala) {shouldRevive |= schedulable.checkSpeculatableTasks()}shouldRevive}

在rootPool里又调用了schedulable的方法，schedulable是ConcurrentLinkedQueue[Schedulable]类型，队列里面放的都是TaskSetMagager，再看TaskSetMagager的checkSpeculatableTasks方法，终于找到检测根源了：

 override def checkSpeculatableTasks(): Boolean = { // 如果task只有一个或者所有task都不需要再执行了就没有必要再检测if (isZombie || numTasks == 1) {  return false}var foundTasks = false // 所有task数 * SPECULATION_QUANTILE（默认0.75，可通过spark.speculation.quantile设置） val minFinishedForSpeculation = (SPECULATION_QUANTILE * numTasks).floor.toIntlogDebug("Checking for speculative tasks: minFinished = " + minFinishedForSpeculation) // 成功的task数是否超过总数的75%，并且成功的task是否大于0if (tasksSuccessful >= minFinishedForSpeculation && tasksSuccessful > 0) {val time = clock.getTimeMillis() // 过滤出成功执行的task的执行时间并排序val durations = taskInfos.values.filter(_.successful).map(_.duration).toArrayArrays.sort(durations) // 取这多个时间的中位数val medianDuration = durations(min((0.5 * tasksSuccessful).round.toInt, durations.length - 1)) // 中位数 * SPECULATION_MULTIPLIER （默认1.5，可通过spark.speculation.multiplier设置）val threshold = max(SPECULATION_MULTIPLIER * medianDuration, 100)logDebug("Task length threshold for speculation: " + threshold) // 遍历该TaskSet中的task，取未成功执行、正在执行、执行时间已经大于threshold 、 // 推测式执行task列表中未包括的task放进需要推测式执行的列表中speculatableTasksfor ((tid, info) <- taskInfos) {val index = info.indexif (!successful(index) && copiesRunning(index) == 1 && info.timeRunning(time) > threshold &&!speculatableTasks.contains(index)) {logInfo("Marking task %d in stage %s (on %s) as speculatable because it ran more than %.0f ms".format(index, taskSet.id, info.host, threshold))speculatableTasks += indexfoundTasks = true}}}foundTasks}

检查逻辑代码中注释很明白，当成功的Task数超过总Task数的75%(可通过参数spark.speculation.quantile设置)时，再统计所有成功的Tasks的运行时间，得到一个中位数，用这个中位数乘以1.5(可通过参数spark.speculation.multiplier控制)得到运行时间门限，如果在运行的Tasks的运行时间超过这个门限，则对它启用推测。简单来说就是对那些拖慢整体进度的Tasks启用推测，以加速整个Stage的运行。
算法大致流程如图：

推测式任务什么时候被调度

在TaskSetMagager在延迟调度策略下为一个executor分配一个task时会调用dequeueTask方法：

private def dequeueTask(execId: String, host: String, maxLocality: TaskLocality.Value): Option[(Int, TaskLocality.Value, Boolean)] ={for (index <- dequeueTaskFromList(execId, getPendingTasksForExecutor(execId))) {return Some((index, TaskLocality.PROCESS_LOCAL, false))}if (TaskLocality.isAllowed(maxLocality, TaskLocality.NODE_LOCAL)) {for (index <- dequeueTaskFromList(execId, getPendingTasksForHost(host))) {return Some((index, TaskLocality.NODE_LOCAL, false))}}......// find a speculative task if all others tasks have been scheduleddequeueSpeculativeTask(execId, host, maxLocality).map {case (taskIndex, allowedLocality) => (taskIndex, allowedLocality, true)}}

该方法的最后一段就是在其他任务都被调度后为推测式任务进行调度，看看起实现：

protected def dequeueSpeculativeTask(execId: String, host: String, locality: TaskLocality.Value): Option[(Int, TaskLocality.Value)] ={// 从推测式执行任务列表中移除已经成功完成的task，因为从检测到调度之间还有一段时间，// 某些task已经成功执行speculatableTasks.retain(index => !successful(index)) // Remove finished tasks from set// 判断task是否可以在该executor对应的Host上执行，判断条件是：// task没有在该host上运行；// 该executor没有在task的黑名单里面（task在这个executor上失败过，并还在'黑暗'时间内）def canRunOnHost(index: Int): Boolean =!hasAttemptOnHost(index, host) && !executorIsBlacklisted(execId, index)if (!speculatableTasks.isEmpty) {// 获取能在该executor上启动的taskIndexfor (index <- speculatableTasks if canRunOnHost(index)) {// 获取task的优先位置val prefs = tasks(index).preferredLocations val executors = prefs.flatMap(_ match {case e: ExecutorCacheTaskLocation => Some(e.executorId)case _ => None});// 优先位置若为ExecutorCacheTaskLocation并且数据所在executor包含当前executor，// 则返回其task在taskSet的index和Locality Levelsif (executors.contains(execId)) {speculatableTasks -= indexreturn Some((index, TaskLocality.PROCESS_LOCAL))}}// 这里的判断是延迟调度的作用，即使是推测式任务也尽量以最好的本地性级别来启动if (TaskLocality.isAllowed(locality, TaskLocality.NODE_LOCAL)) {for (index <- speculatableTasks if canRunOnHost(index)) {val locations = tasks(index).preferredLocations.map(_.host)if (locations.contains(host)) {speculatableTasks -= indexreturn Some((index, TaskLocality.NODE_LOCAL))}}}........}None}

代码太长只列了前面一部分，不过都是类似的逻辑，代码中注释也很清晰。先过滤掉已经成功执行的task，另外，推测执行task不在和正在执行的task同一Host执行，不在黑名单executor里执行，然后在延迟调度策略下根据task的优先位置来决定是否在该executor上以某种本地性级别被调度执行。