之前发现在map任务里面经常看到Container killed by the ApplicationMaster,挺奇怪,不过任务最终是成功的,就没怎么管。不过最近测试集群跑的任务报143错误,还是重新看一下这个问题。

分析版本:hadoop cdh5.4

错误日志

2015-12-30 17:31:09,994 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1451360650431_0228_m_000000_0 TaskAttempt Transitioned from RUNNING to SUCCESS_CONTAINER_CLEANUP
2015-12-30 17:31:09,995 INFO [ContainerLauncher #1] org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: Processing the event EventType: CONTAINER_REMOTE_CLEANUP for container container_1451360650431_0228_01_000002 taskAttempt attempt_1451360650431_0228_m_000000_0
2015-12-30 17:31:09,995 INFO [ContainerLauncher #1] org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: KILLING attempt_1451360650431_0228_m_000000_0
2015-12-30 17:31:09,995 INFO [ContainerLauncher #1] org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy: Opening proxy : uaerouter2-vm02:8041
2015-12-30 17:31:10,012 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1451360650431_0228_m_000000_0 TaskAttempt Transitioned from SUCCESS_CONTAINER_CLEANUP to SUCCEEDED
2015-12-30 17:31:10,020 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: Task succeeded with attempt attempt_1451360650431_0228_m_000000_0
2015-12-30 17:31:10,021 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: task_1451360650431_0228_m_000000 Task Transitioned from RUNNING to SUCCEEDED
2015-12-30 17:31:10,023 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Num completed Tasks: 1
2015-12-30 17:31:10,895 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Before Scheduling: PendingReds:1 ScheduledMaps:0 ScheduledReds:0 AssignedMaps:1 AssignedReds:0 CompletedMaps:1 CompletedReds:0 ContAlloc:1 ContRel:0 HostLocal:1 RackLocal:0
2015-12-30 17:31:10,898 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Recalculating schedule, headroom=<memory:14336, vCores:14>
2015-12-30 17:31:10,898 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Reduce slow start threshold reached. Scheduling reduces.
2015-12-30 17:31:10,898 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: All maps assigned. Ramping up all remaining reduces:1
2015-12-30 17:31:10,898 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:0 ScheduledMaps:0 ScheduledReds:1 AssignedMaps:1 AssignedReds:0 CompletedMaps:1 CompletedReds:0 ContAlloc:1 ContRel:0 HostLocal:1 RackLocal:0
2015-12-30 17:31:11,906 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: getResources() for application_1451360650431_0228: ask=1 release= 0 newContainers=0 finishedContainers=1 resourcelimit=<memory:15360, vCores:15> knownNMs=2
2015-12-30 17:31:11,906 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Received completed container container_1451360650431_0228_01_000002
2015-12-30 17:31:11,907 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:0 ScheduledMaps:0 ScheduledReds:1 AssignedMaps:0 AssignedReds:0 CompletedMaps:1 CompletedReds:0 ContAlloc:1 ContRel:0 HostLocal:1 RackLocal:0
2015-12-30 17:31:11,908 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1451360650431_0228_m_000000_0: Container killed by the ApplicationMaster.
Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143

分析一下错误日志

其中一个mapper任务完成,发送事件TaskAttemptEventType.TA_DONE
执行CleanupContainerTransition,取消taskAttemptListener监听,发送事件 ContainerLauncher.EventType.CONTAINER_REMOTE_CLEANUP给containerLauncher
containerLauncher执行kill命令,但判断container状态还没完成,让container kill掉
container还没有退出,被强制kill掉,所以有一个143的错误
TaskAttempt状态由RUNNING 转为 SUCCESS_CONTAINER_CLEANUP
kill了之后发送事件TaskAttemptEventType.TA_CONTAINER_CLEANED
TaskAttempt状态由SUCCESS_CONTAINER_CLEANUP转为SUCCEEDED
这里会有143的错误的原因是没等到container退出,就发起命令kill掉它了。简单的改法是等待几秒后再去kill它,但是到底要等几秒,每个都要等几秒拖慢了集群。还是看大神是怎么改的。

分析补丁
找到 MAPREDUCE-5465 这个补丁。

在这个补丁里,在RUNNING -> SUCCESS_CONTAINER_CLEANUP 状态转换中,插入了一个状态 SUCCESS_FINISHING_CONTAINER (成功的场景,失败有另一个状态)。引入这个状态就是为了等container自己退出。

其中一个mapper任务完成,发送事件TaskAttemptEventType.TA_DONE
向TaskAttemptFinishingMonitor注册(这个类用来避免container一直不退出,之后说),尽管container这时还没退出,还是发送TaskEventType.T_ATTEMPT_SUCCEEDED事件
状态由RUNNING 转为 SUCCESS_FINISHING_CONTAINER
等待container退出(依赖nodemanager发现container退出,汇报给RM,然后再传给AM)
AM知道container退出后,检查退出状态,如果是被终止的,就发送TaskAttemptEventType.TA_KILL,否则发送TaskAttemptEventType.TA_CONTAINER_COMPLETED
如果事件是TA_KILL,状态则会变成SUCCESS_CONTAINER_CLEANUP或者KILL_CONTAINER_CLEANUP。SUCCESS_CONTAINER_CLEANUP可以再转换为SUCCESSED
如果事件是TA_CONTAINER_COMPLETED,状态就变成SUCCESSED了
另外还会收到事件TA_CONTAINER_CLEANED,有条件才会发这个事件的,看注释,好像是等价于TA_CONTAINER_COMPLETED(好烦)

向TaskAttemptFinishingMonitor注册是这里设定了一个定时器,如果超过时间container都还没有退出,这个Monitor就会发起TA_TIMED_OUT的操作了,会把状态由 SUCCESS_FINISHING_CONTAINER 转为 SUCCESS_CONTAINER_CLEANUP,通过kill的方式清理。

同样,这里还会对失败的taskattempt也做同样的处理,会有对应的状态。

Container killed by the ApplicationMaster, Exit code is 143相关推荐

  1. 【异常】Container exited with a non-zero exit code 1 Failing this attempt.Stack trace: ExitCodeException

    [异常]Container exited with a non-zero exit code 1 Failing this attempt.Stack trace: ExitCodeException ...

  2. 【Flink】Flink Container exited with a non-zero exit code 143

    1.概述 偶然查询一个环境的日志,发现这个环境有报错 Flink Container exited with a non-zero exit code 143. 2022-01-30 12:58:16 ...

  3. Hadoop之——Hadoop3.x运行自带的WordCount报错Container exited with a non-zero exit code 1.

    转载请注明出处:https://blog.csdn.net/l1028386804/article/details/93750832 问题: 今天,基于Hadoop3.2.0搭建了Hadoop集群,对 ...

  4. 【Flink】Flink Dump of the process-tree for container Exit code is 143

    1.背景 flink任务报错这个如下 2020-09-27 16:38:36,063 INFO org.apache.flink.yarn.YarnResourceManager

  5. Container killed on request. Exit code is 143_Permission denied: user=dr.who, a---Hadoop3.x启动报错记录001

    Permission denied: user=dr.who, access=WRITE, inode="/":root:supergroup:drwxr-xr-x  这个错误 还 ...

  6. Vue3运行源码-调试报错:Command failed with exit code 128: git rev-parse HEAD

    Vue3运行源码-调试报错:Command failed with exit code 128: git rev-parse HEAD 下载源码 运行源码 运行源码报错 成功运行 下载源码 官网下载源 ...

  7. Android Studio Emulator Process finished with exit code -1073741515 (0xC0000135)

    背景 准备学习一下其他语言,莫名安装了一些软件,然后卸载了一些软件,隔天再打开Android Studio 的模拟器的时候一直无法打开. Bug Event log中报错如下: Emulator: P ...

  8. python Process finished with exit code -1073741819 (0xC0000005) 解决

    运行程序时,Process finished with exit code -1073741819 (0xC0000005) 报错 原因:没有 python33.dll 在 c:\WINDOWS\sy ...

  9. Process finished with exit code -1073741819 (0xC0000005)

    Process finished with exit code -1073741819 (0xC0000005) pycharm报错:Process finished with exit code - ...

最新文章

  1. java modelmapper_java - 使用ModelMapper映射抽象类型的字段 - SO中文参考 - www.soinside.com...
  2. 行走智慧城市 数据要有统一“身份”
  3. 【转载】拿来即用的企业级安全运维体系搭建指南
  4. matlab基本矩阵运算,科学网—matlab中矩阵基本运算 - 成爱芳的博文
  5. Metasploit Framework(MSF)的使用
  6. (笔试题)和0交换的排序
  7. “老师,我写着写着就 强制交卷了……”
  8. Redis实现延迟队列
  9. 红帽补丁安装的方法_为什么红帽采取“上游优先”的方法
  10. ps2019布尔运算快捷键_超实用:换个角度教你快速理解PS CS6布尔运算
  11. zz机器学习与人工智能学习资源导引
  12. 卓有成效的管理者(笔记)——我能贡献什么
  13. 撬动地球的GOOGLE,告诉你GOOGLE不能说的秘密
  14. win7安装mysql后“应用程序无法启动因为应用程序的并行配置不正
  15. 可编程器件的编程原理
  16. lucene 学习笔记之飞龙在天
  17. 《深入理解计算机网络》迷你书
  18. 由于无法验证发布者,Windows已经阻止此软件
  19. 试题 算法训练 kAc给糖果你吃(贪心)
  20. 【linux 防火墙】Linux如何关闭防火墙

热门文章

  1. [ESP32]学习笔记01
  2. 启用或禁用控制更有效的和有效的方式
  3. java fseek_使用其他文件系统函数实现fseek()功能
  4. 用postman导出excel文件
  5. TCP拒绝服务攻击简述与实验
  6. 拉格朗日插值fortran程序
  7. 基本figure类型
  8. JavaScript面向对象的深入(含源码)
  9. C++项目实战-高并发服务器详析
  10. gamma对冲 matlab,gamma与对冲损益之一