Container killed by the ApplicationMaster, Exit code is 143
之前发现在map任务里面经常看到Container killed by the ApplicationMaster,挺奇怪,不过任务最终是成功的,就没怎么管。不过最近测试集群跑的任务报143错误,还是重新看一下这个问题。
分析版本:hadoop cdh5.4
错误日志
2015-12-30 17:31:09,994 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1451360650431_0228_m_000000_0 TaskAttempt Transitioned from RUNNING to SUCCESS_CONTAINER_CLEANUP
2015-12-30 17:31:09,995 INFO [ContainerLauncher #1] org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: Processing the event EventType: CONTAINER_REMOTE_CLEANUP for container container_1451360650431_0228_01_000002 taskAttempt attempt_1451360650431_0228_m_000000_0
2015-12-30 17:31:09,995 INFO [ContainerLauncher #1] org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: KILLING attempt_1451360650431_0228_m_000000_0
2015-12-30 17:31:09,995 INFO [ContainerLauncher #1] org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy: Opening proxy : uaerouter2-vm02:8041
2015-12-30 17:31:10,012 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1451360650431_0228_m_000000_0 TaskAttempt Transitioned from SUCCESS_CONTAINER_CLEANUP to SUCCEEDED
2015-12-30 17:31:10,020 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: Task succeeded with attempt attempt_1451360650431_0228_m_000000_0
2015-12-30 17:31:10,021 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: task_1451360650431_0228_m_000000 Task Transitioned from RUNNING to SUCCEEDED
2015-12-30 17:31:10,023 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Num completed Tasks: 1
2015-12-30 17:31:10,895 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Before Scheduling: PendingReds:1 ScheduledMaps:0 ScheduledReds:0 AssignedMaps:1 AssignedReds:0 CompletedMaps:1 CompletedReds:0 ContAlloc:1 ContRel:0 HostLocal:1 RackLocal:0
2015-12-30 17:31:10,898 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Recalculating schedule, headroom=<memory:14336, vCores:14>
2015-12-30 17:31:10,898 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Reduce slow start threshold reached. Scheduling reduces.
2015-12-30 17:31:10,898 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: All maps assigned. Ramping up all remaining reduces:1
2015-12-30 17:31:10,898 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:0 ScheduledMaps:0 ScheduledReds:1 AssignedMaps:1 AssignedReds:0 CompletedMaps:1 CompletedReds:0 ContAlloc:1 ContRel:0 HostLocal:1 RackLocal:0
2015-12-30 17:31:11,906 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: getResources() for application_1451360650431_0228: ask=1 release= 0 newContainers=0 finishedContainers=1 resourcelimit=<memory:15360, vCores:15> knownNMs=2
2015-12-30 17:31:11,906 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Received completed container container_1451360650431_0228_01_000002
2015-12-30 17:31:11,907 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:0 ScheduledMaps:0 ScheduledReds:1 AssignedMaps:0 AssignedReds:0 CompletedMaps:1 CompletedReds:0 ContAlloc:1 ContRel:0 HostLocal:1 RackLocal:0
2015-12-30 17:31:11,908 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1451360650431_0228_m_000000_0: Container killed by the ApplicationMaster.
Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143
分析一下错误日志
其中一个mapper任务完成,发送事件TaskAttemptEventType.TA_DONE
执行CleanupContainerTransition,取消taskAttemptListener监听,发送事件 ContainerLauncher.EventType.CONTAINER_REMOTE_CLEANUP给containerLauncher
containerLauncher执行kill命令,但判断container状态还没完成,让container kill掉
container还没有退出,被强制kill掉,所以有一个143的错误
TaskAttempt状态由RUNNING 转为 SUCCESS_CONTAINER_CLEANUP
kill了之后发送事件TaskAttemptEventType.TA_CONTAINER_CLEANED
TaskAttempt状态由SUCCESS_CONTAINER_CLEANUP转为SUCCEEDED
这里会有143的错误的原因是没等到container退出,就发起命令kill掉它了。简单的改法是等待几秒后再去kill它,但是到底要等几秒,每个都要等几秒拖慢了集群。还是看大神是怎么改的。
分析补丁
找到 MAPREDUCE-5465 这个补丁。
在这个补丁里,在RUNNING -> SUCCESS_CONTAINER_CLEANUP 状态转换中,插入了一个状态 SUCCESS_FINISHING_CONTAINER (成功的场景,失败有另一个状态)。引入这个状态就是为了等container自己退出。
其中一个mapper任务完成,发送事件TaskAttemptEventType.TA_DONE
向TaskAttemptFinishingMonitor注册(这个类用来避免container一直不退出,之后说),尽管container这时还没退出,还是发送TaskEventType.T_ATTEMPT_SUCCEEDED事件
状态由RUNNING 转为 SUCCESS_FINISHING_CONTAINER
等待container退出(依赖nodemanager发现container退出,汇报给RM,然后再传给AM)
AM知道container退出后,检查退出状态,如果是被终止的,就发送TaskAttemptEventType.TA_KILL,否则发送TaskAttemptEventType.TA_CONTAINER_COMPLETED
如果事件是TA_KILL,状态则会变成SUCCESS_CONTAINER_CLEANUP或者KILL_CONTAINER_CLEANUP。SUCCESS_CONTAINER_CLEANUP可以再转换为SUCCESSED
如果事件是TA_CONTAINER_COMPLETED,状态就变成SUCCESSED了
另外还会收到事件TA_CONTAINER_CLEANED,有条件才会发这个事件的,看注释,好像是等价于TA_CONTAINER_COMPLETED(好烦)
向TaskAttemptFinishingMonitor注册是这里设定了一个定时器,如果超过时间container都还没有退出,这个Monitor就会发起TA_TIMED_OUT的操作了,会把状态由 SUCCESS_FINISHING_CONTAINER 转为 SUCCESS_CONTAINER_CLEANUP,通过kill的方式清理。
同样,这里还会对失败的taskattempt也做同样的处理,会有对应的状态。
Container killed by the ApplicationMaster, Exit code is 143相关推荐
- 【异常】Container exited with a non-zero exit code 1 Failing this attempt.Stack trace: ExitCodeException
[异常]Container exited with a non-zero exit code 1 Failing this attempt.Stack trace: ExitCodeException ...
- 【Flink】Flink Container exited with a non-zero exit code 143
1.概述 偶然查询一个环境的日志,发现这个环境有报错 Flink Container exited with a non-zero exit code 143. 2022-01-30 12:58:16 ...
- Hadoop之——Hadoop3.x运行自带的WordCount报错Container exited with a non-zero exit code 1.
转载请注明出处:https://blog.csdn.net/l1028386804/article/details/93750832 问题: 今天,基于Hadoop3.2.0搭建了Hadoop集群,对 ...
- 【Flink】Flink Dump of the process-tree for container Exit code is 143
1.背景 flink任务报错这个如下 2020-09-27 16:38:36,063 INFO org.apache.flink.yarn.YarnResourceManager
- Container killed on request. Exit code is 143_Permission denied: user=dr.who, a---Hadoop3.x启动报错记录001
Permission denied: user=dr.who, access=WRITE, inode="/":root:supergroup:drwxr-xr-x 这个错误 还 ...
- Vue3运行源码-调试报错:Command failed with exit code 128: git rev-parse HEAD
Vue3运行源码-调试报错:Command failed with exit code 128: git rev-parse HEAD 下载源码 运行源码 运行源码报错 成功运行 下载源码 官网下载源 ...
- Android Studio Emulator Process finished with exit code -1073741515 (0xC0000135)
背景 准备学习一下其他语言,莫名安装了一些软件,然后卸载了一些软件,隔天再打开Android Studio 的模拟器的时候一直无法打开. Bug Event log中报错如下: Emulator: P ...
- python Process finished with exit code -1073741819 (0xC0000005) 解决
运行程序时,Process finished with exit code -1073741819 (0xC0000005) 报错 原因:没有 python33.dll 在 c:\WINDOWS\sy ...
- Process finished with exit code -1073741819 (0xC0000005)
Process finished with exit code -1073741819 (0xC0000005) pycharm报错:Process finished with exit code - ...
最新文章
- java modelmapper_java - 使用ModelMapper映射抽象类型的字段 - SO中文参考 - www.soinside.com...
- 行走智慧城市 数据要有统一“身份”
- 【转载】拿来即用的企业级安全运维体系搭建指南
- matlab基本矩阵运算,科学网—matlab中矩阵基本运算 - 成爱芳的博文
- Metasploit Framework(MSF)的使用
- (笔试题)和0交换的排序
- “老师,我写着写着就 强制交卷了……”
- Redis实现延迟队列
- 红帽补丁安装的方法_为什么红帽采取“上游优先”的方法
- ps2019布尔运算快捷键_超实用:换个角度教你快速理解PS CS6布尔运算
- zz机器学习与人工智能学习资源导引
- 卓有成效的管理者(笔记)——我能贡献什么
- 撬动地球的GOOGLE,告诉你GOOGLE不能说的秘密
- win7安装mysql后“应用程序无法启动因为应用程序的并行配置不正
- 可编程器件的编程原理
- lucene 学习笔记之飞龙在天
- 《深入理解计算机网络》迷你书
- 由于无法验证发布者,Windows已经阻止此软件
- 试题 算法训练 kAc给糖果你吃(贪心)
- 【linux 防火墙】Linux如何关闭防火墙