Container killed by the ApplicationMaster, Exit code is 143

之前发现在map任务里面经常看到Container killed by the ApplicationMaster，挺奇怪，不过任务最终是成功的，就没怎么管。不过最近测试集群跑的任务报143错误，还是重新看一下这个问题。

分析版本：hadoop cdh5.4

错误日志

2015-12-30 17:31:09,994 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1451360650431_0228_m_000000_0 TaskAttempt Transitioned from RUNNING to SUCCESS_CONTAINER_CLEANUP
2015-12-30 17:31:09,995 INFO [ContainerLauncher #1] org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: Processing the event EventType: CONTAINER_REMOTE_CLEANUP for container container_1451360650431_0228_01_000002 taskAttempt attempt_1451360650431_0228_m_000000_0
2015-12-30 17:31:09,995 INFO [ContainerLauncher #1] org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: KILLING attempt_1451360650431_0228_m_000000_0
2015-12-30 17:31:09,995 INFO [ContainerLauncher #1] org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy: Opening proxy : uaerouter2-vm02:8041
2015-12-30 17:31:10,012 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1451360650431_0228_m_000000_0 TaskAttempt Transitioned from SUCCESS_CONTAINER_CLEANUP to SUCCEEDED
2015-12-30 17:31:10,020 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: Task succeeded with attempt attempt_1451360650431_0228_m_000000_0
2015-12-30 17:31:10,021 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: task_1451360650431_0228_m_000000 Task Transitioned from RUNNING to SUCCEEDED
2015-12-30 17:31:10,023 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Num completed Tasks: 1
2015-12-30 17:31:10,895 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Before Scheduling: PendingReds:1 ScheduledMaps:0 ScheduledReds:0 AssignedMaps:1 AssignedReds:0 CompletedMaps:1 CompletedReds:0 ContAlloc:1 ContRel:0 HostLocal:1 RackLocal:0
2015-12-30 17:31:10,898 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Recalculating schedule, headroom=<memory:14336, vCores:14>
2015-12-30 17:31:10,898 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Reduce slow start threshold reached. Scheduling reduces.
2015-12-30 17:31:10,898 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: All maps assigned. Ramping up all remaining reduces:1
2015-12-30 17:31:10,898 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:0 ScheduledMaps:0 ScheduledReds:1 AssignedMaps:1 AssignedReds:0 CompletedMaps:1 CompletedReds:0 ContAlloc:1 ContRel:0 HostLocal:1 RackLocal:0
2015-12-30 17:31:11,906 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: getResources() for application_1451360650431_0228: ask=1 release= 0 newContainers=0 finishedContainers=1 resourcelimit=<memory:15360, vCores:15> knownNMs=2
2015-12-30 17:31:11,906 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Received completed container container_1451360650431_0228_01_000002
2015-12-30 17:31:11,907 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:0 ScheduledMaps:0 ScheduledReds:1 AssignedMaps:0 AssignedReds:0 CompletedMaps:1 CompletedReds:0 ContAlloc:1 ContRel:0 HostLocal:1 RackLocal:0
2015-12-30 17:31:11,908 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1451360650431_0228_m_000000_0: Container killed by the ApplicationMaster.
Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143

分析一下错误日志

其中一个mapper任务完成，发送事件TaskAttemptEventType.TA_DONE
执行CleanupContainerTransition，取消taskAttemptListener监听，发送事件 ContainerLauncher.EventType.CONTAINER_REMOTE_CLEANUP给containerLauncher
containerLauncher执行kill命令，但判断container状态还没完成，让container kill掉
container还没有退出，被强制kill掉，所以有一个143的错误
TaskAttempt状态由RUNNING 转为 SUCCESS_CONTAINER_CLEANUP
kill了之后发送事件TaskAttemptEventType.TA_CONTAINER_CLEANED
TaskAttempt状态由SUCCESS_CONTAINER_CLEANUP转为SUCCEEDED
这里会有143的错误的原因是没等到container退出，就发起命令kill掉它了。简单的改法是等待几秒后再去kill它，但是到底要等几秒，每个都要等几秒拖慢了集群。还是看大神是怎么改的。

分析补丁
找到 MAPREDUCE-5465 这个补丁。

在这个补丁里，在RUNNING -> SUCCESS_CONTAINER_CLEANUP 状态转换中，插入了一个状态 SUCCESS_FINISHING_CONTAINER （成功的场景，失败有另一个状态）。引入这个状态就是为了等container自己退出。

其中一个mapper任务完成，发送事件TaskAttemptEventType.TA_DONE
向TaskAttemptFinishingMonitor注册(这个类用来避免container一直不退出，之后说)，尽管container这时还没退出，还是发送TaskEventType.T_ATTEMPT_SUCCEEDED事件
状态由RUNNING 转为 SUCCESS_FINISHING_CONTAINER
等待container退出（依赖nodemanager发现container退出，汇报给RM，然后再传给AM）
AM知道container退出后，检查退出状态，如果是被终止的，就发送TaskAttemptEventType.TA_KILL，否则发送TaskAttemptEventType.TA_CONTAINER_COMPLETED
如果事件是TA_KILL，状态则会变成SUCCESS_CONTAINER_CLEANUP或者KILL_CONTAINER_CLEANUP。SUCCESS_CONTAINER_CLEANUP可以再转换为SUCCESSED
如果事件是TA_CONTAINER_COMPLETED，状态就变成SUCCESSED了
另外还会收到事件TA_CONTAINER_CLEANED，有条件才会发这个事件的，看注释，好像是等价于TA_CONTAINER_COMPLETED（好烦）

向TaskAttemptFinishingMonitor注册是这里设定了一个定时器，如果超过时间container都还没有退出，这个Monitor就会发起TA_TIMED_OUT的操作了，会把状态由 SUCCESS_FINISHING_CONTAINER 转为 SUCCESS_CONTAINER_CLEANUP，通过kill的方式清理。

同样，这里还会对失败的taskattempt也做同样的处理，会有对应的状态。

Container killed by the ApplicationMaster, Exit code is 143相关推荐

【异常】Container exited with a non-zero exit code 1 Failing this attempt.Stack trace: ExitCodeException
[异常]Container exited with a non-zero exit code 1 Failing this attempt.Stack trace: ExitCodeException ...
【Flink】Flink Container exited with a non-zero exit code 143
1.概述偶然查询一个环境的日志,发现这个环境有报错 Flink Container exited with a non-zero exit code 143. 2022-01-30 12:58:16 ...
Hadoop之——Hadoop3.x运行自带的WordCount报错Container exited with a non-zero exit code 1.
转载请注明出处:https://blog.csdn.net/l1028386804/article/details/93750832 问题: 今天,基于Hadoop3.2.0搭建了Hadoop集群,对 ...
【Flink】Flink Dump of the process-tree for container Exit code is 143
1.背景 flink任务报错这个如下 2020-09-27 16:38:36,063 INFO org.apache.flink.yarn.YarnResourceManager
Container killed on request. Exit code is 143_Permission denied: user=dr.who, a---Hadoop3.x启动报错记录001
Permission denied: user=dr.who, access=WRITE, inode="/":root:supergroup:drwxr-xr-x 这个错误还 ...
Vue3运行源码-调试报错：Command failed with exit code 128: git rev-parse HEAD
Vue3运行源码-调试报错:Command failed with exit code 128: git rev-parse HEAD 下载源码运行源码运行源码报错成功运行下载源码官网下载源 ...
Android Studio Emulator Process finished with exit code -1073741515 (0xC0000135)
背景准备学习一下其他语言,莫名安装了一些软件,然后卸载了一些软件,隔天再打开Android Studio 的模拟器的时候一直无法打开. Bug Event log中报错如下: Emulator: P ...
python Process finished with exit code -1073741819 (0xC0000005) 解决
运行程序时,Process finished with exit code -1073741819 (0xC0000005) 报错原因:没有 python33.dll 在 c:\WINDOWS\sy ...
Process finished with exit code -1073741819 (0xC0000005)
Process finished with exit code -1073741819 (0xC0000005) pycharm报错:Process finished with exit code - ...

Container killed by the ApplicationMaster, Exit code is 143

Container killed by the ApplicationMaster, Exit code is 143相关推荐

最新文章

热门文章