Flink on yarn Container released on a *lost* node
目录
- 背景
- Yarn 上面查看日志
背景
FLink on yarn Cluster 模式运行一段时间后,程序突然报错,
查找Exceotion 发现 ”Container released on a *lost* node”
具体报错如下。
2021-12-24 09:43:43,931 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Filter -> Process -> (Sink: status, Sink: monitor, Sink: dtc, Sink: statusAll, Sink: other) (6/24) (1af69dcda39c4eb77cc0af944aa33f70) switched from RUNNING to FAILED.
java.lang.Exception: Container released on a *lost* nodeat org.apache.flink.yarn.YarnResourceManager.lambda$onContainersCompleted$0(YarnResourceManager.java:343)at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:397)at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:190)at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152)at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)at scala.PartialFunction.applyOrElse(PartialFunction.scala:123)at scala.PartialFunction.applyOrElse$(PartialFunction.scala:122)at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172)at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172)at akka.actor.Actor.aroundReceive(Actor.scala:517)at akka.actor.Actor.aroundReceive$(Actor.scala:515)at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)at akka.actor.ActorCell.invoke(ActorCell.scala:561)at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)at akka.dispatch.Mailbox.run(Mailbox.scala:225)at akka.dispatch.Mailbox.exec(Mailbox.scala:235)at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
2021-12-24 09:43:43,944 WARN org.apache.flink.yarn.YarnResourceManager - Discard registration from TaskExecutor container_e12_1621343092868_870015_01_000013 at (akka.tcp://flink@ip:42538/user/taskmanager_0) because the framework did not recognize it
2021-12-24 09:43:43,995 INFO org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionStrategy - Calculating tasks to restart to recover the failed task 20ba6b65f97481d5570070de90e4e791_5.
2021-12-24 09:43:44,012 INFO org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionStrategy - 2 tasks should be restarted to recover the failed task 20ba6b65f97481d5570070de90e4e791_5.
2021-12-24 09:43:44,033 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Job t5_parse_classify (e1b87d6b42299c0fc45dc648aa0e20cc) switched from state RUNNING to RESTARTING.
2021-12-24 09:43:44,034 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Source: Custom Source (6/24) (542a4aa5761420c0bdf76cfe60a6d607) switched from RUNNING to CANCELING.
2021-12-24 09:43:44,037 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Discarding the results produced by task execution 1af69dcda39c4eb77cc0af944aa33f70.
2021-12-24 09:43:44,073 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Source: Custom Source (6/24) (542a4aa5761420c0bdf76cfe60a6d607) switched from CANCELING to CANCELED.
2021-12-24 09:43:44,073 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Discarding the results produced by task execution 542a4aa5761420c0bdf76cfe60a6d607.
2021-12-24 09:43:44,075 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Discarding the results produced by task execution 542a4aa5761420c0bdf76cfe60a6d607.
2021-12-24 09:43:44,080 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Filter -> Process -> (Sink: status, Sink: monitor, Sink: dtc, Sink: statusAll, Sink: other) (21/24) (0f92112ca6dc6e8b12d6375006a2269c) switched from RUNNING to FAILED.
java.lang.Exception: Container released on a *lost* nodeat org.apache.flink.yarn.YarnResourceManager.lambda$onContainersCompleted$0(YarnResourceManager.java:343)at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:397)at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:190)at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152)at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)at scala.PartialFunction.applyOrElse(PartialFunction.scala:123)at scala.PartialFunction.applyOrElse$(PartialFunction.scala:122)at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172)at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172)at akka.actor.Actor.aroundReceive(Actor.scala:517)at akka.actor.Actor.aroundReceive$(Actor.scala:515)at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)at akka.actor.ActorCell.invoke(ActorCell.scala:561)at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)at akka.dispatch.Mailbox.run(Mailbox.scala:225)at akka.dispatch.Mailbox.exec(Mailbox.scala:235)at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
2021-12-24 09:43:44,081 INFO org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionStrategy - Calculating tasks to restart to recover the failed task 20ba6b65f97481d5570070de90e4e791_20.
2021-12-24 09:43:44,081 INFO org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionStrategy - 2 tasks should be restarted to recover the failed task 20ba6b65f97481d5570070de90e4e791_20.
2021-12-24 09:43:44,081 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Source: Custom Source (21/24) (812416954373cfd3b8e2ac0d6ab7609a) switched from RUNNING to CANCELING.
2021-12-24 09:43:44,081 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Discarding the results produced by task execution 0f92112ca6dc6e8b12d6375006a2269c.
2021-12-24 09:43:44,081 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Source: Custom Source (21/24) (812416954373cfd3b8e2ac0d6ab7609a) switched from CANCELING to CANCELED.
2021-12-24 09:43:44,081 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Discarding the results produced by task execution 812416954373cfd3b8e2ac0d6ab7609a.
2021-12-24 09:43:44,081 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Discarding the results produced by task execution 812416954373cfd3b8e2ac0d6ab7609a.
2021-12-24 09:43:44,082 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Filter -> Process -> (Sink: status, Sink: monitor, Sink: dtc, Sink: statusAll, Sink: other) (8/24) (493ea4995fe180b2dcc2965254973dd6) switched from RUNNING to FAILED.
java.lang.Exception: Container released on a *lost* nodeat org.apache.flink.yarn.YarnResourceManager.lambda$onContainersCompleted$0(YarnResourceManager.java:343)at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:397)at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:190)at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152)at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)at scala.PartialFunction.applyOrElse(PartialFunction.scala:123)at scala.PartialFunction.applyOrElse$(PartialFunction.scala:122)at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172)at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172)at akka.actor.Actor.aroundReceive(Actor.scala:517)at akka.actor.Actor.aroundReceive$(Actor.scala:515)at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)at akka.actor.ActorCell.invoke(ActorCell.scala:561)at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)at akka.dispatch.Mailbox.run(Mailbox.scala:225)at akka.dispatch.Mailbox.exec(Mailbox.scala:235)at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
2021-12-24 09:43:44,082 INFO org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionStrategy - Calculating tasks to restart to recover the failed task 20ba6b65f97481d5570070de90e4e791_7.
2021-12-24 09:43:44,082 INFO org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionStrategy - 2 tasks should be restarted to recover the failed task 20ba6b65f97481d5570070de90e4e791_7.
2021-12-24 09:43:44,083 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Job t5_parse_classify (e1b87d6b42299c0fc45dc648aa0e20cc) switched from state RESTARTING to FAILING.
org.apache.flink.runtime.JobException: Recovery is suppressed by FailureRateRestartBackoffTimeStrategy(FailureRateRestartBackoffTimeStrategy(failuresIntervalMS=300000,backoffTimeMS=10000,maxFailuresPerInterval=3)at org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.handleFailure(ExecutionFailureHandler.java:110)at org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.getFailureHandlingResult(ExecutionFailureHandler.java:76)at org.apache.flink.runtime.scheduler.DefaultScheduler.handleTaskFailure(DefaultScheduler.java:192)at org.apache.flink.runtime.scheduler.DefaultScheduler.maybeHandleTaskFailure(DefaultScheduler.java:186)at org.apache.flink.runtime.scheduler.DefaultScheduler.updateTaskExecutionStateInternal(DefaultScheduler.java:180)at org.apache.flink.runtime.scheduler.SchedulerBase.updateTaskExecutionState(SchedulerBase.java:484)at org.apache.flink.runtime.scheduler.UpdateSchedulerNgOnInternalFailuresListener.notifyTaskFailure(UpdateSchedulerNgOnInternalFailuresListener.java:49)at org.apache.flink.runtime.executiongraph.ExecutionGraph.notifySchedulerNgAboutInternalTaskFailure(ExecutionGraph.java:1703)at org.apache.flink.runtime.executiongraph.Execution.processFail(Execution.java:1252)at org.apache.flink.runtime.executiongraph.Execution.processFail(Execution.java:1220)at org.apache.flink.runtime.executiongraph.Execution.fail(Execution.java:955)at org.apache.flink.runtime.jobmaster.slotpool.SingleLogicalSlot.signalPayloadRelease(SingleLogicalSlot.java:173)at org.apache.flink.runtime.jobmaster.slotpool.SingleLogicalSlot.release(SingleLogicalSlot.java:165)at org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$SingleTaskSlot.release(SlotSharingManager.java:732)at org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.release(SlotSharingManager.java:537)at org.apache.flink.runtime.jobmaster.slotpool.AllocatedSlot.releasePayload(AllocatedSlot.java:149)at org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.tryFailingAllocatedSlot(SlotPoolImpl.java:730)at org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.failAllocation(SlotPoolImpl.java:710)at org.apache.flink.runtime.jobmaster.JobMaster.internalFailAllocation(JobMaster.java:541)at org.apache.flink.runtime.jobmaster.JobMaster.notifyAllocationFailure(JobMaster.java:667)at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)at java.lang.reflect.Method.invoke(Method.java:498)at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcInvocation(AkkaRpcActor.java:274)at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:194)at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152)at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)at scala.PartialFunction.applyOrElse(PartialFunction.scala:123)at scala.PartialFunction.applyOrElse$(PartialFunction.scala:122)at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172)at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172)at akka.actor.Actor.aroundReceive(Actor.scala:517)at akka.actor.Actor.aroundReceive$(Actor.scala:515)at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)at akka.actor.ActorCell.invoke(ActorCell.scala:561)at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)at akka.dispatch.Mailbox.run(Mailbox.scala:225)at akka.dispatch.Mailbox.exec(Mailbox.scala:235)at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: java.lang.Exception: Container released on a *lost* nodeat org.apache.flink.yarn.YarnResourceManager.lambda$onContainersCompleted$0(YarnResourceManager.java:343)at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:397)at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:190)... 22 more
Yarn 上面查看日志
由上图可以粗略的推断出是因为某个节点丢失,造成的。
查看yarn的node,发现节点没有内存导致的(如下)。
找运维吧…
Flink on yarn Container released on a *lost* node相关推荐
- Flink on Yarn报错:Container released on a *lost* node
flink任务提交到yarn执行几天后报错: 2022-01-05 15:09:26,288 INFO org.apache.flink.runtime.checkpoint.CheckpointCo ...
- TaskAttempt killed because it ran on unusable node IP:8041 Container released on a *lost* node
文章目录 背景 本文为博主九师兄(QQ:541711153 欢迎来探讨技术)原创文章,未经允许博主不允许转载. 背景 执行一个kylin任务 然后报错 TaskAttempt killed becau ...
- [Flink] Flink运行报错Container released on a *lost* node
文章目录 1.背景 2.源码 2.1 onContainersAllocated 2.2 getContainersFromPreviousAttempts 3. 其他 M.扩展 本文为博主九师兄(Q ...
- 2021年大数据Flink(六):Flink On Yarn模式
目录 Flink On Yarn模式 原理 为什么使用Flink On Yarn? Flink如何和Yarn进行交互? 两种方式 操作 1.关闭yarn的内存检查 2.同步 3.重启yarn 测试 S ...
- Flink On Yarn模式,为什么使用Flink On Yarn?Session模式、Per-Job模式、关闭yarn的内存检查,由Yarn模式切换回standalone模式时需要注意的点
Flink On Yarn模式 原理 为什么使用Flink On Yarn? 在实际开发中,使用Flink时,更多的使用方式是Flink On Yarn模式,原因如下: -1.Yarn的资源可以按需使 ...
- 04_Flink-HA高可用、Standalone集群模式、Flink-Standalone集群重要参数详解、集群节点重启及扩容、启动组件、Flink on Yarn、启动命令等
1.4.Flink集群安装部署standalone+yarn 1.4.1.Standalone集群模式 1.4.2.Flink-Standalone集群重要参数详解 1.4.3.集群节点重启及扩容 1 ...
- flink on yarn使用第三方jars的方法如何查看进程所持有jar包
前言 在yarn上跑的程序必须拥有代码,环境,配置. flink on yarn模式,用户提交完jar以后,通过yarn调度队列,任务jar会被分配到某个节点,连同配置,环境,一起被分发到某个Task ...
- 【FLINK 】 Flink on YARN模式下TaskManager的内存分配
解决背景: 总的ytm分配的不变的情况下怎么划分给堆内内存JVM 一个更大的内存空间 对于心急的同学来说,我们直接先给一个解决方案,后面想去了解的再往下看: 原来的命令,-ytm 8192,分配给ta ...
- Flink (四) Flink 的安装和部署- Flink on Yarn 模式 / 集群HA / 并行度和Slot
接上一篇 Flink (三) Flink 的安装和部署- -Standalone模式 3. Flink 提交到 Yarn Flink on Yarn 模式的原理是依靠 YARN 来调度 Flink ...
最新文章
- [WPF]自定义鼠标指针
- android so readelf.exe,android ndk中的工具使用
- 官方宣布:谷歌开发者中国网站正式发布!
- VM虚拟机显示不能铺满问题
- jdbc建立数据库连接的helloword
- php5.3+for+linux,Centos 安装 nginx + php5.3
- Java多线程系列(二):线程的五大状态,以及线程之间的通信与协作
- component、constituent、element、ingredient的区别
- SQLserver查询练习
- API调用,1688商品页面APP端原数据获取(页面信息采集API)
- 熬了多少个夜晚,大家期待的《网络工程师思科华为华三实战案例红宝书》即网工必备技术命令大全版本1完书...
- VS 格式化代码快捷键
- 单片机晶振电路的设计与计算
- 实时网速怎么看快慢_怎么看测出来的网速快慢
- ElasticsearchException解决方案
- 字符集详解(一看就懂系列)
- 区分线性系统和非线性系统
- 离线编译安装lrzsz
- 片上偏差模式OCV,AOCV,SOCV
- b站网页版没有html播放,网页b站能小窗口播放吗?怎么播放?最新版本bilibili小窗口播放器...