目录

  • 背景
  • Yarn 上面查看日志

背景

FLink on yarn Cluster 模式运行一段时间后,程序突然报错,
查找Exceotion 发现 ”Container released on a *lost* node”
具体报错如下。
2021-12-24 09:43:43,931 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Filter -> Process -> (Sink: status, Sink: monitor, Sink: dtc, Sink: statusAll, Sink: other) (6/24) (1af69dcda39c4eb77cc0af944aa33f70) switched from RUNNING to FAILED.
java.lang.Exception: Container released on a *lost* nodeat org.apache.flink.yarn.YarnResourceManager.lambda$onContainersCompleted$0(YarnResourceManager.java:343)at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:397)at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:190)at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152)at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)at scala.PartialFunction.applyOrElse(PartialFunction.scala:123)at scala.PartialFunction.applyOrElse$(PartialFunction.scala:122)at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172)at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172)at akka.actor.Actor.aroundReceive(Actor.scala:517)at akka.actor.Actor.aroundReceive$(Actor.scala:515)at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)at akka.actor.ActorCell.invoke(ActorCell.scala:561)at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)at akka.dispatch.Mailbox.run(Mailbox.scala:225)at akka.dispatch.Mailbox.exec(Mailbox.scala:235)at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
2021-12-24 09:43:43,944 WARN  org.apache.flink.yarn.YarnResourceManager                     - Discard registration from TaskExecutor container_e12_1621343092868_870015_01_000013 at (akka.tcp://flink@ip:42538/user/taskmanager_0) because the framework did not recognize it
2021-12-24 09:43:43,995 INFO  org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionStrategy  - Calculating tasks to restart to recover the failed task 20ba6b65f97481d5570070de90e4e791_5.
2021-12-24 09:43:44,012 INFO  org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionStrategy  - 2 tasks should be restarted to recover the failed task 20ba6b65f97481d5570070de90e4e791_5.
2021-12-24 09:43:44,033 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job t5_parse_classify (e1b87d6b42299c0fc45dc648aa0e20cc) switched from state RUNNING to RESTARTING.
2021-12-24 09:43:44,034 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Source: Custom Source (6/24) (542a4aa5761420c0bdf76cfe60a6d607) switched from RUNNING to CANCELING.
2021-12-24 09:43:44,037 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Discarding the results produced by task execution 1af69dcda39c4eb77cc0af944aa33f70.
2021-12-24 09:43:44,073 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Source: Custom Source (6/24) (542a4aa5761420c0bdf76cfe60a6d607) switched from CANCELING to CANCELED.
2021-12-24 09:43:44,073 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Discarding the results produced by task execution 542a4aa5761420c0bdf76cfe60a6d607.
2021-12-24 09:43:44,075 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Discarding the results produced by task execution 542a4aa5761420c0bdf76cfe60a6d607.
2021-12-24 09:43:44,080 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Filter -> Process -> (Sink: status, Sink: monitor, Sink: dtc, Sink: statusAll, Sink: other) (21/24) (0f92112ca6dc6e8b12d6375006a2269c) switched from RUNNING to FAILED.
java.lang.Exception: Container released on a *lost* nodeat org.apache.flink.yarn.YarnResourceManager.lambda$onContainersCompleted$0(YarnResourceManager.java:343)at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:397)at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:190)at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152)at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)at scala.PartialFunction.applyOrElse(PartialFunction.scala:123)at scala.PartialFunction.applyOrElse$(PartialFunction.scala:122)at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172)at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172)at akka.actor.Actor.aroundReceive(Actor.scala:517)at akka.actor.Actor.aroundReceive$(Actor.scala:515)at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)at akka.actor.ActorCell.invoke(ActorCell.scala:561)at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)at akka.dispatch.Mailbox.run(Mailbox.scala:225)at akka.dispatch.Mailbox.exec(Mailbox.scala:235)at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
2021-12-24 09:43:44,081 INFO  org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionStrategy  - Calculating tasks to restart to recover the failed task 20ba6b65f97481d5570070de90e4e791_20.
2021-12-24 09:43:44,081 INFO  org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionStrategy  - 2 tasks should be restarted to recover the failed task 20ba6b65f97481d5570070de90e4e791_20.
2021-12-24 09:43:44,081 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Source: Custom Source (21/24) (812416954373cfd3b8e2ac0d6ab7609a) switched from RUNNING to CANCELING.
2021-12-24 09:43:44,081 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Discarding the results produced by task execution 0f92112ca6dc6e8b12d6375006a2269c.
2021-12-24 09:43:44,081 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Source: Custom Source (21/24) (812416954373cfd3b8e2ac0d6ab7609a) switched from CANCELING to CANCELED.
2021-12-24 09:43:44,081 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Discarding the results produced by task execution 812416954373cfd3b8e2ac0d6ab7609a.
2021-12-24 09:43:44,081 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Discarding the results produced by task execution 812416954373cfd3b8e2ac0d6ab7609a.
2021-12-24 09:43:44,082 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Filter -> Process -> (Sink: status, Sink: monitor, Sink: dtc, Sink: statusAll, Sink: other) (8/24) (493ea4995fe180b2dcc2965254973dd6) switched from RUNNING to FAILED.
java.lang.Exception: Container released on a *lost* nodeat org.apache.flink.yarn.YarnResourceManager.lambda$onContainersCompleted$0(YarnResourceManager.java:343)at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:397)at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:190)at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152)at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)at scala.PartialFunction.applyOrElse(PartialFunction.scala:123)at scala.PartialFunction.applyOrElse$(PartialFunction.scala:122)at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172)at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172)at akka.actor.Actor.aroundReceive(Actor.scala:517)at akka.actor.Actor.aroundReceive$(Actor.scala:515)at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)at akka.actor.ActorCell.invoke(ActorCell.scala:561)at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)at akka.dispatch.Mailbox.run(Mailbox.scala:225)at akka.dispatch.Mailbox.exec(Mailbox.scala:235)at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
2021-12-24 09:43:44,082 INFO  org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionStrategy  - Calculating tasks to restart to recover the failed task 20ba6b65f97481d5570070de90e4e791_7.
2021-12-24 09:43:44,082 INFO  org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionStrategy  - 2 tasks should be restarted to recover the failed task 20ba6b65f97481d5570070de90e4e791_7.
2021-12-24 09:43:44,083 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job t5_parse_classify (e1b87d6b42299c0fc45dc648aa0e20cc) switched from state RESTARTING to FAILING.
org.apache.flink.runtime.JobException: Recovery is suppressed by FailureRateRestartBackoffTimeStrategy(FailureRateRestartBackoffTimeStrategy(failuresIntervalMS=300000,backoffTimeMS=10000,maxFailuresPerInterval=3)at org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.handleFailure(ExecutionFailureHandler.java:110)at org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.getFailureHandlingResult(ExecutionFailureHandler.java:76)at org.apache.flink.runtime.scheduler.DefaultScheduler.handleTaskFailure(DefaultScheduler.java:192)at org.apache.flink.runtime.scheduler.DefaultScheduler.maybeHandleTaskFailure(DefaultScheduler.java:186)at org.apache.flink.runtime.scheduler.DefaultScheduler.updateTaskExecutionStateInternal(DefaultScheduler.java:180)at org.apache.flink.runtime.scheduler.SchedulerBase.updateTaskExecutionState(SchedulerBase.java:484)at org.apache.flink.runtime.scheduler.UpdateSchedulerNgOnInternalFailuresListener.notifyTaskFailure(UpdateSchedulerNgOnInternalFailuresListener.java:49)at org.apache.flink.runtime.executiongraph.ExecutionGraph.notifySchedulerNgAboutInternalTaskFailure(ExecutionGraph.java:1703)at org.apache.flink.runtime.executiongraph.Execution.processFail(Execution.java:1252)at org.apache.flink.runtime.executiongraph.Execution.processFail(Execution.java:1220)at org.apache.flink.runtime.executiongraph.Execution.fail(Execution.java:955)at org.apache.flink.runtime.jobmaster.slotpool.SingleLogicalSlot.signalPayloadRelease(SingleLogicalSlot.java:173)at org.apache.flink.runtime.jobmaster.slotpool.SingleLogicalSlot.release(SingleLogicalSlot.java:165)at org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$SingleTaskSlot.release(SlotSharingManager.java:732)at org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.release(SlotSharingManager.java:537)at org.apache.flink.runtime.jobmaster.slotpool.AllocatedSlot.releasePayload(AllocatedSlot.java:149)at org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.tryFailingAllocatedSlot(SlotPoolImpl.java:730)at org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.failAllocation(SlotPoolImpl.java:710)at org.apache.flink.runtime.jobmaster.JobMaster.internalFailAllocation(JobMaster.java:541)at org.apache.flink.runtime.jobmaster.JobMaster.notifyAllocationFailure(JobMaster.java:667)at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)at java.lang.reflect.Method.invoke(Method.java:498)at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcInvocation(AkkaRpcActor.java:274)at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:194)at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152)at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)at scala.PartialFunction.applyOrElse(PartialFunction.scala:123)at scala.PartialFunction.applyOrElse$(PartialFunction.scala:122)at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172)at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172)at akka.actor.Actor.aroundReceive(Actor.scala:517)at akka.actor.Actor.aroundReceive$(Actor.scala:515)at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)at akka.actor.ActorCell.invoke(ActorCell.scala:561)at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)at akka.dispatch.Mailbox.run(Mailbox.scala:225)at akka.dispatch.Mailbox.exec(Mailbox.scala:235)at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: java.lang.Exception: Container released on a *lost* nodeat org.apache.flink.yarn.YarnResourceManager.lambda$onContainersCompleted$0(YarnResourceManager.java:343)at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:397)at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:190)... 22 more

Yarn 上面查看日志


由上图可以粗略的推断出是因为某个节点丢失,造成的。
查看yarn的node,发现节点没有内存导致的(如下)。
找运维吧…

Flink on yarn Container released on a *lost* node相关推荐

  1. Flink on Yarn报错:Container released on a *lost* node

    flink任务提交到yarn执行几天后报错: 2022-01-05 15:09:26,288 INFO org.apache.flink.runtime.checkpoint.CheckpointCo ...

  2. TaskAttempt killed because it ran on unusable node IP:8041 Container released on a *lost* node

    文章目录 背景 本文为博主九师兄(QQ:541711153 欢迎来探讨技术)原创文章,未经允许博主不允许转载. 背景 执行一个kylin任务 然后报错 TaskAttempt killed becau ...

  3. [Flink] Flink运行报错Container released on a *lost* node

    文章目录 1.背景 2.源码 2.1 onContainersAllocated 2.2 getContainersFromPreviousAttempts 3. 其他 M.扩展 本文为博主九师兄(Q ...

  4. 2021年大数据Flink(六):Flink On Yarn模式

    目录 Flink On Yarn模式 原理 为什么使用Flink On Yarn? Flink如何和Yarn进行交互? 两种方式 操作 1.关闭yarn的内存检查 2.同步 3.重启yarn 测试 S ...

  5. Flink On Yarn模式,为什么使用Flink On Yarn?Session模式、Per-Job模式、关闭yarn的内存检查,由Yarn模式切换回standalone模式时需要注意的点

    Flink On Yarn模式 原理 为什么使用Flink On Yarn? 在实际开发中,使用Flink时,更多的使用方式是Flink On Yarn模式,原因如下: -1.Yarn的资源可以按需使 ...

  6. 04_Flink-HA高可用、Standalone集群模式、Flink-Standalone集群重要参数详解、集群节点重启及扩容、启动组件、Flink on Yarn、启动命令等

    1.4.Flink集群安装部署standalone+yarn 1.4.1.Standalone集群模式 1.4.2.Flink-Standalone集群重要参数详解 1.4.3.集群节点重启及扩容 1 ...

  7. flink on yarn使用第三方jars的方法如何查看进程所持有jar包

    前言 在yarn上跑的程序必须拥有代码,环境,配置. flink on yarn模式,用户提交完jar以后,通过yarn调度队列,任务jar会被分配到某个节点,连同配置,环境,一起被分发到某个Task ...

  8. 【FLINK 】 Flink on YARN模式下TaskManager的内存分配

    解决背景: 总的ytm分配的不变的情况下怎么划分给堆内内存JVM 一个更大的内存空间 对于心急的同学来说,我们直接先给一个解决方案,后面想去了解的再往下看: 原来的命令,-ytm 8192,分配给ta ...

  9. Flink (四) Flink 的安装和部署- Flink on Yarn 模式 / 集群HA / 并行度和Slot

    接上一篇 Flink (三) Flink 的安装和部署- -Standalone模式 3. Flink  提交到 Yarn Flink on Yarn 模式的原理是依靠 YARN 来调度 Flink ...

最新文章

  1. [WPF]自定义鼠标指针
  2. android so readelf.exe,android ndk中的工具使用
  3. 官方宣布:谷歌开发者中国网站正式发布!
  4. VM虚拟机显示不能铺满问题
  5. jdbc建立数据库连接的helloword
  6. php5.3+for+linux,Centos 安装 nginx + php5.3
  7. Java多线程系列(二):线程的五大状态,以及线程之间的通信与协作
  8. component、constituent、element、ingredient的区别
  9. SQLserver查询练习
  10. API调用,1688商品页面APP端原数据获取(页面信息采集API)
  11. 熬了多少个夜晚,大家期待的《网络工程师思科华为华三实战案例红宝书》即网工必备技术命令大全版本1完书...
  12. VS 格式化代码快捷键
  13. 单片机晶振电路的设计与计算
  14. 实时网速怎么看快慢_怎么看测出来的网速快慢
  15. ElasticsearchException解决方案
  16. 字符集详解(一看就懂系列)
  17. 区分线性系统和非线性系统
  18. 离线编译安装lrzsz
  19. 片上偏差模式OCV,AOCV,SOCV
  20. b站网页版没有html播放,网页b站能小窗口播放吗?怎么播放?最新版本bilibili小窗口播放器...

热门文章

  1. win10版本的共享网络,以及HP LaserJet 1020的共享打印机总结
  2. uni-app 快速集成 IM 即时通信的方法——TUIKit 来啦
  3. arcgis10破解安装
  4. Learning to Predict Context-adaptiveConvolution for Semantic Segmentation阅读笔记
  5. cocos2dx之Box2D
  6. 华硕h410m-f主板检测不到硬盘启动选项?
  7. C# Excel数据验重及Table数据验重
  8. 如何实现微信小程序手机号授权
  9. 会员管理有哪些功能呢?
  10. 数学建模——MATLAB基础知识