Hadoop空闲时无法提交任务
一、问题描述
在用hive提交MR任务时,发现在队列空闲时,提交的application无法能够进入RUNNING,一直处于ACCEPTED。查看日志发现在6.8号也在报相同错误(如下)
2020-06-08 08:10:36,714 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAppManager$ApplicationSummary: appId=application_1573631365527_158284,name=select count(*) from ...ult.hms_per5min_dual(Stage-1),user=root,queue=default,state=FINISHED,trackingUrl=http://stanlee-171-20-hzqsh.node.hzqsh.wacai.sdc:8088/proxy/application_1573631365527_158284/,appMasterHost=stanlee-171-21-hzqsh.node.hzqsh.wacai.sdc,submitTime=1591575015466,startTime=1591575015494,finishTime=1591575030220,finalStatus=SUCCEEDED,memorySeconds=40709,vcoreSeconds=27,preemptedMemorySeconds=0,preemptedVcoreSeconds=0,preemptedAMContainers=0,preemptedNonAMContainers=0,preemptedResources=<memory:0\, vCores:0>,applicationType=MAPREDUCE,resourceSeconds=40709 MB-seconds\, 27 vcore-seconds,preemptedResourceSeconds=0 MB-seconds\, 0 vcore-seconds
2020-06-08 08:10:36,714 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAppManager: Max number of completed apps kept in state store met: maxCompletedAppsInStateStore = 1000, removing app application_1573631365527_157284 from state store.
2020-06-08 08:10:36,714 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAppManager: Application should be expired, max number of completed apps kept in memory met: maxCompletedAppsInMemory = 1000, removing app application_1573631365527_157284 from memory:
2020-06-08 08:10:36,714 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Removing info for app: application_1573631365527_157284
2020-06-08 08:10:37,188 ERROR org.apache.hadoop.yarn.server.resourcemanager.monitor.SchedulingMonitor: Exception raised while executing preemption checker, skip this run..., exception=
java.lang.NullPointerException
二、原因
因为yarn集群资源充足,排除因内存不足导致application pending的原因。
通过查看yarn相关的最新日志,发现已过期的application状态及RM相关信息(其存储在内存或ZK目录上,也用来保证RM的高可用,防止脑裂)未被及时清理,
发现可能是内存或zookeeper保存每次提交application相关的state和RM相关信息的数量超过zookeeper的阈值所致。
然后去查看yark-site.xml的zookeeper相关的配置去验证,发现超过yarn-site.xml相关参数阈值,故需要对过期application的状态进行
1.查看日志(10.1.171.20)
- vim /data/program/hadoop-3.0.0-cdh6.3.1/logs/hadoop-appweb-resourcemanager-stanlee-171-20-hzqsh.node.hzqsh.wacai.sdc.log
2020-06-08 08:10:36,714 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAppManager$ApplicationSummary: appId=application_1573631365527_158284,name=select count(*) from ...ult.hms_per5min_dual(Stage-1),user=root,queue=default,state=FINISHED,trackingUrl=http://stanlee-171-20-hzqsh.node.hzqsh.wacai.sdc:8088/proxy/application_1573631365527_158284/,appMasterHost=stanlee-171-21-hzqsh.node.hzqsh.wacai.sdc,submitTime=1591575015466,startTime=1591575015494,finishTime=1591575030220,finalStatus=SUCCEEDED,memorySeconds=40709,vcoreSeconds=27,preemptedMemorySeconds=0,preemptedVcoreSeconds=0,preemptedAMContainers=0,preemptedNonAMContainers=0,preemptedResources=<memory:0\, vCores:0>,applicationType=MAPREDUCE,resourceSeconds=40709 MB-seconds\, 27 vcore-seconds,preemptedResourceSeconds=0 MB-seconds\, 0 vcore-seconds
2020-06-08 08:10:36,714 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAppManager: Max number of completed apps kept in state store met: maxCompletedAppsInStateStore = 1000, removing app application_1573631365527_157284 from state store.
2020-06-08 08:10:36,714 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAppManager: Application should be expired, max number of completed apps kept in memory met: maxCompletedAppsInMemory = 1000, removing app application_1573631365527_157284 from memory:
2020-06-08 08:10:36,714 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Removing info for app: application_1573631365527_157284
2020-06-08 08:10:37,188 ERROR org.apache.hadoop.yarn.server.resourcemanager.monitor.SchedulingMonitor: Exception raised while executing preemption checker, skip this run..., exception=
java.lang.NullPointerException
2020-06-08 08:10:40,188 ERROR org.apache.hadoop.yarn.server.resourcemanager.monitor.SchedulingMonitor: Exception raised while executing preemption checker
2.查看yarn-site.xml相关配置,发现application相关RM状态存储在zk上
<property><name>yarn.resourcemanager.store.class</name><value>org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore</value></property>
3.查看日志报错时相关的阈值,
并查看zookeeper对应的application数量,发现已经超过起阈值1000,故认为时其导致任务无法提交成功,着手进行清楚其状态
Application should be expired, max number of completed apps kept in memory met: maxCompletedAppsInMemory = 1000
Max number of completed apps kept in state store met: maxCompletedAppsInStateStore = 1000
三、解决
1.登陆zkCli.sh,查看ZKRMStateStore对应的目录application的数量
$ echo "ls /rmstore/ZKRMStateRoot/RMAppRoot" | /opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/lib/zookeeper/bin/zkCli.sh | grep application_ | awk -F , '{print NF}'1002
2.进行脚本清理无效、过期的application.将过期的application状态与zk命令进行拼接,然后进行批量删除
echo "ls /rmstore/ZKRMStateRoot/RMAppRoot" | /opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/lib/zookeeper/bin/zkCli.sh | grep application_ | while read item; do echo ${item#*[}; done | while read item; do echo ${item%*]}; done | awk -F ', ' '{ for (i=1;i<=NF;i++) printf "rmr /rmstore/ZKRMStateRoot/RMAppRoot/%s\n",$i}' > /tmp/deleteNode.txt
3.执行批量删除
cat /tmp/deleteNode.txt | /opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/lib/zookeeper/bin/zkCli.sh
四、清理过期application状态,再次提交hive sql ,可以提交成功
五、总结 出现问题,根据Hadoop完善的日志记录流程去定位问题,或根据以往经验或和相关日志信息进行定位问题,进而确定解决方向
Hadoop空闲时无法提交任务相关推荐
- CPU 空闲时在干嘛?
有趣! CPU 空闲时在干嘛? 人在空闲时会发呆会无聊,计算机呢? 假设你正在用计算机浏览网页,当网页加载完成后你开始阅读,此时你没有移动鼠标,没有敲击键盘,也没有网络通信,那么你的计算机此时在干嘛? ...
- 拯救者Y7000P 2020H款安装deepin20.5后资源空闲时经常出现风扇狂转现象
拯救者Y7000P 2020H款安装deepin20.5后资源空闲时经常出现风扇狂转现象 记录下来备忘,不要再踩坑了!
- Apache Hadoop 答疑:解决 Apache Hadoop 启动时 DataNode 启动异常的问题
文章目录 前言 一.启动 HDFS 并显示当前所有 Java 进程 二.DataNode 启动异常场景描述 三.DataNode 启动异常解决方式 四.Web 界面检查 Hadoop 信息 总结 前言 ...
- hadoop linux 集群提交任务
将这三个文件打成jar包,然后放在集群里用hadoop jar命令来运行hadoop jar wc.jar wc.LinuxS hadoop jar命令会自动调取本地依赖和配置文件 jobcommit ...
- 解决WIN10“系统和压缩内存”“ntoskrnl.exe”系统空闲时占用大量CPU
前些天装了WIN10,感觉,一般般,最近发现个怪现象,一旦机器有空闲一会,那个"系统和压缩内存"进程就会占用我20%的CPU不知道干嘛.百度一下,国内都在讨论这个进程对内存的消耗, ...
- windows下hadoop安装时出现error Couldn‘t find a package.json file in “D:\\hadoop\hadoop-2.7.7\\sbin“问题
windows下hadoop安装时启动yarn时出现error Couldn't find a package.json file问题 yarn run v1.22.0 error Couldn't ...
- VS Code 空闲时的 CPU 使用率是 13%
(点击上方蓝字,快速关注我们) VS Code 自从推出后,在技术圈引起了很多关注.最近国外程序员 joliss 在 VS Code 的 issue 中反馈了一个问题:VS Code 处于空闲时,鼠标 ...
- Hadoop启动时,没有启动DataNode
Hadoop启动时,没有启动DataNode 1.问题 2.原因 3.解决办法 3.1 删除 dfs 文件夹(dfs文件夹中没有重要的数据) 3.2 复制 clusterID(dfs文件夹中有着重要的 ...
- 计算机转机械硬盘,机械硬盘空闲时却还狂转,到底为什么?
原标题:机械硬盘空闲时却还狂转,到底为什么? 在使用个人电脑(PC)时,可能大多数玩家都有几天不关电脑的习惯,甚至还有一些玩家十天半个月都不关电脑呢.我就是一例,"懒关电脑患者". ...
最新文章
- centos7 mysql读写监控_Centos7 Zabbix监控mysql
- Linux文件系统的组成部分
- 性能提升约7倍!Apache Flink 与 Apache Hive 的集成
- STM32学习之C语言知识复习
- 速读《文献管理与信息分析》笔记
- 风变python培训_风变python学习小结
- 一步步完成FastDFS + Spring MVC上传下载整合示例
- include包含文件查找的顺序 .
- 普元EOS常见问题及处理经验
- 嵌入式系统的开发概述(三星s5p6818系统为例)
- 解决 Ubuntu 22.04 Fractional Scaling 画面伸缩后应用程序模糊
- 精工机械表 调整时间,日期和星期的方法
- 华为机试:机器人走迷宫
- yolov3 python含新能源车牌识别系统有pyqt5界面
- 数字证书颁发及认证原理
- 类的初始化以及实例化
- 数据推荐 | 人体行为识别数据集
- STM32的光敏检测自动智能窗帘控制系统proteus设计
- 关于信息化管理的建议
- 计算机房英语单词,计算机与网络英语词汇(O1)