目录

问题现象

排查过程

问题分析:

解决:

解决命令:


问题现象

今天测试环境的NameNode在发生gc停顿时间过长后退出,依次重启后发现无法正常的选出active节点,

排查过程

  1. 查看日志并没有zk选举相关的日志
  2. zkfc进程的日志时间停留在出问题的几个小时前
    1. 具体日志:
2019-08-26 10:47:57,925 WARN org.apache.hadoop.ha.HealthMonitor: Transport-level exception trying to monitor health of NameNode at namenodetest02.bi.10101111.com/10.104.104.128:9001: Call From namenodetest02.bi/10.104.104.128 to namenodetest02.bi.10101111.com:9001 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
2019-08-26 10:47:59,947 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: namenodetest02.bi.10101111.com/10.104.104.128:9001. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=1, sleepTime=1000 MILLISECONDS)
2019-08-26 10:47:59,964 WARN org.apache.hadoop.ha.HealthMonitor: Transport-level exception trying to monitor health of NameNode at namenodetest02.bi.10101111.com/10.104.104.128:9001: Call From namenodetest02.bi/10.104.104.128 to namenodetest02.bi.10101111.com:9001 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
2019-08-26 10:48:04,951 INFO org.apache.hadoop.ha.HealthMonitor: Entering state SERVICE_HEALTHY
2019-08-26 10:48:04,965 INFO org.apache.hadoop.ha.ZKFailoverController: Local service NameNode at namenodetest02.bi.10101111.com/10.104.104.128:9001 entered state: SERVICE_HEALTHY
2019-08-26 10:48:05,174 INFO org.apache.zookeeper.ZooKeeper: Initiating client connection, connectString=10.101.22.31:5181,10.104.108.87:5181,10.104.108.88:5181 sessionTimeout=60000 watcher=org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@726d9b25
2019-08-26 10:48:05,963 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server 10.101.22.31/10.101.22.31:5181. Will not attempt to authenticate using SASL (unknown error)
2019-08-26 10:48:09,270 WARN org.apache.zookeeper.ClientCnxn: Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection timed outat sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
2019-08-26 10:48:09,838 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server 10.104.108.88/10.104.108.88:5181. Will not attempt to authenticate using SASL (unknown error)
2019-08-26 10:48:09,839 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to 10.104.108.88/10.104.108.88:5181, initiating session
2019-08-26 10:48:10,281 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete on server 10.104.108.88/10.104.108.88:5181, sessionid = 0x26c9ac7569130e4, negotiated timeout = 60000
2019-08-26 10:48:10,593 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session connected.
2019-08-26 10:48:10,670 INFO org.apache.hadoop.ha.ActiveStandbyElector: Checking for any old active which needs to be fenced...
2019-08-26 10:48:10,681 INFO org.apache.hadoop.ha.ActiveStandbyElector: Old node exists: 0a0e6861646f6f7032636c757374657212036e6e311a1e6e616d656e6f64657465737430312e62692e31303130313131312e636f6d20a94628d33e
2019-08-26 10:48:11,508 INFO org.apache.hadoop.ha.ZKFailoverController: Should fence: NameNode at namenodetest01.bi.10101111.com/10.104.104.127:9001
2019-08-26 10:48:12,941 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: namenodetest01.bi.10101111.com/10.104.104.127:9001. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=1, sleepTime=1000 MILLISECONDS)
2019-08-26 10:48:12,957 WARN org.apache.hadoop.ha.FailoverController: Unable to gracefully make NameNode at namenodetest01.bi.10101111.com/10.104.104.127:9001 standby (unable to connect)
java.net.ConnectException: Call From namenodetest02.bi/10.104.104.128 to namenodetest01.bi.10101111.com:9001 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefusedat sun.reflect.GeneratedConstructorAccessor8.newInstance(Unknown Source)at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)at java.lang.reflect.Constructor.newInstance(Constructor.java:423)at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:791)at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:731)at org.apache.hadoop.ipc.Client.call(Client.java:1473)at org.apache.hadoop.ipc.Client.call(Client.java:1400)at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)at com.sun.proxy.$Proxy9.transitionToStandby(Unknown Source)at org.apache.hadoop.ha.protocolPB.HAServiceProtocolClientSideTranslatorPB.transitionToStandby(HAServiceProtocolClientSideTranslatorPB.java:112)at org.apache.hadoop.ha.FailoverController.tryGracefulFence(FailoverController.java:172)at org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverController.java:514)at org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:505)at org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:61)at org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:892)at org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:902)at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:801)at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:416)at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599)at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
Caused by: java.net.ConnectException: Connection refusedat sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:530)at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:494)at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:608)at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:706)at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:369)at org.apache.hadoop.ipc.Client.getConnection(Client.java:1522)at org.apache.hadoop.ipc.Client.call(Client.java:1439)... 14 more
2019-08-26 10:48:13,056 INFO org.apache.hadoop.ha.NodeFencer: ====== Beginning Service Fencing Process... ======
2019-08-26 10:48:13,062 INFO org.apache.hadoop.ha.NodeFencer: Trying method 1/1: org.apache.hadoop.ha.SshFenceByTcpPort(null)
2019-08-26 10:48:13,428 INFO org.apache.hadoop.ha.SshFenceByTcpPort: Connecting to namenodetest01.bi.10101111.com...
2019-08-26 10:48:13,483 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: Connecting to namenodetest01.bi.10101111.com port 22
2019-08-26 10:48:13,904 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: Connection established
2019-08-26 10:48:14,170 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: Remote version string: SSH-2.0-OpenSSH_5.3
2019-08-26 10:48:14,170 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: Local version string: SSH-2.0-JSCH-0.1.42
2019-08-26 10:48:14,170 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: CheckCiphers: aes256-ctr,aes192-ctr,aes128-ctr,aes256-cbc,aes192-cbc,aes128-cbc,3des-ctr,arcfour,arcfour128,arcfour256
2019-08-26 10:48:14,700 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: aes256-ctr is not available.
2019-08-26 10:48:14,700 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: aes192-ctr is not available.
2019-08-26 10:48:14,723 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: aes256-cbc is not available.
2019-08-26 10:48:14,723 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: aes192-cbc is not available.
2019-08-26 10:48:14,723 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: arcfour256 is not available.
2019-08-26 10:48:14,911 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: SSH_MSG_KEXINIT sent
2019-08-26 10:48:14,911 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: SSH_MSG_KEXINIT received
2019-08-26 10:48:14,919 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: kex: server->client aes128-ctr hmac-md5 none
2019-08-26 10:48:14,919 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: kex: client->server aes128-ctr hmac-md5 none
2019-08-26 10:48:15,344 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: SSH_MSG_KEXDH_INIT sent
2019-08-26 10:48:15,344 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: expecting SSH_MSG_KEXDH_REPLY
2019-08-26 10:48:15,407 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: ssh_rsa_verify: signature true
2019-08-26 10:48:15,465 WARN org.apache.hadoop.ha.SshFenceByTcpPort.jsch: Permanently added 'namenodetest01.bi.10101111.com' (RSA) to the list of known hosts.
2019-08-26 10:48:15,465 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: SSH_MSG_NEWKEYS sent
2019-08-26 10:48:15,465 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: SSH_MSG_NEWKEYS received
2019-08-26 10:48:15,496 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: SSH_MSG_SERVICE_REQUEST sent
2019-08-26 10:48:15,496 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: SSH_MSG_SERVICE_ACCEPT received
2019-08-26 10:48:15,497 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: Authentications that can continue: publickey,keyboard-interactive,password
2019-08-26 10:48:15,497 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: Next authentication method: publickey
2019-08-26 10:48:15,931 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: Authentication succeeded (publickey).
2019-08-26 10:48:15,967 INFO org.apache.hadoop.ha.SshFenceByTcpPort: Connected to namenodetest01.bi.10101111.com
2019-08-26 10:48:15,967 INFO org.apache.hadoop.ha.SshFenceByTcpPort: Looking for process running on port 9001

3、结合之前出现过namenode的内存不足退出及调大NN的堆内存大小后其他进程会退出的情况

问题分析:

可能是namenode节点的内存不足导致的zkfc进程卡死,不能正常的选举出active节点

解决:

重启zkfc进程

解决命令:

hadoop-daemon.sh stop zkfchadoop-daemon.sh start zkfc

同样问题现象解决续

测试环境再次出现两个NameNode都是stanby现象,分别重启两个节点的zkfc都没法正常使得namenode切换active成功

解决2:

停止一个节点的NN后,重启另一个节点的zkfc解决

相关报错信息:

pported in state standby
2019-10-17 17:06:02,866 INFO org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Triggering log roll on remote NameNode namenodetest01.bi.10101111.com/10.
2019-10-17 17:06:02,871 WARN org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Unable to trigger a roll of the active NN
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category JOURNAL is not supported in state standbyat org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.checkOperation(StandbyState.java:87)at org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.checkOperation(NameNode.java:1722)at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkOperation(FSNamesystem.java:1352)at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.rollEditLog(FSNamesystem.java:6369)at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.rollEditLog(NameNodeRpcServer.java:989)at org.apache.hadoop.hdfs.protocolPB.NamenodeProtocolServerSideTranslatorPB.rollEditLog(NamenodeProtocolServerSideTranslatorPB.java:142)at org.apache.hadoop.hdfs.protocol.proto.NamenodeProtocolProtos$NamenodeProtocolService$2.callBlockingMethod(NamenodeProtocolProtos.java:12025)at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
--at java.security.AccessController.doPrivileged(Native Method)at javax.security.auth.Subject.doAs(Subject.java:422)at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1656)at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2034)at org.apache.hadoop.ipc.Client.call(Client.java:1469)at org.apache.hadoop.ipc.Client.call(Client.java:1400)at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)at com.sun.proxy.$Proxy21.rollEditLog(Unknown Source)at org.apache.hadoop.hdfs.protocolPB.NamenodeProtocolTranslatorPB.rollEditLog(NamenodeProtocolTranslatorPB.java:148)at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.triggerActiveLogRoll(EditLogTailer.java:271)at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.access$600(EditLogTailer.java:61)at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:313)at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$200(EditLogTailer.java:282)at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:299)at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:412)at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:295)
2019-10-17 17:06:03,881 WARN org.apache.hadoop.security.UserGroupInformation: No groups available for user hadoop

具体原因暂时没找到。

两个stanby的NameNode问题解决相关推荐

  1. 解决:hadoop2.5.2 HA启动时出现了两个standy的Namenode,没有出现active的Namenode

    hadoop启动时报:Unable to trigger a roll of the active NN 错误. HA启动时出现了两个standy的Namenode,没有出现active的Nameno ...

  2. 寻找Hadoop启动Namenode失败原因

    问题产生 hadoop目录下,运行下面语句启动namenode,jps发现namenode启动失败 sbin/hadoop.daemon.sh start namenode 问题解决 (1)Hadoo ...

  3. Namenode主备切换或报 IPC Server handler 23 on 8020

    转自:http://blog.csdn.net/u014033218/article/details/75570313 可能是以下原因,未测试: NameNode 高可用整体架构概述 在 Hadoop ...

  4. namenode和datanode工作机制_Hadoop的namenode的管理机制,工作机制和datanode的工作原理...

    HDFS前言: 1) 设计思想 分而治之:将大文件.大批量文件,分布式存放在大量服务器上,以便于采取分而治之的方式对海量数据进行运算分析: 2)在大数据系统中作用: 为各类分布式运算框架(如:mapr ...

  5. HDFS NameNode重启优化

    本文已发表于InfoQ,下面的版本又经过少量修订. 一.背景 在Hadoop集群整个生命周期里,由于调整参数.Patch.升级等多种场景需要频繁操作NameNode重启,不论采用何种架构,重启期间集群 ...

  6. NameNode所需配置,NameNode内存配置计算,NameNode与block关系

    NameNode 所需大小,与Block大小,HDFS副本数均有关,计算方式如下: 例:bolck为256M,副本为3个,10台机器,每台4TB数据, Namenode需要的内存为: 10 * 4 * ...

  7. hadoop namenode ha方案

    Hadoop 2.0 NameNode HA和Federation实践 Posted on 2012/12/10 一.背景 天云趋势在2012年下半年开始为某大型国有银行的历史交易数据备份及查询提供基 ...

  8. 分布式文件系统之DFS复制、命名空间和NameNode

    什么是分布式文件系统? 百度百科: 分布式文件系统(Distributed File System)是指文件系统管理的物理存储资源不一定直接连接在本地节点上,而是通过计算机网络与节点(可简单的理解为一 ...

  9. 7.HDFS之——NameNode的概述、自动Name的概述、NameNode HA 集群搭建

    7.HDFS NameNode HA 7.1 NameNode HA概述 所谓HA(High Availablity [əˌveɪlə'bɪləti] ),即高可用(7x24小时服务不中断).通过主备 ...

  10. namenode和datanode工作机制_HDFS详解一:namenode、datanode工作原理

    1. 概述HDFS集群分为两大角色:NameNode.DataNode(Secondary NameNode) NameNode负责管理整个文件系统的元数据,记录存放在哪些datanode中,以及存放 ...

最新文章

  1. ACL 2020三大奖项出炉!知名学者夫妇曾先后获终身成就奖,时间检验奖回溯95年经典著作...
  2. error C3861: “setw”: 找不到标识符
  3. POJ 3104 Drying【二分搜索】最大化最小值问题
  4. wxWidgets:减少可执行文件大小
  5. C#部分面试题及答案
  6. Oracle 中 call 和 exec的区别
  7. Oracle锁机制的总结【转】
  8. nodejs轻量服务器后端
  9. 关于统计学,几个简单易懂的小故事
  10. 题解 POJ 2559-SP1805 【HISTOGRA - Largest Rectangle in a Histogram】
  11. linux命令--vi,vim
  12. 拓端tecdat|R语言模拟和预测ARIMA模型、随机游走模型RW时间序列趋势可视化
  13. 一个免费、大小仅几MB但超好用的卸载工具——Geek Uninstaller
  14. codeforce 741 B. Arpa's weak amphitheater and Mehrdad's valuable Hoses(背包 dp)
  15. C语言黑与白问题代码及解析(内附视频)
  16. 校园6美女向一男生表白 史上最强表白阵容来袭
  17. 要求输出国际象棋棋盘
  18. vscode绿色、护眼色,vue自动格式化配置参考
  19. 我每天都要打开的8个在线网站,很有用~
  20. 鼠标交互的使用与优化

热门文章

  1. 后端开发面试自我介绍_前端开发面试自我介绍
  2. 904L 是一种耐酸不锈钢含低碳
  3. Python精讲:Python中集合的概念和创建方法详解
  4. 肠道微生物组如何影响运动能力,所谓的“精英肠道微生物组”真的存在吗?
  5. 学好数据结构的重要性
  6. 求大于200的最小质数
  7. ajax请求406,SpringMVC ajax请求406 错误解决方案
  8. Could not resolve project
  9. PHP实现输入地址,获取当前位置的经纬度,$lng和$lat即为经纬度的返回值
  10. Python:利用Entrez库筛选下载PubMed文献摘要