HA集群配置完成并启动后,namenode不能正常启动。刚启动的时候 jps看到了namenode,但是隔了一两分钟,再看namenode就不见了。如果不启动journalnode,namenode运行正常,一旦启动journalnode,则namenode过一会就会挂掉。查看namenode日志,发现报错如下:

2019-10-18 15:32:36,835 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: host-xxx/xxx:8485. Already tried 3 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=100, sleepTime=10000 MILLISECONDS)
2019-10-18 15:32:37,263 WARN org.apache.hadoop.hdfs.server.namenode.FSEditLog: Unable to determine input streams from QJM to [xxx1:8485, xxx2:8485, xxx3:8485]. Skipping.
java.io.IOException: Timed out waiting 20000ms for a quorum of nodes to respond.at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:137)at org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.selectInputStreams(QuorumJournalManager.java:471)at org.apache.hadoop.hdfs.server.namenode.JournalSet.selectInputStreams(JournalSet.java:278)at org.apache.hadoop.hdfs.server.namenode.FSEditLog.selectInputStreams(FSEditLog.java:1463)at org.apache.hadoop.hdfs.server.namenode.FSEditLog.selectInputStreams(FSEditLog.java:1487)at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:212)at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:324)at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$200(EditLogTailer.java:282)at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:299)at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:412)at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:295)
2019-10-18 15:32:37,267 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Stopping services started for standby state
2019-10-18 15:32:37,267 WARN org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Edit log tailer interrupted
java.lang.InterruptedException: sleep interruptedat java.lang.Thread.sleep(Native Method)at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:337)at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$200(EditLogTailer.java:282)at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:299)at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:412)at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:295)
2019-10-18 15:32:46,837 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: host-xxx/xxx:8485. Already tried 4 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=100, sleepTime=10000 MILLISECONDS)
2019-10-18 15:32:46,838 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: host-xxx/xxx:8485. Already tried 4 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=100, sleepTime=10000 MILLISECONDS)
2019-10-18 15:32:56,841 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: host-xxx/xxx:8485. Already tried 5 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=100, sleepTime=10000 MILLISECONDS)

错误原因:

我们在执行start-dfs.sh的时候,默认启动顺序是namenode>datanode>journalnode>zkfc,如果journalnode和namenode不在一台机器启动的话,很容易因为网络延迟问题导致namenode无法连接journalnode,无法实现选举,最后导致刚刚启动的namenode会突然挂掉。虽然namenode启动时有重试机制等待journalnode的启动,但是由于重试次数限制,可能网络情况不好,导致重试次数用完了,也没有启动成功。

解决方法:

1、手动启动namenode,避免了网络延迟等待journalnode的步骤,一旦两个namenode连入journalnode,实现了选举,则不会出现失败情况。(已尝试,但无效果)

2、先启动journalnode然后再运行start-dfs.sh。(已尝试,但无效果)

3、把namenode对journalnode的容错次数或时间调成更大的值,保证能够对正常的启动延迟、网络延迟能容错。在hdfs-site.xml中修改ipc参数,namenode对journalnode检测的重试次数,默认为10次,每次1000ms,故网络情况差需要增加。(已尝试,但无效果)

具体修改信息为:

 <property><name>ipc.client.connect.max.retries</name><value>100</value><description>Indicates the number of retries a client will make to establisha server connection.</description>
</property>
<property><name>ipc.client.connect.retry.interval</name><value>10000</value><description>Indicates the number of milliseconds a client will wait forbefore retrying to establish a server connection.</description>
</property>

4、有资料说可能是因为/etc/hosts文件的问题,检查了下,发现有些机器确实配置不全,重新修改补充。

经过上述一番折腾之后,发现不再报连接错误,但是又出现了下面这种报错:

2019-10-18 16:41:41,800 INFO org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Will roll logs on active node at /xxx:9000 every 120 seconds.
2019-10-18 16:41:41,813 INFO org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer: Starting standby checkpoint thread...
Checkpointing active NN at http://xxx:50070
Serving checkpoints at http://xxx:50070
2019-10-18 16:41:41,964 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Stopping services started for standby state
2019-10-18 16:41:41,965 WARN org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Edit log tailer interrupted
java.lang.InterruptedException: sleep interruptedat java.lang.Thread.sleep(Native Method)at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:337)at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$200(EditLogTailer.java:282)at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:299)at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:412)at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:295)
2019-10-18 16:41:42,072 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Starting services required for active state
2019-10-18 16:41:42,097 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Starting recovery process for unclosed journal segments...
2019-10-18 16:41:42,161 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Successfully started new epoch 20
2019-10-18 16:41:42,161 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Beginning recovery of unclosed segment starting at txid 1
2019-10-18 16:41:42,225 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Recovery prepare phase complete. Responses:
192.168.80.5:8485: segmentState { startTxId: 1 endTxId: 1 isInProgress: true } lastWriterEpoch: 3 lastCommittedTxId: 8
192.168.80.4:8485: segmentState { startTxId: 1 endTxId: 1 isInProgress: true } lastWriterEpoch: 3 lastCommittedTxId: 8
2019-10-18 16:41:42,229 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Using longest log: xxx:8485=segmentState {startTxId: 1endTxId: 1isInProgress: true
}
lastWriterEpoch: 3
lastCommittedTxId: 82019-10-18 16:41:42,230 FATAL org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: recoverUnfinalizedSegments failed for required journal (JournalAndStream(mgr=QJM to [xxx1:8485, xxx2:8485, xxx3:8485], stream=null))
java.lang.AssertionError: Decided to synchronize log to startTxId: 1
endTxId: 1
isInProgress: truebut logger 192.168.80.5:8485 had seen txid 8 committedat org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.recoverUnclosedSegment(QuorumJournalManager.java:336)at org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.recoverUnfinalizedSegments(QuorumJournalManager.java:455)at org.apache.hadoop.hdfs.server.namenode.JournalSet$8.apply(JournalSet.java:624)at org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:393)at org.apache.hadoop.hdfs.server.namenode.JournalSet.recoverUnfinalizedSegments(JournalSet.java:621)at org.apache.hadoop.hdfs.server.namenode.FSEditLog.recoverUnclosedStreams(FSEditLog.java:1394)at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:1149)at org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.startActiveServices(NameNode.java:1655)at org.apache.hadoop.hdfs.server.namenode.ha.ActiveState.enterState(ActiveState.java:61)at org.apache.hadoop.hdfs.server.namenode.ha.HAState.setStateInternal(HAState.java:63)at org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.setState(StandbyState.java:49)at org.apache.hadoop.hdfs.server.namenode.NameNode.transitionToActive(NameNode.java:1533)at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.transitionToActive(NameNodeRpcServer.java:1246)at org.apache.hadoop.ha.protocolPB.HAServiceProtocolServerSideTranslatorPB.transitionToActive(HAServiceProtocolServerSideTranslatorPB.java:107)at org.apache.hadoop.ha.proto.HAServiceProtocolProtos$HAServiceProtocolService$2.callBlockingMethod(HAServiceProtocolProtos.java:4460)at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)at java.security.AccessController.doPrivileged(Native Method)at javax.security.auth.Subject.doAs(Subject.java:415)at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033)
2019-10-18 16:41:42,233 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1
2019-10-18 16:41:42,235 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:

解决方法:

在namenode上执行命令:

hdfs namenode -initializeSharedEdits

然后重启集群,结果namenode运行正常。

参考链接:

https://my.oschina.net/tearsky/blog/631038?p={{currentPage+1}}

https://blog.csdn.net/u012425536/article/details/79216300

namenode启动异常问题解决相关推荐

  1. Nacos小坑——启动异常以及正常启动后账号密码错误问题解决

    Nacos小坑--启动异常以及正常启动后账号密码错误问题解决 前言: 最近在学习SpringCloud alibaba,启动nacos时遇到如下问题: 1.正常解压压缩文件后,在bin目录下直接双击s ...

  2. NameNode 启动失败 - There appears to be a gap in the edit log. We expected txid xxx, but got tx

    场景 NameNode迁移,导致一个节点无法启动 异常 在Namenode主动迁移,或者Namenode机器挂掉无法恢复时,我们需要Namenode节点迁移,迁移经常会出现一个NameNode启动成功 ...

  3. Apache Hadoop 答疑:解决 Apache Hadoop 启动时 DataNode 启动异常的问题

    文章目录 前言 一.启动 HDFS 并显示当前所有 Java 进程 二.DataNode 启动异常场景描述 三.DataNode 启动异常解决方式 四.Web 界面检查 Hadoop 信息 总结 前言 ...

  4. DNW启动异常的问题

    DNW启动异常,提示0x00401670.......不能为Read. 下了很多版本不好用.将C盘根目录下的dnw.ini文件删除后,问题解决 .据说是异常关机时,导致dnw.ini文件出错引起的

  5. 记一次hadoop namenode 启动失败问题及解决过程(启动几秒钟后又挂了)

    这是个小问题,但如果新手接触的话可能会花费一番时间才能解决,甚至会不知所措.在此重点记录的还是解决问题的方式. 问题描述 在自己虚拟机上新安装了hadoop.在做单节点启动测试时发现namenode启 ...

  6. Hadoop高可用集群下namenode格式化失败问题解决

    Hadoop高可用集群下namenode格式化失败问题解决 输入hdfs namenode -format报如下错误 解决方法: 1.在zookeeper目录下执行./bin/zkServer.sh ...

  7. myeclipse启动异常——tomcat启动失败

    因公司要求,必须使用特定版本的mye,在使用过程中产生了诸多的不适应.问题解决后,记录如下: .\myeclipse.exe -clean 在myeclicapse安装目录下,输入上述命令. 安装目录 ...

  8. 【Linux】修改/etc/fstab时参数设错,导致启动异常,无法进入系统(已解决)

    1.问题描述 在ubuntu14.04上设置自动挂载硬盘分区时,修改/etc/fstab时,将defaults错误写成default,导致启动异常,无法进入系统. 2.解决方法 1)ubuntu启动时 ...

  9. mysql 服务启动异常

    mysql 服务启动异常 参考文章: (1)mysql 服务启动异常 (2)https://www.cnblogs.com/yanqingguo/p/10895389.html (3)https:// ...

最新文章

  1. android 多个绑定事件,Android RxJava 实际应用讲解:联合判断多个事件
  2. Eclipse无法编译,提示错误“找不到或者无法加载主类”解决方法
  3. pip Can't connect to HTTPS URL because the SSL module is not available
  4. 在VMware开启此虚拟机时出现内部错误
  5. XP中CPU占用率100%原因及解决方法
  6. 1003. 检查替换后的词是否有效
  7. delphi tpanel 内凹效果_别墅装修公司前十名|别墅装修效果图|现代精致简约
  8. python 进制间相互转换
  9. WebDriver API
  10. .Net语言 APP开发平台——Smobiler学习日志:实现手机上常见的ListMenuView
  11. 在actionbar中加入item的方法
  12. iis在xp3上的部署
  13. MQTT 客户端工具
  14. 【阿卡乐谱】【日常分享】超级强大的简谱-《大海啊,故乡》
  15. python 打包exe(包含把资源文件打进包)
  16. java邮件群发代码_基于java的邮件群发软件
  17. 使用 closest 和 matches 方法来检测元素是否存在某选择器
  18. matlab ols hac,R语言中实现广义相加模型GAM和普通最小二乘(OLS)回归
  19. 计算机网络-传输层:TCP协议
  20. 复位的recovery time和removal time

热门文章

  1. 一、slowfast 代码复现
  2. 【mmaction2 slowfast 行为分析(商用级别)】总目录
  3. 软件工程头歌人机交互部分设计用例
  4. 如何避免开车视觉盲区
  5. 于神之怒加强版 [Bzoj 4407]
  6. 大破才能大立?数字化转型助你说不!影子、柔性组织的决策与效率提升
  7. google的fav icon变了
  8. Web jquery ajax,EL遍历 js刷新 jquery遍历json数组填充表格 等等随笔
  9. 《Counting Out Time: Class Agnostic Video Repetition Counting in the Wild》论文笔记
  10. echart 正负图