namenode启动异常问题解决
HA集群配置完成并启动后,namenode不能正常启动。刚启动的时候 jps看到了namenode,但是隔了一两分钟,再看namenode就不见了。如果不启动journalnode,namenode运行正常,一旦启动journalnode,则namenode过一会就会挂掉。查看namenode日志,发现报错如下:
2019-10-18 15:32:36,835 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: host-xxx/xxx:8485. Already tried 3 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=100, sleepTime=10000 MILLISECONDS)
2019-10-18 15:32:37,263 WARN org.apache.hadoop.hdfs.server.namenode.FSEditLog: Unable to determine input streams from QJM to [xxx1:8485, xxx2:8485, xxx3:8485]. Skipping.
java.io.IOException: Timed out waiting 20000ms for a quorum of nodes to respond.at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:137)at org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.selectInputStreams(QuorumJournalManager.java:471)at org.apache.hadoop.hdfs.server.namenode.JournalSet.selectInputStreams(JournalSet.java:278)at org.apache.hadoop.hdfs.server.namenode.FSEditLog.selectInputStreams(FSEditLog.java:1463)at org.apache.hadoop.hdfs.server.namenode.FSEditLog.selectInputStreams(FSEditLog.java:1487)at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:212)at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:324)at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$200(EditLogTailer.java:282)at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:299)at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:412)at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:295)
2019-10-18 15:32:37,267 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Stopping services started for standby state
2019-10-18 15:32:37,267 WARN org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Edit log tailer interrupted
java.lang.InterruptedException: sleep interruptedat java.lang.Thread.sleep(Native Method)at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:337)at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$200(EditLogTailer.java:282)at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:299)at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:412)at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:295)
2019-10-18 15:32:46,837 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: host-xxx/xxx:8485. Already tried 4 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=100, sleepTime=10000 MILLISECONDS)
2019-10-18 15:32:46,838 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: host-xxx/xxx:8485. Already tried 4 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=100, sleepTime=10000 MILLISECONDS)
2019-10-18 15:32:56,841 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: host-xxx/xxx:8485. Already tried 5 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=100, sleepTime=10000 MILLISECONDS)
错误原因:
我们在执行start-dfs.sh的时候,默认启动顺序是namenode>datanode>journalnode>zkfc,如果journalnode和namenode不在一台机器启动的话,很容易因为网络延迟问题导致namenode无法连接journalnode,无法实现选举,最后导致刚刚启动的namenode会突然挂掉。虽然namenode启动时有重试机制等待journalnode的启动,但是由于重试次数限制,可能网络情况不好,导致重试次数用完了,也没有启动成功。
解决方法:
1、手动启动namenode,避免了网络延迟等待journalnode的步骤,一旦两个namenode连入journalnode,实现了选举,则不会出现失败情况。(已尝试,但无效果)
2、先启动journalnode然后再运行start-dfs.sh。(已尝试,但无效果)
3、把namenode对journalnode的容错次数或时间调成更大的值,保证能够对正常的启动延迟、网络延迟能容错。在hdfs-site.xml中修改ipc参数,namenode对journalnode检测的重试次数,默认为10次,每次1000ms,故网络情况差需要增加。(已尝试,但无效果)
具体修改信息为:
<property><name>ipc.client.connect.max.retries</name><value>100</value><description>Indicates the number of retries a client will make to establisha server connection.</description>
</property>
<property><name>ipc.client.connect.retry.interval</name><value>10000</value><description>Indicates the number of milliseconds a client will wait forbefore retrying to establish a server connection.</description>
</property>
4、有资料说可能是因为/etc/hosts文件的问题,检查了下,发现有些机器确实配置不全,重新修改补充。
经过上述一番折腾之后,发现不再报连接错误,但是又出现了下面这种报错:
2019-10-18 16:41:41,800 INFO org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Will roll logs on active node at /xxx:9000 every 120 seconds.
2019-10-18 16:41:41,813 INFO org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer: Starting standby checkpoint thread...
Checkpointing active NN at http://xxx:50070
Serving checkpoints at http://xxx:50070
2019-10-18 16:41:41,964 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Stopping services started for standby state
2019-10-18 16:41:41,965 WARN org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Edit log tailer interrupted
java.lang.InterruptedException: sleep interruptedat java.lang.Thread.sleep(Native Method)at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:337)at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$200(EditLogTailer.java:282)at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:299)at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:412)at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:295)
2019-10-18 16:41:42,072 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Starting services required for active state
2019-10-18 16:41:42,097 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Starting recovery process for unclosed journal segments...
2019-10-18 16:41:42,161 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Successfully started new epoch 20
2019-10-18 16:41:42,161 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Beginning recovery of unclosed segment starting at txid 1
2019-10-18 16:41:42,225 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Recovery prepare phase complete. Responses:
192.168.80.5:8485: segmentState { startTxId: 1 endTxId: 1 isInProgress: true } lastWriterEpoch: 3 lastCommittedTxId: 8
192.168.80.4:8485: segmentState { startTxId: 1 endTxId: 1 isInProgress: true } lastWriterEpoch: 3 lastCommittedTxId: 8
2019-10-18 16:41:42,229 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Using longest log: xxx:8485=segmentState {startTxId: 1endTxId: 1isInProgress: true
}
lastWriterEpoch: 3
lastCommittedTxId: 82019-10-18 16:41:42,230 FATAL org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: recoverUnfinalizedSegments failed for required journal (JournalAndStream(mgr=QJM to [xxx1:8485, xxx2:8485, xxx3:8485], stream=null))
java.lang.AssertionError: Decided to synchronize log to startTxId: 1
endTxId: 1
isInProgress: truebut logger 192.168.80.5:8485 had seen txid 8 committedat org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.recoverUnclosedSegment(QuorumJournalManager.java:336)at org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.recoverUnfinalizedSegments(QuorumJournalManager.java:455)at org.apache.hadoop.hdfs.server.namenode.JournalSet$8.apply(JournalSet.java:624)at org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:393)at org.apache.hadoop.hdfs.server.namenode.JournalSet.recoverUnfinalizedSegments(JournalSet.java:621)at org.apache.hadoop.hdfs.server.namenode.FSEditLog.recoverUnclosedStreams(FSEditLog.java:1394)at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:1149)at org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.startActiveServices(NameNode.java:1655)at org.apache.hadoop.hdfs.server.namenode.ha.ActiveState.enterState(ActiveState.java:61)at org.apache.hadoop.hdfs.server.namenode.ha.HAState.setStateInternal(HAState.java:63)at org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.setState(StandbyState.java:49)at org.apache.hadoop.hdfs.server.namenode.NameNode.transitionToActive(NameNode.java:1533)at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.transitionToActive(NameNodeRpcServer.java:1246)at org.apache.hadoop.ha.protocolPB.HAServiceProtocolServerSideTranslatorPB.transitionToActive(HAServiceProtocolServerSideTranslatorPB.java:107)at org.apache.hadoop.ha.proto.HAServiceProtocolProtos$HAServiceProtocolService$2.callBlockingMethod(HAServiceProtocolProtos.java:4460)at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)at java.security.AccessController.doPrivileged(Native Method)at javax.security.auth.Subject.doAs(Subject.java:415)at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033)
2019-10-18 16:41:42,233 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1
2019-10-18 16:41:42,235 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:
解决方法:
在namenode上执行命令:
hdfs namenode -initializeSharedEdits
然后重启集群,结果namenode运行正常。
参考链接:
https://my.oschina.net/tearsky/blog/631038?p={{currentPage+1}}
https://blog.csdn.net/u012425536/article/details/79216300
namenode启动异常问题解决相关推荐
- Nacos小坑——启动异常以及正常启动后账号密码错误问题解决
Nacos小坑--启动异常以及正常启动后账号密码错误问题解决 前言: 最近在学习SpringCloud alibaba,启动nacos时遇到如下问题: 1.正常解压压缩文件后,在bin目录下直接双击s ...
- NameNode 启动失败 - There appears to be a gap in the edit log. We expected txid xxx, but got tx
场景 NameNode迁移,导致一个节点无法启动 异常 在Namenode主动迁移,或者Namenode机器挂掉无法恢复时,我们需要Namenode节点迁移,迁移经常会出现一个NameNode启动成功 ...
- Apache Hadoop 答疑:解决 Apache Hadoop 启动时 DataNode 启动异常的问题
文章目录 前言 一.启动 HDFS 并显示当前所有 Java 进程 二.DataNode 启动异常场景描述 三.DataNode 启动异常解决方式 四.Web 界面检查 Hadoop 信息 总结 前言 ...
- DNW启动异常的问题
DNW启动异常,提示0x00401670.......不能为Read. 下了很多版本不好用.将C盘根目录下的dnw.ini文件删除后,问题解决 .据说是异常关机时,导致dnw.ini文件出错引起的
- 记一次hadoop namenode 启动失败问题及解决过程(启动几秒钟后又挂了)
这是个小问题,但如果新手接触的话可能会花费一番时间才能解决,甚至会不知所措.在此重点记录的还是解决问题的方式. 问题描述 在自己虚拟机上新安装了hadoop.在做单节点启动测试时发现namenode启 ...
- Hadoop高可用集群下namenode格式化失败问题解决
Hadoop高可用集群下namenode格式化失败问题解决 输入hdfs namenode -format报如下错误 解决方法: 1.在zookeeper目录下执行./bin/zkServer.sh ...
- myeclipse启动异常——tomcat启动失败
因公司要求,必须使用特定版本的mye,在使用过程中产生了诸多的不适应.问题解决后,记录如下: .\myeclipse.exe -clean 在myeclicapse安装目录下,输入上述命令. 安装目录 ...
- 【Linux】修改/etc/fstab时参数设错,导致启动异常,无法进入系统(已解决)
1.问题描述 在ubuntu14.04上设置自动挂载硬盘分区时,修改/etc/fstab时,将defaults错误写成default,导致启动异常,无法进入系统. 2.解决方法 1)ubuntu启动时 ...
- mysql 服务启动异常
mysql 服务启动异常 参考文章: (1)mysql 服务启动异常 (2)https://www.cnblogs.com/yanqingguo/p/10895389.html (3)https:// ...
最新文章
- android 多个绑定事件,Android RxJava 实际应用讲解:联合判断多个事件
- Eclipse无法编译,提示错误“找不到或者无法加载主类”解决方法
- pip Can't connect to HTTPS URL because the SSL module is not available
- 在VMware开启此虚拟机时出现内部错误
- XP中CPU占用率100%原因及解决方法
- 1003. 检查替换后的词是否有效
- delphi tpanel 内凹效果_别墅装修公司前十名|别墅装修效果图|现代精致简约
- python 进制间相互转换
- WebDriver API
- .Net语言 APP开发平台——Smobiler学习日志:实现手机上常见的ListMenuView
- 在actionbar中加入item的方法
- iis在xp3上的部署
- MQTT 客户端工具
- 【阿卡乐谱】【日常分享】超级强大的简谱-《大海啊,故乡》
- python 打包exe(包含把资源文件打进包)
- java邮件群发代码_基于java的邮件群发软件
- 使用 closest 和 matches 方法来检测元素是否存在某选择器
- matlab ols hac,R语言中实现广义相加模型GAM和普通最小二乘(OLS)回归
- 计算机网络-传输层:TCP协议
- 复位的recovery time和removal time
热门文章
- 一、slowfast 代码复现
- 【mmaction2 slowfast 行为分析(商用级别)】总目录
- 软件工程头歌人机交互部分设计用例
- 如何避免开车视觉盲区
- 于神之怒加强版 [Bzoj 4407]
- 大破才能大立?数字化转型助你说不!影子、柔性组织的决策与效率提升
- google的fav icon变了
- Web jquery ajax,EL遍历 js刷新 jquery遍历json数组填充表格 等等随笔
- 《Counting Out Time: Class Agnostic Video Repetition Counting in the Wild》论文笔记
- echart 正负图