近日,在搭建Hadoop HA QJM集群的时候,出现一个问题,如本文标题。

网上有很多HA的博文,其实比较好的博文就是官方文档,讲的已经非常详细。所以,HA的搭建这里不再赘述。

本文就想给出一篇org.apache.hadoop.ipc.Client: Retrying connect to server错误的解决的方法。

因为在搜索引擎中输入了错误问题,没有找到一篇解决问题的。这里写一篇备忘,也可以给出现同样问题的朋友一个提示。

一、问题描述

HA按照规划配置好,启动后,NameNode不能正常启动。刚启动的时候 jps 看到了NameNode,但是隔了一两分钟,再看NameNode就不见了。

但是测试之后,发现下面2种情况:

1)先启动JournalNode,再启动Hdfs,NameNode可以启动并可以正常运行

2)使用start-dfs.sh启动,众多服务都启动了,隔两分钟NameNode会退出,再次hadoop-daemon.sh start namenode单独启动可以成功稳定运行NameNode。

再看NameNode的日志,不要嫌日志长,其实出错的蛛丝马迹都包含其中了,如下:

2016-03-09 10:50:27,123 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: STARTUP_MSG:

/************************************************************

STARTUP_MSG: Starting NameNode

STARTUP_MSG:   host = node1/192.168.56.201

STARTUP_MSG:   args = []

STARTUP_MSG:   version = 2.5.1

STARTUP_MSG:   build = Unknown -r Unknown; compiled by 'root' on 2014-10-20T05:53Z

STARTUP_MSG:   java = 1.7.0_09

************************************************************/

2016-03-09 10:50:27,132 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: registered UNIX signal handlers for [TERM, HUP, INT]

2016-03-09 10:50:27,138 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: createNameNode []

2016-03-09 10:50:27,465 INFO org.apache.hadoop.metrics2.impl.MetricsConfig: loaded properties from hadoop-metrics2.properties

2016-03-09 10:50:27,623 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot period at 10 second(s).

2016-03-09 10:50:27,623 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: NameNode metrics system started

2016-03-09 10:50:27,625 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: fs.defaultFS is hdfs://hadoopha

2016-03-09 10:50:27,626 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: Clients are to use hadoopha to access this namenode/service.

2016-03-09 10:50:28,048 INFO org.apache.hadoop.hdfs.DFSUtil: Starting web server as: ${dfs.web.authentication.kerberos.principal}

2016-03-09 10:50:28,048 INFO org.apache.hadoop.hdfs.DFSUtil: Starting Web-server for hdfs at: http://node1:50070

2016-03-09 10:50:28,121 INFO org.mortbay.log: Logging to org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via org.mortbay.log.Slf4jLog

2016-03-09 10:50:28,128 INFO org.apache.hadoop.http.HttpRequestLog: Http request log for http.requests.namenode is not defined

2016-03-09 10:50:28,145 INFO org.apache.hadoop.http.HttpServer2: Added global filter 'safety' (class=org.apache.hadoop.http.HttpServer2$QuotingInputFilter)

2016-03-09 10:50:28,149 INFO org.apache.hadoop.http.HttpServer2: Added filter static_user_filter (class=org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter) to context hdfs

2016-03-09 10:50:28,149 INFO org.apache.hadoop.http.HttpServer2: Added filter static_user_filter (class=org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter) to context static

2016-03-09 10:50:28,149 INFO org.apache.hadoop.http.HttpServer2: Added filter static_user_filter (class=org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter) to context logs

2016-03-09 10:50:28,209 INFO org.apache.hadoop.http.HttpServer2: Added filter 'org.apache.hadoop.hdfs.web.AuthFilter' (class=org.apache.hadoop.hdfs.web.AuthFilter)

2016-03-09 10:50:28,211 INFO org.apache.hadoop.http.HttpServer2: addJerseyResourcePackage: packageName=org.apache.hadoop.hdfs.server.namenode.web.resources;org.apache.hadoop.hdfs.web.resources, pathSpec=/webhdfs/v1/*

2016-03-09 10:50:28,268 INFO org.apache.hadoop.http.HttpServer2: Jetty bound to port 50070

2016-03-09 10:50:28,269 INFO org.mortbay.log: jetty-6.1.26

2016-03-09 10:50:28,580 WARN org.apache.hadoop.security.authentication.server.AuthenticationFilter: 'signature.secret' configuration not set, using a random value as secret

2016-03-09 10:50:28,648 INFO org.mortbay.log: Started HttpServer2$SelectChannelConnectorWithSafeStartup@node1:50070

2016-03-09 10:50:28,687 WARN org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Only one p_w_picpath storage directory (dfs.namenode.name.dir) configured. Beware of data loss due to lack of redundant storage directories!

2016-03-09 10:50:28,741 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: fsLock is fair:true

2016-03-09 10:50:28,802 INFO org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: dfs.block.invalidate.limit=1000

2016-03-09 10:50:28,802 INFO org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: dfs.namenode.datanode.registration.ip-hostname-check=true

2016-03-09 10:50:28,805 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: dfs.namenode.startup.delay.block.deletion.sec is set to 000:00:00:00.000

2016-03-09 10:50:28,807 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: The block deletion will start around 2016 Mar 09 10:50:28

2016-03-09 10:50:28,810 INFO org.apache.hadoop.util.GSet: Computing capacity for map BlocksMap

2016-03-09 10:50:28,810 INFO org.apache.hadoop.util.GSet: VM type       = 64-bit

2016-03-09 10:50:28,813 INFO org.apache.hadoop.util.GSet: 2.0% max memory 966.7 MB = 19.3 MB

2016-03-09 10:50:28,813 INFO org.apache.hadoop.util.GSet: capacity      = 2^21 = 2097152 entries

2016-03-09 10:50:28,852 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: dfs.block.access.token.enable=false

2016-03-09 10:50:28,852 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: defaultReplication         = 3

2016-03-09 10:50:28,852 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: maxReplication             = 512

2016-03-09 10:50:28,852 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: minReplication             = 1

2016-03-09 10:50:28,853 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: maxReplicationStreams      = 2

2016-03-09 10:50:28,853 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: shouldCheckForEnoughRacks  = false

2016-03-09 10:50:28,853 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: replicationRecheckInterval = 3000

2016-03-09 10:50:28,853 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: encryptDataTransfer        = false

2016-03-09 10:50:28,853 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: maxNumBlocksToLog          = 1000

2016-03-09 10:50:28,859 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: fsOwner             = hadoop (auth:SIMPLE)

2016-03-09 10:50:28,859 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: supergroup          = supergroup

2016-03-09 10:50:28,859 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: isPermissionEnabled = true

2016-03-09 10:50:28,865 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Determined nameservice ID: hadoopha

2016-03-09 10:50:28,865 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: HA Enabled: true

2016-03-09 10:50:28,866 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Append Enabled: true

2016-03-09 10:50:29,120 INFO org.apache.hadoop.util.GSet: Computing capacity for map INodeMap

2016-03-09 10:50:29,120 INFO org.apache.hadoop.util.GSet: VM type       = 64-bit

2016-03-09 10:50:29,120 INFO org.apache.hadoop.util.GSet: 1.0% max memory 966.7 MB = 9.7 MB

2016-03-09 10:50:29,120 INFO org.apache.hadoop.util.GSet: capacity      = 2^20 = 1048576 entries

2016-03-09 10:50:29,174 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: Caching file names occuring more than 10 times

2016-03-09 10:50:29,186 INFO org.apache.hadoop.util.GSet: Computing capacity for map cachedBlocks

2016-03-09 10:50:29,186 INFO org.apache.hadoop.util.GSet: VM type       = 64-bit

2016-03-09 10:50:29,186 INFO org.apache.hadoop.util.GSet: 0.25% max memory 966.7 MB = 2.4 MB

2016-03-09 10:50:29,186 INFO org.apache.hadoop.util.GSet: capacity      = 2^18 = 262144 entries

2016-03-09 10:50:29,188 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: dfs.namenode.safemode.threshold-pct = 0.9990000128746033

2016-03-09 10:50:29,188 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: dfs.namenode.safemode.min.datanodes = 0

2016-03-09 10:50:29,188 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: dfs.namenode.safemode.extension     = 30000

2016-03-09 10:50:29,190 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Retry cache on namenode is enabled

2016-03-09 10:50:29,190 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Retry cache will use 0.03 of total heap and retry cache entry expiry time is 600000 millis

2016-03-09 10:50:29,194 INFO org.apache.hadoop.util.GSet: Computing capacity for map NameNodeRetryCache

2016-03-09 10:50:29,194 INFO org.apache.hadoop.util.GSet: VM type       = 64-bit

2016-03-09 10:50:29,194 INFO org.apache.hadoop.util.GSet: 0.029999999329447746% max memory 966.7 MB = 297.0 KB

2016-03-09 10:50:29,194 INFO org.apache.hadoop.util.GSet: capacity      = 2^15 = 32768 entries

2016-03-09 10:50:29,199 INFO org.apache.hadoop.hdfs.server.namenode.NNConf: ACLs enabled? false

2016-03-09 10:50:29,199 INFO org.apache.hadoop.hdfs.server.namenode.NNConf: XAttrs enabled? true

2016-03-09 10:50:29,199 INFO org.apache.hadoop.hdfs.server.namenode.NNConf: Maximum size of an xattr: 16384

2016-03-09 10:50:29,208 INFO org.apache.hadoop.hdfs.server.common.Storage: Lock on /home/hadoop/hadoop/tmp/dfs/name/in_use.lock acquired by nodename 4394@node1

2016-03-09 10:50:29,610 WARN org.apache.hadoop.security.ssl.FileBasedKeyStoresFactory: The property 'ssl.client.truststore.location' has not been set, no TrustStore will be loaded

2016-03-09 10:50:31,053 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node2/192.168.56.202:8485. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)

2016-03-09 10:50:31,054 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node3/192.168.56.203:8485. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)

2016-03-09 10:50:31,054 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node4/192.168.56.204:8485. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)

2016-03-09 10:50:32,055 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node2/192.168.56.202:8485. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)

此处省去重复的N行

2016-03-09 10:50:35,807 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 6001 ms (timeout=20000 ms) for a response for selectInputStreams. No responses yet.

此处省去重复的N行

2016-03-09 10:50:39,812 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 10006 ms (timeout=20000 ms) for a response for selectInputStreams. No responses yet.

2016-03-09 10:50:40,065 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node3/192.168.56.203:8485. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)

2016-03-09 10:50:40,065 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node4/192.168.56.204:8485. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)

2016-03-09 10:50:40,065 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node2/192.168.56.202:8485. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)

2016-03-09 10:50:40,069 WARN org.apache.hadoop.hdfs.server.namenode.FSEditLog: Unable to determine input streams from QJM to [192.168.56.202:8485, 192.168.56.203:8485, 192.168.56.204:8485]. Skipping.

org.apache.hadoop.hdfs.qjournal.client.QuorumException: Got too many exceptions to achieve quorum size 2/3. 3 exceptions thrown:

192.168.56.202:8485: Call From node1/192.168.56.201 to node2:8485 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused

192.168.56.203:8485: Call From node1/192.168.56.201 to node3:8485 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused

192.168.56.204:8485: Call From node1/192.168.56.201 to node4:8485 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused

at org.apache.hadoop.hdfs.qjournal.client.QuorumException.create(QuorumException.java:81)

at org.apache.hadoop.hdfs.qjournal.client.QuorumCall.rethrowException(QuorumCall.java:223)

at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:142)

at org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.selectInputStreams(QuorumJournalManager.java:471)

at org.apache.hadoop.hdfs.server.namenode.JournalSet.selectInputStreams(JournalSet.java:260)

at org.apache.hadoop.hdfs.server.namenode.FSEditLog.selectInputStreams(FSEditLog.java:1430)

at org.apache.hadoop.hdfs.server.namenode.FSEditLog.selectInputStreams(FSEditLog.java:1450)

at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:636)

at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:279)

at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:955)

at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:700)

at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:529)

at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:585)

at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:751)

at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:735)

at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1407)

at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1473)

2016-03-09 10:50:40,071 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: No edit log streams selected.

2016-03-09 10:50:40,116 INFO org.apache.hadoop.hdfs.server.namenode.FSImageFormatPBINode: Loading 1 INodes.

2016-03-09 10:50:40,174 INFO org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf: Loaded FSImage in 0 seconds.

2016-03-09 10:50:40,174 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: Loaded p_w_picpath for txid 0 from /home/hadoop/hadoop/tmp/dfs/name/current/fsp_w_picpath_0000000000000000000

2016-03-09 10:50:40,184 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Need to save fs p_w_picpath? false (staleImage=true, haEnabled=true, isRollingUpgrade=false)

2016-03-09 10:50:40,185 INFO org.apache.hadoop.hdfs.server.namenode.NameCache: initialized with 0 entries 0 lookups

2016-03-09 10:50:40,185 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Finished loading FSImage in 10986 msecs

2016-03-09 10:50:40,408 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: RPC server is binding to node1:8020

2016-03-09 10:50:40,414 INFO org.apache.hadoop.ipc.CallQueueManager: Using callQueue class java.util.concurrent.LinkedBlockingQueue

2016-03-09 10:50:40,429 INFO org.apache.hadoop.ipc.Server: Starting Socket Reader #1 for port 8020

2016-03-09 10:50:40,461 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Registered FSNamesystemState MBean

2016-03-09 10:50:40,474 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Number of blocks under construction: 0

2016-03-09 10:50:40,474 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Number of blocks under construction: 0

2016-03-09 10:50:40,475 INFO org.apache.hadoop.hdfs.StateChange: STATE* Leaving safe mode after 11 secs

2016-03-09 10:50:40,475 INFO org.apache.hadoop.hdfs.StateChange: STATE* Network topology has 0 racks and 0 datanodes

2016-03-09 10:50:40,475 INFO org.apache.hadoop.hdfs.StateChange: STATE* UnderReplicatedBlocks has 0 blocks

2016-03-09 10:50:40,536 INFO org.apache.hadoop.ipc.Server: IPC Server Responder: starting

2016-03-09 10:50:40,539 INFO org.apache.hadoop.ipc.Server: IPC Server listener on 8020: starting

2016-03-09 10:50:40,542 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: NameNode RPC up at: node1/192.168.56.201:8020

2016-03-09 10:50:40,542 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Starting services required for standby state

2016-03-09 10:50:40,545 INFO org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Will roll logs on active node at node5/192.168.56.205:8020 every 120 seconds.

2016-03-09 10:50:40,550 INFO org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer: Starting standby checkpoint thread...

Checkpointing active NN at http://node5:50070

Serving checkpoints at http://node1:50070

2016-03-09 10:50:41,551 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node2/192.168.56.202:8485. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)

此处省去重复的N行

2016-03-09 10:50:50,557 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 10007 ms (timeout=20000 ms) for a response for selectInputStreams. No responses yet.

2016-03-09 10:50:50,561 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node3/192.168.56.203:8485. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)

2016-03-09 10:50:50,626 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node4/192.168.56.204:8485. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)

2016-03-09 10:50:50,676 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node2/192.168.56.202:8485. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)

2016-03-09 10:50:50,677 WARN org.apache.hadoop.hdfs.server.namenode.FSEditLog: Unable to determine input streams from QJM to [192.168.56.202:8485, 192.168.56.203:8485, 192.168.56.204:8485]. Skipping.

org.apache.hadoop.hdfs.qjournal.client.QuorumException: Got too many exceptions to achieve quorum size 2/3. 3 exceptions thrown:

192.168.56.202:8485: Call From node1/192.168.56.201 to node2:8485 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused

192.168.56.203:8485: Call From node1/192.168.56.201 to node3:8485 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused

192.168.56.204:8485: Call From node1/192.168.56.201 to node4:8485 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused

at org.apache.hadoop.hdfs.qjournal.client.QuorumException.create(QuorumException.java:81)

at org.apache.hadoop.hdfs.qjournal.client.QuorumCall.rethrowException(QuorumCall.java:223)

at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:142)

at org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.selectInputStreams(QuorumJournalManager.java:471)

at org.apache.hadoop.hdfs.server.namenode.JournalSet.selectInputStreams(JournalSet.java:260)

at org.apache.hadoop.hdfs.server.namenode.FSEditLog.selectInputStreams(FSEditLog.java:1430)

at org.apache.hadoop.hdfs.server.namenode.FSEditLog.selectInputStreams(FSEditLog.java:1450)

at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:212)

at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:324)

at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$200(EditLogTailer.java:282)

at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:299)

at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:411)

at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:295)

2016-03-09 10:50:50,677 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Stopping services started for standby state

2016-03-09 10:50:50,678 WARN org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Edit log tailer interrupted

java.lang.InterruptedException: sleep interrupted

at java.lang.Thread.sleep(Native Method)

at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:337)

at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$200(EditLogTailer.java:282)

at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:299)

at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:411)

at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:295)

2016-03-09 10:50:50,682 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Starting services required for active state

2016-03-09 10:50:50,684 WARN org.apache.hadoop.security.ssl.FileBasedKeyStoresFactory: The property 'ssl.client.truststore.location' has not been set, no TrustStore will be loaded

2016-03-09 10:50:50,690 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Starting recovery process for unclosed journal segments...

2016-03-09 10:50:51,698 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node3/192.168.56.203:8485. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)

此处省去重复的N行

2016-03-09 10:51:00,715 FATAL org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: recoverUnfinalizedSegments failed for required journal (JournalAndStream(mgr=QJM to [192.168.56.202:8485, 192.168.56.203:8485, 192.168.56.204:8485], stream=null))

org.apache.hadoop.hdfs.qjournal.client.QuorumException: Got too many exceptions to achieve quorum size 2/3. 3 exceptions thrown:

192.168.56.203:8485: Call From node1/192.168.56.201 to node3:8485 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused

192.168.56.202:8485: Call From node1/192.168.56.201 to node2:8485 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused

192.168.56.204:8485: Call From node1/192.168.56.201 to node4:8485 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused

at org.apache.hadoop.hdfs.qjournal.client.QuorumException.create(QuorumException.java:81)

at org.apache.hadoop.hdfs.qjournal.client.QuorumCall.rethrowException(QuorumCall.java:223)

at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:142)

at org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.createNewUniqueEpoch(QuorumJournalManager.java:182)

at org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.recoverUnfinalizedSegments(QuorumJournalManager.java:436)

at org.apache.hadoop.hdfs.server.namenode.JournalSet$7.apply(JournalSet.java:590)

at org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:359)

at org.apache.hadoop.hdfs.server.namenode.JournalSet.recoverUnfinalizedSegments(JournalSet.java:587)

at org.apache.hadoop.hdfs.server.namenode.FSEditLog.recoverUnclosedStreams(FSEditLog.java:1361)

at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:1068)

at org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.startActiveServices(NameNode.java:1624)

at org.apache.hadoop.hdfs.server.namenode.ha.ActiveState.enterState(ActiveState.java:61)

at org.apache.hadoop.hdfs.server.namenode.ha.HAState.setStateInternal(HAState.java:63)

at org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.setState(StandbyState.java:49)

at org.apache.hadoop.hdfs.server.namenode.NameNode.transitionToActive(NameNode.java:1502)

at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.transitionToActive(NameNodeRpcServer.java:1197)

at org.apache.hadoop.ha.protocolPB.HAServiceProtocolServerSideTranslatorPB.transitionToActive(HAServiceProtocolServerSideTranslatorPB.java:107)

at org.apache.hadoop.ha.proto.HAServiceProtocolProtos$HAServiceProtocolService$2.callBlockingMethod(HAServiceProtocolProtos.java:4460)

at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)

at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)

at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)

at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)

at java.security.AccessController.doPrivileged(Native Method)

at javax.security.auth.Subject.doAs(Subject.java:415)

at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)

at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)

2016-03-09 10:51:00,717 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1

2016-03-09 10:51:00,718 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:

/************************************************************

SHUTDOWN_MSG: Shutting down NameNode at node1/192.168.56.201

************************************************************/

二、问题分析

看着日志很长,来分析一下,注意看日志中使用颜色突出的部分。

可以肯定NameNode不能正常运行,不是配置错了,而是不能连接上JournalNode、

查看JournalNode的日志没有问题,那么问题就在JournalNode的客户端NameNode。

2016-03-09 10:50:31,053 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node2/192.168.56.202:8485. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)

来分析上句的日志:

NameNode作为JournalNode的客户端发起连接请求,但是失败了,然后NameNode又向其他节点依次发起了请求都失败了,直至到了最大重试次数。

通过实验知道,先启动JournalNode或者再次启动NameNode就可以了,说明JournalNode并没有准备好,而NameNode已经用完了所有重试次数。

三、解决办法

修改core-site.xml中的ipc参数

<property>

<name>ipc.client.connect.max.retries</name>

<value>100</value>

<description>Indicates the number of retries a client will make to establish

a server connection.

</description>

</property>

<property>

<name>ipc.client.connect.retry.interval</name>

<value>10000</value>

<description>Indicates the number of milliseconds a client will wait for

before retrying to establish a server connection.

</description>

</property>

Namenode向JournalNode发起的ipc连接请求的重试间隔时间和重试次数,我的虚拟机集群实验大约需要2分钟,NameNode即可连接上JournalNode。连接后很稳定。

注意:仅对于这种由于服务没有启动完成造成连接超时的问题,都可以调整core-site.xml中的ipc参数来解决。如果目标服务本身没有启动成功,这边调整ipc参数是无效的。

转载于:https://blog.51cto.com/me2xp/1752973

Hadoop HA HDFS启动错误之org.apache.hadoop.ipc.Client: Retrying connect to server问题解决相关推荐

  1. org.apache.hadoop.ipc.Client: Retrying connect to server异常的解决

    org.apache.hadoop.ipc.Client: Retrying connect to server异常的解决 参考文章: (1)org.apache.hadoop.ipc.Client: ...

  2. org.apache.hadoop.ipc.Client: Retrying connect to server

    这个问题导致jps查看结点进程时发现找不到NodeManager或一段时间后消失,网上查找了很多博客,因hadoop版本不一样且出错的原因也可能不同,所以找了老半天. 步骤:jps --> 看l ...

  3. 【hadoop】ipc.Client: Retrying connect to server: xxx:8020. Already tried 37 time(s) RetryPolicy[Multi

    2.背景 执行命令:yarn logs applicationId application_1588855620153_0001 > aaa.txt 然后报错 [hdfs@1 ~]$ yarn ...

  4. Hadoop HA HDFS启动 NameNode启动失败解析

    今天也要努力学习 作者大哥:me2xp  https://www.linuxidc.com/Linux/2016-03/129437.htm 看日志真的很重要!!!!!!!!!! 一.问题描述 HA按 ...

  5. namenode -format时org.apache.hadoop.ipc.Client:Retrying connect to serverAlready tried time(s)解决方案

      大家好,我是爱编程的喵喵.双985硕士毕业,现担任全栈工程师一职,热衷于将数据思维应用到工作与生活中.从事机器学习以及相关的前后端开发工作.曾在阿里云.科大讯飞.CCF等比赛获得多次Top名次.现 ...

  6. hbase错误:Org.apache.hadoop.ipc.RemoteException:User:client is not allowed to impersonate root

    场景:远程登录配有Kerberos的hbase的时候,client端用到的kerberos的principal是client/mj1@BDSM.CMCC 错误: Org.apache.hadoop.i ...

  7. 【Hadoop启动异常】org.apache.hadoop.hdfs.server.common.InconsistentFSStateException

    异常描述 hadoop启动过程中namenode打不开,通过查看logs文件下,namenode-xxx.log文件,发现一下错误信息: org.apache.hadoop.hdfs.server.c ...

  8. Hive启动报错org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.SafeModeE...

    Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.SafeModeExce ...

  9. Hadoop HA高可用集群搭建(Hadoop+Zookeeper+HBase)

    一.服务器环境 主机名 IP 用户名 密码 安装目录 master 192.168.142.124 root xxx /usr/hadoop-2.6.5/ slave1 192.168.142.125 ...

  10. hadoop 多节点集群_设置Apache Hadoop多节点集群

    hadoop 多节点集群 我们正在分享有关在基于Linux的机器(多节点)上安装Apache Hadoop的经验. 在这里,我们还将分享我们在各种故障排除方面的经验,并在将来进行更新. 用户创建和其他 ...

最新文章

  1. hi3559a100解码
  2. 实现模糊查询并忽略大小写
  3. Linux基础学习八:mysql主从复制原理以及详细搭建步骤
  4. c语言数字和字母排序,C# 数字和字母的混合排序
  5. OpenShift 4 概念 - OpenShift是如何通过ImageStream访问Image的
  6. JSP include directive 和JSP include的区别
  7. unity多边形uv地图
  8. 报表的查询条件只能在数据上方吗?
  9. python怎么下载网易云歌曲_python 下载网易云歌曲(能在线听都可以)
  10. 肯德基营销策略案例分析PPT模板
  11. 用C语言恶搞你的好朋友strcmp()
  12. PS平面设计如何学?零基础学习有多难?
  13. 二维码怎么知道自己被扫描 ()==> java 篇 带一点点(Android)
  14. VBA Excel如何在工作表区域内查找某个值,发回这个值所在的行号、列号
  15. 图形界面 I: 图形界面的动画 (第三章)
  16. Memcached学习(五)--LRU删除策略
  17. SQL Server基础操作(此随笔仅作为本人学习进度记录九!--游标)
  18. 怎么能看出一个人开车水平高低?
  19. 基于UMDF2将 WPP 软件跟踪添加到 Windows 驱动程序
  20. yolov5使用GPU

热门文章

  1. php程序员开发工具箱,php程序员工具箱|php工具箱 v0.7 官方版 - 软件下载 - 绿茶软件园|33LC.com...
  2. cad填充转多段线脚本
  3. 松下服务器显示18号报警,松下伺服故障报警代码分析及处理
  4. Operator norm - 算子范数
  5. 计算机 A类会议论文,一篇论文被CCF A类会议SIGIR 2021录用!
  6. 计算机网络基础知识应用题,计算机网络试题及答案共十套
  7. 2、ARM嵌入式系统:LED初始化
  8. 常用中文字体的Unicode编码
  9. solidword入门使用
  10. maya如何查看资源大纲_干货|Maya入门到精通完全自学教程大纲(建模篇)