redis cluster master failover问题

背景：

在进行主从切换演练的过程中，发现client应用端会在master下线后，slave选举成为主且节点topology拓扑更新完毕的过程中出现报错信息

默认的时间是60s，我这里配置的是300ms

io.lettuce.core.RedisCommandTimeoutException: Command timed out after 300 millisecond(s)
at io.lettuce.core.ExceptionFactory.createTimeoutException(ExceptionFactory.java:51)
at io.lettuce.core.LettuceFutures.awaitOrCancel(LettuceFutures.java:114)
at io.lettuce.core.cluster.ClusterFutureSyncInvocationHandler.handleInvocation(ClusterFutureSyncInvocationHandler.java:123)
at io.lettuce.core.internal.AbstractInvocationHandler.invoke(AbstractInvocationHandler.java:80)
at com.sun.proxy.$Proxy9.setex(Unknown Source)
at com.xueqiu.infra.redis4.RedisClusterImpl$171.apply(RedisClusterImpl.java:2336)
at com.xueqiu.infra.redis4.RedisClusterImpl$171.apply(RedisClusterImpl.java:2333)
at com.xueqiu.infra.redis4.RedisClusterImpl.executeSync(RedisClusterImpl.java:543)
at com.xueqiu.infra.redis4.RedisClusterImpl.setex(RedisClusterImpl.java:2333)
at com.xueqiu.infra.redis4.RedisMetricsTest.testMutilSet(RedisMetricsTest.java:114)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
at org.junit.runner.JUnitCore.run(JUnitCore.java:137)
at com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:68)
at com.intellij.rt.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:33)
at com.intellij.rt.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:230)
at com.intellij.rt.junit.JUnitStarter.main(JUnitStarter.java:58)

其中集群的topology经历以下的过程：

disconnected –- fail ? -- fail –- all node topology consistency

以下是服务端日志，可以看到花费了近30s，集群才完成了主从切换恢复正常

另外从网上找的一大堆信息，好多都是使用的springboot自带的lettuce，大部分博主说的都是客户端的ClusterTopologyRefreshOptions无法配置，需要升级版本

但是：从雪球的RedisCluster4组件中，采用的默认配置是可以在集群failover恢复后自动更新topology信息的（网上的版本可能与我们不一致，要不就是博主夸大了）

48977:S 24 Feb 11:16:09.477 # Connection with master lost.
48977:S 24 Feb 11:16:09.477 * Caching the disconnected master state.
48977:S 24 Feb 11:16:09.941 * Connecting to MASTER 10.10.200.30:16389
48977:S 24 Feb 11:16:09.941 * MASTER <-> SLAVE sync started
48977:S 24 Feb 11:16:09.941 # Error condition on socket for SYNC: Connection refused
48977:S 24 Feb 11:16:10.943 * Connecting to MASTER 10.10.200.30:16389
48977:S 24 Feb 11:16:10.943 * MASTER <-> SLAVE sync started
……
48977:S 24 Feb 11:16:38.058 # Error condition on socket for SYNC: Connection refused
48977:S 24 Feb 11:16:39.064 * Connecting to MASTER 10.10.200.30:16389
48977:S 24 Feb 11:16:39.064 * MASTER <-> SLAVE sync started
48977:S 24 Feb 11:16:39.064 # Error condition on socket for SYNC: Connection refused
48977:S 24 Feb 11:16:39.713 * FAIL message received from 36ddadb3dbc4a981fe5415c9996754add0f18711 about 486112f7a7d8f52cb2143738a802b49421d0efe0
48977:S 24 Feb 11:16:39.765 # Start of election delayed for 725 milliseconds (rank #0, offset 4548).
48977:S 24 Feb 11:16:40.071 * Connecting to MASTER 10.10.200.30:16389
48977:S 24 Feb 11:16:40.071 * MASTER <-> SLAVE sync started
48977:S 24 Feb 11:16:40.071 # Error condition on socket for SYNC: Connection refused
48977:S 24 Feb 11:16:40.572 # Starting a failover election for epoch 29.
48977:S 24 Feb 11:16:40.620 # Failover election won: I'm the new master.
48977:S 24 Feb 11:16:40.620 # configEpoch set to 29 after successful failover
48977:M 24 Feb 11:16:40.620 # Setting secondary replication ID to f223244de2f9a22b323274f3ac4cabfc61096bbb, valid up to offset: 4549. New replication ID is 0b64827f015621e379f1e0063821c6bcae6ece69
48977:M 24 Feb 11:16:40.620 * Discarding previously cached master state.

但是要是主从切换过程中，client有读写的情况，其中尝试了几次，发现集群恢复正常的耗时较长（这部分server端的原因还没有探究）

48999:S 24 Feb 11:24:24.741 # Connection with master lost.
48999:S 24 Feb 11:24:24.741 * Caching the disconnected master state.
48999:S 24 Feb 11:24:25.222 * Connecting to MASTER 10.10.200.30:17389
48999:S 24 Feb 11:24:25.222 * MASTER <-> SLAVE sync started
48999:S 24 Feb 11:24:25.222 # Error condition on socket for SYNC: Connection refused
48999:S 24 Feb 11:24:26.225 * Connecting to MASTER 10.10.200.30:17389
48999:S 24 Feb 11:24:26.225 * MASTER <-> SLAVE sync started
……
48999:S 24 Feb 11:24:55.361 # Error condition on socket for SYNC: Connection refused
48999:S 24 Feb 11:24:56.262 * Marking node afc1b251151003a099388c26e9dd3fc90f84e413 as failing (quorum reached).
48999:S 24 Feb 11:24:56.362 * Connecting to MASTER 10.10.200.30:17389
48999:S 24 Feb 11:24:56.362 * MASTER <-> SLAVE sync started
48999:S 24 Feb 11:24:56.362 # Start of election delayed for 535 milliseconds (rank #0, offset 191410).
48999:S 24 Feb 11:24:56.362 # Error condition on socket for SYNC: Connection refused
48999:S 24 Feb 11:24:56.966 # Starting a failover election for epoch 30.
48999:S 24 Feb 11:24:57.369 * Connecting to MASTER 10.10.200.30:17389
48999:S 24 Feb 11:24:57.369 * MASTER <-> SLAVE sync started
48999:S 24 Feb 11:24:57.369 # Error condition on socket for SYNC: Connection refused
48999:S 24 Feb 11:24:58.370 * Connecting to MASTER 10.10.200.30:17389
48999:S 24 Feb 11:24:58.371 * MASTER <-> SLAVE sync started
……
48999:S 24 Feb 11:25:29.487 # Error condition on socket for SYNC: Connection refused
48999:S 24 Feb 11:25:30.490 * Connecting to MASTER 10.10.200.30:17389
48999:S 24 Feb 11:25:30.491 * MASTER <-> SLAVE sync started
48999:S 24 Feb 11:25:30.491 # Error condition on socket for SYNC: Connection refused
48999:S 24 Feb 11:25:31.293 # Currently unable to failover: Waiting for votes, but majority still not reached.
48999:S 24 Feb 11:25:31.494 * Connecting to MASTER 10.10.200.30:17389
48999:S 24 Feb 11:25:31.494 * MASTER <-> SLAVE sync started
……
48999:S 24 Feb 11:25:55.580 # Error condition on socket for SYNC: Connection refused
48999:S 24 Feb 11:25:56.582 * Connecting to MASTER 10.10.200.30:17389
48999:S 24 Feb 11:25:56.582 * MASTER <-> SLAVE sync started
48999:S 24 Feb 11:25:56.583 # Error condition on socket for SYNC: Connection refused
48999:S 24 Feb 11:25:56.983 # Currently unable to failover: Failover attempt expired.
48999:S 24 Feb 11:25:57.586 * Connecting to MASTER 10.10.200.30:17389
48999:S 24 Feb 11:25:57.594 * MASTER <-> SLAVE sync started
48999:S 24 Feb 11:25:57.594 # Error condition on socket for SYNC: Connection refused
48999:S 24 Feb 11:25:58.596 * Connecting to MASTER 10.10.200.30:17389
48999:S 24 Feb 11:25:58.596 * MASTER <-> SLAVE sync started
……
48999:S 24 Feb 11:26:55.784 # Error condition on socket for SYNC: Connection refused
48999:S 24 Feb 11:26:56.786 * Connecting to MASTER 10.10.200.30:17389
48999:S 24 Feb 11:26:56.786 * MASTER <-> SLAVE sync started
48999:S 24 Feb 11:26:56.786 # Error condition on socket for SYNC: Connection refused
48999:S 24 Feb 11:26:56.987 # Start of election delayed for 547 milliseconds (rank #0, offset 191410).
48999:S 24 Feb 11:26:57.087 # Currently unable to failover: Waiting the delay before I can start a new failover.
48999:S 24 Feb 11:26:57.588 # Starting a failover election for epoch 31.
48999:S 24 Feb 11:26:57.641 # Currently unable to failover: Waiting for votes, but majority still not reached.
48999:S 24 Feb 11:26:57.641 # Failover election won: I'm the new master.
48999:S 24 Feb 11:26:57.642 # configEpoch set to 31 after successful failover
48999:M 24 Feb 11:26:57.642 # Setting secondary replication ID to 0b64827f015621e379f1e0063821c6bcae6ece69, valid up to offset: 191411. New replication ID is ca9d8deacfb6b77a3a2f8edeebda701a0e2ca86c
48999:M 24 Feb 11:26:57.642 * Discarding previously cached master state.

分析原因：

master宕机，slave选举过程如下：

1.slave发现自己的master变为FAIL
2.发起选举前，slave先给自己的epoch（即currentEpoch）增一，然后请求其它master给自己投票。slave是通过广播FAILOVER_AUTH_REQUEST包给集中的每一个masters。
3.slave发起投票后，会等待至少两倍NODE_TIMEOUT时长接收投票结果，不管NODE_TIMEOUT何值，也至少会等待2秒。
4.master接收投票后给slave响应FAILOVER_AUTH_ACK，并且在（NODE_TIMEOUT*2）时间内不会给同一master的其它slave投票。
5.如果slave收到FAILOVER_AUTH_ACK响应的epoch值小于自己的epoch，则会直接丢弃。一旦slave收到多数master的FAILOVER_AUTH_ACK，则声明自己赢得了选举。
6.如果slave在两倍的NODE_TIMEOUT时间内（至少2秒）未赢得选举，则放弃本次选举，然后在四倍NODE_TIMEOUT时间（至少4秒）后重新发起选举。

查看NODE_TIMEOUT方式

127.0.0.1:16389> snblconfig get cluster-node-timeout
1) "cluster-node-timeout"
2) "30000"

由此可以看到，这个slave的选主过程也是影响集群的一个参数，太长太短都不行，需要结合当前的redis cluster的应用场景给出合适的配置时间

强制延迟至少0.5秒选举，是为确保master的fail状态在整个集群内传开，否则可能只有小部分master知晓，而master只会给处于fail状态的master的slaves投票

延迟计算公式：
DELAY = 500ms + random(0 ~ 500ms) + SLAVE_RANK * 1000ms
SLAVE_RANK表示此slave已经从master复制数据的总量的rank。Rank越小代表已复制的数据越新。这种方式下，持有最新数据的slave将会首先发起选举（理论上）。

改善措施：

服务端：

在进行节点迁移的时候，不要使用强制性的master shutdown或者kill操作

另外在其从节点上执行：cluster failover命令虽然可以完成节点切换，但是主从切换的数据一致性问题还有待验证（因为会发生部分主的数据没有同步至slave，这部分数据就会出现丢失）

正确的使用方式应该是：

https://redis.io/commands/cluster-setslot（这部分需要进行线上演练，线下测试是OK的）

这个过程cluster可以完成节点切换，主节点变更为后，会给客户端回复Redirected重定向操作，触发客户端的topology更新，这个过程是自适应的

注意非必要情况不要使用：cluster failover命令进行节点切换

https://redis.io/commands/cluster-failover中有强调

Implementation details and notes
CLUSTER FAILOVER, unless the TAKEOVER option is specified, does not execute a failover synchronously,
it only schedules a manual failover, bypassing the failure detection stage, so to check if the failover actually happened,
CLUSTER NODES or other means should be used in order to verify that the state of the cluster changes after some time the command was sent.

简述：它仅排定手动故障转移，绕过故障检测阶段，因此要检查故障转移是否确实发生了

好比在集群的初始创建阶段这种及特殊case可以

客户端：

服务在使用时，可以通过配置ClusterTopologyRefreshOptions参数，来设置自定义的topology刷新逻辑

比如：http://git.snowballfinance.com/lib/redis-cluster4 readme有解释SDK的设计

ClusterTopologyRefreshOptions topologyRefreshOptions = ClusterTopologyRefreshOptions.builder().enableAllAdaptiveRefreshTriggers()//开启定时刷新topology信息.enablePeriodicRefresh(true)//定时任务时间间隔，这个需要根据业务的场景配置，因为是个容灾操作刷新频率过高会影响P99，频率过低在高QPS会影响SLA.refreshPeriod(Duration.ofSeconds(60)).build();
ClusterClientOptions options = ClusterClientOptions.builder().autoReconnect(true).pingBeforeActivateConnection(true)//命令执行异步等待时间.timeoutOptions(TimeoutOptions.builder().fixedTimeout(Duration.ofMillis(300)).build()).socketOptions(SocketOptions.builder().keepAlive(true).connectTimeout(Duration.ofMillis(300)).build()).topologyRefreshOptions(topologyRefreshOptions).build();

但是topology刷新逻辑的刷新逻辑只能够降低topology的刷新时间，但是在集群master宕机的slave选主过程中，异常还是会发生的

参考资料：

RedisCluster master宕机case：https://blog.csdn.net/ankeway/article/details/100136675/

RedisCluster官方提供的failover：https://redis.io/commands/cluster-failover

https://redis.io/topics/cluster-spec/