RabbitMQ脑裂

欢迎支持笔者新作：《深入理解Kafka:核心设计与实践原理》和《RabbitMQ实战指南》，同时欢迎关注笔者的微信公众号：朱小厮的博客。

欢迎跳转到本文的原文链接：https://honeypps.com/mq/rabbitmq-network-partition-1/

在RabbitMQ3.4.x中会出现错误的网络分区检测（某种意义上可以称之为脑裂）的现象，本文通过实验验证此现象，愿小伙伴们少走弯路。

Preview

网上有两篇帖子（需要翻墙）
https://groups.google.com/forum/#!topic/rabbitmq-users/dt8VFhMb2zM
https://groups.google.com/forum/#!topic/rabbitmq-users/06OQkYtLJd8
陈述了脑裂的现象。

帖子中描述现象：

Hey Folk,i just set up a rabbitmq cluster:Three Nodes:
Node A | Node B | Node CAll three nodes see each other (same erlang-cookie, mode: pause_minority).rabbitmqctl cluster_status => shows status of all nodes on every instance.Every queue is mirrored to the other nodes.If i shutdown Node B, the following is happening:
* Node A realizes Node B is offline.
* Node A asks Node C for Node B status.
* Node C answers: "I still have connection to Node B."
* Node A shuts down itself.
* Node C realizes some seconds later, that the connection to Node B is no more possible.From three Nodes only one is left in case of an unexpected outage.I would like to realize a setup where Node A and C keep the connection even if Node B goes offline.
Is there any way to do this?

Michael Klishin（rabbitmq-server第二贡献者）回复：

A known issue which is partially resolve in 3.4.x releases. 26474 can be related.

(根据RabbitMQ 3.4.2 Release日志：26474 prevent false positive detection of partial partitions (since 3.4.0)) ====》错误的网络分区检测。

Simon MacMullen(也是rabbitmq-server的contributor)：

So this is caused by the new partial partition detection in 3.4.x. It
looks like it is too sensitive - C should only reply "yes" if it has
positive confirmation that it can still talk to B, not if the connection
just hasn't failed yet. This will be fixed in 3.4.2.

假设

自此可以假设：rabbitmq3.4.0存在错误的网络分区检测，rabbitmq3.4.2修复了此bug。
论证过程：分别对rabbitmq3.4.0, rabbitmq3.4.1, rabbitmq3.4.2, rabbitmq3.6.0进行实验, 分别配置A B C三个节点组成一个cluster，然后通过停止C的网络来验证A和B是否出现错误的网络分区检测.

论证

论证1

rabbitmq版本:3.4.0
rabbitmq节点配置
共三个节点：A B C，分别为：
A:rabbit@zhuzhonghua2-fqawb
B:rabbit@hiddenzhu-8drd
C:rabbit@hidden-local
B join_cluster A; C join_cluster A

查看cluster_status:(rabbitmqctl cluster_status)

Cluster status of node 'rabbit@zhuzhonghua2-fqawb' ...
[{nodes,[{disc,['rabbit@hidden-local','rabbit@hiddenzhu-8drdc','rabbit@zhuzhonghua2-fqawb']}]},{running_nodes,['rabbit@hidden-local','rabbit@hiddenzhu-8drdc','rabbit@zhuzhonghua2-fqawb']},{cluster_name,<<"rabbit@zhuzhonghua2-fqawb">>},{partitions,[]}]

在C节点执行service network stop
在A节点查看cluster_status

[{nodes,[{disc,['rabbit@hidden-local','rabbit@hiddenzhu-8drdc','rabbit@zhuzhonghua2-fqawb']}]},{running_nodes,['rabbit@zhuzhonghua2-fqawb']},{cluster_name,<<"rabbit@zhuzhonghua2-fqawb">>},{partitions,[]}]

再次在A节点查看cluster_status

Cluster status of node 'rabbit@zhuzhonghua2-fqawb' ...
[{nodes,[{disc,['rabbit@hidden-local','rabbit@hiddenzhu-8drdc','rabbit@zhuzhonghua2-fqawb']}]},{running_nodes,['rabbit@zhuzhonghua2-fqawb']},{cluster_name,<<"rabbit@zhuzhonghua2-fqawb">>},{partitions,[{'rabbit@zhuzhonghua2-fqawb',['rabbit@hiddenzhu-8drdc']}]}]
在B节点查看cluster_status
[{nodes,[{disc,['rabbit@hidden-local','rabbit@hiddenzhu-8drdc','rabbit@zhuzhonghua2-fqawb']}]},{running_nodes,['rabbit@hiddenzhu-8drdc']},{cluster_name,<<"rabbit@zhuzhonghua2-fqawb">>},{partitions,[{'rabbit@hiddenzhu-8drdc',['rabbit@zhuzhonghua2-fqawb']}]}]

结论：【这里出现了网络分区，但是真正的网络分区是要在网络恢复连通之后才能检测】

在C节点执行service network start
查看A节点cluster_status

[{nodes,[{disc,['rabbit@hidden-local','rabbit@hiddenzhu-8drdc','rabbit@zhuzhonghua2-fqawb']}]},{running_nodes,['rabbit@zhuzhonghua2-fqawb']},{cluster_name,<<"rabbit@zhuzhonghua2-fqawb">>},{partitions,[{'rabbit@zhuzhonghua2-fqawb',['rabbit@hidden-local','rabbit@hiddenzhu-8drdc']}]}]

查看B节点cluster_status

[{nodes,[{disc,['rabbit@hidden-local','rabbit@hiddenzhu-8drdc','rabbit@zhuzhonghua2-fqawb']}]},{running_nodes,['rabbit@hiddenzhu-8drdc']},{cluster_name,<<"rabbit@zhuzhonghua2-fqawb">>},{partitions,[{'rabbit@hiddenzhu-8drdc',['rabbit@zhuzhonghua2-fqawb']}]}]

查看C节点cluster_status

[{nodes,[{disc,['rabbit@hidden-local','rabbit@hiddenzhu-8drdc','rabbit@zhuzhonghua2-fqawb']}]},{running_nodes,['rabbit@hidden-local']},{cluster_name,<<"rabbit@zhuzhonghua2-fqawb">>},{partitions,[{'rabbit@hidden-local',['rabbit@zhuzhonghua2-fqawb']}]}]

论证2

rabbitmq版本：3.4.1
节点配置如上（B join_cluster A, C join_cluster A）
查看节点状态：

[{nodes,[{disc,['rabbit@hidden-local','rabbit@hiddenzhu-8drdc','rabbit@zhuzhonghua2-fqawb']}]},{running_nodes,['rabbit@hiddenzhu-8drdc','rabbit@zhuzhonghua2-fqawb','rabbit@hidden-local']},{cluster_name,<<"rabbit@zhuzhonghua2-fqawb">>},{partitions,[]}]

在C节点执行service network stop
查看A节点cluster_status

Cluster status of node 'rabbit@zhuzhonghua2-fqawb' ...
[{nodes,[{disc,['rabbit@hidden-local','rabbit@hiddenzhu-8drdc','rabbit@zhuzhonghua2-fqawb']}]},{running_nodes,['rabbit@zhuzhonghua2-fqawb']},{cluster_name,<<"rabbit@zhuzhonghua2-fqawb">>},{partitions,[{'rabbit@zhuzhonghua2-fqawb',['rabbit@hiddenzhu-8drdc']}]}]

查看B节点cluster_status

[{nodes,[{disc,['rabbit@hidden-local','rabbit@hiddenzhu-8drdc','rabbit@zhuzhonghua2-fqawb']}]},{running_nodes,['rabbit@hiddenzhu-8drdc']},{cluster_name,<<"rabbit@zhuzhonghua2-fqawb">>},{partitions,[{'rabbit@hiddenzhu-8drdc',['rabbit@zhuzhonghua2-fqawb']}]}]

结论：【复现】

在C节点执行service network start
查看A节点cluster_status

[{nodes,[{disc,['rabbit@hidden-local','rabbit@hiddenzhu-8drdc','rabbit@zhuzhonghua2-fqawb']}]},{running_nodes,['rabbit@zhuzhonghua2-fqawb']},{cluster_name,<<"rabbit@zhuzhonghua2-fqawb">>},{partitions,[{'rabbit@zhuzhonghua2-fqawb',['rabbit@hidden-local','rabbit@hiddenzhu-8drdc']}]}]

查看B节点cluster_status

[{nodes,[{disc,['rabbit@hidden-local','rabbit@hiddenzhu-8drdc','rabbit@zhuzhonghua2-fqawb']}]},{running_nodes,['rabbit@hiddenzhu-8drdc']},{cluster_name,<<"rabbit@zhuzhonghua2-fqawb">>},{partitions,[{'rabbit@hiddenzhu-8drdc',['rabbit@zhuzhonghua2-fqawb']}]}]

查看C节点cluster_status

[{nodes,[{disc,['rabbit@hidden-local','rabbit@hiddenzhu-8drdc','rabbit@zhuzhonghua2-fqawb']}]},{running_nodes,['rabbit@hidden-local']},{cluster_name,<<"rabbit@zhuzhonghua2-fqawb">>},{partitions,[{'rabbit@hidden-local',['rabbit@zhuzhonghua2-fqawb']}]}]

论证3

rabbitmq版本：3.4.2 （版本3.6.0与此相同）
节点配置如上（B join_cluster A, C join_cluster A）
查看节点状态

[{nodes,[{disc,['rabbit@hidden-local','rabbit@hiddenzhu-8drdc','rabbit@zhuzhonghua2-fqawb']}]},{running_nodes,['rabbit@hiddenzhu-8drdc','rabbit@zhuzhonghua2-fqawb','rabbit@hidden-local']},{cluster_name,<<"rabbit@zhuzhonghua2-fqawb">>},{partitions,[]}]

在C节点执行service network stop
查看A节点cluster_status

[{nodes,[{disc,['rabbit@hidden-local','rabbit@hiddenzhu-8drdc','rabbit@zhuzhonghua2-fqawb']}]},{running_nodes,['rabbit@hiddenzhu-8drdc','rabbit@zhuzhonghua2-fqawb']},{cluster_name,<<"rabbit@zhuzhonghua2-fqawb">>},{partitions,[]}]

查看B节点cluster_status

[{nodes,[{disc,['rabbit@hidden-local','rabbit@hiddenzhu-8drdc','rabbit@zhuzhonghua2-fqawb']}]},{running_nodes,['rabbit@zhuzhonghua2-fqawb','rabbit@hiddenzhu-8drdc']},{cluster_name,<<"rabbit@zhuzhonghua2-fqawb">>},{partitions,[]}]

结论：【未复现】

在C节点执行service network start
查看A节点cluster_status

[{nodes,[{disc,['rabbit@hidden-local','rabbit@hiddenzhu-8drdc','rabbit@zhuzhonghua2-fqawb']}]},{running_nodes,['rabbit@hiddenzhu-8drdc','rabbit@zhuzhonghua2-fqawb']},{cluster_name,<<"rabbit@zhuzhonghua2-fqawb">>},{partitions,[{'rabbit@zhuzhonghua2-fqawb',['rabbit@hidden-local']}]}]

查看B节点cluster_status

[{nodes,[{disc,['rabbit@hidden-local','rabbit@hiddenzhu-8drdc','rabbit@zhuzhonghua2-fqawb']}]},{running_nodes,['rabbit@zhuzhonghua2-fqawb','rabbit@hiddenzhu-8drdc']},{cluster_name,<<"rabbit@zhuzhonghua2-fqawb">>},{partitions,[{'rabbit@zhuzhonghua2-fqawb',['rabbit@hidden-local']}]}]

查看C节点cluster_status

[{nodes,[{disc,['rabbit@hidden-local','rabbit@hiddenzhu-8drdc','rabbit@zhuzhonghua2-fqawb']}]},{running_nodes,['rabbit@hidden-local']},{cluster_name,<<"rabbit@zhuzhonghua2-fqawb">>},{partitions,[{'rabbit@hidden-local',['rabbit@zhuzhonghua2-fqawb']}]}]

结论

版本问题基本得到验证，为了防止错误的网络分区检测现象，建议正在使用rabbitmq的小伙伴升级，避免使用3.4.0和3.4.1这两个版本。

网络分区

有关网络分区有篇文章（RabbitMQ 网络分区问题）这样介绍：

RabbitMQ 集群的网络分区容错性并不是非常高，在网络经常发生分区时会有些问题，最明显的就是脑裂问题。

官方文档是这样介绍的：

RabbitMQ clusters do not tolerate network partitions well. If you are thinking of clustering across a WAN, don't. You should use federation or the shovel instead.

从中我们可以看出，在广域网环境下不应该使用集群，而应该使用 federation 或者 shovel 来解决。

不过即使是在局域网环境下，网络分区也不可能完全避免，网络设备（比如中继设备、网卡）出现故障也会导致网络分区。

Network partition detectedMnesia reports that this RabbitMQ cluster has experienced a network partition. This is a dangerous situation. RabbitMQ clusters should not be installed on networks which can experience partitions.

当出现网络分区时，不同分区里的节点会认为不属于自身所在分区的节点都已经挂了，对 queue、exchange、binding 的操作仅对当前分区有效。在 RabbitMQ 的默认配置下，即使网络恢复了也不会自动处理网络分区带来的问题从而恢复集群。RabbitMQ（3.1+）会自动探测网络分区，并且提供了配置来解决这个问题。

[{rabbit,[{tcp_listeners,[5672]},{cluster_partition_handling, ignore}]}
].

RabbitMQ 提供了4种配置（详细参考：http://blog.csdn.net/u013256816/article/details/73757884）：

ignore：默认配置，发生网络分区时不作处理，当认为网络是可靠时选用该配置
autoheal：各分区协商后重启客户端连接最少的分区节点，恢复集群（CAP 中保证 AP，有状态丢失）
pause_if_all_down。
pause_minority：分区发生后判断自己所在分区内节点是否超过集群总节点数一半，如果没有超过则暂停这些节点（保证 CP，总节点数为奇数个）

参考：
● RabbitMQ 官方文档
● 网络分区
● 脑裂问题

欢迎跳转到本文的原文链接：https://honeypps.com/mq/rabbitmq-network-partition-1/

欢迎支持笔者新作：《深入理解Kafka:核心设计与实践原理》和《RabbitMQ实战指南》，同时欢迎关注笔者的微信公众号：朱小厮的博客。

RabbitMQ脑裂相关推荐

RabbitMQ脑裂问题解决方案调查
RabbitMQ脑裂问题解决方案调查参考文章: (1)RabbitMQ脑裂问题解决方案调查 (2)https://www.cnblogs.com/liyongsan/p/9640361.html 备 ...
RabbitMQ脑裂问题
脑裂问题 RabbitMQ 集群的网络分区容错性并不高,在网络质量较差的环境中会比较容易出现问题,其中最明显的就是脑裂问题所谓的脑裂问题,就是在多机热备的高可用 HA 系统中,当两个节点心跳突然断开 ...
rabbitmq实战指南_RabbitMQ之脑裂
点击上方蓝色字体,选择"设为星标" 9 10 本文总结<RabbitMQ实战指南>网络分区章节,并亲自实践才有这篇文章,手动处理章节详细记录了操作过程中的注意事项.如果 ...
mysql 脑裂的问题,DRBD脑裂问题故障处理
环境:Mysql+heartbeat+drbd 架构问题:mysql-主宕机 mysql-从接替之后,再恢复Mysql-主之后,发现主的drbd启动不了,而且从上面也无法辨识对方,从的状态是Pri ...
解决keepalived脑裂问题
检测思路:正常情况下keepalived的VIP地址是在主节点上的,如果在从节点发现了VIP,就设置报警信息脚本如下: #!/bin/bash # 检查脑裂的脚本,在备节点上进行部署 LB01_VI ...
zookeeper脑裂
出现: 在搭建hadoop的HA集群环境后,由于两个namenode的状态不一,当active的namenode由于网络等原因出现假死状态,standby接收不到active的心跳,因此判断activ ...
【Zookeeper】Zookeeper集群“脑裂”问题处理大全
本文重点分享Zookeeper脑裂问题的处理办法.ZooKeeper是用来协调(同步)分布式进程的服务,提供了一个简单高性能的协调内核,用户可以在此之上构建更多复杂的分布式协调功能. 脑裂通常会出现在 ...
说说Keepalived的脑裂
1. 工作场景 Keepalived提供了Loadbalancing和High-Availability的功能, 本文说的是其为2个Mycat节点提供HA功能的场景. 2. 关键配置如下, 为主备非抢 ...
Elasticsearch之集群脑裂
集群脑裂是什么? 所谓脑裂问题(类似于精神分裂),就是同一个集群中的不同节点,对于集群的状态有了不一样的理解. 由于某些节点的失效,部分节点的网络连接会断开,并形成一个与原集群一样名字的集群,这种情况 ...

RabbitMQ脑裂

Preview

假设

论证

论证1

论证2

论证3

结论

网络分区

RabbitMQ脑裂相关推荐

最新文章

热门文章