Redis哨兵机制哨兵集群搭建

本文讲解，基于Redis版本：5.0.3

2021-12-24更新：本教程 Redis-6.2.1 同样适用

本文是在Redis集群的基础之上，通过Redis哨兵机制来完成Redis集群的高可用方案。如需了解Redis Cluster集群的安装，请移步：Redis Cluster集群安装(手工搭建 && redis-cli工具搭建)

1.什么是哨兵

顾名思义，哨兵的作用就是监控Redis系统的运行状况，在Redis中，哨兵叫做sentinel。它的功能包括两个：

①监控master节点和slave节点，是否正常运行

②当master主节点发生故障时，自动将master对应的slave节点升级为master节点，实现主从切换

2.Redis Cluster使用哨兵架构图

Redis-Sentinel(哨兵模式)是Redis官方推荐的高可用性(HA)解决方案。Redis哨兵是一个独立的进程，Redis集群引入哨兵后，架构图如下：

从如上架构图我们不难发现，当哨兵挂掉之后，我们依然无法实现Redis集群的高可用，所以此处就有引出了哨兵的单点问题，哨兵单点故障仍然无法满足Redis集群的高可用。那么哨兵如何解决单点故障呢？

解决哨兵的单点故障问题，我们可以使用多个哨兵进行监控任务以保证系统足够稳定。此时哨兵不仅会监控master和slave，同时还会互相监控，这种方式称为哨兵集群。哨兵集群需要解决①Redis集群的故障发现 ②master节点决策的协商机制问题

当我们引入哨兵集群之后，多个哨兵之间也会进行相互监控，Redis集群架构图如下：

多个哨兵节点之间，会因为共同监听同一个master节点，从而产生关联。一个新加入的哨兵节点，需要和监视相同master节点的其他哨兵，通过pub/sub(发布/订阅)机制，来完成相互感知，从而使集群中原有的哨兵发现这个新加入集群的哨兵。最后新加入哨兵集群的哨兵，会和集群中的其他的哨兵建立起长连接，来共同维护Redis集群的高可用。

3.Redis集群中master节点故障发现

Redis集群中，通过引入哨兵机制来完成Redis集群的高可用。那么master节点的故障是如何被发现的呢？Sentinel哨兵节点会定时向master节点发送心跳来判断master节点是否存活。一旦master节点在规定时间内没有正确响应，Sentinel哨兵会把master节点设置为"主观不可用状态"，然后它会把"主观不可用状态"发送给其他所有的集群中的其他Sentinel哨兵节点去确认，当确认的Sentinel节点数 > quorum(quorum在配置文件中可配置)时，便会认为该master是"客观不可用"，接下来便会进入新的master选举过程。

但是，在哨兵集群中，如果多个节点同时发现master节点达到"客观不可用状态"，那么由哪个哨兵来决定哪个节点作为master呢？

这个时候就需要从哨兵集群中，选择一个Sentinel来作为leader来做出相应的决策。这里会用到一个一致性算法Raft算法，它和ZooKeeper中用到的Paxos算法类似，都是分布式一致性算法。Raft算法和Paxos算法一样，也是基于投票算法，只要保证过半数节点通过选举，即可选定该Sentinel为新的leader，来做出哪个节点应该作为master节点的决策。

4.哨兵机制的配置

基于Redis-5.0.3集群安装，完成3主3从，部署在6台机器上的Redis集群哨兵的配置。此处哨兵你可以随意配置几个都可以。1个哨兵的话，无法满足Redis Cluster的高可用。所以最少得配置2台哨兵。本文配置3台哨兵，实现3主3从3哨兵的Redis集群高可用。

我们在解压缩redis.tar.gz包后，会在目录下发现一个sentinel.conf文件，改文件就是哨兵的配置文件，如下图所示

3台哨兵分别配置在192.168.204.201、192.168.204.202、192.168.204.203三台服务器上(哨兵并不一定配置在3主3从的服务器上，也可以重新找一台服务器来配置)。注意：Redis-Sentinel(哨兵模式)，作为Redis中的一个分支，必须依赖于Redis服务，所以如果你要讲哨兵部署在Redis集群之外的机器上，也必须想安装Redis才能正常使用哨兵。

我们先来看看sentinel.conf配置文件的内容(如不想看，可直接跳过，看下面重要部分)

# Example sentinel.conf# *** IMPORTANT ***
#
# By default Sentinel will not be reachable from interfaces different than
# localhost, either use the 'bind' directive to bind to a list of network
# interfaces, or disable protected mode with "protected-mode no" by
# adding it to this configuration file.
#
# Before doing that MAKE SURE the instance is protected from the outside
# world via firewalling or other means.
#
# For example you may use one of the following:
#
# bind 127.0.0.1 192.168.1.1
#
# protected-mode no# port <sentinel-port>
# The port that this sentinel instance will run on
port 26379# By default Redis Sentinel does not run as a daemon. Use 'yes' if you need it.
# Note that Redis will write a pid file in /var/run/redis-sentinel.pid when
# daemonized.
daemonize no# When running daemonized, Redis Sentinel writes a pid file in
# /var/run/redis-sentinel.pid by default. You can specify a custom pid file
# location here.
pidfile /var/run/redis-sentinel.pid# Specify the log file name. Also the empty string can be used to force
# Sentinel to log on the standard output. Note that if you use standard
# output for logging but daemonize, logs will be sent to /dev/null
logfile ""# sentinel announce-ip <ip>
# sentinel announce-port <port>
#
# The above two configuration directives are useful in environments where,
# because of NAT, Sentinel is reachable from outside via a non-local address.
#
# When announce-ip is provided, the Sentinel will claim the specified IP address
# in HELLO messages used to gossip its presence, instead of auto-detecting the
# local address as it usually does.
#
# Similarly when announce-port is provided and is valid and non-zero, Sentinel
# will announce the specified TCP port.
#
# The two options don't need to be used together, if only announce-ip is
# provided, the Sentinel will announce the specified IP and the server port
# as specified by the "port" option. If only announce-port is provided, the
# Sentinel will announce the auto-detected local IP and the specified port.
#
# Example:
#
# sentinel announce-ip 1.2.3.4# dir <working-directory>
# Every long running process should have a well-defined working directory.
# For Redis Sentinel to chdir to /tmp at startup is the simplest thing
# for the process to don't interfere with administrative tasks such as
# unmounting filesystems.
dir /tmp# sentinel monitor <master-name> <ip> <redis-port> <quorum>
#
# Tells Sentinel to monitor this master, and to consider it in O_DOWN
# (Objectively Down) state only if at least <quorum> sentinels agree.
#
# Note that whatever is the ODOWN quorum, a Sentinel will require to
# be elected by the majority of the known Sentinels in order to
# start a failover, so no failover can be performed in minority.
#
# Replicas are auto-discovered, so you don't need to specify replicas in
# any way. Sentinel itself will rewrite this configuration file adding
# the replicas using additional configuration options.
# Also note that the configuration file is rewritten when a
# replica is promoted to master.
#
# Note: master name should not include special characters or spaces.
# The valid charset is A-z 0-9 and the three characters ".-_".
sentinel monitor mymaster 127.0.0.1 6379 2# sentinel auth-pass <master-name> <password>
#
# Set the password to use to authenticate with the master and replicas.
# Useful if there is a password set in the Redis instances to monitor.
#
# Note that the master password is also used for replicas, so it is not
# possible to set a different password in masters and replicas instances
# if you want to be able to monitor these instances with Sentinel.
#
# However you can have Redis instances without the authentication enabled
# mixed with Redis instances requiring the authentication (as long as the
# password set is the same for all the instances requiring the password) as
# the AUTH command will have no effect in Redis instances with authentication
# switched off.
#
# Example:
#
# sentinel auth-pass mymaster MySUPER--secret-0123passw0rd# sentinel down-after-milliseconds <master-name> <milliseconds>
#
# Number of milliseconds the master (or any attached replica or sentinel) should
# be unreachable (as in, not acceptable reply to PING, continuously, for the
# specified period) in order to consider it in S_DOWN state (Subjectively
# Down).
#
# Default is 30 seconds.
sentinel down-after-milliseconds mymaster 30000# sentinel parallel-syncs <master-name> <numreplicas>
#
# How many replicas we can reconfigure to point to the new replica simultaneously
# during the failover. Use a low number if you use the replicas to serve query
# to avoid that all the replicas will be unreachable at about the same
# time while performing the synchronization with the master.
sentinel parallel-syncs mymaster 1# sentinel failover-timeout <master-name> <milliseconds>
#
# Specifies the failover timeout in milliseconds. It is used in many ways:
#
# - The time needed to re-start a failover after a previous failover was
#   already tried against the same master by a given Sentinel, is two
#   times the failover timeout.
#
# - The time needed for a replica replicating to a wrong master according
#   to a Sentinel current configuration, to be forced to replicate
#   with the right master, is exactly the failover timeout (counting since
#   the moment a Sentinel detected the misconfiguration).
#
# - The time needed to cancel a failover that is already in progress but
#   did not produced any configuration change (SLAVEOF NO ONE yet not
#   acknowledged by the promoted replica).
#
# - The maximum time a failover in progress waits for all the replicas to be
#   reconfigured as replicas of the new master. However even after this time
#   the replicas will be reconfigured by the Sentinels anyway, but not with
#   the exact parallel-syncs progression as specified.
#
# Default is 3 minutes.
sentinel failover-timeout mymaster 180000# SCRIPTS EXECUTION
#
# sentinel notification-script and sentinel reconfig-script are used in order
# to configure scripts that are called to notify the system administrator
# or to reconfigure clients after a failover. The scripts are executed
# with the following rules for error handling:
#
# If script exits with "1" the execution is retried later (up to a maximum
# number of times currently set to 10).
#
# If script exits with "2" (or an higher value) the script execution is
# not retried.
#
# If script terminates because it receives a signal the behavior is the same
# as exit code 1.
#
# A script has a maximum running time of 60 seconds. After this limit is
# reached the script is terminated with a SIGKILL and the execution retried.# NOTIFICATION SCRIPT
#
# sentinel notification-script <master-name> <script-path>
#
# Call the specified notification script for any sentinel event that is
# generated in the WARNING level (for instance -sdown, -odown, and so forth).
# This script should notify the system administrator via email, SMS, or any
# other messaging system, that there is something wrong with the monitored
# Redis systems.
#
# The script is called with just two arguments: the first is the event type
# and the second the event description.
#
# The script must exist and be executable in order for sentinel to start if
# this option is provided.
#
# Example:
#
# sentinel notification-script mymaster /var/redis/notify.sh# CLIENTS RECONFIGURATION SCRIPT
#
# sentinel client-reconfig-script <master-name> <script-path>
#
# When the master changed because of a failover a script can be called in
# order to perform application-specific tasks to notify the clients that the
# configuration has changed and the master is at a different address.
#
# The following arguments are passed to the script:
#
# <master-name> <role> <state> <from-ip> <from-port> <to-ip> <to-port>
#
# <state> is currently always "failover"
# <role> is either "leader" or "observer"
#
# The arguments from-ip, from-port, to-ip, to-port are used to communicate
# the old address of the master and the new address of the elected replica
# (now a master).
#
# This script should be resistant to multiple invocations.
#
# Example:
#
# sentinel client-reconfig-script mymaster /var/redis/reconfig.sh# SECURITY
#
# By default SENTINEL SET will not be able to change the notification-script
# and client-reconfig-script at runtime. This avoids a trivial security issue
# where clients can set the script to anything and trigger a failover in order
# to get the program executed.sentinel deny-scripts-reconfig yes# REDIS COMMANDS RENAMING
#
# Sometimes the Redis server has certain commands, that are needed for Sentinel
# to work correctly, renamed to unguessable strings. This is often the case
# of CONFIG and SLAVEOF in the context of providers that provide Redis as
# a service, and don't want the customers to reconfigure the instances outside
# of the administration console.
#
# In such case it is possible to tell Sentinel to use different command names
# instead of the normal ones. For example if the master "mymaster", and the
# associated replicas, have "CONFIG" all renamed to "GUESSME", I could use:
#
# SENTINEL rename-command mymaster CONFIG GUESSME
#
# After such configuration is set, every time Sentinel would use CONFIG it will
# use GUESSME instead. Note that there is no actual need to respect the command
# case, so writing "config guessme" is the same in the example above.
#
# SENTINEL SET can also be used in order to perform this configuration at runtime.
#
# In order to set a command back to its original name (undo the renaming), it
# is possible to just rename a command to itsef:
#
# SENTINEL rename-command mymaster CONFIG CONFIG

接下来我们来正式配置Redis哨兵集群。因为我们要配置3个哨兵，所以每个哨兵则需要监听集群中所有的节点，我们在192.168.204.201节点来配置一个哨兵，主要配置文件如下：(直接使用如下哨兵配置即可)

#Sentinel使用端口
port 26379#打开非保护模式
protected-mode no#守护线程启动(即后台启动)
daemonize yes#守护进程会使用到的一个文件
pidfile "/var/run/redis-sentinel.pid"#指定日志文件名,默认为"",空字符串也可用于强制Sentinel登录标准输出,指定后我们可以通过tail -f xxx.log查看日志
logfile "/usr/local/lib/redis-5.0.3/redis-sentinel.log"#每个长时间运行的进程都应该有一个明确定义的工作目录。对于Redis Sentinel来说，启动时dir到/tmp是最简单的事情为进程不干扰管理任务，如卸载文件系统。(默认就是"/tmp",copy过来即可)
dir "/tmp"#重要的来了
#sentinel monitor <master-name> <ip> <redis-port> <quorum>
#告诉sentinel去监听地址为ip:port的一个master,这里的master-name可以自定义,quorum是一个数字,指明当
#有多少个sentinel认为一个master失效时,master才算真正失效.需要注意的是master-ip 要写真实
#的ip地址而不要用回环地址（127.0.0.1）。
sentinel monitor master001 192.168.204.201 6379 2
sentinel monitor master002 192.168.204.202 6379 2
sentinel monitor master003 192.168.204.203 6379 2#sentinel down-after-milliseconds <master-name> <milliseconds>
#这个配置项指定需要多少时间无响应,一个master才会被这个sentinel主观地认为是不可用的.单位是毫秒,默认为30秒
sentinel down-after-milliseconds master001 10000
sentinel down-after-milliseconds master002 10000
sentinel down-after-milliseconds master003 10000#sentinel parallel-syncs <master-name> <numslaves>
#这个配置项指定了在发生failover主备切换时最多可以有多少个slave同时对新的master进行同步,这个数字越小,完成failover所需的时间就越长,但是如果这个数字越大,就意味着越 多的slave因为replication而不可用.可以通过将这个值设为1(默认就是1)来保证每次只有一个slave处于不能处理命令请求的状态
sentinel parallel-syncs master001 1
sentinel parallel-syncs master002 1
sentinel parallel-syncs master003 1#sentinel failover-timeout <master-name> <milliseconds>
# failover过期时间，当failover开始后，在此时间内仍然没有触发任何failover操作，当前sentinel 将会认为此次failover失败,默认为3分钟,单位为毫秒
sentinel failover-timeout master001 180000
sentinel failover-timeout master002 180000
sentinel failover-timeout master003 180000#是否拒绝从新配置通知脚本,默认拒绝(yes).
sentinel deny-scripts-reconfig yes

配置完成后，将该配置文件分别复制到192.168.204.202和192.168.204.203节点各一份即可。然后通过命令，将3台服务器的哨兵都启动，命令如下：

src/redis-sentinel ./sentinel.conf

至此哨兵集群搭建完毕。

注意开放端口问题：

如果你需要将三个Sentinel哨兵，部署在三台不同的服务器上，切记要在该三台服务器上分别开放Sentinel访问的端口。如果不开放端口，Sentinel哨兵也还是无法监控到的。开放端口，请移步参考：Linux开放指定端口

5.Redis集群HA测试

以 201 master节点和 204 slave节点为例

如果配置了后台启动，你可以通过tail-f xxx.log来查看哨兵日志。三个哨兵中打印的日志都是一样的内容。所以我们看一个201服务器的哨兵集群日志即可。

我们现在手动关闭201这个master节点，哨兵会帮我们自动将204节点从slave角色变更为master角色，如下图：

为什么哨兵会有一段时间无响应，那是它在测试连接的心跳是否超时，一次来判断master节点是否已经挂掉，这个我们可以在sentinel.conf文件中配置。

我们会发现204节点已经变更为master节点，当我们将原master节点201服务器重新启动后，我们会发现原master 201节点已经变成现在新master 204节点的slave。

我还在努力写博客，来充实自己中...

如有本文有帮助到你，那就帮我点个赞，鼓励一下我啦^_^

END