os: ubuntu 16.04
db: postgresql 10.9

pg_auto_failover 是 citus 开源的一款 postgresql 高可用软件,目前只支持 postgresql 10 及以上.

pg_auto_failover is an extension and service for PostgreSQL that monitors and manages automated failover for a Postgres cluster. It is optimized for simplicity and correctness and supports Postgres 10 and newer.

ip 规划

192.168.56.101 node1
192.168.56.102 node2
192.168.56.103 node3

node1 为 monitor 节点
node2 node3 为 postgresql 主从

monitor

node1 上查看状态

$ pg_autoctl show state --pgdata /data/pg10/main/Name |   Port | Group |  Node |     Current State |    Assigned State
------+--------+-------+-------+-------------------+------------------
node2 |   5432 |     0 |    10 |         secondary |         secondary
node3 |   5432 |     0 |    11 |           primary |           primary$ pg_autoctl show uri --formation default --pgdata /data/pg10/main/
postgres://node3:5432,node2:5432/postgres?target_session_attrs=read-write$ pg_autoctl show uri --pgdata /data/pg10/main
postgres://autoctl_node@node1:5432/pg_auto_failover

当前 node3 为 master,node2 为 slave.
psql 登录 monitor,发出 failover 指令

$ psql postgres://autoctl_node@node1:5432/pg_auto_failover
psql (10.9 (Ubuntu 10.9-1.pgdg16.04+1))
Type "help" for help.pg_auto_failover=> \dxList of installed extensionsName      | Version |   Schema   |         Description
----------------+---------+------------+------------------------------pgautofailover | 1.0     | public     | pg_auto_failoverplpgsql        | 1.0     | pg_catalog | PL/pgSQL procedural language
(2 rows)pg_auto_failover=> set search_path=pgautofailover;
SET
pg_auto_failover=> \dList of relationsSchema     |       Name        |   Type   |  Owner
----------------+-------------------+----------+----------pgautofailover | event             | table    | postgrespgautofailover | event_eventid_seq | sequence | postgrespgautofailover | event_nodeid_seq  | sequence | postgrespgautofailover | formation         | table    | postgrespgautofailover | node              | table    | postgrespgautofailover | node_nodeid_seq   | sequence | postgres
(6 rows)pg_auto_failover=> \df+ perform_failoverList of functionsSchema     |       Name       | Result data type |                          Argument data types                          |  Type  | Volatility | Parallel |  Owner   | Security | Access privileges | Language |   Source code    |                     Description
----------------+------------------+------------------+-----------------------------------------------------------------------+--------+------------+----------+----------+----------+-------------------+----------+------------------+-----------------------------------------------------pgautofailover | perform_failover | void             | formation_id text DEFAULT 'default'::text, group_id integer DEFAULT 0 | normal | volatile   | unsafe   | postgres | invoker  |                   | c        | perform_failover | manually failover from the primary to the secondary
(1 row)postgres=# select pgautofailover.perform_failover();
ERROR:  permission denied for relation node
CONTEXT:  SQL statement "UPDATE pgautofailover.node SET goalstate = $1, statechangetime = now() WHERE nodename = $2 AND nodeport = $3"

以 postgres 用户登录 monitor,注意使用的是 localhost

$ psql postgres://postgres@localhost:5432/pg_auto_failover
psql (10.9 (Ubuntu 10.9-1.pgdg16.04+1))
Type "help" for help.pg_auto_failover=# select pgautofailover.perform_failover();perform_failover
------------------(1 row)

monitor 日志

09:04:04 INFO  Setting goal state of node3:5432 to draining and node2:5432 toprepare_promotion after a user-initiated failover.
09:04:04 INFO  New state for node3:5432 in formation "default": primary/draining
09:04:04 INFO  New state for node2:5432 in formation "default": secondary/prepare_promotion09:04:05 INFO  Node node3:5432 reported new state draining
09:04:05 INFO  New state for node3:5432 in formation "default": draining/draining09:04:09 INFO  Node node2:5432 reported new state prepare_promotion
09:04:09 INFO  New state for node2:5432 in formation "default": prepare_promotion/prepare_promotion
09:04:09 INFO  Setting goal state of node3:5432 to demote_timeout and node2:5432 to stop_replication after node2:5432 converged to prepare_promotion.
09:04:09 INFO  New state for node2:5432 in formation "default": prepare_promotion/stop_replication
09:04:09 INFO  New state for node3:5432 in formation "default": draining/demote_timeout09:04:10 INFO  Node node2:5432 reported new state stop_replication
09:04:10 INFO  New state for node2:5432 in formation "default": stop_replication/stop_replication09:04:11 INFO  Node node3:5432 reported new state demote_timeout
09:04:11 INFO  New state for node3:5432 in formation "default": demote_timeout/demote_timeout09:04:14 INFO  Setting goal state of node2:5432 to wait_primary and node3:5432 to demoted after the demote timeout expired.
09:04:14 INFO  New state for node2:5432 in formation "default": stop_replication/wait_primary
09:04:14 INFO  New state for node3:5432 in formation "default": demote_timeout/demoted09:04:15 INFO  Node node2:5432 reported new state wait_primary
09:04:15 INFO  New state for node2:5432 in formation "default": wait_primary/wait_primary09:04:16 INFO  Node node3:5432 reported new state demoted
09:04:16 INFO  New state for node3:5432 in formation "default": demoted/demoted
09:04:16 INFO  Setting goal state of node3:5432 to catchingup after it converged to demotion and node2:5432 converged to wait_primary.
09:04:16 INFO  New state for node3:5432 in formation "default": demoted/catchingup09:04:19 INFO  Node node3:5432 reported new state catchingup
09:04:19 INFO  New state for node3:5432 in formation "default": catchingup/catchingup09:04:30 INFO  Setting goal state of node2:5432 to primary and node3:5432 to secondary after node3:5432 caught up.
09:04:30 INFO  New state for node3:5432 in formation "default": catchingup/secondary
09:04:30 INFO  New state for node2:5432 in formation "default": wait_primary/primary09:04:31 INFO  Node node3:5432 reported new state secondary
09:04:31 INFO  New state for node3:5432 in formation "default": secondary/secondary09:04:34 INFO  Node node2:5432 reported new state primary
09:04:34 INFO  New state for node2:5432 in formation "default": primary/primary

keeper node2 日志

09:03:48 INFO  Calling node_active for node default/10/0 with current state: secondary, PostgreSQL is running, sync_state is "", WAL delta is 0.
09:03:53 INFO  Calling node_active for node default/10/0 with current state: secondary, PostgreSQL is running, sync_state is "", WAL delta is 0.
09:03:58 INFO  Calling node_active for node default/10/0 with current state: secondary, PostgreSQL is running, sync_state is "", WAL delta is 0.
09:04:04 INFO  Calling node_active for node default/10/0 with current state: secondary, PostgreSQL is running, sync_state is "", WAL delta is 0.09:04:09 ERROR PostgreSQL cannot reach the primary server: the system view pg_stat_wal_receiver has no rows.
09:04:09 INFO  Calling node_active for node default/10/0 with current state: secondary, PostgreSQL is running, sync_state is "", WAL delta is -1.
09:04:09 INFO  FSM transition from "secondary" to "prepare_promotion": Stop traffic to primary, wait for it to finish draining.
09:04:09 INFO  Transition complete: current state is now "prepare_promotion"
09:04:09 INFO  Calling node_active for node default/10/0 with current state: prepare_promotion, PostgreSQL is running, sync_state is "", WAL delta is -1.
09:04:09 INFO  FSM transition from "prepare_promotion" to "stop_replication": Prevent against split-brain situations.
09:04:09 INFO  Prevent writes to the promoted standby while the primary is not demoted yet, by making the service incompatible with target_session_attrs = read-write
09:04:09 INFO  Setting default_transaction_read_only to on
09:04:09 INFO  Promoting postgres
09:04:09 INFO  Other node in the HA group is node3:5432
09:04:09 INFO  Create replication slot "pgautofailover_standby"
09:04:09 INFO  Disabling synchronous replication
09:04:09 INFO  Transition complete: current state is now "stop_replication"
09:04:09 INFO  Calling node_active for node default/10/0 with current state: stop_replication, PostgreSQL is running, sync_state is "", WAL delta is -1.09:04:14 INFO  Calling node_active for node default/10/0 with current state: stop_replication, PostgreSQL is running, sync_state is "", WAL delta is -1.
09:04:14 INFO  FSM transition from "stop_replication" to "wait_primary": Confirmed promotion with the monitor
09:04:14 INFO  Setting default_transaction_read_only to off
09:04:14 INFO  Transition complete: current state is now "wait_primary"
09:04:14 INFO  Calling node_active for node default/10/0 with current state: wait_primary, PostgreSQL is running, sync_state is "", WAL delta is -1.09:04:19 INFO  Calling node_active for node default/10/0 with current state: wait_primary, PostgreSQL is running, sync_state is "", WAL delta is -1.
09:04:24 INFO  Calling node_active for node default/10/0 with current state: wait_primary, PostgreSQL is running, sync_state is "async", WAL delta is 0.
09:04:29 INFO  Calling node_active for node default/10/0 with current state: wait_primary, PostgreSQL is running, sync_state is "async", WAL delta is 0.
09:04:34 INFO  Calling node_active for node default/10/0 with current state: wait_primary, PostgreSQL is running, sync_state is "async", WAL delta is 0.
09:04:34 INFO  FSM transition from "wait_primary" to "primary": A healthy secondary appeared
09:04:34 INFO  Enabling synchronous replication
09:04:34 INFO  Transition complete: current state is now "primary"
09:04:34 INFO  Calling node_active for node default/10/0 with current state: primary, PostgreSQL is running, sync_state is "sync", WAL delta is 0.
09:04:39 INFO  Calling node_active for node default/10/0 with current state: primary, PostgreSQL is running, sync_state is "sync", WAL delta is 0.

keeper node3 日志

09:03:55 INFO  Calling node_active for node default/11/0 with current state: primary, PostgreSQL is running, sync_state is "sync", WAL delta is 0.
09:04:00 INFO  Calling node_active for node default/11/0 with current state: primary, PostgreSQL is running, sync_state is "sync", WAL delta is 0.09:04:05 INFO  Calling node_active for node default/11/0 with current state: primary, PostgreSQL is running, sync_state is "sync", WAL delta is 0.
09:04:05 INFO  FSM transition from "primary" to "draining": A failover occurred, stopping writes
09:04:05 INFO  Transition complete: current state is now "draining"
09:04:05 INFO  Calling node_active for node default/11/0 with current state: draining, PostgreSQL is not running, sync_state is "", WAL delta is -1.09:04:10 INFO  Calling node_active for node default/11/0 with current state: draining, PostgreSQL is not running, sync_state is "", WAL delta is -1.
09:04:10 INFO  FSM transition from "draining" to "demote_timeout": Secondary confirms it’s receiving no more writes
09:04:10 INFO  pg_ctl: no server running
09:04:10 INFO  pg_ctl stop failed, but PostgreSQL is not running anyway
09:04:10 INFO  Transition complete: current state is now "demote_timeout"
09:04:10 INFO  Calling node_active for node default/11/0 with current state: demote_timeout, PostgreSQL is not running, sync_state is "", WAL delta is -1.09:04:15 INFO  Calling node_active for node default/11/0 with current state: demote_timeout, PostgreSQL is not running, sync_state is "", WAL delta is -1.
09:04:15 INFO  FSM transition from "demote_timeout" to "demoted": Demote timeout expired
09:04:15 INFO  pg_ctl: no server running09:04:15 INFO  pg_ctl stop failed, but PostgreSQL is not running anyway
09:04:15 INFO  Transition complete: current state is now "demoted"
09:04:15 INFO  Calling node_active for node default/11/0 with current state: demoted, PostgreSQL is not running, sync_state is "", WAL delta is -1.
09:04:15 INFO  FSM transition from "demoted" to "catchingup": A new primary is available. First, try to rewind. If that fails, do a pg_basebackup.
09:04:15 INFO  The primary node returned by the monitor is node2:5432
09:04:15 INFO  Rewinding PostgreSQL to follow new primary node2:5432
09:04:15 INFO  pg_ctl: no server running09:04:15 INFO  pg_ctl stop failed, but PostgreSQL is not running anyway
09:04:15 INFO  Running /usr/bin/pg_rewind --target-pgdata "/data/pg10/main" --source-server " host='node2' port=5432 user='pgautofailover_replicator' dbname='postgres'" --progress ...
$ pg_autoctl run --pgdata /data/pg10/main
09:06:12 INFO  Managing PostgreSQL installation at "/data/pg10/main"
09:06:12 INFO  pg_autoctl service is starting
09:06:12 INFO  Calling node_active for node default/11/0 with current state: catchingup, PostgreSQL is running, sync_state is "", WAL delta is -1.
09:06:12 INFO  FSM transition from "catchingup" to "secondary": Convinced the monitor that I'm up and running, and eligible for promotion again
09:06:12 INFO  Transition complete: current state is now "secondary"
09:06:12 INFO  Calling node_active for node default/11/0 with current state: secondary, PostgreSQL is running, sync_state is "", WAL delta is 0.
09:06:17 INFO  Calling node_active for node default/11/0 with current state: secondary, PostgreSQL is running, sync_state is "", WAL delta is 0.

查看当前postgresql状况,符合预期

$ pg_autoctl show state --pgdata /data/pg10/main/Name |   Port | Group |  Node |     Current State |    Assigned State
------+--------+-------+-------+-------------------+------------------
node2 |   5432 |     0 |    10 |           primary |           primary
node3 |   5432 |     0 |    11 |         secondary |         secondary$ pg_autoctl show events --pgdata /data/pg10/main/Event Time |  Formation |   Node |      Current State |     Assigned State | Comment
-------------------------------+------------+--------+--------------------+--------------------+-----------2019-07-04 09:04:14.533103+08 |    default |   0/10 |       wait_primary |       wait_primary | Node node2:5432 reported new state wait_primary2019-07-04 09:04:15.729724+08 |    default |   0/11 |            demoted |            demoted | Node node3:5432 reported new state demoted2019-07-04 09:04:15.729724+08 |    default |   0/11 |            demoted |         catchingup | Setting goal state of node3:5432 to catchingup after it converged to demotion and node2:5432 converged to wait_primary.2019-07-04 09:04:19.632981+08 |    default |   0/11 |         catchingup |         catchingup | Node node3:5432 reported new state catchingup2019-07-04 09:04:30.184196+08 |    default |   0/11 |         catchingup |          secondary | Setting goal state of node2:5432 to primary and node3:5432 to secondary after node3:5432 caught up.2019-07-04 09:04:30.184196+08 |    default |   0/10 |       wait_primary |            primary | Setting goal state of node2:5432 to primary and node3:5432 to secondary after node3:5432 caught up.2019-07-04 09:04:30.204477+08 |    default |   0/11 |          secondary |          secondary | Node node3:5432 reported new state secondary2019-07-04 09:04:34.699843+08 |    default |   0/10 |            primary |            primary | Node node2:5432 reported new state primary2019-07-04 09:06:12.163731+08 |    default |   0/11 |         catchingup |          secondary | Node node3:5432 reported new state catchingup2019-07-04 09:06:12.198041+08 |    default |   0/11 |          secondary |          secondary | Node node3:5432 reported new state secondary

参考:
https://github.com/citusdata/pg_auto_failover
https://pg-auto-failover.readthedocs.io/en/latest/quickstart.html

https://www.citusdata.com/blog/2019/05/30/introducing-pg-auto-failover/
https://cloudblogs.microsoft.com/opensource/2019/05/06/introducing-pg_auto_failover-postgresql-open-source-extension-automated-failover-high-availability/

pg_auto_failover 之四 manual failover相关推荐

  1. [HDFS Manual] CH4 HDFS High Availability Using the Quorum Journal Manager

    HDFS High Availability Using the Quorum Journal Manager HDFS High Availability Using the Quorum Jour ...

  2. centos7上安装redis6-0-5

    下载tar包 wget http://download.redis.io/releases/redis-6.0.5.tar.gz 解压tar包 tar -zxvf redis-6.0.5.tar.gz ...

  3. redis cluster集群选主

    redis 选主过程分析  当slave发现自己的master变为FAIL状态时,便尝试进行Failover,以期成为新的master.由于挂掉的master可能会有多个slave.Failover的 ...

  4. 十八、redis.conf配置详解

    启动的时候,就通过配置文件来启动! 工作中,一些小小的配置,可以突出专业性! [root@localhost rconfig]# more redis.conf 单位 # Redis configur ...

  5. redis终于有比较大的进展了,redis3.0.1 稳定版本发布,支持集群。

    原文地址:https://raw.githubusercontent.com/antirez/redis/3.0/00-RELEASENOTES Redis 3.0 release notes --[ ...

  6. Docker安装redis 设置密码

    1.下载最新版本 docker pull redis:latest :latest :最新版本 2.启动容器 docker run -itd --name myredis -p 6379:6379 r ...

  7. Redis架构及分片管理

    Redis 集群的 TCP 端口(Redis Cluster TCP ports) 每个 Redis 集群节点需要两个 TCP 连接打开.正常的 TCP 端口用来服务客户端,例如 6379,加 100 ...

  8. redis源码阅读(1)

    redis 是c 编写的,首先看下redis 代码目录结构(对应版本3.25): 开发相关的放在deps下面: 主要代码放置在deps和src下面,utils 下面放置的是rb 脚本 首先看下src ...

  9. Redis 源码分析之故障转移

    在 Redis cluster 中故障转移是个很重要的功能,下面就从故障发现到故障转移整个流程做一下详细分析. 故障检测 PFAIL 标记 集群中每个节点都会定期向其他节点发送 PING 消息,以此来 ...

  10. Redis Cluster集群知识学习总结

    Redis集群解决方案有两个: 1)  Twemproxy: 这是Twitter推出的解决方案,简单的说就是上层加个代理负责分发,属于client端集群方案,目前很多应用者都在采用的解决方案.Twem ...

最新文章

  1. QTP的那些事--QTP回放iFrame控件时间非常慢的问题分析
  2. operator模块
  3. Selenium WebDriver + python 自动化测试框架
  4. 基于JAVA+Spring+MYSQL的报名系统
  5. linux pcie组raid_大概是市面上带金属 PCIE 装甲和背板中最便宜的一款主板。华擎 Z390 Phantom Gaming X 开箱评测...
  6. 机构关注的数据治理问题
  7. 单片机作业1_为OLED制作汉字字库_第1部分
  8. vue+腾讯位置服务 实现坐标拾取器功能
  9. Python利用requests库爬取百度文库文章
  10. 苹果手机又刷屏啦!!它是如何做到的?
  11. 【科软课程-信息安全】Lab13 Packet Sniffing and Spoofing
  12. CCAI 2017 | 中国工程院院士李德毅:L3的挑战与量产
  13. osgEarth在斜面内绕自身Z轴旋转的锥体
  14. Unity3D下载地址
  15. 海思3559:百兆网口的配置
  16. 莫兰指数stata命令_【第六期】Regional Study 群日报
  17. 计算机工程的突出技能该怎么写,没有科研竞赛,计算机保研简历应该怎么写?...
  18. 红队笔记之杀软原理介绍与免杀技术总结
  19. RabbitMQ基本使用
  20. 编写一个类,实现简单的栈(提示:可用链表实现)。数据的操作按先进后出(FILO)的顺序。

热门文章

  1. Python网络爬虫中图片下载简单实现
  2. 小型计算机和Pc,超小型台式电脑:重量相当于两个新iPhone
  3. asp万年历简易版本
  4. 本地搭建ipV6测试环境
  5. 播布客老顽童MySQL DBA培训目录
  6. Android组件化入门,分享一点面试小经验
  7. 这家机场扶梯安装“智能管家”,扶梯消毒仪现高招,绝了!
  8. Periodic-table
  9. 【高等数学】二重积分交换积分次序,反三角函数主值区间选择
  10. 考研要求过英语四六级!这些大学有明确规定!