pg_auto_failover 之四 manual failover
os: ubuntu 16.04
db: postgresql 10.9
pg_auto_failover 是 citus 开源的一款 postgresql 高可用软件,目前只支持 postgresql 10 及以上.
pg_auto_failover is an extension and service for PostgreSQL that monitors and manages automated failover for a Postgres cluster. It is optimized for simplicity and correctness and supports Postgres 10 and newer.
ip 规划
192.168.56.101 node1
192.168.56.102 node2
192.168.56.103 node3
node1 为 monitor 节点
node2 node3 为 postgresql 主从
monitor
node1 上查看状态
$ pg_autoctl show state --pgdata /data/pg10/main/Name | Port | Group | Node | Current State | Assigned State
------+--------+-------+-------+-------------------+------------------
node2 | 5432 | 0 | 10 | secondary | secondary
node3 | 5432 | 0 | 11 | primary | primary$ pg_autoctl show uri --formation default --pgdata /data/pg10/main/
postgres://node3:5432,node2:5432/postgres?target_session_attrs=read-write$ pg_autoctl show uri --pgdata /data/pg10/main
postgres://autoctl_node@node1:5432/pg_auto_failover
当前 node3 为 master,node2 为 slave.
psql 登录 monitor,发出 failover 指令
$ psql postgres://autoctl_node@node1:5432/pg_auto_failover
psql (10.9 (Ubuntu 10.9-1.pgdg16.04+1))
Type "help" for help.pg_auto_failover=> \dxList of installed extensionsName | Version | Schema | Description
----------------+---------+------------+------------------------------pgautofailover | 1.0 | public | pg_auto_failoverplpgsql | 1.0 | pg_catalog | PL/pgSQL procedural language
(2 rows)pg_auto_failover=> set search_path=pgautofailover;
SET
pg_auto_failover=> \dList of relationsSchema | Name | Type | Owner
----------------+-------------------+----------+----------pgautofailover | event | table | postgrespgautofailover | event_eventid_seq | sequence | postgrespgautofailover | event_nodeid_seq | sequence | postgrespgautofailover | formation | table | postgrespgautofailover | node | table | postgrespgautofailover | node_nodeid_seq | sequence | postgres
(6 rows)pg_auto_failover=> \df+ perform_failoverList of functionsSchema | Name | Result data type | Argument data types | Type | Volatility | Parallel | Owner | Security | Access privileges | Language | Source code | Description
----------------+------------------+------------------+-----------------------------------------------------------------------+--------+------------+----------+----------+----------+-------------------+----------+------------------+-----------------------------------------------------pgautofailover | perform_failover | void | formation_id text DEFAULT 'default'::text, group_id integer DEFAULT 0 | normal | volatile | unsafe | postgres | invoker | | c | perform_failover | manually failover from the primary to the secondary
(1 row)postgres=# select pgautofailover.perform_failover();
ERROR: permission denied for relation node
CONTEXT: SQL statement "UPDATE pgautofailover.node SET goalstate = $1, statechangetime = now() WHERE nodename = $2 AND nodeport = $3"
以 postgres 用户登录 monitor,注意使用的是 localhost
$ psql postgres://postgres@localhost:5432/pg_auto_failover
psql (10.9 (Ubuntu 10.9-1.pgdg16.04+1))
Type "help" for help.pg_auto_failover=# select pgautofailover.perform_failover();perform_failover
------------------(1 row)
monitor 日志
09:04:04 INFO Setting goal state of node3:5432 to draining and node2:5432 toprepare_promotion after a user-initiated failover.
09:04:04 INFO New state for node3:5432 in formation "default": primary/draining
09:04:04 INFO New state for node2:5432 in formation "default": secondary/prepare_promotion09:04:05 INFO Node node3:5432 reported new state draining
09:04:05 INFO New state for node3:5432 in formation "default": draining/draining09:04:09 INFO Node node2:5432 reported new state prepare_promotion
09:04:09 INFO New state for node2:5432 in formation "default": prepare_promotion/prepare_promotion
09:04:09 INFO Setting goal state of node3:5432 to demote_timeout and node2:5432 to stop_replication after node2:5432 converged to prepare_promotion.
09:04:09 INFO New state for node2:5432 in formation "default": prepare_promotion/stop_replication
09:04:09 INFO New state for node3:5432 in formation "default": draining/demote_timeout09:04:10 INFO Node node2:5432 reported new state stop_replication
09:04:10 INFO New state for node2:5432 in formation "default": stop_replication/stop_replication09:04:11 INFO Node node3:5432 reported new state demote_timeout
09:04:11 INFO New state for node3:5432 in formation "default": demote_timeout/demote_timeout09:04:14 INFO Setting goal state of node2:5432 to wait_primary and node3:5432 to demoted after the demote timeout expired.
09:04:14 INFO New state for node2:5432 in formation "default": stop_replication/wait_primary
09:04:14 INFO New state for node3:5432 in formation "default": demote_timeout/demoted09:04:15 INFO Node node2:5432 reported new state wait_primary
09:04:15 INFO New state for node2:5432 in formation "default": wait_primary/wait_primary09:04:16 INFO Node node3:5432 reported new state demoted
09:04:16 INFO New state for node3:5432 in formation "default": demoted/demoted
09:04:16 INFO Setting goal state of node3:5432 to catchingup after it converged to demotion and node2:5432 converged to wait_primary.
09:04:16 INFO New state for node3:5432 in formation "default": demoted/catchingup09:04:19 INFO Node node3:5432 reported new state catchingup
09:04:19 INFO New state for node3:5432 in formation "default": catchingup/catchingup09:04:30 INFO Setting goal state of node2:5432 to primary and node3:5432 to secondary after node3:5432 caught up.
09:04:30 INFO New state for node3:5432 in formation "default": catchingup/secondary
09:04:30 INFO New state for node2:5432 in formation "default": wait_primary/primary09:04:31 INFO Node node3:5432 reported new state secondary
09:04:31 INFO New state for node3:5432 in formation "default": secondary/secondary09:04:34 INFO Node node2:5432 reported new state primary
09:04:34 INFO New state for node2:5432 in formation "default": primary/primary
keeper node2 日志
09:03:48 INFO Calling node_active for node default/10/0 with current state: secondary, PostgreSQL is running, sync_state is "", WAL delta is 0.
09:03:53 INFO Calling node_active for node default/10/0 with current state: secondary, PostgreSQL is running, sync_state is "", WAL delta is 0.
09:03:58 INFO Calling node_active for node default/10/0 with current state: secondary, PostgreSQL is running, sync_state is "", WAL delta is 0.
09:04:04 INFO Calling node_active for node default/10/0 with current state: secondary, PostgreSQL is running, sync_state is "", WAL delta is 0.09:04:09 ERROR PostgreSQL cannot reach the primary server: the system view pg_stat_wal_receiver has no rows.
09:04:09 INFO Calling node_active for node default/10/0 with current state: secondary, PostgreSQL is running, sync_state is "", WAL delta is -1.
09:04:09 INFO FSM transition from "secondary" to "prepare_promotion": Stop traffic to primary, wait for it to finish draining.
09:04:09 INFO Transition complete: current state is now "prepare_promotion"
09:04:09 INFO Calling node_active for node default/10/0 with current state: prepare_promotion, PostgreSQL is running, sync_state is "", WAL delta is -1.
09:04:09 INFO FSM transition from "prepare_promotion" to "stop_replication": Prevent against split-brain situations.
09:04:09 INFO Prevent writes to the promoted standby while the primary is not demoted yet, by making the service incompatible with target_session_attrs = read-write
09:04:09 INFO Setting default_transaction_read_only to on
09:04:09 INFO Promoting postgres
09:04:09 INFO Other node in the HA group is node3:5432
09:04:09 INFO Create replication slot "pgautofailover_standby"
09:04:09 INFO Disabling synchronous replication
09:04:09 INFO Transition complete: current state is now "stop_replication"
09:04:09 INFO Calling node_active for node default/10/0 with current state: stop_replication, PostgreSQL is running, sync_state is "", WAL delta is -1.09:04:14 INFO Calling node_active for node default/10/0 with current state: stop_replication, PostgreSQL is running, sync_state is "", WAL delta is -1.
09:04:14 INFO FSM transition from "stop_replication" to "wait_primary": Confirmed promotion with the monitor
09:04:14 INFO Setting default_transaction_read_only to off
09:04:14 INFO Transition complete: current state is now "wait_primary"
09:04:14 INFO Calling node_active for node default/10/0 with current state: wait_primary, PostgreSQL is running, sync_state is "", WAL delta is -1.09:04:19 INFO Calling node_active for node default/10/0 with current state: wait_primary, PostgreSQL is running, sync_state is "", WAL delta is -1.
09:04:24 INFO Calling node_active for node default/10/0 with current state: wait_primary, PostgreSQL is running, sync_state is "async", WAL delta is 0.
09:04:29 INFO Calling node_active for node default/10/0 with current state: wait_primary, PostgreSQL is running, sync_state is "async", WAL delta is 0.
09:04:34 INFO Calling node_active for node default/10/0 with current state: wait_primary, PostgreSQL is running, sync_state is "async", WAL delta is 0.
09:04:34 INFO FSM transition from "wait_primary" to "primary": A healthy secondary appeared
09:04:34 INFO Enabling synchronous replication
09:04:34 INFO Transition complete: current state is now "primary"
09:04:34 INFO Calling node_active for node default/10/0 with current state: primary, PostgreSQL is running, sync_state is "sync", WAL delta is 0.
09:04:39 INFO Calling node_active for node default/10/0 with current state: primary, PostgreSQL is running, sync_state is "sync", WAL delta is 0.
keeper node3 日志
09:03:55 INFO Calling node_active for node default/11/0 with current state: primary, PostgreSQL is running, sync_state is "sync", WAL delta is 0.
09:04:00 INFO Calling node_active for node default/11/0 with current state: primary, PostgreSQL is running, sync_state is "sync", WAL delta is 0.09:04:05 INFO Calling node_active for node default/11/0 with current state: primary, PostgreSQL is running, sync_state is "sync", WAL delta is 0.
09:04:05 INFO FSM transition from "primary" to "draining": A failover occurred, stopping writes
09:04:05 INFO Transition complete: current state is now "draining"
09:04:05 INFO Calling node_active for node default/11/0 with current state: draining, PostgreSQL is not running, sync_state is "", WAL delta is -1.09:04:10 INFO Calling node_active for node default/11/0 with current state: draining, PostgreSQL is not running, sync_state is "", WAL delta is -1.
09:04:10 INFO FSM transition from "draining" to "demote_timeout": Secondary confirms it’s receiving no more writes
09:04:10 INFO pg_ctl: no server running
09:04:10 INFO pg_ctl stop failed, but PostgreSQL is not running anyway
09:04:10 INFO Transition complete: current state is now "demote_timeout"
09:04:10 INFO Calling node_active for node default/11/0 with current state: demote_timeout, PostgreSQL is not running, sync_state is "", WAL delta is -1.09:04:15 INFO Calling node_active for node default/11/0 with current state: demote_timeout, PostgreSQL is not running, sync_state is "", WAL delta is -1.
09:04:15 INFO FSM transition from "demote_timeout" to "demoted": Demote timeout expired
09:04:15 INFO pg_ctl: no server running09:04:15 INFO pg_ctl stop failed, but PostgreSQL is not running anyway
09:04:15 INFO Transition complete: current state is now "demoted"
09:04:15 INFO Calling node_active for node default/11/0 with current state: demoted, PostgreSQL is not running, sync_state is "", WAL delta is -1.
09:04:15 INFO FSM transition from "demoted" to "catchingup": A new primary is available. First, try to rewind. If that fails, do a pg_basebackup.
09:04:15 INFO The primary node returned by the monitor is node2:5432
09:04:15 INFO Rewinding PostgreSQL to follow new primary node2:5432
09:04:15 INFO pg_ctl: no server running09:04:15 INFO pg_ctl stop failed, but PostgreSQL is not running anyway
09:04:15 INFO Running /usr/bin/pg_rewind --target-pgdata "/data/pg10/main" --source-server " host='node2' port=5432 user='pgautofailover_replicator' dbname='postgres'" --progress ...
$ pg_autoctl run --pgdata /data/pg10/main
09:06:12 INFO Managing PostgreSQL installation at "/data/pg10/main"
09:06:12 INFO pg_autoctl service is starting
09:06:12 INFO Calling node_active for node default/11/0 with current state: catchingup, PostgreSQL is running, sync_state is "", WAL delta is -1.
09:06:12 INFO FSM transition from "catchingup" to "secondary": Convinced the monitor that I'm up and running, and eligible for promotion again
09:06:12 INFO Transition complete: current state is now "secondary"
09:06:12 INFO Calling node_active for node default/11/0 with current state: secondary, PostgreSQL is running, sync_state is "", WAL delta is 0.
09:06:17 INFO Calling node_active for node default/11/0 with current state: secondary, PostgreSQL is running, sync_state is "", WAL delta is 0.
查看当前postgresql状况,符合预期
$ pg_autoctl show state --pgdata /data/pg10/main/Name | Port | Group | Node | Current State | Assigned State
------+--------+-------+-------+-------------------+------------------
node2 | 5432 | 0 | 10 | primary | primary
node3 | 5432 | 0 | 11 | secondary | secondary$ pg_autoctl show events --pgdata /data/pg10/main/Event Time | Formation | Node | Current State | Assigned State | Comment
-------------------------------+------------+--------+--------------------+--------------------+-----------2019-07-04 09:04:14.533103+08 | default | 0/10 | wait_primary | wait_primary | Node node2:5432 reported new state wait_primary2019-07-04 09:04:15.729724+08 | default | 0/11 | demoted | demoted | Node node3:5432 reported new state demoted2019-07-04 09:04:15.729724+08 | default | 0/11 | demoted | catchingup | Setting goal state of node3:5432 to catchingup after it converged to demotion and node2:5432 converged to wait_primary.2019-07-04 09:04:19.632981+08 | default | 0/11 | catchingup | catchingup | Node node3:5432 reported new state catchingup2019-07-04 09:04:30.184196+08 | default | 0/11 | catchingup | secondary | Setting goal state of node2:5432 to primary and node3:5432 to secondary after node3:5432 caught up.2019-07-04 09:04:30.184196+08 | default | 0/10 | wait_primary | primary | Setting goal state of node2:5432 to primary and node3:5432 to secondary after node3:5432 caught up.2019-07-04 09:04:30.204477+08 | default | 0/11 | secondary | secondary | Node node3:5432 reported new state secondary2019-07-04 09:04:34.699843+08 | default | 0/10 | primary | primary | Node node2:5432 reported new state primary2019-07-04 09:06:12.163731+08 | default | 0/11 | catchingup | secondary | Node node3:5432 reported new state catchingup2019-07-04 09:06:12.198041+08 | default | 0/11 | secondary | secondary | Node node3:5432 reported new state secondary
参考:
https://github.com/citusdata/pg_auto_failover
https://pg-auto-failover.readthedocs.io/en/latest/quickstart.html
https://www.citusdata.com/blog/2019/05/30/introducing-pg-auto-failover/
https://cloudblogs.microsoft.com/opensource/2019/05/06/introducing-pg_auto_failover-postgresql-open-source-extension-automated-failover-high-availability/
pg_auto_failover 之四 manual failover相关推荐
- [HDFS Manual] CH4 HDFS High Availability Using the Quorum Journal Manager
HDFS High Availability Using the Quorum Journal Manager HDFS High Availability Using the Quorum Jour ...
- centos7上安装redis6-0-5
下载tar包 wget http://download.redis.io/releases/redis-6.0.5.tar.gz 解压tar包 tar -zxvf redis-6.0.5.tar.gz ...
- redis cluster集群选主
redis 选主过程分析 当slave发现自己的master变为FAIL状态时,便尝试进行Failover,以期成为新的master.由于挂掉的master可能会有多个slave.Failover的 ...
- 十八、redis.conf配置详解
启动的时候,就通过配置文件来启动! 工作中,一些小小的配置,可以突出专业性! [root@localhost rconfig]# more redis.conf 单位 # Redis configur ...
- redis终于有比较大的进展了,redis3.0.1 稳定版本发布,支持集群。
原文地址:https://raw.githubusercontent.com/antirez/redis/3.0/00-RELEASENOTES Redis 3.0 release notes --[ ...
- Docker安装redis 设置密码
1.下载最新版本 docker pull redis:latest :latest :最新版本 2.启动容器 docker run -itd --name myredis -p 6379:6379 r ...
- Redis架构及分片管理
Redis 集群的 TCP 端口(Redis Cluster TCP ports) 每个 Redis 集群节点需要两个 TCP 连接打开.正常的 TCP 端口用来服务客户端,例如 6379,加 100 ...
- redis源码阅读(1)
redis 是c 编写的,首先看下redis 代码目录结构(对应版本3.25): 开发相关的放在deps下面: 主要代码放置在deps和src下面,utils 下面放置的是rb 脚本 首先看下src ...
- Redis 源码分析之故障转移
在 Redis cluster 中故障转移是个很重要的功能,下面就从故障发现到故障转移整个流程做一下详细分析. 故障检测 PFAIL 标记 集群中每个节点都会定期向其他节点发送 PING 消息,以此来 ...
- Redis Cluster集群知识学习总结
Redis集群解决方案有两个: 1) Twemproxy: 这是Twitter推出的解决方案,简单的说就是上层加个代理负责分发,属于client端集群方案,目前很多应用者都在采用的解决方案.Twem ...
最新文章
- QTP的那些事--QTP回放iFrame控件时间非常慢的问题分析
- operator模块
- Selenium WebDriver + python 自动化测试框架
- 基于JAVA+Spring+MYSQL的报名系统
- linux pcie组raid_大概是市面上带金属 PCIE 装甲和背板中最便宜的一款主板。华擎 Z390 Phantom Gaming X 开箱评测...
- 机构关注的数据治理问题
- 单片机作业1_为OLED制作汉字字库_第1部分
- vue+腾讯位置服务 实现坐标拾取器功能
- Python利用requests库爬取百度文库文章
- 苹果手机又刷屏啦!!它是如何做到的?
- 【科软课程-信息安全】Lab13 Packet Sniffing and Spoofing
- CCAI 2017 | 中国工程院院士李德毅:L3的挑战与量产
- osgEarth在斜面内绕自身Z轴旋转的锥体
- Unity3D下载地址
- 海思3559:百兆网口的配置
- 莫兰指数stata命令_【第六期】Regional Study 群日报
- 计算机工程的突出技能该怎么写,没有科研竞赛,计算机保研简历应该怎么写?...
- 红队笔记之杀软原理介绍与免杀技术总结
- RabbitMQ基本使用
- 编写一个类,实现简单的栈(提示:可用链表实现)。数据的操作按先进后出(FILO)的顺序。