参考文档:

Pgpool-II + Watchdog Setup Example

https://www.pgpool.net/docs/latest/en/html/example-configs.html
https://www.pgpool.net/docs/latest/en/html/example-cluster.html

已测试完成的功能
1 当其中某一个节点上的pgpool失败时,VIP会自动漂移到另一个节点上,也就是watchdog起作用了
2 关闭掉primary 数据库,standby数据库会提升为primary。也就是failover功能。好像没有switch over功能
3 失效的原来的primary数据库需要重建(使用pgpool的工具、pg_rewind都可以)
未测试完成的功能
1 使用pgpool工具,online recovery没有成功,使用pg_rewind命令来代替处理了。

主机名信息:

test  : 192.168.2.80
test1 :  192.168.2.81
VIP :192.168.2.88 

PG版本和配置

PostgreSQL Version    10.15
port                  5432
$PGDATA               /opt/PostgreSQL/10/data
Archive mode          ON
Replication Slots     Enable     -- 没有设置
Start automatically   Enable     -- 源码安装完毕PG后,就是自动设置的

pgpool-II版本和配置
Pgpool-II Version    4.2.0    
一些端口:

9999               Pgpool-II accepts connections
9898                  PCP process accepts connections
9000                  watchdog accepts connections
9694                  UDP port for receiving Watchdog's heartbeat signal
Config file           /opt/pgpool/etc/pgpool.conf
Pgpool-II启动用户     postgres ,(Pgpool-II 4.1 or later) Pgpool-II 4.0 or before, the default startup user is root
Running mode          streaming replication mode
Watchdog on Life check method: heartbeat
Start automatically Enable   -- 我这里没有设置 

脚本

failover :
/opt/pgpool/etc/failover.sh         -- 通过failover_command运行执行故障转移
/opt/pgpool/etc/follow_primary.sh   -- 通过 follow_primary_command 运行以在故障转移后将 Standby 与新的 Primary 同步。

failover.sh的脚本内容,可以参考文档:
https://git.postgresql.org/gitweb/?p=pgpool2.git;a=blob_plain;f=src/sample/scripts/failover.sh.sample;hb=refs/heads/V4_2_STABLE
该failover脚本的内容很简单,就是通过pg_ctl promote命令,将从库提升为新的主库,原来的主库不做处理。

follow_primary.sh脚本内容,可以参考文档:https://git.postgresql.org/gitweb/?p=pgpool2.git;a=blob_plain;f=src/sample/scripts/follow_primary.sh.sample;hb=refs/heads/V4_2_STABLE
follow_primary.sh脚本,当使用三台postgres数据库服务器的时候,需要配置,两台postgres数据库的时候,不需要配置。
该脚本的主要作用是,当有3台postgres数据库服务器的时候。failover后,其中一个standby被提升为primary。这时候,另外的一个standby和新的primary数据库进行同步。(当只有2台postgres的时候,primary 宕机,standby提升为新的primary,后续没有standby了,所以不需要该脚本)该脚本主要还是通过pg_rewind来修复新的standby。

online recovery :
/opt/pgpool/etc/recovery_1st_stage    -- 运行 recovery_1st_stage_command 恢复 Standby 节点
/opt/pgpool/etc/pgpool_remote_start   --在 recovery_1st_stage_command 之后运行以启动 Standby 节点

recovery_1st_stage脚本的官方文档:
https://git.postgresql.org/gitweb/?p=pgpool2.git;a=blob_plain;f=src/sample/scripts/recovery_1st_stage.sample;hb=refs/heads/V4_2_STABLE
该脚本的作用,是通过使用pg_basebackup进行在线创建standby库的,或者当使用pg_rewind无法同步standby库的数据时候,重新创建standby库的时候使用。
pg_remote_start脚本的官方文档:
https://git.postgresql.org/gitweb/?p=pgpool2.git;a=blob_plain;f=src/sample/scripts/pgpool_remote_start.sample;hb=refs/heads/V4_2_STABLE
pg_remote_start脚本,是当新的standby库创建好以后,通过pg_ctl来远程启动新的standby数据库的。

Watchdog:
/opt/pgpool/etc/escalation.sh   -- 通过 wd_escalation_command 运行以安全地切换 Active/Standby Pgpool-II
escalation.sh脚本的官方文档:
https://git.postgresql.org/gitweb/?p=pgpool2.git;a=blob_plain;f=src/sample/scripts/escalation.sh.sample;hb=refs/heads/V4_2_STABLE
该脚本的主要用途是在旧节点上down掉VIP。然后在新节点上启用VIP 。

安装postgres数据库 (略)

在 server1(主)上编辑配置文件 $PGDATA/postgresql.conf 如下。启用 wal_log_hints 以使用 pg_rewind。由于 Primary 以后可能会成为 Standby,我们设置 hot_standby = on。

postgresql.conf的配置

listen_addresses = '*'
archive_mode = on
archive_command = 'cp %p /postgres/archive/%f'
max_wal_senders = 10
max_replication_slots = 10
wal_level = replica
hot_standby = on
wal_log_hints = on

主服务器启动后,我们使用Pgpool-II的在线恢复功能来设置备用服务器-- 这个安装配置standby的过程,略,通过pg_basebackup手工创建stadnby了 ,standby的复制通过repl用户来进行。

配置pg_hba.conf

host    replication     repl            192.168.2.80/24         md5
host    replication     repl            192.168.2.81/24         md5

配置ssh对等互信,root用户和postgres用户。过程略,类似以下命令。当ssh server1 date 不需要输入命令时表示设置成功。

[all servers]# cd ~/.ssh
[all servers]# ssh-keygen -t rsa -f id_rsa_pgpool
[all servers]# ssh-copy-id -i id_rsa_pgpool.pub postgres@server1
[all servers]# ssh-copy-id -i id_rsa_pgpool.pub postgres@server2
[all servers]# ssh-copy-id -i id_rsa_pgpool.pub postgres@server3[all servers]# su - postgres
[all servers]$ cd ~/.ssh
[all servers]$ ssh-keygen -t rsa -f id_rsa_pgpool
[all servers]$ ssh-copy-id -i id_rsa_pgpool.pub postgres@server1
[all servers]$ ssh-copy-id -i id_rsa_pgpool.pub postgres@server2
[all servers]$ ssh-copy-id -i id_rsa_pgpool.pub postgres@server3

编辑.pgpass文件 ,主要是无密码登陆postgres数据库 ,注意权限是600,否则不能生效

[postgres@test1 /home/postgres]$more .pgpass
*:5432:*:postgres:oracle
*:5432:*:barman:oracle
[postgres@test1 /home/postgres]$
[postgres@test /home/postgres]$more .pgpass
*:5432:*:postgres:oracle
*:5432:*:barman:oracle
[postgres@test /home/postgres]$

设置pgpool-II自动启动(略),本例子中,手工启动。

创建pgpool_node_id,id分别为0和1 
如果启用了watchdog,需要辨别那台机器是那个,则需要设置pgpool_node_id

[postgres@test /opt/pgpool/etc]$more pgpool_node_id
0
[postgres@test /opt/pgpool/etc]$
[postgres@test1 /opt/pgpool/etc]$more pgpool_node_id
1
[postgres@test1 /opt/pgpool/etc]$

注意:pgpool目录下,有个pgpool.pid文件,记录的是pgpool启动的进程id,注意别和标识节点的pgpool_node_id搞混了

[postgres@test /opt/pgpool]$more pgpool.pid
14306
[postgres@test /opt/pgpool]$
[postgres@test1 /opt/pgpool]$more pgpool.pid
16948
[postgres@test1 /opt/pgpool]$

pgpool-II配置 
由于从 Pgpool-II 4.2 开始,所有主机上的所有配置参数都相同,您可以在任何 pgpool 节点上编辑 pgpool.conf 并将编辑后的 ​​pgpool.conf 文件复制到其他 pgpool 节点。

集群模式

[postgres@test  /opt/pgpool/etc]$cp pgpool.conf.sample-stream pgpool.conf
[postgres@test1 /opt/pgpool/etc]$cp pgpool.conf.sample-stream pgpool.conf

pgpool.conf配置

listen_addresses = '*'         -- 监听器地址,为了允许 Pgpool-II 接受所有传入的连接,我们设置 listen_addresses = '*'。
port = 9999                    --指定pgpool监听器监听的端口
sr_check_user = 'postgres'     -- 流复制检查
sr_check_password = 'oracle'   -- 如果这些参数留空,Pgpool-II 将首先尝试从 pool_passwd 文件中获取该特定用户的密码,然后再使用空密码。
health_check_period = 5
health_check_timeout = 30
health_check_user = 'postgres'
health_check_password = 'oracle'
health_check_max_retries = 3   backend_hostname0 = '192.168.2.80'
backend_port0 = 5432
backend_weight0 = 1
backend_data_directory0 = '/opt/PostgreSQL/10/data'
backend_flag0 = 'ALLOW_TO_FAILOVER'backend_hostname2 = '192.168.2.81'
backend_port2 = 5432
backend_weight2 = 1
backend_data_directory2 = '/opt/PostgreSQL/10/data'
backend_flag2 = 'ALLOW_TO_FAILOVER'
backend_application_name0 = 'test'
backend_application_name1 = 'test1'   failover_command = '/opt/pgpool/etc/failover.sh %d %h %p %D %m %H %M %P %r %R %N %S'  -- 注意修改该脚本中的PGHOME路径
follow_primary_command = ''           -- 因为只有2台机器,这个脚本用不到,所以没有设置

##在failover_command 参数中指定要在故障转移后执行的failover.sh 脚本。如果我们使用3台PostgreSQL服务器,我们需要指定follow_primary_command在主节点故障转移后运行。如果有两个 PostgreSQL 服务器,则不需要 follow_primary_command 设置

配置pcp.conf  (因为本例中没有用到follow_primary_command脚本,所以本例中没有设置,如果需要设置,可以参考以下方法)

由于follow_primary_command脚本中使用PCP命令需要用户认证,所以我们需要在pcp.conf中以“username:encrypted password”的格式指定用户名和md5加密密码。然后我们使用 pg_md5 为 pgpool 用户创建加密的密码条目,如下所示:

[all servers]# echo 'postgres:'`pg_md5 PCP passowrd` >> /opt/pgpool/etc/pcp.conf# USERID:MD5PASSWD
postgres:a189c633d9995e11bf8607170ec9a4b8     -- 内容类似这个 

由于follow_primary.sh 脚本必须在不输入密码的情况下执行PCP 命令,因此我们需要在每台服务器的Pgpool-II 启动用户(postgres 用户)的主目录中创建.pcppass。       -- 本例子没有设置这个

[all servers]# su - postgres
[all servers]$ echo 'localhost:9898:postgres:<pgpool user password>' > ~/.pcppass
[all servers]$ chmod 600 ~/.pcppass

注意:follow_primary.sh不支持表空间,如果使用到了表空间,则需要定义该脚本

Pgpool-II 在线恢复配置 (本例子中没有使用到)
接下来,为了使用 Pgpool-II 执行在线恢复,我们指定 PostgreSQL 用户名和在线恢复命令 recovery_1st_stage。因为进行在线恢复需要PostgreSQL中的超级用户权限,所以我们在recovery_user中指定了postgres用户。然后,我们在PostgreSQL主服务器(server1)的数据库集群目录下创建recovery_1st_stage和pgpool_remote_start,并添加执行权限。   -- 设置pgpool.conf文件

recovery_user = 'postgres'
recovery_password = 'oracle'
recovery_1st_stage_command = 'recovery_1st_stage'  

copy文件并修改权限

[postgres@test /opt/pgpool/etc]$cp recovery_1st_stage.sample recovery_1st_stage
[postgres@test /opt/pgpool/etc]$cp pgpool_remote_start.sample pgpool_remote_start

为了使用在线恢复功能,需要pgpool_recovery、pgpool_remote_start、pgpool_switch_xlog的功能,所以我们需要在PostgreSQL服务器server1的template1上安装pgpool_recovery。

[server1]# su - postgres
[server1]$ psql template1 -c "CREATE EXTENSION pgpool_recovery"   -- 这个要安装pg_recovery函数

-- 编译安装pgpool_adm  pgpool-recovery  pgpool-regclass等

[root@test /root/.ssh]$cd /postgres/pgpool-II-4.2.4/src/sql
[root@test /postgres/pgpool-II-4.2.4/src/sql]$ls
insert_lock.sql  Makefile  pgpool_adm  pgpool-recovery  pgpool-regclass
[root@test /postgres/pgpool-II-4.2.4/src/sql]$make install
[postgres@test /opt/pgpool/etc]$psql template1 -c "CREATE EXTENSION pgpool_recovery"
CREATE EXTENSION
[postgres@test /opt/pgpool/etc]$

客户端认证配置
默认情况下,pool_hba 认证是禁用的,设置 enable_pool_hba = on 来启用它。

host    all         repl        0.0.0.0/0             md5
host    all         postgres    0.0.0.0/0             md5

配置sudo ,针对postgres用户,这个主要作用是在VIP漂移的时候,设置VIP的IP地址

chmod u+w /etc/sudoers
vi /etc/sudoers
postgres ALL=NOPASSWD: /sbin/ip
postgres ALL=NOPASSWD: /usr/sbin/arping

watchdog配置

use_watchdog = on
delegate_IP = '192.168.2.88 '   if_up_cmd = '/usr/bin/sudo /sbin/ip addr add $_IP_$/24 dev ens33 label ens33:0'    -- 例子中的网卡是ens33
if_down_cmd = '/usr/bin/sudo /sbin/ip addr del $_IP_$/24 dev ens33'
arping_cmd = '/usr/bin/sudo /usr/sbin/arping -U $_IP_$ -w 1 -I ens33'

指定所有 Pgpool-II 节点信息用于配置看门狗

hostname0 = '192.168.2.80'
wd_port0 = 9000
pgpool_port0 = 9999
hostname1 = '192.168.2.81'
wd_port1 = 9000
pgpool_port1 = 9999
wd_lifecheck_method = 'heartbeat'
wd_interval = 10

指定发送和接收心跳信号的所有Pgpool-II节点信息

heartbeat_hostname0 = '192.168.2.80'
heartbeat_port0 = 9694
heartbeat_device0 = ''
heartbeat_hostname1 = '192.168.2.81'
heartbeat_port1 = 9694
heartbeat_device1 = ''
wd_heartbeat_keepalive = 2
wd_heartbeat_deadtime = 30
wd_escalation_command = '/opt/pgpool/etc/escalation.sh'    -- 要根据实际情况修改escalation.sh中的内容
enable_consensus_with_half_votes = on    -- 节点为偶数的时候使用
pid_file_name = '/opt/pgpool/etc/pgpool_node_id'log_destination = 'stderr'
logging_collector = on
log_directory = '/opt/pgpool/log'    --
log_filename = 'pgpool-%Y-%m-%d_%H%M%S.log'
log_truncate_on_rotation = on
log_rotation_age = 1d
log_rotation_size = 10MB

启动 Pgpool-II 时,如果 pgpool_status 文件存在,Pgpool-II 将从 pgpool_status 文件中读取后端状态(up/down)。
如果你想在 Pgpool-II 启动时忽略 pgpool_status 文件,在 /etc/sysconfig/pgpool 的启动选项 OPTS 中添加“-D”    
[all servers]# vi /etc/sysconfig/pgpool     -- 编译安装的postgres,没有这个文件,直接将pgpool_status文件内容清空,或者启动的时候,pgpool加上-D参数忽略该文件
...
OPTS=" -D -n"

启动pgpool,在postgres用户下,直接使用pgpool就可以启动    
启动后的日志,表明心跳正常

2021-08-31 11:09:21: pid 14310: LOG:  new IPC connection received
2021-08-31 11:09:31: pid 14310: LOG:  new IPC connection received
2021-08-31 11:09:41: pid 14310: LOG:  new IPC connection received
2021-08-31 11:09:52: pid 14310: LOG:  new IPC connection received
2021-08-31 11:10:02: pid 14310: LOG:  new IPC connection received
2021-08-31 11:10:12: pid 14310: LOG:  new IPC connection received
2021-08-31 11:10:22: pid 14310: LOG:  new IPC connection received

启动后的IP地址 ,VIP地址是在primary上的

[postgres@test /home/postgres]$ifconfig
ens33: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500inet 192.168.2.80  netmask 255.255.255.0  broadcast 192.168.2.255inet6 fe80::14f:9066:9e0c:5f6d  prefixlen 64  scopeid 0x20<link>ether 00:0c:29:10:c6:5b  txqueuelen 1000  (Ethernet)RX packets 65786  bytes 8130219 (7.7 MiB)RX errors 0  dropped 0  overruns 0  frame 0TX packets 186681  bytes 363112722 (346.2 MiB)TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0ens33:0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500inet 192.168.2.88  netmask 255.255.255.0  broadcast 0.0.0.0ether 00:0c:29:10:c6:5b  txqueuelen 1000  (Ethernet)lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536inet 127.0.0.1  netmask 255.0.0.0inet6 ::1  prefixlen 128  scopeid 0x10<host>loop  txqueuelen 1000  (Local Loopback)RX packets 68588  bytes 23916273 (22.8 MiB)RX errors 0  dropped 0  overruns 0  frame 0TX packets 68588  bytes 23916273 (22.8 MiB)TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

-- 查看primary和standby的状态

[postgres@test /home/postgres]$psql -h 192.168.2.80 -p 9999
Password:
psql.bin (10.15)
Type "help" for help.postgres=# show pool_nodes;node_id |   hostname   | port | status | lb_weight |  role   | select_cnt | load_balance_node | replication_delay | replication_state| replication_sync_state | last_status_change
---------+--------------+------+--------+-----------+---------+------------+-------------------+-------------------+------------------
-+------------------------+---------------------0       | 192.168.2.80 | 5432 | up     | 0.500000  | primary | 0          | false             | 0                 |                  |                        | 2021-08-31 08:55:591       | 192.168.2.81 | 5432 | up     | 0.500000  | standby | 0          | true              | 0                 |                  |                        | 2021-08-31 08:55:59
(2 rows)postgres=# [postgres@test1 /home/postgres]$psql -h 192.168.2.81 -p 9999
Password:
psql.bin (10.15)
Type "help" for help.postgres=# show pool_nodes;node_id |   hostname   | port | status | lb_weight |  role   | select_cnt | load_balance_node | replication_delay | replication_state| replication_sync_state | last_status_change
---------+--------------+------+--------+-----------+---------+------------+-------------------+-------------------+------------------
-+------------------------+---------------------0       | 192.168.2.80 | 5432 | up     | 0.500000  | primary | 0          | false             | 0                 |                  |                        | 2021-08-31 08:57:041       | 192.168.2.81 | 5432 | up     | 0.500000  | standby | 0          | true              | 0                 |                  |                        | 2021-08-31 08:57:04
(2 rows)postgres=# 

通过watchdog查看,LEADER是192.168.2.80

pcp_watchdog_info -p 9898 -h 192.168.2.88 -U postgres[postgres@test /home/postgres]$pcp_watchdog_info -p 9898 -h 192.168.2.88 -U postgres
Password:
2 YES 192.168.2.80:9999 Linux test 192.168.2.80192.168.2.80:9999 Linux test 192.168.2.80 9999 9000 4 LEADER
192.168.2.81:9999 Linux test1 192.168.2.81 9999 9000 7 STANDBY
[postgres@test /home/postgres]$

停止掉192.168.2.80上的pgpool,再次查看watchdog信息,可以看到192.168.2.80上的状态为shutdown,同时VIP漂移到另一台机器上。LEADER是192.168.2.81

[postgres@test /home/postgres]$pgpool stop
2021-08-31 11:16:30: pid 26938: LOG:  stop request sent to pgpool. waiting for termination...
.done.
[postgres@test /home/postgres]$pcp_watchdog_info -p 9898 -h 192.168.2.88 -U postgres
Password:
2 YES 192.168.2.81:9999 Linux test1 192.168.2.81192.168.2.81:9999 Linux test1 192.168.2.81 9999 9000 4 LEADER
192.168.2.80:9999 Linux test 192.168.2.80 9999 9000 10 SHUTDOWN
[postgres@test /home/postgres]$[postgres@test1 /opt/pgpool/etc]$ifconfig
ens33: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500inet 192.168.2.81  netmask 255.255.255.0  broadcast 192.168.2.255inet6 fe80::626c:15d:e820:43fc  prefixlen 64  scopeid 0x20<link>inet6 fe80::14f:9066:9e0c:5f6d  prefixlen 64  scopeid 0x20<link>ether 00:0c:29:86:18:b7  txqueuelen 1000  (Ethernet)RX packets 186932  bytes 193331959 (184.3 MiB)RX errors 0  dropped 0  overruns 0  frame 0TX packets 62571  bytes 8315733 (7.9 MiB)TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0ens33:0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500inet 192.168.2.88  netmask 255.255.255.0  broadcast 0.0.0.0ether 00:0c:29:86:18:b7  txqueuelen 1000  (Ethernet)

启动pgpool ,再次查看信息 ,发现192.168.2.81是leader,192.168.2.80是standby

[postgres@test /home/postgres]$pgpool
[postgres@test /home/postgres]$pcp_watchdog_info -p 9898 -h 192.168.2.88 -U postgres
Password:
2 YES 192.168.2.81:9999 Linux test1 192.168.2.81192.168.2.81:9999 Linux test1 192.168.2.81 9999 9000 4 LEADER
192.168.2.80:9999 Linux test 192.168.2.80 9999 9000 7 STANDBY
[postgres@test /home/postgres]$

查看postges的primary和standby信息,正常。其实这个和watchdog看到的信息没有关系。watchdog看到的信息是pgpool集群的状态

[postgres@test1 /opt/pgpool/etc]$psql -h 192.168.2.88 -p 9999 -U postgres postgres -c "show pool_nodes"
Password for user postgres: node_id |   hostname   | port | status | lb_weight |  role   | select_cnt | load_balance_node | replication_delay | replication_state| replication_sync_state | last_status_change
---------+--------------+------+--------+-----------+---------+------------+-------------------+-------------------+------------------
-+------------------------+---------------------0       | 192.168.2.80 | 5432 | up     | 0.500000  | primary | 0          | true              | 0                 |                  |                        | 2021-08-31 08:57:041       | 192.168.2.81 | 5432 | up     | 0.500000  | standby | 0          | false             | 0                 |                  |                        | 2021-08-31 08:57:04
(2 rows)[postgres@test1 /opt/pgpool/etc]$

-- failover测试

当存在3台postgres的时候,关闭掉primary,新的primary会被提升。被关闭的primary会变成standby,该standby再次启动后,不会变成primary,还是standby,需要通过pgpool来在线recovery或者手工通过pg_rewind进行在线recovery 。

-- 关闭掉节点为0的postgres数据库,也就是关闭掉192.168.2.80上的数据库 ,可以看到节点为0的服务器已经被关闭,状态为standby,新的primary已经切换到了节点为1的服务器上,也就是192.168.2.81是新的primary。

[postgres@test1 /opt/pgpool/etc]$psql -h 192.168.2.88 -p 9999 -U postgres postgres -c "show pool_nodes"
Password for user postgres: node_id |   hostname   | port | status | lb_weight |  role   | select_cnt | load_balance_node | replication_delay | replication_state| replication_sync_state | last_status_change
---------+--------------+------+--------+-----------+---------+------------+-------------------+-------------------+------------------
-+------------------------+---------------------0       | 192.168.2.80 | 5432 | down   | 0.500000  | standby | 0          | false             | 0                 |                  |                        | 2021-08-31 12:44:491       | 192.168.2.81 | 5432 | up     | 0.500000  | primary | 0          | true              | 0                 |                  |                        | 2021-08-31 12:44:49
(2 rows)[postgres@test1 /opt/pgpool/etc]$

-- 同时,在pgpool的log中可以看到

-- 节点id 为0 上的log ,发现primary node 已经变成了192.168.2.81

2021-08-31 12:44:33: pid 27061: LOG:  new IPC connection received
2021-08-31 12:44:43: pid 27061: LOG:  new IPC connection received
2021-08-31 12:44:49: pid 27100: LOG:  failed to connect to PostgreSQL server on "192.168.2.80:5432", getsockopt() failed
2021-08-31 12:44:49: pid 27100: DETAIL:  Operation now in progress
2021-08-31 12:44:49: pid 27100: ERROR:  failed to make persistent db connection
2021-08-31 12:44:49: pid 27100: DETAIL:  connection to host:"192.168.2.80:5432" failed
2021-08-31 12:44:49: pid 27100: LOG:  health check retrying on DB node: 0 (round:1)
2021-08-31 12:44:49: pid 27061: LOG:  signal_user1_to_parent_with_reason(2)
2021-08-31 12:44:49: pid 27058: LOG:  Pgpool-II parent process received SIGUSR1
2021-08-31 12:44:49: pid 27058: LOG:  Pgpool-II parent process received sync backend signal from watchdog
2021-08-31 12:44:49: pid 27061: LOG:  new IPC connection received
2021-08-31 12:44:49: pid 27058: LOG:  leader watchdog has performed failover
2021-08-31 12:44:49: pid 27058: DETAIL:  syncing the backend states from the LEADER watchdog node
2021-08-31 12:44:49: pid 27061: LOG:  new IPC connection received
2021-08-31 12:44:49: pid 27061: LOG:  received the get data request from local pgpool-II on IPC interface
2021-08-31 12:44:49: pid 27061: LOG:  get data request from local pgpool-II node received on IPC interface is forwarded to leader watc
hdog node "192.168.2.81:9999 Linux test1"
2021-08-31 12:44:49: pid 27061: DETAIL:  waiting for the reply...
2021-08-31 12:44:49: pid 27058: LOG:  leader watchdog node "192.168.2.81:9999 Linux test1" returned status for 2 backend nodes
2021-08-31 12:44:49: pid 27058: LOG:  backend:0 is set to down status
2021-08-31 12:44:49: pid 27058: DETAIL:  backend:0 is DOWN on cluster leader "192.168.2.81:9999 Linux test1"
2021-08-31 12:44:49: pid 27058: LOG:  primary node:1 on leader watchdog node "192.168.2.81:9999 Linux test1" is different from local p
rimary node:0
2021-08-31 12:44:49: pid 27058: LOG:  primary node was changed after the sync from "192.168.2.81:9999 Linux test1"
2021-08-31 12:44:49: pid 27058: DETAIL:  all children needs to be restarted
2021-08-31 12:44:49: pid 27058: LOG:  child process with pid: 27066 exits with status 0
2021-08-31 12:44:49: pid 27058: LOG:  child process with pid: 27067 exits with status 0

-- 节点id为1的机器上,可以看到新的primary已经promote成功了

2021-08-31 12:44:49: pid 16951: LOG:  new IPC connection received
2021-08-31 12:44:49: pid 16951: LOG:  received the failover indication from Pgpool-II on IPC interface
2021-08-31 12:44:49: pid 16951: LOG:  watchdog is informed of failover start by the main process
2021-08-31 12:44:49: pid 16948: LOG:  starting degeneration. shutdown host 192.168.2.80(5432)
2021-08-31 12:44:49: pid 16948: LOG:  Restart all children
2021-08-31 12:44:49: pid 16948: LOG:  execute command: /opt/pgpool/etc/failover.sh 0 192.168.2.80 5432 /opt/PostgreSQL/10/data 1 192.1
68.2.81 0 0 5432 /opt/PostgreSQL/10/data 192.168.2.80 5432
+ FAILED_NODE_ID=0
+ FAILED_NODE_HOST=192.168.2.80
+ FAILED_NODE_PORT=5432
+ FAILED_NODE_PGDATA=/opt/PostgreSQL/10/data
+ NEW_MAIN_NODE_ID=1
+ NEW_MAIN_NODE_HOST=192.168.2.81
+ OLD_MAIN_NODE_ID=0
+ OLD_PRIMARY_NODE_ID=0
+ NEW_MAIN_NODE_PORT=5432
+ NEW_MAIN_NODE_PGDATA=/opt/PostgreSQL/10/data
+ OLD_PRIMARY_NODE_HOST=192.168.2.80
+ OLD_PRIMARY_NODE_PORT=5432
+ PGHOME=/opt/PostgreSQL/10
+ REPL_SLOT_NAME=192_168_2_80
+ echo failover.sh: start: failed_node_id=0 failed_host=192.168.2.80 old_primary_node_id=0 new_main_node_id=1 new_main_host=192.168.2.
81
failover.sh: start: failed_node_id=0 failed_host=192.168.2.80 old_primary_node_id=0 new_main_node_id=1 new_main_host=192.168.2.81
+ '[' 1 -lt 0 ']'
+ ssh -T -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null postgres@192.168.2.81 -i /home/postgres/.ssh/id_rsa_pgpool ls /tm
p
Warning: Permanently added '192.168.2.81' (ECDSA) to the list of known hosts.
+ '[' 0 -ne 0 ']'
+ '[' 0 -ne 0 ']'
+ echo failover.sh: primary node is down, promote new_main_node_id=1 on 192.168.2.81.
failover.sh: primary node is down, promote new_main_node_id=1 on 192.168.2.81.
+ ssh -T -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null postgres@192.168.2.81 -i /home/postgres/.ssh/id_rsa_pgpool /opt/P
ostgreSQL/10/bin/pg_ctl -D /opt/PostgreSQL/10/data -w promote
Warning: Permanently added '192.168.2.81' (ECDSA) to the list of known hosts.
waiting for server to promote.... done
server promoted
+ '[' 0 -ne 0 ']'
+ echo failover.sh: end: new_main_node_id=1 on 192.168.2.81 is promoted to a primary
failover.sh: end: new_main_node_id=1 on 192.168.2.81 is promoted to a primary
+ exit 0
2021-08-31 12:44:49: pid 16948: LOG:  find_primary_node_repeatedly: waiting for finding a primary node
2021-08-31 12:44:49: pid 16948: LOG:  find_primary_node: primary node is 1
2021-08-31 12:44:49: pid 16948: LOG:  failover: set new primary node: 1
2021-08-31 12:44:49: pid 16948: LOG:  failover: set new main node: 1
2021-08-31 12:44:49: pid 16951: LOG:  new IPC connection received
2021-08-31 12:44:49: pid 16951: LOG:  received the failover indication from Pgpool-II on IPC interface
2021-08-31 12:44:49: pid 16951: LOG:  watchdog is informed of failover end by the main process
2021-08-31 12:44:49: pid 17074: LOG:  worker process received restart request
failover done. shutdown host 192.168.2.80(5432)2021-08-31 12:44:49: pid 16948: LOG:  failover done. shutdown host 192.168.2.80(5432)
2021-08-31 12:44:50: pid 17073: LOG:  restart request received in pcp child process
2021-08-31 12:44:50: pid 16948: LOG:  PCP child 17073 exits with status 0 in failover()
2021-08-31 12:44:50: pid 16948: LOG:  fork a new PCP child pid 38973 in failover()
2021-08-31 12:44:50: pid 16948: LOG:  child process with pid: 17041 exits with status 256
2021-08-31 12:44:50: pid 16948: LOG:  child process with pid: 17042 exits with status 256
2021-08-31 12:44:50: pid 16948: LOG:  child process with pid: 17043 exits with status 256
2021-08-31 12:44:50: pid 16948: LOG:  child process with pid: 17044 exits with status 256
2021-08-31 12:44:50: pid 16948: LOG:  child process with pid: 17045 exits with status 256
2021-08-31 12:44:50: pid 16948: LOG:  child process with pid: 17046 exits with status 256
2021-08-31 12:44:50: pid 16948: LOG:  child process with pid: 17047 exits with status 256
2021-08-31 12:44:50: pid 16948: LOG:  child process with pid: 17048 exits with status 256
2021-08-31 12:44:50: pid 38973: LOG:  PCP process: 38973 started
2021-08-31 12:44:50: pid 16948: LOG:  child process with pid: 17049 exits with status 256
2021-08-31 12:44:50: pid 16948: LOG:  child process with pid: 17050 exits with status 256
2021-08-31 12:44:50: pid 16948: LOG:  child process with pid: 17051 exits with status 256
2021-08-31 12:44:50: pid 16948: LOG:  child process with pid: 17052 exits with status 256
2021-08-31 12:44:50: pid 16948: LOG:  child process with pid: 17053 exits with status 256
2021-08-31 12:44:50: pid 16948: LOG:  child process with pid: 17054 exits with status 256
2021-08-31 12:44:50: pid 16948: LOG:  child process with pid: 17055 exits with status 256
2021-08-31 12:44:50: pid 16948: LOG:  child process with pid: 17056 exits with status 256
2021-08-31 12:44:50: pid 16948: LOG:  child process with pid: 17057 exits with status 256
2021-08-31 12:44:50: pid 16948: LOG:  child process with pid: 17058 exits with status 256
2021-08-31 12:44:50: pid 16948: LOG:  child process with pid: 17059 exits with status 256
2021-08-31 12:44:50: pid 16948: LOG:  child process with pid: 17060 exits with status 256
2021-08-31 12:44:50: pid 16948: LOG:  child process with pid: 17061 exits with status 256
2021-08-31 12:44:50: pid 16948: LOG:  child process with pid: 17062 exits with status 256
2021-08-31 12:44:50: pid 16948: LOG:  child process with pid: 17063 exits with status 256
2021-08-31 12:44:50: pid 16948: LOG:  child process with pid: 17064 exits with status 256
2021-08-31 12:44:50: pid 16948: LOG:  child process with pid: 17065 exits with status 256
2021-08-31 12:44:50: pid 16948: LOG:  child process with pid: 17066 exits with status 256
2021-08-31 12:44:50: pid 16948: LOG:  child process with pid: 17067 exits with status 256
2021-08-31 12:44:50: pid 16948: LOG:  child process with pid: 17068 exits with status 256
2021-08-31 12:44:50: pid 16948: LOG:  child process with pid: 17069 exits with status 256
2021-08-31 12:44:50: pid 16948: LOG:  child process with pid: 17070 exits with status 256
2021-08-31 12:44:50: pid 16948: LOG:  child process with pid: 17071 exits with status 256
2021-08-31 12:44:50: pid 16948: LOG:  worker child process with pid: 17074 exits with status 256
2021-08-31 12:44:50: pid 16948: LOG:  fork a new worker child process with pid: 38974
2021-08-31 12:44:50: pid 16948: LOG:  child process with pid: 31462 exits with status 256
2021-08-31 12:44:50: pid 38974: LOG:  process started
2021-08-31 12:44:50: pid 16951: LOG:  new IPC connection received
2021-08-31 12:44:52: pid 16951: LOG:  watchdog received the failover command from remote pgpool-II node "192.168.2.80:9999 Linux test"
2021-08-31 12:44:52: pid 16951: LOG:  watchdog is processing the failover command [DEGENERATE_BACKEND_REQUEST] received from 192.168.2
.80:9999 Linux test
2021-08-31 12:44:52: pid 16951: LOG:  we have got the consensus to perform the failover
2021-08-31 12:44:52: pid 16951: DETAIL:  1 node(s) voted in the favor
2021-08-31 12:44:52: pid 16951: LOG:  invalid degenerate backend request, node id : 0 status: [3] is not valid for failover
2021-08-31 12:45:00: pid 16951: LOG:  new IPC connection received
2021-08-31 12:45:11: pid 16951: LOG:  new IPC connection received

-- 将关闭的postgres启动

[postgres@test /home/postgres]$pg_ctl start
waiting for server to start....2021-08-31 12:51:52.130 CST [35222] LOG:  listening on IPv4 address "0.0.0.0", port 5432
2021-08-31 12:51:52.130 CST [35222] LOG:  listening on IPv6 address "::", port 5432
2021-08-31 12:51:52.133 CST [35222] LOG:  listening on Unix socket "/tmp/.s.PGSQL.5432"
2021-08-31 12:51:52.169 CST [35222] LOG:  redirecting log output to logging collector process
2021-08-31 12:51:52.169 CST [35222] HINT:  Future log output will appear in directory "log".done
server started
[postgres@test /home/postgres]$

-- 这个时候,通过pg_controldata发现启动后的postgres,还是in production(为in production状态很正常,因为没有人告诉他,primary和standby的情况,因为primary关闭了,关闭后,只告诉了standby提升为primary,并没有告诉原来的primary要怎么处理)。需要在线recovery,修复primary和standby 的关系。

pcp_recovery_node -h 192.168.2.88 -p 9898 -U postgres -n 0[postgres@test /home/postgres]$pcp_recovery_node -h 192.168.2.88 -p 9898 -U postgres -n 0
Password:
ERROR:  executing recovery, execution of command failed at "1st stage"
DETAIL:  command:"recovery_1st_stage"[postgres@test /home/postgres]$

使用pcp_recovery_node出错,目前暂时没有找到原因。从官方文档来看,还是需要命令行来处理新的primary-standby关系。目前猜测pcp_recovery_node调用recovery_1st_stage脚本,通过pb_basebackup来恢复standby,然后再调用pgpool_remote_start 脚本,远程启动standby。然后再调用follow_primary.sh,将节点加入到pgpool中。总之,就是感觉,primary转换为新的standby后,还是需要手工处理,而不是pgpool全自动处理。

我这里使用pg_rewind来处理新的standy,再使用pgpool工具将节点添加进来。

[postgres@test /opt/pgpool/etc]$pg_ctl stop
waiting for server to shut down.... done
server stopped
[postgres@test /opt/pgpool/etc]$pg_rewind --target-pgdata=$PGDATA --source-server='host=192.168.2.81 port=5432 user=postgres password=oracle' -P
connected to server
servers diverged at WAL location 0/17000098 on timeline 3
rewinding from last common checkpoint at 0/17000028 on timeline 3
reading source file list
reading target file list
reading WAL in target
need to copy 200 MB (total source directory size is 253 MB)
205115/205115 kB (100%) copied
creating backup label and updating control file
syncing target data directory
Done!
[postgres@test /opt/pgpool/etc]$[postgres@test /opt/PostgreSQL/10/data]$mv recovery.done recovery.conf
[postgres@test /opt/PostgreSQL/10/data]$vi recovery.conf
standby_mode = 'on'
primary_conninfo = 'user=repl password=oracle host=192.168.2.81 port=5432 sslmode=prefer sslcompression=1 krbsrvname=postgres target_session_attrs=any'[postgres@test /home/postgres]$pcp_attach_node -h 192.168.2.80 -p 9898 -U postgres -n 0 -v
Password:
pcp_attach_node -- Command Successful
[postgres@test /home/postgres]$

-- 添加后,psql -h 192.168.2.80 -p 9999、psql -h 192.168.2.81 -p 9999查看,其中一个为waiting。将pgpool重启后,全部UP。

postgres=# show pool_nodes;node_id |   hostname   | port | status | lb_weight |  role   | select_cnt | load_balance_node | replication_delay | replication_state| replication_sync_state | last_status_change
---------+--------------+------+--------+-----------+---------+------------+-------------------+-------------------+------------------
-+------------------------+---------------------0       | 192.168.2.80 | 5432 | up     | 0.500000  | standby | 0          | false             | 0                 |                  |                        | 2021-08-31 13:53:381       | 192.168.2.81 | 5432 | up     | 0.500000  | primary | 0          | true              | 0                 |                  |                        | 2021-08-31 13:53:38
(2 rows)postgres=#
postgres=# show pool_nodes;node_id |   hostname   | port | status  | lb_weight |  role   | select_cnt | load_balance_node | replication_delay | replication_stat
e | replication_sync_state | last_status_change
---------+--------------+------+---------+-----------+---------+------------+-------------------+-------------------+-----------------
--+------------------------+---------------------0       | 192.168.2.80 | 5432 | waiting | 0.500000  | standby | 0          | false             | 0                 |                 |                        | 2021-08-31 13:52:281       | 192.168.2.81 | 5432 | up      | 0.500000  | primary | 0          | true              | 0                 |                 |                        | 2021-08-31 12:44:49
(2 rows)postgres=# 

END

Pgpool-II + Watchdog 设置与测试相关推荐

  1. 360安全浏览器兼容模式怎么设置_测试新手一定要知道:最实用的Web兼容性测试经验都在这...

    在日常工作中,我们经常碰到网页不兼容的问题.我们之所以要做兼容性测试,目的在于保证待测试项目在不同的操作系统平台上正常运行. 主要包括待测试项目能在同一操作系统平台的不同版本上正常运行:待测试项目能与 ...

  2. 华硕PRIME Z370M-PLUS II 魔改BIOS 测试正常

    华硕PRIME Z370M-PLUS II 魔改BIOS 测试正常 Z370M-plus 直接上,六代,七代,八代,九代 志强e3-V5.系列的CPU

  3. 如何在Windows 10中设置和测试麦克风

    Whether you're dictating with speech recognition or talking to a family member or gaming buddy over ...

  4. iMX6开发板-uboot-网络设置和测试

    本文章基于迅为IMX6开发板 将iMX6开发板通过网线连接到路由器,同时连接好调试串口,上电立即按 enter,即可进入 uboot.然后输入命令 pri,查看开发板当前的配置,如下图所示可以看到 i ...

  5. junit测试设置不回滚_正确设置JUnit测试名称

    junit测试设置不回滚 寻找好名字是手工软件的挑战之一. 您需要随时随地找到它们-类,方法,变量,仅举几例. 但是,什么使名字成为好名字呢? 引用Oncle Bob的话:"三件事:可读性, ...

  6. linux中tftp服务器设置及测试

    1.linux下的tftp安装: 首先利用命令:rpm-qa | grep tftp,查看tftp是否安装,如果没有安装,则可以按下面步骤进行安装. 1.安装相关软件包:tftpd(服务端),tftp ...

  7. 【店小蜜】欢迎语卡片的设置和测试

    进入配置页面 千牛搜索店小蜜→在店小蜜菜单栏选择全自动模式(这个界面就可以设置欢迎语卡片) 界面功能介绍 设置店小蜜接待昵称 就是店小蜜在聊天工具中显示的昵称,修改后记得按旁边的保存就可以.昵称有单独 ...

  8. 专用测试软件说明,测试说明、软件设置及测试结果

    写在测试前: 由于游戏显卡和专 业图形卡的应用领域不同,其定位也不同:专业图形卡特别强调特定行业软件的兼容性.稳定性.速度和准确性,而游戏显卡强调的是在相对较低的价位下提供更强大的娱乐.办公.游戏.多 ...

  9. 【Linux】Linux之DNS查询、设置及测试可用性

    1.DNS查询的三种方式 1.1 cat /etc/resolv.conf 1.2 nslookup www.baidu.com 1.3 dig |grep www.baidu.com 2.DNS设置 ...

最新文章

  1. iOS 计算两个日期之间的差值
  2. Eclipse java反编译插件之jadclipse
  3. Chrome 解决flash问题
  4. h5新增浏览器本地缓存localStorage
  5. mysql主键索引_MySQL索引之主键索引
  6. JQuery方式执行ajax请求
  7. [Java基础] 反射机制汇总
  8. 常见的网页技巧(转)
  9. MyEclipse 2017(英文版)安装教程
  10. Feature Tools:可自动构造机器学习特征的Python库
  11. YII 规则rule 里面 min,max 提示错误信息
  12. hdu 3879 Base Station 最大权闭合图
  13. django请求和响应
  14. Confluent修改许可,限制其他云供应商
  15. Docker 学习笔记 -- kuangshen Docker 视频学习笔记
  16. STM32 USB主机通信连接中断过程
  17. 目标跟踪VOT2016的配置
  18. lnmp全面优化集合nginx+mysql+php
  19. 夜拍王荣耀10 VS同档位旗舰机夜拍功能,实战结果一目了然!
  20. 企业与组织仍然没有实现无纸化的三大原因

热门文章

  1. string字符串输入
  2. 01-PHP简介和开发环境的搭建
  3. 无力吐槽的废酱的几个问题
  4. Android中如何计算图片占用的实际内存大小?
  5. 机器学习_深度学习毕设题目汇总——人脸B
  6. @Scheduled注解与参数
  7. 如何手搭Hadoop集群
  8. 玩转RFID(一) - MFRC522模块上手
  9. 10个的国外大学论文期刊网站分享
  10. acwing-小猫爬山