Pgpool-II + Watchdog 设置与测试

参考文档：

Pgpool-II + Watchdog Setup Example

https://www.pgpool.net/docs/latest/en/html/example-configs.html
https://www.pgpool.net/docs/latest/en/html/example-cluster.html

已测试完成的功能
1 当其中某一个节点上的pgpool失败时，VIP会自动漂移到另一个节点上，也就是watchdog起作用了
2 关闭掉primary 数据库，standby数据库会提升为primary。也就是failover功能。好像没有switch over功能
3 失效的原来的primary数据库需要重建（使用pgpool的工具、pg_rewind都可以）
未测试完成的功能
1 使用pgpool工具，online recovery没有成功，使用pg_rewind命令来代替处理了。

主机名信息：

test  ： 192.168.2.80
test1 :  192.168.2.81
VIP :192.168.2.88

PG版本和配置

PostgreSQL Version    10.15
port                  5432
$PGDATA               /opt/PostgreSQL/10/data
Archive mode          ON
Replication Slots     Enable     -- 没有设置
Start automatically   Enable     -- 源码安装完毕PG后，就是自动设置的

pgpool-II版本和配置
Pgpool-II Version 4.2.0
一些端口：

9999               Pgpool-II accepts connections
9898                  PCP process accepts connections
9000                  watchdog accepts connections
9694                  UDP port for receiving Watchdog's heartbeat signal
Config file           /opt/pgpool/etc/pgpool.conf
Pgpool-II启动用户     postgres ，(Pgpool-II 4.1 or later) Pgpool-II 4.0 or before, the default startup user is root
Running mode          streaming replication mode
Watchdog on Life check method: heartbeat
Start automatically Enable   -- 我这里没有设置

脚本

failover ：
/opt/pgpool/etc/failover.sh         -- 通过failover_command运行执行故障转移
/opt/pgpool/etc/follow_primary.sh   -- 通过 follow_primary_command 运行以在故障转移后将 Standby 与新的 Primary 同步。

failover.sh的脚本内容，可以参考文档：
https://git.postgresql.org/gitweb/?p=pgpool2.git;a=blob_plain;f=src/sample/scripts/failover.sh.sample;hb=refs/heads/V4_2_STABLE
该failover脚本的内容很简单，就是通过pg_ctl promote命令，将从库提升为新的主库，原来的主库不做处理。

follow_primary.sh脚本内容，可以参考文档：https://git.postgresql.org/gitweb/?p=pgpool2.git;a=blob_plain;f=src/sample/scripts/follow_primary.sh.sample;hb=refs/heads/V4_2_STABLE
follow_primary.sh脚本，当使用三台postgres数据库服务器的时候，需要配置，两台postgres数据库的时候，不需要配置。
该脚本的主要作用是，当有3台postgres数据库服务器的时候。failover后，其中一个standby被提升为primary。这时候，另外的一个standby和新的primary数据库进行同步。（当只有2台postgres的时候，primary 宕机，standby提升为新的primary，后续没有standby了，所以不需要该脚本）该脚本主要还是通过pg_rewind来修复新的standby。

online recovery ：
/opt/pgpool/etc/recovery_1st_stage    -- 运行 recovery_1st_stage_command 恢复 Standby 节点
/opt/pgpool/etc/pgpool_remote_start   --在 recovery_1st_stage_command 之后运行以启动 Standby 节点

recovery_1st_stage脚本的官方文档：
https://git.postgresql.org/gitweb/?p=pgpool2.git;a=blob_plain;f=src/sample/scripts/recovery_1st_stage.sample;hb=refs/heads/V4_2_STABLE
该脚本的作用，是通过使用pg_basebackup进行在线创建standby库的，或者当使用pg_rewind无法同步standby库的数据时候，重新创建standby库的时候使用。
pg_remote_start脚本的官方文档：
https://git.postgresql.org/gitweb/?p=pgpool2.git;a=blob_plain;f=src/sample/scripts/pgpool_remote_start.sample;hb=refs/heads/V4_2_STABLE
pg_remote_start脚本，是当新的standby库创建好以后，通过pg_ctl来远程启动新的standby数据库的。

Watchdog：
/opt/pgpool/etc/escalation.sh -- 通过 wd_escalation_command 运行以安全地切换 Active/Standby Pgpool-II
escalation.sh脚本的官方文档：
https://git.postgresql.org/gitweb/?p=pgpool2.git;a=blob_plain;f=src/sample/scripts/escalation.sh.sample;hb=refs/heads/V4_2_STABLE
该脚本的主要用途是在旧节点上down掉VIP。然后在新节点上启用VIP 。

安装postgres数据库（略）

在 server1（主）上编辑配置文件 $PGDATA/postgresql.conf 如下。启用 wal_log_hints 以使用 pg_rewind。由于 Primary 以后可能会成为 Standby，我们设置 hot_standby = on。

postgresql.conf的配置

listen_addresses = '*'
archive_mode = on
archive_command = 'cp %p /postgres/archive/%f'
max_wal_senders = 10
max_replication_slots = 10
wal_level = replica
hot_standby = on
wal_log_hints = on

主服务器启动后，我们使用Pgpool-II的在线恢复功能来设置备用服务器-- 这个安装配置standby的过程，略，通过pg_basebackup手工创建stadnby了，standby的复制通过repl用户来进行。

配置pg_hba.conf

host    replication     repl            192.168.2.80/24         md5
host    replication     repl            192.168.2.81/24         md5

配置ssh对等互信，root用户和postgres用户。过程略，类似以下命令。当ssh server1 date 不需要输入命令时表示设置成功。

[all servers]# cd ~/.ssh
[all servers]# ssh-keygen -t rsa -f id_rsa_pgpool
[all servers]# ssh-copy-id -i id_rsa_pgpool.pub postgres@server1
[all servers]# ssh-copy-id -i id_rsa_pgpool.pub postgres@server2
[all servers]# ssh-copy-id -i id_rsa_pgpool.pub postgres@server3[all servers]# su - postgres
[all servers]$ cd ~/.ssh
[all servers]$ ssh-keygen -t rsa -f id_rsa_pgpool
[all servers]$ ssh-copy-id -i id_rsa_pgpool.pub postgres@server1
[all servers]$ ssh-copy-id -i id_rsa_pgpool.pub postgres@server2
[all servers]$ ssh-copy-id -i id_rsa_pgpool.pub postgres@server3

编辑.pgpass文件，主要是无密码登陆postgres数据库，注意权限是600，否则不能生效

[postgres@test1 /home/postgres]$more .pgpass
*:5432:*:postgres:oracle
*:5432:*:barman:oracle
[postgres@test1 /home/postgres]$
[postgres@test /home/postgres]$more .pgpass
*:5432:*:postgres:oracle
*:5432:*:barman:oracle
[postgres@test /home/postgres]$

设置pgpool-II自动启动（略），本例子中，手工启动。

创建pgpool_node_id，id分别为0和1
如果启用了watchdog，需要辨别那台机器是那个，则需要设置pgpool_node_id

[postgres@test /opt/pgpool/etc]$more pgpool_node_id
0
[postgres@test /opt/pgpool/etc]$
[postgres@test1 /opt/pgpool/etc]$more pgpool_node_id
1
[postgres@test1 /opt/pgpool/etc]$

注意：pgpool目录下，有个pgpool.pid文件，记录的是pgpool启动的进程id，注意别和标识节点的pgpool_node_id搞混了

[postgres@test /opt/pgpool]$more pgpool.pid
14306
[postgres@test /opt/pgpool]$
[postgres@test1 /opt/pgpool]$more pgpool.pid
16948
[postgres@test1 /opt/pgpool]$

pgpool-II配置
由于从 Pgpool-II 4.2 开始，所有主机上的所有配置参数都相同，您可以在任何 pgpool 节点上编辑 pgpool.conf 并将编辑后的 pgpool.conf 文件复制到其他 pgpool 节点。

集群模式

[postgres@test  /opt/pgpool/etc]$cp pgpool.conf.sample-stream pgpool.conf
[postgres@test1 /opt/pgpool/etc]$cp pgpool.conf.sample-stream pgpool.conf

pgpool.conf配置

listen_addresses = '*'         -- 监听器地址，为了允许 Pgpool-II 接受所有传入的连接，我们设置 listen_addresses = '*'。
port = 9999                    --指定pgpool监听器监听的端口
sr_check_user = 'postgres'     -- 流复制检查
sr_check_password = 'oracle'   -- 如果这些参数留空，Pgpool-II 将首先尝试从 pool_passwd 文件中获取该特定用户的密码，然后再使用空密码。
health_check_period = 5
health_check_timeout = 30
health_check_user = 'postgres'
health_check_password = 'oracle'
health_check_max_retries = 3   backend_hostname0 = '192.168.2.80'
backend_port0 = 5432
backend_weight0 = 1
backend_data_directory0 = '/opt/PostgreSQL/10/data'
backend_flag0 = 'ALLOW_TO_FAILOVER'backend_hostname2 = '192.168.2.81'
backend_port2 = 5432
backend_weight2 = 1
backend_data_directory2 = '/opt/PostgreSQL/10/data'
backend_flag2 = 'ALLOW_TO_FAILOVER'
backend_application_name0 = 'test'
backend_application_name1 = 'test1'   failover_command = '/opt/pgpool/etc/failover.sh %d %h %p %D %m %H %M %P %r %R %N %S'  -- 注意修改该脚本中的PGHOME路径
follow_primary_command = ''           -- 因为只有2台机器，这个脚本用不到，所以没有设置

##在failover_command 参数中指定要在故障转移后执行的failover.sh 脚本。如果我们使用3台PostgreSQL服务器，我们需要指定follow_primary_command在主节点故障转移后运行。如果有两个 PostgreSQL 服务器，则不需要 follow_primary_command 设置

配置pcp.conf （因为本例中没有用到follow_primary_command脚本，所以本例中没有设置，如果需要设置，可以参考以下方法）

由于follow_primary_command脚本中使用PCP命令需要用户认证，所以我们需要在pcp.conf中以“username:encrypted password”的格式指定用户名和md5加密密码。然后我们使用 pg_md5 为 pgpool 用户创建加密的密码条目，如下所示：

[all servers]# echo 'postgres:'`pg_md5 PCP passowrd` >> /opt/pgpool/etc/pcp.conf# USERID:MD5PASSWD
postgres:a189c633d9995e11bf8607170ec9a4b8     -- 内容类似这个

由于follow_primary.sh 脚本必须在不输入密码的情况下执行PCP 命令，因此我们需要在每台服务器的Pgpool-II 启动用户（postgres 用户）的主目录中创建.pcppass。 -- 本例子没有设置这个

[all servers]# su - postgres
[all servers]$ echo 'localhost:9898:postgres:<pgpool user password>' > ~/.pcppass
[all servers]$ chmod 600 ~/.pcppass

注意：follow_primary.sh不支持表空间，如果使用到了表空间，则需要定义该脚本

Pgpool-II 在线恢复配置（本例子中没有使用到）
接下来，为了使用 Pgpool-II 执行在线恢复，我们指定 PostgreSQL 用户名和在线恢复命令 recovery_1st_stage。因为进行在线恢复需要PostgreSQL中的超级用户权限，所以我们在recovery_user中指定了postgres用户。然后，我们在PostgreSQL主服务器（server1）的数据库集群目录下创建recovery_1st_stage和pgpool_remote_start，并添加执行权限。 -- 设置pgpool.conf文件

recovery_user = 'postgres'
recovery_password = 'oracle'
recovery_1st_stage_command = 'recovery_1st_stage'

copy文件并修改权限

[postgres@test /opt/pgpool/etc]$cp recovery_1st_stage.sample recovery_1st_stage
[postgres@test /opt/pgpool/etc]$cp pgpool_remote_start.sample pgpool_remote_start

为了使用在线恢复功能，需要pgpool_recovery、pgpool_remote_start、pgpool_switch_xlog的功能，所以我们需要在PostgreSQL服务器server1的template1上安装pgpool_recovery。

[server1]# su - postgres
[server1]$ psql template1 -c "CREATE EXTENSION pgpool_recovery"   -- 这个要安装pg_recovery函数

-- 编译安装pgpool_adm pgpool-recovery pgpool-regclass等

[root@test /root/.ssh]$cd /postgres/pgpool-II-4.2.4/src/sql
[root@test /postgres/pgpool-II-4.2.4/src/sql]$ls
insert_lock.sql  Makefile  pgpool_adm  pgpool-recovery  pgpool-regclass
[root@test /postgres/pgpool-II-4.2.4/src/sql]$make install
[postgres@test /opt/pgpool/etc]$psql template1 -c "CREATE EXTENSION pgpool_recovery"
CREATE EXTENSION
[postgres@test /opt/pgpool/etc]$

客户端认证配置
默认情况下，pool_hba 认证是禁用的，设置 enable_pool_hba = on 来启用它。

host    all         repl        0.0.0.0/0             md5
host    all         postgres    0.0.0.0/0             md5

配置sudo ，针对postgres用户，这个主要作用是在VIP漂移的时候，设置VIP的IP地址

chmod u+w /etc/sudoers
vi /etc/sudoers
postgres ALL=NOPASSWD: /sbin/ip
postgres ALL=NOPASSWD: /usr/sbin/arping

watchdog配置

use_watchdog = on
delegate_IP = '192.168.2.88 '   if_up_cmd = '/usr/bin/sudo /sbin/ip addr add $_IP_$/24 dev ens33 label ens33:0'    -- 例子中的网卡是ens33
if_down_cmd = '/usr/bin/sudo /sbin/ip addr del $_IP_$/24 dev ens33'
arping_cmd = '/usr/bin/sudo /usr/sbin/arping -U $_IP_$ -w 1 -I ens33'

指定所有 Pgpool-II 节点信息用于配置看门狗

hostname0 = '192.168.2.80'
wd_port0 = 9000
pgpool_port0 = 9999
hostname1 = '192.168.2.81'
wd_port1 = 9000
pgpool_port1 = 9999
wd_lifecheck_method = 'heartbeat'
wd_interval = 10

指定发送和接收心跳信号的所有Pgpool-II节点信息

heartbeat_hostname0 = '192.168.2.80'
heartbeat_port0 = 9694
heartbeat_device0 = ''
heartbeat_hostname1 = '192.168.2.81'
heartbeat_port1 = 9694
heartbeat_device1 = ''
wd_heartbeat_keepalive = 2
wd_heartbeat_deadtime = 30
wd_escalation_command = '/opt/pgpool/etc/escalation.sh'    -- 要根据实际情况修改escalation.sh中的内容
enable_consensus_with_half_votes = on    -- 节点为偶数的时候使用
pid_file_name = '/opt/pgpool/etc/pgpool_node_id'log_destination = 'stderr'
logging_collector = on
log_directory = '/opt/pgpool/log'    --
log_filename = 'pgpool-%Y-%m-%d_%H%M%S.log'
log_truncate_on_rotation = on
log_rotation_age = 1d
log_rotation_size = 10MB

启动 Pgpool-II 时，如果 pgpool_status 文件存在，Pgpool-II 将从 pgpool_status 文件中读取后端状态（up/down）。
如果你想在 Pgpool-II 启动时忽略 pgpool_status 文件，在 /etc/sysconfig/pgpool 的启动选项 OPTS 中添加“-D”
[all servers]# vi /etc/sysconfig/pgpool -- 编译安装的postgres，没有这个文件，直接将pgpool_status文件内容清空，或者启动的时候，pgpool加上-D参数忽略该文件
...
OPTS=" -D -n"

启动pgpool，在postgres用户下，直接使用pgpool就可以启动
启动后的日志，表明心跳正常

2021-08-31 11:09:21: pid 14310: LOG:  new IPC connection received
2021-08-31 11:09:31: pid 14310: LOG:  new IPC connection received
2021-08-31 11:09:41: pid 14310: LOG:  new IPC connection received
2021-08-31 11:09:52: pid 14310: LOG:  new IPC connection received
2021-08-31 11:10:02: pid 14310: LOG:  new IPC connection received
2021-08-31 11:10:12: pid 14310: LOG:  new IPC connection received
2021-08-31 11:10:22: pid 14310: LOG:  new IPC connection received

启动后的IP地址，VIP地址是在primary上的

[postgres@test /home/postgres]$ifconfig
ens33: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500inet 192.168.2.80  netmask 255.255.255.0  broadcast 192.168.2.255inet6 fe80::14f:9066:9e0c:5f6d  prefixlen 64  scopeid 0x20<link>ether 00:0c:29:10:c6:5b  txqueuelen 1000  (Ethernet)RX packets 65786  bytes 8130219 (7.7 MiB)RX errors 0  dropped 0  overruns 0  frame 0TX packets 186681  bytes 363112722 (346.2 MiB)TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0ens33:0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500inet 192.168.2.88  netmask 255.255.255.0  broadcast 0.0.0.0ether 00:0c:29:10:c6:5b  txqueuelen 1000  (Ethernet)lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536inet 127.0.0.1  netmask 255.0.0.0inet6 ::1  prefixlen 128  scopeid 0x10<host>loop  txqueuelen 1000  (Local Loopback)RX packets 68588  bytes 23916273 (22.8 MiB)RX errors 0  dropped 0  overruns 0  frame 0TX packets 68588  bytes 23916273 (22.8 MiB)TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

-- 查看primary和standby的状态

[postgres@test /home/postgres]$psql -h 192.168.2.80 -p 9999
Password:
psql.bin (10.15)
Type "help" for help.postgres=# show pool_nodes;node_id |   hostname   | port | status | lb_weight |  role   | select_cnt | load_balance_node | replication_delay | replication_state| replication_sync_state | last_status_change
---------+--------------+------+--------+-----------+---------+------------+-------------------+-------------------+------------------
-+------------------------+---------------------0       | 192.168.2.80 | 5432 | up     | 0.500000  | primary | 0          | false             | 0                 |                  |                        | 2021-08-31 08:55:591       | 192.168.2.81 | 5432 | up     | 0.500000  | standby | 0          | true              | 0                 |                  |                        | 2021-08-31 08:55:59
(2 rows)postgres=# [postgres@test1 /home/postgres]$psql -h 192.168.2.81 -p 9999
Password:
psql.bin (10.15)
Type "help" for help.postgres=# show pool_nodes;node_id |   hostname   | port | status | lb_weight |  role   | select_cnt | load_balance_node | replication_delay | replication_state| replication_sync_state | last_status_change
---------+--------------+------+--------+-----------+---------+------------+-------------------+-------------------+------------------
-+------------------------+---------------------0       | 192.168.2.80 | 5432 | up     | 0.500000  | primary | 0          | false             | 0                 |                  |                        | 2021-08-31 08:57:041       | 192.168.2.81 | 5432 | up     | 0.500000  | standby | 0          | true              | 0                 |                  |                        | 2021-08-31 08:57:04
(2 rows)postgres=#

通过watchdog查看，LEADER是192.168.2.80

pcp_watchdog_info -p 9898 -h 192.168.2.88 -U postgres[postgres@test /home/postgres]$pcp_watchdog_info -p 9898 -h 192.168.2.88 -U postgres
Password:
2 YES 192.168.2.80:9999 Linux test 192.168.2.80192.168.2.80:9999 Linux test 192.168.2.80 9999 9000 4 LEADER
192.168.2.81:9999 Linux test1 192.168.2.81 9999 9000 7 STANDBY
[postgres@test /home/postgres]$

停止掉192.168.2.80上的pgpool，再次查看watchdog信息，可以看到192.168.2.80上的状态为shutdown，同时VIP漂移到另一台机器上。LEADER是192.168.2.81

[postgres@test /home/postgres]$pgpool stop
2021-08-31 11:16:30: pid 26938: LOG:  stop request sent to pgpool. waiting for termination...
.done.
[postgres@test /home/postgres]$pcp_watchdog_info -p 9898 -h 192.168.2.88 -U postgres
Password:
2 YES 192.168.2.81:9999 Linux test1 192.168.2.81192.168.2.81:9999 Linux test1 192.168.2.81 9999 9000 4 LEADER
192.168.2.80:9999 Linux test 192.168.2.80 9999 9000 10 SHUTDOWN
[postgres@test /home/postgres]$[postgres@test1 /opt/pgpool/etc]$ifconfig
ens33: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500inet 192.168.2.81  netmask 255.255.255.0  broadcast 192.168.2.255inet6 fe80::626c:15d:e820:43fc  prefixlen 64  scopeid 0x20<link>inet6 fe80::14f:9066:9e0c:5f6d  prefixlen 64  scopeid 0x20<link>ether 00:0c:29:86:18:b7  txqueuelen 1000  (Ethernet)RX packets 186932  bytes 193331959 (184.3 MiB)RX errors 0  dropped 0  overruns 0  frame 0TX packets 62571  bytes 8315733 (7.9 MiB)TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0ens33:0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500inet 192.168.2.88  netmask 255.255.255.0  broadcast 0.0.0.0ether 00:0c:29:86:18:b7  txqueuelen 1000  (Ethernet)

启动pgpool ，再次查看信息，发现192.168.2.81是leader，192.168.2.80是standby

[postgres@test /home/postgres]$pgpool
[postgres@test /home/postgres]$pcp_watchdog_info -p 9898 -h 192.168.2.88 -U postgres
Password:
2 YES 192.168.2.81:9999 Linux test1 192.168.2.81192.168.2.81:9999 Linux test1 192.168.2.81 9999 9000 4 LEADER
192.168.2.80:9999 Linux test 192.168.2.80 9999 9000 7 STANDBY
[postgres@test /home/postgres]$

查看postges的primary和standby信息，正常。其实这个和watchdog看到的信息没有关系。watchdog看到的信息是pgpool集群的状态

[postgres@test1 /opt/pgpool/etc]$psql -h 192.168.2.88 -p 9999 -U postgres postgres -c "show pool_nodes"
Password for user postgres: node_id |   hostname   | port | status | lb_weight |  role   | select_cnt | load_balance_node | replication_delay | replication_state| replication_sync_state | last_status_change
---------+--------------+------+--------+-----------+---------+------------+-------------------+-------------------+------------------
-+------------------------+---------------------0       | 192.168.2.80 | 5432 | up     | 0.500000  | primary | 0          | true              | 0                 |                  |                        | 2021-08-31 08:57:041       | 192.168.2.81 | 5432 | up     | 0.500000  | standby | 0          | false             | 0                 |                  |                        | 2021-08-31 08:57:04
(2 rows)[postgres@test1 /opt/pgpool/etc]$

-- failover测试

当存在3台postgres的时候，关闭掉primary，新的primary会被提升。被关闭的primary会变成standby，该standby再次启动后，不会变成primary，还是standby，需要通过pgpool来在线recovery或者手工通过pg_rewind进行在线recovery 。

-- 关闭掉节点为0的postgres数据库，也就是关闭掉192.168.2.80上的数据库，可以看到节点为0的服务器已经被关闭，状态为standby，新的primary已经切换到了节点为1的服务器上，也就是192.168.2.81是新的primary。

[postgres@test1 /opt/pgpool/etc]$psql -h 192.168.2.88 -p 9999 -U postgres postgres -c "show pool_nodes"
Password for user postgres: node_id |   hostname   | port | status | lb_weight |  role   | select_cnt | load_balance_node | replication_delay | replication_state| replication_sync_state | last_status_change
---------+--------------+------+--------+-----------+---------+------------+-------------------+-------------------+------------------
-+------------------------+---------------------0       | 192.168.2.80 | 5432 | down   | 0.500000  | standby | 0          | false             | 0                 |                  |                        | 2021-08-31 12:44:491       | 192.168.2.81 | 5432 | up     | 0.500000  | primary | 0          | true              | 0                 |                  |                        | 2021-08-31 12:44:49
(2 rows)[postgres@test1 /opt/pgpool/etc]$

-- 同时，在pgpool的log中可以看到

-- 节点id 为0 上的log ，发现primary node 已经变成了192.168.2.81

2021-08-31 12:44:33: pid 27061: LOG:  new IPC connection received
2021-08-31 12:44:43: pid 27061: LOG:  new IPC connection received
2021-08-31 12:44:49: pid 27100: LOG:  failed to connect to PostgreSQL server on "192.168.2.80:5432", getsockopt() failed
2021-08-31 12:44:49: pid 27100: DETAIL:  Operation now in progress
2021-08-31 12:44:49: pid 27100: ERROR:  failed to make persistent db connection
2021-08-31 12:44:49: pid 27100: DETAIL:  connection to host:"192.168.2.80:5432" failed
2021-08-31 12:44:49: pid 27100: LOG:  health check retrying on DB node: 0 (round:1)
2021-08-31 12:44:49: pid 27061: LOG:  signal_user1_to_parent_with_reason(2)
2021-08-31 12:44:49: pid 27058: LOG:  Pgpool-II parent process received SIGUSR1
2021-08-31 12:44:49: pid 27058: LOG:  Pgpool-II parent process received sync backend signal from watchdog
2021-08-31 12:44:49: pid 27061: LOG:  new IPC connection received
2021-08-31 12:44:49: pid 27058: LOG:  leader watchdog has performed failover
2021-08-31 12:44:49: pid 27058: DETAIL:  syncing the backend states from the LEADER watchdog node
2021-08-31 12:44:49: pid 27061: LOG:  new IPC connection received
2021-08-31 12:44:49: pid 27061: LOG:  received the get data request from local pgpool-II on IPC interface
2021-08-31 12:44:49: pid 27061: LOG:  get data request from local pgpool-II node received on IPC interface is forwarded to leader watc
hdog node "192.168.2.81:9999 Linux test1"
2021-08-31 12:44:49: pid 27061: DETAIL:  waiting for the reply...
2021-08-31 12:44:49: pid 27058: LOG:  leader watchdog node "192.168.2.81:9999 Linux test1" returned status for 2 backend nodes
2021-08-31 12:44:49: pid 27058: LOG:  backend:0 is set to down status
2021-08-31 12:44:49: pid 27058: DETAIL:  backend:0 is DOWN on cluster leader "192.168.2.81:9999 Linux test1"
2021-08-31 12:44:49: pid 27058: LOG:  primary node:1 on leader watchdog node "192.168.2.81:9999 Linux test1" is different from local p
rimary node:0
2021-08-31 12:44:49: pid 27058: LOG:  primary node was changed after the sync from "192.168.2.81:9999 Linux test1"
2021-08-31 12:44:49: pid 27058: DETAIL:  all children needs to be restarted
2021-08-31 12:44:49: pid 27058: LOG:  child process with pid: 27066 exits with status 0
2021-08-31 12:44:49: pid 27058: LOG:  child process with pid: 27067 exits with status 0

-- 节点id为1的机器上，可以看到新的primary已经promote成功了

2021-08-31 12:44:49: pid 16951: LOG:  new IPC connection received
2021-08-31 12:44:49: pid 16951: LOG:  received the failover indication from Pgpool-II on IPC interface
2021-08-31 12:44:49: pid 16951: LOG:  watchdog is informed of failover start by the main process
2021-08-31 12:44:49: pid 16948: LOG:  starting degeneration. shutdown host 192.168.2.80(5432)
2021-08-31 12:44:49: pid 16948: LOG:  Restart all children
2021-08-31 12:44:49: pid 16948: LOG:  execute command: /opt/pgpool/etc/failover.sh 0 192.168.2.80 5432 /opt/PostgreSQL/10/data 1 192.1
68.2.81 0 0 5432 /opt/PostgreSQL/10/data 192.168.2.80 5432
+ FAILED_NODE_ID=0
+ FAILED_NODE_HOST=192.168.2.80
+ FAILED_NODE_PORT=5432
+ FAILED_NODE_PGDATA=/opt/PostgreSQL/10/data
+ NEW_MAIN_NODE_ID=1
+ NEW_MAIN_NODE_HOST=192.168.2.81
+ OLD_MAIN_NODE_ID=0
+ OLD_PRIMARY_NODE_ID=0
+ NEW_MAIN_NODE_PORT=5432
+ NEW_MAIN_NODE_PGDATA=/opt/PostgreSQL/10/data
+ OLD_PRIMARY_NODE_HOST=192.168.2.80
+ OLD_PRIMARY_NODE_PORT=5432
+ PGHOME=/opt/PostgreSQL/10
+ REPL_SLOT_NAME=192_168_2_80
+ echo failover.sh: start: failed_node_id=0 failed_host=192.168.2.80 old_primary_node_id=0 new_main_node_id=1 new_main_host=192.168.2.
81
failover.sh: start: failed_node_id=0 failed_host=192.168.2.80 old_primary_node_id=0 new_main_node_id=1 new_main_host=192.168.2.81
+ '[' 1 -lt 0 ']'
+ ssh -T -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null postgres@192.168.2.81 -i /home/postgres/.ssh/id_rsa_pgpool ls /tm
p
Warning: Permanently added '192.168.2.81' (ECDSA) to the list of known hosts.
+ '[' 0 -ne 0 ']'
+ '[' 0 -ne 0 ']'
+ echo failover.sh: primary node is down, promote new_main_node_id=1 on 192.168.2.81.
failover.sh: primary node is down, promote new_main_node_id=1 on 192.168.2.81.
+ ssh -T -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null postgres@192.168.2.81 -i /home/postgres/.ssh/id_rsa_pgpool /opt/P
ostgreSQL/10/bin/pg_ctl -D /opt/PostgreSQL/10/data -w promote
Warning: Permanently added '192.168.2.81' (ECDSA) to the list of known hosts.
waiting for server to promote.... done
server promoted
+ '[' 0 -ne 0 ']'
+ echo failover.sh: end: new_main_node_id=1 on 192.168.2.81 is promoted to a primary
failover.sh: end: new_main_node_id=1 on 192.168.2.81 is promoted to a primary
+ exit 0
2021-08-31 12:44:49: pid 16948: LOG:  find_primary_node_repeatedly: waiting for finding a primary node
2021-08-31 12:44:49: pid 16948: LOG:  find_primary_node: primary node is 1
2021-08-31 12:44:49: pid 16948: LOG:  failover: set new primary node: 1
2021-08-31 12:44:49: pid 16948: LOG:  failover: set new main node: 1
2021-08-31 12:44:49: pid 16951: LOG:  new IPC connection received
2021-08-31 12:44:49: pid 16951: LOG:  received the failover indication from Pgpool-II on IPC interface
2021-08-31 12:44:49: pid 16951: LOG:  watchdog is informed of failover end by the main process
2021-08-31 12:44:49: pid 17074: LOG:  worker process received restart request
failover done. shutdown host 192.168.2.80(5432)2021-08-31 12:44:49: pid 16948: LOG:  failover done. shutdown host 192.168.2.80(5432)
2021-08-31 12:44:50: pid 17073: LOG:  restart request received in pcp child process
2021-08-31 12:44:50: pid 16948: LOG:  PCP child 17073 exits with status 0 in failover()
2021-08-31 12:44:50: pid 16948: LOG:  fork a new PCP child pid 38973 in failover()
2021-08-31 12:44:50: pid 16948: LOG:  child process with pid: 17041 exits with status 256
2021-08-31 12:44:50: pid 16948: LOG:  child process with pid: 17042 exits with status 256
2021-08-31 12:44:50: pid 16948: LOG:  child process with pid: 17043 exits with status 256
2021-08-31 12:44:50: pid 16948: LOG:  child process with pid: 17044 exits with status 256
2021-08-31 12:44:50: pid 16948: LOG:  child process with pid: 17045 exits with status 256
2021-08-31 12:44:50: pid 16948: LOG:  child process with pid: 17046 exits with status 256
2021-08-31 12:44:50: pid 16948: LOG:  child process with pid: 17047 exits with status 256
2021-08-31 12:44:50: pid 16948: LOG:  child process with pid: 17048 exits with status 256
2021-08-31 12:44:50: pid 38973: LOG:  PCP process: 38973 started
2021-08-31 12:44:50: pid 16948: LOG:  child process with pid: 17049 exits with status 256
2021-08-31 12:44:50: pid 16948: LOG:  child process with pid: 17050 exits with status 256
2021-08-31 12:44:50: pid 16948: LOG:  child process with pid: 17051 exits with status 256
2021-08-31 12:44:50: pid 16948: LOG:  child process with pid: 17052 exits with status 256
2021-08-31 12:44:50: pid 16948: LOG:  child process with pid: 17053 exits with status 256
2021-08-31 12:44:50: pid 16948: LOG:  child process with pid: 17054 exits with status 256
2021-08-31 12:44:50: pid 16948: LOG:  child process with pid: 17055 exits with status 256
2021-08-31 12:44:50: pid 16948: LOG:  child process with pid: 17056 exits with status 256
2021-08-31 12:44:50: pid 16948: LOG:  child process with pid: 17057 exits with status 256
2021-08-31 12:44:50: pid 16948: LOG:  child process with pid: 17058 exits with status 256
2021-08-31 12:44:50: pid 16948: LOG:  child process with pid: 17059 exits with status 256
2021-08-31 12:44:50: pid 16948: LOG:  child process with pid: 17060 exits with status 256
2021-08-31 12:44:50: pid 16948: LOG:  child process with pid: 17061 exits with status 256
2021-08-31 12:44:50: pid 16948: LOG:  child process with pid: 17062 exits with status 256
2021-08-31 12:44:50: pid 16948: LOG:  child process with pid: 17063 exits with status 256
2021-08-31 12:44:50: pid 16948: LOG:  child process with pid: 17064 exits with status 256
2021-08-31 12:44:50: pid 16948: LOG:  child process with pid: 17065 exits with status 256
2021-08-31 12:44:50: pid 16948: LOG:  child process with pid: 17066 exits with status 256
2021-08-31 12:44:50: pid 16948: LOG:  child process with pid: 17067 exits with status 256
2021-08-31 12:44:50: pid 16948: LOG:  child process with pid: 17068 exits with status 256
2021-08-31 12:44:50: pid 16948: LOG:  child process with pid: 17069 exits with status 256
2021-08-31 12:44:50: pid 16948: LOG:  child process with pid: 17070 exits with status 256
2021-08-31 12:44:50: pid 16948: LOG:  child process with pid: 17071 exits with status 256
2021-08-31 12:44:50: pid 16948: LOG:  worker child process with pid: 17074 exits with status 256
2021-08-31 12:44:50: pid 16948: LOG:  fork a new worker child process with pid: 38974
2021-08-31 12:44:50: pid 16948: LOG:  child process with pid: 31462 exits with status 256
2021-08-31 12:44:50: pid 38974: LOG:  process started
2021-08-31 12:44:50: pid 16951: LOG:  new IPC connection received
2021-08-31 12:44:52: pid 16951: LOG:  watchdog received the failover command from remote pgpool-II node "192.168.2.80:9999 Linux test"
2021-08-31 12:44:52: pid 16951: LOG:  watchdog is processing the failover command [DEGENERATE_BACKEND_REQUEST] received from 192.168.2
.80:9999 Linux test
2021-08-31 12:44:52: pid 16951: LOG:  we have got the consensus to perform the failover
2021-08-31 12:44:52: pid 16951: DETAIL:  1 node(s) voted in the favor
2021-08-31 12:44:52: pid 16951: LOG:  invalid degenerate backend request, node id : 0 status: [3] is not valid for failover
2021-08-31 12:45:00: pid 16951: LOG:  new IPC connection received
2021-08-31 12:45:11: pid 16951: LOG:  new IPC connection received

-- 将关闭的postgres启动

[postgres@test /home/postgres]$pg_ctl start
waiting for server to start....2021-08-31 12:51:52.130 CST [35222] LOG:  listening on IPv4 address "0.0.0.0", port 5432
2021-08-31 12:51:52.130 CST [35222] LOG:  listening on IPv6 address "::", port 5432
2021-08-31 12:51:52.133 CST [35222] LOG:  listening on Unix socket "/tmp/.s.PGSQL.5432"
2021-08-31 12:51:52.169 CST [35222] LOG:  redirecting log output to logging collector process
2021-08-31 12:51:52.169 CST [35222] HINT:  Future log output will appear in directory "log".done
server started
[postgres@test /home/postgres]$

-- 这个时候，通过pg_controldata发现启动后的postgres，还是in production（为in production状态很正常，因为没有人告诉他，primary和standby的情况，因为primary关闭了，关闭后，只告诉了standby提升为primary，并没有告诉原来的primary要怎么处理）。需要在线recovery，修复primary和standby 的关系。

pcp_recovery_node -h 192.168.2.88 -p 9898 -U postgres -n 0[postgres@test /home/postgres]$pcp_recovery_node -h 192.168.2.88 -p 9898 -U postgres -n 0
Password:
ERROR:  executing recovery, execution of command failed at "1st stage"
DETAIL:  command:"recovery_1st_stage"[postgres@test /home/postgres]$

使用pcp_recovery_node出错，目前暂时没有找到原因。从官方文档来看，还是需要命令行来处理新的primary-standby关系。目前猜测pcp_recovery_node调用recovery_1st_stage脚本，通过pb_basebackup来恢复standby，然后再调用pgpool_remote_start 脚本，远程启动standby。然后再调用follow_primary.sh，将节点加入到pgpool中。总之，就是感觉，primary转换为新的standby后，还是需要手工处理，而不是pgpool全自动处理。

我这里使用pg_rewind来处理新的standy，再使用pgpool工具将节点添加进来。

[postgres@test /opt/pgpool/etc]$pg_ctl stop
waiting for server to shut down.... done
server stopped
[postgres@test /opt/pgpool/etc]$pg_rewind --target-pgdata=$PGDATA --source-server='host=192.168.2.81 port=5432 user=postgres password=oracle' -P
connected to server
servers diverged at WAL location 0/17000098 on timeline 3
rewinding from last common checkpoint at 0/17000028 on timeline 3
reading source file list
reading target file list
reading WAL in target
need to copy 200 MB (total source directory size is 253 MB)
205115/205115 kB (100%) copied
creating backup label and updating control file
syncing target data directory
Done!
[postgres@test /opt/pgpool/etc]$[postgres@test /opt/PostgreSQL/10/data]$mv recovery.done recovery.conf
[postgres@test /opt/PostgreSQL/10/data]$vi recovery.conf
standby_mode = 'on'
primary_conninfo = 'user=repl password=oracle host=192.168.2.81 port=5432 sslmode=prefer sslcompression=1 krbsrvname=postgres target_session_attrs=any'[postgres@test /home/postgres]$pcp_attach_node -h 192.168.2.80 -p 9898 -U postgres -n 0 -v
Password:
pcp_attach_node -- Command Successful
[postgres@test /home/postgres]$

-- 添加后，psql -h 192.168.2.80 -p 9999、psql -h 192.168.2.81 -p 9999查看，其中一个为waiting。将pgpool重启后，全部UP。

postgres=# show pool_nodes;node_id |   hostname   | port | status | lb_weight |  role   | select_cnt | load_balance_node | replication_delay | replication_state| replication_sync_state | last_status_change
---------+--------------+------+--------+-----------+---------+------------+-------------------+-------------------+------------------
-+------------------------+---------------------0       | 192.168.2.80 | 5432 | up     | 0.500000  | standby | 0          | false             | 0                 |                  |                        | 2021-08-31 13:53:381       | 192.168.2.81 | 5432 | up     | 0.500000  | primary | 0          | true              | 0                 |                  |                        | 2021-08-31 13:53:38
(2 rows)postgres=#
postgres=# show pool_nodes;node_id |   hostname   | port | status  | lb_weight |  role   | select_cnt | load_balance_node | replication_delay | replication_stat
e | replication_sync_state | last_status_change
---------+--------------+------+---------+-----------+---------+------------+-------------------+-------------------+-----------------
--+------------------------+---------------------0       | 192.168.2.80 | 5432 | waiting | 0.500000  | standby | 0          | false             | 0                 |                 |                        | 2021-08-31 13:52:281       | 192.168.2.81 | 5432 | up      | 0.500000  | primary | 0          | true              | 0                 |                 |                        | 2021-08-31 12:44:49
(2 rows)postgres=#

END

Pgpool-II + Watchdog 设置与测试相关推荐

360安全浏览器兼容模式怎么设置_测试新手一定要知道：最实用的Web兼容性测试经验都在这...
在日常工作中,我们经常碰到网页不兼容的问题.我们之所以要做兼容性测试,目的在于保证待测试项目在不同的操作系统平台上正常运行. 主要包括待测试项目能在同一操作系统平台的不同版本上正常运行:待测试项目能与 ...
华硕PRIME Z370M-PLUS II 魔改BIOS 测试正常
华硕PRIME Z370M-PLUS II 魔改BIOS 测试正常 Z370M-plus 直接上,六代,七代,八代,九代志强e3-V5.系列的CPU
如何在Windows 10中设置和测试麦克风
Whether you're dictating with speech recognition or talking to a family member or gaming buddy over ...
iMX6开发板-uboot-网络设置和测试
本文章基于迅为IMX6开发板将iMX6开发板通过网线连接到路由器,同时连接好调试串口,上电立即按 enter,即可进入 uboot.然后输入命令 pri,查看开发板当前的配置,如下图所示可以看到 i ...
junit测试设置不回滚_正确设置JUnit测试名称
junit测试设置不回滚寻找好名字是手工软件的挑战之一. 您需要随时随地找到它们-类,方法,变量,仅举几例. 但是,什么使名字成为好名字呢? 引用Oncle Bob的话:"三件事:可读性, ...
linux中tftp服务器设置及测试
1.linux下的tftp安装: 首先利用命令:rpm-qa | grep tftp,查看tftp是否安装,如果没有安装,则可以按下面步骤进行安装. 1.安装相关软件包:tftpd(服务端),tftp ...
【店小蜜】欢迎语卡片的设置和测试
进入配置页面千牛搜索店小蜜→在店小蜜菜单栏选择全自动模式(这个界面就可以设置欢迎语卡片) 界面功能介绍设置店小蜜接待昵称就是店小蜜在聊天工具中显示的昵称,修改后记得按旁边的保存就可以.昵称有单独 ...
专用测试软件说明,测试说明、软件设置及测试结果
写在测试前: 由于游戏显卡和专业图形卡的应用领域不同,其定位也不同:专业图形卡特别强调特定行业软件的兼容性.稳定性.速度和准确性,而游戏显卡强调的是在相对较低的价位下提供更强大的娱乐.办公.游戏.多 ...
【Linux】Linux之DNS查询、设置及测试可用性
1.DNS查询的三种方式 1.1 cat /etc/resolv.conf 1.2 nslookup www.baidu.com 1.3 dig |grep www.baidu.com 2.DNS设置 ...

Pgpool-II + Watchdog 设置与测试

Pgpool-II + Watchdog 设置与测试相关推荐

最新文章

热门文章