Pgpool-II + Watchdog 设置与测试
参考文档:
Pgpool-II + Watchdog Setup Example
https://www.pgpool.net/docs/latest/en/html/example-configs.html
https://www.pgpool.net/docs/latest/en/html/example-cluster.html
已测试完成的功能
1 当其中某一个节点上的pgpool失败时,VIP会自动漂移到另一个节点上,也就是watchdog起作用了
2 关闭掉primary 数据库,standby数据库会提升为primary。也就是failover功能。好像没有switch over功能
3 失效的原来的primary数据库需要重建(使用pgpool的工具、pg_rewind都可以)
未测试完成的功能
1 使用pgpool工具,online recovery没有成功,使用pg_rewind命令来代替处理了。
主机名信息:
test : 192.168.2.80
test1 : 192.168.2.81
VIP :192.168.2.88
PG版本和配置
PostgreSQL Version 10.15
port 5432
$PGDATA /opt/PostgreSQL/10/data
Archive mode ON
Replication Slots Enable -- 没有设置
Start automatically Enable -- 源码安装完毕PG后,就是自动设置的
pgpool-II版本和配置
Pgpool-II Version 4.2.0
一些端口:
9999 Pgpool-II accepts connections
9898 PCP process accepts connections
9000 watchdog accepts connections
9694 UDP port for receiving Watchdog's heartbeat signal
Config file /opt/pgpool/etc/pgpool.conf
Pgpool-II启动用户 postgres ,(Pgpool-II 4.1 or later) Pgpool-II 4.0 or before, the default startup user is root
Running mode streaming replication mode
Watchdog on Life check method: heartbeat
Start automatically Enable -- 我这里没有设置
脚本
failover :
/opt/pgpool/etc/failover.sh -- 通过failover_command运行执行故障转移
/opt/pgpool/etc/follow_primary.sh -- 通过 follow_primary_command 运行以在故障转移后将 Standby 与新的 Primary 同步。
failover.sh的脚本内容,可以参考文档:
https://git.postgresql.org/gitweb/?p=pgpool2.git;a=blob_plain;f=src/sample/scripts/failover.sh.sample;hb=refs/heads/V4_2_STABLE
该failover脚本的内容很简单,就是通过pg_ctl promote命令,将从库提升为新的主库,原来的主库不做处理。
follow_primary.sh脚本内容,可以参考文档:https://git.postgresql.org/gitweb/?p=pgpool2.git;a=blob_plain;f=src/sample/scripts/follow_primary.sh.sample;hb=refs/heads/V4_2_STABLE
follow_primary.sh脚本,当使用三台postgres数据库服务器的时候,需要配置,两台postgres数据库的时候,不需要配置。
该脚本的主要作用是,当有3台postgres数据库服务器的时候。failover后,其中一个standby被提升为primary。这时候,另外的一个standby和新的primary数据库进行同步。(当只有2台postgres的时候,primary 宕机,standby提升为新的primary,后续没有standby了,所以不需要该脚本)该脚本主要还是通过pg_rewind来修复新的standby。
online recovery :
/opt/pgpool/etc/recovery_1st_stage -- 运行 recovery_1st_stage_command 恢复 Standby 节点
/opt/pgpool/etc/pgpool_remote_start --在 recovery_1st_stage_command 之后运行以启动 Standby 节点
recovery_1st_stage脚本的官方文档:
https://git.postgresql.org/gitweb/?p=pgpool2.git;a=blob_plain;f=src/sample/scripts/recovery_1st_stage.sample;hb=refs/heads/V4_2_STABLE
该脚本的作用,是通过使用pg_basebackup进行在线创建standby库的,或者当使用pg_rewind无法同步standby库的数据时候,重新创建standby库的时候使用。
pg_remote_start脚本的官方文档:
https://git.postgresql.org/gitweb/?p=pgpool2.git;a=blob_plain;f=src/sample/scripts/pgpool_remote_start.sample;hb=refs/heads/V4_2_STABLE
pg_remote_start脚本,是当新的standby库创建好以后,通过pg_ctl来远程启动新的standby数据库的。
Watchdog:
/opt/pgpool/etc/escalation.sh -- 通过 wd_escalation_command 运行以安全地切换 Active/Standby Pgpool-II
escalation.sh脚本的官方文档:
https://git.postgresql.org/gitweb/?p=pgpool2.git;a=blob_plain;f=src/sample/scripts/escalation.sh.sample;hb=refs/heads/V4_2_STABLE
该脚本的主要用途是在旧节点上down掉VIP。然后在新节点上启用VIP 。
安装postgres数据库 (略)
在 server1(主)上编辑配置文件 $PGDATA/postgresql.conf 如下。启用 wal_log_hints 以使用 pg_rewind。由于 Primary 以后可能会成为 Standby,我们设置 hot_standby = on。
postgresql.conf的配置
listen_addresses = '*'
archive_mode = on
archive_command = 'cp %p /postgres/archive/%f'
max_wal_senders = 10
max_replication_slots = 10
wal_level = replica
hot_standby = on
wal_log_hints = on
主服务器启动后,我们使用Pgpool-II的在线恢复功能来设置备用服务器-- 这个安装配置standby的过程,略,通过pg_basebackup手工创建stadnby了 ,standby的复制通过repl用户来进行。
配置pg_hba.conf
host replication repl 192.168.2.80/24 md5
host replication repl 192.168.2.81/24 md5
配置ssh对等互信,root用户和postgres用户。过程略,类似以下命令。当ssh server1 date 不需要输入命令时表示设置成功。
[all servers]# cd ~/.ssh
[all servers]# ssh-keygen -t rsa -f id_rsa_pgpool
[all servers]# ssh-copy-id -i id_rsa_pgpool.pub postgres@server1
[all servers]# ssh-copy-id -i id_rsa_pgpool.pub postgres@server2
[all servers]# ssh-copy-id -i id_rsa_pgpool.pub postgres@server3[all servers]# su - postgres
[all servers]$ cd ~/.ssh
[all servers]$ ssh-keygen -t rsa -f id_rsa_pgpool
[all servers]$ ssh-copy-id -i id_rsa_pgpool.pub postgres@server1
[all servers]$ ssh-copy-id -i id_rsa_pgpool.pub postgres@server2
[all servers]$ ssh-copy-id -i id_rsa_pgpool.pub postgres@server3
编辑.pgpass文件 ,主要是无密码登陆postgres数据库 ,注意权限是600,否则不能生效
[postgres@test1 /home/postgres]$more .pgpass
*:5432:*:postgres:oracle
*:5432:*:barman:oracle
[postgres@test1 /home/postgres]$
[postgres@test /home/postgres]$more .pgpass
*:5432:*:postgres:oracle
*:5432:*:barman:oracle
[postgres@test /home/postgres]$
设置pgpool-II自动启动(略),本例子中,手工启动。
创建pgpool_node_id,id分别为0和1
如果启用了watchdog,需要辨别那台机器是那个,则需要设置pgpool_node_id
[postgres@test /opt/pgpool/etc]$more pgpool_node_id
0
[postgres@test /opt/pgpool/etc]$
[postgres@test1 /opt/pgpool/etc]$more pgpool_node_id
1
[postgres@test1 /opt/pgpool/etc]$
注意:pgpool目录下,有个pgpool.pid文件,记录的是pgpool启动的进程id,注意别和标识节点的pgpool_node_id搞混了
[postgres@test /opt/pgpool]$more pgpool.pid
14306
[postgres@test /opt/pgpool]$
[postgres@test1 /opt/pgpool]$more pgpool.pid
16948
[postgres@test1 /opt/pgpool]$
pgpool-II配置
由于从 Pgpool-II 4.2 开始,所有主机上的所有配置参数都相同,您可以在任何 pgpool 节点上编辑 pgpool.conf 并将编辑后的 pgpool.conf 文件复制到其他 pgpool 节点。
集群模式
[postgres@test /opt/pgpool/etc]$cp pgpool.conf.sample-stream pgpool.conf
[postgres@test1 /opt/pgpool/etc]$cp pgpool.conf.sample-stream pgpool.conf
pgpool.conf配置
listen_addresses = '*' -- 监听器地址,为了允许 Pgpool-II 接受所有传入的连接,我们设置 listen_addresses = '*'。
port = 9999 --指定pgpool监听器监听的端口
sr_check_user = 'postgres' -- 流复制检查
sr_check_password = 'oracle' -- 如果这些参数留空,Pgpool-II 将首先尝试从 pool_passwd 文件中获取该特定用户的密码,然后再使用空密码。
health_check_period = 5
health_check_timeout = 30
health_check_user = 'postgres'
health_check_password = 'oracle'
health_check_max_retries = 3 backend_hostname0 = '192.168.2.80'
backend_port0 = 5432
backend_weight0 = 1
backend_data_directory0 = '/opt/PostgreSQL/10/data'
backend_flag0 = 'ALLOW_TO_FAILOVER'backend_hostname2 = '192.168.2.81'
backend_port2 = 5432
backend_weight2 = 1
backend_data_directory2 = '/opt/PostgreSQL/10/data'
backend_flag2 = 'ALLOW_TO_FAILOVER'
backend_application_name0 = 'test'
backend_application_name1 = 'test1' failover_command = '/opt/pgpool/etc/failover.sh %d %h %p %D %m %H %M %P %r %R %N %S' -- 注意修改该脚本中的PGHOME路径
follow_primary_command = '' -- 因为只有2台机器,这个脚本用不到,所以没有设置
##在failover_command 参数中指定要在故障转移后执行的failover.sh 脚本。如果我们使用3台PostgreSQL服务器,我们需要指定follow_primary_command在主节点故障转移后运行。如果有两个 PostgreSQL 服务器,则不需要 follow_primary_command 设置
配置pcp.conf (因为本例中没有用到follow_primary_command脚本,所以本例中没有设置,如果需要设置,可以参考以下方法)
由于follow_primary_command脚本中使用PCP命令需要用户认证,所以我们需要在pcp.conf中以“username:encrypted password”的格式指定用户名和md5加密密码。然后我们使用 pg_md5 为 pgpool 用户创建加密的密码条目,如下所示:
[all servers]# echo 'postgres:'`pg_md5 PCP passowrd` >> /opt/pgpool/etc/pcp.conf# USERID:MD5PASSWD
postgres:a189c633d9995e11bf8607170ec9a4b8 -- 内容类似这个
由于follow_primary.sh 脚本必须在不输入密码的情况下执行PCP 命令,因此我们需要在每台服务器的Pgpool-II 启动用户(postgres 用户)的主目录中创建.pcppass。 -- 本例子没有设置这个
[all servers]# su - postgres
[all servers]$ echo 'localhost:9898:postgres:<pgpool user password>' > ~/.pcppass
[all servers]$ chmod 600 ~/.pcppass
注意:follow_primary.sh不支持表空间,如果使用到了表空间,则需要定义该脚本
Pgpool-II 在线恢复配置 (本例子中没有使用到)
接下来,为了使用 Pgpool-II 执行在线恢复,我们指定 PostgreSQL 用户名和在线恢复命令 recovery_1st_stage。因为进行在线恢复需要PostgreSQL中的超级用户权限,所以我们在recovery_user中指定了postgres用户。然后,我们在PostgreSQL主服务器(server1)的数据库集群目录下创建recovery_1st_stage和pgpool_remote_start,并添加执行权限。 -- 设置pgpool.conf文件
recovery_user = 'postgres'
recovery_password = 'oracle'
recovery_1st_stage_command = 'recovery_1st_stage'
copy文件并修改权限
[postgres@test /opt/pgpool/etc]$cp recovery_1st_stage.sample recovery_1st_stage
[postgres@test /opt/pgpool/etc]$cp pgpool_remote_start.sample pgpool_remote_start
为了使用在线恢复功能,需要pgpool_recovery、pgpool_remote_start、pgpool_switch_xlog的功能,所以我们需要在PostgreSQL服务器server1的template1上安装pgpool_recovery。
[server1]# su - postgres
[server1]$ psql template1 -c "CREATE EXTENSION pgpool_recovery" -- 这个要安装pg_recovery函数
-- 编译安装pgpool_adm pgpool-recovery pgpool-regclass等
[root@test /root/.ssh]$cd /postgres/pgpool-II-4.2.4/src/sql
[root@test /postgres/pgpool-II-4.2.4/src/sql]$ls
insert_lock.sql Makefile pgpool_adm pgpool-recovery pgpool-regclass
[root@test /postgres/pgpool-II-4.2.4/src/sql]$make install
[postgres@test /opt/pgpool/etc]$psql template1 -c "CREATE EXTENSION pgpool_recovery"
CREATE EXTENSION
[postgres@test /opt/pgpool/etc]$
客户端认证配置
默认情况下,pool_hba 认证是禁用的,设置 enable_pool_hba = on 来启用它。
host all repl 0.0.0.0/0 md5
host all postgres 0.0.0.0/0 md5
配置sudo ,针对postgres用户,这个主要作用是在VIP漂移的时候,设置VIP的IP地址
chmod u+w /etc/sudoers
vi /etc/sudoers
postgres ALL=NOPASSWD: /sbin/ip
postgres ALL=NOPASSWD: /usr/sbin/arping
watchdog配置
use_watchdog = on
delegate_IP = '192.168.2.88 ' if_up_cmd = '/usr/bin/sudo /sbin/ip addr add $_IP_$/24 dev ens33 label ens33:0' -- 例子中的网卡是ens33
if_down_cmd = '/usr/bin/sudo /sbin/ip addr del $_IP_$/24 dev ens33'
arping_cmd = '/usr/bin/sudo /usr/sbin/arping -U $_IP_$ -w 1 -I ens33'
指定所有 Pgpool-II 节点信息用于配置看门狗
hostname0 = '192.168.2.80'
wd_port0 = 9000
pgpool_port0 = 9999
hostname1 = '192.168.2.81'
wd_port1 = 9000
pgpool_port1 = 9999
wd_lifecheck_method = 'heartbeat'
wd_interval = 10
指定发送和接收心跳信号的所有Pgpool-II节点信息
heartbeat_hostname0 = '192.168.2.80'
heartbeat_port0 = 9694
heartbeat_device0 = ''
heartbeat_hostname1 = '192.168.2.81'
heartbeat_port1 = 9694
heartbeat_device1 = ''
wd_heartbeat_keepalive = 2
wd_heartbeat_deadtime = 30
wd_escalation_command = '/opt/pgpool/etc/escalation.sh' -- 要根据实际情况修改escalation.sh中的内容
enable_consensus_with_half_votes = on -- 节点为偶数的时候使用
pid_file_name = '/opt/pgpool/etc/pgpool_node_id'log_destination = 'stderr'
logging_collector = on
log_directory = '/opt/pgpool/log' --
log_filename = 'pgpool-%Y-%m-%d_%H%M%S.log'
log_truncate_on_rotation = on
log_rotation_age = 1d
log_rotation_size = 10MB
启动 Pgpool-II 时,如果 pgpool_status 文件存在,Pgpool-II 将从 pgpool_status 文件中读取后端状态(up/down)。
如果你想在 Pgpool-II 启动时忽略 pgpool_status 文件,在 /etc/sysconfig/pgpool 的启动选项 OPTS 中添加“-D”
[all servers]# vi /etc/sysconfig/pgpool -- 编译安装的postgres,没有这个文件,直接将pgpool_status文件内容清空,或者启动的时候,pgpool加上-D参数忽略该文件
...
OPTS=" -D -n"
启动pgpool,在postgres用户下,直接使用pgpool就可以启动
启动后的日志,表明心跳正常
2021-08-31 11:09:21: pid 14310: LOG: new IPC connection received
2021-08-31 11:09:31: pid 14310: LOG: new IPC connection received
2021-08-31 11:09:41: pid 14310: LOG: new IPC connection received
2021-08-31 11:09:52: pid 14310: LOG: new IPC connection received
2021-08-31 11:10:02: pid 14310: LOG: new IPC connection received
2021-08-31 11:10:12: pid 14310: LOG: new IPC connection received
2021-08-31 11:10:22: pid 14310: LOG: new IPC connection received
启动后的IP地址 ,VIP地址是在primary上的
[postgres@test /home/postgres]$ifconfig
ens33: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500inet 192.168.2.80 netmask 255.255.255.0 broadcast 192.168.2.255inet6 fe80::14f:9066:9e0c:5f6d prefixlen 64 scopeid 0x20<link>ether 00:0c:29:10:c6:5b txqueuelen 1000 (Ethernet)RX packets 65786 bytes 8130219 (7.7 MiB)RX errors 0 dropped 0 overruns 0 frame 0TX packets 186681 bytes 363112722 (346.2 MiB)TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0ens33:0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500inet 192.168.2.88 netmask 255.255.255.0 broadcast 0.0.0.0ether 00:0c:29:10:c6:5b txqueuelen 1000 (Ethernet)lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536inet 127.0.0.1 netmask 255.0.0.0inet6 ::1 prefixlen 128 scopeid 0x10<host>loop txqueuelen 1000 (Local Loopback)RX packets 68588 bytes 23916273 (22.8 MiB)RX errors 0 dropped 0 overruns 0 frame 0TX packets 68588 bytes 23916273 (22.8 MiB)TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
-- 查看primary和standby的状态
[postgres@test /home/postgres]$psql -h 192.168.2.80 -p 9999
Password:
psql.bin (10.15)
Type "help" for help.postgres=# show pool_nodes;node_id | hostname | port | status | lb_weight | role | select_cnt | load_balance_node | replication_delay | replication_state| replication_sync_state | last_status_change
---------+--------------+------+--------+-----------+---------+------------+-------------------+-------------------+------------------
-+------------------------+---------------------0 | 192.168.2.80 | 5432 | up | 0.500000 | primary | 0 | false | 0 | | | 2021-08-31 08:55:591 | 192.168.2.81 | 5432 | up | 0.500000 | standby | 0 | true | 0 | | | 2021-08-31 08:55:59
(2 rows)postgres=# [postgres@test1 /home/postgres]$psql -h 192.168.2.81 -p 9999
Password:
psql.bin (10.15)
Type "help" for help.postgres=# show pool_nodes;node_id | hostname | port | status | lb_weight | role | select_cnt | load_balance_node | replication_delay | replication_state| replication_sync_state | last_status_change
---------+--------------+------+--------+-----------+---------+------------+-------------------+-------------------+------------------
-+------------------------+---------------------0 | 192.168.2.80 | 5432 | up | 0.500000 | primary | 0 | false | 0 | | | 2021-08-31 08:57:041 | 192.168.2.81 | 5432 | up | 0.500000 | standby | 0 | true | 0 | | | 2021-08-31 08:57:04
(2 rows)postgres=#
通过watchdog查看,LEADER是192.168.2.80
pcp_watchdog_info -p 9898 -h 192.168.2.88 -U postgres[postgres@test /home/postgres]$pcp_watchdog_info -p 9898 -h 192.168.2.88 -U postgres
Password:
2 YES 192.168.2.80:9999 Linux test 192.168.2.80192.168.2.80:9999 Linux test 192.168.2.80 9999 9000 4 LEADER
192.168.2.81:9999 Linux test1 192.168.2.81 9999 9000 7 STANDBY
[postgres@test /home/postgres]$
停止掉192.168.2.80上的pgpool,再次查看watchdog信息,可以看到192.168.2.80上的状态为shutdown,同时VIP漂移到另一台机器上。LEADER是192.168.2.81
[postgres@test /home/postgres]$pgpool stop
2021-08-31 11:16:30: pid 26938: LOG: stop request sent to pgpool. waiting for termination...
.done.
[postgres@test /home/postgres]$pcp_watchdog_info -p 9898 -h 192.168.2.88 -U postgres
Password:
2 YES 192.168.2.81:9999 Linux test1 192.168.2.81192.168.2.81:9999 Linux test1 192.168.2.81 9999 9000 4 LEADER
192.168.2.80:9999 Linux test 192.168.2.80 9999 9000 10 SHUTDOWN
[postgres@test /home/postgres]$[postgres@test1 /opt/pgpool/etc]$ifconfig
ens33: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500inet 192.168.2.81 netmask 255.255.255.0 broadcast 192.168.2.255inet6 fe80::626c:15d:e820:43fc prefixlen 64 scopeid 0x20<link>inet6 fe80::14f:9066:9e0c:5f6d prefixlen 64 scopeid 0x20<link>ether 00:0c:29:86:18:b7 txqueuelen 1000 (Ethernet)RX packets 186932 bytes 193331959 (184.3 MiB)RX errors 0 dropped 0 overruns 0 frame 0TX packets 62571 bytes 8315733 (7.9 MiB)TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0ens33:0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500inet 192.168.2.88 netmask 255.255.255.0 broadcast 0.0.0.0ether 00:0c:29:86:18:b7 txqueuelen 1000 (Ethernet)
启动pgpool ,再次查看信息 ,发现192.168.2.81是leader,192.168.2.80是standby
[postgres@test /home/postgres]$pgpool
[postgres@test /home/postgres]$pcp_watchdog_info -p 9898 -h 192.168.2.88 -U postgres
Password:
2 YES 192.168.2.81:9999 Linux test1 192.168.2.81192.168.2.81:9999 Linux test1 192.168.2.81 9999 9000 4 LEADER
192.168.2.80:9999 Linux test 192.168.2.80 9999 9000 7 STANDBY
[postgres@test /home/postgres]$
查看postges的primary和standby信息,正常。其实这个和watchdog看到的信息没有关系。watchdog看到的信息是pgpool集群的状态
[postgres@test1 /opt/pgpool/etc]$psql -h 192.168.2.88 -p 9999 -U postgres postgres -c "show pool_nodes"
Password for user postgres: node_id | hostname | port | status | lb_weight | role | select_cnt | load_balance_node | replication_delay | replication_state| replication_sync_state | last_status_change
---------+--------------+------+--------+-----------+---------+------------+-------------------+-------------------+------------------
-+------------------------+---------------------0 | 192.168.2.80 | 5432 | up | 0.500000 | primary | 0 | true | 0 | | | 2021-08-31 08:57:041 | 192.168.2.81 | 5432 | up | 0.500000 | standby | 0 | false | 0 | | | 2021-08-31 08:57:04
(2 rows)[postgres@test1 /opt/pgpool/etc]$
-- failover测试
当存在3台postgres的时候,关闭掉primary,新的primary会被提升。被关闭的primary会变成standby,该standby再次启动后,不会变成primary,还是standby,需要通过pgpool来在线recovery或者手工通过pg_rewind进行在线recovery 。
-- 关闭掉节点为0的postgres数据库,也就是关闭掉192.168.2.80上的数据库 ,可以看到节点为0的服务器已经被关闭,状态为standby,新的primary已经切换到了节点为1的服务器上,也就是192.168.2.81是新的primary。
[postgres@test1 /opt/pgpool/etc]$psql -h 192.168.2.88 -p 9999 -U postgres postgres -c "show pool_nodes"
Password for user postgres: node_id | hostname | port | status | lb_weight | role | select_cnt | load_balance_node | replication_delay | replication_state| replication_sync_state | last_status_change
---------+--------------+------+--------+-----------+---------+------------+-------------------+-------------------+------------------
-+------------------------+---------------------0 | 192.168.2.80 | 5432 | down | 0.500000 | standby | 0 | false | 0 | | | 2021-08-31 12:44:491 | 192.168.2.81 | 5432 | up | 0.500000 | primary | 0 | true | 0 | | | 2021-08-31 12:44:49
(2 rows)[postgres@test1 /opt/pgpool/etc]$
-- 同时,在pgpool的log中可以看到
-- 节点id 为0 上的log ,发现primary node 已经变成了192.168.2.81
2021-08-31 12:44:33: pid 27061: LOG: new IPC connection received
2021-08-31 12:44:43: pid 27061: LOG: new IPC connection received
2021-08-31 12:44:49: pid 27100: LOG: failed to connect to PostgreSQL server on "192.168.2.80:5432", getsockopt() failed
2021-08-31 12:44:49: pid 27100: DETAIL: Operation now in progress
2021-08-31 12:44:49: pid 27100: ERROR: failed to make persistent db connection
2021-08-31 12:44:49: pid 27100: DETAIL: connection to host:"192.168.2.80:5432" failed
2021-08-31 12:44:49: pid 27100: LOG: health check retrying on DB node: 0 (round:1)
2021-08-31 12:44:49: pid 27061: LOG: signal_user1_to_parent_with_reason(2)
2021-08-31 12:44:49: pid 27058: LOG: Pgpool-II parent process received SIGUSR1
2021-08-31 12:44:49: pid 27058: LOG: Pgpool-II parent process received sync backend signal from watchdog
2021-08-31 12:44:49: pid 27061: LOG: new IPC connection received
2021-08-31 12:44:49: pid 27058: LOG: leader watchdog has performed failover
2021-08-31 12:44:49: pid 27058: DETAIL: syncing the backend states from the LEADER watchdog node
2021-08-31 12:44:49: pid 27061: LOG: new IPC connection received
2021-08-31 12:44:49: pid 27061: LOG: received the get data request from local pgpool-II on IPC interface
2021-08-31 12:44:49: pid 27061: LOG: get data request from local pgpool-II node received on IPC interface is forwarded to leader watc
hdog node "192.168.2.81:9999 Linux test1"
2021-08-31 12:44:49: pid 27061: DETAIL: waiting for the reply...
2021-08-31 12:44:49: pid 27058: LOG: leader watchdog node "192.168.2.81:9999 Linux test1" returned status for 2 backend nodes
2021-08-31 12:44:49: pid 27058: LOG: backend:0 is set to down status
2021-08-31 12:44:49: pid 27058: DETAIL: backend:0 is DOWN on cluster leader "192.168.2.81:9999 Linux test1"
2021-08-31 12:44:49: pid 27058: LOG: primary node:1 on leader watchdog node "192.168.2.81:9999 Linux test1" is different from local p
rimary node:0
2021-08-31 12:44:49: pid 27058: LOG: primary node was changed after the sync from "192.168.2.81:9999 Linux test1"
2021-08-31 12:44:49: pid 27058: DETAIL: all children needs to be restarted
2021-08-31 12:44:49: pid 27058: LOG: child process with pid: 27066 exits with status 0
2021-08-31 12:44:49: pid 27058: LOG: child process with pid: 27067 exits with status 0
-- 节点id为1的机器上,可以看到新的primary已经promote成功了
2021-08-31 12:44:49: pid 16951: LOG: new IPC connection received
2021-08-31 12:44:49: pid 16951: LOG: received the failover indication from Pgpool-II on IPC interface
2021-08-31 12:44:49: pid 16951: LOG: watchdog is informed of failover start by the main process
2021-08-31 12:44:49: pid 16948: LOG: starting degeneration. shutdown host 192.168.2.80(5432)
2021-08-31 12:44:49: pid 16948: LOG: Restart all children
2021-08-31 12:44:49: pid 16948: LOG: execute command: /opt/pgpool/etc/failover.sh 0 192.168.2.80 5432 /opt/PostgreSQL/10/data 1 192.1
68.2.81 0 0 5432 /opt/PostgreSQL/10/data 192.168.2.80 5432
+ FAILED_NODE_ID=0
+ FAILED_NODE_HOST=192.168.2.80
+ FAILED_NODE_PORT=5432
+ FAILED_NODE_PGDATA=/opt/PostgreSQL/10/data
+ NEW_MAIN_NODE_ID=1
+ NEW_MAIN_NODE_HOST=192.168.2.81
+ OLD_MAIN_NODE_ID=0
+ OLD_PRIMARY_NODE_ID=0
+ NEW_MAIN_NODE_PORT=5432
+ NEW_MAIN_NODE_PGDATA=/opt/PostgreSQL/10/data
+ OLD_PRIMARY_NODE_HOST=192.168.2.80
+ OLD_PRIMARY_NODE_PORT=5432
+ PGHOME=/opt/PostgreSQL/10
+ REPL_SLOT_NAME=192_168_2_80
+ echo failover.sh: start: failed_node_id=0 failed_host=192.168.2.80 old_primary_node_id=0 new_main_node_id=1 new_main_host=192.168.2.
81
failover.sh: start: failed_node_id=0 failed_host=192.168.2.80 old_primary_node_id=0 new_main_node_id=1 new_main_host=192.168.2.81
+ '[' 1 -lt 0 ']'
+ ssh -T -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null postgres@192.168.2.81 -i /home/postgres/.ssh/id_rsa_pgpool ls /tm
p
Warning: Permanently added '192.168.2.81' (ECDSA) to the list of known hosts.
+ '[' 0 -ne 0 ']'
+ '[' 0 -ne 0 ']'
+ echo failover.sh: primary node is down, promote new_main_node_id=1 on 192.168.2.81.
failover.sh: primary node is down, promote new_main_node_id=1 on 192.168.2.81.
+ ssh -T -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null postgres@192.168.2.81 -i /home/postgres/.ssh/id_rsa_pgpool /opt/P
ostgreSQL/10/bin/pg_ctl -D /opt/PostgreSQL/10/data -w promote
Warning: Permanently added '192.168.2.81' (ECDSA) to the list of known hosts.
waiting for server to promote.... done
server promoted
+ '[' 0 -ne 0 ']'
+ echo failover.sh: end: new_main_node_id=1 on 192.168.2.81 is promoted to a primary
failover.sh: end: new_main_node_id=1 on 192.168.2.81 is promoted to a primary
+ exit 0
2021-08-31 12:44:49: pid 16948: LOG: find_primary_node_repeatedly: waiting for finding a primary node
2021-08-31 12:44:49: pid 16948: LOG: find_primary_node: primary node is 1
2021-08-31 12:44:49: pid 16948: LOG: failover: set new primary node: 1
2021-08-31 12:44:49: pid 16948: LOG: failover: set new main node: 1
2021-08-31 12:44:49: pid 16951: LOG: new IPC connection received
2021-08-31 12:44:49: pid 16951: LOG: received the failover indication from Pgpool-II on IPC interface
2021-08-31 12:44:49: pid 16951: LOG: watchdog is informed of failover end by the main process
2021-08-31 12:44:49: pid 17074: LOG: worker process received restart request
failover done. shutdown host 192.168.2.80(5432)2021-08-31 12:44:49: pid 16948: LOG: failover done. shutdown host 192.168.2.80(5432)
2021-08-31 12:44:50: pid 17073: LOG: restart request received in pcp child process
2021-08-31 12:44:50: pid 16948: LOG: PCP child 17073 exits with status 0 in failover()
2021-08-31 12:44:50: pid 16948: LOG: fork a new PCP child pid 38973 in failover()
2021-08-31 12:44:50: pid 16948: LOG: child process with pid: 17041 exits with status 256
2021-08-31 12:44:50: pid 16948: LOG: child process with pid: 17042 exits with status 256
2021-08-31 12:44:50: pid 16948: LOG: child process with pid: 17043 exits with status 256
2021-08-31 12:44:50: pid 16948: LOG: child process with pid: 17044 exits with status 256
2021-08-31 12:44:50: pid 16948: LOG: child process with pid: 17045 exits with status 256
2021-08-31 12:44:50: pid 16948: LOG: child process with pid: 17046 exits with status 256
2021-08-31 12:44:50: pid 16948: LOG: child process with pid: 17047 exits with status 256
2021-08-31 12:44:50: pid 16948: LOG: child process with pid: 17048 exits with status 256
2021-08-31 12:44:50: pid 38973: LOG: PCP process: 38973 started
2021-08-31 12:44:50: pid 16948: LOG: child process with pid: 17049 exits with status 256
2021-08-31 12:44:50: pid 16948: LOG: child process with pid: 17050 exits with status 256
2021-08-31 12:44:50: pid 16948: LOG: child process with pid: 17051 exits with status 256
2021-08-31 12:44:50: pid 16948: LOG: child process with pid: 17052 exits with status 256
2021-08-31 12:44:50: pid 16948: LOG: child process with pid: 17053 exits with status 256
2021-08-31 12:44:50: pid 16948: LOG: child process with pid: 17054 exits with status 256
2021-08-31 12:44:50: pid 16948: LOG: child process with pid: 17055 exits with status 256
2021-08-31 12:44:50: pid 16948: LOG: child process with pid: 17056 exits with status 256
2021-08-31 12:44:50: pid 16948: LOG: child process with pid: 17057 exits with status 256
2021-08-31 12:44:50: pid 16948: LOG: child process with pid: 17058 exits with status 256
2021-08-31 12:44:50: pid 16948: LOG: child process with pid: 17059 exits with status 256
2021-08-31 12:44:50: pid 16948: LOG: child process with pid: 17060 exits with status 256
2021-08-31 12:44:50: pid 16948: LOG: child process with pid: 17061 exits with status 256
2021-08-31 12:44:50: pid 16948: LOG: child process with pid: 17062 exits with status 256
2021-08-31 12:44:50: pid 16948: LOG: child process with pid: 17063 exits with status 256
2021-08-31 12:44:50: pid 16948: LOG: child process with pid: 17064 exits with status 256
2021-08-31 12:44:50: pid 16948: LOG: child process with pid: 17065 exits with status 256
2021-08-31 12:44:50: pid 16948: LOG: child process with pid: 17066 exits with status 256
2021-08-31 12:44:50: pid 16948: LOG: child process with pid: 17067 exits with status 256
2021-08-31 12:44:50: pid 16948: LOG: child process with pid: 17068 exits with status 256
2021-08-31 12:44:50: pid 16948: LOG: child process with pid: 17069 exits with status 256
2021-08-31 12:44:50: pid 16948: LOG: child process with pid: 17070 exits with status 256
2021-08-31 12:44:50: pid 16948: LOG: child process with pid: 17071 exits with status 256
2021-08-31 12:44:50: pid 16948: LOG: worker child process with pid: 17074 exits with status 256
2021-08-31 12:44:50: pid 16948: LOG: fork a new worker child process with pid: 38974
2021-08-31 12:44:50: pid 16948: LOG: child process with pid: 31462 exits with status 256
2021-08-31 12:44:50: pid 38974: LOG: process started
2021-08-31 12:44:50: pid 16951: LOG: new IPC connection received
2021-08-31 12:44:52: pid 16951: LOG: watchdog received the failover command from remote pgpool-II node "192.168.2.80:9999 Linux test"
2021-08-31 12:44:52: pid 16951: LOG: watchdog is processing the failover command [DEGENERATE_BACKEND_REQUEST] received from 192.168.2
.80:9999 Linux test
2021-08-31 12:44:52: pid 16951: LOG: we have got the consensus to perform the failover
2021-08-31 12:44:52: pid 16951: DETAIL: 1 node(s) voted in the favor
2021-08-31 12:44:52: pid 16951: LOG: invalid degenerate backend request, node id : 0 status: [3] is not valid for failover
2021-08-31 12:45:00: pid 16951: LOG: new IPC connection received
2021-08-31 12:45:11: pid 16951: LOG: new IPC connection received
-- 将关闭的postgres启动
[postgres@test /home/postgres]$pg_ctl start
waiting for server to start....2021-08-31 12:51:52.130 CST [35222] LOG: listening on IPv4 address "0.0.0.0", port 5432
2021-08-31 12:51:52.130 CST [35222] LOG: listening on IPv6 address "::", port 5432
2021-08-31 12:51:52.133 CST [35222] LOG: listening on Unix socket "/tmp/.s.PGSQL.5432"
2021-08-31 12:51:52.169 CST [35222] LOG: redirecting log output to logging collector process
2021-08-31 12:51:52.169 CST [35222] HINT: Future log output will appear in directory "log".done
server started
[postgres@test /home/postgres]$
-- 这个时候,通过pg_controldata发现启动后的postgres,还是in production(为in production状态很正常,因为没有人告诉他,primary和standby的情况,因为primary关闭了,关闭后,只告诉了standby提升为primary,并没有告诉原来的primary要怎么处理)。需要在线recovery,修复primary和standby 的关系。
pcp_recovery_node -h 192.168.2.88 -p 9898 -U postgres -n 0[postgres@test /home/postgres]$pcp_recovery_node -h 192.168.2.88 -p 9898 -U postgres -n 0
Password:
ERROR: executing recovery, execution of command failed at "1st stage"
DETAIL: command:"recovery_1st_stage"[postgres@test /home/postgres]$
使用pcp_recovery_node出错,目前暂时没有找到原因。从官方文档来看,还是需要命令行来处理新的primary-standby关系。目前猜测pcp_recovery_node调用recovery_1st_stage脚本,通过pb_basebackup来恢复standby,然后再调用pgpool_remote_start 脚本,远程启动standby。然后再调用follow_primary.sh,将节点加入到pgpool中。总之,就是感觉,primary转换为新的standby后,还是需要手工处理,而不是pgpool全自动处理。
我这里使用pg_rewind来处理新的standy,再使用pgpool工具将节点添加进来。
[postgres@test /opt/pgpool/etc]$pg_ctl stop
waiting for server to shut down.... done
server stopped
[postgres@test /opt/pgpool/etc]$pg_rewind --target-pgdata=$PGDATA --source-server='host=192.168.2.81 port=5432 user=postgres password=oracle' -P
connected to server
servers diverged at WAL location 0/17000098 on timeline 3
rewinding from last common checkpoint at 0/17000028 on timeline 3
reading source file list
reading target file list
reading WAL in target
need to copy 200 MB (total source directory size is 253 MB)
205115/205115 kB (100%) copied
creating backup label and updating control file
syncing target data directory
Done!
[postgres@test /opt/pgpool/etc]$[postgres@test /opt/PostgreSQL/10/data]$mv recovery.done recovery.conf
[postgres@test /opt/PostgreSQL/10/data]$vi recovery.conf
standby_mode = 'on'
primary_conninfo = 'user=repl password=oracle host=192.168.2.81 port=5432 sslmode=prefer sslcompression=1 krbsrvname=postgres target_session_attrs=any'[postgres@test /home/postgres]$pcp_attach_node -h 192.168.2.80 -p 9898 -U postgres -n 0 -v
Password:
pcp_attach_node -- Command Successful
[postgres@test /home/postgres]$
-- 添加后,psql -h 192.168.2.80 -p 9999、psql -h 192.168.2.81 -p 9999查看,其中一个为waiting。将pgpool重启后,全部UP。
postgres=# show pool_nodes;node_id | hostname | port | status | lb_weight | role | select_cnt | load_balance_node | replication_delay | replication_state| replication_sync_state | last_status_change
---------+--------------+------+--------+-----------+---------+------------+-------------------+-------------------+------------------
-+------------------------+---------------------0 | 192.168.2.80 | 5432 | up | 0.500000 | standby | 0 | false | 0 | | | 2021-08-31 13:53:381 | 192.168.2.81 | 5432 | up | 0.500000 | primary | 0 | true | 0 | | | 2021-08-31 13:53:38
(2 rows)postgres=#
postgres=# show pool_nodes;node_id | hostname | port | status | lb_weight | role | select_cnt | load_balance_node | replication_delay | replication_stat
e | replication_sync_state | last_status_change
---------+--------------+------+---------+-----------+---------+------------+-------------------+-------------------+-----------------
--+------------------------+---------------------0 | 192.168.2.80 | 5432 | waiting | 0.500000 | standby | 0 | false | 0 | | | 2021-08-31 13:52:281 | 192.168.2.81 | 5432 | up | 0.500000 | primary | 0 | true | 0 | | | 2021-08-31 12:44:49
(2 rows)postgres=#
END
Pgpool-II + Watchdog 设置与测试相关推荐
- 360安全浏览器兼容模式怎么设置_测试新手一定要知道:最实用的Web兼容性测试经验都在这...
在日常工作中,我们经常碰到网页不兼容的问题.我们之所以要做兼容性测试,目的在于保证待测试项目在不同的操作系统平台上正常运行. 主要包括待测试项目能在同一操作系统平台的不同版本上正常运行:待测试项目能与 ...
- 华硕PRIME Z370M-PLUS II 魔改BIOS 测试正常
华硕PRIME Z370M-PLUS II 魔改BIOS 测试正常 Z370M-plus 直接上,六代,七代,八代,九代 志强e3-V5.系列的CPU
- 如何在Windows 10中设置和测试麦克风
Whether you're dictating with speech recognition or talking to a family member or gaming buddy over ...
- iMX6开发板-uboot-网络设置和测试
本文章基于迅为IMX6开发板 将iMX6开发板通过网线连接到路由器,同时连接好调试串口,上电立即按 enter,即可进入 uboot.然后输入命令 pri,查看开发板当前的配置,如下图所示可以看到 i ...
- junit测试设置不回滚_正确设置JUnit测试名称
junit测试设置不回滚 寻找好名字是手工软件的挑战之一. 您需要随时随地找到它们-类,方法,变量,仅举几例. 但是,什么使名字成为好名字呢? 引用Oncle Bob的话:"三件事:可读性, ...
- linux中tftp服务器设置及测试
1.linux下的tftp安装: 首先利用命令:rpm-qa | grep tftp,查看tftp是否安装,如果没有安装,则可以按下面步骤进行安装. 1.安装相关软件包:tftpd(服务端),tftp ...
- 【店小蜜】欢迎语卡片的设置和测试
进入配置页面 千牛搜索店小蜜→在店小蜜菜单栏选择全自动模式(这个界面就可以设置欢迎语卡片) 界面功能介绍 设置店小蜜接待昵称 就是店小蜜在聊天工具中显示的昵称,修改后记得按旁边的保存就可以.昵称有单独 ...
- 专用测试软件说明,测试说明、软件设置及测试结果
写在测试前: 由于游戏显卡和专 业图形卡的应用领域不同,其定位也不同:专业图形卡特别强调特定行业软件的兼容性.稳定性.速度和准确性,而游戏显卡强调的是在相对较低的价位下提供更强大的娱乐.办公.游戏.多 ...
- 【Linux】Linux之DNS查询、设置及测试可用性
1.DNS查询的三种方式 1.1 cat /etc/resolv.conf 1.2 nslookup www.baidu.com 1.3 dig |grep www.baidu.com 2.DNS设置 ...
最新文章
- iOS 计算两个日期之间的差值
- Eclipse java反编译插件之jadclipse
- Chrome 解决flash问题
- h5新增浏览器本地缓存localStorage
- mysql主键索引_MySQL索引之主键索引
- JQuery方式执行ajax请求
- [Java基础] 反射机制汇总
- 常见的网页技巧(转)
- MyEclipse 2017(英文版)安装教程
- Feature Tools:可自动构造机器学习特征的Python库
- YII 规则rule 里面 min,max 提示错误信息
- hdu 3879 Base Station 最大权闭合图
- django请求和响应
- Confluent修改许可,限制其他云供应商
- Docker 学习笔记 -- kuangshen Docker 视频学习笔记
- STM32 USB主机通信连接中断过程
- 目标跟踪VOT2016的配置
- lnmp全面优化集合nginx+mysql+php
- 夜拍王荣耀10 VS同档位旗舰机夜拍功能,实战结果一目了然!
- 企业与组织仍然没有实现无纸化的三大原因