os: centos 7.4

postgresql: 9.6.9

etcd: 3.2.18

patroni: 1.4.4

patroni + etcd 是在一个postgrsql 开源大会上 亚信的一个哥们讲解的高可用方案。

依然是基于 postgreql stream replication。

ip规划

192.168.56.101 node1 master

192.168.56.102 node2 slave

192.168.56.103 node3 slave

验证failover

node1 的 master 关闭

# systemctl stop postgresql-9.6.service

mode1 的 patroni 马上就有信息输出

2018-07-11 21:43:52,402 INFO: Lock owner: pg96_101; I am pg96_101

2018-07-11 21:43:52,441 INFO: no action. i am the leader with the lock

2018-07-11 21:44:02,405 WARNING: Postgresql is not running.

2018-07-11 21:44:02,406 INFO: Lock owner: pg96_101; I am pg96_101

2018-07-11 21:44:02,444 INFO: Lock owner: pg96_101; I am pg96_101

2018-07-11 21:44:02,455 INFO: starting as readonly because i had the session lock

2018-07-11 21:44:02,456 INFO: closed patroni connection to the postgresql cluster

2018-07-11 21:44:02,491 INFO: postmaster pid=11705

192.168.56.101:5432 - no response

< 2018-07-11 21:44:02.525 CST > LOG: redirecting log output to logging collector process

< 2018-07-11 21:44:02.525 CST > HINT: Future log output will appear in directory "pg_log".

192.168.56.101:5432 - accepting connections

192.168.56.101:5432 - accepting connections

2018-07-11 21:44:03,555 INFO: Lock owner: pg96_101; I am pg96_101

2018-07-11 21:44:03,555 INFO: establishing a new patroni connection to the postgres cluster

2018-07-11 21:44:03,597 INFO: promoted self to leader because i had the session lock

server promoting

2018-07-11 21:44:03,603 INFO: cleared rewind state after becoming the leader

看到日志输出,马上就把 master 拉起来了。

node1 的 os 掉电

节点掉电是一种极端的情况,在各种ha架构中都会模拟。

可以看到其中一个节点的patroni 很快就有信息输出

2018-07-11 21:49:44,632 INFO: Lock owner: pg96_101; I am pg96_103

2018-07-11 21:49:44,632 INFO: does not have lock

2018-07-11 21:49:44,642 INFO: no action. i am a secondary and i am following a leader

2018-07-11 21:49:55,140 INFO: Selected new etcd server http://192.168.56.101:2379

2018-07-11 21:49:57,643 WARNING: Retrying (Retry(total=0, connect=None, read=None, redirect=0, status=None)) after connection broken by 'ConnectTimeoutError(, u'Connection to 192.168.56.101 timed out. (connect timeout=2.5)')': /v2/keys/pg96/pg96/?recursive=true

2018-07-11 21:50:00,148 ERROR: Request to server http://192.168.56.101:2379 failed: MaxRetryError(u"HTTPConnectionPool(host=u'192.168.56.101', port=2379): Max retries exceeded with url: /v2/keys/pg96/pg96/?recursive=true (Caused by ConnectTimeoutError(, u'Connection to 192.168.56.101 timed out. (connect timeout=2.5)'))",)

2018-07-11 21:50:00,149 INFO: Reconnection allowed, looking for another server.

2018-07-11 21:50:00,149 INFO: Selected new etcd server http://192.168.56.102:2379

2018-07-11 21:50:00,172 INFO: Lock owner: pg96_101; I am pg96_103

2018-07-11 21:50:00,172 INFO: does not have lock

2018-07-11 21:50:00,191 INFO: no action. i am a secondary and i am following a leader

2018-07-11 21:50:05,137 INFO: Selected new etcd server http://192.168.56.103:2379

2018-07-11 21:50:05,141 INFO: Lock owner: pg96_101; I am pg96_103

2018-07-11 21:50:05,141 INFO: does not have lock

2018-07-11 21:50:05,146 INFO: no action. i am a secondary and i am following a leader

2018-07-11 21:50:15,060 INFO: Got response from pg96_102 http://127.0.0.1:8008/patroni: {"database_system_identifier": "6576484813966394513", "postmaster_start_time": "2018-07-11 17:38:41.768 CST", "timeline": 2, "xlog": {"received_location": 50379696, "replayed_timestamp": "2018-07-11 18:03:34.386 CST", "paused": false, "replayed_location": 50379696}, "patroni": {"scope": "pg96", "version": "1.4.4"}, "state": "running", "role": "replica", "server_version": 90609}

2018-07-11 21:50:15,066 INFO: Got response from pg96_101 http://127.0.0.1:8008/patroni: {"database_system_identifier": "6576484813966394513", "postmaster_start_time": "2018-07-11 17:38:41.768 CST", "timeline": 2, "xlog": {"received_location": 50379696, "replayed_timestamp": "2018-07-11 18:03:34.386 CST", "paused": false, "replayed_location": 50379696}, "patroni": {"scope": "pg96", "version": "1.4.4"}, "state": "running", "role": "replica", "server_version": 90609}

2018-07-11 21:50:15,113 WARNING: Could not activate Linux watchdog device: "Can't open watchdog device: [Errno 2] No such file or directory: '/dev/watchdog'"

2018-07-11 21:50:15,119 INFO: promoted self to leader by acquiring session lock

server promoting

2018-07-11 21:50:15,190 INFO: cleared rewind state after becoming the leader

2018-07-11 21:50:16,257 INFO: Lock owner: pg96_103; I am pg96_103

2018-07-11 21:50:16,318 INFO: no action. i am the leader with the lock

2018-07-11 21:50:26,254 INFO: Lock owner: pg96_103; I am pg96_103

2018-07-11 21:50:26,279 INFO: no action. i am the leader with the lock

看到 日志输出有 server promoting。说明该节点的 slave 被提升为新的master

再次查看 patroni 集群状态

$ patronictl -c /usr/patroni/conf/patroni_postgresql.yml list pg96

+---------+----------+----------------+--------+---------+-----------+

| Cluster | Member | Host | Role | State | Lag in MB |

+---------+----------+----------------+--------+---------+-----------+

| pg96 | pg96_102 | 192.168.56.102 | | running | 0.0 |

| pg96 | pg96_103 | 192.168.56.103 | Leader | running | 0.0 |

+---------+----------+----------------+--------+---------+-----------+

果然如预期一样。这个时候再 node3 节点上查看复制情况。

select client_addr,

pg_xlog_location_diff(sent_location, write_location) as write_delay,

pg_xlog_location_diff(sent_location, flush_location) as flush_delay,

pg_xlog_location_diff(sent_location, replay_location) as replay_delay

from pg_stat_replication;

client_addr | write_delay | flush_delay | replay_delay

----------------+-------------+-------------+--------------

192.168.56.102 | 0 | 0 | 0

(1 row)

哈哈。

再启动node1后,查看信息

# ps -ef|grep -i etcd

etcd 996 1 2 21:57 ? 00:00:00 /usr/bin/etcd --name=node1 --data-dir=/var/lib/etcd/node1.etcd --listen-peer-urls=http://192.168.56.101:2380,http://127.0.0.1:2380 --listen-client-urls=http://192.168.56.101:2379,http://127.0.0.1:2379 --initial-advertise-peer-urls=http://192.168.56.101:2380 --advertise-client-urls=http://192.168.56.101:2379 --initial-cluster=node1=http://192.168.56.101:2380,node2=http://192.168.56.102:2380,node3=http://192.168.56.103:2380 --initial-cluster-token=etcd-cluster --initial-cluster-state=new

root 1486 1332 0 21:57 pts/0 00:00:00 grep --color=auto -i etcd

patroni 没有起来,需要后面设置为随OS启动。手动启动postgresql, patroni

$ mv recovery.done recovery.conf

$ cat recovery.conf

primary_slot_name = 'pg96_101'

standby_mode = 'on'

recovery_target_timeline = 'latest'

primary_conninfo = 'user=replicator password=1qaz2wsx host=192.168.56.103 port=5432 sslmode=prefer sslcompression=1 application_name=pg96_101'

# systemctl start postgresql-9.6.service

$ patroni /usr/patroni/conf/patroni_postgresql.yml

查看 patroni 集群状态后,发现node1的postgreql居然没有加进去。

$ patronictl -c /usr/patroni/conf/patroni_postgresql.yml list pg96

+---------+----------+----------------+--------+---------+-----------+

| Cluster | Member | Host | Role | State | Lag in MB |

+---------+----------+----------------+--------+---------+-----------+

| pg96 | pg96_102 | 192.168.56.102 | | running | 0.0 |

| pg96 | pg96_103 | 192.168.56.103 | Leader | running | 0.0 |

+---------+----------+----------------+--------+---------+-----------+

查看日志后提示信息为 "replication slot ““pg96_101"” does not exist”,“pg96_101”

奇怪了,前面明明创建了 pg96_101 的slot,node3的日志居然提示没有。

postgres=# select * from pg_replication_slots;

-[ RECORD 1 ]-------+----------

slot_name | pg96_102

plugin |

slot_type | physical

datoid |

database |

active | t

active_pid | 19347

xmin |

catalog_xmin |

restart_lsn | 0/300C058

confirmed_flush_lsn |

确实没有,汗,那就再尝试创建一个吧。

select * from pg_create_physical_replication_slot('pg96_101');

node3 的日志里提示:

$ tail -n 1000 postgresql-2018-07-11.csv

2018-07-11 22:14:27.247 CST,"replicator","",19982,"192.168.56.101:53204",5b4610c3.4e0e,3,"idle",2018-07-11 22:14:27 CST,5/0,0,ERROR,42704,"replication slot ""pg96_101"" does not exist",,,,,,,,,"pg96_101"

2018-07-11 22:14:27.248 CST,"replicator","",19982,"192.168.56.101:53204",5b4610c3.4e0e,4,"idle",2018-07-11 22:14:27 CST,,0,LOG,00000,"disconnection: session time: 0:00:00.005 user=replicator database= host=192.168.56.101 port=53204",,,,,,,,,"pg96_101"

2018-07-11 22:14:28.367 CST,"postgres","postgres",19929,"[local]",5b461068.4dd9,6,"SELECT",2018-07-11 22:12:56 CST,4/0,0,LOG,00000,"duration: 29.076 ms",,,,,,,,,"psql"

2018-07-11 22:14:30.777 CST,"postgres","postgres",19929,"[local]",5b461068.4dd9,7,"SELECT",2018-07-11 22:12:56 CST,4/0,0,LOG,00000,"duration: 1.060 ms",,,,,,,,,"psql"

2018-07-11 22:14:32.249 CST,,,19984,"192.168.56.101:53206",5b4610c8.4e10,1,"",2018-07-11 22:14:32 CST,,0,LOG,00000,"connection received: host=192.168.56.101 port=53206",,,,,,,,,""

2018-07-11 22:14:32.252 CST,"replicator","",19984,"192.168.56.101:53206",5b4610c8.4e10,2,"authentication",2018-07-11 22:14:32 CST,5/121,0,LOG,00000,"replication connection authorized: user=replicator",,,,,,,,,""

2018-07-11 22:14:32.315 CST,"replicator","",19984,"192.168.56.101:53206",5b4610c8.4e10,3,"streaming 0/300C138",2018-07-11 22:14:32 CST,5/0,0,LOG,00000,"standby ""pg96_101"" is now a synchronous standby with priority 1",,,,,,,,,"pg96_101"

2018-07-11 22:14:36.255 CST,"postgres","postgres",16066,"192.168.56.103:54424",5b45d88a.3ec2,1453,"SELECT",2018-07-11 18:14:34 CST,2/0,0,LOG,00000,"duration: 0.448 ms",,,,,,,,,"Patroni"

2018-07-11 22:14:46.258 CST,"postgres","postgres",16066,"192.168.56.103:54424",5b45d88a.3ec2,1454,"SELECT",2018-07-11 18:14:34 CST,2/0,0,LOG,00000,"duration: 0.568 ms",,,,,,,,,"Patroni"

稍等一会后,已经可以看到node1 已经加到slave里了

postgres=# select client_addr,

pg_xlog_location_diff(sent_location, write_location) as write_delay,

pg_xlog_location_diff(sent_location, flush_location) as flush_delay,

pg_xlog_location_diff(sent_location, replay_location) as replay_delay

from pg_stat_replication;

client_addr | write_delay | flush_delay | replay_delay

----------------+-------------+-------------+--------------

192.168.56.102 | 0 | 0 | 0

192.168.56.101 | 0 | 0 | 0

(2 rows)

但是用 patronictl 还是查看不到 node1的信息。

$ patronictl -c /usr/patroni/conf/patroni_postgresql.yml list pg96

+---------+----------+----------------+--------+---------+-----------+

| Cluster | Member | Host | Role | State | Lag in MB |

+---------+----------+----------------+--------+---------+-----------+

| pg96 | pg96_102 | 192.168.56.102 | | running | 0.0 |

| pg96 | pg96_103 | 192.168.56.103 | Leader | running | 0.0 |

+---------+----------+----------------+--------+---------+-----------+

等了一段时间,又ok了

$ patronictl -c /usr/patroni/conf/patroni_postgresql.yml list pg96

+---------+----------+----------------+--------+---------+-----------+

| Cluster | Member | Host | Role | State | Lag in MB |

+---------+----------+----------------+--------+---------+-----------+

| pg96 | pg96_101 | 192.168.56.101 | | running | 0.0 |

| pg96 | pg96_102 | 192.168.56.102 | | running | 0.0 |

| pg96 | pg96_103 | 192.168.56.103 | Leader | running | 0.0 |

+---------+----------+----------------+--------+---------+-----------+

mysql mha etcd_postgresql 高可用 etcd + patroni 之四 failover相关推荐

  1. postgresql 高可用 etcd + patroni 之二 patroni

    os: centos 7.4 postgresql: 9.6.9 etcd: 3.2.18 patroni: 1.4.4 patroni + etcd 是在一个postgrsql 开源大会上 亚信的一 ...

  2. postgresql 高可用 etcd + patroni 之六 callback bind vip

    os: centos 7.4 postgresql: 9.6.9 etcd: 3.2.18 patroni: 1.4.4 本篇blog介绍下 etcd + patroni 发生切换时使用 callba ...

  3. MySQL数据库的高可用方案总结

    高可用架构对于互联网服务基本是标配,无论是应用服务还是数据库服务都需要做到高可用.虽然互联网服务号称7*24小时不间断服务,但多多少少有一些时候服务不可用,比如某些时候网页打不开,百度不能搜索或者无法 ...

  4. MySQL集群高可用架构

    MySQL集群高可用架构 前言 高可用架构对于互联网服务基本是标配,无论是应用服务还是数据库服务都需要做到高可用.对于一个系统而言,可能包含很多模块,比如前端应用,缓存,数据库,搜索,消息队列等,每个 ...

  5. MHA+keepalive高可用环境搭建

    MHA+keepalive高可用环境搭建 2017年02月17日 14:05:57 阅读数:2582 MHA(Master HighAvailability)目前在MySQL高可用方面是一个相对成熟的 ...

  6. MySQL常见的高可用架构

    MySQL常见的高可用架构 概述: 1.基于共享存储的方案SAN 优点: 限制或缺点: 2.基于磁盘复制的方案 MySQL+DRDB架构 优点: 限制或缺点: 3.MySQL+MHA架构 优点: 缺点 ...

  7. mysql purge logs_【MySQL】【高可用】purge_relay_logs工具的使用

    [MySQL][高可用]purge_relay_logs工具的使用 背景: ​ 在MHA高可用架构中切换的步骤中,主要靠对各个实例的relay log文件新旧程度进行比较,选取最新的relay log ...

  8. 01 MySQL生产环境高可用架构浅谈

    1.数据库主从架构与分库分表 ​ 随着现在互联网的应用越来越大,数据库会频繁的成为整个应用的性能瓶颈.而我们经常使用的MySQL数据库,也会不断面临数据量太大.数据访问太频繁.数据读写速度太快等一系列 ...

  9. CentOS 6.4 Heartbeat+mysql+nfs实现高可用的mysql集群

    一.Heartbeat网络架构 二.准备工作 1.操作系统 CentOS 6.4 X86-64 最小化安装 由于用源码编译安装heartbeat一直没有通过,所以没办法只能采用yum安装. heart ...

最新文章

  1. Java线程如何转储
  2. 关于全角半角转换(转)
  3. Ubuntu系统下bash和dash的区别(修改默认sh为bash)
  4. 2020年豪华车市场洞察
  5. OPENWRT传感器实验
  6. exe可执行程序及堆栈分配(转载)
  7. 从编程小白到数据科学家,我只用了 6 个月的时间
  8. java图形界面 关闭_用 java编写的图形用户界面运行后怎么关不掉
  9. 大数问题-----ACM中java的入门使用
  10. java 类注释标准_Java 标准注释
  11. c语言外部中断服务程序设计,手把手教你学单片机的C语言程序设计十二中断服务函数.pdf...
  12. 你的功夫真的夠了嗎?
  13. mysql安装步骤及报错处理(windows)
  14. python模块专题——1.faker
  15. Zemax 快捷键及使用技巧(持续更新中)
  16. 默认网关和静态路由表
  17. Python检查文件内容是否有变动
  18. 紫光同创国产FPGA学习之Fabric Configuration
  19. 机器学习scikit-learn(一)(转)
  20. 【Excel从头开始】-4 单元格格式

热门文章

  1. centos7.5安装snipe-it v5.1.2版本开源资产管理软件
  2. 微软账号登陆不上_微软抛弃“亲生儿子”,IE浏览器被限制,强制你更换Edge
  3. php的 提示无效字符,ORA-00911: 无效字符
  4. 【干货】Android实现支付宝当面付
  5. 如何一键免费压缩PDF文件?最好的 PDF 阅读器免费下载!
  6. 对待Petya勒索病毒的解决办法
  7. 第十三章:Sqlserver2019数据库之Transact-SQL 语法基础及常用 SQL 函数总结
  8. Transact-SQL编程
  9. Qt深入浅出(六)设计师界面
  10. 任意装修、在线DIY定制商城系统,跟紧时代的潮流