Lost connection to MySQL server during query (by quqi99)

问题

neutron designate日志中发现错误"Got lower serial for", 并且创建zone时永远停留在PENDING状态.

解决

minidns中看到下列日志:

var/log/designate/designate-mdns.log.2.gz:2018-10-23 23:27:36.016 94713 INFO designate.mdns.handler [req-26d8910d-61d5-4fd6-bc6b-df7acaf36c12 - - - - -] NotFound, refusing. Question was xxx.openstack-au-east-2.oc.xxx.com. IN SOA
var/log/designate/designate-mdns.log.2.gz:2018-10-23 23:27:36.024 94713 WARNING designate.mdns.handler [req-fa5518dd-9506-44f8-a2ee-ee7d79ffaa3c - - - - -] ZoneNotFound while handling axfr request. Question was xxx.openstack-au-east-2.oc.xxx.com. IN AXFR: ZoneNotFound: Could not find Zone

根据代码[1], 查询DB时报ZoneNotFound从而导致无法构建SOA无法构建axfr response, 从而导致minidns master与bind9 slave无法做zone transfer, 这样导致bind9 slave的serial number也无法更新, 最终在minidns notify时看到"Got lower serial for"

在发生ZoneNotFound的附近有下列日志:

[req-b48612b0-ed3e-46d9-8510-6634282ef0a2 - - - - -] Database connection was found disconnected; reconnecting: DBConnectionError: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query') [SQL: u'SELECT 1']

于是, 用下列测试程序重现了上述DB error,

import time
from sqlalchemy.engine import create_engine
url = 'mysql+pymysql://root@127.0.0.1:3306/mysql'
engine = create_engine(url, pool_recycle=4).connect()
query = 'SELECT NOW();'
while True:print('Q1', engine.execute(query).fetchall())engine.execute('SET wait_timeout=2')time.sleep(3)print('Q2', engine.execute(query).fetchall())

或者使用oslo.db测试程序:

mysql -u root -p -e "SET GLOBAL wait_timeout=5, slow_query_log=on, long_query_time=0.0;"$ cat test.py
import sys
import time
from oslo_config import cfg
from oslo_db import options as db_options
from oslo_db.sqlalchemy import session as db_session
from sqlalchemy.sql.expression import select
_facade = db_session.EngineFacade("mysql://root@localhost/test")
x = _facade.get_session()
print(x.scalar(select([23])))
time.sleep(5)
print(x.scalar(select([23])))

连接线的过期时间(pool_recycle, 连接池里的连接空闲一段时间后自动释放, oslo中默认是3600)不能大于服务器端的wait_timeout时间(这里是2, 默认是8小时). 所以应该设置wait_timeout大于3600, 或者haproxy中的配置大于3600.
wait_timeout是mysql的一个设置，主要是用来断开不使用的数据库连接。当连接空闲的时间达到wait_timeout设置的最大值时，mysql会主动切断这个连接，以供别的客户端连接数据库。这个值一般是28800，也就是8小时。在mysql中可以通过： show variables like “%timeout%”;获取。
另外，当数据库主动切断连接的时候，mysql客户端并不知道这个连接已经被切断，所以程序并不知道其已经无效了，如果mysql客户端再不支持ReConnect，双重的问题叠加在一起就会导致连接池返回无效连接的可能.

其他

mysqldump --single-transaction -u root -p designate --skip-extended-insert > /tmp/designate-$(date +%s).sql

For the error ‘Lost connection to MySQL server during query’,

wait_timeout is a setting of mysql, which is used for server side to disconnect unused client connections proactively. The default is 8 hours(28800), we can check it by ‘show variables like “%timeout%”;’ or ‘juju config mysql wait-timeout’

All the rest of the timeout below should be less than wait_timeout, if they are greater than wait_timeout, the error ‘Lost connection to MySQL server during query’ will happen.
1, the default value of olso’s connection_recycle_time is 3600
2, the timeout in haproxy.cfg

So we need to collect the following info for further analyses.

1, juju config mysql wait-timeout
2, mysql -unova -p -h<MYSQL_IP> -e ‘show variables like “%timeout%”’ #eg: run it in nova-cloud-controller/0
3, juju ssh neutron-api/0 – sudo grep -r ‘connection_recycle_time’ /etc/neutron/
4, juju ssh neutron-api/0 – sudo grep -r ‘timeout’ /etc/haproxy/

一些命令

#This will show which openstack service is using the most mysql connections
select user, count(*) from information_schema.processlist group by user;juju run --application mysql leader-get
juju run --application mysql "mysql -uroot -pChangeMe123 -e \"SELECT IFNULL(usr,'All Users') user,IFNULL(hst,'All Hosts') host,COUNT(1) Connections FROM (SELECT user usr,LEFT(host,LOCATE(':',host) - 1) hst FROM information_schema.processlist WHERE user NOT IN ('system user','root')) A GROUP BY usr,hst WITH ROLLUP;\""

附件

openstack的各服务都有大量的这种错误：
var/log/mysql/error.log:2019-08-29T16:07:45.335406Z 420821 [Note] Aborted connection 420821 to db: ‘keystone’ user: ‘keystone’ host: ‘10.191.5.49’ (Got timeout reading communication packets)

要求客户增大了connect_timeout, net_read_timeout, net_write_timeout, interactive_timeout后使用‘show global status like ‘aborted%’;’看到Aborted_clients的数目仍在增大
show global variables like ‘max_conn%’;
show global variables like ‘%timeout%’;
show global status like ‘aborted%’;
show processlist;

这时发现在21:55:48，所有的designate serveres都收到了mysql错误，通过mysql的查询日志找到21:04到21:16之间mysql收到了40个gnocchi查询，一个查询就要返回500MB的大小, 然后看到了警告：InnoDB: Warning: difficult to find free blocks in the buffer pool (338 search iterations)!"
所以看起来像是IO swamp的问题进而造成查询超时。解决方法：
1，增加innodb_buffer_pool_size (juju中通过dataset-size控制）
2, 删除gnocchi中的相关数据
3, 调查designate中当遇到mysql超时时是不是会反复retry造成对DB的flood

20201014更新

amphora-agent writes heatbeat to udp, then octavia-health-manager gets heatbeat from udp and writes them to DB, finally health_manager gets heartbeat from DB via get_stale_amphora to decide if failover process will be started

当health-manager线程的数目(pgrep -af /usr/bin/octavia-health-manager | wc -l, 160)大于mysql的连接数时(ss -tp | grep -c :mysql, 90)时上面的get_stale_amphora就会报"Lost connection to MySQL server during query"错误(排除wait-timeout!=3600这个因素之后),
sqlalchemy看到这个错误后就会reconnect, 从而产生下面提到的"QueuePool limit of size 10 overflow 20 reached, connection timed out, timeout 10"这个错误(http://sqlalche.me/e/3o7r). sqlalchemy’s reconnect可能不是replace, 而是删除老连接然后建立新连接, 这次当达到max_pool_size+max_overflow=20个连接后永远无新连接供它reconnect(sqlalchemy使用max_pool_size=10, 当一次使用多于5时, 会用到默认的max_overflow=10这样可以达到15个连接, 但多于20个后不再允许溢出.), 这可能是一个bug - https://github.com/sqlalchemy/sqlalchemy/issues/5308

可能在mysql在做failover时会出现这种情况, 因为new VIP owner没有old TCP connections从而导致之后的old TCP connections to go stale从而导致reconnect.

可能的解决方法:
1, 如果mysql不再failover的话可能这个问题也不再重启了, 这个可以等下列fix
2, as a base leve request a stable backport for the following patch
health-manager的eventlet/greenlet的数目由health_update_threads参数决定, 默认与CPU核数同, 大多数情况下都是短连接会多路复用还没问题, 但有时候也会各种因素让有的变成长连接. charm中一般对于worker threads有0.25x的乘数(LXD容器中最大是4), octavia-charm也应该用这个乘数来减小worker threads的数目 - https://bugs.launchpad.net/charm-octavia/+bug/1889731
3, We could additionally reconfigure the oslo.db [database] section to include a pool overflow limit at least as large as the number of expected threads by worker-multiplier
4, We could additionally open a bug against sqlalchemy that it should allow for connection replacement even when the pool limit is reached

CLI

export MYSQL_PASSWORD=ChangeMe123
juju run --application mysql "mysql -u root -p$MYSQL_PASSWORD -e \"select * from nova.instances where host='juju-38b529-ovn-6.cloud.sts'\G\""
juju run --application mysql "mysql -u root -p$MYSQL_PASSWORD -e \"select * from nova.instance_extra where instance_uuid='0761d1de-7acd-4781-ae8b-f5ba864ab6ec'\G\""
juju run --application mysql "mysql -u root -p$MYSQL_PASSWORD -e \"select * from nova.compute_nodes where hypervisor_hostname='p2-bits-cloud-xxx.maas' or hypervisor_hostname='p2-bits-cloud-xxx.maas'\G\""
juju run --application mysql "mysql -u root -p$MYSQL_PASSWORD -e \"select * from nova_api.request_specs where instance_uuid='0761d1de-7acd-4781-ae8b-f5ba864ab6ec'\G\""
pager less -S

mysql backup by usin gjuju

juju run-action --wait mysql/leader backup
juju ssh mysql/leader
sudo cp -r /opt/backups/mysql mysql-backup-$(date +%F)
sudo chown ubuntu -R mysql-backup-$(date +%F)/
juju scp -- -r mysql/leader:~/mysql-backup-$(date +%F) backups/juju run-action --wait nova-cloud-controller/leader archive-data

bring mysql-innodb-cluster back

juju run-action mysql/leader reboot-cluster-from-complete-outage
juju run-action mysql/1 rejoin-instance

mysql-innodb-cluster

在单主模式下，InnoDB集群仅有一个读写服务器实例(即单主节点)。多个从节点是单主的副本。如果主发生故障，则从节点会选举一个作为单主。MySQL Router会检测到这种变化并将客户端应用程序转发到新的单主上。高级用户还可以将集群配置配置成多主模式.

MySQL 5.7 引入了 Group Replication 功能，可以在一组 MySQL 服务器之间实现自动主机选举，形成一主多从结构。经过高级配置后，可以实现多主多从结构。
MySQL InnoDB Cluster是建立在group repalication基础之上的高可用方案，其中MGR是基础. MGR(Mysql Group Replication)有两种模式，一种是Single-Primary，一种是Multi-Primary，即单主或者多主。注意：Multi-Primary模式中，所有的节点都是主节点，都可以同时被读写，看上去这似乎更好，但是因为多主的复杂性，在功能上如果设置了多主模式，则会有一些使用的限制，比如不支持Foreign Keys with Cascading Constraints。
MySQL Router 是一个轻量级透明中间件，可以自动获取上述集群的状态，规划 SQL 语句，分配到合理的 MySQL 后端进行执行。MySQL Router 在InnoDB集群里面主要作用是为数据库集群提供一个虚拟IP
Mysql shell作为客户端提供对集群的管理和访问，支持JS、Python、SQL三种方式。可谓新一代mysql client

安装：

juju add-model mysql
juju add-machine -n 4
juju deploy --to lxd:0 cs:keystone
juju deploy -n 3 --to lxd:1,lxd:2,lxd:3 mysql-innodb-cluster
juju deploy mysql-router keystone-mysql-router
juju add-relation keystone-mysql-router:db-router mysql-innodb-cluster:db-router
juju add-relation keystone-mysql-router:shared-db keystone:shared-db

1, ssh into mysql-innodb-cluster/leader to run 'pkill juju' if the following command is hungjuju run -u mysql-innodb-cluster/leader leader-get mysql.passwd2, log into mysql-innodb shell$mysql-shell.mysqlsh --uri root@localhost:3306 -p$(juju run -u mysql-innodb-cluster/leader leader-get mysql.passwd)
$mysql-shell.mysqlsh
# we have 3 modes (\sql, \py, \jq)\connect localhost:3306\js
mysql-js> var cluster = dba.rebootClusterFromCompleteOutage();
mysql-js> cluster.status();mysql-js> \sqlhttp://shrenikp.blogspot.com/2018/10/mysql-innodb-cluster-troubleshooting.html

更新，上面使用root登录时一定要加sudo

juju run -u mysql/leader leader-get mysql.passwd
juju ssh mysql/leader -- sudo -s
sudo mysql -u root -p
pager less -S#eg:
juju ssh mysql/leader -- sudo mysql --database=octavia --user=root --password=${DB_PASS}
juju ssh mysql/leader -- "sudo mysql --database=octavia --user=root --password=${DB_PASS} \--execute=\"select * from load_balancer;\""

20220505更新

2022-05-04 08:10:05.959 791819 ERROR nova.servicegroup.drivers.db oslo_messaging.rpc.client.RemoteError: Remote error: OperationalError (pymysql.err.OperationalError) (1040, 'Too many connections to MySQL Router')
juju run -a nova-cloud-controller -- sudo ss -Onatp state connected \'sport = 3306\'
juju run -a nova-cloud-controller -- grep max_total_connections /var/lib/mysql/nova-cloud-controller-mysql-router/mysqlrouter.conf
# /var/lib/mysql/nova-cloud-controller-mysql-router/mysqlrouter.conf
[DEFAULT]
...
max_total_connections = 1024
and make sure unattended-upgrades is disabled in all units running mysql-router- https://bugs.launchpad.net/charm-mysql-router/+bug/1971565pass=`juju run --unit mysql/0 'leader-get root-password'`
juju ssh mysql/0 "mysql -uroot -p$pass -e \"show status like 'wsrep_cluster%';\""
juju ssh mysql/0 "mysql -uroot -p$pass -e \"select * from information_schema.processlist order by db, time desc;\""

另一个脚本

#!/bin/bashPASSWD=$(juju run --unit mysql-innodb-cluster/leader leader-get mysql.passwd)
for x in $(seq 0 900)
dodateecho "select * from designate.zone_masters" | juju ssh mysql-innodb-cluster/leader sudo mysql -N -u root -p$PASSWD 2> /dev/nullsleep 1
done

20220901 - backup db

pass=`juju run --unit mysql/0 'leader-get root-password'`
juju run --unit mysql/0 "mysqldump -u root --password=$pass --single-transaction --skip-lock-tables --set-gtid-purged=OFF --databases keystone --quick --result-file=/tmp/keystone.sql"
juju run --unit mysql/0 "mysqldump -u root --password=$pass --single-transaction --skip-lock-tables --set-gtid-purged=OFF --databases neutron --quick --result-file=/tmp/neutron.sql"
juju run --unit mysql/0 "mysqldump -u root --password=$pass --single-transaction --skip-lock-tables --set-gtid-purged=OFF --databases keystone octavia nova --quick --result-file=/tmp/datavases.sql"

Reference

[1] https://github.com/openstack/designate/blob/stable/queens/designate/mdns/handler.py#L233
[2] https://blog.csdn.net/quqi99/article/details/119042087