版权声明:可以任意转载,转载时请务必以超链接形式标明文章原始出处和作者信息及本版权声明 (作者:张华 发表于:2018-11-06)

问题

neutron designate日志中发现错误"Got lower serial for", 并且创建zone时永远停留在PENDING状态.

解决

minidns中看到下列日志:

var/log/designate/designate-mdns.log.2.gz:2018-10-23 23:27:36.016 94713 INFO designate.mdns.handler [req-26d8910d-61d5-4fd6-bc6b-df7acaf36c12 - - - - -] NotFound, refusing. Question was xxx.openstack-au-east-2.oc.xxx.com. IN SOA
var/log/designate/designate-mdns.log.2.gz:2018-10-23 23:27:36.024 94713 WARNING designate.mdns.handler [req-fa5518dd-9506-44f8-a2ee-ee7d79ffaa3c - - - - -] ZoneNotFound while handling axfr request. Question was xxx.openstack-au-east-2.oc.xxx.com. IN AXFR: ZoneNotFound: Could not find Zone

根据代码[1], 查询DB时报ZoneNotFound从而导致无法构建SOA无法构建axfr response, 从而导致minidns master与bind9 slave无法做zone transfer, 这样导致bind9 slave的serial number也无法更新, 最终在minidns notify时看到"Got lower serial for"

在发生ZoneNotFound的附近有下列日志:

[req-b48612b0-ed3e-46d9-8510-6634282ef0a2 - - - - -] Database connection was found disconnected; reconnecting: DBConnectionError: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query') [SQL: u'SELECT 1']

于是, 用下列测试程序重现了上述DB error,

import time
from sqlalchemy.engine import create_engine
url = 'mysql+pymysql://root@127.0.0.1:3306/mysql'
engine = create_engine(url, pool_recycle=4).connect()
query = 'SELECT NOW();'
while True:print('Q1', engine.execute(query).fetchall())engine.execute('SET wait_timeout=2')time.sleep(3)print('Q2', engine.execute(query).fetchall())

或者使用oslo.db测试程序:

mysql -u root -p -e "SET GLOBAL wait_timeout=5, slow_query_log=on, long_query_time=0.0;"$ cat test.py
import sys
import time
from oslo_config import cfg
from oslo_db import options as db_options
from oslo_db.sqlalchemy import session as db_session
from sqlalchemy.sql.expression import select
_facade = db_session.EngineFacade("mysql://root@localhost/test")
x = _facade.get_session()
print(x.scalar(select([23])))
time.sleep(5)
print(x.scalar(select([23])))

连接线的过期时间(pool_recycle, 连接池里的连接空闲一段时间后自动释放, oslo中默认是3600)不能大于服务器端的wait_timeout时间(这里是2, 默认是8小时). 所以应该设置wait_timeout大于3600, 或者haproxy中的配置大于3600.
wait_timeout是mysql的一个设置,主要是用来断开不使用的数据库连接。当连接空闲的时间达到wait_timeout设置的最大值时,mysql会主动切断这个连接,以供别的客户端连接数据库。这个值一般是28800,也就是8小时。在mysql中可以通过: show variables like “%timeout%”;获取。
另外,当数据库主动切断连接的时候,mysql客户端并不知道这个连接已经被切断,所以程序并不知道其已经无效了,如果mysql客户端再不支持ReConnect,双重的问题叠加在一起就会导致连接池返回无效连接的可能.

其他

mysqldump --single-transaction -u root -p designate --skip-extended-insert > /tmp/designate-$(date +%s).sql

For the error ‘Lost connection to MySQL server during query’,

wait_timeout is a setting of mysql, which is used for server side to disconnect unused client connections proactively. The default is 8 hours(28800), we can check it by ‘show variables like “%timeout%”;’ or ‘juju config mysql wait-timeout’

All the rest of the timeout below should be less than wait_timeout, if they are greater than wait_timeout, the error ‘Lost connection to MySQL server during query’ will happen.
1, the default value of olso’s connection_recycle_time is 3600
2, the timeout in haproxy.cfg

So we need to collect the following info for further analyses.

1, juju config mysql wait-timeout
2, mysql -unova -p -h<MYSQL_IP> -e ‘show variables like “%timeout%”’ #eg: run it in nova-cloud-controller/0
3, juju ssh neutron-api/0 – sudo grep -r ‘connection_recycle_time’ /etc/neutron/
4, juju ssh neutron-api/0 – sudo grep -r ‘timeout’ /etc/haproxy/

一些命令

#This will show which openstack service is using the most mysql connections
select user, count(*) from information_schema.processlist group by user;juju run --application mysql leader-get
juju run --application mysql "mysql -uroot -pChangeMe123 -e \"SELECT IFNULL(usr,'All Users') user,IFNULL(hst,'All Hosts') host,COUNT(1) Connections FROM (SELECT user usr,LEFT(host,LOCATE(':',host) - 1) hst FROM information_schema.processlist WHERE user NOT IN ('system user','root')) A GROUP BY usr,hst WITH ROLLUP;\""

附件

openstack的各服务都有大量的这种错误:
var/log/mysql/error.log:2019-08-29T16:07:45.335406Z 420821 [Note] Aborted connection 420821 to db: ‘keystone’ user: ‘keystone’ host: ‘10.191.5.49’ (Got timeout reading communication packets)

要求客户增大了connect_timeout, net_read_timeout, net_write_timeout, interactive_timeout后使用‘show global status like ‘aborted%’;’看到Aborted_clients的数目仍在增大
show global variables like ‘max_conn%’;
show global variables like ‘%timeout%’;
show global status like ‘aborted%’;
show processlist;

这时发现在21:55:48, 所有的designate serveres都收到了mysql错误, 通过mysql的查询日志找到21:04到21:16之间mysql收到了40个gnocchi查询,一个查询就要返回500MB的大小, 然后看到了警告:InnoDB: Warning: difficult to find free blocks in the buffer pool (338 search iterations)!"
所以看起来像是IO swamp的问题进而造成查询超时。解决方法:
1, 增加innodb_buffer_pool_size (juju中通过dataset-size控制)
2, 删除gnocchi中的相关数据
3, 调查designate中当遇到mysql超时时是不是会反复retry造成对DB的flood

20201014更新

amphora-agent writes heatbeat to udp, then octavia-health-manager gets heatbeat from udp and writes them to DB, finally health_manager gets heartbeat from DB via get_stale_amphora to decide if failover process will be started

当health-manager线程的数目(pgrep -af /usr/bin/octavia-health-manager | wc -l, 160)大于mysql的连接数时(ss -tp | grep -c :mysql, 90)时上面的get_stale_amphora就会报"Lost connection to MySQL server during query"错误(排除wait-timeout!=3600这个因素之后),
sqlalchemy看到这个错误后就会reconnect, 从而产生下面提到的"QueuePool limit of size 10 overflow 20 reached, connection timed out, timeout 10"这个错误(http://sqlalche.me/e/3o7r). sqlalchemy’s reconnect可能不是replace, 而是删除老连接然后建立新连接, 这次当达到max_pool_size+max_overflow=20个连接后永远无新连接供它reconnect(sqlalchemy使用max_pool_size=10, 当一次使用多于5时, 会用到默认的max_overflow=10这样可以达到15个连接, 但多于20个后不再允许溢出.), 这可能是一个bug - https://github.com/sqlalchemy/sqlalchemy/issues/5308

可能在mysql在做failover时会出现这种情况, 因为new VIP owner没有old TCP connections从而导致之后的old TCP connections to go stale从而导致reconnect.

可能的解决方法:
1, 如果mysql不再failover的话可能这个问题也不再重启了, 这个可以等下列fix
2, as a base leve request a stable backport for the following patch
health-manager的eventlet/greenlet的数目由health_update_threads参数决定, 默认与CPU核数同, 大多数情况下都是短连接会多路复用还没问题, 但有时候也会各种因素让有的变成长连接. charm中一般对于worker threads有0.25x的乘数(LXD容器中最大是4), octavia-charm也应该用这个乘数来减小worker threads的数目 - https://bugs.launchpad.net/charm-octavia/+bug/1889731
3, We could additionally reconfigure the oslo.db [database] section to include a pool overflow limit at least as large as the number of expected threads by worker-multiplier
4, We could additionally open a bug against sqlalchemy that it should allow for connection replacement even when the pool limit is reached

CLI

export MYSQL_PASSWORD=ChangeMe123
juju run --application mysql "mysql -u root -p$MYSQL_PASSWORD -e \"select * from nova.instances where host='juju-38b529-ovn-6.cloud.sts'\G\""
juju run --application mysql "mysql -u root -p$MYSQL_PASSWORD -e \"select * from nova.instance_extra where instance_uuid='0761d1de-7acd-4781-ae8b-f5ba864ab6ec'\G\""
juju run --application mysql "mysql -u root -p$MYSQL_PASSWORD -e \"select * from nova.compute_nodes where hypervisor_hostname='p2-bits-cloud-xxx.maas' or hypervisor_hostname='p2-bits-cloud-xxx.maas'\G\""
juju run --application mysql "mysql -u root -p$MYSQL_PASSWORD -e \"select * from nova_api.request_specs where instance_uuid='0761d1de-7acd-4781-ae8b-f5ba864ab6ec'\G\""
pager less -S

mysql backup by usin gjuju

juju run-action --wait mysql/leader backup
juju ssh mysql/leader
sudo cp -r /opt/backups/mysql mysql-backup-$(date +%F)
sudo chown ubuntu -R mysql-backup-$(date +%F)/
juju scp -- -r mysql/leader:~/mysql-backup-$(date +%F) backups/juju run-action --wait nova-cloud-controller/leader archive-data

bring mysql-innodb-cluster back

juju run-action mysql/leader reboot-cluster-from-complete-outage
juju run-action mysql/1 rejoin-instance

mysql-innodb-cluster

在单主模式下,InnoDB集群仅有一个读写服务器实例(即单主节点)。多个从节点是单主的副本。如果主发生故障,则从节点会选举一个作为单主。MySQL Router会检测到这种变化并将客户端应用程序转发到新的单主上。高级用户还可以将集群配置配置成多主模式.

  • MySQL 5.7 引入了 Group Replication 功能,可以在一组 MySQL 服务器之间实现自动主机选举,形成一主多从结构。经过高级配置后,可以实现多主多从结构。
  • MySQL InnoDB Cluster是建立在group repalication基础之上的高可用方案,其中MGR是基础. MGR(Mysql Group Replication)有两种模式,一种是Single-Primary,一种是Multi-Primary,即单主或者多主。注意:Multi-Primary模式中,所有的节点都是主节点,都可以同时被读写,看上去这似乎更好,但是因为多主的复杂性,在功能上如果设置了多主模式,则会有一些使用的限制,比如不支持Foreign Keys with Cascading Constraints。
  • MySQL Router 是一个轻量级透明中间件,可以自动获取上述集群的状态,规划 SQL 语句,分配到合理的 MySQL 后端进行执行。MySQL Router 在InnoDB集群里面主要作用是为数据库集群提供一个虚拟IP
  • Mysql shell作为客户端提供对集群的管理和访问,支持JS、Python、SQL三种方式。可谓新一代mysql client

    安装:
juju add-model mysql
juju add-machine -n 4
juju deploy --to lxd:0 cs:keystone
juju deploy -n 3 --to lxd:1,lxd:2,lxd:3 mysql-innodb-cluster
juju deploy mysql-router keystone-mysql-router
juju add-relation keystone-mysql-router:db-router mysql-innodb-cluster:db-router
juju add-relation keystone-mysql-router:shared-db keystone:shared-db
1, ssh into mysql-innodb-cluster/leader to run 'pkill juju' if the following command is hungjuju run -u mysql-innodb-cluster/leader leader-get mysql.passwd2, log into mysql-innodb shell$mysql-shell.mysqlsh --uri root@localhost:3306 -p$(juju run -u mysql-innodb-cluster/leader leader-get mysql.passwd)
$mysql-shell.mysqlsh
# we have 3 modes (\sql, \py, \jq)\connect localhost:3306\js
mysql-js> var cluster = dba.rebootClusterFromCompleteOutage();
mysql-js> cluster.status();mysql-js> \sqlhttp://shrenikp.blogspot.com/2018/10/mysql-innodb-cluster-troubleshooting.html

更新,上面使用root登录时一定要加sudo

juju run -u mysql/leader leader-get mysql.passwd
juju ssh mysql/leader -- sudo -s
sudo mysql -u root -p
pager less -S#eg:
juju ssh mysql/leader -- sudo mysql --database=octavia --user=root --password=${DB_PASS}
juju ssh mysql/leader -- "sudo mysql --database=octavia --user=root --password=${DB_PASS} \--execute=\"select * from load_balancer;\""

20220505更新

2022-05-04 08:10:05.959 791819 ERROR nova.servicegroup.drivers.db oslo_messaging.rpc.client.RemoteError: Remote error: OperationalError (pymysql.err.OperationalError) (1040, 'Too many connections to MySQL Router')
juju run -a nova-cloud-controller -- sudo ss -Onatp state connected \'sport = 3306\'
juju run -a nova-cloud-controller -- grep max_total_connections /var/lib/mysql/nova-cloud-controller-mysql-router/mysqlrouter.conf
# /var/lib/mysql/nova-cloud-controller-mysql-router/mysqlrouter.conf
[DEFAULT]
...
max_total_connections = 1024
and make sure unattended-upgrades is disabled in all units running mysql-router- https://bugs.launchpad.net/charm-mysql-router/+bug/1971565pass=`juju run --unit mysql/0 'leader-get root-password'`
juju ssh mysql/0 "mysql -uroot -p$pass -e \"show status like 'wsrep_cluster%';\""
juju ssh mysql/0 "mysql -uroot -p$pass -e \"select * from information_schema.processlist order by db, time desc;\""

另一个脚本

#!/bin/bashPASSWD=$(juju run --unit mysql-innodb-cluster/leader leader-get mysql.passwd)
for x in $(seq 0 900)
dodateecho "select * from designate.zone_masters" | juju ssh mysql-innodb-cluster/leader sudo mysql -N -u root -p$PASSWD 2> /dev/nullsleep 1
done

20220901 - backup db

pass=`juju run --unit mysql/0 'leader-get root-password'`
juju run --unit mysql/0 "mysqldump -u root --password=$pass --single-transaction --skip-lock-tables --set-gtid-purged=OFF --databases keystone --quick --result-file=/tmp/keystone.sql"
juju run --unit mysql/0 "mysqldump -u root --password=$pass --single-transaction --skip-lock-tables --set-gtid-purged=OFF --databases neutron --quick --result-file=/tmp/neutron.sql"
juju run --unit mysql/0 "mysqldump -u root --password=$pass --single-transaction --skip-lock-tables --set-gtid-purged=OFF --databases keystone octavia nova --quick --result-file=/tmp/datavases.sql"

Reference

[1] https://github.com/openstack/designate/blob/stable/queens/designate/mdns/handler.py#L233
[2] https://blog.csdn.net/quqi99/article/details/119042087

Lost connection to MySQL server during query (by quqi99)相关推荐

  1. python3 mysql错误 pymysql.err.OperationalError: (2013, 'Lost connection to MySQL server during query')

    pymysql错误: pymysql.err.OperationalError: (2013, 'Lost connection to MySQL server during query') pymy ...

  2. Procedure execution failed 2013 - Lost connection to MySQL server during query

    1 错误描写叙述 Procedure execution failed 2013 - Lost connection to MySQL server during query 2 错误原因 由错误描写 ...

  3. android+mysql+server+error_Lost connection to MySQL server during query错误 | 学步园

    关于 Lost connection to MySQL server during query 错误2007年04月11日 星期三 00:56晚上在虚拟机上继续玩 Delphi for PHP,想试试 ...

  4. MySQL Workbench出现:Error Code: 2013. Lost connection to MySQL server during query的问题解决

    MySQL Workbench出现:Error Code: 2013. Lost connection to MySQL server during query的问题解决 参考文章: (1)MySQL ...

  5. Lost connection to MySQL server during query的几种可能分析

    在使用navicat导出查询结果时,发现一段时间后就断开连接了,报错:[Msg] [Exp] 2013 - Lost connection to MySQL server during query 开 ...

  6. 解决Lost connection to MySQL server during query错误方法

    昨天使用Navicat for MySQL导入MySQL数据库的时候,出现了一个严重的错误,Lost connection to MySQL server during query,字面意思就是在查询 ...

  7. 解决: Error Code: 2013. Lost connection to MySQL server during query

    执行SQL语句: update  table_C, table_A  set table_C.colC = table_A.colA where table_A.anotherId= table_C. ...

  8. mysql hy000 2013_ERROR 2013 (HY000): Lost connection to MySQL server during query

    问题描述: Navicat连接远程服务器Mysql报错: ERROR 2013 (HY000): Lost connection to MySQL server during query 过程: Na ...

  9. 解决InterfaceError: 2013: Lost connection to MySQL server during query

    服务器后台报错: InterfaceError: 2013: Lost connection to MySQL server during query 具体信息如下: File "./sr/ ...

最新文章

  1. 亲试白天使:华硕家用级无线路由RT-N11+
  2. Unity面试题精选(2)
  3. 【SpringBoot】 配置文件的拆分
  4. java 记录考勤记录_Java中的记录器– Java记录示例
  5. Nginx Unit 与 LEMUR 新架构
  6. svn commit 问题 POST of '***/!svn/me' 403 forbidden
  7. 前后端分离 -- 深入浅出 Spring Boot + Vue 实现工程项目进度管理系统 Vue不过如此~
  8. IT人论房价 (七) 城市置业乱谈
  9. 云端服务器与传统的独立主机服务器有什么区别
  10. 2023java面试看完这篇笔记薪资和offer稳了!
  11. 如何在ASP.Net Core中使用辅助服务
  12. 机器学习_深度学习毕设题目汇总——语音
  13. 国产系统UOS上的视频监控系统
  14. 阿里云DataWorks介绍
  15. blackjack java gui_求助一个java的题,blackjack, class 分为Card 和Deck,求大神帮我做一个程序...
  16. python水果超市管理系统_项目1-水果超市管理系统-学生
  17. js-Symbol.hasInstance
  18. 互联网晚报 | 11/22星期二 | 威马汽车上海全员工资打折;中国千万资产家庭达206万户;充电宝质量国家抽查不合格率达25%...
  19. 如何避免学计算机掉头发,电脑工作者如何防止脱发
  20. AEC-Q100和ISO26262 ASIL-B双重认证

热门文章

  1. 【洛谷】P2713 罗马游戏
  2. 这8种武器点亮程序员的个人品牌
  3. jQuery简介及选择器
  4. 分享:笔记本花屏的解决方法
  5. 湖南科技大学计算机领导参观老年日间中心,湖南科技大学新闻网
  6. H7-TOOL高速DAPLINK仿真器功能已发布,支持WinUSB和HID方式二合一,含常见问题FAQ(2020-12-07)
  7. 计算机课自我介绍二百字,自我介绍作文200个字
  8. 如何搭建视频通信服务器架构
  9. win10默认浏览器不显示谷歌浏览器_win10系统下google浏览器无法打开网页的解决方法...
  10. 血眼龙王萧沙传-王遗风篇(2)