对于打工人可能最痛苦的就是被告知的故障,数据库有监控、告警、每天巡检,自己做了一系列数据库的“安保”工作,本以为可以万无一失,中午在安心的睡觉中,被人告知数据库crash了。当时的我一脸懵逼,不可能,没有告警,半小时前我还登进去查信息呢,怎么可能crash了呢,结果登陆系统被现实一幕啪啪打脸,数据库1节点crash了,2节点正常,监控和告警突然也有问题了,当然今天重点不在这。立马启动数据库~咦~数据库真的起来了,那么怎么宕的呢,需要研究alert日志了。

###alert.log

Archived Log entry 7660224 added for thread 1 sequence 1179174 ID 0x5ad9d9fa dest 1:
Thu Jan 14 14:10:04 2021
TABLE E2EQA.F_JIAKUAN_DPI_DNS_USER_IP_H: ADDED INTERVAL PARTITION SYS_P5860754 (17860) VALUES LESS THAN (TO_DATE(' 2021-01-14 04:00:00', 'SYYYY-MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIAN'))
Thu Jan 14 14:10:14 2021
Errors in file /u01/app/oracle/diag/rdbms/nmsb/nmsb1/trace/nmsb1_lms4_248229.trc  (incident=564829):
ORA-00600: internal error code, arguments: [kjmchkiseq:!seq], [0], [0], [2147483647], [255], [34], [2], [1], [], [], [], []
Incident details in: /u01/app/oracle/diag/rdbms/nmsb/nmsb1/incident/incdir_564829/nmsb1_lms4_248229_i564829.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Thu Jan 14 14:10:15 2021
Dumping diagnostic data in directory=[cdmp_20210114141015], requested by (instance=1, osid=248229 (LMS4)), summary=[incident=564829].
Errors in file /u01/app/oracle/diag/rdbms/nmsb/nmsb1/trace/nmsb1_lms4_248229.trc:
ORA-00600: internal error code, arguments: [kjmchkiseq:!seq], [0], [0], [2147483647], [255], [34], [2], [1], [], [], [], []
Thu Jan 14 14:10:16 2021
Sweep [inc][564829]: completed
Sweep [inc2][564829]: completed
Errors in file /u01/app/oracle/diag/rdbms/nmsb/nmsb1/trace/nmsb1_lms4_248229.trc:
ORA-00600: internal error code, arguments: [kjmchkiseq:!seq], [0], [0], [2147483647], [255], [34], [2], [1], [], [], [], []
LMS4 (ospid: 248229): terminating the instance due to error 484
Thu Jan 14 14:10:17 2021
opiodr aborting process unknown ospid (144669) as a result of ORA-1092

从alert.log发现数据库在1月14日 14:10分遭遇ORA-00600[kjmchkiseq:!seq]错误,通过trace发现这个报错是由LMS进程触发的,继续检查trace日志

###nmsb1_lms4_248229_i564829.trc

## nmsb1_lms4_248229_i564829.trc
Dump continued from file: /u01/app/oracle/diag/rdbms/nmsb/nmsb1/trace/nmsb1_lms4_248229.trc
ORA-00600: internal error code, arguments: [kjmchkiseq:!seq], [0], [0], [2147483647], [255], [34], [2], [1], [], [], [], []

========= Dump for incident 564829 (ORA 600 [kjmchkiseq:!seq]) ========
----- Beginning of Customized Incident Dump(s) -----
MSG [34:KJX_B_CLOSE] [0xad0f9.87a] inc=20 len=144 sender=(2,2) seq=0   <---block 708857 file 2170
     fg=iqb stat=KJUSERSTAT_DONE spnum=18 flg=x2
----- End of Customized Incident Dump(s) -----

Session Wait History:
        elapsed time of 0.646200 sec since last wait
     0: waited for 'gcs remote message'
        waittime=0x1e, poll=0x0, event=0x0
        wait_id=3658542447 seq_num=55512 snap_id=1
        wait times: snap=0.000098 sec, exc=0.000098 sec, total=0.000098 sec
        wait times: max=0.030000 sec
        wait counts: calls=1 os=1
        occurred after 0.000055 sec of elapsed time
     1: waited for 'gcs remote message'
        waittime=0x1e, poll=0x0, event=0x0
        wait_id=3658542446 seq_num=55511 snap_id=1
        wait times: snap=0.001050 sec, exc=0.001050 sec, total=0.001050 sec
        wait times: max=0.030000 sec
        wait counts: calls=1 os=1
        occurred after 0.000010 sec of elapsed time
     2: waited for 'gcs remote message'
        waittime=0x1e, poll=0x0, event=0x0
        wait_id=3658542445 seq_num=55510 snap_id=1
        wait times: snap=0.000127 sec, exc=0.000127 sec, total=0.000127 sec
        wait times: max=0.030000 sec
        wait counts: calls=1 os=1
        occurred after 0.000023 sec of elapsed time
     3: waited for 'gcs remote message'
        waittime=0x1e, poll=0x0, event=0x0
        wait_id=3658542444 seq_num=55509 snap_id=1
        wait times: snap=0.000208 sec, exc=0.000208 sec, total=0.000208 sec
        wait times: max=0.030000 sec
        wait counts: calls=1 os=1
        occurred after 0.000064 sec of elapsed time
     4: waited for 'gcs remote message'
        waittime=0x1e, poll=0x0, event=0x0
        wait_id=3658542443 seq_num=55508 snap_id=1
        wait times: snap=0.000380 sec, exc=0.000380 sec, total=0.000380 sec
        wait times: max=0.030000 sec
        wait counts: calls=1 os=1
        occurred after 0.000020 sec of elapsed time
     5: waited for 'gcs remote message'
        waittime=0x1e, poll=0x0, event=0x0
        wait_id=3658542442 seq_num=55507 snap_id=1
        wait times: snap=0.000385 sec, exc=0.000385 sec, total=0.000385 sec
        wait times: max=0.030000 sec
        wait counts: calls=1 os=1
        occurred after 0.000014 sec of e lapsed time
     6: waited for 'gcs remote message'
        waittime=0x1e, poll=0x0, event=0x0
        wait_id=3658542441 seq_num=55506 snap_id=1
        wait times: snap=0.000057 sec, exc=0.000057 sec, total=0.000057 sec
        wait times: max=0.030000 sec
        wait counts: calls=1 os=1
        occurred after 0.000045 sec of elapsed time
     7: waited for 'gcs remote message'
        waittime=0x1e, poll=0x0, event=0x0
        wait_id=3658542440 seq_num=55505 snap_id=1
        wait times: snap=0.000242 sec, exc=0.000242 sec, total=0.000242 sec
        wait times: max=0.030000 sec
        wait counts: calls=1 os=1
        occurred after 0.000014 sec of elapsed time
     8: waited for 'gcs remote message'
        waittime=0x1e, poll=0x0, event=0x0
        wait_id=3658542439 seq_num=55504 snap_id=1
        wait times: snap=0.000315 sec, exc=0.000315 sec, total=0.000315 sec
        wait times: max=0.030000 sec
        wait counts: calls=1 os=1
        occurred after 0.000033 sec of elapsed time
     9: waited for 'gcs remote message'
        waittime=0x1e, poll=0x0, event=0x0
        wait_id=3658542438 seq_num=55503 snap_id=1
        wait times: snap=0.000561 sec, exc=0.000561 sec, total=0.000561 sec
        wait times: max=0.030000 sec
        wait counts: calls=1 os=1
        occurred after 0.000018 sec of elapsed time
----- DDE Diagnostic Information Dump -----
Depth: 1
DDE flags: 0x0
Heap: 0x7f44fbd51ec0
Incident Context pointer in diag: 0x7fff75931748
Incident ID Cache: 0x137d41e5e8 
Invocation Context #: 0
----- Invocation Context Dump -----
Address: 0x7f44fbd55f60
Phase: 3
flags: 0x18E0001
Incident ID: 564829
Error Descriptor: ORA-600 [kjmchkiseq:!seq] [0] [0] [2147483647] [255] [34] [2] [1] [] [] [] []
Error class: 0
Problem Key # of args: 1
Number of actions: 11
----- Incident Context Dump -----
Address: 0x7fff75931748
Incident ID: 564829
Problem Key: ORA 600 [kjmchkiseq:!seq]
Error: ORA-600 [kjmchkiseq:!seq] [0] [0] [2147483647] [255] [34] [2] [1] [] [] [] []
[00]: dbgexExplicitEndInc [diag_dde]
[01]: dbgeEndDDEInvocationImpl [diag_dde]
[02]: dbgeEndDDEInvocation [diag_dde]
[03]: kjmvalidate [RAC_MLMDS]<-- Signaling  
[04]: kjmpbmsg [RAC_MLMDS]
[05]: kjmsm [RAC_MLMDS]
[06]: ksbrdp [background_proc]
[07]: opirip []  
[08]: opidrv []
[09]: sou2o []
[10]: opimai_real []
[11]: ssthrdmain []
[12]: main []
[13]: __libc_start_main []
MD [00]: 'SID'='1423.1' (0x2)
MD [01]: 'ProcId'='18.1' (0x2)
MD [02]: 'Client ProcId'='oracle@dm05db01.bmcc.com.cn.248229_139934305621760' (0x0)

End of KJC Communication Dump
 DEFER MSG QUEUE 0 ON LMS4 IS EMPTY
 SEQUENCES: 
[sidx 0]  0:0.0  1:442149.0
 DEFER MSG QUEUE 1 ON LMS4 IS EMPTY
 SEQUENCES: 
[sidx 1]  0:0.0  1:2147483647.255
 DEFER MSG QUEUE 2 ON LMS4 IS EMPTY
 SEQUENCES: 
[sidx 2]  0:0.0  1:2147483647.255
error 484 detected in background process
ORA-00600: internal error code, arguments: [kjmchkiseq:!seq], [0], [0], [2147483647], [255], [34], [2], [1], [], [], [], []<<<<<<<<kjmchkiseq - check for indirect sequence prior processing. sequence for sendq message must be increasing.  0 = received message seq   0 = received message wrap 2147483647= last message seq 255 = last message wrap 

kjzduptcctx: Notifying DIAG for crash event
----- Abridged Call Stack Trace -----
ksedsts()+465<-kjzdssdmp()+267<-kjzduptcctx()+232<-kjzdicrshnfy()+63<-ksuitm()+5608<-ksbrdp()+3507<-opirip()+623<-opidrv()+603<-sou2o()+103<-opimai_real()+250<-ssthrdmain()+265<-main()+201<-__libc_start_main()+253 
----- End of Abridged Call Stack Trace -----

*** 2021-01-14 14:10:17.044
LMS4 (ospid: 248229): terminating the instance due to error 484

从trace中发现由于LMS进程在处理message时失败(持续了三次,抛出三次ora-00600),在rac的运行机制下档LMS进程收到错误消息时为了保证数据一致性将1节点踢出集群。那么是什么原因导致LMS接受消息失败呢,是节点间通信有问题吗,我们收集了OSW查看

Ip:
    1845957605880 total packets received
    1373116 with invalid addresses
    0 forwarded
    0 incoming packets discarded
    1845955287255 incoming packets delivered
    1789568708375 requests sent out
    5 outgoing packets dropped
    96 dropped because of missing route
    2 fragments dropped after timeout
    279032 reassemblies required
    139515 packets reassembled ok
    2 packet reassembles failed
    139515 fragments received ok
    279030 fragments created
Icmp:
    48861429 ICMP messages received
    8479 input ICMP message failed.
    ICMP input histogram:
        destination unreachable: 19166487
        timeout in transit: 2962
        echo requests: 12984085
        echo replies: 16701821
        timestamp request: 60
        address mask request: 99
    37334456 ICMP messages sent
    0 ICMP messages failed
    ICMP output histogram:
        destination unreachable: 7637180
        time exceeded: 2
        echo request: 16713224
        echo replies: 12983990
        timestamp replies: 60
IcmpMsg:
        InType0: 16701821
        InType3: 19166487
        InType8: 12984085
        InType9: 5905
        InType11: 2962
        InType13: 60
        InType17: 99
        InType165: 10
        OutType0: 12983990
        OutType3: 7637180
        OutType8: 16713224
        OutType11: 2
        OutType14: 60
Udp:
    1163780019445 packets received
    7918856 packets to unknown port received.
    942462850 packet receive errors
    1163464113908 packets sent
    RcvbufErrors: 942462850
IgnoredMulti: 159009153

zzz <01/14/2021 14:10:16> subcount: 111
Ip:
    1845957797979 total packets received
    1373116 with invalid addresses
    0 forwarded
    0 incoming packets discarded
    1845955479354 incoming packets delivered
    1789568895187 requests sent out
    5 outgoing packets dropped
    96 dropped because of missing route
    2 fragments dropped after timeout
    279032 reassemblies required
    139515 packets reassembled ok
    2 packet reassembles failed
    139515 fragments received ok
    279030 fragments created
Icmp:
    48861441 ICMP messages received
    8479 input ICMP message failed.
    ICMP input histogram:
        destination unreachable: 19166499
        timeout in transit: 2962
        echo requests: 12984085
        echo replies: 16701821
        timestamp request: 60
        address mask request: 99
    37334458 ICMP messages sent
    0 ICMP messages failed
    ICMP output histogram:
        destination unreachable: 7637182
        time exceeded: 2
        echo request: 16713224
        echo replies: 12983990
        timestamp replies: 60
IcmpMsg:
        InType0: 16701821
        InType3: 19166499
        InType8: 12984085
        InType9: 5905
        InType11: 2962
        InType13: 60
        InType17: 99
        InType165: 10
        OutType0: 12983990
        OutType3: 7637182
        OutType8: 16713224
        OutType11: 2
        OutType14: 60
Udp:
    1163780060287 packets received
    7918858 packets to unknown port received.
    942462852 packet receive errors >>>>>>>>>>>>> incr 2
    1163464154791 packets sent
    RcvbufErrors: 942462852
    IgnoredMulti: 159009252

检查数据库patch

数据库版本为Oracle 11.2.0.4.180417 for exadata

在故障时间段从IP层面监控fragments dropped分片/组包正常,UDP有少量丢包现象,不是节点间通讯的问题,继续查看metalink案例发现一片文章,BUG Bug 23628685 - RAC instance crashed by lms with ORA-600 [kjmchkiseq:!seq] (Doc ID 23628685.8)

metalink描述

ORA-00600: internal error code, arguments: [kjmchkiseq:!seq], [1], [0],[2147483646], [255]
Where:
1 = received message seq
0 = received message wrap
2147483646 = last message seq
255 = last message wrap
Instances have been up for a very long time (in this customers case 592 days) the message header wrap can itself wrap past 255 back to 0.
In this case the next message that has a seq=1 and wrp=0 will fail with a kjmchkiseq:!seq assert as the stored wrp is 255 from the previous message

db alert<sid>.log
-------------------------
Tue Jun 21 09:02:53 2016
Archived Log entry 1100596 added for thread 1 sequence 177386 ID 0x438fd7bc
dest 1:
Tue Jun 21 09:04:59 2016
Errors in file <DIR>/trace/<INSTANCE>_lms1_27192.trc
(incident=1480143):
ORA-00600: internal error code, arguments: [kjmchkiseq:!seq], [1], [0],[2147483646], [255], [34], [0], [1], [], [], [], []
Incident details in:
<DIR>/incident/incdir_1480143/<INSTANCE>_lms1_27192_i1480143.trc
-------------------------------
<sid>_lms1_27192_i1480143.trc
-----------------------
Dump continued from file:
<DIR>/trace/<INSTANCE>_lms1_27192.trc
ORA-00600: internal error code, arguments: [kjmchkiseq:!seq], [1], [0],[2147483646], [255], [34], [0], [1], [], [], [], []
========= Dump for incident 1480143 (ORA 600 [kjmchkiseq:!seq]) ========
----- Beginning of Customized Incident Dump(s) -----
MSG [34:KJX_B_CLOSE] [0x237f3c.c0] inc=4 len=144 sender=(2,2) seq=1
fg=iqb stat=KJUSERSTAT_DONE spnum=15 flg=x0
----- End of Customized Incident Dump(s) -----
STACK TRACE:
------------
Error: ORA-600 [kjmchkiseq:!seq] [1] [0] [2147483646] [255] [34] [0] [1] []
[] [] []
[00]: dbgexExplicitEndInc [diag_dde]
[01]: dbgeEndDDEInvocationImpl [diag_dde]
[02]: dbgeEndDDEInvocation [diag_dde]
[03]: kjmvalidate [RAC_MLMDS]<-- Signaling
2021/1/14 Document 23628685.8
https://support.oracle.com/epmos/faces/DocumentDisplay?_adf.ctrl-state=e7wrd5piw_236&id=23628685.8 3/3
[04]: kjmpbmsg [RAC_MLMDS]
[05]: kjmsm [RAC_MLMDS]
[06]: ksbrdp [background_proc]
[07]: opirip []
[08]: opidrv []
[09]: sou2o []
[10]: opimai_real []
[11]: ssthrdmain []
[12]: main []
[13]: __libc_start_main []

现象跟这个bug及其相似,糟糕的是oracle对于bug触发原因并没有过多的解释,咨询sr也没有过多解释,可通过打补丁修复bug。

一个BUG引发的灾难:ORA-00600 [kjmchkiseq:!seq]相关推荐

  1. 『转』度百死去飞秋一个BUG引发的血案

    作了一篇文章度百死去飞秋一个BUG引发的血案,昨天,度百死去的美国客户发邮件给我,说我的软件出问题了,我查来查去,发现居然是服务器上一个目录无法删除,一删除就报 cannot read from th ...

  2. 一个bug引发的血案(大爆炸)

    据传,在冷战时期,CIA曾成功向前苏联"输出"一个有设计缺陷的控制软件,该软件用来控制天然气主管道.(KGB从一家加拿大公司窃取该软件.)那个植入的Bug最终引发了1982年的西伯 ...

  3. 查看redis缓存大小_一个 bug 引发了服务器崩溃,对应 redis 的 key 回收原理你清楚了吗?...

    1 背景 项目中使用了 redis 做旁路缓存.读请求到来时,有以下操作:1.检查缓存,有则返回2.没有则读取数据库,将结果回写到缓存中. 写请求到来时,有以下操作:1.更新数据库 2.更新缓存(实际 ...

  4. 一个bug引发的人生感悟

    文章目录 问题现象 原因分析 节目id和节目路径映射关系 uuid数据 流程分析 发包机 解决办法 感悟 问题现象 发包机 结果csv文件,不同progid指向同一视频路径问题 原因分析 节目id和节 ...

  5. std::uniform_real_distribution的一个bug引发的服务器崩溃

    文章目录 前言 崩溃问题 std::uniform_real_distribution<> 的bug bug 重现方法 总结 前言 近日发生一次线上游戏服务器宕机问题,通过日志和core文 ...

  6. mysql 5.6 bug_MySQL 5.6的一个bug引发的故障

    突然收到告警,提示mysql宕机了,该服务器是从库.于是尝试登录服务器看看能否登录,发现可以登录,查看mysql进程也存在,尝试登录提示 ERROR 1040 (HY000): Too many co ...

  7. 一个Bug能有多大影响:亏损30亿、致6人死亡、甚至差点毁灭世界...

    欢迎关注方志朋的博客,回复"666"获面试宝典 作者:博雯   来源:量子位(QbitAI) 一个Bug就地蒸发5亿美元: 软件设计层面出Bug致6人死亡: DeBug不成功直接世 ...

  8. QQ超市模拟排配2D版1.13 (XNA4.0) (修正双格货架移动的一个bug和3-5地图)

    抱歉,更新了一个地图-- 下载地址:(版本过期了,请下新版) 1.13:更新日期:2012-3-22 更新3店5口地图错误问题.启动程序前请手动删除旧版地图数据. 地址:C:\(我的文档路径)\Sav ...

  9. 37 岁学编程,发现第一个 Bug,创造商业编程语言 | 人物志

    Humans are allergic to change. They love to say, "We've always done it this way." I try to ...

最新文章

  1. 642-832 GNS3 自搭建拓扑
  2. LPC55S69 MicroPython模组和库函数
  3. javascript 高级程序设计_JavaScript 经典「红宝书」,几代前端人的入门选择
  4. python的编程模式-使用简单工厂模式来进行Python的设计模式编程
  5. Moo.fx 超级轻量级的 javascript 特效库
  6. PHP执行一个http请求
  7. java 关键字volatile的作用
  8. windows环境下,如何在Pycharm下安装TensorFlow环境
  9. Javascript 链式运动框架——逐行分析代码,让你轻松了解运动的原理
  10. Golang笔记—文件操作
  11. 7-65 藏头诗 (15 分)
  12. 13分钟,教你python可视化分析20W数据,找到妹子最爱的内衣
  13. 学习.net设计规范记录
  14. 视频教程-IT必备技能Cisco认证CCNA全集-思科认证
  15. c# gerber文件读取_必须收藏的一篇关于:AD18生成gerber文件及用CAM350读取gerber教程...
  16. 论SVGA直播礼物特效对直播平台的重要性
  17. 经典回忆Adobe Photoshop CS 2安装教程永久注册使用
  18. C#_CRC-16/CCITT-FALSE计算加判断
  19. 小米4 miui6 android,小米4怎样升级MIUI6方法 小米4运行MIUI 6上手体验报告
  20. 基于51单片机的超声波避障小车设计(含Proteus仿真)

热门文章

  1. Java实现简单的售货机程序2
  2. python计算标准差函数_Python pandas,pandas常用统计方法,求和sum,均值mean,最大值max,中位数median,标准差std...
  3. .NET WebApi 实战第三讲
  4. 盘丝洞服务器维护,2010年8月3日定期维护公告 群雄逐鹿争霸赛
  5. 软件工程课堂作业(三)——Right-BICEP软件单元测试
  6. 飞卡日常进度之K60DN/K60FX/K66对比
  7. Hadoop生态系统全面介绍
  8. 鲸探发布点评:9月1日发售《新石器黄玉猪龙形珮》数字藏品
  9. 初识AMD型号CPU
  10. ei指什么_今天说一下EI是什么