墨墨导读:本文详述硬件掉电后,Oracle集群无法启动的诡异故障处理过程。

一、 问题描述

现象:硬件掉电后,Oracle集群无法启动。

[root@rac2 ~]# crsctl stat res -tCRS-4535: Cannot communicate with Cluster Ready ServicesCRS-4000: Command Status failed, or completed with errors.[root@rac2 ~]# crsctl start crsCRS-4640: Oracle High Availability Services is already activeCRS-4000: Command Start failed, or completed with errors.

二、 故障处理

查看集群组件发现ora.asm状态为offline

root@rac2 ~]# crsctl stat res -t -init--------------------------------------------------------------------------------NAME           TARGET  STATE        SERVER                   STATE_DETAILS--------------------------------------------------------------------------------Cluster Resources--------------------------------------------------------------------------------ora.asm      1        ONLINE  OFFLINE                               Instance Shutdownora.cluster_interconnect.haip      1        ONLINE  ONLINE       rac2ora.crf      1        ONLINE  ONLINE       rac2ora.crsd      1        ONLINE  OFFLINEora.cssd      1        ONLINE  ONLINE       rac2ora.cssdmonitor      1        ONLINE  ONLINE       rac2ora.ctssd      1        ONLINE  ONLINE       rac2                     OBSERVERora.diskmon      1        OFFLINE OFFLINEora.drivers.acfs      1        ONLINE  ONLINE       rac2ora.evmd      1        ONLINE  INTERMEDIATE rac2ora.gipcd      1        ONLINE  ONLINE       rac2ora.gpnpd      1        ONLINE  ONLINE       rac2ora.mdnsd      1        ONLINE  ONLINE       rac2

查看grid alert日志发现磁盘组没有mount

[ohasd(4329)]CRS-2769:Unable to failover resource 'ora.diskmon'.2018-05-08 04:12:24.940:[cssd(4576)]CRS-1707:Lease acquisition for node rac2 number 2 completed2018-05-08 04:12:26.188:[cssd(4576)]CRS-1605:CSSD voting file is online: /dev/asmdisk/oraasm-OCR_0000; details in /u01/app/11.2.0/grid/log/rac2/cssd/ocssd.log.2018-05-08 04:12:28.723:[cssd(4576)]CRS-1601:CSSD Reconfiguration complete. Active nodes are rac1 rac2 .2018-05-08 04:12:30.617:[ctssd(4660)]CRS-2401:The Cluster Time Synchronization Service started on host rac2.2018-05-08 04:12:30.617:[ctssd(4660)]CRS-2407:The new Cluster Time Synchronization Service reference node is host rac1.2018-05-08 04:12:32.348:[ohasd(4329)]CRS-2767:Resource state recovery not attempted for 'ora.diskmon' as its target state is OFFLINE2018-05-08 04:12:32.348:[ohasd(4329)]CRS-2769:Unable to failover resource 'ora.diskmon'.

查看asm_alert,出现ORA-00600 [kfrValAcd30]的报错

NOTE: GMON heartbeating for grp 2GMON querying group 2 at 6 for pid 23, osid 5727NOTE: cache opening disk 0 of grp 2: DATA_0000 path:/dev/asmdisk/oraasm-ASM_0000NOTE: F1X0 found on disk 0 au 2 fcn 0.0NOTE: cache opening disk 1 of grp 2: DATA_0001 path:/dev/asmdisk/oraasm-ASM_0001NOTE: F1X0 found on disk 1 au 2 fcn 0.0NOTE: cache opening disk 2 of grp 2: DATA_0002 path:/dev/asmdisk/oraasm-ASM_0002NOTE: F1X0 found on disk 2 au 2 fcn 0.0NOTE: cache opening disk 3 of grp 2: DATA_0003 path:/dev/asmdisk/oraasm-ASM_0003NOTE: cache mounting (first) normal redundancy group 2/0x877A96CD (DATA)* allocate domain 2, invalid = TRUENOTE: attached to recovery domain 2NOTE: starting recovery of thread=1 ckpt=8.390 group=2 (DATA)Errors in file /u01/app/grid/diag/asm/+asm/+ASM2/trace/+ASM2_ora_5727.trc  (incident=50111):ORA-00600: internal error code, arguments: [kfrValAcd30], [DATA], [1], [8], [390], [9], [390], [], [], [], [], []ORA-15017: diskgroup "ASM" cannot be mountedORA-15063: ASM discovered an insufficient number of disks for diskgroup "ASM"Incident details in: /u01/app/grid/diag/asm/+asm/+ASM2/incident/incdir_50111/+ASM2_ora_5727_i50111.trcUse ADRCI or Support Workbench to package the incident.See Note 411.1 at My Oracle Support for error and packaging details.Errors in file /u01/app/grid/diag/asm/+asm/+ASM2/trace/+ASM2_ora_5727.trc:ORA-00600: internal error code, arguments: [kfrValAcd30], [DATA], [1], [8], [390], [9], [390], [], [], [], [], []ORA-15017: diskgroup "ASM" cannot be mountedORA-15063: ASM discovered an insufficient number of disks for diskgroup "ASM"Abort recovery for domain 2NOTE: crash recovery signalled OER-600ERROR: ORA-600 signalled during mount of diskgroup DATANOTE: cache dismounting (clean) group 2/0x877A96CD (DATA)NOTE: messaging CKPT to quiesce pins Unix process pid: 5727, image: oracle@rac2 (TNS V1-V3)

查看ASM2_ora_5727_i50111.trc

kfdp_query(DATA): 6----- Abridged Call Stack Trace -----ksedsts()+465<-kfdp_query()+530<-kfdPstSyncPriv()+585<-kfgFinalizeMount()+1630<-kfgscFinalize()+1433<-kfgForEachKfgsc()+285<-kfgsoFinalize()+135<-kfgFinalize()+398<-kfxdrvMount()+5558<-kfxdrvEntry()+2207<-opiexe()+20624<-opiosq0()+3932<-kpooprx()+274<-kpoal8()+842<-opiodr()+917<-ttcpip()+2183<-opitsk()+1710<-opiino()+969<-opiodr()+917<-opidrv()+570<-sou2o()+103<-opimai_real()+133<-ssthrdmain()+265<-main()+201<-__libc_start_main()+253----- End of Abridged Call Stack Trace -----2018-05-17 21:54:44.321623 : Start recovery for domain=2, valid=0, flags=0x4NOTE: starting recovery of thread=1 ckpt=8.390 group=2 (DATA)Incident 50111 created, dump file: /u01/app/grid/diag/asm/+asm/+ASM2/incident/incdir_50111/+ASM2_ora_5727_i50111.trcORA-00600: internal error code, arguments: [kfrValAcd30], [DATA], [1], [8], [390], [9], [390], [], [], [], [], []ORA-15017: diskgroup "ASM" cannot be mountedORA-15063: ASM discovered an insufficient number of disks for diskgroup "ASM"
recovery of group DATA failed due to the following error(s):ORA-00600: internal error code, arguments: [kfrValAcd30], [DATA], [1], [8], [390], [9], [390], [], [], [], [], []ORA-15017: diskgroup "ASM" cannot be mountedORA-15063: ASM discovered an insufficient number of disks for diskgroup "ASM"2018-05-17 21:54:45.016907 : Abort recovery for domain 2, flags = 0x42018-05-17 21:54:45.017799 : kjb_abort_recovery: domain flags=0x0, valid=0NOTE: messaging CKPT to quiesce pins Unix process pid: 5727, image: oracle@rac2 (TNS V1-V3)kfdp_dismount(): 7----- Abridged Call Stack Trace -----ksedsts()+465<-kfdp_dismountInt()+389<-kfdp_dismount()+11<-kfgTermCache()+347<-kfgRecoverDismount()+373<-kfgRecoverMount()+277<-kfgscDelete()+2742<-kssdel()+155<-kfgscFinalize()+1211<-kfgForEachKfgsc()+285<-kfgsoFinalize()+135<-kfgFinalize()+398<-kfxdrvMount()+5558<-kfxdrvEntry()+2207<-opiexe()+20624<-opiosq0()+3932<-kpooprx()+274<-kpoal8()+842<-opiodr()+917<-ttcpip()+2183<-opitsk()+1710<-opiino()+969<-opiodr()+917<-opidrv()+570<-sou2o()+103<-opimai_real()+133<-ssthrdmain()+265<-main()+201<-__libc_start_main()+253----- End of Abridged Call Stack Trace -----ASM name of disk 0x9f1d9488 (2:0:DATA_0000:/dev/asmdisk/oraasm-ASM_0000) is being clearedASM name of disk 0x9f1d9808 (2:1:DATA_0001:/dev/asmdisk/oraasm-ASM_0001) is being clearedASM name of disk 0x9f1d9108 (2:2:DATA_0002:/dev/asmdisk/oraasm-ASM_0002) is being clearedASM name of disk 0x9f1d8d88 (2:3:DATA_0003:/dev/asmdisk/oraasm-ASM_0003) is being clearedNOTE: messaging CKPT to quiesce pins Unix process pid: 5727, image: oracle@rac2 (TNS V1-V3)kfdp_dismount(): 8----- Abridged Call Stack Trace -----ksedsts()+465<-kfdp_dismountInt()+389<-kfdp_dismount()+11<-kfgTermCache()+347<-kfgRecoverDismount()+373<-kfgRecoverMount()+277<-kfgscDelete()+2742<-kssdct()+258<-kfgsoDelete()+721<-kssdel()+155<-kssdch()+6849<-ksures()+52<-opiosq0()+4721<-kpooprx()+274<-kpoal8()+842<-opiodr()+917<-ttcpip()+2183<-opitsk()+1710<-opiino()+969<-opiodr()+917<-opidrv()+570<-sou2o()+103<-opimai_real()+133<-ssthrdmain()+265<-main()+201<-__libc_start_main()+253----- End of Abridged Call Stack Trace -----

搜索mos也发现了几个Bug:

Bug 27288230 - ORA-600: [kfrvalacd30] in ASM SMON (Doc ID 27288230.8)Bug 19064132 : ORA-600 [KFRVALACD30] MOUNTING DISKGROUP AFTER RESTARTING SERVERSORA-600 [KFRVALACD30] in ASM (Doc ID 2123013.1)bug在11.2.0.1/11.2.0.3/112.0.4/12.1/18.1 都存在。

Oracle认为这是存储或者OS问题导致asm acd block的元数据不一致了,可能导致ASM元数据的主辅扩展区都发生损坏。这个损坏会导致rebalance挂起或不断尝试失败,或者阻止磁盘组被挂载。

strace出现ORA 600 [kfrValAcd30]报错,但没发现其他可用的信息

write(25, "string_value", 12)           = 12write(25, ">", 1)                       = 1write(25, "ORA 600 [kfrValAcd30]", 21)  = 21write(25, "</", 2)                      = 2write(25, "string_value", 12)           = 12write(25, ">", 1)                       = 1write(25, "</", 2)                      = 2write(25, "pkey_dbgristih", 14)         = 14write(25, ">", 1)                       = 1write(25, "\n", 1)                      = 1write(25, "<", 1)                       = 1write(25, "creatm_dbgristih", 16)       = 16write(25, ">", 1)                       = 1write(25, "<", 1)                       = 1write(25, "time", 4)                    = 4write(25, ">", 1)                       = 1

ASM的Active Change Directory(ACD)简单来说就相当于asm元数据重做记录,需要说明的是,每一个asm实例,都用有其自己的ACD目录,也就说,如果你是双节点的rac,那么就有84m的ACD 目录信息。

ACD主要是记录了这样一些数据:线程号,block号,opcode,sequences,记录长度等信息。

当处于运行的asm实例,突然crash后,那么重启asm实例以后,asm可以根据ACD信息去进行”instance recover”,从而保证能够正常启动asm实例。

如果acd的信息出现损坏,磁盘组将不能够被mount。

使用kfed来读取元数据,首先定位到active change directory 所在AU

[grid@rac2 ~]$  kfed read  /dev/dm-0  aun=2 blkn=3|grep .xptr.au |head -1kfffde[0].xptr.au:                    4 ; 0x4a0: 0x00000004

从上面信息可以看出,ACD元数据和数据应该包含在其中一个AU 4。

查看au 4的信息

grid@rac2 ~]$  kfed read /dev/dm-0 aun=4 blkn=1|morekfbh.endian:                          1 ; 0x000: 0x01kfbh.hard:                          130 ; 0x001: 0x82kfbh.type:                            8 ; 0x002: KFBTYP_CHNGDIRkfbh.datfmt:                          1 ; 0x003: 0x01kfbh.block.blk:                       1 ; 0x004: blk=1kfbh.block.obj:                       3 ; 0x008: file=3kfbh.check:                    17400326 ; 0x00c: 0x01098206kfbh.fcn.base:                        0 ; 0x010: 0x00000000kfbh.fcn.wrap:                        0 ; 0x014: 0x00000000kfbh.spare1:                          0 ; 0x018: 0x00000000kfbh.spare2:                          0 ; 0x01c: 0x00000000kfracdb.aba.seq:                      2 ; 0x000: 0x00000002kfracdb.aba.blk:                      0 ; 0x004: 0x00000000kfracdb.ents:                         2 ; 0x008: 0x0002kfracdb.ub2spare:                     0 ; 0x00a: 0x0000kfracdb.lge[0].valid:                 1 ; 0x00c: V=1 B=0 M=0kfracdb.lge[0].chgCount:              1 ; 0x00d: 0x01kfracdb.lge[0].len:                  52 ; 0x00e: 0x0034kfracdb.lge[0].kfcn.base:             1 ; 0x010: 0x00000001kfracdb.lge[0].kfcn.wrap:             0 ; 0x014: 0x00000000kfracdb.lge[0].bcd[0].kfbl.blk:       0 ; 0x018: blk=0kfracdb.lge[0].bcd[0].kfbl.obj:       4 ; 0x01c: file=4kfracdb.lge[0].bcd[0].kfcn.base:      0 ; 0x020: 0x00000000kfracdb.lge[0].bcd[0].kfcn.wrap:      0 ; 0x024: 0x00000000kfracdb.lge[0].bcd[0].oplen:          4 ; 0x028: 0x0004  ---表示长度,类似logfile dump的LENkfracdb.lge[0].bcd[0].blkIndex:       0 ; 0x02a: 0x0000kfracdb.lge[0].bcd[0].flags:         28 ; 0x02c: F=0 N=0 F=1 L=1 V=1 A=0 C=0kfracdb.lge[0].bcd[0].opcode:       212 ; 0x02e: 0x00d4  --opcode,类似数据库实例中的update/delete/insert操作的opcode编号kfracdb.lge[0].bcd[0].kfbtyp:         9 ; 0x030: KFBTYP_COD_BGO  --操作类型,类似数据库实例中的update/delete/insert等类型kfracdb.lge[0].bcd[0].redund:        17 ; 0x031: SCHE=0x1 NUMB=0x1  --这里表示冗余级别,17是unport,18是mirror,19表示highkfracdb.lge[0].bcd[0].pad:        63903 ; 0x032: 0xf99fkfracdb.lge[0].bcd[0].KFRCOD_CRASH:   1 ; 0x034: 0x00000001kfracdb.lge[0].bcd[0].au[0]:         46 ; 0x038: 0x0000002ekfracdb.lge[0].bcd[0].disks[0]:       0 ; 0x03c: 0x0000kfracdb.lge[1].valid:                 1 ; 0x040: V=1 B=0 M=0kfracdb.lge[1].chgCount:              1 ; 0x041: 0x01

check等信息属于hash值,每隔3s都会更新一次,可能是由于突然掉电,cache里的信息没有更新到磁盘中导致。

[grid@rac2 ~]$  kfed read /dev/dm-2 aun=4 blkn=0|grep ckptkfracdc.ckpt.seq:                     8 ; 0x018: 0x00000008kfracdc.ckpt.blk:                   390 ; 0x01c: 0x00000186[grid@rac2 ~]$  kfed read /dev/dm-0 aun=4 blkn=0|grep ckptkfracdc.ckpt.seq:                    70 ; 0x018: 0x00000046kfracdc.ckpt.blk:                   255 ; 0x01c: 0x000000f

三、 问题解决

通过kfed merge手工修改的方法恢复

kfed read /dev/dm-2 aun=4 blkn=0 >acd.txtkfed merge /dev/dm-2 aun=4 blkn=0  text=acd.txt

手工mount磁盘组

SQL>  alter diskgroup data mount; alter diskgroup data mount*ERROR at line 1:ORA-00600: internal error code, arguments: [kfrValAcd30], [DATA], [1], [8],[390], [9], [390], [], [], [], [], []
SQL> alter diskgroup data mount;Diskgroup altered.

如果kfed merge无法修复,请及时联系恩墨紧急救援(400-660-8755),以通过odu对asm底层进行恢复数据文件,从而最大限度恢复数据。

墨天轮原文链接:https://www.modb.pro/db/26884(复制到浏览器中打开或者点击“阅读原文”)

推荐阅读:144页!分享珍藏已久的数据库技术年刊

数据和云

ID:OraNews

如有收获,请划至底部,点击“在看”,谢谢!

点击下图查看更多 ↓

云和恩墨大讲堂 | 一个分享交流的地方

长按,识别二维码,加入万人交流社群

请备注:云和恩墨大讲堂

  点个“在看”

你的喜欢会被看到❤

异常掉电导致的ORA-[kfrValAcd30]故障处理相关推荐

  1. vbox虚拟机异常掉电导致linux启动失败

    背景介绍 host:windows系统.virtualbox:centos系统. 在vbox虚拟机centos运行的同时,host开了很多程序,某个时刻出现大量的内存访问越界弹窗等问题,于是人工强制关 ...

  2. SSD异常掉电数据是否会损坏

    1. 问题发现 基于X1 Master系统在整机下电时,对于SSD固态硬盘实际属于异常掉电,引发一些思考 [思考] Master系统下电前会根据上位机发送的预关机指令,更新并保存文件系统信息到mram ...

  3. android 行车记录仪分析,基于Android架构行车记录仪的异常掉电可播放视频方法与流程...

    本发明涉及摄录像视频技术领域,特别涉及一种基于Android架构行车记录仪的异常掉电可播放视频方法. 背景技术: 随着车联网概念的兴起和技术的积累,越来越多的智能设备被接入到车辆上.行车记录仪作为非常 ...

  4. iPhone出现异常掉电的问题怎么办 iPhone掉电异常解决方法

    当感觉到手机掉电快,或留意到夜间待机也在掉电时,就会感到担忧和焦虑,对于这些问题,给予如下建议和解决办法,一起来了解一下吧 正因为手机的对于日常生活的重要性,大部分用户都会十分关注手机电量.当感觉到手 ...

  5. 由异常掉电问题---谈xfs文件系统

    由异常掉电问题---谈xfs文件系统 参考文章: (1)由异常掉电问题---谈xfs文件系统 (2)https://www.cnblogs.com/yuzhaoxin/p/4083582.html ( ...

  6. 磁盘阵列掉电 oracle数据库,掉电导致磁盘坏,非归档下的redo全部丢失,数据库打开的恢复失败...

    1.环境 OS:sun solaris Database version:8.1.7 数据文件存放路径:/u01/oradata/bjtb/; /u02/oradata/bjtb/ 数据库为非归档 2 ...

  7. Linux写文件断电保存,linux 写文件操作,异常掉电后,文件损坏丢失(0kb)

    今天调试程序,在做异常断电测试时,再开机发现文件是0 kb, 通过log查询,文件已经写入完成.不明白为何掉电之后文件就没了. 查到下面的博文解决了该问题. 通过 fflush->fsync-& ...

  8. 【FLASH存储器系列十九】固态硬盘掉电后如何恢复掉电前状态?

    掉电分两种,一种是正常掉电,另一种是异常掉电.不管是哪种原因导致的掉电,我们都希望,重新上电后,SSD都需要能从掉电中恢复过来,继续正常工作. 正常掉电恢复,这个好理解,主机通知SSD要下电了,让其做 ...

  9. 固态硬盘掉电保护测试原理及要点

    前言 固态硬盘由于必须使用FTL做逻辑地址和物理地址之间的转换,如果在SSD读.写.删除等正常工作的情况下出现异常掉电,有可能会导致mapping table的因为来不及更新而丢失,从而出现SSD无法 ...

最新文章

  1. Scrum企业实践-Leangoo敏捷工具
  2. 【数字孪生】工业互联网和数字孪生
  3. linux命令行 正则,在Linux命令行中使用正则表达式
  4. android中怎么网络判断,Android中判断网络是否连接实例详解
  5. OK6410裸机开发之LED灯
  6. Linux下CMake简明教程(八) 添加编译选项
  7. 当我真正开始爱自己——查理·卓别林
  8. java动画闪烁_优化Java动画编程中的显示效果
  9. .net core 使用Redis的发布订阅
  10. vue3数据绑定显示列表数据局
  11. zrender zlevel层叠控制和Group使用笔记
  12. 读书笔记:javascript高级程序设计
  13. java模拟面试题目_JAVA模拟面试题库
  14. error: crosses initialization of
  15. 如何通过C/C++求任意角度的余弦值
  16. 用IDEA弹奏《起风了》伴奏音乐 Java语言 <源码分享> GitHub有趣的小项目
  17. 广告设计网站制作怎么做?
  18. 青云上NAS服务器挂的操作(他们的文档)
  19. 【MATLAB教程案例26】图像特征点提取算法matlab仿真与分析——sift,surf,kaze,corner,BRISK等
  20. 计算机相关专业术语中英文对照

热门文章

  1. apache服务 功能错误_如何使用Apache OpenWhisk开发功能即服务
  2. 少儿编程几种语言_您使用了几种编程语言?
  3. 开源版本命名规范_11个开源项目如何命名
  4. 10大开源文档管理系统_开源文档的5大趋势
  5. outlook2016投票_投票:2016年读者选择奖和最佳采访奖
  6. foundation 框架_来自Linux Foundation,DockerCon,Facebook,Google等的开源新闻
  7. cad2014工具集_2014年最佳公开教育工具和故事
  8. CMakeList.txt的简介
  9. 硬件创新需要去理解的点(精炼总结)
  10. 根据压缩后的行列数和sourcemap反向定位源码