oracle WAITED TOO LONG FOR A ROW CACHE ENQUEUE LOCK问题分析

服务概述

财务系统出现业务卡顿，数据库实例2遇到>>> WAITED TOO LONG FOR A ROW CACHE ENQUEUE LOCK! <<<错误导致业务HANG住。

对此HANG的原因进行分析：

故障发生时，未有主机监控数据，从可以获取的数据库AWR中，可以发现在故障发生前一小时，已经出现有较多cursor: pin S wait on X Latch: row cache objects等待；而从AWR报告中也可以看到问题时段的shared pool 从4800M收缩为了2900M，查看SGA组件变动使用情况，可以发现将近2个GB是被KGH: NO ACCESS所使用，即处于shrik/grow的中间状态。

此时ALERT日志中开始出现有>>> WAITED TOO LONG FOR A ROW CACHE ENQUEUE LOCK! <<<错误，查看此时的System State dumped TRACE文件，核心进程在等待latch: row cache objects资源，根据故障时ALERT日志中systemstate dump prod2_ora_3191.trc信息来看，以典型的LCK0进程为例，此进程在等待和kghfrunp(释放shared pool中内存)有关的latch,此进程的历史等待事件中也大量存在'latch: shared pool'、latch: row cache objects；kghfrunp(kghfrunp相关的功能是需要从shared pool中释放空间相关)有关的latch。综合考虑,从现在得到的信息来看,很可能发生问题时shared pool的使用出现了问题。

RAC数据库实例HANG分析

4.1 数据库实例日志的分析

4.1.1故障节点数据库实例Alert日志

节点1数据库实例的Alert日志

Fri May 31 12:11:03 2019Incremental checkpoint up to RBA [0x16b38.55e5.0], current log tail at RBA [0x16b38.7949.0]Fri May 31 12:31:15 2019Incremental checkpoint up to RBA [0x16b38.d76f.0], current log tail at RBA [0x16b38.d7a1.0]Fri May 31 12:38:24 2019GES: Potential blocker (pid=4099) on resource CI-0000001E-00000002;enqueue info in file /oracle/product/10.2.0/db/rdbms/log/prod2_lmd0_3768.trc and DIAG trace fileFri May 31 12:49:04 2019Beginning log switch checkpoint up to RBA [0x16b39.2.10], SCN: 6163266145639Thread 2 advanced to log sequence 92985Current log# 8 seq# 92985 mem# 0: +CWDATA/prod/onlinelog/group_8.525.896335299Current log# 8 seq# 92985 mem# 1: +CWDATA/prod/onlinelog/group_8.628.896335301Fri May 31 12:49:09 2019LNS: Standby redo logfile selected for thread 2 sequence 92985 for destination LOG_ARCHIVE_DEST_2Fri May 31 12:49:09 2019Completed checkpoint up to RBA [0x16b39.2.10], SCN: 6163266145639Fri May 31 12:51:21 2019Incremental checkpoint up to RBA [0x16b39.19.0], current log tail at RBA [0x16b39.19.0]Fri May 31 12:54:11 2019GES: Potential blocker (pid=4099) on resource CI-00000046-00000002;enqueue info in file /oracle/product/10.2.0/db/rdbms/log/prod2_lmd0_3768.trc and DIAG trace fileFri May 31 13:02:58 2019GES: Potential blocker (pid=3991) on resource DR-00000000-00000000;enqueue info in file /oracle/product/10.2.0/db/rdbms/log/prod2_lmd0_3768.trc and DIAG trace fileFri May 31 13:11:23 2019Incremental checkpoint up to RBA [0x16b39.e1.0], current log tail at RBA [0x16b39.e1.0]Fri May 31 13:14:13 2019GES: Potential blocker (pid=4099) on resource CI-0000001E-00000002;enqueue info in file /oracle/product/10.2.0/db/rdbms/log/prod2_lmd0_3768.trc and DIAG trace fileFri May 31 13:25:34 2019>>> WAITED TOO LONG FOR A ROW CACHE ENQUEUE LOCK! pid=287System State dumped to trace file /oracle/product/10.2.0/db/rdbms/log/prod2_ora_3191.trcFri May 31 13:31:25 2019Incremental checkpoint up to RBA [0x16b39.1ad.0], current log tail at RBA [0x16b39.1ad.0]Fri May 31 13:33:01 2019WARNING: inbound connection timed out (ORA-3136)Fri May 31 13:33:01 2019WARNING: inbound connection timed out (ORA-3136)Fri May 31 13:33:01 2019WARNING: inbound connection timed out (ORA-3136)

对故障时间段的alert日志进行分析，可以发现当时节点1的ALERT日志中出现大量如下告警信息：

>>> WAITED TOO LONG FOR A ROW CACHE ENQUEUE LOCK! pid=***

4.2 对dump信息的分析

节点2的ALERT日志中提到了大量System State dumped文件，对此类TRACE文件进行了分析。如下：

使用发生问题后产生的systemstate dump prod2_ora_3191.trc。

[oracle@bys1 ~]$ awk -f ass109.awk prod2_ora_3191.trcStarting Systemstate 1.............................................Ass.Awk Version 1.0.9 - Processing prod2_ora_3191.trcSystem State 1~~~~~~~~~~~~~~~~1:                                      2:  waiting for 'pmon timer'            wait3:  waiting for 'DIAG idle wait'        wait4:  waiting for 'rdbms ipc message'     wait5:  waiting for 'rdbms ipc message'     wait6:  waiting for 'ges remote message'    wait7:  waiting for 'gcs remote message'    wait8:  waiting for 'gcs remote message'    wait9:  waiting for 'gcs remote message'    wait10: waiting for 'gcs remote message'    wait11: waiting for 'gcs remote message'    wait12: waiting for 'gcs remote message'    wait13: waiting for 'gcs remote message'    wait14: waiting for 'gcs remote message'    wait15: waiting for 'gcs remote message'    wait16: waiting for 'gcs remote message'    wait17: waiting for 'gcs remote message'    wait18: waiting for 'gcs remote message'    wait19: waiting for 'gcs remote message'    wait20: waiting for 'gcs remote message'    wait21: waiting for 'gcs remote message'    wait22: waiting for 'gcs remote message'    wait23: waiting for 'gcs remote message'    wait24: waiting for 'gcs remote message'    wait25: waiting for 'gcs remote message'    wait26: waiting for 'gcs remote message'    wait27: waiting for 'rdbms ipc message'     wait28: waiting for 'rdbms ipc message'     wait29: waiting for 'rdbms ipc message'     wait30: waiting for 'rdbms ipc message'     wait31: waiting for 'rdbms ipc message'     wait32: waiting for 'rdbms ipc message'     wait33: waiting for 'rdbms ipc message'     wait34: waiting for 'rdbms ipc message'     wait35: waiting for 'rdbms ipc message'     wait36: waiting for 'rdbms ipc message'     wait37: waiting for 'rdbms ipc message'     wait38: waiting for 'rdbms ipc message'     wait39: waiting for 'rdbms ipc message'     wait40: waiting for 'latch: row cache objects'[Latch 33a5d5bd0] wait41: waiting for 'cursor: pin S wait on X' wait42: waiting for 'latch: shared pool'   [Latch 33ba89b70] wait43: waiting for 'rdbms ipc reply'       wait44: waiting for 'rdbms ipc message'     wait45: waiting for 'latch: row cache objects'[Latch 33a5d5bd0] waitBlockers~~~~~~~~Above is a list of all the processes. If they are waiting for a resourcethen it will be given in square brackets. Below is a summary of thewaited upon resources, together with the holder of that resource.Notes:~~~~~o A process id of '???' implies that the holder was not found in thesystemstate.Resource Holder StateLatch 33a5d5bd0    ??? BlockerLatch 33ba89b70    ??? BlockerObject Names~~~~~~~~~~~~Latch 33a5d5bd0 Child row cache objects       Latch 33ba89b70 Child library cache           167446 Lines Processed.

latch: row cache objects等待具体信息

LCK0进程在等待和kghfrunp(释放shared pool中内存)有关的latch,此进程的历史等待事件中也大量存在'latch: shared pool'、latch: row cache objects

PROCESS 45:----------------------------------------SO: 0x357240660, type: 2, owner: (nil), flag: INIT/-/-/0x00(process) Oracle pid=45, calls cur/top: 0x33472fae8/0x33472fae8, flag: (6) SYSTEMint error: 0, call error: 0, sess error: 0, txn error 0(post info) last post received: 0 0 24last post received-location: ksasndlast process to post me: 35224fdd8 39 0last post sent: 0 0 87last post sent-location: kjmdrmchk: drm_hb acklast process posted by me: 355269da8 1 6(latch info) wait_event=0 bits=0Location from where call was made: kqrbip:waiting for 33a5d5bd0 Child row cache objects level=4 child#=16Location from where latch is held: kghfrunp: clatch: wait:  ====>>> 可以看到这个latch是因为kghfrunp(KGH: Ask client to free unpinned space)相关的功能发起的,而这个功能是需要从shared pool中释放空间有关Context saved from call: 0state=busy, wlstate=freewaiters [orapid (seconds since: put on list, posted, alive check)]:50 (1, 1559280342, 1)45 (1, 1559280342, 1)waiter count=2gotten 540727585 times wait, failed first 2224239 sleeps 386931gotten 3781484 times nowait, failed: 419225possible holder pid = 223 ospid=30412on wait list for 33a5d5bd0Process Group: DEFAULT, pseudo proc: 0x3552a2308O/S info: user: oracle, term: UNKNOWN, ospid: 4099OSD pid info: Unix process pid: 4099, image: oracle@findbb (LCK0)Short stack dump: ksdxfstk()+32<-ksdxcb()+1547<-sspuser()+111<-<0x311720eb10>

此进程的历史等待事件：

----------------------------------------SO: 0x3553c5998, type: 4, owner: 0x357240660, flag: INIT/-/-/0x00(session) sid: 1537 trans: (nil), creator: 0x357240660, flag: (51) USR/- BSY/-/-/-/-/-DID: 0000-002D-00000003, short-term DID: 0000-0000-00000000txn branch: (nil)oct: 0, prv: 0, sql: (nil), psql: (nil), user: 0/SYSwaiting for 'latch: row cache objects' blocking sess=0x(nil) seq=30395 wait_time=0 seconds since wait started=0address=33a5d5bd0, number=c7, tries=2Dumping Session Wait Historyfor 'latch: row cache objects' count=1 wait_time=41address=33a5d5bd0, number=c7, tries=1for 'latch: row cache objects' count=1 wait_time=422982address=33a5d5bd0, number=c7, tries=0for 'latch: shared pool' count=1 wait_time=28address=600ebf80, number=d5, tries=0for 'latch: shared pool' count=1 wait_time=57address=600ebf80, number=d5, tries=0for 'latch: row cache objects' count=1 wait_time=135277address=33a5d5bd0, number=c7, tries=0for 'latch: shared pool' count=1 wait_time=29address=600ebf80, number=d5, tries=1for 'latch: shared pool' count=1 wait_time=37address=600ebf80, number=d5, tries=0for 'latch: shared pool' count=1 wait_time=78address=600ebf80, number=d5, tries=0for 'latch: row cache objects' count=1 wait_time=47585address=33a5d5bd0, number=c7, tries=0for 'latch: row cache objects' count=1 wait_time=45842address=33a5d5bd0, number=c7, tries=0temporary object counter: 0----------------------------------------UOL used : 0 locks(used=0, free=0)KGX Atomic Operation Log 0x35afaaab8Mutex (nil)(0, 0) idn 7fff00000000 oper NONECursor Pin uid 1537 efd 3 whr 11 slp 0KGX Atomic Operation Log 0x35afaab00Mutex (nil)(0, 0) idn 7fff00000000 oper NONELibrary Cache uid 1537 efd 0 whr 0 slp 0KGX Atomic Operation Log 0x35afaab48Mutex (nil)(0, 0) idn 7fff00000000 oper NONELibrary Cache uid 1537 efd 0 whr 0 slp 0----------------------------------------SO: 0x33c58c458, type: 41, owner: 0x3553c5998, flag:

4.3 数据库AWR信息分析

数据库实例在11点时已经HANG住，未生成AWR快照。因此使用9-10点的快照来查看故障发生前是否存在异常。

从AWR可以看到shared pool 从4800M收缩为了2900M，查看SGA组件变动使用情况，可以发现将近2个GB是被KGH: NO ACCESS所使用，即处于shrik/grow的中间状态。

总结与建议

针对此次RAC数据库实例问题，总结及建议如下：

5.1 问题描述与分析

此套RAC上运行的数据库实例，数据库版本为10.2.0.3.0；

其中数据库实例2遇到>>> WAITED TOO LONG FOR A ROW CACHE ENQUEUE LOCK! <<<错误导致业务HANG住。

对此HANG的原因进行分析：

5.2 后续处理方法与建议

1. 数据库SGA内存参数的优化

当前节点1主机内存大小为32GB，节点2由于更换过硬件设备，内存为128GB。

因此之前出于两节点参数统一未对节点2的内存参数进行增加，使用的是sga_target= 12884901888，即12GB。

建议将两节点内存统计设置为SGA设置为节点1内存的一半大小，即16GB。（注意此操作需要重启数据库需要安排停机维护时间窗口进行）

2.数据库buffer cache及shared pool组件优化

建议设置最小的buffer cache及shared pool的值

这是用来设置下限值，使SGA自动管理时不会调整到低于此值。具体值应结合历史AWR报告中Report Summary部分的Cache Sizes大小及Advisory Statistics 部分建议的 Buffer Pool Advisory/Shared Pool Advisory值。使用的命令如下：（注意在非业务高峰时或安排停机维护时间窗口进行）

SQL> ALTER SYSTEM SET DB_CACHE_SIZE=n SCOPE=SPFILE;

SQL> ALTER SYSTEM SET SHARED_POOL_SIZE=m SCOPE=SPFILE;

建议可以设置为BUFFER CACHE 10GB，SHARED POOL 5GB。

3.关闭DRM相关参数

10G

_gc_affinity_time=0

_gc_undo_affinity=FALSE

这2个参数是静态参数，也就是说必须要重启实例才能生效。

实际上可以设置另外2个动态的隐含参数，来达到这个目的。按下面的值设置这2个参数之后，不能完全算是禁止/关闭了DRM，而是从”事实上“关闭了DRM。

--可以设置更高的值。

_gc_affinity_limit=250

_gc_affinity_minimum=10485760

alter system set "_gc_affinity_limit"=250;

alter system set "_gc_affinity_minimum"=10485760;

甚至可以将以上2个参数值设置得更大。这2个参数是立即生效的，在所有的节点上设置这2个参数之后，系统不再进行DRM。

4.建议继续加强对于RAC集群、数据库运行情况、主机资源使用的监控。再次出现类似异常时可以手动收集HANG和Systemstate dump信息来辅助分析，以协助排查。