昨天同事遇到一个 2节点,基于AIX 7.1的的ASM ocr访问超时的问题,Node2无法正常访问,检查Node2的alert_asm.log日志如下:

Reference :ASM diskgroup dismount with "Waited 15 secs for write IO to PST" (Doc ID 1581684.1)

Thu Aug 21 17:24:06 2014
WARNING: Waited 15 secs for write IO to PST disk 0 in group 1.
WARNING: Waited 15 secs for write IO to PST disk 0 in group 1.
WARNING: Waited 15 secs for write IO to PST disk 0 in group 2.
WARNING: Waited 15 secs for write IO to PST disk 0 in group 2.
WARNING: Waited 15 secs for write IO to PST disk 0 in group 3.
WARNING: Waited 15 secs for write IO to PST disk 1 in group 3.
WARNING: Waited 15 secs for write IO to PST disk 2 in group 3.
WARNING: Waited 15 secs for write IO to PST disk 3 in group 3.
WARNING: Waited 15 secs for write IO to PST disk 4 in group 3.
WARNING: Waited 15 secs for write IO to PST disk 0 in group 3.
WARNING: Waited 15 secs for write IO to PST disk 1 in group 3.
WARNING: Waited 15 secs for write IO to PST disk 2 in group 3.
WARNING: Waited 15 secs for write IO to PST disk 3 in group 3.
WARNING: Waited 15 secs for write IO to PST disk 4 in group 3.
Thu Aug 21 17:24:06 2014
NOTE: process _b000_+asm1 (24903780) initiating offline of disk 0.2095165706 (GRID_0000) with mask 0x7e in group 3
NOTE: process _b000_+asm1 (24903780) initiating offline of disk 1.2095165707 (GRID_0001) with mask 0x7e in group 3
NOTE: process _b000_+asm1 (24903780) initiating offline of disk 2.2095165708 (GRID_0002) with mask 0x7e in group 3
NOTE: process _b000_+asm1 (24903780) initiating offline of disk 3.2095165709 (GRID_0003) with mask 0x7e in group 3
NOTE: process _b000_+asm1 (24903780) initiating offline of disk 4.2095165710 (GRID_0004) with mask 0x7e in group 3
NOTE: checking PST: grp = 3
GMON checking disk modes for group 3 at 10 for pid 35, osid 24903780
ERROR: no read quorum in group: required 3, found 0 disks
NOTE: checking PST for grp 3 done.

....

NOTE: initiating PST update: grp = 3, dsk = 0/0x7ce1b10a, mask = 0x6a, op = clear

crs.log

2014-08-21 20:36:04.495: [  OCRRAW][9264]proprior: Retrying buffer read from another mirror for disk group [+GRID] for block at offset [6909952]
2014-08-21 20:36:04.495: [  OCRASM][9264]proprasmres: Total 0 mirrors detected
2014-08-21 20:36:04.495: [  OCRASM][9264]proprasmres: Only 1 mirror found in this disk group.
2014-08-21 20:36:04.495: [  OCRASM][9264]proprasmres: Need to invoke checkdg. Mirror #0 has an invalid buffer.
2014-08-21 20:36:04.595: [  OCRASM][9264]proprasmres: kgfoControl returned error [8]
[  OCRASM][9264]SLOS : SLOS: cat=8, opn=kgfoCkDG01, dep=15032, loc=kgfokge

2014-08-21 20:36:04.595: [  OCRASM][9264]ASM Error Stack : ORA-27091: unable to queue I/O
ORA-15078: ASM diskgroup was forcibly dismounted
ORA-06512: at line 4

2014-08-21 20:36:04.595: [  OCRRAW][9264]proprior: ASM re silver returned [22]
2014-08-21 20:36:04.597: [  OCRRAW][9264]fkce:2: problem [22] reading the tnode 6909952

基于2个日志,可以发先是ocr所在的diskgroup 访问超时,MOS 中提及到AIX有个rw_timeout参数时间和asm一个IO超时时间对存储访问的请求时间不一致所致:

Symptoms
Normal or high redundancy diskgroup is dismounted with these WARNING messages.

//ASM alert.log

Mon Jul 01 09:10:47 2013
WARNING: Waited 15 secs for write IO to PST disk 1 in group 6.
WARNING: Waited 15 secs for write IO to PST disk 4 in group 6.
WARNING: Waited 15 secs for write IO to PST disk 1 in group 6.
WARNING: Waited 15 secs for write IO to PST disk 4 in group 6.
....
GMON dismounting group 6 at 72 for pid 44, osid 8782162

Cause
Generally this kind messages comes in ASM alertlog file on below situations,
Delayed ASM PST heart beats on ASM disks in normal or high redundancy diskgroup,
thus the ASM instance dismount the diskgroup.By default, it is 15 seconds.
By the way the heart beat delays are sort of ignored for external redundancy diskgroup.
ASM instance stop issuing more PST heart beat until it succeeds PST revalidation,
but the heart beat delays do not dismount external redundancy diskgroup directly.

The ASM disk could go into unresponsiveness, normally in the following scenarios:

+    Some of the paths of the physical paths of the multipath device are offline or lost
+    During path 'failover' in a multipath set up
+    Server load, or any sort of storage/multipath/OS maintenance
The Doc ID 10109915.8  briefs about Bug 10109915(this fix introduce this underscore parameter). And the issue is with no OS/Storage tunable timeout mechanism in a case of a Hung NFS Server/Filer. And then  _asm_hbeatiowait  helps in setting the time out.

Solution
1]    Check with OS and Storage admin that there is disk unresponsiveness.

2]    Possibly keep the disk responsiveness to below 15 seconds.

This will depend on various factors like
+    Operating System
+    Presence of Multipath ( and Multipath Type )
+    Any kernel parameter

So you need to find out, what is the 'maximum' possible disk unresponsiveness for your set up.
For example, on AIX  rw_timeout  setting affects this and defaults to 30 seconds.
Another example is Linux with native multipathing. In such set up, number of physical paths and  polling_interval value in multipath.conf file, will dictate this maximum disk unresponsiveness.
So for your set up ( combination of OS / multipath / storage ), you need to find out this.

3]If you can not keep the disk unresponsiveness to below 15 seconds, then the below parameter can be set in the ASM instance ( on all the Nodes of RAC ):

_asm_hbeatiowait
Set it to 200.

Run below in asm instance to set desired value for _asm_hbeatiowait

alter system set "_asm_hbeatiowait"=<value> scope=spfile sid='*';

on AIX  rw_timeout  setting affects this and defaults to 30 seconds.

AIX中对存储io的请求超时时间(rw_timeout)默认为30秒

Delayed ASM PST(Partner and Status Table) heart beats on ASM disks in normal or high redundancy diskgroup,thus the ASM instance dismount the diskgroup.By default, it is 15 seconds.By the way the heart beat delays are sort of ignored for external redundancy diskgroup.
默认ASM 有个Delayed ASM PST heart beats的时间为15秒,超过15秒会被dismount(crs日志可知), 在dg的冗余配置成external redundanccy下可避免被dismount(normal/high redundancy 会发生这种情况,所以很不幸正好中招。)

最终通过将所有节点的_asm_hbeatiowait,重启所有节点服务生效。

alter system set "_asm_hbeatiowait"=(大于30秒) scope=spfile sid='*';

WARNING: Waited 15 secs for write IO to PST disk 4 in group 3 in alert_asm.log相关推荐

  1. 【ORACLE】RAC 磁盘超时,导致数据库重启 WARNING: Waited 15 secs for write IO to PST disk 0 in group 1.

    项目场景: 采用云资源上部署的oracle RAC 11.2.0.4数据库两节点不定期重启 问题描述 现场反馈,数据库两节点不断重启,检查crs,无重大报错.检查asm日志,发现如下报错. Fri S ...

  2. ASM diskgroup dismount with Waited 15 secs for write IO to PST (文档 ID 1581684.1)

    ASM diskgroup dismount with "Waited 15 secs for write IO to PST" (文档 ID 1581684.1) 转到底部 In ...

  3. Oralce数据库ASM存储管理-存储IO故障,disk未及时offline分析,故障分析模板

    背景说明: 1.Oracle12.2RAC+ASM Normal Redendancy 模式,数据库存储采用双存储冗余架构,规避单存储故障导致服务中断及数据丢失: 2. ASM DiskGroup 设 ...

  4. Centos7 WARNING: ‘aclocal-1.15‘ is missing on your system.

    wget http://ftp.gnu.org/gnu/automake/automake-1.15.tar.gz tar -zxvf automake-1.15.tar.gz cd automake ...

  5. 由于CRS磁盘dismount造成的CRS进程无法启动问题

    环境:11.2.0.3 rac primary+rac standby 生产库rac standby的node1节点CRS自动关闭问题 --EM报警 Message=Clusterware has p ...

  6. CRS磁盘force dismount引起的RAC节点宕机故障

    墨墨导读:本文来自墨天轮"猫瞳映月"投稿,作者主页https://www.modb.pro/u/12330,主要详述CRS磁盘force dismount引起的RAC节点宕机故障处 ...

  7. oracle 启动crs进程,由于CRS磁盘dismount造成的CRS进程无法启动问题

    0 CRS                                                      0DISMOUNTED 2 DATA1                       ...

  8. oracle 11g RAC ASM磁盘被强制下线抢修一例

    又有一段时间没写有关oracle的文章了,恐怕这也是不能成为高手的原因之一,偶有典型性拖延症和懒癌. 今天要讲的一个栗子又一个现场case,中午吃饭的时候看到来自同事的未接,处于职业敏感感觉是数据中心 ...

  9. rac异常诊断分析一则

    1节点集群异常 [root@lzl:lzl:~]# /u01/app/11.2.0/grid/bin/crsctl stat res -t ------------------------------ ...

最新文章

  1. 机器学习萌新必备的三种优化算法 | 选型指南
  2. 90 亿美元的“Java 第一版权案”终落幕:谷歌胜,甲骨文败!
  3. Linux下的摄影后期处理软件
  4. python详细下载安装教程-Pycharm及python安装详细教程
  5. java lambda函数_最常用的 Java 8 中的 Lambda 函数(项目中实用笔记)
  6. linux命令之tee,技术|为初学者介绍的 Linux tee 命令(6 个例子)
  7. 向linux内核加入系统调用新老内核比較
  8. 初识多线程之基础知识与常用方法
  9. 极光推送 java 绑定别名_极光推送-别名篇
  10. kite:Python 代码自动补全神器
  11. PCL_三维点云拼接融合/点云粗配准/点云精配准
  12. Selenium+Java - 结合sikuliX操作Flash网页
  13. 从零开始配置腾讯云 CDN
  14. linux的批处理文件怎么写,Linux下批处理文件编写
  15. 【JY】橡胶支座精细化模拟与有限元分析注意要点
  16. css+js 实现炫酷的魔方旋转
  17. 2018年6月购书清单
  18. robotframework-给定日期推算星期几
  19. IE10和IE9兼容性常见问题解答(FAQ)
  20. win7桌面图标显示不正常解决

热门文章

  1. C++青少年编程课程体系与教案
  2. Android 小工具--圆形图片
  3. 笔记本升级--固态硬盘安装双系统win10-CentOS
  4. 智慧树知到网课中国马克思主义与当代课后所有章节测试答案
  5. 获取和设置默认打印机
  6. 关于tomcat启动报错Error deploying web application directory [C:\......]出现的其中一种问题解决:
  7. Android Studio第六课:模仿QQ登录跳转
  8. 麦克风阵列波束形成之DSB原理与实现
  9. 乐鑫esp8266学习rtos3.0笔记第11篇:详细分析Esp8266上电信息打印的数据,如何做到串口通讯上电不乱码打印。
  10. 2020-09-13 滴滴-2021校招在线笔试-DE数据开发试卷