2017-07-17 DBA日记，凭直觉发现CRSDG的磁盘问题及处理

1 问题:

2017-07-14 8：00左右，收到短信报警“数据库db1的安装目录/u01可用空间低于5%”，是什么原因导致的呢？

2 分析：

2.1 具体描述:2017-07-14 8：00左右, 数据库db1的安装目录/u01可用空间低于5%

2.2 收集数据：

ssh登录到db1所在服务器使用命令,显示那个日志目录占用的空间最多，并逐层目录查找:

cd /u01

du -sk *

查找结果，发现在asm日志目录.\trace空间占用最大
进入到trace目录，发现有一个trace文件大小为10G，文件改变时间为当前时间。
除了这个trace文件外，有一个alert_+ASM1.log的文件更新时间也是当前时间。

2.3 分析

大小为10g的trace文件《《〈〈〈〈〈〈应该是这次报警的元凶。
这个大小为10G的文件是慢慢累积而成，还是由于慢个事件迅速增长到10G。
还有一个值得在意的地方就是trace文件与alert文件的更新时间不断变化。

2.4 结论

基本可以断定清除10g的trace文件能解除报警，但未必能发现报警代表的更深层次原因。
采用后续行动，进一步分析trace文件的容量是迅速生成还是慢慢累积的？

3. 第二次分析：

3.1 具体描述：大小为10G的trace文件，是什么原因生成的？

3.2 收集数据：

打开大小为10g的trace文件，查看内容，由于文件太大，打开失败。
打开alert文件，查看内容，发现如下错误信息：

NOTE: Attempting voting file relocation on diskgroup CRSDG

NOTE: Failed voting file relocation on diskgroup CRSDG

NOTE: Attempting voting file relocation on diskgroup CRSDG

NOTE: Failed voting file relocation on diskgroup CRSDG

NOTE: Attempting voting file relocation on diskgroup CRSDG

NOTE: Failed voting file relocation on diskgroup CRSDG

NOTE: Attempting voting file relocation on diskgroup CRSDG

NOTE: Failed voting file relocation on diskgroup CRSDG

检查这些日志的最早生成时间及持续多长，频率有多大，信息如下：

发现自2017-05-16 HP存储出现故障后，该报警信息不断出现。

持续至：2017-07-14

频率：每20秒报一次警。

查看crsdg，asm disk状态：select * from v$asm_disk; 发现有两个磁盘的mount_status:IGNORED

3.3 分析

在2017-05-16， HP存储出现故障时，CRSDG的磁盘由3个可用，变成只得一个可用。即使用存储故障修复后，oracle asm也不能自动重新识别，导致集群的voting file重新放置失败。
IGNORED - Disk is present in the system but is ignored by Oracle ASM because

of one of the following:

The disk is detected by the system library but is ignored because an

Oracle ASM library discovered the same disk

Oracle ASM has determined that the membership claimed by the disk

header is no longer valid

进一步确认，查看support.oracle.com，其中文章“V$ASM_DISK View Shows Some Disk Header Status as IGNORED and Group Number as "0". (文档 ID 1299866.1)”给出很详细的说明，摘录如下：

This is not a normal situation but can happen in the following scenario.

1] If there are more than 2 disks dropped forcefully from the same diskgroup created with normal redundancy while the diskgroup is still mounted, these disks still show "MEMBER" and group number as "0". If a new disk is added to the diskgroup or existing disks are added in different order, there is a possibility that the same disk number and disk name can be assigned to the ones that was assigned to one of dropped disks which has not been added yet.

This situation can happen either when one of SAN failgroup crash or when one of cell node from Exadata crash.

2] In RAC environment, ASM disk added to a existing diskgroup should be seen from all nodes. If one of nodes doesn't see the disk being added, the operation will fail with "ORA-15075 disk(s) are not visible cluster-wide" but the disk being added keeps the new disk number and disk name showing "MEMBER" status. After this, a new disk is added to the RAC node successfully, chances are the new disk gets the same disk number and diskname with the one that was failed in the first place.

The following similar error occurs when these disks showing IGNORED status are attempted to be added an existing diskgroup.

SQL> alter diskgroup data add failgroup F2 disk '/dev/oracleasm/disks/ASM1' ;

alter diskgroup data add failgroup F2 disk '/dev/oracleasm/disks/ASM1'

ERROR at line 1:

ORA-15032: not all alterations performed

ORA-15033: disk '/dev/oracleasm/disks/ASM1' belongs to diskgroup "DATA"

3] Wrong multipath configuration where multiple single device path are visbile in asm_diskstring.

那么，到底crsdg那些offline了磁盘现在发生了什么事呢？

4. 第三次分析

4.1 具体描述：crsdg那些offline的磁盘是否出现硬件故障。

4.2 收集数据：使用root用户，执行命令ioscan -fnkCdisk 检查磁盘的路径状态，发现有的磁盘路径状态是NO_HW

4.3 分析：

NO_HW 此地址的硬件不再响应

ioscan 中的 NO_HW 表示当计算机引导时，该设备正在响应，但是现在没有响应。

NO_HW 可能来自一个有问题的设备，或者一个已经移动删除的设备。

4.4 结论：

CRSDG中ignore的磁盘，在os层、存储层上已经被识别为无效设备。

5. 解决方案：

5.1 根据oracle的官方文档，制定如下的解决方案：

修复损坏的路径，让OS层重新识别
执行alter diskgroup crsdg add disk 'path' force;

5.2 但是由于当时处理该事件没有进行充分的调研oracle官方处理方法，在与存储管理员确认后，采用重启主机的方法，修复路径，然后重启DB实例，发现还是没有自动识别磁盘，于是采立即新增了两个磁盘，并划入到crsdg中，最后alert日志中的报错“Failed voting file relocation on diskgroup CRSDG”消除。并将ignore的磁盘权限及用户组变更回原始状态，如root sys ，然后由存储管理员回收。

6. 引用

V$ASM_DISK View Shows Some Disk Header Status as IGNORED and Group Number as "0". (文档 ID 1299866.1) 转到底部转到底部

In this Document

Symptoms

Cause

Solution

References

APPLIES TO:

Oracle Database - Enterprise Edition - Version 10.2.0.1 to 11.2.0.2 [Release 10.2 to 11.2]

Information in this document applies to any platform.

SYMPTOMS

1] V$ASM_DISK view shows disk header status as IGNORED and group number as "0".

2] Background ASM trace file corresponding sqlplus shows the following similar message.

Ignoring dsk because it is a duplicatekfdsk:0xb7b91620 <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

disk: num: 21/47626513311654 grp: 0/47622597378048 compat: 10.1.0.0.0 dbcompat:10.1.0.0.0

fg: path: /dev/oracleasm/disks/ASM1 <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

mnt: O hdr: M mode: v v(--) p(r-) a(-) d(-) sta: N flg: 1001

slot 65535 ddeslot 65535 numslots 65535 dtype 0 enc 0 part 0 flags 0

kfts: 2013/03/02 10:11:53.307000

kfts: 2013/11/15 01:18:09.263000

pcnt: 0 ()

kfkid: 0xb7bbf940, kfknm: , status: IDENTIFIED

fob: (KSFD)ba77c470, magic: bebe ausize: 0

kfdds: dn=21 inc=3915933606 dsk=0xb7b91620 usrp=0x2b50f0992118

kfkds 0x2b50f08b07f8, kfkid 0xb7bbf940, magic abbe, libnum 0, bpau 0, fob 0xba77dec0

Ignoring dsk because it is a duplicatekfdsk:0xb7b919a0

CAUSE

This is not a normal situation but can happen in the following scenario.

This situation can happen either when one of SAN failgroup crash or when one of cell node from Exadata crash.

The following similar error occurs when these disks showing IGNORED status are attempted to be added an existing diskgroup.

SQL> alter diskgroup data add failgroup F2 disk '/dev/oracleasm/disks/ASM1' ;

alter diskgroup data add failgroup F2 disk '/dev/oracleasm/disks/ASM1'

ERROR at line 1:

ORA-15032: not all alterations performed

ORA-15033: disk '/dev/oracleasm/disks/ASM1' belongs to diskgroup "DATA"

3] Wrong multipath configuration where multiple single device path are visbile in asm_diskstring.

SOLUTION

IGNORED status can be seen for ASM disks which are not part of any currently mounted diskgroup but it shows the same disk header information with one of disks that are currently mounted such as disk number, disk name and diskgroup and the status as "MEMBER" and group number as "0".

Check whether there are at last 2 ASM disks showing the same disk header information using kfed described in Appendix

1] When the diskgroup from which the disk has been dropped is currently mounted, try to add the disk showing "IGNORED" status using force option to the original diskgroup from which the disk has been dropped. Please see Document:946213.1 - for details how to add a disk back to the original diskgroup.

Example)

SQL> alter diskgroup data add failgroup F2 disk '/dev/oracleasm/disks/ASM1' force ;

Diskgroup altered

2] When the diskgroup from which the disk has been dropped is not currrently mounted, v$asm_disk shows the correct disk as MEMBER by comparing creation time from disk directory . If a disk showing the same disk number and has a different creation time from disk directory, the status is seen as IGNORED.

example) /dev/oracleasm/disks/ASM1 and /dev/oacleasm/disks/ASM2 disks below show the same disk header information from kfed output.

SQL>@diskifno

G_N D_N NAME FAILGROUP M_STATU H_STATUS MO_STAT STATE PATH M_DATE

---- ---- ----------- ---------- ------- ---------- ------- -------- ------------------------- --------------------

0 0 CLOSED MEMBER ONLINE NORMAL /dev/oracleasm/disks/ASM2 2011/02/25 21:07:49

0 1 IGNORED MEMBER ONLINE NORMAL /dev/oracleasm/disks/ASM1 2011/02/25 18:08:43

0 2 CLOSED MEMBER ONLINE NORMAL /dev/oracleasm/disks/ASM4 2011/02/25 20:59:01

0 3 CLOSED MEMBER ONLINE NORMAL /dev/oracleasm/disks/ASM3 2011/02/25 21:07:49

1 0 PLAY_0000 PLAY_0000 CACHED MEMBER ONLINE NORMAL /dev/oracleasm/disks/ASM5 2011/02/20 20:52:31

3 0 KYLE_0000 KYLE_0000 CACHED MEMBER ONLINE NORMAL /dev/oracleasm/disks/ASM6 2011/02/20 20:52:44

2-1. ASM discards the disk showing IGNORED status automatically by comparing creation time from ASM metadata called disk directory when mounting DATA diskgroup.

SQL> alter diskgroup data mount ;

Diskgroup altered.

Note that /dev/oracleasm/disks/ASM1 has been excluded from DATA diskgroup.

SQL> @diskinfo

G_N D_N NAME FAILGROUP M_STATU H_STATUS MO_STAT STATE PATH M_DATE

---- ---- ----------- ---------- ------- ---------- ------- -------- ------------------------- --------------------

0 1 IGNORED MEMBER ONLINE NORMAL /dev/oracleasm/disks/ASM1 2011/02/25 18:08:43 <<<<<<<<<<<<<<<<<<< Here, ASM1 has been excluded from DATA diskgroup

1 0 PLAY_0000 PLAY_0000 CACHED MEMBER ONLINE NORMAL /dev/oracleasm/disks/ASM5 2011/02/20 20:52:31

2 0 DATA_0000 F1 CACHED MEMBER ONLINE NORMAL /dev/oracleasm/disks/ASM2 2011/02/25 21:09:47

2 1 DATA_0001 F2 CACHED MEMBER ONLINE NORMAL /dev/oracleasm/disks/ASM4 2011/02/25 21:09:47

2 2 DATA_0002 F2 CACHED MEMBER ONLINE NORMAL /dev/oracleasm/disks/ASM3 2011/02/25 21:09:47

3 0 KYLE_0000 KYLE_0000 CACHED MEMBER ONLINE NORMAL /dev/oracleasm/disks/ASM6 2011/02/20 20:52:44

2-2. And then the disk showing IGNORED can be added using force option described in step 1.

Appendix

1. How to check whether there are duplication disks showing the same disk header information.

o Check whether kfed executable exist in your <ASM_HOME/bin>

o Rebuild kfed if kfed executable doesn't exist in $GRID_HOME/bin using the following way.

$cd $ORACLE_HOME/rdbms/lib

$make -f ins_rdbms.mk ikfed

o Take a kfed output for the disks. If the following information of the header is the same, the 2 disks are considered to be the same.

Note mntstmp.hi and mntstmp.lo (disk mount time) can be different to be the same disk in this exercise.

2. diskinfo.sql

set linesize 200

col g_name format a10

col g_n format 99

col d_n format 999

col m_status format a7

col mo_status format a7

col h_status format a11

col name format a30

col path format a45

col failgroup format a15

select g.group_number g_n,

g.disk_number d_n,

g.name name,

g.failgroup,

g.mount_status m_status,

g.header_status h_status,

g.mode_status mo_status,

g.path ,

to_char(g.mount_date, 'YYYY/MM/DD HH24:MI:SS') m_date

from v$asm_disk g

order by g_n, d_n

2017-07-17 DBA日记，凭直觉发现CRSDG的磁盘问题及处理相关推荐

2017.07.17
今天上午主要熟悉了开发环境,主要事Git命令,明白了Git命令在公司开发环境搭架的一个整体过程,首先再次划分一下GitHub上,我们公司的开发分支.包括:master(版本发布分支),dev(开发分支 ...
DBA日记：上海的RAC宕机
--转自网络,很好的一个案例供大家细品! 今天晚上上海的雷总突然打电话过来,说有件事需要我们帮下忙.我问他是什么事,他说是一个客户的系统宕机的问题,最好能够尽快过来一下.我说没问题,明天一早就派工程师 ...
DBA日记上海的RAC宕机（转自网络）
今天晚上上海的雷总突然打电话过来,说有件事需要我们帮下忙.我问他是什么事,他说是一个客户的系统宕机的问题,最好能够尽快过来一下.我说没问题,明天一早就派工程师过去.老雷一听就急了,老白不是我不信任你的 ...
CCAI 2017 | 德国DFKI科技总监Hans Uszkoreit：如何用机器学习和知识图谱来实现商业智能化？原2017.07.25AI科技大本营文/CSDN大琦 7 月22 - 2
CCAI 2017 | 德国DFKI科技总监Hans Uszkoreit:如何用机器学习和知识图谱来实现商业智能化? 原2017.07.25AI科技大本营文/CSDN大琦 7 月22 - 23 日, ...
2017版MySQL DBA核心课程-第1-16部完整-老男孩-专题视频课程
2017版MySQL DBA核心课程-第1-16部完整-12443人已学习课程介绍 linux培训班部分内容,培训班学员无需购买 2017版老男孩MySQL数据库视频课程-大集合 2 ...
【第五组】头脑风暴+核心竞争力+NABCD+个人（用例+功能+技术说明书）最后修改时间 2017.07.13...
2017.07.13版因为对之前版本做了较多修改,所以重新发了,并且在博客下方保留原有版本作为记录. 头脑风暴结果: 刚开始我们无法确定要做一个什么样的应用程序,总结之前可视化课程的作业,我们提出了 ...
2017.8.17 开始了我的QT 学习。
Ctrl+H ........................................................................水平布局 Ctrl+L ......... ...
win10 如何配置 java jdk1.8环境变量（2017.8.17 ）jdk1.8.0_144
win10 如何配置 java jdk 环境变量 2017.8.17 本篇还适用于 windows server 2012. windows server 2014+ 一.安装下载 jdk 64位 ...
javaScript ie8 不支持 new Date(2017-07);只支持new Date(2017/07/01)
今天看以前代码的时候看到的JS的注释,记得当时调了老半天,ie8以上以及谷歌,火狐没有这个问题. //ie8 不支持 new Date("2017-07");只支持new Dat ...

2017-07-17 DBA日记，凭直觉发现CRSDG的磁盘问题及处理

2017-07-17 DBA日记，凭直觉发现CRSDG的磁盘问题及处理相关推荐

最新文章

热门文章