1 问题:
2017-07-14 8:00左右,收到短信报警“数据库db1的安装目录/u01可用空间低于5%”,是什么原因导致的呢?
2 分析:
2.1 具体描述:2017-07-14 8:00左右, 数据库db1的安装目录/u01可用空间低于5%
2.2 收集数据:
  • ssh登录到db1所在服务器使用命令,显示那个日志目录占用的空间最多,并逐层目录查找:
cd /u01
du -sk *
  • 查找结果,发现在asm日志目录.\trace空间占用最大
  • 进入到trace目录,发现有一个trace文件大小为10G,文件改变时间为当前时间。
  • 除了这个trace文件外,有一个alert_+ASM1.log的文件更新时间也是当前时间。
2.3 分析
  • 大小为10g的trace文件《《〈〈〈〈〈〈应该是这次报警的元凶。
  • 这个大小为10G的文件是慢慢累积而成,还是由于慢个事件迅速增长到10G。
  • 还有一个值得在意的地方就是trace文件与alert文件的更新时间不断变化。
2.4 结论
  • 基本可以断定清除10g的trace文件能解除报警,但未必能发现报警代表的更深层次原因。
  • 采用后续行动,进一步分析trace文件的容量是迅速生成还是慢慢累积的?
3. 第二次分析:
3.1 具体描述:大小为10G的trace文件,是什么原因生成的?
3.2 收集数据:
  • 打开大小为10g的trace文件,查看内容,由于文件太大,打开失败。
  • 打开alert文件,查看内容,发现如下错误信息:
NOTE: Attempting voting file relocation on diskgroup CRSDG
NOTE: Failed voting file relocation on diskgroup CRSDG
NOTE: Attempting voting file relocation on diskgroup CRSDG
NOTE: Failed voting file relocation on diskgroup CRSDG
NOTE: Attempting voting file relocation on diskgroup CRSDG
NOTE: Failed voting file relocation on diskgroup CRSDG
NOTE: Attempting voting file relocation on diskgroup CRSDG
NOTE: Failed voting file relocation on diskgroup CRSDG
  • 检查这些日志的最早生成时间及持续多长,频率有多大,信息如下:
发现自2017-05-16 HP存储出现故障后,该报警信息不断出现。
持续至:2017-07-14
频率:每20秒报一次警。
  • 查看crsdg,asm disk状态:select * from v$asm_disk; 发现有两个磁盘的mount_status:IGNORED
3.3 分析
  • 在2017-05-16, HP存储出现故障时,CRSDG的磁盘由3个可用,变成只得一个可用。即使用存储故障修复后,oracle asm也不能自动重新识别,导致集群的voting file重新放置失败。
  • IGNORED - Disk is present in the system but is ignored by Oracle ASM because
of one of the following:
The disk is detected by the system library but is ignored because an
Oracle ASM library discovered the same disk
Oracle ASM has determined that the membership claimed by the disk
header is no longer valid
  • 进一步确认,查看support.oracle.com,其中文章“V$ASM_DISK View Shows Some Disk Header Status as IGNORED and Group Number as "0". (文档 ID 1299866.1)”给出很详细的说明,摘录如下:
This is not a normal situation but can happen in the following scenario.
1] If there are more than 2 disks dropped forcefully from the same diskgroup created with normal redundancy while the diskgroup is still mounted, these disks still show "MEMBER" and group number as "0". If a new disk is added to the diskgroup or existing disks are added in different order, there is a possibility that the same disk number and disk name can be assigned to the ones that was assigned to one of dropped disks which has not been added yet.
This situation can happen either when one of SAN failgroup crash or when one of cell node from Exadata crash.
2] In RAC environment, ASM disk added to a existing diskgroup should be seen from all nodes. If one of nodes doesn't see the disk being added, the operation will fail with "ORA-15075 disk(s) are not visible cluster-wide" but the disk being added keeps the new disk number and disk name showing "MEMBER" status. After this, a new disk is added to the RAC node successfully, chances are the new disk gets the same disk number and diskname with the one that was failed in the first place.
The following similar error occurs when these disks showing IGNORED status are attempted to be added an existing diskgroup.
SQL> alter diskgroup data add failgroup F2 disk '/dev/oracleasm/disks/ASM1' ;
alter diskgroup data add failgroup F2 disk '/dev/oracleasm/disks/ASM1'
*
ERROR at line 1:
ORA-15032: not all alterations performed
ORA-15033: disk '/dev/oracleasm/disks/ASM1' belongs to diskgroup "DATA"
3] Wrong multipath configuration where multiple single device path are visbile in asm_diskstring.
  • 那么,到底crsdg那些offline了磁盘现在发生了什么事呢?
4. 第三次分析
4.1 具体描述:crsdg那些offline的磁盘是否出现硬件故障。
4.2 收集数据:使用root用户,执行命令ioscan -fnkCdisk 检查磁盘的路径状态,发现有的磁盘路径状态是NO_HW
4.3 分析:
NO_HW 此地址的硬件不再响应
ioscan 中的 NO_HW 表示当计算机引导时,该设备正在响应,但是现在没有响应。
NO_HW 可能来自一个有问题的设备,或者一个已经移动删除的设备。
4.4 结论:
CRSDG中ignore的磁盘,在os层、存储层上已经被识别为无效设备。
5. 解决方案:
5.1 根据oracle的官方文档,制定如下的解决方案:
  • 修复损坏的路径,让OS层重新识别
  • 执行alter diskgroup crsdg add disk 'path' force;
5.2 但是由于当时处理该事件没有进行充分的调研oracle官方处理方法,在与存储管理员确认后,采用重启主机的方法,修复路径,然后重启DB实例,发现还是没有自动识别磁盘,于是采立即新增了两个磁盘,并划入到crsdg中,最后alert日志中的报错“Failed voting file relocation on diskgroup CRSDG”消除。并将ignore的磁盘权限及用户组变更回原始状态,如root sys ,然后由存储管理员回收。
6. 引用
V$ASM_DISK View Shows Some Disk Header Status as IGNORED and Group Number as "0". (文档 ID 1299866.1) 转到底部转到底部
In this Document
Symptoms
Cause
Solution
References
APPLIES TO:
Oracle Database - Enterprise Edition - Version 10.2.0.1 to 11.2.0.2 [Release 10.2 to 11.2]
Information in this document applies to any platform.
SYMPTOMS
1] V$ASM_DISK view shows disk header status as IGNORED and group number as "0".
2] Background ASM trace file corresponding sqlplus shows the following similar message.
Ignoring dsk because it is a duplicatekfdsk:0xb7b91620 <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
disk: num: 21/47626513311654 grp: 0/47622597378048 compat: 10.1.0.0.0 dbcompat:10.1.0.0.0
fg: path: /dev/oracleasm/disks/ASM1 <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
mnt: O hdr: M mode: v v(--) p(r-) a(-) d(-) sta: N flg: 1001
slot 65535 ddeslot 65535 numslots 65535 dtype 0 enc 0 part 0 flags 0
kfts: 2013/03/02 10:11:53.307000
kfts: 2013/11/15 01:18:09.263000
pcnt: 0 ()
kfkid: 0xb7bbf940, kfknm: , status: IDENTIFIED
fob: (KSFD)ba77c470, magic: bebe ausize: 0
kfdds: dn=21 inc=3915933606 dsk=0xb7b91620 usrp=0x2b50f0992118
kfkds 0x2b50f08b07f8, kfkid 0xb7bbf940, magic abbe, libnum 0, bpau 0, fob 0xba77dec0
Ignoring dsk because it is a duplicatekfdsk:0xb7b919a0
CAUSE
This is not a normal situation but can happen in the following scenario.
1] If there are more than 2 disks dropped forcefully from the same diskgroup created with normal redundancy while the diskgroup is still mounted, these disks still show "MEMBER" and group number as "0". If a new disk is added to the diskgroup or existing disks are added in different order, there is a possibility that the same disk number and disk name can be assigned to the ones that was assigned to one of dropped disks which has not been added yet.
This situation can happen either when one of SAN failgroup crash or when one of cell node from Exadata crash.
2] In RAC environment, ASM disk added to a existing diskgroup should be seen from all nodes. If one of nodes doesn't see the disk being added, the operation will fail with "ORA-15075 disk(s) are not visible cluster-wide" but the disk being added keeps the new disk number and disk name showing "MEMBER" status. After this, a new disk is added to the RAC node successfully, chances are the new disk gets the same disk number and diskname with the one that was failed in the first place.
The following similar error occurs when these disks showing IGNORED status are attempted to be added an existing diskgroup.
SQL> alter diskgroup data add failgroup F2 disk '/dev/oracleasm/disks/ASM1' ;
alter diskgroup data add failgroup F2 disk '/dev/oracleasm/disks/ASM1'
*
ERROR at line 1:
ORA-15032: not all alterations performed
ORA-15033: disk '/dev/oracleasm/disks/ASM1' belongs to diskgroup "DATA"
3] Wrong multipath configuration where multiple single device path are visbile in asm_diskstring.
SOLUTION
IGNORED status can be seen for ASM disks which are not part of any currently mounted diskgroup but it shows the same disk header information with one of disks that are currently mounted such as disk number, disk name and diskgroup and the status as "MEMBER" and group number as "0".
Check whether there are at last 2 ASM disks showing the same disk header information using kfed described in Appendix
1] When the diskgroup from which the disk has been dropped is currently mounted, try to add the disk showing "IGNORED" status using force option to the original diskgroup from which the disk has been dropped. Please see Document:946213.1 - for details how to add a disk back to the original diskgroup.
Example)
SQL> alter diskgroup data add failgroup F2 disk '/dev/oracleasm/disks/ASM1' force ;
Diskgroup altered
2] When the diskgroup from which the disk has been dropped is not currrently mounted, v$asm_disk shows the correct disk as MEMBER by comparing creation time from disk directory . If a disk showing the same disk number and has a different creation time from disk directory, the status is seen as IGNORED.
example) /dev/oracleasm/disks/ASM1 and /dev/oacleasm/disks/ASM2 disks below show the same disk header information from kfed output.
SQL>@diskifno
G_N D_N NAME FAILGROUP M_STATU H_STATUS MO_STAT STATE PATH M_DATE
---- ---- ----------- ---------- ------- ---------- ------- -------- ------------------------- --------------------
0 0 CLOSED MEMBER ONLINE NORMAL /dev/oracleasm/disks/ASM2 2011/02/25 21:07:49
0 1 IGNORED MEMBER ONLINE NORMAL /dev/oracleasm/disks/ASM1 2011/02/25 18:08:43
0 2 CLOSED MEMBER ONLINE NORMAL /dev/oracleasm/disks/ASM4 2011/02/25 20:59:01
0 3 CLOSED MEMBER ONLINE NORMAL /dev/oracleasm/disks/ASM3 2011/02/25 21:07:49
1 0 PLAY_0000 PLAY_0000 CACHED MEMBER ONLINE NORMAL /dev/oracleasm/disks/ASM5 2011/02/20 20:52:31
3 0 KYLE_0000 KYLE_0000 CACHED MEMBER ONLINE NORMAL /dev/oracleasm/disks/ASM6 2011/02/20 20:52:44
2-1. ASM discards the disk showing IGNORED status automatically by comparing creation time from ASM metadata called disk directory when mounting DATA diskgroup.
SQL> alter diskgroup data mount ;
Diskgroup altered.
Note that /dev/oracleasm/disks/ASM1 has been excluded from DATA diskgroup.
SQL> @diskinfo
G_N D_N NAME FAILGROUP M_STATU H_STATUS MO_STAT STATE PATH M_DATE
---- ---- ----------- ---------- ------- ---------- ------- -------- ------------------------- --------------------
0 1 IGNORED MEMBER ONLINE NORMAL /dev/oracleasm/disks/ASM1 2011/02/25 18:08:43 <<<<<<<<<<<<<<<<<<< Here, ASM1 has been excluded from DATA diskgroup
1 0 PLAY_0000 PLAY_0000 CACHED MEMBER ONLINE NORMAL /dev/oracleasm/disks/ASM5 2011/02/20 20:52:31
2 0 DATA_0000 F1 CACHED MEMBER ONLINE NORMAL /dev/oracleasm/disks/ASM2 2011/02/25 21:09:47
2 1 DATA_0001 F2 CACHED MEMBER ONLINE NORMAL /dev/oracleasm/disks/ASM4 2011/02/25 21:09:47
2 2 DATA_0002 F2 CACHED MEMBER ONLINE NORMAL /dev/oracleasm/disks/ASM3 2011/02/25 21:09:47
3 0 KYLE_0000 KYLE_0000 CACHED MEMBER ONLINE NORMAL /dev/oracleasm/disks/ASM6 2011/02/20 20:52:44
2-2. And then the disk showing IGNORED can be added using force option described in step 1.
Appendix
1. How to check whether there are duplication disks showing the same disk header information.
o Check whether kfed executable exist in your <ASM_HOME/bin>
o Rebuild kfed if kfed executable doesn't exist in $GRID_HOME/bin using the following way.
$cd $ORACLE_HOME/rdbms/lib
$make -f ins_rdbms.mk ikfed
o Take a kfed output for the disks. If the following information of the header is the same, the 2 disks are considered to be the same.
Note mntstmp.hi and mntstmp.lo (disk mount time) can be different to be the same disk in this exercise.
ex) $ kfed read /dev/oracleasm/disks/ASM1 | egrep 'dsknum|dskname|grpname|fgname|hdrsts|mntstmp.hi|mntstmp.lo'
2. diskinfo.sql
set linesize 200
col g_name format a10
col g_n format 99
col d_n format 999
col m_status format a7
col mo_status format a7
col h_status format a11
col name format a30
col path format a45
col failgroup format a15
select g.group_number g_n,
g.disk_number d_n,
g.name name,
g.failgroup,
g.mount_status m_status,
g.header_status h_status,
g.mode_status mo_status,
g.path ,
to_char(g.mount_date, 'YYYY/MM/DD HH24:MI:SS') m_date
from v$asm_disk g
order by g_n, d_n
/
												

2017-07-17 DBA日记,凭直觉发现CRSDG的磁盘问题及处理相关推荐

  1. 2017.07.17

    今天上午主要熟悉了开发环境,主要事Git命令,明白了Git命令在公司开发环境搭架的一个整体过程,首先再次划分一下GitHub上,我们公司的开发分支.包括:master(版本发布分支),dev(开发分支 ...

  2. DBA日记:上海的RAC宕机

    --转自网络,很好的一个案例供大家细品! 今天晚上上海的雷总突然打电话过来,说有件事需要我们帮下忙.我问他是什么事,他说是一个客户的系统宕机的问题,最好能够尽快过来一下.我说没问题,明天一早就派工程师 ...

  3. DBA日记 上海的RAC宕机(转自网络)

    今天晚上上海的雷总突然打电话过来,说有件事需要我们帮下忙.我问他是什么事,他说是一个客户的系统宕机的问题,最好能够尽快过来一下.我说没问题,明天一早就派工程师过去.老雷一听就急了,老白不是我不信任你的 ...

  4. CCAI 2017 | 德国DFKI科技总监Hans Uszkoreit:如何用机器学习和知识图谱来实现商业智能化? 原2017.07.25AI科技大本营 文/CSDN大琦 7 月22 - 2

    CCAI 2017 | 德国DFKI科技总监Hans Uszkoreit:如何用机器学习和知识图谱来实现商业智能化? 原2017.07.25AI科技大本营 文/CSDN大琦 7 月22 - 23 日, ...

  5. 2017版MySQL DBA核心课程-第1-16部完整-老男孩-专题视频课程

    2017版MySQL DBA核心课程-第1-16部完整-12443人已学习 课程介绍         linux培训班部分内容,培训班学员无需购买 2017版老男孩MySQL数据库视频课程-大集合 2 ...

  6. 【第五组】头脑风暴+核心竞争力+NABCD+个人(用例+功能+技术说明书) 最后修改时间 2017.07.13...

    2017.07.13版 因为对之前版本做了较多修改,所以重新发了,并且在博客下方保留原有版本作为记录. 头脑风暴结果: 刚开始我们无法确定要做一个什么样的应用程序,总结之前可视化课程的作业,我们提出了 ...

  7. 2017.8.17 开始了我的QT 学习。

    Ctrl+H ........................................................................水平布局 Ctrl+L ......... ...

  8. win10 如何配置 java jdk1.8环境变量(2017.8.17 )jdk1.8.0_144

    win10 如何配置 java jdk 环境变量 2017.8.17 本篇还适用于 windows server 2012. windows server 2014+ 一.安装 下载 jdk 64位 ...

  9. javaScript ie8 不支持 new Date(2017-07);只支持new Date(2017/07/01)

    今天看以前代码的时候看到的JS的注释,记得当时调了老半天,ie8以上 以及谷歌,火狐没有这个问题. //ie8 不支持 new Date("2017-07");只支持new Dat ...

最新文章

  1. 第十三周项目四-立体类族共有的抽象类
  2. SUID和SGID位简介
  3. 解决PLSQL Developer 9连接oracle10g出现乱码
  4. 改变NumericStepper控件上下箭头的外观.
  5. rsync 相关参数
  6. Oracle 字符串函数
  7. python字母大小写排序_Python中sorted()排序与字母大小写的问题
  8. .Net5发布在即,当心技术断层!
  9. 【期外】 (一)关于LSH :局部敏感哈希算法
  10. (线段树 点更新 区间求和)lightoj1112
  11. -----------简单排序-------------
  12. OpenCV-python学习笔记(四)——smoothing and blurring平滑和模糊
  13. php毛玻璃,CSS实现毛玻璃透明效果
  14. 20年研发管理经验谈(七)
  15. QTP11 5 HP UFT 11 5 下载地址
  16. 2020-10-13
  17. 网络操作系统发展历程
  18. python期货程序化交易高手心得_10分钟打造WonderTrader上的期货日内交易策略
  19. Python 发出警报声音 简单播放声音 beep 在linux 上
  20. Microsoft JDBC Driver XX (XX表示版本号)for SQL Server的安装

热门文章

  1. MedicalGPT:基于LLaMA-13B的中英医疗问答模型(LoRA)、实现包括二次预训练、有监督微调、奖励建模、强化学习训练[LLM:含Ziya-LLaMA]。
  2. display:flex布局下white-space:nowrap失效问题解决
  3. 因数之和等于数字本身的数称为完全数,比数字本身大的数称为丰沛数, 比数字本身小的数称为不足数
  4. 通过shiro进行按钮及页面访问url的权限控制
  5. 用三元组存储稀疏矩阵,实现其快速转置c语言代码,稀疏矩阵三元组表快速转置(C语言实现)...
  6. QT+VS2005安装配置
  7. 最小的Linux系统制作过程详解
  8. Java导出pdf含表格,含导出水印,水印可以文字或者图片
  9. 2021-01-04 linux服务器管理考试内容
  10. java接口是干啥_浅谈Java接口