墨墨导读:一套11 G r2(11.2.0.3)版本的2节点RAC adg环境,节点1因为硬件原因异常crash(apply redo 节点), 但同时实例2上的应用也都断开了(原来都是open)。

今天在一套11 G r2(11.2.0.3)版本的2节点RAC adg环境,节点1因为硬件原因异常crash(apply redo 节点), 但同时实例2上的应用也都断开了(原来都是open),adg上是有连接一些只读业务,而且节点2 db alert log未发现明显手动close 实例的日志,并且是自动切换到了mount状态,RAC不是应该高可用吗?为什么死一个节点另外的节点也要跟着受影响?

这里如果检查实例2状态其实是“mount”, 不知道有多少人知道数据库其实是有alter database close命令的,但是在一个实例的生命周期内手动close,也就无法再open,需要重启实例(pdb除外), 并且刚才也说了,实例2 alert没有close迹象,下面附一段

dg  instance 2 db alert log

2020-09-03 00:10:39.601000 +08:00
ORA-01555 caused by SQL statement below (SQL ID: 4snkhx5vxrmv2, Query Duration=7340 sec, SCN: 0x0f46.c04d2402):
select....
2020-09-04 11:35:29.504000 +08:00
Archived Log entry 105312 added for thread 2 sequence 200520 ID 0x1fcb56a7 dest 1:
2020-09-08 14:16:17.954000 +08:00
Reconfiguration started (old inc 22, new inc 24)
List of instances:2 (myinst: 2)Global Resource Directory frozen* dead instance detected - domain 0 invalid = TRUECommunication channels reestablishedMaster broadcasted resource hash value bitmapsNon-local Process blocks cleaned outLMS 0: 23 GCS shadows cancelled, 0 closed, 0 Xw survivedLMS 2: 34 GCS shadows cancelled, 0 closed, 0 Xw survivedLMS 1: 19 GCS shadows cancelled, 0 closed, 0 Xw survivedSet master node infoSubmitted all remote-enqueue requests
2020-09-08 14:16:19.005000 +08:00Dwn-cvts replayed, VALBLKs dubiousAll grantable enqueues grantedPost SMON to start 1st pass IR
2020-09-08 14:16:22.014000 +08:00
ARC1: Becoming the active heartbeat ARCH
ARC1: Becoming the active heartbeat ARCH
2020-09-08 14:16:23.328000 +08:00Submitted all GCS remote-cache requestsPost SMON to start 1st pass IRFix write in gcs resources
2020-09-08 14:16:24.368000 +08:00
Reconfiguration complete
Recovery session aborted due to instance crash
Close the database due to aborted recovery session
SMON: disabling tx recovery
2020-09-08 14:16:54.955000 +08:00
Stopping background process MMNL
Stopping background process MMON
2020-09-08 14:17:26.530000 +08:00
Background process MMON not dead after 30 seconds
Killing background process MMON
2020-09-08 14:18:04.907000 +08:00
Starting background process MMON
MMON started with pid=27, OS id=18743
Starting background process MMNL
MMNL started with pid=1865, OS id=18745
CLOSE: killing server sessions.
2020-09-08 14:18:07.003000 +08:00
Active process 3858 user 'grid' program 'oracle@anbob2'
Active process 14847 user 'grid' program 'oracle@anbob2'
Active process 3435 user 'grid' program 'oracle@anbob2'
Active process 25029 user 'grid' program 'oracle@anbob2'
Active process 9789 user 'grid' program 'oracle@anbob2'
Active process 23815 user 'grid' program 'oracle@anbob2'
...
Active process 24285 user 'itmuser' program 'oracle@anbob2 (TNS V1-V3)'
Active process 10045 user 'grid' program 'oracle@anbob2'
Active process 24229 user 'grid' program 'oracle@anbob2'
CLOSE: all sessions shutdown successfully.
SMON: disabling tx recovery
SMON: disabling cache recovery
2020-09-08 14:19:06.638000 +08:002020-09-08 14:26:56.608000 +08:00
alter database recover managed standby database using current logfile disconnect from session
Attempt to start background Managed Standby Recovery process (tbcsc2)
MRP0 started with pid=73, OS id=9746
MRP0: Background Managed Standby Recovery process started (tbcsc2)
2020-09-08 14:27:01.712000 +08:00started logmerger process
Managed Standby Recovery starting Real Time Apply
2020-09-08 14:27:08.171000 +08:00
Parallel Media Recovery started with 32 slaves
Waiting for all non-current ORLs to be archived...
All non-current ORLs have been archived.
Recovery of Online Redo Log: Thread 2 Group 41 Seq 201120 Reading mem 0Mem# 0: /dev/yyc_oravg04/ryyc_redo41
Media Recovery Log /yyc1_arch/arch_1_288357_920590168.arc
Completed: alter database recover managed standby database using current logfile disconnect from session
2020-09-08 14:27:29.221000 +08:00-- 因为缺少主机1上的归档日志所以无法应用,我们cancel redo apply,abort了数据库
alter database recover managed standby database cancel
2020-09-08 14:31:57.640000 +08:00WARNING: inbound connection timed out (ORA-3136)
2020-09-08 14:33:30.958000 +08:00
Shutting down instance (abort)

也就”Close the database due to aborted recovery session” 给出了一个原因,close 数据库是因为recover session 终止了,其实这是RAC ADG的预期行为, 在这里不得不吐槽一下ORACLE MOS文档标题是写给oracle工程师或专业人看的,让普通人很费解,如12c alert log路径改了标题是12.1.0.2 Oracle Clusterware Diagnostic and Alert Log Moved to ADR (Doc ID 1915729.1), 从数据库里读操作系统上的文件内容叫”外部表”。对于这个问题偏的倒不是很远但不容易查到。Active Data Guard RAC Standby – Apply Instance Node Failure Impact (Doc ID 1357597.1) 给出了明确解释。

简而言之就是,如果apply redo应用日志的实例进程异常终止后,其它所有OPEN READ ONLY的实例会close, 因为在RAC ADG环境中,如果实例在应用日志过程中中断crash, 会把CACHE FUSION的锁留到残留幸存的实例中,会导致数据查询不一致,因次需要关闭数据库,重新打开来保证buffer cache和datafile 的一致状态。如果配置了DG BROKER 这个操作可以自动完成, 版本大于11.2.0.2,如果没有配置dg broker,手动方式直接open 就可以了,接着手动执行应用日志命令,继续在幸存的节点上应用日志。

附上MOS那段解释

Symptoms
In an Active Data Guard RAC standby, if the redo apply instance crashes, all other instances of that standby that were open Read Only will be closed and returned to the MOUNT state. This disconnects all readers of the Active Data Guard standby.

Cause
In an Active Data Guard RAC standby, if the redo apply instance crashes in the middle of media recovery, it leaves the RAC cache fusion locks on the surviving instances and the data files on disk in an in-flux state. In such a situation, queries on the surviving instances can potentially see inconsistent data. To resolve this in-flux state, the entire standby database is closed. Upon opening the first instance after such a close, the buffer caches and datafiles are made consistent again.

从12.1 版本引入了新特性”ADG instance recovery” ,解决的是当redo apply instance crash时,影响其它实例也close问题,从12.1以后保存下来的ADG 实例会自动做adg instance recovery,保证数据一致性, 这操作可以从实例的alert log中看到如 “Beginning ADG Instance Recovery” and “Completed ADG Instance Recovery”, 然后实例还是保持在open read only状态,不在中断ADG上的应用,如果配置了dg broker 还会自动在幸存的实例启动MRP,从而实现继续日志apply。这个功能在向后做到了11.2.0.4版中,前提是安装了较新的PSU,修复了bug 18331944和19516448,同时再配置隐藏参数””_adg_instance_recovery=TRUE””(默认是close幸存实例)。

But from 12.1 when the apply instance crashed in the middle of applying changes, one of the remaining open instances will be automatically posted to do “ADG instance recovery”, after the ADG instance recovery.We can see this, ADG instance recovery by checking the alert log, for the messages like “Beginning ADG Instance Recovery” and “Completed ADG Instance Recovery”. If DG broker is enabled then Broker will start the MRP on any of the surviving instances.

墨天轮原文链接:https://www.modb.pro/db/32021(复制到浏览器中打开或者点击“阅读原文”)

推荐阅读:144页!分享珍藏已久的数据库技术年刊


视频号,新的分享时代,关注我们,看看有什么新发现?

数据和云

ID:OraNews

如有收获,请划至底部,点击“在看”,谢谢!

点击下图查看更多 ↓

云和恩墨大讲堂 | 一个分享交流的地方

长按,识别二维码,加入万人交流社群

请备注:云和恩墨大讲堂

  点个“在看”

你的喜欢会被看到❤

注意:ORACLE 11G ADG RAC 这个情况下并不能高可用相关推荐

  1. oracle 11g r2 rac中节点时间不同步,Oracle11gR2安装RAC错误之--时钟不同步

    系统环境: 操作系统:RedHat EL5 Cluster: Oracle GI(Grid Infrastructure) Oracle: Oracle 11.2.0.1.0 如图所示:RAC 系统架 ...

  2. oracle 11g r2 rac到单实例的dg

    oracle 11g r2 rac到单实例的dg 1 主备环境说明 rac环境--primary CentOS release 6.5 (Final) hostname rac1 rac2 ip 10 ...

  3. oracle 11g r2 rac中节点时间不同步,Oracle 11gR2 安装RAC错误之--时钟不同步

    系统环境: 操作系统:RedHat EL5 Cluster: Oracle GI(Grid Infrastructure) Oracle: Oracle 11.2.0.1.0 如图所示:RAC 系统架 ...

  4. Oracle 11G R2 RAC 启动报错:ORA-01078 ORA-01565 ORA-17503 ORA-12547 处理方法

    Oracle 11G R2 RAC 启动报错:ORA-01078 ORA-01565 ORA-17503 ORA-12547 处理方法 前几天搭建的RAC测试环境:RedHat 6.8   grid ...

  5. Oracle 11G R2 RAC中的scan ip 的用途和基本原理【转】

    Oracle 11G R2 RAC增加了scan ip功能,在11.2之前,client链接数据库的时候要用vip,假如你的cluster有4个节点,那么客户端的tnsnames.ora中就对应有四个 ...

  6. oracle恢复drop建的表首次,案例:Oracle dul数据挖掘 没有备份情况下非常规恢复drop删除的数据表...

    天萃荷净 通过Oracle dul工具在没有备份情况下进行非常规恢复,找出drop删除的Oracle数据表中的数据进行恢复 dul对被drop对象进行恢复,需要提供两个信息 1.被删除表所属表空间(非 ...

  7. Oracle 11g R2+RAC+ASM+redhat安装详解1

    Oracle RAC是Oracle Real Application Cluster的简写,官方中文文档一般翻译为"真正应用集群",它一般有两台或者两台以上同构计算机及共享存储设备 ...

  8. Oracle 11g R2 RAC Hands on Training RAC 性能优化

    教程网址:  Oracle中国公司作品:Oracle 11g R2 RAC Hands on Training - 1 在线播放:http://www.boobooke.com/v/bbk3464 O ...

  9. 大促场景下云通信高可用、稳定性实战

    简介:为了帮助用户更好地了解和使用云通信的产品,秒懂云通信系统课程还在继续中.12月21日的秒懂云通信,阿里云高级技术专家卢彬彬分享了<安全可靠 稳如泰山+揭秘双11背后阿里云通信黑科技> ...

最新文章

  1. QIIME 2教程. 29参考数据库DataResources(2021.2)
  2. POJ2391(最大流Isap+Floyd+二分)
  3. CentOS上安装skype
  4. 原创内容屡屡被盗?从源头对资源盗用说NO
  5. 第三章 数据的图形展示
  6. yum安装elasticsearch慢_Elasticsearch客户端工具之ESHead
  7. 打包vue项目时报错:Expected indentation of 6 spaces but found 10
  8. 新浪微博分享 小记!!!(尚未成功)
  9. MapReduce入门(二)合并小文件
  10. http awstats安装
  11. 【电路仿真】基于matlab GUI Simulink钟摆自由控制【含Matlab源码 991期】
  12. 使用VUE实现的数独游戏
  13. css实现div半透明而文字不透明
  14. Chrome无法安装axure插件
  15. 千兆12光12电工业级环网交换机24口全千兆二层网管型机架式工业以太网交换机
  16. 三十六计珍藏版(下)
  17. 24种设计模式的定义和使用场合
  18. 如何生成jks证书文件
  19. 更新3ds时和用读卡器读取传输文件操作不当导致的tf卡损坏无法读取需要格式化时出现的问题以及不想丢失文件并且修复的详细解决方法
  20. 局域网聊天工具都有哪些?

热门文章

  1. 关联映射 一对多 实验心得_使用影响映射来帮助您的团队进行实验
  2. opensource项目_宣布2016年Opensource.com社区奖获奖者
  3. HTML5 header元素
  4. es6 super关键字
  5. es6 Reflect对象的静态方法
  6. 使用vrep给某个模型加dummy的一点小经验
  7. run、kill、return、stoprobot、stop
  8. 运行orbslam2出现 段错误 (核心已转储)_JDK 14已发布快速预览16个新特性
  9. linux用什么剪辑视频教程,Linux 上的开源视频剪辑软件Olive
  10. MySQL2索引优化