一. 概述

在之前的文章:

RAC 的一些概念性和原理性的知识

http://blog.csdn.net/tianlesoftware/article/details/5331067

提到OCSSD 这个进程是Clusterware最关键的进程,如果这个进程出现异常,会导致系统重启,这个进程提供CSS(Cluster Synchronization Service)服务。 CSS 服务通过多种心跳机制实时监控集群状态,提供脑裂保护等基础集群服务功能。

CSS 服务有2种心跳机制: 一种是通过私有网络的Network Heartbeat,另一种是通过Voting Disk的Disk Heartbeat.

这2种心跳都有最大延时,对于Disk Heartbeat, 这个延时叫作IOT (I/O Timeout);对于Network Heartbeat, 这个延时叫MC(Misscount)。 这2个参数都以秒为单位,缺省时IOT大于MC,在默认情况下,这2个参数是Oracle 自动判定的,并且不建议调整。

可以通过如下命令来查看参数值:

$crsctl get css disktimeout

$crsctl get css misscount

如:

[oracle@rac1 ~]$ crsctl get css disktimeout

200

[oracle@rac1 ~]$ crsctl get css misscount

60

这是这2个参数的默认值。

二. MOS 上相关的几篇文章

How to start/stop the 10g CRS ClusterWare[ID 309542.1]

10g RAC: Steps To Increase CSS Misscount,Reboottime and Disktimeout [ID 284752.1]

CSS Timeout Computation in OracleClusterware [ID 294430.1]

RAC Assurance Support Team: RAC and OracleClusterware Starter Kit and Best Practices (Generic) [ID 810394.1]

2.1修改CSS Misscount 步骤:

1)Shut down CRS on all but one node. For exact steps use Note 309542.1

2)Execute crsctl as root to modify the misscount:

$ORA_CRS_HOME/bin/crsctl set css misscount <n>

where <n> is the maximum i/o latency to the voting disk +1 second

3)Reboot the node where adjustment was made

4)Start all other nodes shutdown in step 1

With the Patch:4896338 for 10.2.0.1 thereare two additional settings that can be tuned. This change is incorporated into the 10.2.0.2 and 10.1.0.6patchsets.

These following are only relevant on10.2.0.1 with Patch:4896338,In addition to MissCount, CSS now has two more parameters:

1)reboottime (default 3 seconds) - the amount of time allowed for a node  to complete a reboot after the CSS daemon hasbeen evicted. (I.E. how  long does ittake for the machine to completely shutdown when you do a reboot)

2)disktimeout (default 200 seconds) - the maximum amount of time allowed      for a voting file I/O to complete; if thistime is exceeded the voting disk will be marked as offline.  Note that this is also the amount of timethat will be required for initial cluster formation, i.e. when no nodes havepreviously been up and in a cluster.

$CRS_HOME/bin/crsctl set css reboottime <r> [-force]  (<r> is seconds)

$CRS_HOME/bin/crsctl set css disktimeout <d> [-force] (<d>is seconds)

Confirm the new css  misscount setting via ocrdump

2.2 CSS Timeout Computation in OracleClusterware

2.2.1 MISSCOUNTDEFINITION AND DEFAULT VALUES
       The CSS misscount parameterrepresents the maximum time, in seconds, that a network heartbeat can be missedbefore entering into a cluster reconfiguration to evict the node. The followingare the default values for the misscount parameter and their respectiveversions when using Oracle Clusterware* in seconds:

OS

10g (R1 &R2)

11g

Linux

60

30

Unix

30

30

VMS

30

30

Windows

30

30

*CSS misscount default value when using vendor (non-Oracle)clusterware is 600 seconds. This is to allow the vendor clusterwareample time to resolve any possible split brain scenarios.

On AIX platforms with HACMP starting with 10.2.0.3 BP#1, themisscount is 30. This is documented in Note551658.1

2.2.2 CSS HEARTBEATMECHANISMS AND THEIR INTERRELATIONSHIP
       The synchronization servicescomponent (CSS) of the Oracle Clusterware maintains two heartbeat mechanisms

1.) the disk heartbeat to the voting deviceand

2.) the network heartbeat  across theinterconnect which establish and confirm valid node membership in the cluster.

Bothof these heartbeat mechanisms have an associated timeout value. The diskheartbeat has an internal i/o timeout interval (DTO Disk TimeOut), in seconds,where an i/o to the voting disk must complete. The misscount parameter (MC), asstated above, is the maximum time, in seconds, that a network heartbeat can be missed. The disk heartbeat i/o timeout interval is directly related tothe misscount parameter setting. There has been some variation in thisrelationship 
between versions as described below:

9.x.x.x

NOTE, MISSCOUNT WAS A  DIFFERENT ENTITY IN THIS RELEASE

10.1.0.2

No one should be on this version

10.1.0.3

DTO = MC - 15 seconds

10.1.0.4

DTO = MC - 15 seconds

10.1.0.4+Unpublished Bug 3306964

DTO = MC - 3 seconds

10.1.0.4 with CRS II Merge patch

DTO =Disktimeout (Defaults to 200 seconds) Normally OR Misscount seconds only during initial Cluster formation or Slightly before reconfiguration

10.1.0.5

IOT = MC - 3 seconds

10.2.0.1 +Fix for unpublished Bug 4896338

IOT=Disktimeout (Defaults to 200 seconds) Normally OR Misscount seconds only during initial Cluster formation or Slightly before reconfiguration

10.2.0.2

Same as above (10.2.0.1 with Patch Bug:4896338

10.1 - 11.1

During node join and leave (reconfiguration) in a cluster we need to reconfigure, in that particular case we use Short Disk TimeOut (SDTO) which is in all versions SDTO = MC – reboottime (usually 3 seconds)

Misscountdrives cluster membership reconfigurations and directly effects theavailability of the cluster. In most cases, the default settings for MC shouldbe acceptable.  Modifying the default value of misscount not onlyinfluences the timeout interval for the i/o to the voting disk, but alsoinfluences the tolerance for missed network heartbeats across the interconnect.

2.2.3 LONG LATENCIES TOTHE VOTING DISKS
       If I/O latencies to the voting diskare greater than the default DTO calculations noted above, the cluster mayexperience CSS node evictions depending on (a)the Oracle Clusterware (CRS)version, (b)whether merge patch has been applied and (c)the state of theCluster. More details on this are covered in the section "Change inBehavior with CRS Merge PATCH (4896338 on 10.2.0.1)".

Theselatencies can be attributed to any number of problems in the i/o subsystem orproblems with any component in the i/o path. The following is a non exhaustivelist of reported problems which resulted in CSS node eviction due to latenciesto the voting disk longer than the default Oracle Clusterware i/o timeoutvalue(DTO):

1.       QLogic HBA cards with a LinkDown Timeout greater than the default misscount.

2.       Bad cables to the SAN/storagearray that effect i/o latencies

3.       SAN switch (like Brocade)failover latency greater than the default misscount

4.       EMC Clariion Array whentrespassing the SP to the backup SP greater than default misscount

5.       EMC PowerPath path errordetection and I/O repost and redirect greater than defaultmisscount

6.       NetApp Cluster (CFO) failoverlatency greater than default misscount

7.       Sustained high CPU load whicheffects the CSSD disk ping monitoring thread

8.       Poor SAN network configurationthat creates latencies in the I/O path.

The mostcommon problems relate to multi-path IO software drivers, and thereconfiguration times resulting from a failure in the IO path. Hardwareand (re)configuration issues that introduce these latencies should becorrected. Incompatible failover times with underlying OS, network or storagehardware or software may be addressed given a complete understanding of theconsiderations listed below.

Misscount should NOT be modified to workaround theabove-mentioned issues. Oracle support recommends that you apply thelatest patchset which changes the CSS behaviour. More details covered innext section.

2.2.4 Change in Behaviorwith Bug:4896338 applied on top of 10.2.0.1
       Starting with 10.2.0.1+Bug:4896338,CSS will not evict the node from the cluster due to (DTO) I/O to voting disktaking more than misscount seconds unless it is during the initial clusterformation or slightly before reconfiguration. 
       So if we have a N number ofnodes in a cluster and one of the nodes takes more than misscountseconds to access the voting disk, the node will not be evicted as long asthe access to the voting disk is completed within disktimeoutseconds. Consequently with this patch, there is no need to increasethe misscount at all.

Additionallythis merge patch introduces Disktimeout  which is the amount of time thata lack of disk ping to voting disk(s) will be tolerated.

Note:  applying the patch will notchange your value for Misscount.

The table below explains inthe conditions under which the eviction will occur

Network Ping

Disk Ping

Reboot

Completes within misscount seconds

Completes within Misscount seconds

N

Completes within Misscount seconds

Takes more than misscount seconds but less than Disktimeout seconds

N

Completes within Misscount seconds

Takes more than Disktimeout seconds

Y

Takes more than Misscount Seconds

Completes within Misscount seconds

Y

* By default Misscount is lessthan Disktimeout seconds

2.2.5 CONSIDERATIONS WHENCHANGING MISSCOUNT FROM THE DEFAULT VALUE

1.       Customers drive SLA and clusteravailability. The customer ultimately defines Service Levels and availabilityfor the cluster. Before recommending any change to misscount, the full impactof that change should be described and the impact to cluster availabilitymeasured.

2.       Customers may have timeout andretry logic in their applications. The impact of delaying reconfiguration maycause 'artificial' timeouts of the application, reconnect failures andsubsequent logon storms.

3.       Misscount timeout values areversion dependent and are subject to change. As we have seen, misscountcalculations are variable between releases and between versions within arelease. Creating a false dependency on misscount calculation in one versionmay not be appropriate for later versions.

4.       Internal I/O timeout interval(DTO) algorithms may change in later releases as stated above, there exists adirect relationship between the internal I/O timeout interval and misscount.This relationship is subject to change in later releases.

5.       An increase in misscount tocompensate for i/o latencies directly effects reconfiguration times for networkfailures. The network heartbeat is the primary indicator of connectivity withinthe cluster. Misscount is the tolerance level of missed 'check ins' thattrigger cluster reconfiguration. Increasing misscount will prolong the time totake corrective action in the event of network failure or other anomalieseffecting the availability of a node in the cluster. This directly effectscluster availability.

6.       Changing misscount toworkaround voting disk latencies will need to be corrected when the underlyingdisk latency is corrected, misscount needs to be set back to the default Thecustomer needs to document the change and set the parameter back to the defaultwhen the underlying storage I/O latency is resolved.

7.       Do not change default misscountvalues if you are  running Vendor Clusterware along with OracleClusterware. The default values for misscount should not be changed when usingvendor clusterware. Modifying misscount in this environment may causeclusterwide outages and potential corruptions.

8.       Changing misscount parameterincurs a clusterwide outage. As note below, the customer will need to schedule 
a clusterwide outage to make this change.

9.       Changing misscount should notbe used to compensate for poor configurations or faulty hardware

10.   Cluster and RDBMS availabilityare directly effected by high misscount settings.

11.   In case of stretched clustersand stretched storage systems and a site failure where we loose one storage andN number of nodes we go into a reconfiguration state and then we revert toShortDiskTimeOut value as internal I/O timeout for the votings. Several casesare known with stretched clusters where when a site failure happen the storagefailover cannot complete within SDTO. If the I/O to the votings is blocked morethan SDTO the result is node evictions on the surviving side.

To Change MISSCOUNT back to default Please referto Note:284752.1
       THIS IS THE ONLY SUPPORTED METHOD.NOT FOLLOWING THIS METHOD RISKS EVICTIONS AND/OR CORRUPTING THE OCR

10g Release 2 MIRRORED VOTING DISKS AND VENDORMULTIPATHING SOLUTIONS 
       Oracle RAC 10g Release 2 allows formultiple voting disks so that  the customer does not have to rely on amultipathing solution from a storage vendor. You can have n voting disks (up to31) where n = m*2+1 where m is the number of disk failures you  want tosurvive. Oracle recommends each voting disk to be on a separate physical disk.

-------------------------------------------------------------------------------------------------------

Blog:http://blog.csdn.net/tianlesoftware

Weibo: http://weibo.com/tianlesoftware

Email: dvd.dba@gmail.com

DBA1 群:62697716(满);   DBA2 群:62697977(满)  DBA3 群:62697850(满)

DBA 超级群:63306533(满);  DBA4 群: 83829929(满) DBA5群: 142216823(满)

DBA6 群:158654907(满)  聊天 群:40132017(满)   聊天2群:69087192(满)

--加群需要在备注说明Oracle表空间和数据文件的关系,否则拒绝申请

Oracle RAC CSS 超时计算 及 参数 misscount, Disktimeout 说明相关推荐

  1. 【RAC】Oracle集群心跳及其参数misscount/disktimeout/reboottime

    Oracle 集群心跳及其参数misscount/disktimeout/reboottime 在Oracle RAC中,可以从多个层次,多个不同的机制来检测RAC的健康状况,即可以通过心跳机制以及一 ...

  2. oracle rac 心跳参数 misscount disktimeout

    os: centos 7.6 db: oracle 19.3 OCSSD 进程是 Clusterware 最关键的进程,如果这个进程出现异常,会导致系统重启,这个进程提供CSS(Cluster Syn ...

  3. Oracle RAC Brain Split Resolution

    大约是一周前,一位资深的Oracle工程师向我和客户介绍RAC中脑裂的处理过程,据他介绍脑裂发生时通过各节点对voting disk(投票磁盘)的抢夺,那些争抢到(n/2+1)数量voting dis ...

  4. 了解Oracle RAC Brain Split Resolution集群脑裂协议

    CSS工作原理 在理解脑裂(Brain Split)处理过程前,有必要介绍一下Oracle RAC Css(Cluster Synchronization Services)的工作框架: Oracle ...

  5. rac oracle脑裂,深入了解Oracle RAC 脑裂 Brain Split

    本帖最后由 maclean 于 2011-10-12 00:30 编辑 在理解脑裂(Brain Split)处理过程前,有必要介绍一下Oracle RAC Css(Cluster Synchroniz ...

  6. oracle修改asm参数文件,学习笔记:Oracle RAC参数文件管理 修改创建asm中的spfile文件...

    天萃荷净 Oracle rac创建修改asm中的spfile文件内容 create spfile to asm --查看sid SQL> show parameter instance_name ...

  7. 【ORACLE】RAC 磁盘超时,导致数据库重启 WARNING: Waited 15 secs for write IO to PST disk 0 in group 1.

    项目场景: 采用云资源上部署的oracle RAC 11.2.0.4数据库两节点不定期重启 问题描述 现场反馈,数据库两节点不断重启,检查crs,无重大报错.检查asm日志,发现如下报错. Fri S ...

  8. oracle rac 环境配置文件,学习笔记:Oracle RAC spfile参数文件配置案例详解

    天萃荷净 rac中的spfile探讨,记录一下Oracle RAC搭建完成后关于spfile参数文件的配置案例,与更改RAC环境中参数文件的方法 今天朋友的的rac,因为被同事做数据库升级,分别在两个 ...

  9. oracle rac理解和用途扩展

    Oracle RAC的优势在于利用多个节点(数据库实例)组成一个数据库,这样在保证了数据库高可用性的情况下更充分的利用了多个主机的性能,而且可以通过增加节点进行性能的扩展.实现Oracle RAC需要 ...

最新文章

  1. 性能测试知多少---系统架构分析
  2. 【第二期】那些设计漂亮、有创意的电路板!
  3. Java线程池理解及用法
  4. unity 克隆_使用Unity开发Portal游戏克隆
  5. popoverController(iPad)
  6. Syslink Control使用技巧
  7. python漏洞利用脚本_利用Python脚本实现漏洞情报监控与通知的经验分享
  8. 深度优先搜索(DFS)
  9. 网管师职业认证网上辅导班开课前的调查
  10. MapReduce的map流程
  11. iOS UITabBarController
  12. SQL Server 2005 Beta 2 Service Broker: Stored Procedure acts as a service program
  13. 【期刊会议系列】IEEE系列模板下载指南
  14. 2022年5月17日 点扩展函数的matlab仿真学习
  15. vim 常用命令 挺全的
  16. Windows cmd 查看文件MD5 SHA1 SHA256
  17. 芭比之长发公主 Barbie as Rapunzel 高清720P
  18. 怡和嘉业在创业板上市:总市值约186亿元,前三季度业绩同比翻倍
  19. 阿里云小蜜PHP实例代码
  20. 华为路由器修改telnet,ssh密码

热门文章

  1. 通过JavaScript简单的操作DOM(一)
  2. Proxy 动态代理 InvocationHandler CGLIB MD
  3. 小爱童鞋@你,一起来撸个小程序吧
  4. 99%的用户都选择用它来恢复丢失的照片
  5. GridView复合多层表头(不限级)!!! (转)
  6. Linux 小知识翻译 - 「架构」(arch)
  7. 技术专家预测未来25大颠覆性硬趋势
  8. 主页面调用iframe里面匿名Javascript函数的问题
  9. Ubuntu系统给磁盘配额(Quota)
  10. SCOM2012部署系列之九:部署审核收集报告(ACSReporting)