首先说说rac中的心跳, 关于oracle clusterware中的心跳有两种,如下:

1. Disk heartbeat (voting device) - IOT

2. Network heartbeat (across the interconnect) - misscount

这里的disk hearbeat是指votedisk心跳,我们都知道votedisk是仲裁盘,



The Voting Disk is used by the Oracle cluster manager in various layers.

The Node Monitor (NM)uses the Voting Disk for the Disk Hearbeat, which is

essential in the detection and resolution of cluster "split brain".

NM monitors the Voting Disk for other competing sub-clusters and uses it for the

eviction phase. Hence the availability from the Voting Disk is critical for the

operation of the Oracle Cluster Manager.

The shared volumes created for the OCR and the voting disk should be configured

using RAID to protect against media failure. This requires the use of an external

cluster volume manager, cluster file system, or storage hardware that provides

RAID protection.

Disk heartbeat:

Each node writes a disk heartbeat to each voting disk once per second。Each node

reads their kill block once per second, if the kill block is overwritten node commits

suicide.During reconfig (join or leave) CSSD monitors all nodes and determines whether

a node has a disk heartbeat, including those with no network heartbeat.If no disk

heartbeat within I/O timeout (MissCount during cluster reconfiguration) then node is declared as dead.

Voting disk needs to be mirrored, should it become unavailable, cluster will come down.

If an I/O error is reported immediately on access to the vote disk, we immediately mark the vote disk

as offline so it isn't at the party anymore. So now we have (in our case) just two voting disks available.

We do keep retrying access to that dead disk, and if it becomes available again and the data is

uncorrupted we mark it online again. If a second vote disk suffered an I/O error in the window

that the first disk was marked offline. So now we don't have  quorum. Bang reboot.


Voting files are used by CSS to ensure data integrity of the database by detecting and resolving network

problems that could lead to a split-brain, so must be accessible at all times.  There are other techniques

used by other cluster managers, like quorum server, and quorum disks which function differently, but serve

the same purpose.

Note that a majority of vote disks, i.e. N/2 + 1, must be accessible by each node to ensure that all pairs

node have at least one voting file that they both see, which allows proper resolution of network issues;

this is to address the possible complaint that 2 voting files provide redundancy, so a third should not be


During normal processing, each node writes a disk heartbeat once per second and also reads its kill block

once per second. When the kill block indicates that the node has been evicted, the node exits, causing a node

reboot.As long as we have enough voting disks online, the node can survive, but when the number of offline

voting disks is greater than or equal to the number of online voting disks, the Cluster Communication Service

daemon will fail resulting in a reboot. The rationale for this is that as long as each node is required to

have a majority of voting disks online, there is guaranteed to be one voting disk that both nodes in a 2

node pair can see.



misscount: 网络心跳可以丢失的次数(单位是秒)



10g (R1 &R2)















diskhearbeat即disktimeout,在10.2.0.1+版本以后(打了patch 4896338)默认值是200s。


-- SDTO,是short disk time out的简称,即节点添加或删除时cluster需要进行reconfigure的时间。

-- LDTO,是指正常的rac操作中允许votedisk i/o完成超时的时间。




那么, 在什么情况下会导致node被驱逐呢?如下:

· Node is not pinging via the network heartbeat

· Node is not pinging the Voting disk

· Node is hung/busy and is unable to perform either of the earlier tasks

根据文档中的描述,翻译过来的node reboot条件表格:

Network Ping

Disk Ping


在misscount值 内完成



Disk ping时间超过misscount值,但是小于disktimeout值。


Disk ping时间超过disktimeout值

Network ping时间超过misscount值



$ORA_CRS_HOME/bin/crsctl set css misscount

where  is the maximum i/o latency to the voting disk +1 second版本,如果应用了patch4896338,那么还需要有如下的操作:

$CRS_HOME/bin/crsctl set css reboottime  [-force]  ( is seconds)

$CRS_HOME/bin/crsctl set css disktimeout  [-force] ( is seco



Do not change default misscount values if you are  running Vendor Clusterware along with Oracle Clusterware.

The default values for misscount should not be changed when using vendor clusterware. Modifying misscount in

this environment may cause clusterwide outages and potential corruptions.


10g RAC- Steps To Increase CSS Misscount- Reboottime and Disktimeout

CSS Timeout Computation in Oracle Clusterware

Reconfiguring the CSS disktimeout of 10gR2 Clusterware for Proper LUN Failover of the

Dell MD3000i iSCSI Storage [ID 462616.1]

How to start/stop the 10g CRS ClusterWare [ID 309542.1]


