os: centos 7.6
db: oracle 19.3

OCSSD 进程是 Clusterware 最关键的进程,如果这个进程出现异常,会导致系统重启,这个进程提供CSS(Cluster Synchronization Service)服务。
CSS 服务通过多种心跳机制实时监控集群状态,提供脑裂保护等基础集群服务功能。

一种为网络心跳,网络心跳的延时叫 MC(Misscount)。
一种为磁盘心跳,磁盘心跳延时叫作 IOT (I/O Timeout)。
两种心跳都有最大延时,都以秒为单位,缺省时情况下 Misscount < Disktimeout

crsctl get css

# su - grid
$ /u01/app/grid/product/19.0.0/grid_1/bin/crsctl get css
Usage:crsctl get css <parameter>Displays the value of a Cluster Synchronization Services parameterclusterguiddisktimeoutmisscountreboottimenoautorestartprioritycrsctl get css ipmiaddrDisplays the IP address of the local IPMI device as set in the Oracle registry.


默认值为 30s,表示如果集群各节点间内联网络延迟大于30s,Oracle认为节点间发生了脑裂,需要将故障节点逐出集群。

Every one second, a sending thread in the cssd sends a network tcp heartbeat to itself and all nodes. The receiving thread of the ocssd.bin receives the heartbeat.
If the package network is dropped or has error, the error correction mechanism on tcp would retransmit the package.
Oracle does not retransmit. From the ocssd.log, you will see a WARNING message about missing of heartbeat if a node does not receive a heartbeat from another node for 15 seconds (50% of miscount). Another warning is reported in ocssd.log if the same node is missing for 22 seconds (75% of miscount)…another warning continues from the same node for 27 seconds (90% miscount). When the heartbeat is missing 100% …30 seconds miscount, the node is evicted

$ /u01/app/grid/product/19.0.0/grid_1/bin/crsctl get css misscount
CRS-4678: Successful get misscount 30 for Cluster Synchronization Services.


$ /u01/app/grid/product/19.0.0/grid_1/bin/crsctl set css misscount 60



默认值为 200s,表示如果ocssd进程更新表决磁盘的时间超过200s,Oracle会认为该表决磁盘脱机。

A thread in ocssd.bin updates the voting disk every second.
If a node does not update the voting disks for 200 seconds, it’s evicted.
However, the ocssd.bin on the local node has the logic that it will bring down the node if it has an I/O error more than majority of the voting disks. Also there is a CRS reconfiguration is happening when misscount is 27 second and the local node is rebooted. As a result, you rarely see an eviction due to failure of the voting disk on (this is more common in because the ocssd.bin will abort the node before it get evicted by another node if writing to the voting disk is the problem.

$ /u01/app/grid/product/19.0.0/grid_1/bin/crsctl get css disktimeout
CRS-4678: Successful get disktimeout 200 for Cluster Synchronization Services.


$ /u01/app/grid/product/19.0.0/grid_1/bin/crsctl set css disktimeout 300


默认值为 3s
Default 3 seconds -the amount of time allowed for a node to complete a reboot
after the CSS daemon has been evicted.

$ /u01/app/grid/product/19.0.0/grid_1/bin/crsctl get css reboottime
CRS-4678: Successful get reboottime 3 for Cluster Synchronization Services.


$ /u01/app/grid/product/19.0.0/grid_1/bin/crsctl set css reboottime 10


