os: centos 7.6
db: oracle 19.3

OCSSD 进程是 Clusterware 最关键的进程，如果这个进程出现异常，会导致系统重启，这个进程提供CSS(Cluster Synchronization Service)服务。
CSS 服务通过多种心跳机制实时监控集群状态，提供脑裂保护等基础集群服务功能。

CSS服务提供2种心跳机制:
一种为网络心跳,网络心跳的延时叫 MC(Misscount)。
一种为磁盘心跳,磁盘心跳延时叫作 IOT (I/O Timeout)。
两种心跳都有最大延时,都以秒为单位,缺省时情况下 Misscount < Disktimeout

crsctl get css

# su - grid
$ /u01/app/grid/product/19.0.0/grid_1/bin/crsctl get css
Usage:crsctl get css <parameter>Displays the value of a Cluster Synchronization Services parameterclusterguiddisktimeoutmisscountreboottimenoautorestartprioritycrsctl get css ipmiaddrDisplays the IP address of the local IPMI device as set in the Oracle registry.

misscount

通过私有网络来检测节点的状态,如果私有网络在一定时间内无法进行正常通信,将会导致脑裂,通过选举后确定master,会产生节点驱逐.
默认值为 30s,表示如果集群各节点间内联网络延迟大于30s,Oracle认为节点间发生了脑裂,需要将故障节点逐出集群。

Every one second, a sending thread in the cssd sends a network tcp heartbeat to itself and all nodes. The receiving thread of the ocssd.bin receives the heartbeat.
If the package network is dropped or has error, the error correction mechanism on tcp would retransmit the package.
Oracle does not retransmit. From the ocssd.log, you will see a WARNING message about missing of heartbeat if a node does not receive a heartbeat from another node for 15 seconds (50% of miscount). Another warning is reported in ocssd.log if the same node is missing for 22 seconds (75% of miscount)…another warning continues from the same node for 27 seconds (90% miscount). When the heartbeat is missing 100% …30 seconds miscount, the node is evicted

$ /u01/app/grid/product/19.0.0/grid_1/bin/crsctl get css misscount
CRS-4678: Successful get misscount 30 for Cluster Synchronization Services.

修改后实时生效,无需重启组件

$ /u01/app/grid/product/19.0.0/grid_1/bin/crsctl set css misscount 60

disktimeout

每个节点会每一秒钟更新一次表决磁盘。共享的表决磁盘用于检查磁盘心跳。
如果当前节点表决磁盘脱机的个数小于在线表决磁盘的个数,该节点能够幸存.
如果脱机表决磁盘的个数大于或等于在线表决磁盘的个数,则clusterware认为磁盘心跳出现问题,故障节点会被逐出集群,执行自动修复过程

默认值为 200s,表示如果ocssd进程更新表决磁盘的时间超过200s,Oracle会认为该表决磁盘脱机。

A thread in ocssd.bin updates the voting disk every second.
If a node does not update the voting disks for 200 seconds, it’s evicted.
However, the ocssd.bin on the local node has the logic that it will bring down the node if it has an I/O error more than majority of the voting disks. Also there is a CRS reconfiguration is happening when misscount is 27 second and the local node is rebooted. As a result, you rarely see an eviction due to failure of the voting disk on 10.2.0.4 (this is more common in 10.2.0.1)) because the ocssd.bin will abort the node before it get evicted by another node if writing to the voting disk is the problem.

$ /u01/app/grid/product/19.0.0/grid_1/bin/crsctl get css disktimeout
CRS-4678: Successful get disktimeout 200 for Cluster Synchronization Services.

修改后实时生效,无需重启组件

$ /u01/app/grid/product/19.0.0/grid_1/bin/crsctl set css disktimeout 300

reboottime

节点被踢出集群后节点开始重启允许的最大时间
默认值为 3s
Default 3 seconds -the amount of time allowed for a node to complete a reboot
after the CSS daemon has been evicted.

$ /u01/app/grid/product/19.0.0/grid_1/bin/crsctl get css reboottime
CRS-4678: Successful get reboottime 3 for Cluster Synchronization Services.

修改后实时生效,无需重启组件

$ /u01/app/grid/product/19.0.0/grid_1/bin/crsctl set css reboottime 10

发生重启的情况如下:

参考:
<< crsctl get css>>
https://docs.oracle.com/en/database/oracle/oracle-database/19/cwadd/oracle-clusterware-control-crsctl-utility-reference.html#GUID-C2258EC1-B92B-4423-9974-A5BAEA458D48

<< CSS Timeout Computation in Oracle Clusterware (文档 ID 294430.1)>>
https://support.oracle.com/epmos/faces/SearchDocDisplay?_adf.ctrl-state=d6h2lbhkp_9&_afrLoop=175931758698127#PURPOSE

<< 12c Grid Infrastructure Quick Reference (文档 ID 1517182.1)>>
https://support.oracle.com/epmos/faces/SearchDocDisplay?_adf.ctrl-state=d6h2lbhkp_9&_afrLoop=175931758698127#aref_section210

oracle rac 心跳参数 misscount disktimeout相关推荐

【RAC】Oracle集群心跳及其参数misscount/disktimeout/reboottime
Oracle 集群心跳及其参数misscount/disktimeout/reboottime 在Oracle RAC中,可以从多个层次,多个不同的机制来检测RAC的健康状况,即可以通过心跳机制以及一 ...
oracle rac 环境配置文件,学习笔记:Oracle RAC spfile参数文件配置案例详解
天萃荷净 rac中的spfile探讨,记录一下Oracle RAC搭建完成后关于spfile参数文件的配置案例,与更改RAC环境中参数文件的方法今天朋友的的rac,因为被同事做数据库升级,分别在两个 ...
ORACLE RAC心跳网络
1. RAC默认心跳时间版本 misscount disktimeout reboottime 10.2.0.1 60s \ \ 10.2.0.1+p4896338/10.2.0.2 ...
Oracle RAC心跳机制
1.网络心跳(Network HeartBeat,NHB) 首先是确定集群节点之间的连通性,以便节点之间能够了解彼此的状态,而对于Oracle集群,这是通过节点间的网络心跳来实现的.对于Oracle集 ...
关于ORACLE RAC心跳问题的释疑
1.rac心跳的作用: 检测集群节点间的网络健康状态,还可用做缓存同步刷新及全局资源维护.在grid control出现后还传输数据块,其内联数据通信量比较大,通常是千兆网,当然使用万兆更好. 2.r ...
关于oracle RAC心跳线采用直连还是交换机连接的建议
首先说说心跳线的作用: oracle RAC不得不提的概念:健忘和脑裂健忘=>OCR 脑裂=>VOTE 下面谈谈直连的方式: 在oracle9i rac的时侯采用直连的要比走交换机的 ...
Oracle RAC CSS 超时计算及参数 misscount， Disktimeout 说明
一. 概述在之前的文章: RAC 的一些概念性和原理性的知识 http://blog.csdn.net/tianlesoftware/article/details/5331067 提到OCSSD ...
Oracle RAC集群三种心跳机制
Oracle 集群心跳机制: Oracle集群如何维护集群的一致性,所谓的集群一致性就是指集群中每个成员能够了解其他成员的状态,而且每个成员获得的集群中其他节点的状态和集群中节点成员列表 ...
【云和恩墨大讲堂】Oracle RAC精讲之心跳机制
世界上最遥远的距离,不是生与死,而是我们是集群的两个节点,你却听不到我的心跳. 自从云和恩墨大讲堂推出Oracle12.2体系架构图的系列课程,受到广大技术朋友们的关注和支持.本周是系列第四讲,主题: ...

oracle rac 心跳参数 misscount disktimeout

crsctl get css

misscount

disktimeout

reboottime

oracle rac 心跳参数 misscount disktimeout相关推荐

最新文章

热门文章