模拟私网问题导致节点宕机无法启动

  • 目的
  • 分析过程
    • GI alert日志
    • os日志
    • ocssd.log 日志
  • 参考文档

目的

本文章通过模拟私网问题,导致集群节点宕机,来进行日志分析。

# ifconfig eth1 down

分析过程

GI alert日志

<GI_HOME>/log/<节点名称>/alert<节点名称>.log

  • 节点1

    2019-01-16 13:44:16.839
    [cssd(28522)]CRS-1612:Network communication with node rac2 (2) missing for 50% of timeout interval.  Removal of this node from cluster in 14.400 seconds
    2019-01-16 13:44:23.855
    [cssd(28522)]CRS-1611:Network communication with node rac2 (2) missing for 75% of timeout interval.  Removal of this node from cluster in 7.380 seconds
    2019-01-16 13:44:28.865
    [cssd(28522)]CRS-1610:Network communication with node rac2 (2) missing for 90% of timeout interval.  Removal of this node from cluster in 2.370 seconds
    

    与节点2 rac2的网络通信在50%的超时间隔内丢失。 在14秒内从群集中删除此节点
    与节点2 rac2的网络通信在75%的超时间隔内丢失。 在7秒内从群集中删除此节点
    与节点2 rac2的网络通信在90%的超时间隔内丢失。 在2秒内从群集中删除此节点

    2019-01-16 13:44:31.243
    [cssd(28522)]CRS-1607:Node rac2 is being evicted in cluster incarnation 322157220; details at (:CSSNM00007:) in /u01/11.2.0/grid/log/rac1/cssd/ocssd.log.
    2019-01-16 13:44:54.468
    [ohasd(28301)]CRS-8011:reboot advisory message from host: rac2, component: mo093358, with time stamp: L-2019-01-16-13:44:53.124
    [ohasd(28301)]CRS-8013:reboot advisory message text: clsnomon_status: need to reboot, unexpected failure 8 received from CSS
    2019-01-16 13:45:02.236
    [cssd(28522)]CRS-1601:CSSD Reconfiguration complete. Active nodes are rac1 .
    2019-01-16 13:45:02.393
    [ctssd(28691)]CRS-2407:The new Cluster Time Synchronization Service reference node is host rac1.
    2019-01-16 13:45:27.304
    [crsd(28825)]CRS-5504:Node down event reported for node 'rac2'.
    2019-01-16 13:45:32.086
    [crsd(28825)]CRS-2773:Server 'rac2' has been removed from pool 'Generic'.
    2019-01-16 13:45:32.086
    [crsd(28825)]CRS-2773:Server 'rac2' has been removed from pool 'ora.orcl'.
    2019-01-16 13:46:37.328
    [ctssd(28691)]CRS-2406:The Cluster Time Synchronization Service timed out on host rac1. Details in /u01/11.2.0/grid/log/rac1/ctssd/octssd.log.
    

    在集群中驱逐rac2节点,要求rac2重启,集群重新配置完成,从服务池中删除rac2节点

  • 节点2

    2019-01-16 13:44:17.512
    [cssd(24201)]CRS-1612:Network communication with node rac1 (1) missing for 50% of timeout interval.  Removal of this node from cluster in 14.400 seconds
    2019-01-16 13:44:24.529
    [cssd(24201)]CRS-1611:Network communication with node rac1 (1) missing for 75% of timeout interval.  Removal of this node from cluster in 7.380 seconds
    2019-01-16 13:44:29.539
    [cssd(24201)]CRS-1610:Network communication with node rac1 (1) missing for 90% of timeout interval.  Removal of this node from cluster in 2.370 seconds
    

    与节点1 rac1的网络通信在50%的超时间隔内丢失。 在14秒内从群集中删除此节点
    与节点1 rac1的网络通信在75%的超时间隔内丢失。 在7秒内从群集中删除此节点
    与节点1 rac1的网络通信在90%的超时间隔内丢失。 在2秒内从群集中删除此节点

      2019-01-16 13:44:31.915[cssd(24201)]CRS-1609:This node is unable to communicate with other nodes in the cluster and is going down to preserve cluster integrity; details at (:CSSNM00008:) in /u01/11.2.0/grid/log/rac2/cssd/ocssd.log.2019-01-16 13:44:32.025[cssd(24201)]CRS-1608:This node was evicted by node 1, rac1; details at (:CSSNM00005:) in /u01/11.2.0/grid/log/rac2/cssd/ocssd.log.2019-01-16 13:47:33.459[ohasd(2892)]CRS-2112:The OLR service started on node rac2.2019-01-16 13:47:34.870[ohasd(2892)]CRS-8011:reboot advisory message from host: rac2, component: ag125511, with time stamp: L-2019-01-15-16:20:15.283[ohasd(2892)]CRS-8013:reboot advisory message text: clsnomon_status: need to reboot, unexpected failure 8 received from CSS2019-01-16 13:47:35.804[ohasd(2892)]CRS-8011:reboot advisory message from host: rac2, component: mo093358, with time stamp: L-2019-01-16-13:44:53.124[ohasd(2892)]CRS-8013:reboot advisory message text: clsnomon_status: need to reboot, unexpected failure 8 received from CSS
    

    以上信息看出节点2无法与群集中的其他节点通信,并且正在关闭以保持群集完整性; 节点2被节点1驱逐。

os日志

/var/log/messages

  • 节点1

    Jan 16 13:45:02 rac1 kernel: [Oracle OKS] Cluster Membership change - Current  incarn 0x19
    Jan 16 13:45:02 rac1 kernel: [Oracle OKS] Nodes in cluster:
    Jan 16 13:45:02 rac1 kernel: [Oracle OKS]   Node 1 (IP 0xa0a0a0a)
    Jan 16 13:45:02 rac1 kernel: [Oracle OKS] Node count 1
    Jan 16 13:45:02 rac1 kernel: [Oracle ADVM] Cluster reconfiguration started.
    Jan 16 13:45:02 rac1 kernel: [Oracle ADVM] Cluster reconfiguration completed.
    Jan 16 13:45:02 rac1 kernel: [Oracle ADVM] Cluster reconfiguration completed.
    Jan 16 13:45:02 rac1 kernel: [Oracle OKS] Cluster Membership change setup complete
    Jan 16 13:45:30 rac1 avahi-daemon[2647]: Registering new address record for 172.18.4.187 on eth0.
    Jan 16 13:45:30 rac1 avahi-daemon[2647]: Withdrawing address record for 172.18.4.187 on eth0.
    Jan 16 13:45:30 rac1 avahi-daemon[2647]: Registering new address record for 172.18.4.187 on eth0.
    Jan 16 13:45:30 rac1 avahi-daemon[2647]: Registering new address record for 172.18.4.172 on eth0.
    Jan 16 13:45:30 rac1 avahi-daemon[2647]: Withdrawing address record for 172.18.4.172 on eth0.
    Jan 16 13:45:30 rac1 avahi-daemon[2647]: Registering new address record for 172.18.4.172 on eth0.
    Jan 16 13:45:31 rac1 avahi-daemon[2647]: Withdrawing address record for fe80::a00:27ff:feb1:3373 on eth1.
    Jan 16 13:45:31 rac1 avahi-daemon[2647]: Withdrawing address record for 10.10.10.10 on eth1.
    Jan 16 13:45:31 rac1 avahi-daemon[2647]: Withdrawing address record for fe80::a00:27ff:fe2e:e4e on eth0.
    Jan 16 13:45:31 rac1 avahi-daemon[2647]: Withdrawing address record for 172.18.4.172 on eth0.
    Jan 16 13:45:31 rac1 avahi-daemon[2647]: Withdrawing address record for 172.18.4.186 on eth0.
    Jan 16 13:45:31 rac1 avahi-daemon[2647]: Withdrawing address record for 172.18.4.182 on eth0.
    Jan 16 13:45:31 rac1 avahi-daemon[2647]: Host name conflict, retrying with <rac1-3>
    
  • 节点2
    没有故障时间段信息
    

    根据上面信息,可以看出集群重新配置,节点2的上vip与scanip都注册到到节点1的eth0网卡上。

ocssd.log 日志

<GI_HOME>/log/<节点名称>/cssd/ocssd.log

  • 节点1

    2019-01-16 13:44:16.839: [    CSSD][1218210112]clssnmPollingThread: node rac2 (2) at 50% heartbeat fatal, removal in 14.400 seconds
    2019-01-16 13:44:16.839: [    CSSD][1218210112]clssnmPollingThread: node rac2 (2) is impending reconfig, flag 394254, misstime 15600
    

    以上信息显示:节点1发现节点2已经在连续一段时间内丢失网络心跳。如果这种情况继续下》去,集群会在14.400s之后发生重新配置,11Gr2 misscount=30s,通过命令crsctl get css misscount可以查询。

  • 节点2

    2019-01-16 13:44:17.512: [    CSSD][1222531392]clssnmPollingThread: node rac1 (1) at 50% heartbeat fatal, removal in 14.400 seconds
    2019-01-16 13:44:17.512: [    CSSD][1222531392]clssnmPollingThread: node rac1 (1) is impending reconfig, flag 132108, misstime 15600
    

    节点2出现了和节点1类似的消息。由于这是一个双节点集群,所以这是正常情况。当丢失网络心跳的时间达到misscount时间后,集群就需要重新配置了。

  • 节点1

    2019-01-16 13:44:16.839: [    CSSD][1218210112]clssnmPollingThread: local diskTimeout set to 27000 ms, remote disk timeout set to 27000, impending reconfig status(1)
    2019-01-16 13:44:16.926: [    CSSD][1249679680]clssnmvSchedDiskThreads: DiskPingThread for voting file /dev/raw/raw1 sched delay 950 > margin 750 cur_ms 59003974 lastalive 59003024
    2019-01-16 13:44:18.914: [    CSSD][1228699968]clssnmSendingThread: sending status msg to all nodes
    2019-01-16 13:44:18.914: [    CSSD][1228699968]clssnmSendingThread: sent 5 status msgs to all nodes
    2019-01-16 13:44:20.946: [    CSSD][1099692352]clssscSelect: cookie accept request 0xa430f48
    2019-01-16 13:44:20.946: [    CSSD][1099692352]clssgmAllocProc: (0xad8a000) allocated
    2019-01-16 13:44:20.946: [    CSSD][1099692352]clssgmClientConnectMsg: properties of cmProc 0xad8a000 - 1,2,3,4
    2019-01-16 13:44:20.946: [    CSSD][1099692352]clssgmClientConnectMsg: Connect from con(0x5a2bd) proc(0xad8a000) pid(9116) version 11:2:1:4, properties: 1,2,3,4
    2019-01-16 13:44:20.946: [    CSSD][1099692352]clssgmClientConnectMsg: msg flags 0x0000
    2019-01-16 13:44:20.947: [    CSSD][1099692352]clssgmExecuteClientRequest: Node name request from client ((nil))
    2019-01-16 13:44:20.950: [    CSSD][1099692352]clssgmDeadProc: proc 0xad8a000
    2019-01-16 13:44:20.950: [    CSSD][1099692352]clssgmDestroyProc: cleaning up proc(0xad8a000) con(0x5a2bd) skgpid  ospid 9116 with 0 clients, refcount 0
    2019-01-16 13:44:20.950: [    CSSD][1099692352]clssgmDiscEndpcl: gipcDestroy 0x5a2bd
    2019-01-16 13:44:23.855: [    CSSD][1218210112]clssnmPollingThread: node rac2 (2) at 75% heartbeat fatal, removal in 7.380 seconds
    2019-01-16 13:44:23.917: [    CSSD][1228699968]clssnmSendingThread: sending status msg to all nodes
    2019-01-16 13:44:23.917: [    CSSD][1228699968]clssnmSendingThread: sent 5 status msgs to all nodes
    2019-01-16 13:44:28.865: [    CSSD][1218210112]clssnmPollingThread: node rac2 (2) at 90% heartbeat fatal, removal in 2.370 seconds, seedhbimpd 1
    2019-01-16 13:44:28.921: [    CSSD][1228699968]clssnmSendingThread: sending status msg to all nodes
    2019-01-16 13:44:28.921: [    CSSD][1228699968]clssnmSendingThread: sent 5 status msgs to all nodes
    2019-01-16 13:44:30.731: [    CSSD][1099692352]clssnmvSchedDiskThreads: DiskPingMonitorThread sched delay 850 > margin 750 cur_ms 59017784 lastalive 59016934
    2019-01-16 13:44:31.242: [    CSSD][1218210112]clssnmPollingThread: Removal started for node rac2 (2), flags 0x6040e, state 3, wt4c 0
    2019-01-16 13:44:31.242: [    CSSD][1218210112]clssnmDiscHelper: rac2, node(2) connection failed, endp (0x331), probe(0x100000000), ninf->endp 0x331
    2019-01-16 13:44:31.242: [    CSSD][1218210112]clssnmDiscHelper: node 2 clean up, endp (0x331), init state 5, cur state 5
    

    节点1认为节点2持续丢失网络心跳,决定从集群中清除节点2

  • 节点2

    2019-01-16 13:44:17.512: [    CSSD][1222531392]clssnmPollingThread: local diskTimeout set to 27000 ms, remote disk timeout set to 27000, impending reconfig status(1)
    2019-01-16 13:44:18.247: [    CSSD][1254000960]clssnmvSchedDiskThreads: DiskPingMonitorThread sched delay 840 > margin 750 cur_ms 20599214 lastalive 20598374
    2019-01-16 13:44:19.232: [    CSSD][1254000960]clssnmvSchedDiskThreads: DiskPingMonitorThread sched delay 820 > margin 750 cur_ms 20600194 lastalive 20599374
    2019-01-16 13:44:20.232: [    CSSD][1138612544]clssnmvSchedDiskThreads: DiskPingMonitorThread sched delay 820 > margin 750 cur_ms 20601194 lastalive 20600374
    2019-01-16 13:44:21.239: [    CSSD][1254000960]clssnmvSchedDiskThreads: DiskPingMonitorThread sched delay 830 > margin 750 cur_ms 20602204 lastalive 20601374
    2019-01-16 13:44:22.236: [    CSSD][1233021248]clssnmSendingThread: sending status msg to all nodes
    2019-01-16 13:44:22.236: [    CSSD][1233021248]clssnmSendingThread: sent 5 status msgs to all nodes
    2019-01-16 13:44:22.243: [    CSSD][1254000960]clssnmvSchedDiskThreads: DiskPingMonitorThread sched delay 830 > margin 750 cur_ms 20603204 lastalive 20602374
    2019-01-16 13:44:23.238: [    CSSD][1254000960]clssnmvSchedDiskThreads: DiskPingMonitorThread sched delay 830 > margin 750 cur_ms 20604204 lastalive 20603374
    2019-01-16 13:44:24.238: [    CSSD][1138612544]clssnmvSchedDiskThreads: DiskPingMonitorThread sched delay 830 > margin 750 cur_ms 20605204 lastalive 20604374
    2019-01-16 13:44:24.529: [    CSSD][1222531392]clssnmPollingThread: node rac1 (1) at 75% heartbeat fatal, removal in 7.380 seconds
    2019-01-16 13:44:25.237: [    CSSD][1254000960]clssnmvSchedDiskThreads: DiskPingMonitorThread sched delay 830 > margin 750 cur_ms 20606204 lastalive 20605374
    2019-01-16 13:44:26.246: [    CSSD][1254000960]clssnmvSchedDiskThreads: DiskPingMonitorThread sched delay 840 > margin 750 cur_ms 20607214 lastalive 20606374
    2019-01-16 13:44:27.230: [    CSSD][1233021248]clssnmSendingThread: sending status msg to all nodes
    2019-01-16 13:44:27.230: [    CSSD][1233021248]clssnmSendingThread: sent 5 status msgs to all nodes
    2019-01-16 13:44:27.232: [    CSSD][1254000960]clssnmvSchedDiskThreads: DiskPingMonitorThread sched delay 820 > margin 750 cur_ms 20608194 lastalive 20607374
    2019-01-16 13:44:28.231: [    CSSD][1254000960]clssnmvSchedDiskThreads: DiskPingMonitorThread sched delay 820 > margin 750 cur_ms 20609194 lastalive 20608374
    2019-01-16 13:44:29.236: [    CSSD][1254000960]clssnmvSchedDiskThreads: DiskPingMonitorThread sched delay 820 > margin 750 cur_ms 20610194 lastalive 20609374
    2019-01-16 13:44:29.539: [    CSSD][1222531392]clssnmPollingThread: node rac1 (1) at 90% heartbeat fatal, removal in 2.370 seconds, seedhbimpd 1
    2019-01-16 13:44:30.241: [    CSSD][1254000960]clssnmvSchedDiskThreads: DiskPingMonitorThread sched delay 820 > margin 750 cur_ms 20611204 lastalive 20610384
    2019-01-16 13:44:31.229: [    CSSD][1138612544]clssnmvSchedDiskThreads: DiskPingMonitorThread sched delay 810 > margin 750 cur_ms 20612194 lastalive 20611384
    2019-01-16 13:44:31.914: [    CSSD][1222531392]clssnmPollingThread: Removal started for node rac1 (1), flags 0x2040c, state 3, wt4c 0
    2019-01-16 13:44:31.914: [    CSSD][1222531392]clssnmDiscHelper: rac1, node(1) connection failed, endp (0x256), probe(0x100000000), ninf->endp 0x256
    2019-01-16 13:44:31.914: [    CSSD][1222531392]clssnmDiscHelper: node 1 clean up, endp (0x256), init state 5, cur state 5
    

    节点2同样认为节点1持续丢失网络心跳,决定从集群中清除节点1.

  • 节点1

    2019-01-16 13:44:31.242: [GIPCXCPT][1218210112]gipcInternalDissociate: obj 0xab2fe10 [0000000000000331] { gipcEndpoint : localAddr 'gipc://rac1:3f93-7c9d-5dcd-3d8d#10.10.10.10#43951', remoteAddr 'gipc://rac2:nm_raccluster#10.10.10.20#56183', numPend 5, numReady 0, numDone 0, numDead 0, numTransfer 0, objFlags 0x5e89, pidPeer 0, flags 0x2616, usrFlags 0x0 } not associated with any container, ret gipcretFail (1)
    2019-01-16 13:44:31.242: [GIPCXCPT][1218210112]gipcDissociateF [clssnmDiscHelper : clssnm.c : 3215]: EXCEPTION[ ret gipcretFail (1) ]  failed to dissociate obj 0xab2fe10 [0000000000000331] { gipcEndpoint : localAddr 'gipc://rac1:3f93-7c9d-5dcd-3d8d#10.10.10.10#43951', remoteAddr 'gipc://rac2:nm_raccluster#10.10.10.20#56183', numPend 5, numReady 0, numDone 0, numDead 0, numTransfer 0, objFlags 0x5e89, pidPeer 0, flags 0x2616, usrFlags 0x0 }, flags 0x0
    2019-01-16 13:44:31.242: [    CSSD][1239189824]clssnmDoSyncUpdate: Initiating sync 322157220
    2019-01-16 13:44:31.242: [    CSSD][1239189824]clssscUpdateEventValue: NMReconfigInProgress  val 1, changes 3
    2019-01-16 13:44:31.242: [    CSSD][1239189824]clssnmDoSyncUpdate: local disk timeout set to 27000 ms, remote disk timeout set to 27000
    2019-01-16 13:44:31.242: [    CSSD][1239189824]clssnmDoSyncUpdate: new values for local disk timeout and remote disk timeout will take effect when the sync is completed.
    2019-01-16 13:44:31.242: [    CSSD][1239189824]clssnmDoSyncUpdate: Starting cluster reconfig with incarnation 322157220
    2019-01-16 13:44:31.242: [    CSSD][1239189824]clssnmSetupAckWait: Ack message type (11)
    2019-01-16 13:44:31.242: [    CSSD][1239189824]clssnmSetupAckWait: node(1) is ALIVE
    2019-01-16 13:44:31.242: [    CSSD][1239189824]clssnmSendSync: syncSeqNo(322157220), indicating EXADATA fence initialization complete
    2019-01-16 13:44:31.242: [    CSSD][1239189824]List of nodes that have ACKed my sync: NULL
    
  • 节点2

    2019-01-16 13:44:31.914: [GIPCXCPT][1222531392]gipcInternalDissociate: obj 0x4ab8b60 [0000000000000256] { gipcEndpoint : localAddr 'gipc://rac2:34a4-9438-68a9-4123#10.10.10.20#56183', remoteAddr 'gipc://10.10.10.10:3f4c-bc6d-9424-f406#10.10.10.10#43951', numPend 5, numReady 0, numDone 0, numDead 0, numTransfer 0, objFlags 0x6f6a, pidPeer 0, flags 0x102616, usrFlags 0x0 } not associated with any container, ret gipcretFail (1)
    2019-01-16 13:44:31.914: [GIPCXCPT][1222531392]gipcDissociateF [clssnmDiscHelper : clssnm.c : 3215]: EXCEPTION[ ret gipcretFail (1) ]  failed to dissociate obj 0x4ab8b60 [0000000000000256] { gipcEndpoint : localAddr 'gipc://rac2:34a4-9438-68a9-4123#10.10.10.20#56183', remoteAddr 'gipc://10.10.10.10:3f4c-bc6d-9424-f406#10.10.10.10#43951', numPend 5, numReady 0, numDone 0, numDead 0, numTransfer 0, objFlags 0x6f6a, pidPeer 0, flags 0x102616, usrFlags 0x0 }, flags 0x0
    2019-01-16 13:44:31.914: [    CSSD][1243511104]clssnmDoSyncUpdate: Initiating sync 322157220
    2019-01-16 13:44:31.914: [    CSSD][1243511104]clssscUpdateEventValue: NMReconfigInProgress  val 1, changes 9
    2019-01-16 13:44:31.914: [    CSSD][1243511104]clssnmDoSyncUpdate: local disk timeout set to 27000 ms, remote disk timeout set to 27000
    2019-01-16 13:44:31.914: [    CSSD][1243511104]clssnmDoSyncUpdate: new values for local disk timeout and remote disk timeout will take effect when the sync is completed.
    2019-01-16 13:44:31.914: [    CSSD][1243511104]clssnmDoSyncUpdate: Starting cluster reconfig with incarnation 322157220
    2019-01-16 13:44:31.914: [    CSSD][1243511104]clssnmSetupAckWait: Ack message type (11)
    2019-01-16 13:44:31.914: [    CSSD][1243511104]clssnmSetupAckWait: node(2) is ALIVE
    2019-01-16 13:44:31.914: [    CSSD][1243511104]clssnmSendSync: syncSeqNo(322157220), indicating EXADATA fence initialization complete
    2019-01-16 13:44:31.914: [    CSSD][1243511104]List of nodes that have ACKed my sync: NULL
    

    每个节点都向自己认为没问题的节点发送重新配置消息,但是对于两个节点的集群来说,不会有其他节点需要接收这个消息的。接下来,需要通过磁盘心跳信息决定每个节点的状态。

  • 节点1

    2019-01-16 13:44:31.243: [    CSSD][1239189824]clssnmCheckDskInfo: Checking disk info...
    2019-01-16 13:44:31.243: [    CSSD][1239189824]clssnmCheckSplit: Node 2, rac2, is alive, DHB (1547617471, 20612054) more than disk timeout of 27000 after the last NHB (1547617441, 20582204)
  • 节点2

    2019-01-16 13:44:31.915: [    CSSD][1243511104]clssnmCheckDskInfo: Checking disk info...
    2019-01-16 13:44:31.915: [    CSSD][1243511104]clssnmCheckSplit: Node 1, rac1, is alive, DHB (1547617471, 59018214) more than disk timeout of 27000 after the last NHB (1547617441, 58988964)

    看起来脑裂要发生了,因为每个节点都发现对方的状态是正常的。接下来,集群需要通过自己处理脑裂的规则来决定哪一个节点离开集群。

    集群处理脑裂规则:
    对于Oracle集群,脑裂是指集群的某些节点间的网络心跳丢失,但是节点的磁盘心跳是正常的情况。当脑裂出现后,集群会分裂成若干个子集群(Corhort)。对于这种情况的出现,集群需要进行重新配置,基本原则是:节点数多的子集群存活,如果子集群包含的节点数相同,那么包含最小编号节点的子集群存活。

  • 节点1

    2019-01-16 13:44:31.243: [    CSSD][1239189824]clssnmCheckDskInfo: My cohort: 1
    2019-01-16 13:44:31.243: [    CSSD][1239189824]clssnmEvict: Start
    2019-01-16 13:44:31.243: [    CSSD][1239189824](:CSSNM00007:)clssnmrEvict: Evicting node 2, rac2, from the cluster in incarnation 322157220, node birth incarnation 322157217, death incarnation 322157220, stateflags 0x64000
    2019-01-16 13:44:31.243: [    CSSD][1239189824]clssnmrFenceSage: Fenced node rac2, number 2, with EXADATA, handle 1239186920
    2019-01-16 13:44:31.243: [    CSSD][1239189824]clssnmSendShutdown: req to node 2, kill time 59018294
    
  • 节点2

    2019-01-16 13:44:31.915: [    CSSD][1243511104]clssnmCheckDskInfo: My cohort: 2
    2019-01-16 13:44:31.915: [    CSSD][1243511104]clssnmCheckDskInfo: Surviving cohort: 1
    2019-01-16 13:44:31.915: [    CSSD][1243511104](:CSSNM00008:)clssnmCheckDskInfo: Aborting local node to avoid splitbrain. Cohort of 1 nodes with leader 2, rac2, is smaller than cohort of 1 nodes led by node 1, rac1, based on map type 2
    

    节点2为避免脑裂,离开了集群,并重启了节点。从11.2.0.2开始,由于新特性rebootless restart的原因,节点不会被重启,而是集群管理软件进行重启。

参考文档

《RAC核心技术详解》

模拟私网问题导致节点宕机无法启动相关推荐

  1. 下载丨云和恩墨技术通讯:OCR无法正常读取导致节点宕机

    墨墨导读:为了及时共享行业案例,通知共性问题,达成共享和提前预防,我们整理和编辑了<云和恩墨技术通讯>,通过对过去一段时间的知识回顾,故障归纳,以期提供有价值的信息供大家参考.同时,我们也 ...

  2. oracle重启root,案例:Oracle报错ORA-15025 ORA-27041 root用户操作rac导致节点宕机

    天萃荷净 运维DBA反映Oracle RAC环境中节点宕机,alert发现报错ORA-15025 ORA-27041,分析原因为使用root用户操作rac导致节点宕机 接到同事请求,说客户的linux ...

  3. 11gR2 RAC启用iptables导致节点宕机问题处理

    通常,在安装数据库时,绝大多数都是要求把selinux及iptables关闭,然后再进行安装的.但是在运营商的系统中,很多安全的因素,需要将现网的数据库主机上的iptables开启的. 在开启ipta ...

  4. zookeeper模拟监控服务节点宕机

    2019独角兽企业重金招聘Python工程师标准>>> zookeeper模拟监控服务节点宕机 /*** 模拟监控服务节点宕机* 思路:* 节点上线的时候,往/watch下创建一个节 ...

  5. Flink taskmanager因节点宕机失效或进程异常导致的报警处理

    flink的taskmanager的端口占用情况是动态分配的,因此在promethues的监控targets配置也采用了脚本动态拉取更新: 1. 先连接zookeeper,查看flinkNodes的注 ...

  6. oracle19c集群重启,由重启引起的Oracle RAC节点宕机分析及追根溯源

    原标题:由重启引起的Oracle RAC节点宕机分析及追根溯源 作者介绍 裴征峰,现就职于北京海天起点,二线专家成员,南京办事处负责人,OCP 10g.OCP 11g.OCM11g.超八年Oracle ...

  7. oracle11g ora 29927,【案例】Oracle内存泄漏 进行10046跟踪分析07445导致数据库宕机

    天萃荷净 在一次ORA-7445导致oracle数据库down掉故障分析中,发现sql因某种原因导致大量的sql area中很多内存泄露,最终导致数据库down掉.通过实验找出类此奇怪SQL. SEL ...

  8. oracle rodm包,由重启引起的Oracle RAC节点宕机分析及追根溯源

    作者介绍 裴征峰,现就职于北京海天起点,二线专家成员,南京办事处负责人,OCP 10g.OCP 11g.OCM11g.超八年Oracle服务经验,擅长数据库故障诊断和性能调优.目前主要从事客户的现场维 ...

  9. MySQL集群节点宕机,数据库脑裂!如何排障?

    作者介绍 王晶,中国移动DBA,负责"移动云"业务系统的数据库集成架构设计.运维.优化等工作:擅长技术领域MySQL,获Oracle颁发的"MySQL DBA" ...

最新文章

  1. 016_CSS选择器列表
  2. WebRTC直播课堂实践:实时互动是核心
  3. 使用nvl就不能groupby了吗_现在的手机大部分都不能换电池,使用1至2年就需要更换吗?...
  4. 心田花开:小学三年级语文下册古诗词整理【全】
  5. WSDM 2017精选论文
  6. vm使用PE安装系统(2)
  7. 购买的Microsoft Office不小心卸载后重新安装方法
  8. PMP常考知识点核对单-10.沟通管理
  9. 什么是车联网,IoV(Internet of Vehicles)
  10. GitChat · 软件工程 | 一小时教你学会 Maven 项目的构建与管理
  11. mybatis操作Oracle数据库批量插入与更新、运行注意事项、属性含义
  12. Aloam+deeplabv3+ 构建语义地图+行人车辆检测(kitti数据集)
  13. 第一节 花的结构和类型
  14. Problem C: 判断三角形的性质
  15. ch340 win7 64位驱动下载
  16. docker安装oracle19c
  17. OpenSSL 常用命令
  18. 浅谈智能座舱的“一芯多屏”
  19. 学习区块链的第一堂课--认识区块链
  20. JS算术运算符、 JS赋值运算符、 JS自增自减、 JS比较运算符、 JS逻辑运算符、 JS三元运算符、 JS选择分支

热门文章

  1. HC小区管理系统mysql如何修改密码
  2. java代码自动回复_17.10.18 Java实现公众号关注自动回复图文
  3. [强网杯 2019]随便注 —— 堆叠注入
  4. Canvas制作RPG手机版游戏(一):
  5. 上海电气集团数科总经理程艳:工业互联网赋能集团型企业转型发展
  6. “信息安全产品”的昨天、今天和明天
  7. 【Python常用函数合集】clip函数、range函数等
  8. 【洛谷P2967】【USACO 2009 Dec】电子游戏 Video Game Troubles
  9. 《真三国无双5》全人研究完整版
  10. 大学计算机音乐一起学,和学生一起学音乐