IPC Send timeout/node eviction etc with high packet reassembles failure

RHEL 6.6: IPC Send timeout/node eviction etc with high packet reassembles failure (文档 ID 2008933.1)

转到底部

APPLIES TO:

Oracle Database - Enterprise Edition - Version 11.2.0.1 and later
Oracle Database Cloud Schema Service - Version N/A and later
Oracle Database Exadata Cloud Machine - Version N/A and later
Oracle Cloud Infrastructure - Database Service - Version N/A and later
Oracle Database Backup Service - Version N/A and later
Generic Linux

SYMPTOMS

Red Hat Enterprise Linux or Oracle Linux running Red-Hat compatible kernel, after upgraded to 6.6, database/node fails with messages:

Fri May 01 03:05:48 2015
IPC Send timeout detected. Receiver ospid 28660 [oracle@xxxxx (LMS0)]
Fri May 01 03:05:48 2015
Errors in file <ORACLE_BASE>/diag/rdbms/xrcovd/<dbname>/trace/<sid>_lms0_28660.trc:
IPC Send timeout detected. Receiver ospid 28670 [oracle@xxxxx (LMS1)]
Fri May 01 03:05:53 2015
Errors in file <ORACLE_BASE>/diag/rdbms/xrcovd/<dbname>/trace/<sid>_lms1_28670.trc:
Fri May 01 03:06:00 2015
IPC Send timeout detected. Receiver ospid 31414 [oracle@xxxxx (PZ98)]
Fri May 01 03:06:00 2015
Errors in file <ORACLE_BASE>/diag/rdbms/xrcovd/<dbname>/trace/<sid>_pz98_31414.trc:
Fri May 01 03:06:13 2015
IPC Send timeout detected. Receiver ospid 1835 [oracle@xxxxx (PZ97)]
Fri May 01 03:06:13 2015
Errors in file <ORACLE_BASE>/diag/rdbms/xrcovd/<dbname>/trace/<sid>_pz97_1835.trc:
Fri May 01 03:06:43 2015
Fri May 01 03:06:43 2015
Received an instance abort message from instance 1Received an instance abort message from instance 1

Please check instance 1 alert and LMON trace files for detail.Please check instance 1 alert and LMON trace files for detail.

LMS0 (ospid: 28660): terminating the instance due to error 481

Fri May 01 03:06:43 2015

System state dump requested by (instance=3, osid=28660 (LMS0)), summary=[abnormal instance termination].
System State dumped to trace file <ORACLE_BASE>/diag/rdbms/xrcovd/<dbname>/trace/<sid>_diag_28625.trc

While this is happening, "netstat" shows huge jump of "packet reassembles failed":

==>> before the issue, the following number is more or less stable or increasing slowly
6817 packet reassembles failed
....
==>> in 30 minutes it increased by 50
6867 packet reassembles failed
==>> now the issue is happening and in 10 seconds it increased by 7533 - 6867 = 666
7533 packet reassembles failed
==>> in another 10 seconds it increased by 9630 - 7533 = 2097
9630 packet reassembles failed

Other symptoms could be:

1. node eviction

2. instance/node won't join the cluster after instance/node eviction without rebooting the node where "packet reassembles failed" is happening

CAUSE

RHEL 6.6 has a few ipfrag fix and increased the default ipfrag_*_thresh:

cat /proc/sys/net/ipv4/ipfrag_low_thresh
3145728
cat /proc/sys/net/ipv4/ipfrag_high_thresh
4194304

However, the issue still happen, for Oracle Linux running Red-Hat compatible kernel, the issue was tracked in below bug, later closed as 'Not a bug':

BUG 21036841 - LCOV5/7/17 SERVER CRASHED AFTER PATCH UPGRADE AND KERNEL UPGRADE

SOLUTION

Workaround is to enable jumbo frame

Increase value of below kernel parameter as mentioned below,

net.ipv4.ipfrag_high_thresh = 16M
net.ipv4.ipfrag_low_thresh = 15M

Units of these values are MB.