RHEL 6.6: IPC Send timeout/node eviction etc with high packet reassembles failure (文档 ID 2008933.1) 转到底部  

APPLIES TO:

Oracle Database - Enterprise Edition - Version 11.2.0.1 and later
Oracle Database Cloud Schema Service - Version N/A and later
Oracle Database Exadata Cloud Machine - Version N/A and later
Oracle Cloud Infrastructure - Database Service - Version N/A and later
Oracle Database Backup Service - Version N/A and later
Generic Linux

SYMPTOMS

Red Hat Enterprise Linux or Oracle Linux running Red-Hat compatible kernel, after upgraded to 6.6, database/node fails with messages:

Fri May 01 03:05:48 2015
IPC Send timeout detected. Receiver ospid 28660 [oracle@xxxxx (LMS0)]
Fri May 01 03:05:48 2015
Errors in file <ORACLE_BASE>/diag/rdbms/xrcovd/<dbname>/trace/<sid>_lms0_28660.trc:
IPC Send timeout detected. Receiver ospid 28670 [oracle@xxxxx (LMS1)]
Fri May 01 03:05:53 2015
Errors in file <ORACLE_BASE>/diag/rdbms/xrcovd/<dbname>/trace/<sid>_lms1_28670.trc:
Fri May 01 03:06:00 2015
IPC Send timeout detected. Receiver ospid 31414 [oracle@xxxxx (PZ98)]
Fri May 01 03:06:00 2015
Errors in file <ORACLE_BASE>/diag/rdbms/xrcovd/<dbname>/trace/<sid>_pz98_31414.trc:
Fri May 01 03:06:13 2015
IPC Send timeout detected. Receiver ospid 1835 [oracle@xxxxx (PZ97)]
Fri May 01 03:06:13 2015
Errors in file <ORACLE_BASE>/diag/rdbms/xrcovd/<dbname>/trace/<sid>_pz97_1835.trc:
Fri May 01 03:06:43 2015
Fri May 01 03:06:43 2015
Received an instance abort message from instance 1Received an instance abort message from instance 1

Please check instance 1 alert and LMON trace files for detail.Please check instance 1 alert and LMON trace files for detail.

LMS0 (ospid: 28660): terminating the instance due to error 481

Fri May 01 03:06:43 2015

System state dump requested by (instance=3, osid=28660 (LMS0)), summary=[abnormal instance termination].
System State dumped to trace file <ORACLE_BASE>/diag/rdbms/xrcovd/<dbname>/trace/<sid>_diag_28625.trc

While this is happening, "netstat" shows huge jump of "packet reassembles failed":

==>> before the issue, the following number is more or less stable or increasing slowly
6817 packet reassembles failed
....
==>> in 30 minutes it increased by 50
6867 packet reassembles failed
==>> now the issue is happening and in 10 seconds it increased by 7533 - 6867 = 666
7533 packet reassembles failed
==>> in another 10 seconds it increased by 9630 - 7533 = 2097
9630 packet reassembles failed

Other symptoms could be:

1. node eviction

2. instance/node won't join the cluster after instance/node eviction without rebooting the node where  "packet reassembles failed" is happening

CAUSE

RHEL 6.6 has a few ipfrag fix and increased the default ipfrag_*_thresh:   

cat /proc/sys/net/ipv4/ipfrag_low_thresh
3145728
cat /proc/sys/net/ipv4/ipfrag_high_thresh
4194304

  

However, the issue still happen, for Oracle Linux running Red-Hat compatible kernel, the issue was tracked in below bug, later closed as 'Not a bug':

BUG 21036841 - LCOV5/7/17 SERVER CRASHED AFTER PATCH UPGRADE AND KERNEL UPGRADE

SOLUTION

Workaround is to enable jumbo frame

or

Increase value of below kernel parameter as mentioned below,

net.ipv4.ipfrag_high_thresh = 16M
net.ipv4.ipfrag_low_thresh = 15M

Units of these values are MB.

IPC Send timeout/node eviction etc with high packet reassembles failure相关推荐

  1. oracle ipc message,【案例】Oracle RAC IPC send timeout error导致RAC的节点挂起解决办法

    天萃荷净 Oracle研究中心案例分析:运维DBA反映Oracle RAC环境数据库节点挂起,分享日志发现是由于IPC send timeout error导致RAC的节点挂起. 本站文章除注明转载外 ...

  2. 如何诊断RAC数据库上的“IPC Send timeout”问题?

    RAC 数据库上比较常见的一种问题就是"IPC Send timeout".数据库Alert log中出现了"IPC Send timeout"之后,经常会伴随 ...

  3. Oracle Grid Infrastructure: Understanding Split-Brain Node Eviction (文档 ID 1546004.1)

    In this Document   Purpose   Scope   Details   What does "split brain" mean?   Why is this ...

  4. mtu设置失败_Oracle RAC该调整网卡MTU值

    在Oracle RAC的环境中,如果我们发现OSW监控数据显示包重组失败率过高,就需要引起足够的重视,因为这很可能会引发member kill/Node kill等重大故障,甚至在有些场景会连带影响到 ...

  5. 关于Oracle RAC调整网卡MTU值的问题

    墨墨导读:在Oracle RAC的环境中,如果我们发现OSW监控数据显示包重组失败率过高,就需要引起足够的重视,因为这很可能会引发member kill/Node kill等重大故障,甚至在有些场景会 ...

  6. 【推荐】 RAC 性能优化全攻略与经典案例剖析

    ORACLE RAC凭借其卓越的容错能力和可扩展性以及对应用透明的切换能力引领了数据库高可用架构的潮流,但在实际的生产环境中,出现的性能问题非常多,对数据库的稳定性产生很大的影响,有一些甚至影响到了业 ...

  7. ORACLE RAC ASM磁盘规划

     基于ASM冗余设计架构实现的数据库双活方案,如何规划ASM? ASM使用独特的镜像算法:不镜像磁盘,而是镜像盘区.作为结果,为了在产生故障时提供连续的保护,只需要磁盘组中的空间容量,而不需要预备一个 ...

  8. oracle集群+默认什么组,Oracle RAC 建设过程中必须应知、应做(上)

    原标题:Oracle RAC 建设过程中必须应知.应做(上) 作者:赵海,某城商行系统架构师,专注并擅长银行数据中心解决方案规划及设计.目前在社区会员关注TOP100排行榜中名列第三位,社区专业技能榜 ...

  9. 经常宕机的RAC系统 -排查案例

     第4章 经常宕机的RAC系统 4.1  3月2日 上海的紧急故障 今天晚上上海的雷总突然打电话过来,说有件事需要我们帮下忙.我问他是什么事,他说是一个客户的系统宕机的问题,最好能够尽快过来一下. ...

最新文章

  1. PAT A1065 A+B and C (64bit) (20 分)
  2. JAVA多线程:线程创建过程以及生命周期总结
  3. 1354. 等差数列【一般 / 暴力枚举】
  4. 猪流感来了,我们做好准备了吗
  5. C/Cpp / 虚函数是否可以用 inline 修饰
  6. java基本命令_java基础篇 快捷键 常见Dos命令等等
  7. 利用GSM模块通过GPRS在GMSK调制方式下与IP网通信
  8. RocketMQ之消费者顺序消费源码解析
  9. pe系统如何读取手机_五分钟教会你pe系统制作
  10. 用Jsoup从网页上抓取中国地区编号转变成Map
  11. The Tao of Programing-编程之道
  12. python动态网页爬取——四六级成绩批量爬取
  13. 【CodingNoBorder - 04】无际软工队 - 求职岛:技术规格说明书
  14. 航天工程系统是什么?
  15. Android的MVVM架构的单Activity应用实践
  16. vue 数字金额转大写方法
  17. socket 10053 错误
  18. 大学谷歌镜像_Google表示您不再需要大学
  19. excel数据透视表中插入一列新数据
  20. 使用tiddlywiki的用途和心得?

热门文章

  1. HDU-6608-Fansblog(威尔逊定理+快速乘)(多校)
  2. 深入了解PHP8 JIT(即时编译)功能
  3. i春秋 upload
  4. Android深入浅出系列之Bluetooth—蓝牙操作(一)
  5. echarts 旭日图sunburst
  6. Python数据分析之用户留存
  7. 人脸检测实战进阶:使用 OpenCV 进行活体检测
  8. 高通9xxx系列模块modem射频 RF LTE B41频段踩过的坑
  9. Css3动画 Qian锋逆战班
  10. Nginx下配置Https,测试环境的完整过程