跨节点走IB网络任务报错
1. 故障现象,客户HPC任务,走千兆网路正常运算,但是走IB网络报以下错误
psolid.x 0000000000519C2D xmp_init_ 91 xmp_init.F
psolid.x 00000000005164BF pamcsm_ 88 pamcsm.F
psolid.x 0000000000515D90 MAIN__ 26 pcrash.F
psolid.x 0000000000515D1C Unknown Unknown Unknown
libc.so.6 00007FCB4A9B4C36 Unknown Unknown Unknown
psolid.x 0000000000515C29 Unknown Unknown Unknown
forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
libpthread.so.0 00007F4E28BFC7E0 Unknown Unknown Unknown
libibverbs.so.1 00007F4E23E4BC79 Unknown Unknown Unknown
libibverbs.so.1 00007F4E23E4CC08 Unknown Unknown Unknown
libmpi.so.1 00007F4E2669A8CF Unknown Unknown Unknown
libmpi.so.1 00007F4E2654CC45 Unknown Unknown Unknown
libmpi.so.1 00007F4E26514B3B Unknown Unknown Unknown
libmpi.so.1 00007F4E26516935 Unknown Unknown Unknown
libmpi.so.1 00007F4E26517BCF Unknown Unknown Unknown
libmpi.so.1 00007F4E265AD3BC Unknown Unknown Unknown
libmpm_platform-9 00007F4E26867F63 MPM_Mod_F_Init 23 MPM_Mod_F_Init.c
libmpm.so 00007F4E30B5209A mpi_init_ 44 MPM_Lib_F_Init.c
psolid.x 00000000005F55AB mpp_init_ 68 mpp_init.F
psolid.x 0000000000519C2D xmp_init_ 91 xmp_init.F
psolid.x 00000000005164BF pamcsm_ 88 pamcsm.F
psolid.x 0000000000515D90 MAIN__ 26 pcrash.F
psolid.x 0000000000515D1C Unknown Unknown Unknown
libc.so.6 00007F4E28890C36 Unknown Unknown Unknown
psolid.x 0000000000515C29 Unknown Unknown Unknown
forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
libpthread.so.0 00007FFBB1E877E0 Unknown Unknown Unknown
libibverbs.so.1 00007FFBAD0D6C79 Unknown Unknown Unknown
libibverbs.so.1 00007FFBAD0D7C08 Unknown Unknown Unknown
libmpi.so.1 00007FFBAF9258CF Unknown Unknown Unknown
libmpi.so.1 00007FFBAF7D7C45 Unknown Unknown Unknown
libmpi.so.1 00007FFBAF79FB3B Unknown Unknown Unknown
libmpi.so.1 00007FFBAF7A1935 Unknown Unknown Unknown
libmpi.so.1 00007FFBAF7A2BCF Unknown Unknown Unknown
libmpi.so.1 00007FFBAF8383BC Unknown Unknown Unknown
libmpm_platform-9 00007FFBAFAF2F63 MPM_Mod_F_Init 23 MPM_Mod_F_Init.c
libmpm.so 00007FFBB9DDD09A mpi_init_ 44 MPM_Lib_F_Init.c
psolid.x 00000000005F55AB mpp_init_ 68 mpp_init.F
psolid.x 0000000000519C2D xmp_init_ 91 xmp_init.F
psolid.x 00000000005164BF pamcsm_ 88 pamcsm.F
psolid.x 0000000000515D90 MAIN__ 26 pcrash.F
psolid.x 0000000000515D1C Unknown Unknown Unknown
libc.so.6 00007FFBB1B1BC36 Unknown Unknown Unknown
psolid.x 0000000000515C29 Unknown Unknown Unknown
forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
libpthread.so.0 00007F9653FB07E0 Unknown Unknown Unknown
libibverbs.so.1 00007F964F1FFC79 Unknown Unknown Unknown
libibverbs.so.1 00007F964F200C08 Unknown Unknown Unknown
libmpi.so.1 00007F9651A4E8CF Unknown Unknown Unknown
ssh_keysign: no reply
key_sign failed
psolid.x: Rank 0:2: MPI_Init: Could not pin pre-pinned rdma region 0
psolid.x: Rank 0:2: MPI_Init: hpmp_rdmaregion_alloc() failed
psolid.x: Rank 0:2: MPI_Init: make_world_rdmaenvelope() failed
psolid.x: Rank 0:2: MPI_Init: Internal Error: Processes cannot connect to rdma device
psolid.x: Rank 0:0: MPI_Init: Could not pin pre-pinned rdma region 0
psolid.x: Rank 0:0: MPI_Init: hpmp_rdmaregion_alloc() failed
psolid.x: Rank 0:0: MPI_Init: make_world_rdmaenvelope() failed
psolid.x: Rank 0:0: MPI_Init: Internal Error: Processes cannot connect to rdma device
psolid.x: Rank 0:3: MPI_Init: Could not pin pre-pinned rdma region 0
psolid.x: Rank 0:3: MPI_Init: hpmp_rdmaregion_alloc() failed
psolid.x: Rank 0:3: MPI_Init: make_world_rdmaenvelope() failed
psolid.x: Rank 0:3: MPI_Init: Internal Error: Processes cannot connect to rdma device
psolid.x: Rank 0:1: MPI_Init: Could not pin pre-pinned rdma region 0
psolid.x: Rank 0:1: MPI_Init: hpmp_rdmaregion_alloc() failed
psolid.x: Rank 0:1: MPI_Init: make_world_rdmaenvelope() failed
psolid.x: Rank 0:1: MPI_Init: Internal Error: Processes cannot connect to rdma device
MPI Application rank 0 exited before MPI_Finalize() with status 1
MPI Application rank 2 exited before MPI_Finalize() with status 1
pamcrash : Error :
==============================================================================
This process has exited with a nonzero exit code, indicating an error
termination.
You may have some unmerged files left behind like VW331-4CS_K_SAD_China-NCAP-MDB_51_40_v045_xxx.{LIS,msg}
in /CAE/home/tpbrls/pam2014.3_test_new directory, containing some relevant informations regarding this error
condition.
Please refer to your documentation, or contact you technical support for this
merging purpose.
2. 解决办法,刚开始以为是少安装了某些库文件,后来发现是资源限制的问题,在/etc/security/limits.conf后增加下面两条,重启后问题解决
admin:~ # cat /etc/security/limits.conf
* soft memlock unlimited
* hard memlock unlimited
注: 其中memlock的含义为:max locked-in-memory address space (KB)
跨节点走IB网络任务报错相关推荐
- 【报错解决】linux网络编程报错storage size of ‘serv_addr’ isn’t known解决办法
linux网络编程报错storage size of 'serv_addr' isn't known解决办法 报错如下: server.c:18:21: error: storage size of ...
- Canvas引入跨域的图片导致toDataURL()报错的问题的解决
本文介绍了Canvas引入跨域的图片导致toDataURL()报错的问题的解决,分享给大家,具体如下: [场景] 用户打开网页,则请求腾讯COS(图片服务器)上的图片js代码.使用canvas绘图. ...
- centos7 如何重启web服务_centOS7下重启网络服务报错
用户提问 实体机上安装的centOS7 在安装centOS7时并没有设置过任何网络的东西. 使用systemctl retart network.service或service network sta ...
- 记录uni-app网络请求跨域、安卓打包后网络请求报错
跨域 测试方法如下:网络请求接口地址玩安卓的开放API:https://wanandroid.com/article/listproject/0/json 注意url是完整地址,并且manifest. ...
- master节点重置后添加node报错_土豆SUPER通过Node-Red接入HASS和Homekit
土豆SUPER通过Node-Red接入HASS和Homekit 最近研究home-assistant(以下简称HASS)上了瘾,将小米全家桶基本都实现接入了.家里刚装了土豆SUPER新风机,很优秀的产 ...
- master节点重置后添加node报错_kubeadm高可用master节点(三主两从)2
2.5.部署Master节点 (1)生成预处理文件 在master节点执行如下指令: [root@master1 ~]# kubeadm config print init-defaults > ...
- kali linux重启网络服务报错,Web安全学习笔记之在Kali Linux上安装Openvas以及启动失败修复...
现在用的kali linux是2018.1的版本,在安装openvas的时候报错,无法通过网络下载和安装openvas. 主要错误是源配置错误,可能现在用的kali很久没更新了. 一.解决和配置更新源 ...
- 解决浏览器跨域加载本地文件报错 Access to script at ‘xxx‘ from origin ‘null‘ has been blocked by CORS policy
报错: Failed to load resource: net::ERR_FILE_NOT_FOUND Access to script at 'xxx' from origin 'null' ha ...
- 火狐浏览器jtopo节点切换tab后消失报错NS_ERROR_FAILURE的解决
火狐浏览器切换tab后流程节点消失,js报错NS_ERROR_FAILURE,网上查询该错误为元素隐藏后火狐相关的api调用会抛异常,只要先判断下元素的display属性是否为none,为none就不 ...
最新文章
- Python与redis集群进行交互操作
- 《JavaScript高级程序设计》笔记:变量、作用域和内存问题(四)
- wifi网络结构(下)
- 【ZT】详细设计文档规范
- STL 里 resize 和 reserve 的区别
- 彻底搞懂阻塞、非阻塞、同步、异步
- 小爱同学100个奇葩回复_小爱同学深度体验报告:这6个问题值得思考
- 第十五节20181209
- StanfordDB class自学笔记 (6) 关系代数
- 从网络中获取债券收益率数据
- python中随机生成数字方法
- python3 简单爬虫实战|使用selenium来模拟浏览器抓取选股宝网站信息里面的股票
- 常见的中间件有哪些?
- 西安市2012年教师资格证考试报名时间:3月10-15日
- 单目视觉技术、双目视觉技术、多目视觉技术
- 开学季家长会PPT模板
- JavaScript看图器 汉字简体繁体转换 中国日历类 自动更换桌面墙纸
- 乐融Letv第五代超级电视年内发布:不浮夸 不玩8K
- 台式电脑执行DirectShow报错
- 网络安全建设应需要用到哪些安全设备