导读

在使用tensorflow训练模型的时候报如下错误

tensorflow.python.framework.errors_impl.InvalidArgumentError: No OpKernel was registered to support Op 'NcclAllReduce' used by node AllReduceGrads/NcclAllReduce (defined at /home/zw/anaconda3/envs/tf_models/lib/python3.7/site-packages/tensorpack/graph_builder/utils.py:160) with these attrs: [reduction="sum", shared_name="c0", T=DT_FLOAT, num_devices=2]
Registered devices: [CPU, XLA_CPU, XLA_GPU]
Registered kernels:device='GPU'[[AllReduceGrads/NcclAllReduce]]Errors may have originated from an input operation.
Input Source operations connected to node AllReduceGrads/NcclAllReduce:tower0/gradients/AddN_373 (defined at /home/zw/anaconda3/envs/tf_models/lib/python3.7/site-packages/tensorpack/train/tower.py:276)
terminate called without an active exception
terminate called recursively
terminate called recursively
*** Received signal 6 ***
*** BEGIN MANGLED STACK TRACE ***
Aborted (core dumped)

这个错误是发生在使用多个GPU进行并行训练的时候,使用单个GPU训练的时候并没有报错,而且指定的GPU会占用135M的GPU内存。

环境

  • 系统:Ubuntu16.04
  • cuda版本:10.1
  • cudnn版本:8.0.2
  • tensorflow-gpu:1.14.0

错误原因分析及解决办法

其实这个错误主要是因为环境配置问题导致,在训练的时候报如上错误的时候,在查找上面的输出信息的前面发现如下信息

2020-08-14 13:58:07.324004: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcudart.so.10.0'; dlerror: libcudart.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/usr/local/cuda-10.1/lib64
2020-08-14 13:58:07.324109: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcublas.so.10.0'; dlerror: libcublas.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/usr/local/cuda-10.1/lib64
2020-08-14 13:58:07.324205: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcufft.so.10.0'; dlerror: libcufft.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/usr/local/cuda-10.1/lib64
2020-08-14 13:58:07.324311: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcurand.so.10.0'; dlerror: libcurand.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/usr/local/cuda-10.1/lib64
2020-08-14 13:58:07.324415: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcusolver.so.10.0'; dlerror: libcusolver.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/usr/local/cuda-10.1/lib64
2020-08-14 13:58:07.324508: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcusparse.so.10.0'; dlerror: libcusparse.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/usr/local/cuda-10.1/lib64
2020-08-14 13:58:07.324599: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcudnn.so.7'; dlerror: libcudnn.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/usr/local/cuda-10.1/lib64
2020-08-14 13:58:07.324614: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1663] Cannot dlopen some GPU libraries. Skipping registering GPU devices...
2020-08-14 13:58:07.324666: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:

通过分析上面的错误可以发现,是由于找不到libcu*.so.10.0导致的,所以可以很肯定这个错误是由于cuda的版本导致的。因为我安装的是cuda10.1的版本,而TensorFlow1.14需要的是cuda10.0的版本,所以针对这种情况,要么更换cuda的版本要么更换TensorFlow的版本,关于TensorFlow和cuda对应的版本,TensorFlow官方给出了如下信息

官方文档说明:https://www.tensorflow.org/install/source?hl=zh-cn

通过上面的版本对应表可以发现,TensorFlow_gpu-1.14.0所对应的cuda的版本应该是10.0,我最终更改了cuda的版本解决了这个问题。

tensorflow报No OpKernel was registered to support Op ‘NcclAllReduce‘相关推荐

  1. NVIDIA Jetson Xavier NX上导入tensorflow报错:AttributeError: module ‘wrapt‘ has no attribute ‘ObjectProxy‘

    欢迎大家关注笔者,你的关注是我持续更博的最大动力 原创文章,转载告知,盗版必究 在Jetson Xavier NX上导入tensorflow报错:AttributeError: module 'wra ...

  2. navicat 连接 mysql 报错:client does not support authentication protocal requested by server

    标题 navicat 连接 mysql 报错:client does not support authentication protocal requested by server 转载自:https ...

  3. anconda安装后命令行中安装tensorflow报错

    现象  anconda安装后命令行中安装tensorflow报错 pip install --upgrade --ignore-installed tensorflow-gpu Building wh ...

  4. 解决tensorflow报错:AttributeError: module ‘tensorflow.keras.backend‘ has no attribute ‘get_session‘ 问题

    欢迎大家关注笔者,你的关注是我持续更博的最大动力 原创文章,转载告知,盗版必究 解决tensorflow报错:AttributeError: module 'tensorflow.keras.back ...

  5. Navicat 远程连接docker容器中的mysql 报错1251 - Client does not support authentication protocol 解决办法

    Navicat 远程连接docker容器中的mysql 报错1251 - Client does not support authentication protocol 解决办法 1).容器中登录my ...

  6. 用pip安装tensorflow报错SyntaxError: invalid syntax

    用pip安装tensorflow报错SyntaxError: invalid syntax 解决办法:直接在cmd中输入安装语句

  7. 服务器安装opencv报错--libSM.so.6: cannot open shared ...+tensorflow 报错libcusolver.so.8.0: can not...

    1.安装opencv出现以下错误: pip install opencv-contrib-python apt-get install -y python-qt4 apt-get install tk ...

  8. Ubuntu安装tensorflow报错:tensorflow-xx.whl not a supported wheel on this platform

    解决Ubuntu安装tensorflow报错:tensorflow-0.5.0-cp27-none-linux_x86_64.whl is not a supported wheel on this ...

  9. 【已解决】Python安装TensorFlow报错“Consider adding this directory to PATH or, if you prefer to suppress this

    [已解决]Python安装TensorFlow报错"Consider adding this directory to PATH or, if you prefer to suppress ...

最新文章

  1. 聊下并发和Tomcat线程数(Updated)
  2. 以太坊Oracle系列二:My Oracle
  3. 《现代体系结构上的UNIX系统:内核程序员的对称多处理和缓存技术(修订版)》——2.11 高速缓存的性能...
  4. 八、Vue cli3详解学习笔记
  5. 【script】python 解析 Windows日志(python-evtx)
  6. Java B2B2C o2o多用户商城 springcloud架构-docker-feign-hystrix(六)
  7. 什么是GC Roots
  8. 原创:CSS3技术-雪碧图自适应缩放与精灵动画方案
  9. 个税计算公式excel_财务不会做工资表?全函数统计查询、自动个税计算模板送你,给力...
  10. wps里为什么没有华文楷体_是谁动了我的字体?为什么Word或PPT换台电脑打开字体就变了呢?...
  11. (QACNN)自然语言处理:智能问答 IBM 保险QA QACNN 实现笔记
  12. RESTful理解与实践
  13. 文字排版--删除线(text-decoration:line-through)
  14. python3《机器学习实战系列》学习笔记----3.2 决策树实战
  15. 粉条要经过什么检查才符合315?
  16. 自己整理的五个常用的焦点图
  17. IOS苹果手机背景音乐不能自动播放问题
  18. 分享十款国外最受欢迎的搜索引擎
  19. c# 小票打印机打条形码_C#调用CODESOFT打印条码标签的关键代码
  20. 遥感学习笔记(二)——地物反射波谱特征

热门文章

  1. Python·pip升级失败报异常之解决方案
  2. cad标注桩号lisp_CAD插件标桩号的AutoLISP程序语言求解释并译成中文,谢谢
  3. C/C++ Eeny Meeny Moo
  4. 免费视频压缩工具、视频格式转换器、mp3格式转换器、视频转mp3、Moo0视频压缩工具
  5. 浅谈使用Word和Baklib制作帮助文档区别
  6. 【C语言】(用函数实现)请给小学生随机出10道加减法的练习题,要求:10以内的加减法,并且能批改。
  7. 2018年英语六级作文(附翻译)
  8. 安卓基于Frida HOOK传感器 实现虚拟运动跑步
  9. 中国女排为何世界杯屡战屡胜?郎平靠的是史诗级的项目管理
  10. vue过滤器使用方法