ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[16,77,3072]

跑模型的时候出现了下面的错误（太长了，所以只保留了有用的关键信息）。在网上得知，出现这种错误的原因可能是显存空间不够，这有可能是使用的batch_size过大或者显卡被其他服务占用引起的。之后我查看了一下源码，偶然间发现代码里使用的n_gpu的默认值是4，我将其修改为1并重新运行代码之后，代码被成功执行。

结合网上搜索到的资源和我的这次试验，总结一下出现这个问题的原因：

batch_size太大；

有其他模型在占用GPU资源；

对GPU数量的设置不符合实际（过大）。

2019-03-16 18:59:38.881528: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2019-03-16 18:59:38.881535: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2019-03-16 18:59:38.881540: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2019-03-16 18:59:38.881545: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX512F instructions, but these are available on your machine and could speed up CPU computations.
2019-03-16 18:59:38.881550: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
2019-03-16 18:59:39.005554: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:893] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-03-16 18:59:39.005820: I tensorflow/core/common_runtime/gpu/gpu_device.cc:955] Found device 0 with properties:
name: Tesla P4
major: 6 minor: 1 memoryClockRate (GHz) 1.531
pciBusID 0000:00:07.0
Total memory: 7.43GiB
Free memory: 7.32GiB
2019-03-16 18:59:39.005851: I tensorflow/core/common_runtime/gpu/gpu_device.cc:976] DMA: 0
2019-03-16 18:59:39.005858: I tensorflow/core/common_runtime/gpu/gpu_device.cc:986] 0:   Y
2019-03-16 18:59:39.005868: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla P4, pci bus id: 0000:00:07.0)0%|                                                    | 0/46 [00:00<?, ?it/s]2019-03-16 19:00:05.441385: W tensorflow/core/common_runtime/bfc_allocator.cc:273] Allocator (GPU_0_bfc) ran out of memory trying to allocate 14.44MiB.  Current allocation summary follows.
2019-03-16 19:00:05.441859: I tensorflow/core/common_runtime/bfc_allocator.cc:660] Bin for 14.44MiB was 8.00MiB, Chunk State:
2019-03-16 19:00:05.462553: I tensorflow/core/common_runtime/bfc_allocator.cc:693]      Summary of in-use Chunks by size:
2019-03-16 19:00:05.462905: I tensorflow/core/common_runtime/bfc_allocator.cc:700] Sum Total of in-use chunks: 6.95GiB
2019-03-16 19:00:05.462917: I tensorflow/core/common_runtime/bfc_allocator.cc:702] Stats:
Limit:                  7463944192
InUse:                  7462481920
MaxInUse:               7462915328
NumAllocs:                    3978
MaxAllocSize:            1972741122019-03-16 19:00:05.463019: W tensorflow/core/common_runtime/bfc_allocator.cc:277] ****************************************************************************************************
2019-03-16 19:00:05.463075: W tensorflow/core/framework/op_kernel.cc:1192] Resource exhausted: OOM when allocating tensor with shape[16,77,3072]
2019-03-16 19:00:05.463170: W tensorflow/core/common_runtime/bfc_allocator.cc:273] Allocator (GPU_0_bfc) ran out of memory trying to allocate 14.44MiB.  Current allocation summary follows.
2019-03-16 19:00:05.463596: I tensorflow/core/common_runtime/bfc_allocator.cc:660] Bin for 14.44MiB was 8.00MiB, Chunk State:
2019-03-16 19:00:05.484133: I tensorflow/core/common_runtime/bfc_allocator.cc:693]      Summary of in-use Chunks by size:
2019-03-16 19:00:05.484464: I tensorflow/core/common_runtime/bfc_allocator.cc:700] Sum Total of in-use chunks: 6.95GiB
2019-03-16 19:00:05.484475: I tensorflow/core/common_runtime/bfc_allocator.cc:702] Stats:
Limit:                  7463944192
InUse:                  7462481920
MaxInUse:               7462915328
NumAllocs:                    3978
MaxAllocSize:            1972741122019-03-16 19:00:05.484576: W tensorflow/core/common_runtime/bfc_allocator.cc:277] ****************************************************************************************************
2019-03-16 19:00:05.484592: W tensorflow/core/framework/op_kernel.cc:1192] Resource exhausted: OOM when allocating tensor with shape[16,77,3072]
2019-03-16 19:00:05.530899: W tensorflow/core/common_runtime/bfc_allocator.cc:273] Allocator (GPU_0_bfc) ran out of memory trying to allocate 3.61MiB.  Current allocation summary follows.
2019-03-16 19:00:05.531407: I tensorflow/core/common_runtime/bfc_allocator.cc:660] Bin for 3.61MiB was 2.00MiB, Chunk State:
2019-03-16 19:00:05.553057: I tensorflow/core/common_runtime/bfc_allocator.cc:693]      Summary of in-use Chunks by size:
2019-03-16 19:00:05.553394: I tensorflow/core/common_runtime/bfc_allocator.cc:700] Sum Total of in-use chunks: 6.95GiB
2019-03-16 19:00:05.553404: I tensorflow/core/common_runtime/bfc_allocator.cc:702] Stats:
Limit:                  7463944192
InUse:                  7462481920
MaxInUse:               7462915328
NumAllocs:                    3978
MaxAllocSize:            1972741122019-03-16 19:00:05.553505: W tensorflow/core/common_runtime/bfc_allocator.cc:277] ****************************************************************************************************
2019-03-16 19:00:05.553531: W tensorflow/core/framework/op_kernel.cc:1192] Resource exhausted: OOM when allocating tensor with shape[16,77,768]
2019-03-16 19:00:05.553668: W tensorflow/core/common_runtime/bfc_allocator.cc:273] Allocator (GPU_0_bfc) ran out of memory trying to allocate 3.61MiB.  Current allocation summary follows.
2019-03-16 19:00:05.554103: I tensorflow/core/common_runtime/bfc_allocator.cc:660] Bin for 3.61MiB was 2.00MiB, Chunk State:
2019-03-16 19:00:05.574314: I tensorflow/core/common_runtime/bfc_allocator.cc:693]      Summary of in-use Chunks by size:
2019-03-16 19:00:05.574638: I tensorflow/core/common_runtime/bfc_allocator.cc:700] Sum Total of in-use chunks: 6.95GiB
2019-03-16 19:00:05.574666: I tensorflow/core/common_runtime/bfc_allocator.cc:702] Stats:
Limit:                  7463944192
InUse:                  7462481920
MaxInUse:               7462915328
NumAllocs:                    3978
MaxAllocSize:            1972741122019-03-16 19:00:05.574770: W tensorflow/core/common_runtime/bfc_allocator.cc:277] ****************************************************************************************************
2019-03-16 19:00:05.574786: W tensorflow/core/framework/op_kernel.cc:1192] Resource exhausted: OOM when allocating tensor with shape[1232,768]
2019-03-16 19:00:15.484765: W tensorflow/core/common_runtime/bfc_allocator.cc:273] Allocator (GPU_0_bfc) ran out of memory trying to allocate 14.44MiB.  Current allocation summary follows.
2019-03-16 19:00:15.485248: I tensorflow/core/common_runtime/bfc_allocator.cc:660] Bin for 14.44MiB was 8.00MiB, Chunk State:
2019-03-16 19:00:15.506609: I tensorflow/core/common_runtime/bfc_allocator.cc:693]      Summary of in-use Chunks by size:
2019-03-16 19:00:15.506956: I tensorflow/core/common_runtime/bfc_allocator.cc:700] Sum Total of in-use chunks: 6.95GiB
2019-03-16 19:00:15.506968: I tensorflow/core/common_runtime/bfc_allocator.cc:702] Stats:
Limit:                  7463944192
InUse:                  7462422528
MaxInUse:               7462915328
NumAllocs:                    3978
MaxAllocSize:            1972741122019-03-16 19:00:15.507082: W tensorflow/core/common_runtime/bfc_allocator.cc:277] ****************************************************************************************************
2019-03-16 19:00:15.507112: W tensorflow/core/framework/op_kernel.cc:1192] Resource exhausted: OOM when allocating tensor with shape[16,77,3072]
2019-03-16 19:00:25.507333: W tensorflow/core/common_runtime/bfc_allocator.cc:273] Allocator (GPU_0_bfc) ran out of memory trying to allocate 14.44MiB.  Current allocation summary follows.
2019-03-16 19:00:25.507912: I tensorflow/core/common_runtime/bfc_allocator.cc:660] Bin for 14.44MiB was 8.00MiB, Chunk State:
2019-03-16 19:00:25.527807: I tensorflow/core/common_runtime/bfc_allocator.cc:693]      Summary of in-use Chunks by size:
2019-03-16 19:00:25.528034: I tensorflow/core/common_runtime/bfc_allocator.cc:700] Sum Total of in-use chunks: 6.95GiB
2019-03-16 19:00:25.528044: I tensorflow/core/common_runtime/bfc_allocator.cc:702] Stats:
Limit:                  7463944192
InUse:                  7462422528
MaxInUse:               7462915328
NumAllocs:                    3978
MaxAllocSize:            1972741122019-03-16 19:00:25.528124: W tensorflow/core/common_runtime/bfc_allocator.cc:277] ****************************************************************************************************
2019-03-16 19:00:25.528148: W tensorflow/core/framework/op_kernel.cc:1192] Resource exhausted: OOM when allocating tensor with shape[16,77,3072]
Traceback (most recent call last):File "/anaconda3/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1327, in _do_callreturn fn(*args)File "/anaconda3/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1306, in _run_fnstatus, run_metadata)File "/anaconda3/lib/python3.5/contextlib.py", line 66, in __exit__next(self.gen)File "/anaconda3/lib/python3.5/site-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_statuspywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[16,77,3072][[Node: model_2/h2/mlp/Pow = Pow[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:0"](model_2/h2/mlp/c_fc/Reshape_2, model_2/h2/mlp/Pow/y)]][[Node: Mean_8/_2963 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_62814_Mean_8", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]]During handling of the above exception, another exception occurred:Traceback (most recent call last):File "train.py", line 433, in <module>cost, _ = sess.run([clf_loss, train], {X_train:xmb, M_train:mmb, Y_train:ymb})File "/anaconda3/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 895, in runrun_metadata_ptr)File "/anaconda3/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1124, in _runfeed_dict_tensor, options, run_metadata)File "/anaconda3/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1321, in _do_runoptions, run_metadata)File "/anaconda3/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1340, in _do_callraise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[16,77,3072][[Node: model_2/h2/mlp/Pow = Pow[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:0"](model_2/h2/mlp/c_fc/Reshape_2, model_2/h2/mlp/Pow/y)]][[Node: Mean_8/_2963 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_62814_Mean_8", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]]Caused by op 'model_2/h2/mlp/Pow', defined at:File "train.py", line 397, in <module>train, logits, clf_losses, lm_losses = mgpu_train(X_train, M_train, Y_train)File "train.py", line 203, in mgpu_trainclf_logits, clf_losses, lm_losses = model(*xs, train=True, reuse=do_reuse)File "train.py", line 172, in modelh = block(h, 'h%d'%layer, train=train, scale=True)File "train.py", line 145, in blockm = mlp(n, 'mlp', nx*4, train=train)File "train.py", line 135, in mlph = act(conv1d(x, 'c_fc', n_state, 1, train=train))File "train.py", line 23, in gelureturn 0.5*x*(1+tf.tanh(math.sqrt(2/math.pi)*(x+0.044715*tf.pow(x, 3))))File "/anaconda3/lib/python3.5/site-packages/tensorflow/python/ops/math_ops.py", line 544, in powreturn gen_math_ops._pow(x, y, name=name)File "/anaconda3/lib/python3.5/site-packages/tensorflow/python/ops/gen_math_ops.py", line 1533, in _powresult = _op_def_lib.apply_op("Pow", x=x, y=y, name=name)File "/anaconda3/lib/python3.5/site-packages/tensorflow/python/framework/op_def_library.py", line 767, in apply_opop_def=op_def)File "/anaconda3/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 2630, in create_oporiginal_op=self._default_original_op, op_def=op_def)File "/anaconda3/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 1204, in __init__self._traceback = self._graph._extract_stack()  # pylint: disable=protected-accessResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[16,77,3072][[Node: model_2/h2/mlp/Pow = Pow[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:0"](model_2/h2/mlp/c_fc/Reshape_2, model_2/h2/mlp/Pow/y)]][[Node: Mean_8/_2963 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_62814_Mean_8", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]]

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[16,77,3072]相关推荐

报错解决：ResourceExhaustedError: OOM when allocating tensor with shape
报错解决:ResourceExhaustedError: OOM when allocating tensor with shape 早上在使用tensorflow时遇到如下报错: Traceback ...
Resource exhausted: OOM when allocating tensor with shape[620,20000] and type float on /job:localhos
在CPU下跑的时候并没有报错,换成GPU后就一直OOM,后来被提醒道,显存比内存小很多啊,所以改小了batch_size,并将网络结构进行了优化,减少了词向量的维度和隐藏层节点个数. 另外,在代码里添 ...
OP_REQUIRES failed at conv_ops.cc:386 : Resource exhausted: OOM when allocating tensor with shape..
tensorflow-gpu验证准确率是报错如上: 解决办法: 1. 加入os.environ['CUDA_VISIBLE_DEVICES']='2' 强制使用CPU验证-----慢 2.'batch ...
报错：ResourceExhaustedError OOM when allocating
日萌社人工智能AI:Keras PyTorch MXNet TensorFlow PaddlePaddle 深度学习实战(不定时更新) 报错:ResourceExhaustedError OOM w ...
Mask_RCNN安装与踩过的坑
一.Mask_RCNN下载 https://www.bilibili.com/video/BV1M7411x7is?t=629&p=5 按照上述教程的话,安装的是ballon例子的Mask_R ...
Windows Tensorflow GPU安装
GPU资源对神经网络模型的训练很重要,应充分利用电脑的显卡资源,加快模型的训练速度.这里是本人安装tensorflow-gpu的过程,记录了安装的步骤以及在每个步骤中参考的资料以及所遇到的坑. 大体步 ...
keras训练过程中发生的一些报错及其解决办法
ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[32,x,x,x] sh ...
NLP之BERT英文阅读理解问答SQuAD 2.0超详细教程
环境 linux python 3.6 tensorflow 1.12.0 文件准备工作下载bert源代码 : https://github.com/google-research/bert 下载b ...
BEGAN-hmi88代码调试
来源于github上hmi88大佬的代码. 问题:InternalError: Blas GEMM launch failed : a.shape=(16, 64), b.shape=(64, 819 ...
令人绝望的TensorFlow-GPU，多种报错！！！
为了加速自己搞模型的效率,再三考虑后终于决定换上GPU版本的TensorFlow 但是!!!! 这个鬼报错差点把我搞疯!!! 我不止一次的想过就放弃吧,老老实实的回去用CPU版本,但是我本身极好的素质 ...

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[16,77,3072]

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[16,77,3072]相关推荐

最新文章

热门文章