1、non-finite loss, ending training tensor(nan, device=‘cuda:0‘,2、‘LogSoftmaxBackward3、Function ‘MulB

WARNING: non-finite loss, ending training tensor(nan, device='cuda:0', grad_

错误1：WARNING: non-finite loss, ending training tensor(nan, device=‘cuda:0’, grad_

错误2：Function ‘LogSoftmaxBackward’ returned nan values in its 0th output.

错误3：Function ‘MulBackward0’ returned nan values in its 0th output

参考1：出现这种情况，大家可以尝试换一个数据集，我折腾了两天，pytorch版本也换了，各种的都试了，一直怀疑是网络结构的问题，改来改去的，还是不行，我的损失函数是交叉熵，focalloss也试了，这两个损失函数都用到了log这玩意，出现0值时就会报错，换了个数据集可以了，最后看了看数据集中的图像，有些图片接近于白色，值非常接近0 ，可能是这个原因导致的把

参考2：torch.autograd.set_detect_anomaly(True)，可以尝试在训练train.py文件最开始的位置加入这句话，这样报错的时候就会有更详细的解释，我的是其中一个模块的问题，删除了模块，代码正常，如果加入那个模块会出现这个问题，意思就是某个值被修改了，最后还祝我好运，哈哈哈，网上的办法都试了，包括改inplace=false，RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [16, 480, 14, 14]], which is output 0 of ReluBackward0, is at version 1; expected version 0 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

加入torch.autograd.set_detect_anomaly(True)，这个后，更详细的报错如下，可以根据详细的报错位置去删除修改，可以看到我的具体报错在下文加粗的地方，将其删除后，代码正常了

C:\Users\dyh.conda\envs\dyh_torch2\python.exe M:/第三个分类/第三个分类/Test8_densenet/train_three_path.py
5232 images were found in the dataset.
4187 images for training.
1045 images for validation.
Using 12 dataloader workers every process
0%| | 0/262 [00:00<?, ?it/s]loss= tensor(0.6223, device=‘cuda:0’, grad_fn=)
C:\Users\dyh.conda\envs\dyh_torch2\lib\site-packages\torch\autograd_init_.py:173: UserWarning: Error detected in ReluBackward0. Traceback of forward call that caused the error:
File “M:/第三个分类/第三个分类/Test8_densenet/train_three_path.py”, line 127, in
main(opt)
File “M:/第三个分类/第三个分类/Test8_densenet/train_three_path.py”, line 84, in main
mean_loss = train_one_epoch(model=model,
File “M:\第三个分类\第三个分类\Test8_densenet\utils.py”, line 129, in train_one_epoch
pred = model(images.to(device))
File “C:\Users\dyh.conda\envs\dyh_torch2\lib\site-packages\torch\nn\modules\module.py”, line 1130, in _call_impl
return forward_call(*input, **kwargs)
File “M:\第三个分类\第三个分类\Test8_densenet\model_three_path.py”, line 308, in forward
features_sk3 = self.sk3(left2)
File “C:\Users\dyh.conda\envs\dyh_torch2\lib\site-packages\torch\nn\modules\module.py”, line 1130, in call_impl
return forward_call(*input, **kwargs)
File “M:\第三个分类\第三个分类\Test8_densenet\sk_model.py”, line 41, in forward
fea = conv(x).unsqueeze(dim=1)
File “C:\Users\dyh.conda\envs\dyh_torch2\lib\site-packages\torch\nn\modules\module.py”, line 1130, in _call_impl
return forward_call(*input, **kwargs)
File “C:\Users\dyh.conda\envs\dyh_torch2\lib\site-packages\torch\nn\modules\container.py”, line 139, in forward
input = module(input)
File “C:\Users\dyh.conda\envs\dyh_torch2\lib\site-packages\torch\nn\modules\module.py”, line 1130, in _call_impl
return forward_call(*input, **kwargs)
File “C:\Users\dyh.conda\envs\dyh_torch2\lib\site-packages\torch\nn\modules\activation.py”, line 98, in forward
return F.relu(input, inplace=self.inplace)
File “C:\Users\dyh.conda\envs\dyh_torch2\lib\site-packages\torch\nn\functional.py”, line 1457, in relu
result = torch.relu(input)
(Triggered internally at C:\cb\pytorch_1000000000000\work\torch\csrc\autograd\python_anomaly_mode.cpp:104.)
Variable.execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
0%| | 0/262 [00:54<?, ?it/s]
Traceback (most recent call last):
File “M:/第三个分类/第三个分类/Test8_densenet/train_three_path.py”, line 127, in
main(opt)
File “M:/第三个分类/第三个分类/Test8_densenet/train_three_path.py”, line 84, in main
mean_loss = train_one_epoch(model=model,
File “M:\第三个分类\第三个分类\Test8_densenet\utils.py”, line 133, in train_one_epoch
loss.backward()
File “C:\Users\dyh.conda\envs\dyh_torch2\lib\site-packages\torch_tensor.py”, line 396, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "C:\Users\dyh.conda\envs\dyh_torch2\lib\site-packages\torch\autograd_init.py", line 173, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [16, 480, 14, 14]], which is output 0 of ReluBackward0, is at version 1; expected version 0 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

Process finished with exit code 1

1、non-finite loss, ending training tensor(nan, device=‘cuda:0‘,2、‘LogSoftmaxBackward3、Function ‘MulB相关推荐

37、记录使用 Swin Transformer主干网络去实现分类，并转化NCNN、TNN、MNN模型以及部署
基本思想:最近手中有个swim transformer模型,想移植手机端进行推理一下,随手记录一下遇到的问题涉及简单的转ncnn tnn mnn的流程性问题一.首先我fork了大佬的代码https: ...
常用损失函数总结（L1 loss、L2 loss、Negative Log-Likelihood loss、Cross-Entropy loss、Hinge Embedding loss、Margi）
常用损失函数总结(L1 loss.L2 loss.Negative Log-Likelihood loss.Cross-Entropy loss.Hinge Embedding loss.Margi) ...
alexnet实验偶遇：loss nan, train acc 0.100, test acc 0.100情况，通过bn层加快收敛速度，防止过拟合，防止梯度消失、爆炸
场景:数据集:官方的fashionminst + 网络:alexnet+pytroch+relu激活函数源代码:https://zh-v2.d2l.ai/chapter_convolutional- ...
tensorflow1.0模型的保存、加载、在训练
1.checkpoint文件总览 tensorflow保存的模型文件如下所示: .meta文件保存的是图结构,meta文件是pb(protocol buffer)格式文件,包含变量.op.集合等. c ...
TensorFlow 2.0 - TFRecord存储数据集、@tf.function图执行模式、tf.TensorArray、tf.config分配GPU
文章目录 1. TFRecord 格式存储 2. tf.function 高性能 3. tf.TensorArray 支持计算图特性 4. tf.config 分配GPU 学习于:简单粗暴 Tenso ...
深度学习（四十一）cuda8.0+ubuntu16.04+theano、caffe、tensorflow环境搭建
cuda8.0+ubuntu16.04+theano.caffe.tensorflow环境搭建目前自己撘过深度学习各种库.各种环境,已经搭建了n多台电脑,发现每台电脑配置安装方法各不相同,总会出现各 ...
PIL，cv2读取类型及转换，以及PIL，numpy，tensor格式以及cuda，cpu的格式转换
PIL,cv2读取类型,以及PIL,numpy,tensor格式以及cuda,cpu的格式转换一.PIL,cv2读取数据图片以及之间的转换二.PIL,数组类型以及tensor类型的转换 1.PIL ...
【pytorch】3.0 优化器BGD、SGD、MSGD、Momentum、Adagrad、RMSPprop、Adam
目录一.优化器介绍 1.梯度下降法 1.1 一维梯度下降法 1.2 多维梯度下降法 2.动量(Momentum) 3.Adagrad 4.RMSProp 5.Adam 7.总结: 二.动态修改学习率 ...
android4.3.0 camera,4.3、高通camera驱动简析
1.Sensor slave配置结构体msm_camera_sensor_slave_info定义在media/msm_cam_sensor.h中: struct msm_camera_sensor ...

1、non-finite loss, ending training tensor(nan, device=‘cuda:0‘,2、‘LogSoftmaxBackward3、Function ‘MulB

WARNING: non-finite loss, ending training tensor(nan, device='cuda:0', grad_

1、non-finite loss, ending training tensor(nan, device=‘cuda:0‘,2、‘LogSoftmaxBackward3、Function ‘MulB相关推荐

最新文章

热门文章