使用Faster R-CNN（linux-pytorch）训练自己的数据集过程记录

准备工作

源码下载

配置环境

制作VOC数据集

data目录结构

训练

编译CUDA依赖环境

预训练模型

修改pascal_voc.py文件

进行训练

遇到的问题

主要参考文章

准备工作

源码下载

Faster R-CNN pytorch0.4.0版源码：GitHub - jwyang/faster-rcnn.pytorch: A faster pytorch implementation of faster r-cnn
Faster R-CNN pytorch1.0.0版源码：GitHub - jwyang/faster-rcnn.pytorch at pytorch-1.0

配置环境

在requirements.txt所在的目录下用如下命令安装所需库

pip install -r requirements.txt

注意在这之后最好将scipy库降版本，如安装1.2.1版本，不然后面可能会报错

pip uninstall scipy
pip install scipy==1.2.1

制作VOC数据集

制作自己的VOC格式数据集：记录Open Image v4数据集转化为VOC格式_ZZZZ_Y_的博客-CSDN博客

创建软链接指向数据集，不用复制数据集到项目指定路径下额外占用内存：Windows、Linux创建软链接_windows创建链接_ZZZZ_Y_的博客-CSDN博客

data目录结构

data
├─VOCdevkit2007
│  └─VOC2007
│      ├─Annotations
│      ├─ImageSets
│      │  ├─Layout
│      │  ├─Main
│      │  └─Segmentation
│      ├─JPEGImages
│      ├─SegmentationClass
│      └─SegmentationObject
└─pretrained_model

pretrained_model下存放的是预训练模型，

Annotations下存放的是xml标签文件，

JPEGImages下存放的是jpg图片数据文件，

ImageSets下的Main文件夹下存放的是训练集验证集和测试集txt文件，里面是图片的序号

训练

编译CUDA依赖环境

cd lib
python setup.py build develop

预训练模型

预训练模型要存放在pretrained_model文件夹下

修改pascal_voc.py文件

修改文件路径 faster-rcnn.pytorch/lib/datasets/pascal_voc.py文件中的检测类别

类别名要是小写！

进行训练

CUDA_VISIBLE_DEVICES=0 python trainval_net.py --dataset pascal_voc --net res101 --bs 4 --nw 4 --lr 0.005 --lr_decay_step 5 --cuda --epochs 50

测试

使用如下代码测试：

CUDA_VISIBLE_DEVICES=9 python test_net.py --dataset pascal_voc --net res101 --checksession 1 --checkepoch 1 --checkpoint 6755 --cuda

demo.py

使用如下命令运行demo.py

CUDA_VISIBLE_DEVICES=0 python demo.py --net res101 --checksession 1 --checkepoch 4 --checkpoint 13512 --cuda --load_dir models

报错：RuntimeError: Error(s) in loading state_dict for resnet:

RuntimeError: Error(s) in loading state_dict for resnet:size mismatch for RCNN_cls_score.weight: copying a param with shape torch.Size([16, 2048]) from checkpoint, the shape in current model is torch.Size([21, 2048]).size mismatch for RCNN_cls_score.bias: copying a param with shape torch.Size([16]) from checkpoint, the shape in current model is torch.Size([21]).size mismatch for RCNN_bbox_pred.weight: copying a param with shape torch.Size([64, 2048]) from checkpoint, the shape in current model is torch.Size([84, 2048]).size mismatch for RCNN_bbox_pred.bias: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([84]).

继续报错：

修改demo.py代码中的检测类别：

服务器资源不够，大家都在用，今天没法测试（test+demo）了

遇到的问题

1. 编译时报错invalid command 'develop'

usage: setup.py [global_opts] cmd1 [cmd1_opts] [cmd2 [cmd2_opts] ...]or: setup.py --help [cmd1 cmd2 ...]or: setup.py --help-commandsor: setup.py cmd --helperror: invalid command 'develop'

参考：python setup.py develop · Issue #92 · django-extensions/django-extensions · GitHub

setup.py文件中的

from distutils.core import setup

替换为

from setuptools import setup

然后在服务器终端激活虚拟环境，cd到相应的目录，依次执行以下命令：

conda activate your_env_name
cd lib
python setup.py build develop

开始编译了！

2. 其中会遇到 can't import imread ,可通过scipy降版本解决，可降为1.2.1

pip uninstall scipy
pip install scipy==1.2.1

3. libstdc++.so.6: version `GLIBCXX_3.4.30' not found

ImportError: /home/zy/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/lib/../../../../libstdc++.so.6: version `GLIBCXX_3.4.30' not found (required by /home/zy/anaconda3/envs/pytorch/lib/python3.6/site-packages/cv2/cv2.abi3.so)

查看系统libstdc++.so.6文件中支持的GLIBCXX版本：

strings  /usr/lib/x86_64-linux-gnu/libstdc++.so.6   | grep GLIBC

如下图，最高版本为3.4.30

anaconda环境下libstdc++.so.6文件中支持的GLIBCXX版本:

strings  /home/zy/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/lib/../../../../libstdc++.so.6 | grep GLIBCX

anaconda环境下最高版本为3.4.29，但需要使用版本为3.4.30

参考Ubuntu系统anaconda报错version `GLIBCXX_3.4.30' not found - Death_Knight - 博客园 (cnblogs.com)

查看anaconda环境下libstdc++.so.6的相关文件：

ls libstdc++.so
ls libstdc++.so -al
ls libstdc++.so.6 -al
ls libstdc++.so.6.0.29 -al

使用如下命令查看系统库路径下，libstdc++.so.6的相关文件：

ls -al /usr/lib/x86_64-linux-gnu/libstdc++.so.6

目前anaconda环境中libstdc++.so和libstdc++.so.6的链接地址指向的为libstdc++.so.6.0.29

使用如下命令将anaconda环境中libstdc++.so和libstdc++.so.6的链接地址指向系统路径中的地址

rm libstdc++.so
rm libstdc++.so.6
ln -s /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30 libstdc++.so
ln -s /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30 libstdc++.so.6

再次查看发现链接的版本为6.0.30

也可以try这个，我用上面的方法成功了就没试下面这个了：

（已解决）Import报错 Version `GLIBCXX_3.4.22‘ not found_glibcxx_3.4.28_可可与鱼的博客-CSDN博客

4. 再次运行，上个报错解决了，又出现了新的问题

ImportError: torch.utils.ffi is deprecated. Please use cpp extensions instead.

参考解决方法：【已解决】ImportError: torch.utils.ffi is deprecated. Please use cpp extensions instead._ShuqiaoS的博客-CSDN博客

我发现我用的是0.4.0的代码，所以才会出现这么多的问题，现在要换成1.0.1的代码了。。。真的很难过，折腾了好久发现版本竟然是0.4.0的

0.4.0的可以看这个，(2条消息) Faster RCNN 环境配置_faster rcnn环境配置_吾人为学的博客-CSDN博客，我看这个还是用的0.4.0

5. 运行报错ImportError: cannot import name '_mask'

参考ImportError: cannot import name '_mask' · Issue #410 · jwyang/faster-rcnn.pytorch · GitHub

激活虚拟环境，cd到data目录安装coco API，执行以下命令

cd datagit clone https://github.com/pdollar/coco.gitcd coco/PythonAPImake

6. 新问题：TypeError: load() missing 1 required positional argument: 'Loader'

原因：新版本的ppyaml已经不支持旧版本的yaml.load()，

way1：可用以下三种方式替代：

yaml.load(file,Loader=yaml.FullLoader)
yaml.safe_load(file)
yaml.load(file, Loader=yaml.CLoader)

way2：降级pyyaml 版本 6.0降为5.4.1（我是用这样解决的，感觉也最方便）

pip uninstall pyyaml
pip install pyyaml==5.4.1

终于不报错了，感动

7. oh no ，有报错了

ValueError: Caught ValueError in DataLoader worker process 1.

ValueError: operands could not be broadcast together with shapes (683,1024,4) (1,1,3) (683,1024,4)

之前用的多个显卡，现在换成1个显卡，没有报错了，但是为什么rpn_cls，rpn_box等都是nan呢

好吧还是报错了，和上面一样，又发现我的数据集文件的命名都为5位数，应该是6位数

8. 修改之后报错 assert (boxes[:, 2] >= boxes[:, 0]).all() AssertionError

参考（linux）Faster RCNN-pytorch1.0目标检测2：训练自己的数据集，gpu，pycharm, 训练笔记_chao_xy的博客-CSDN博客

修改lib/datasets/pascal_voc.py，_load_pascal_annotation(,)函数

将Xmin,Ymin,Xmax,Ymax 后的-1全部去掉

修改lib/datasets/imdb.py，append_flipped_images()函数

数据整理，在一行代码为 boxes[:, 2] = widths[i] - oldx1 - 1下加入代码：

aboxes = boxes
for b in range(len(boxes)):
if boxes[b][2] < boxes[b][0]:boxes[b][0] = boxes[b][2]boxes[b][2] = aboxes[b][0]

9. 运行又发生了这个报错

roidb[i]['img_id'] = imdb.image_id_at(i)
IndexError: list index out of range

参考roidb[i]['image'] = imdb.image_path_at（i） ·问题 #79 ·RBGIRSHICK/FAST-RCNN ·GitHub

这可能是缓存文件引起的，可以在 fast-rcnn-master/data/cache/ 文件夹下删除训练数据的特定缓存文件，然后重试解决了！

10. 再次报错ValueError: operands could not be broadcast together with shapes (1024,717,4) (1,1,3) (1024,717,4) 看来这个问题还是没有解决

可能是因为有的图片是4通道的，即rgb+alpha，只选取rgb三个通道即可

参考ValueError: operands could not be broadcast together with shapes (441,786,4) (1,1,3) (441,786,4) · Issue #599 · jwyang/faster-rcnn.pytorch · GitHub

在 lib\model\util\blob.py 的P39行前插入：

 if im.shape[2] == 4：im = im[：， ：， ：3]

可以正常训练了

CUDA_VISIBLE_DEVICES=3,4 python trainval_net.py --dataset pascal_voc --net res101 --bs 4 --nw 4 --lr 0.005 --lr_decay_step 5 --cuda --epochs 50

11.又有报错了

RuntimeError: Caught RuntimeError in DataLoader worker process 3.

RuntimeError: The expanded size of the tensor (1200) must match the existing size (0) at non-singleton dimension 1. Target sizes: [600, 1200, 3]. Tensor sizes: [600, 0, 3]

1439
[session 1][epoch  1][iter 3800/6756] loss: 0.7875, lr: 5.00e-03fg/bg=(10/1014), time cost: 53.474536rpn_cls: 0.1268, rpn_box: 0.0292, rcnn_cls: 0.0855, rcnn_box 0.0142
Traceback (most recent call last):File "trainval_net.py", line 310, in <module>data = next(data_iter)File "/home/zy/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 435, in __next__data = self._next_data()File "/home/zy/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 1085, in _next_datareturn self._process_data(data)File "/home/zy/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 1111, in _process_datadata.reraise()File "/home/zy/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/_utils.py", line 428, in reraiseraise self.exc_type(msg)
RuntimeError: Caught RuntimeError in DataLoader worker process 2.
Original Traceback (most recent call last):File "/home/zy/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 198, in _worker_loopdata = fetcher.fetch(index)File "/home/zy/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetchdata = [self.dataset[idx] for idx in possibly_batched_index]File "/home/zy/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>data = [self.dataset[idx] for idx in possibly_batched_index]File "/home/zy/faster-rcnn.pytorch-pytorch-1.0/lib/roi_data_layer/roibatchLoader.py", line 177, in __getitem__padding_data[:, :data_width, :] = data[0]
RuntimeError: The expanded size of the tensor (1200) must match the existing size (0) at non-singleton dimension 1.  Target sizes: [600, 1200, 3].  Tensor sizes: [600, 0, 3]

参考"RuntimeError: The expanded size of the tensor (1200) must match the existing size (1199) at non-singleton dimension 1. Target sizes: [600, 1200, 3]. Tensor sizes: [600, 1199, 3] " · Issue #629 · jwyang/faster-rcnn.pytorch · GitHub

删除fast-rcnn-master/data/cache/下的pkl文件试一试

测试过程报错：

1. AttributeError: 'NoneType' object has no attribute 'text'

  File "/home/zy/faster-rcnn.pytorch-pytorch-1.0/lib/datasets/voc_eval.py", line 22, in parse_recobj_struct['pose'] = obj.find('pose').text
AttributeError: 'NoneType' object has no attribute 'text'

解决方法：去原文档里注释掉这一句 obj_struct['pose'] = obj.find('pose').text

发现pose，truncated，difficult 这几项我的xml文件里都没有，于是都给注释了

参考AttributeError:“NoneType ” object has no attribute 'text' · Issue #580 · rbgirshick/py-faster-rcnn · GitHub

2. 报错： KeyError: 'difficult'

difficult = np.array([x['difficult'] for x in R]).astype(np.bool)
KeyError: 'difficult'

解决方法：修改faster-rcnn.pytorch-pytorch-1.0/lib/datasets/voc_eval.py文件，注释difficult相关代码

参考利用py-faster-rcnn训练目标检测模型_liuyan20062010的博客-CSDN博客

和AttributeError:“NoneType ” object has no attribute 'text' · Issue #580 · rbgirshick/py-faster-rcnn · GitHub

修改过的地方：

def parse_rec(filename):""" Parse a PASCAL VOC xml file """tree = ET.parse(filename)objects = []for obj in tree.findall('object'):obj_struct = {}obj_struct['name'] = obj.find('name').text#obj_struct['pose'] = obj.find('pose').text    //注释这一行#obj_struct['truncated'] = int(obj.find('truncated').text)    //注释这一行#obj_struct['difficult'] = int(obj.find('difficult').text)    //注释这一行bbox = obj.find('bndbox')obj_struct['bbox'] = [int(bbox.find('xmin').text),int(bbox.find('ymin').text),int(bbox.find('xmax').text),int(bbox.find('ymax').text)]objects.append(obj_struct)return objects

  # extract gt objects for this classclass_recs = {}npos = 0for imagename in imagenames:R = [obj for obj in recs[imagename] if obj['name'] == classname]bbox = np.array([x['bbox'] for x in R])#difficult = np.array([x['difficult'] for x in R]).astype(np.bool)    //注释这行difficult = 0;    //添加这行det = [False] * len(R)     #npos = npos + sum(~difficult)     //注释这行class_recs[imagename] = {'bbox': bbox,'difficult': difficult,'det': det}

      if ovmax > ovthresh:#if not R['difficult'][jmax]:    //注释这行# if not R['det'][jmax]:    //注释这行#  tp[d] = 1.    //注释这行# R['det'][jmax] = 1    //注释这行# else:    //注释这行fp[d] = 1.else:fp[d] = 1.

主要参考文章

Faster-RCNN.pytorch的搭建、使用过程详解（适配PyTorch 1.0以上版本）_faster rcnn pytorch_Yale曼陀罗的博客-CSDN博客

使用faster-rcnn.pytorch训练自己数据集（完整版） - Wind·Chaser - 博客园 (cnblogs.com)

Faster RCNN(Pytorch) 配置过程记录及问题解决_cc__cc__的博客-CSDN博客

（linux）Faster RCNN-pytorch1.0目标检测2：训练自己的数据集，gpu，pycharm, 训练笔记_chao_xy的博客-CSDN博客 Faster-RCNN.pytorch的搭建、使用过程个人记录（适配PyTorch 1.0以上版本）_璐璐不是小胖纸er的博客-CSDN博客