TorchVision 对象检测微调教程

1. 预训练的Mask R-CNN 模型进行微调

我们将说明如何在 torchvision 中使用新功能,以便在自定义数据集上训练实例细分模型。

2. 定义数据集
Penn-Fudan 数据库中对行人检测和分割。 数据集应继承自标准类,并实现__len__和__getitem__。

import os
import numpy as np
import torch
from PIL import Imageclass PennFudanDataset(object):def __init__(self, root, transforms):self.root = rootself.transforms = transforms# load all image files, sorting them to# ensure that they are alignedself.imgs = list(sorted(os.listdir(os.path.join(root, "PNGImages"))))self.masks = list(sorted(os.listdir(os.path.join(root, "PedMasks"))))def __getitem__(self, idx):# load images ad masksimg_path = os.path.join(self.root, "PNGImages", self.imgs[idx])mask_path = os.path.join(self.root, "PedMasks", self.masks[idx])img ="RGB")# note that we haven't converted the mask to RGB,# because each color corresponds to a different instance# with 0 being backgroundmask = convert the PIL Image into a numpy arraymask = np.array(mask)# instances are encoded as different colorsobj_ids = np.unique(mask)# first id is the background, so remove itobj_ids = obj_ids[1:]# split the color-encoded mask into a set# of binary masksmasks = mask == obj_ids[:, None, None]# get bounding box coordinates for each masknum_objs = len(obj_ids)boxes = []for i in range(num_objs):pos = np.where(masks[i])xmin = np.min(pos[1])xmax = np.max(pos[1])ymin = np.min(pos[0])ymax = np.max(pos[0])boxes.append([xmin, ymin, xmax, ymax])# convert everything into a torch.Tensorboxes = torch.as_tensor(boxes, dtype=torch.float32)# there is only one classlabels = torch.ones((num_objs,), dtype=torch.int64)masks = torch.as_tensor(masks, dtype=torch.uint8)image_id = torch.tensor([idx])area = (boxes[:, 3] - boxes[:, 1]) * (boxes[:, 2] - boxes[:, 0])# suppose all instances are not crowdiscrowd = torch.zeros((num_objs,), dtype=torch.int64)target = {}target["boxes"] = boxestarget["labels"] = labelstarget["masks"] = maskstarget["image_id"] = image_idtarget["area"] = areatarget["iscrowd"] = iscrowdif self.transforms is not None:img, target = self.transforms(img, target)return img, targetdef __len__(self):return len(self.imgs)


图像:大小为(H, W)的 PIL 图像
boxes (FloatTensor[N, 4]):[x0, y0, x1, y1]格式的N个边界框的坐标,范围从0至W,从0至H
labels (Int64Tensor[N]):每个边界框的标签。0经常表示背景类
image_id (Int64Tensor[1]):图像标识符。 它在数据集中的所有图像之间应该是唯一的,并在评估过程中使用
area (Tensor[N]):边界框的区域。 在使用 COCO 度量进行评估时,可使用此值来区分小盒子,中盒子和大盒子之间的度量得分。
iscrowd (UInt8Tensor[N]):iscrowd = True 的实例在评估期间将被忽略。
(可选)masks (UInt8Tensor[N, H, W]):每个对象的分割Mask
(可选)keypoints (FloatTensor[N, K, 3]):对于 N 个对象中的每个对象,它包含[x, y, visibility]格式的 K 个关键点,以定义对象。 可见性= 0 表示关键点不可见。 请注意,对于数据扩充,翻转关键点的概念取决于数据表示形式,您可能应该将references/detection/transforms.py修改为新的关键点表示形式

3. 定义模型

在本教程中,我们使用 Mask R-CNN 。

3.1 pre-trained model的微调

The first is when we want to start from a pre-trained model, and just finetune the last layer.
假设您想从在 COCO 上经过预训练的模型开始,并希望针对您的特定类别对其进行微调。 这是一种可行的方法:

import torchvision
from torchvision.models.detection.faster_rcnn import FastRCNNPredictor# load a model pre-trained pre-trained on COCO
model = torchvision.models.detection.fasterrcnn_resnet50_fpn(pretrained=True)# replace the classifier with a new one, that has
# num_classes which is user-defined
num_classes = 2  # 1 class (person) + background
# get number of input features for the classifier
in_features = model.roi_heads.box_predictor.cls_score.in_features
# replace the pre-trained head with a new one
model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes)

在我们的例子中,由于我们的数据集非常小,我们希望从预训练的模型中进行微调,因此我们将遵循方法 。

这里我们还想计算实例分割掩码,因此我们将使用 Mask R-CNN:

import torchvision
from torchvision.models.detection.faster_rcnn import FastRCNNPredictor
from torchvision.models.detection.mask_rcnn import MaskRCNNPredictordef get_model_instance_segmentation(num_classes):# load an instance segmentation model pre-trained pre-trained on COCOmodel = torchvision.models.detection.maskrcnn_resnet50_fpn(pretrained=True)# get number of input features for the classifierin_features = model.roi_heads.box_predictor.cls_score.in_features# replace the pre-trained head with a new onemodel.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes)# now get the number of input features for the mask classifierin_features_mask = model.roi_heads.mask_predictor.conv5_mask.in_channelshidden_layer = 256# and replace the mask predictor with a new onemodel.roi_heads.mask_predictor = MaskRCNNPredictor(in_features_mask,hidden_layer,num_classes)return model


补充3.2 替换backbone

The other is when we want to replace the backbone of the model with a different one (for faster predictions, for example).

import torchvision
from torchvision.models.detection import FasterRCNN
from torchvision.models.detection.rpn import AnchorGenerator# load a pre-trained model for classification and return
# only the features
backbone = torchvision.models.mobilenet_v2(pretrained=True).features
# FasterRCNN needs to know the number of
# output channels in a backbone. For mobilenet_v2, it's 1280
# so we need to add it here
backbone.out_channels = 1280# let's make the RPN generate 5 x 3 anchors per spatial
# location, with 5 different sizes and 3 different aspect
# ratios. We have a Tuple[Tuple[int]] because each feature
# map could potentially have different sizes and
# aspect ratios
anchor_generator = AnchorGenerator(sizes=((32, 64, 128, 256, 512),),aspect_ratios=((0.5, 1.0, 2.0),))# let's define what are the feature maps that we will
# use to perform the region of interest cropping, as well as
# the size of the crop after rescaling.
# if your backbone returns a Tensor, featmap_names is expected to
# be [0]. More generally, the backbone should return an
# OrderedDict[Tensor], and in featmap_names you can choose which
# feature maps to use.
roi_pooler = torchvision.ops.MultiScaleRoIAlign(featmap_names=[0],output_size=7,sampling_ratio=2)# put the pieces together inside a FasterRCNN model
model = FasterRCNN(backbone,num_classes=2,rpn_anchor_generator=anchor_generator,box_roi_pool=roi_pooler)

4. Training and evaluation functions

# Download TorchVision repo to use some files from
# references/detection
git clone
cd vision
git checkout v0.3.0cp references/detection/ ../
cp references/detection/ ../
cp references/detection/ ../
cp references/detection/ ../
cp references/detection/ ../

在references/detection/中,我们提供了许多帮助程序功能来简化训练和评估检测模型。 在这里,我们将使用references/detection/,references/detection/utils.py和references/detection/。 只需将它们复制到您的文件夹中并在此处使用它们即可。

Let’s write some helper functions for data augmentation / transformation, which leverages the functions in refereces/detection that we have just copied:

from engine import train_one_epoch, evaluate
import utils
import transforms as Tdef get_transform(train):transforms = []# converts the image, a PIL image, into a PyTorch Tensortransforms.append(T.ToTensor())if train:# during training, randomly flip the training images# and ground-truth for data augmentationtransforms.append(T.RandomHorizontalFlip(0.5))return T.Compose(transforms)

请注意,我们不需要在数据转换中添加均值/标准差归一化或图像缩放,因为这些是由Mask R-CNN模型内部处理的。



model = torchvision.models.detection.fasterrcnn_resnet50_fpn(pretrained=True)
dataset = PennFudanDataset('PennFudanPed', get_transform(train=True))
data_loader =, batch_size=2, shuffle=True, num_workers=4,collate_fn=utils.collate_fn)
# For Training
images,targets = next(iter(data_loader))
images = list(image for image in images)
targets = [{k: v for k, v in t.items()} for t in targets]
output = model(images,targets)   # Returns losses and detections
# For inference
x = [torch.rand(3, 300, 400), torch.rand(3, 500, 400)]
predictions = model(x)           # Returns predictions

instantiate 实例化

# use our dataset and defined transformations
dataset = PennFudanDataset('PennFudanPed', get_transform(train=True))
dataset_test = PennFudanDataset('PennFudanPed', get_transform(train=False))# split the dataset in train and test set
indices = torch.randperm(len(dataset)).tolist()
dataset =, indices[:-50])
dataset_test =, indices[-50:])# define training and validation data loaders
data_loader =, batch_size=2, shuffle=True, num_workers=4,collate_fn=utils.collate_fn)data_loader_test =, batch_size=1, shuffle=False, num_workers=4,collate_fn=utils.collate_fn)

instantiate the model and the optimizer

device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')# our dataset has two classes only - background and person
num_classes = 2# get the model using our helper function
model = get_instance_segmentation_model(num_classes)
# move model to the right device construct an optimizer
params = [p for p in model.parameters() if p.requires_grad]
optimizer = torch.optim.SGD(params, lr=0.005,momentum=0.9, weight_decay=0.0005)# and a learning rate scheduler which decreases the learning rate by
# 10x every 3 epochs
lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer,step_size=3,gamma=0.1)

10 epochs

# let's train it for 10 epochs
num_epochs = 10for epoch in range(num_epochs):# train for one epoch, printing every 10 iterationstrain_one_epoch(model, optimizer, data_loader, device, epoch, print_freq=10)# update the learning ratelr_scheduler.step()# evaluate on the test datasetevaluate(model, data_loader_test, device=device)

what it actually predicts in a test image

# pick one image from the test set
img, _ = dataset_test[0]
# put the model in evaluation mode
with torch.no_grad():prediction = model([])

Printing the prediction shows that we have a list of dictionaries. Each element of the list corresponds to a different image. As we have a single image, there is a single dictionary in the list. The dictionary contains the predictions for the image we passed. In this case, we can see that it contains boxes, labels, masks and scores as fields.

[{'boxes': tensor([[ 61.7920,  35.8468, 196.2695, 328.1466],[276.3983,  21.7483, 291.1403,  73.4649],[ 79.1629,  42.9354, 201.3314, 207.8434]], device='cuda:0'),'labels': tensor([1, 1, 1], device='cuda:0'),'masks': tensor([[[[0., 0., 0.,  ..., 0., 0., 0.],[0., 0., 0.,  ..., 0., 0., 0.],[0., 0., 0.,  ..., 0., 0., 0.],...,[0., 0., 0.,  ..., 0., 0., 0.],[0., 0., 0.,  ..., 0., 0., 0.],[0., 0., 0.,  ..., 0., 0., 0.]]],[[[0., 0., 0.,  ..., 0., 0., 0.],[0., 0., 0.,  ..., 0., 0., 0.],[0., 0., 0.,  ..., 0., 0., 0.],...,[0., 0., 0.,  ..., 0., 0., 0.],[0., 0., 0.,  ..., 0., 0., 0.],[0., 0., 0.,  ..., 0., 0., 0.]]],[[[0., 0., 0.,  ..., 0., 0., 0.],[0., 0., 0.,  ..., 0., 0., 0.],[0., 0., 0.,  ..., 0., 0., 0.],...,[0., 0., 0.,  ..., 0., 0., 0.],[0., 0., 0.,  ..., 0., 0., 0.],[0., 0., 0.,  ..., 0., 0., 0.]]]], device='cuda:0'),'scores': tensor([0.9994, 0.8378, 0.0524], device='cuda:0')}]

convert the image



Image.fromarray(img.mul(255).permute(1, 2, 0).byte().numpy())

And let’s now visualize the top predicted segmentation mask. The masks are predicted as [N, 1, H, W], where N is the number of predictions, and are probability maps between 0-1.

Image.fromarray(prediction[0]['masks'][0, 0].mul(255).byte().cpu().numpy())

