matterport/Mask rcnn

model.py是网络主要构建的文件
utils.py中的anchor产生函数部分，主要是涉及函数：

RPN部分

scales：(32, 64, 128, 256, 512)

ratios：(0.5, 1, 2)

feature_shapes：[[256, 256], [128, 128], [64, 64], [32, 32], [16, 16]]

feature_strides：[4, 8, 16, 32, 64]

anchor_stride：1

在generate_pyramid_anchor()中对generate_anchors()循环调用，scales, feature_shapes, feature_strides是一一对应来进入generate_anchors()函数。

input image的大小是1024x1024，那么经过4x下采样，得到(256, 256)，同样，经过8x，得到(128, 128),这就是feature_strides和feature_shapes的对应关系。scales表示一个在各个FPN level上相同大小的anchor所产生的对应于input image的anchor的是不一样的。在(256, 256)产生的是32x32的，而在(128, 128)时，因为anchor大小不变，而感受野变大，变大2x，所以产生的是64x64的，以此类推。

generate_anchors()函数：当输入为generate_anchors(32, [0.5, 1, 2], [256, 256], 4, 1)时，在256x256大小的feature map上，计算可知，在feature map上的anchor大小是8×8的，映射到input image是32x32大小，其中的shifts_y与shifts_x就是对feature map 256×256上的element按照步长为anchor_stride来进行移动窗口，在此处设置anchor_stide为1,那么将会有256*256个，因为有三个ratios，所以一共有256×256×3个anchor，且通过操作 * feature_stride映射回到input image上。最后由中心坐标和长宽来求box的左上与右下坐标。

def generate_anchors(scales, ratios, shape, feature_stride, anchor_stride):"""scales: 1D array of anchor sizes in pixels. Example: [32, 64, 128]ratios: 1D array of anchor ratios of width/height. Example: [0.5, 1, 2]shape: [height, width] spatial shape of the feature map over whichto generate anchors.feature_stride: Stride of the feature map relative to the image in pixels.anchor_stride: Stride of anchors on the feature map. For example, if thevalue is 2 then generate anchors for every other feature map pixel."""# Get all combinations of scales and ratios# eg 32 和 [0.5, 1, 2]scales, ratios = np.meshgrid(np.array(scales), np.array(ratios))scales = scales.flatten() # array([32, 32, 32])ratios = ratios.flatten() # array([ 0.5,  1. ,  2. ])# Enumerate heights and widths from scales and ratiosheights = scales / np.sqrt(ratios) # array([ 45.254834,  32.,  22.627417])widths = scales * np.sqrt(ratios) # array([ 22.627417,  32. ,  45.254834])# Enumerate shifts in feature space,eg anchor_stride = 1 , featrue_stride = 4 # here, shape = (256, 256)# shift_y.shape = (256, ), shift_x.shape = (256, )shifts_y = np.arange(0, shape[0], anchor_stride) * feature_strideshifts_x = np.arange(0, shape[1], anchor_stride) * feature_stride# shifts_x.shape = (256, 256)--->[[0, 1, ...., 256], [0, 1, ...., 256], ...] * 4# shifts_y.shape = (256, 256)--->[[0, 0, ...., 0], [1, 1, ...., 1], ....] * 4shifts_x, shifts_y = np.meshgrid(shifts_x, shifts_y)# Enumerate combinations of shifts, widths, and heights# box_widths.shape = (256*256, 3)----> 3 is [22.627417,  32. ,  45.254834]# box_centers_x.shape = (256*256, 3)--->256*256 is [0, 1, ...255, 0, 1, ...255,,...]box_widths, box_centers_x = np.meshgrid(widths, shifts_x)box_heights, box_centers_y = np.meshgrid(heights, shifts_y)# Reshape to get a list of (y, x) and a list of (h, w)# (256*256*3, 2)box_centers = np.stack([box_centers_y, box_centers_x], axis=2).reshape([-1, 2])box_sizes = np.stack([box_heights, box_widths], axis=2).reshape([-1, 2])# Convert to corner coordinates (y1, x1, y2, x2)boxes = np.concatenate([box_centers - 0.5 * box_sizes,box_centers + 0.5 * box_sizes], axis=1)# if generate_anchors(32, [0.5, 1, 2], [256, 256], 4, 1), return boxes.shape=(196608, 4)return boxesdef generate_pyramid_anchors(scales, ratios, feature_shapes, feature_strides,anchor_stride):"""Generate anchors at different levels of a feature pyramid. Each scaleis associated with a level of the pyramid, but each ratio is used inall levels of the pyramid.Returns:anchors: [N, (y1, x1, y2, x2)]. All generated anchors in one array. Sortedwith the same order of the given scales. So, anchors of scale[0] comefirst, then anchors of scale[1], and so on.scales (32, 64, 128, 256, 512)ratios = [0.5, 1, 2]here, feature_shapes*feature_strides = [[1024, 1024], [1024, 1024], ...], input image is 1024x1024feature_shapes = [[256 256][128 128][ 64  64][ 32  32][ 16  16]]feature_strides = [4, 8, 16, 32, 64]anchor_stride = 1"""# Anchors# [anchor_count, (y1, x1, y2, x2)]anchors = []for i in range(len(scales)):anchors.append(generate_anchors(scales[i], ratios, feature_shapes[i],feature_strides[i], anchor_stride))return np.concatenate(anchors, axis=0)

model.py中的Class ProposalLayer，实现了网络中的Proposal层,是在rpn和rcnn的中间部分，这一层的输入是rpn_probs和rpn_bbox，代码中有标注，output的shape为(None, self.proposal_count, 4)，可以看出是候选框的数量，4为坐标，proposalLayer层是对所有的经过rpn网络得到的proposal进行处理，从fg prob的得分选出top_k的候选框，再使用非极大值抑制进一步减少proposal数量。其中有个细节是利用rpn预测的rpn_bbox的(dy, dx, log(dh), log(dw))来对选出来的anchor原坐标(y1, x1, y2 ,x2)进行位置偏移的精修，rpn_bbox的预测是偏移量，处理的函数是：apply_box_deltas_graph(anchors, deltas)函数

class ProposalLayer(KE.Layer):"""Receives anchor scores and selects a subset to pass as proposalsto the second stage. Filtering is done based on anchor scores andnon-max suppression to remove overlaps. It also applies boundingbox refinment detals to anchors.Inputs:rpn_probs: [batch, anchors, (bg prob, fg prob)]rpn_bbox: [batch, anchors, (dy, dx, log(dh), log(dw))]Returns:Proposals in normalized coordinates [batch, rois, (y1, x1, y2, x2)]"""def __init__(self, proposal_count, nms_threshold, anchors,config=None, **kwargs):"""proposal_count = 2000 or 1000 below, nms_threshold=0.7anchors: [N, (y1, x1, y2, x2)] anchors defined in image coordinates"""super(ProposalLayer, self).__init__(**kwargs)self.config = configself.proposal_count = proposal_countself.nms_threshold = nms_thresholdself.anchors = anchors.astype(np.float32)def call(self, inputs):# inputs is [rpn_class, rpn_bbox]# Box Scores. Use the foreground class confidence. [Batch, num_rois, 1]scores = inputs[0][:, :, 1]# Box deltas [batch, num_rois, 4]deltas = inputs[1]# RPN_BBOX_STD_DEV = np.array([0.08, 0.08, 0.17, 0.17])# RPN_BBOX_STD_MEANS = np.array([0.02, 0.02, 0.01, 0.02])deltas = (deltas + np.reshape(self.config.RPN_BBOX_STD_MEANS, [1, 1, 4])) * np.reshape(self.config.RPN_BBOX_STD_DEV, [1, 1, 4])# Base anchorsanchors = self.anchors# Improve performance by trimming to top anchors by score# and doing the rest on the smaller subset.# self.anchors generated by utils.generate_pyramid_anchors# anchor.shape = [anchor_num, 4]pre_nms_limit = min(6000, self.anchors.shape[0])ix = tf.nn.top_k(scores, pre_nms_limit, sorted=True,name="top_anchors").indicesscores = utils.batch_slice([scores, ix], lambda x, y: tf.gather(x, y),self.config.IMAGES_PER_GPU)deltas = utils.batch_slice([deltas, ix], lambda x, y: tf.gather(x, y),self.config.IMAGES_PER_GPU)anchors = utils.batch_slice(ix, lambda x: tf.gather(anchors, x),self.config.IMAGES_PER_GPU,names=["pre_nms_anchors"])# Apply deltas to anchors to get refined anchors.# [batch, N, (y1, x1, y2, x2)]boxes = utils.batch_slice([anchors, deltas],lambda x, y: apply_box_deltas_graph(x, y),self.config.IMAGES_PER_GPU,names=["refined_anchors"])# Clip to image boundaries. [batch, N, (y1, x1, y2, x2)]height, width = self.config.IMAGE_SHAPE[:2]window = np.array([0, 0, height, width]).astype(np.float32)boxes = utils.batch_slice(boxes,lambda x: clip_boxes_graph(x, window),self.config.IMAGES_PER_GPU,names=["refined_anchors_clipped"])# Filter out small boxes# According to Xinlei Chen's paper, this reduces detection accuracy# for small objects, so we're skipping it.# Normalize dimensions to range of 0 to 1.normalized_boxes = boxes / np.array([[height, width, height, width]])# Non-max suppressiondef nms(normalized_boxes, scores):indices = tf.image.non_max_suppression(normalized_boxes, scores, self.proposal_count,self.nms_threshold, name="rpn_non_max_suppression")proposals = tf.gather(normalized_boxes, indices)# Pad if neededpadding = tf.maximum(self.proposal_count - tf.shape(proposals)[0], 0)proposals = tf.pad(proposals, [(0, padding), (0, 0)])return proposalsproposals = utils.batch_slice([normalized_boxes, scores], nms,self.config.IMAGES_PER_GPU)return proposalsdef compute_output_shape(self, input_shape):return (None, self.proposal_count, 4)

def apply_box_deltas_graph(boxes, deltas):"""Applies the given deltas to the given boxes.boxes: [N, 4] where each row is y1, x1, y2, x2deltas: [N, 4] where each row is [dy, dx, log(dh), log(dw)]"""# Convert to y, x, h, wheight = boxes[:, 2] - boxes[:, 0]width = boxes[:, 3] - boxes[:, 1]center_y = boxes[:, 0] + 0.5 * heightcenter_x = boxes[:, 1] + 0.5 * width# Apply deltascenter_y += deltas[:, 0] * heightcenter_x += deltas[:, 1] * widthheight *= tf.exp(deltas[:, 2])width *= tf.exp(deltas[:, 3])# Convert back to y1, x1, y2, x2y1 = center_y - 0.5 * heightx1 = center_x - 0.5 * widthy2 = y1 + heightx2 = x1 + widthresult = tf.stack([y1, x1, y2, x2], axis=1, name="apply_box_deltas_out")return result

Class PyramidROIAlign，这个是金字塔ROIAlign层：输入是boxes和feature maps，boxes即proposalLayer的输出，即选出来的候选框box，且shape为[batch, num_boxes, (y1, x1, y2, x2)]，这个地方都是根据rpn_bbox的偏移预测调整过后且normalized的坐标值。feature maps是从特征金字塔中选取出来的，根据计算公式，来确定boxes（boxes的坐标是input image上的坐标值）的所对应的是金字塔的哪一个level，不同的boxes大小映射到对应level后的大小是一样，虽然感受野不一样。从feature map上取出对应区域，来进行roiAlign操作。返回7x7大小的结果。

这里的call会在keras.engine.Layer中是实现的__call__(self, inputs, **kwargs)回调的时候调用。[详解]

class PyramidROIAlign(KE.Layer):"""Implements ROI Pooling on multiple levels of the feature pyramid.Params:- pool_shape: [height, width] of the output pooled regions. Usually [7, 7]- image_shape: [height, width, chanells]. Shape of input image in pixelsInputs:- boxes: [batch, num_boxes, (y1, x1, y2, x2)] in normalizedcoordinates. Possibly padded with zeros if not enoughboxes to fill the array.- Feature maps: List of feature maps from different levels of the pyramid.Each is [batch, height, width, channels]Output:Pooled regions in the shape: [batch, num_boxes, height, width, channels].height = width = 7The width and height are those specific in the pool_shape in the layerconstructor."""# rois: [batch, num_rois, (y1, x1, y2, x2)] Proposal boxes in normalized coordinates.#    feature_maps: List of feature maps from diffent layers of the pyramid, [P2, P3, P4, P5]. Each has a different resolution [batch, height, width, channels]# x = PyramidROIAlign([pool_size, pool_size], image_shape,name="roi_align_classifier")([rois] + feature_maps)def __init__(self, pool_shape, image_shape, **kwargs):super(PyramidROIAlign, self).__init__(**kwargs)self.pool_shape = tuple(pool_shape)self.image_shape = tuple(image_shape)def call(self, inputs):# Crop boxes [batch, num_boxes, (y1, x1, y2, x2)] in normalized coords# inputs = [rois, *feature_maps] = [[batch, num_rios,(y1,x1,y2,x2)], bacth, height, width, channels]boxes = inputs[0]# Feature Maps. List of feature maps from different level of the# feature pyramid. Each is [batch, height, width, channels]feature_maps = inputs[1:]# Assign each ROI to a level in the pyramid based on the ROI area.y1, x1, y2, x2 = tf.split(boxes, 4, axis=2)h = y2 - y1w = x2 - x1# Equation 1 in the Feature Pyramid Networks paper. Account for# the fact that our coordinates are normalized here.# e.g. a 224x224 ROI (in pixels) maps to P4image_area = tf.cast(self.image_shape[0] * self.image_shape[1], tf.float32)roi_level = log2_graph(tf.sqrt(h * w) / (224.0 / tf.sqrt(image_area)))# roi_level limited between 2 and 5roi_level = tf.minimum(5, tf.maximum(2, 4 + tf.cast(tf.round(roi_level), tf.int32)))# the dimension 2's value is 1, and delete itroi_level = tf.squeeze(roi_level, 2)# Loop through levels and apply ROI pooling to each. P2 to P5.pooled = []box_to_level = []for i, level in enumerate(range(2, 6)):ix = tf.where(tf.equal(roi_level, level))level_boxes = tf.gather_nd(boxes, ix)# Box indicies for crop_and_resize.box_indices = tf.cast(ix[:, 0], tf.int32)# Keep track of which box is mapped to which levelbox_to_level.append(ix)# Stop gradient propogation to ROI proposalslevel_boxes = tf.stop_gradient(level_boxes)box_indices = tf.stop_gradient(box_indices)# Crop and Resize# From Mask R-CNN paper: "We sample four regular locations, so# that we can evaluate either max or average pooling. In fact,# interpolating only a single value at each bin center (without# pooling) is nearly as effective."## Here we use the simplified approach of a single value per bin,# which is how it's done in tf.crop_and_resize()# Result: [batch * num_boxes, pool_height, pool_width, channels]pooled.append(tf.image.crop_and_resize(feature_maps[i], level_boxes, box_indices, self.pool_shape,method="bilinear"))# Pack pooled features into one tensorpooled = tf.concat(pooled, axis=0)# Pack box_to_level mapping into one array and add another# column representing the order of pooled boxesbox_to_level = tf.concat(box_to_level, axis=0)box_range = tf.expand_dims(tf.range(tf.shape(box_to_level)[0]), 1)box_to_level = tf.concat([tf.cast(box_to_level, tf.int32), box_range],axis=1)# Rearrange pooled features to match the order of the original boxes# Sort box_to_level by batch then box index# TF doesn't have a way to sort by two columns, so merge them and sort.sorting_tensor = box_to_level[:, 0] * 100000 + box_to_level[:, 1]ix = tf.nn.top_k(sorting_tensor, k=tf.shape(box_to_level)[0]).indices[::-1]ix = tf.gather(box_to_level[:, 2], ix)pooled = tf.gather(pooled, ix)# Re-add the batch dimensionpooled = tf.expand_dims(pooled, 0)return pooleddef compute_output_shape(self, input_shape):return input_shape[0][:2] + self.pool_shape + (input_shape[1][-1], )

Network head部分

图片来自知乎

这一个函数是实现上图左边部分的网络结构，即头部部分分类和回归的预测，函数里有一个函数keras.layers.TimeDistributed()，这个函数是[the output of the PyramidROIAlign method, x, has shape (batch, N, height, width, channel). This is a 5D tensor. We want to apply 2D convolution to x but Keras' Conv2D only accepts 4D tensors and the second dimension of x (i.e., N) is technically the batch dimension for the Conv2D operation. This is where we can use TimeDistributed layers. The output shape of x after the first ReLU operation is (batch, N, 1, 1,1024) for reference，摘自here]，简而言之，就是这里的input是一个5-D tensor，但是Conv2D只能处理4D tensor，因此我们使用TimeDistributed函数，将batch作为一个时间片，对于每个时间片(N, height, width, channel)都应用到Conv2D上，最后输出是(batch, N, 1, 1, 1024)。

def fpn_classifier_graph(rois, feature_maps,image_shape, pool_size, num_classes):"""Builds the computation graph of the feature pyramid network classifierand regressor heads.rois: [batch, num_rois, (y1, x1, y2, x2)] Proposal boxes in normalizedcoordinates.feature_maps: List of feature maps from diffent layers of the pyramid,[P2, P3, P4, P5]. Each has a different resolution.image_shape: [height, width, depth]pool_size: The width of the square feature map generated from ROI Pooling.num_classes: number of classes, which determines the depth of the resultsReturns:logits: [N, NUM_CLASSES] classifier logits (before softmax)probs: [N, NUM_CLASSES] classifier probabilitiesbbox_deltas: [N, (dy, dx, log(dh), log(dw))] Deltas to apply toproposal boxes"""# ROI Pooling# Shape: [batch, num_boxes, pool_height, pool_width, channels]x = PyramidROIAlign([pool_size, pool_size], image_shape,name="roi_align_classifier")([rois] + feature_maps)# Attent part 2_1#attent_x = KL.TimeDistributed(KL.Conv2D(2*1024, (pool_size, pool_size),padding="valid"),                                                                                                                    name="mrcnn_attent_pre")(x)#attent_x = KL.TimeDistributed(KL.Reshape((1024,2)))(attent_x)#attent_x = KL.Lambda(lambda x: tf.nn.softmax(x), name="mrcnn_softmax_attent")(attent_x)#attent_object, attent_back = KL.Lambda(lambda x: tf.split(x, [1,1], -1))(attent_x)#attent_object = KL.Lambda(lambda x: tf.tile(x, [1,1,1,num_classes-1]))(attent_object)#print(K.int_shape(attent_object))#attent_x = KL.Lambda(lambda x: tf.concat(x, -1))([attent_back,attent_object])# Two 1024 FC layers (implemented with Conv2D for consistency)x = KL.TimeDistributed(KL.Conv2D(1024, (pool_size, pool_size), padding="valid"),name="mrcnn_class_conv1")(x)x = KL.TimeDistributed(BatchNorm(axis=3), name='mrcnn_class_bn1')(x)x = KL.Activation('relu')(x)# the output of the PyramidROIAlign method, x, has shape (batch, N, height, width, channel). This is a 5D tensor. We want to apply 2D convolution to x but Keras' Conv2D only accepts 4D tensors and the second dimension of x (i.e., N) is technically the batch dimension for the Conv2D operation. This is where we can use TimeDistributed layers. The output shape of x after the first ReLU operation is (batch, N, 1, 1,1024) for referencex = KL.TimeDistributed(KL.Conv2D(1024, (1, 1)),name="mrcnn_class_conv2")(x)x = KL.TimeDistributed(BatchNorm(axis=3),name='mrcnn_class_bn2')(x)x = KL.Activation('relu')(x)# (batch, N, 1, 1, 1024)--->(batch, N, 1024), put into FC layershared = KL.Lambda(lambda x: K.squeeze(K.squeeze(x, 3), 2),name="pool_squeeze")(x)#shared = KL.TimeDistributed(KL.Reshape((1024,1)))(shared)#shared = KL.Multiply()([attent_x ,shared])#shared = KL.TimeDistributed(KL.Flatten())(shared)#shared = KL.TimeDistributed(KL.Reshape((-1,1)))(shared)# Classifier head# mrcnn_class_logits = KL.TimeDistributed(KL.LocallyConnected1D(1,1024,strides=1024),#                                        name='mrcnn_class_logits')(shared)# mrcnn_class_logits = KL.Lambda(lambda x: K.squeeze(x, -1), name="logits_squeeze")(mrcnn_class_logits)mrcnn_class_logits = KL.TimeDistributed(KL.Dense(num_classes),name='mrcnn_class_logits')(shared)mrcnn_probs = KL.TimeDistributed(KL.Activation("softmax"),name="mrcnn_class")(mrcnn_class_logits)# BBox head# [batch, boxes, num_classes * (dy, dx, log(dh), log(dw))]#mrcnn_bbox = KL.TimeDistributed(KL.LocallyConnected1D(4,1024,strides=1024),#                                       name='mrcnn_bbox_fc')(shared)x = KL.TimeDistributed(KL.Dense(num_classes * 4, activation='linear'),name='mrcnn_bbox_fc')(shared)# Reshape to [batch, boxes, num_classes, (dy, dx, log(dh), log(dw))]s = K.int_shape(x)mrcnn_bbox = KL.Reshape((s[1], num_classes, 4), name="mrcnn_bbox")(x)return mrcnn_class_logits, mrcnn_probs, mrcnn_bbox

这一部分是mask的预测：对照上图可以很明确其网络结构，最终的output的维度是(batch, num_rois, 28, 28, 80)，80是num_classes。

def build_fpn_mask_graph(rois, feature_maps,image_shape, pool_size, num_classes):"""Builds the computation graph of the mask head of Feature Pyramid Network.rois: [batch, num_rois, (y1, x1, y2, x2)] Proposal boxes in normalizedcoordinates.feature_maps: List of feature maps from diffent layers of the pyramid,[P2, P3, P4, P5]. Each has a different resolution.image_shape: [height, width, depth]pool_size: The width of the square feature map generated from ROI Pooling.num_classes: number of classes, which determines the depth of the resultsReturns: Masks [batch, roi_count, height, width, num_classes]"""# ROI Pooling# Shape: [batch, boxes, pool_height, pool_width, channels]x = PyramidROIAlign([pool_size, pool_size], image_shape,name="roi_align_mask")([rois] + feature_maps)# Conv layersx = KL.TimeDistributed(KL.Conv2D(256, (3, 3), padding="same"),name="mrcnn_mask_conv1")(x)x = KL.TimeDistributed(BatchNorm(axis=3),name='mrcnn_mask_bn1')(x)x = KL.Activation('relu')(x)x = KL.TimeDistributed(KL.Conv2D(256, (3, 3), padding="same"),name="mrcnn_mask_conv2")(x)x = KL.TimeDistributed(BatchNorm(axis=3),name='mrcnn_mask_bn2')(x)x = KL.Activation('relu')(x)x = KL.TimeDistributed(KL.Conv2D(256, (3, 3), padding="same"),name="mrcnn_mask_conv3")(x)x = KL.TimeDistributed(BatchNorm(axis=3),name='mrcnn_mask_bn3')(x)x = KL.Activation('relu')(x)x = KL.TimeDistributed(KL.Conv2D(256, (3, 3), padding="same"),name="mrcnn_mask_conv4")(x)x = KL.TimeDistributed(BatchNorm(axis=3),name='mrcnn_mask_bn4')(x)x = KL.Activation('relu')(x)x = KL.TimeDistributed(KL.Conv2DTranspose(256, (2, 2), strides=2, activation="relu"),name="mrcnn_mask_deconv")(x)x = KL.TimeDistributed(KL.Conv2D(num_classes, (1, 1), strides=1, activation="sigmoid"),name="mrcnn_mask")(x)return x

Loss函数：

rpn处的两个loss函数

def smooth_l1_loss(y_true, y_pred):"""Implements Smooth-L1 loss.y_true and y_pred are typicallly: [N, 4], but could be any shape."""diff = K.abs(y_true - y_pred)less_than_one = K.cast(K.less(diff, 1.0), "float32")loss = (less_than_one * 0.5 * diff**2) + (1 - less_than_one) * (diff - 0.5)return lossdef rpn_class_loss_graph(rpn_match, rpn_class_logits):"""RPN anchor classifier loss.rpn_match: [batch, anchors, 1]. Anchor match type. 1=positive,-1=negative, 0=neutral anchor.rpn_class_logits: [batch, anchors, 2]. RPN classifier logits for FG/BG."""# Squeeze last dim to simplify# [batch, anchors, 1]--->[batch, anchors], the value is 1 or 0 represent the fg/bg# a high dimension is for a explicit expressionrpn_match = tf.squeeze(rpn_match, -1)# Get anchor classes. Convert the -1/+1 match to 0/1 values.# anchor_class.shape = (batch, anchors)anchor_class = K.cast(K.equal(rpn_match, 1), tf.int32)# Positive and Negative anchors contribute to the loss,# but neutral anchors (match value = 0) don't.indices = tf.where(K.not_equal(rpn_match, 0))# Pick rows that contribute to the loss and filter out the rest.rpn_class_logits = tf.gather_nd(rpn_class_logits, indices)# according the indices slice the anchor_class,# the output.shape = indices.shape[:-1] + params.shape[indices.shape[-1]:]anchor_class = tf.gather_nd(anchor_class, indices)# Crossentropy lossloss = K.sparse_categorical_crossentropy(target=anchor_class,output=rpn_class_logits,from_logits=True)loss = K.switch(tf.size(loss) > 0, K.mean(loss), tf.constant(0.0))return lossdef rpn_bbox_loss_graph(config, target_bbox, rpn_match, rpn_bbox):"""Return the RPN bounding box loss graph.config: the model config object.target_bbox: [batch, max positive anchors, (dy, dx, log(dh), log(dw))].Uses 0 padding to fill in unsed bbox deltas.rpn_match: [batch, anchors, 1]. Anchor match type. 1=positive,-1=negative, 0=neutral anchor.rpn_bbox: [batch, anchors, (dy, dx, log(dh), log(dw))]"""# Positive anchors contribute to the loss, but negative and# neutral anchors (match value of 0 or -1) don't.rpn_match = K.squeeze(rpn_match, -1)indices = tf.where(K.equal(rpn_match, 1))# Pick bbox deltas that contribute to the lossrpn_bbox = tf.gather_nd(rpn_bbox, indices)# Trim target bounding box deltas to the same length as rpn_bbox.batch_counts = K.sum(K.cast(K.equal(rpn_match, 1), tf.int32), axis=1)target_bbox = batch_pack_graph(target_bbox, batch_counts,config.IMAGES_PER_GPU)# TODO: use smooth_l1_loss() rather than reimplementing here#       to reduce code duplicationdiff = K.abs(target_bbox - rpn_bbox)less_than_one = K.cast(K.less(diff, 1.0), "float32")loss = (less_than_one * 0.5 * diff**2) + (1 - less_than_one) * (diff - 0.5)loss = K.switch(tf.size(loss) > 0, K.mean(loss), tf.constant(0.0))return loss

rcnn的三个loss函数，分别是求cls、bbox、mask部分的loss：

def mrcnn_class_loss_graph(target_class_ids, pred_class_logits,active_class_ids):"""Loss for the classifier head of Mask RCNN.target_class_ids: [batch, num_rois]. Integer class IDs. Uses zeropadding to fill in the array.pred_class_logits: [batch, num_rois, num_classes]active_class_ids: [batch, num_classes]. Has a value of 1 forclasses that are in the dataset of the image, and 0for classes that are not in the dataset."""target_class_ids = tf.cast(target_class_ids, 'int64')# Find predictions of classes that are not in the dataset.# pred_class_ids = tf.argmax(pred_class_logits, axis=2)# TODO: Update this line to work with batch > 1. Right now it assumes all#       images in a batch have the same active_class_ids# pred_active = [tf.gather(active_class_ids[i], pred_class_ids[i]) for i in range(2)]# pred_active = tf.stack(pred_active, axis=0)# Lossloss = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=target_class_ids, logits=pred_class_logits)# Erase losses of predictions of classes that are not in the active# classes of the image.# loss = loss * pred_active# Computer loss mean. Use only predictions that contribute# to the loss to get a correct mean.loss = tf.reduce_mean(loss) #/ tf.reduce_sum(pred_active)return lossdef mrcnn_bbox_loss_graph(target_bbox, target_class_ids, pred_bbox):"""Loss for Mask R-CNN bounding box refinement.target_bbox: [batch, num_rois, (dy, dx, log(dh), log(dw))]target_class_ids: [batch, num_rois]. Integer class IDs.pred_bbox: [batch, num_rois, num_classes, (dy, dx, log(dh), log(dw))]"""# Reshape to merge batch and roi dimensions for simplicity.target_class_ids = K.reshape(target_class_ids, (-1,))target_bbox = K.reshape(target_bbox, (-1, 4))pred_bbox = K.reshape(pred_bbox, (-1, K.int_shape(pred_bbox)[2], 4))# Only positive ROIs contribute to the loss. And only# the right class_id of each ROI. Get their indicies.positive_roi_ix = tf.where(target_class_ids > 0)[:, 0]positive_roi_class_ids = tf.cast(tf.gather(target_class_ids, positive_roi_ix), tf.int64)indices = tf.stack([positive_roi_ix, positive_roi_class_ids], axis=1)# Gather the deltas (predicted and true) that contribute to losstarget_bbox = tf.gather(target_bbox, positive_roi_ix)pred_bbox = tf.gather_nd(pred_bbox, indices)# Smooth-L1 Lossloss = K.switch(tf.size(target_bbox) > 0,smooth_l1_loss(y_true=target_bbox, y_pred=pred_bbox),tf.constant(0.0))loss = K.mean(loss)loss = K.reshape(loss, [1, 1])return lossdef mrcnn_mask_loss_graph(target_masks, target_class_ids, pred_masks):"""Mask binary cross-entropy loss for the masks head.target_masks: [batch, num_rois, height, width].A float32 tensor of values 0 or 1. Uses zero padding to fill array.target_class_ids: [batch, num_rois]. Integer class IDs. Zero padded.pred_masks: [batch, proposals, height, width, num_classes] float32 tensorwith values from 0 to 1."""# Reshape for simplicity. Merge first two dimensions into one.target_class_ids = K.reshape(target_class_ids, (-1,))mask_shape = tf.shape(target_masks)target_masks = K.reshape(target_masks, (-1, mask_shape[2], mask_shape[3]))pred_shape = tf.shape(pred_masks)pred_masks = K.reshape(pred_masks,(-1, pred_shape[2], pred_shape[3], pred_shape[4]))# Permute predicted masks to [N, num_classes, height, width]pred_masks = tf.transpose(pred_masks, [0, 3, 1, 2])# Only positive ROIs contribute to the loss. And only# the class specific mask of each ROI.positive_ix = tf.where(target_class_ids > 0)[:, 0]positive_class_ids = tf.cast(tf.gather(target_class_ids, positive_ix), tf.int64)indices = tf.stack([positive_ix, positive_class_ids], axis=1)# Gather the masks (predicted and true) that contribute to lossy_true = tf.gather(target_masks, positive_ix)y_pred = tf.gather_nd(pred_masks, indices)# Compute binary cross entropy. If no positive ROIs, then return 0.# shape: [batch, roi, num_classes]loss = K.switch(tf.size(y_true) > 0,K.binary_crossentropy(target=y_true, output=y_pred),tf.constant(0.0))loss = K.mean(loss)loss = K.reshape(loss, [1, 1])return loss

数据处理部分：Dataset Generator

load_image_gt()：加载gt的相关信息，dataset是类CocoDataset(Dataset)的实例，load_image(image_id)加载图片为(w, h, c)的array，load_bbox(image_id)则是从annotation标签信息中加载出bbox的位置，以及所对应的class_id信息，从其返回值可以看出。utils.resize_image()函数是根据设定的最大最小长宽来resize图片，先缩放，再用0填充。utils.resize_bbox()是对bbox的形状进行改变，但是只是对bbox做缩放，，且将bbox的左上角和右下角的坐标加上图片改变后的左上角的填充，因为可能填充造成了相对坐标发生变化。

def load_image_gt(dataset, config, image_id, augment=False):"""Load and return ground truth data for an image (image, mask, bounding boxes).augment: If true, apply random image augmentation. Currently, onlyhorizontal flipping is offered.use_mini_mask: If False, returns full-size masks that are the same heightand width as the original image. These can be big, for example1024x1024x100 (for 100 instances). Mini masks are smaller, typically,224x224 and are generated by extracting the bounding box of theobject and resizing it to MINI_MASK_SHAPE.Returns:image: [height, width, 3]shape: the original shape of the image before resizing and cropping.class_ids: [instance_count] Integer class IDsbbox: [instance_count, (y1, x1, y2, x2)]mask: [height, width, instance_count]. The height and width are thoseof the image unless use_mini_mask is True, in which case they aredefined in MINI_MASK_SHAPE."""# Load image and mask# dataset is the instance of class CocoDataset(Dataset)image = dataset.load_image(image_id)shape = image.shapebbox, class_ids = dataset.load_bbox(image_id)image, window, scale, padding = utils.resize_image(image,min_dim=config.IMAGE_MIN_DIM,max_dim=config.IMAGE_MAX_DIM,padding=config.IMAGE_PADDING)# Bounding boxes. Note that some boxes might be all zeros# if the corresponding mask got cropped out.# bbox: [num_instances, (y1, x1, y2, x2)]h, w, _ = image.shapebbox = utils.resize_bbox(bbox, scale, padding)# Random horizontal flips.if augment:if random.randint(0, 1):image = np.fliplr(image)bbox[:,[3,1]] = h - bbox[:,[1,3]]# Active classes# Different datasets have different classes, so track the# classes supported in the dataset of this image.active_class_ids = np.zeros([dataset.num_classes], dtype=np.int32)source_class_ids = dataset.source_class_ids[dataset.image_info[image_id]["source"]]active_class_ids[source_class_ids] = 1# Image meta dataimage_meta = compose_image_meta(image_id, shape, window, active_class_ids)return image, image_meta, class_ids, bbox

build_rpn_targets()函数，其中参数anchor是所有的anchor，gt_class_ids与gt_boxes是对应的，即标签和gt_box对应，函数里对每个anchor求与每个gt_box的IOU值，在这里有一个地方，就是crowd_ix = np.where(gt_class_ids < 0)[0]，按照这里的式子是因为在coco的数据集里会存在一个拥挤标注，即如果两个目标有一定的重叠，那么对其的标注类别，可能就会使用负数。那么至于如何使用标注则由自己的代码决定。在此处，如果有这样的gt_boxes，那么会求出每个anchor与这样标注为负的gt_boxes的IOU，在这种IOU中求出每个anchor对应最大的IOU，阈值为0.001的为True,那么按照代码中呢。基本上会排除掉与这样的gt_box有交集的anchors作为负样本。

对于正样本，上述则无影响。阈值为0.7。最终需要采样256个。在选出的所有的anchors，正样本不超过128个，否则随机选取超出数量的归为0,即无用部分。剩下部分为负样本。此时在所有anchors中，有x(<=128)个positive anchors,y(<=256-x)，但是一般可以用negative填充到256个)个negative anchors，剩下的就是无用的anchor。

选出的256个anchors，其中的positive anchors是可以多个anchor对应一个gt_box，但是每个anchor只会对应一个gt_box，对positive anchors，按照论文中的公式，求出其与gt_box的偏移值来。偏移值是用来直接计算L1loss的参数。

def build_rpn_targets(image_shape, anchors, gt_class_ids, gt_boxes, config):"""Given the anchors and GT boxes, compute overlaps and identify positiveanchors and deltas to refine them to match their corresponding GT boxes.anchors: [num_anchors, (y1, x1, y2, x2)], is the proposal anchorgt_class_ids: [num_gt_boxes] Integer class IDs. is the proposal anchor-->real label idgt_boxes: [num_gt_boxes, (y1, x1, y2, x2)]Returns:rpn_match: [N] (int32) matches between anchors and GT boxes.1 = positive anchor, -1 = negative anchor, 0 = neutralrpn_bbox: [N, (dy, dx, log(dh), log(dw))] Anchor bbox deltas."""# RPN Match: 1 = positive anchor, -1 = negative anchor, 0 = neutral# rpn_match.shape = (num_anchors, )rpn_match = np.zeros([anchors.shape[0]], dtype=np.int32)# RPN bounding boxes: [max anchors per image, (dy, dx, log(dh), log(dw))]# config.RPN_TRAIN_ANCHORS_PER_IMAGE = 256#  rpn_bbox.shape = (256, 4)rpn_bbox = np.zeros((config.RPN_TRAIN_ANCHORS_PER_IMAGE, 4))# Handle COCO crowds# A crowd box in COCO is a bounding box around several instances. Exclude# them from training. A crowd box is given a negative class ID.crowd_ix = np.where(gt_class_ids < 0)[0]if crowd_ix.shape[0] > 0:# Filter out crowds from ground truth class IDs and boxesnon_crowd_ix = np.where(gt_class_ids > 0)[0]crowd_boxes = gt_boxes[crowd_ix]gt_class_ids = gt_class_ids[non_crowd_ix]gt_boxes = gt_boxes[non_crowd_ix]# Compute overlaps with crowd boxes [anchors, crowds]crowd_overlaps = utils.compute_overlaps(anchors, crowd_boxes)crowd_iou_max = np.amax(crowd_overlaps, axis=1)no_crowd_bool = (crowd_iou_max < 0.001)else:# All anchors don't intersect a crowdno_crowd_bool = np.ones([anchors.shape[0]], dtype=bool)# Compute overlaps [num_anchors, num_gt_boxes]overlaps = utils.compute_overlaps(anchors, gt_boxes)# Match anchors to GT Boxes# If an anchor overlaps a GT box with IoU >= 0.7 then it's positive.# If an anchor overlaps a GT box with IoU < 0.3 then it's negative.# Neutral anchors are those that don't match the conditions above,# and they don't influence the loss function.# However, don't keep any GT box unmatched (rare, but happens). Instead,# match it to the closest anchor (even if its max IoU is < 0.3).## 1. Set negative anchors first. They get overwritten below if a GT box is# matched to them. Skip boxes in crowd areas.# overlaps.shape = (num_anchor, num_gt_box)anchor_iou_argmax = np.argmax(overlaps, axis=1)anchor_iou_max = overlaps[np.arange(overlaps.shape[0]), anchor_iou_argmax]rpn_match[(anchor_iou_max < 0.3) & (no_crowd_bool)] = -1# 2. Set an anchor for each GT box (regardless of IoU value).# TODO: If multiple anchors have the same IoU match all of themgt_iou_argmax = np.argmax(overlaps, axis=0)rpn_match[gt_iou_argmax] = 1# 3. Set anchors with high overlap as positive.rpn_match[anchor_iou_max >= 0.7] = 1# Subsample to balance positive and negative anchors# Don't let positives be more than half the anchorsids = np.where(rpn_match == 1)[0]# RPN_TRAIN_ANCHORS_PER_IMAGE = 256# 128# make the positive <= 128 and the negative < 128, other is the anchor no contribution to lossextra = len(ids) - (config.RPN_TRAIN_ANCHORS_PER_IMAGE // 2)if extra > 0:# Reset the extra ones to neutralids = np.random.choice(ids, extra, replace=False)rpn_match[ids] = 0# Same for negative proposalsids = np.where(rpn_match == -1)[0]extra = len(ids) - (config.RPN_TRAIN_ANCHORS_PER_IMAGE -np.sum(rpn_match == 1))if extra > 0:# Rest the extra ones to neutralids = np.random.choice(ids, extra, replace=False)rpn_match[ids] = 0# For positive anchors, compute shift and scale needed to transform them# to match the corresponding GT boxes.ids = np.where(rpn_match == 1)[0]ix = 0  # index into rpn_bbox# TODO: use box_refinment() rather than duplicating the code herefor i, a in zip(ids, anchors[ids]):# Closest gt box (it might have IoU < 0.7)gt = gt_boxes[anchor_iou_argmax[i]]# Convert coordinates to center plus width/height.# GT Boxgt_h = gt[2] - gt[0]gt_w = gt[3] - gt[1]gt_center_y = gt[0] + 0.5 * gt_hgt_center_x = gt[1] + 0.5 * gt_w# Anchora_h = a[2] - a[0]a_w = a[3] - a[1]a_center_y = a[0] + 0.5 * a_ha_center_x = a[1] + 0.5 * a_w# Compute the bbox refinement that the RPN should predict.rpn_bbox[ix] = [(gt_center_y - a_center_y) / a_h,(gt_center_x - a_center_x) / a_w,np.log(gt_h / a_h),np.log(gt_w / a_w),]# Normalizerpn_bbox[ix] -= config.RPN_BBOX_STD_MEANSrpn_bbox[ix] /= config.RPN_BBOX_STD_DEVix += 1return rpn_match, rpn_bbox

Mask RCNN代码相关推荐

Mask rcnn代码实现_pytorch版_适用30系列显卡
Mask rcnn代码实现_pytorch版由于科研需求,要做一个图像分割的项目,于是开始着手跑一下 mask rcnn.本以为很简单的事情,网上代码比较多,结果尝试了一下,遇到了各种问题. 主要是 ...
keras Mask Rcnn代码走读（九）-detect方法介绍
keras Mask Rcnn代码走读(八)-detect方法介绍,主要用于图片实体分割的推断时调用的. 一,首先对图像进行处理,调用self.mold_inputs()函数,把原图等比例resize ...
Mask rcnn 代码复现
首先去github上下载mask-rcnn源码 GitHub - matterport/Mask_RCNN: Mask R-CNN for object detection and instance ...
Mask Rcnn代码与原理相结合解析
1:前言文章目录 1:前言 2:图片的预处理 3:整体流程概述 4:搭建特征提取网络 4:anchors的形成 5:RPN网络的搭建 6:Proposal Layer 7:创建标签 8:ROIAli ...
[detectron2 ] Mask R-CNN代码笔记
主要代码文件路径: 总架构文件: detectron2/detectron2/modeling/meta_arch/rcnn.py 默认配置:detectron2/detectron2/config/ ...
mask rcnn 超详细代码解读（一）
mask r-cnn 代码解读(一) 文章目录 1 代码架构 2 model.py 的结构 3 train过程代码解析 3.1 Resnet Graph 3.2 Region Proposal Net ...
在OpenCV中使用Mask R-CNN
本文翻译自:https://www.pyimagesearch.com/2018/11/19/mask-r-cnn-with-opencv/ 在本教程中,您将学习如何在OpenCV中使用Mask R- ...
Mask R-CNN用于目标检测和分割代码实现
Mask R-CNN用于目标检测和分割代码实现 Mask R-CNN for object detection and instance segmentation on Keras and Tenso ...
详解何恺明团队4篇大作 !（附代码）| 从特征金字塔网络、Mask R-CNN到学习分割一切
来源:新智元作者:krish 本文5000字,建议阅读10+分钟. 本文介绍FAIR何恺明.Tsung-Yi Lin等人的团队在计算机视觉领域最新的一些创新,包括特征金字塔网络.RetinaNet. ...

Mask RCNN代码