pytorchOCR之PSEnet

论文链接
官方代码

论文解读这里就不做了，网上很多。这里只对项目代码解读。

标签制作

借用论文里的图，如图所示，需要生成若干个（自己设定，论文中为6）黑白图，文字部分为白即为1，背景部分为黑即为0. 白色最大的为文字分割图，最小的文中叫做kernel图，通过这样可以分开临近的文本。
在ptocr/dataloader/DetLoad/MakeSegMap.py里的

def shrink(self,bboxes, rate, max_shr=20):rate = rate * rateshrinked_bboxes = []for bbox in bboxes:area = plg.Polygon(bbox).area()peri = self.perimeter(bbox)pco = pyclipper.PyclipperOffset()pco.AddPath(bbox, pyclipper.JT_ROUND, pyclipper.ET_CLOSEDPOLYGON)offset = min((int)(area * (1 - rate) / (peri + 0.001) + 0.5), max_shr)shrinked_bbox = pco.Execute(-offset)if len(shrinked_bbox) == 0:shrinked_bboxes.append(bbox)continueshrinked_bbox = np.array(shrinked_bbox)[0]shrinked_bbox = np.array(shrinked_bbox)if shrinked_bbox.shape[0] <= 2:shrinked_bboxes.append(bbox)continueshrinked_bboxes.append(shrinked_bbox)return np.array(shrinked_bboxes)

通过这个函数将标注框缩小，得到每个缩小的框。最后用opencv生成分割图。

模型解读

该检测方法是基于分割，论文使用FPN作为分割网络，其中backbone为resnet50，参看
ptocr/model/backbone/det_resnet.py部分代码如下

 def forward(self, x):x = self.conv1(x)x = self.bn1(x)x = self.relu(x)x = self.maxpool(x)x2 = self.layer1(x)x3 = self.layer2(x2)x4 = self.layer3(x3)x5 = self.layer4(x4)return x2, x3, x4, x5

经过该backbone返回四个map（x2,x3,x4,x5），分别为原图的1/4，1/8，1/16，1/32.此四个map 进入ptocr/model/head/det_FPNHead.py，如下：
该部分是fpn不同深度的map融合部分

self.toplayer = ConvBnRelu(in_channels[-1], inner_channels, kernel_size=1, stride=1,padding=0,bias=bias)  # Reduce channels
# Smooth layers
self.smooth1 = ConvBnRelu(inner_channels, inner_channels, kernel_size=3, stride=1, padding=1,bias=bias)
self.smooth2 = ConvBnRelu(inner_channels, inner_channels, kernel_size=3, stride=1, padding=1,bias=bias)
self.smooth3 = ConvBnRelu(inner_channels, inner_channels, kernel_size=3, stride=1, padding=1,bias=bias)
# Lateral layers
self.latlayer1 = ConvBnRelu(in_channels[-2], inner_channels, kernel_size=1, stride=1, padding=0,bias=bias)
self.latlayer2 = ConvBnRelu(in_channels[-3], inner_channels, kernel_size=1, stride=1, padding=0,bias=bias)
self.latlayer3 = ConvBnRelu(in_channels[-4], inner_channels, kernel_size=1, stride=1, padding=0,bias=bias)
# Out map
self.conv_out = ConvBnRelu(inner_channels * 4, inner_channels, kernel_size=3, stride=1, padding=1,bias=bias)

在config的yaml中需要设置in_channels和inner_channels，其中in_channels分别对应着不同尺度输出map（x2,x3,x4,x5）的channel数目,如果你想改变backbone，这里也要根据实际情况做相应改变，inner_channels可以随意设置，但是一般根据backbone来调整。

def forward(self, x):c2, c3, c4, c5 = x##p5 = self.toplayer(c5)c4 = self.latlayer1(c4)p4 = upsample_add(p5, c4)p4 = self.smooth1(p4)c3 = self.latlayer2(c3)p3 = upsample_add(p4, c3)p3 = self.smooth2(p3)c2 = self.latlayer3(c2)p2 = upsample_add(p3, c2)p2 = self.smooth3(p2)##p3 = upsample(p3, p2)p4 = upsample(p4, p2)p5 = upsample(p5, p2)fuse = torch.cat((p2, p3, p4, p5), 1)fuse = self.conv_out(fuse)return fuse

这里操作就是将深层map向上做插值和上一层的map做融合，最后将不同尺度的map进行concat，论文中对此有描述。至此FPN部分完成。于是进入ptocr/model/segout/det_PSE_segout.py

class SegDetector(nn.Module):def __init__(self,inner_channels=256,classes=7):super(SegDetector,self).__init__()self.binarize = nn.Conv2d(inner_channels,classes,1,1,0)def forward(self, x,img):x = self.binarize(x)x = upsample(x,img)if self.training:pre_batch = dict(pre_text=x[:,0])pre_batch['pre_kernel'] = x[:,1:]return pre_batchreturn x

这里就是输出分割图，并把分割图插值成原图大小，这里输出7个分割图，其中第0个为最大对应着图片中文字的分割图，依次不断减小，kernel就是最小的一个分割图即第6个kernel图作用就是用来区分密集文本。

loss 部分

这里用到了分割常用的dice loss，在ptocr/model/loss/basical_loss.py如下：

class DiceLoss(nn.Module):def __init__(self,eps=1e-6):super(DiceLoss,self).__init__()self.eps = epsdef forward(self,pre_score,gt_score,train_mask):pre_score = pre_score.contiguous().view(pre_score.size()[0], -1)gt_score = gt_score.contiguous().view(gt_score.size()[0], -1)train_mask = train_mask.contiguous().view(train_mask.size()[0], -1)pre_score = pre_score * train_maskgt_score = gt_score * train_maska = torch.sum(pre_score * gt_score, 1)b = torch.sum(pre_score * pre_score, 1) + self.epsc = torch.sum(gt_score * gt_score, 1) + self.epsd = (2 * a) / (b + c)dice_loss = torch.mean(d)return 1 - dice_loss

这里共需要三个输入，一个网络输出的7个图，一个标签制作好的7个图，以及这七个图的train_mask，这里train_mask的作用就是使得部分像素不参与loss计算（即这部分的loss为0）。
这里用到了ohem如下：

def ohem_single(score, gt_text, training_mask):pos_num = (int)(np.sum(gt_text > 0.5)) - (int)(np.sum((gt_text > 0.5) & (training_mask <= 0.5)))if pos_num == 0:# selected_mask = gt_text.copy() * 0 # may be not goodselected_mask = training_maskselected_mask = selected_mask.reshape(1, selected_mask.shape[0], selected_mask.shape[1]).astype('float32')return selected_maskneg_num = (int)(np.sum(gt_text <= 0.5))neg_num = (int)(min(pos_num * 3, neg_num))if neg_num == 0:selected_mask = training_maskselected_mask = selected_mask.reshape(1, selected_mask.shape[0], selected_mask.shape[1]).astype('float32')return selected_maskneg_score = score[gt_text <= 0.5]neg_score_sorted = np.sort(-neg_score)threshold = -neg_score_sorted[neg_num - 1]selected_mask = ((score >= threshold) | (gt_text > 0.5)) & (training_mask > 0.5)selected_mask = selected_mask.reshape(1, selected_mask.shape[0], selected_mask.shape[1]).astype('float32')return selected_mask

这里就是选取负样本中loss排序大的，选择正负样本为1:3，假如正样本有3个，负样本像素就要选9个。选择loss最大的九个。

说明：文中图均来自论文