深度学习-TextDetection

本文主要对常用的文本检测模型算法进行总结及分析，有的模型笔者切实run过，有的是通过论文及相关代码的分析，如有错误，请不吝指正。

一下进行各个模型的详细解析

CTPN 详解

代码链接：https://github.com/xiaofengShi/CHINESE-OCR

CTPN是目前应用非常广泛的印刷体文本检测模型算法。

CTPN由fasterrcnn改进而来，可以看下二者的异同

网络结构	FasterRcnn	CTPN
basenet	Vgg16 ,Vgg19,resnet	Vgg16,也可以使用其他CNN结构
RPN预测	basenet的predict layer使用CNN生成	basenet之后使用双向RNN使用FC生成
ROI	模型适用于目标检测，为多分类任务，包含ROI及类别损失和BOX回归	文本提取为二分类任务，不包含ROI及类别损失，只在RPN层计算目标损失及BOX回归
Anchor	一共9种anchor尺寸,3比例，3尺寸	固定anchor宽度，高度为10种
batch	每次只能训练一个样本	每次只能训练一个样本

根据ctpn的网络设计，可以看到看到ctpn一般使用预训练的vggnet，并且只用来检测水平文本，一般可以用来进行标准格式印刷体的检测，在目标框回归预测时，加上回归框的角度信息，就可以用来检测旋转文本，比如EAST模型。

代码分析

网络模型

直接看CTPN的网络代码

copy

class VGGnet_train(Network):# 继承自NetWork,关与NetWork可以看这里：https://github.com/xiaofengShi/CHINESE-OCR/blob/master/ctpn/lib/networks/network.pydef __init__(self, trainable=True):self.inputs = []self.data = tf.placeholder(tf.float32, shape=[None, None, None, 3], name='data')self.im_info = tf.placeholder(tf.float32, shape=[None, 3], name='im_info')self.gt_boxes = tf.placeholder(tf.float32, shape=[None, 5], name='gt_boxes')self.gt_ishard = tf.placeholder(tf.int32, shape=[None], name='gt_ishard')self.dontcare_areas = tf.placeholder(tf.float32, shape=[None, 4], name='dontcare_areas')self.keep_prob = tf.placeholder(tf.float32)self.layers = dict({'data': self.data, 'im_info': self.im_info, 'gt_boxes': self.gt_boxes,'gt_ishard': self.gt_ishard, 'dontcare_areas': self.dontcare_areas})self.trainable = trainableself.setup()def setup(self):# 对于文本提议来说，类别为2，一类为为文字部分，另一类为背景n_classes = cfg.NCLASSES# anchor的初始尺寸，论文中使用的是16anchor_scales = cfg.ANCHOR_SCALES_feat_stride = [16, ]# base net is vgg16# 内部使用的函数(self.feed('data').conv(3, 3, 64, 1, 1, name='conv1_1').conv(3, 3, 64, 1, 1, name='conv1_2').max_pool(2, 2, 2, 2, padding='VALID', name='pool1').conv(3, 3, 128, 1, 1, name='conv2_1').conv(3, 3, 128, 1, 1, name='conv2_2').max_pool(2, 2, 2, 2, padding='VALID', name='pool2').conv(3, 3, 256, 1, 1, name='conv3_1').conv(3, 3, 256, 1, 1, name='conv3_2').conv(3, 3, 256, 1, 1, name='conv3_3').max_pool(2, 2, 2, 2, padding='VALID', name='pool3').conv(3, 3, 512, 1, 1, name='conv4_1').conv(3, 3, 512, 1, 1, name='conv4_2').conv(3, 3, 512, 1, 1, name='conv4_3').max_pool(2, 2, 2, 2, padding='VALID', name='pool4').conv(3, 3, 512, 1, 1, name='conv5_1').conv(3, 3, 512, 1, 1, name='conv5_2').conv(3, 3, 512, 1, 1, name='conv5_3'))# RPN # 该层对上层的feature map进行卷积，生成512通道的的feature map(self.feed('conv5_3').conv(3, 3, 512, 1, 1, name='rpn_conv/3x3'))# 卷积最后一层的的feature_map尺寸为batch*h*w*512# 原来的单层双向LSTM(self.feed('rpn_conv/3x3').Bilstm(512, 128, 512, name='lstm_o'))# bilstm之后输出的尺寸为(N, H, W, 512)""" 和faster—rcnn相似，在ctpn的rpn网络中，使用双向lstm和全连接得到预测的目标概率和回归框，在faster-rcnn中使用的是卷积的方式从basenet的最后一层生成使用LSTM的输出来计算位置偏移和类别概率（判断是否是物体，不判断类别的种类）输入尺寸为(N, H, W, 512)  输出尺寸（N, H, W, int(d_o)）可以将这一层当做目标检测中的最后一层feature_maprpn_bbox_pred--对于h*w的尺寸上，每一anchor上生成4个位置偏移量rpn_cls_score--对于h*w的尺寸上，每一anchor上生成2个置信度得分，判断是否为物体"""(self.feed('lstm_o').lstm_fc(512, len(anchor_scales) * 10 * 4, name='rpn_bbox_pred'))(self.feed('lstm_o').lstm_fc(512, len(anchor_scales) * 10 * 2, name='rpn_cls_score'))# generating training labels on the fly# output: rpn_labels(HxWxA, 2) rpn_bbox_targets(HxWxA, 4) rpn_bbox_inside_weights rpn_bbox_outside_weights# 给每个anchor上标签，并计算真值（也是delta的形式），以及内部权重和外部权重(self.feed('rpn_cls_score', 'gt_boxes', 'gt_ishard', 'dontcare_areas', 'im_info').anchor_target_layer(_feat_stride, anchor_scales, name='rpn-data'))# shape is (1, H, W, Ax2) -> (1, H, WxA, 2)# 给之前得到的score进行softmax，得到0-1之间的得分(self.feed('rpn_cls_score').spatial_reshape_layer(2, name='rpn_cls_score_reshape').spatial_softmax(name='rpn_cls_prob'))'''# the below is the rcnn net model from faster_rcnn# 后面的部分是fasterrcnn之后的ROIPooling部分(self.feed('rpn_cls_prob').spatial_reshape_layer(len(anchor_scales) * 10 * 2, name='rpn_cls_prob_reshape'))self.feed('rpn_cls_prob_reshape', 'rpn_bbox_pred', 'im_info').proposal_layer(_feat_stride, anchor_scales, 'TRAIN', name='rpn_rois')(self.feed('rpn_rois', 'gt_boxes').proposal_target_layer(n_classes, name='roi-data'))# ========= RCNN ============(self.feed('conv5_3', 'roi-data').roi_pool(7, 7, 1.0/16, name='pool_5').fc(4096, name='fc6').dropout(0.5, name='drop6').fc(4096, name='fc7').dropout(0.5, name='drop7').fc(n_classes, relu=False, name='cls_score').softmax(name='cls_prob'))(self.feed('drop7').fc(n_classes*4, relu=False, name='bbox_pred'))'''

可以看到CTPN的网络结构有FasterRcnn改变而来，使用vggnet进行图像的特征提取，对得到的最后一层featuremap的尺寸为[N,H,W,C][N,H,W,C]，进行维度变换为[NH,W,C][NH,W,C]成为序列，使用BLSTM得到的维度为[NH,W,2D][NH,W,2D]其中DD为单向RNN的隐藏层节点数，转换维度为[NHW,2D][NHW,2D]，使用全连接进行维度转换为[NHW,C][NHW,C]，最后再reshape成[N,H,W,C][N,H,W,C]，在这一步中，使用RNN对CNN之后的特征图进行特征图长度方向上的连接；接下来使用lstm_fc函数对anchor进行目标类别预测和边界回归框预测，在这一层的特征图上，每个点生成A个anchor，每个anchor存在目标类别预测和边界回归预测：对于回归预测，每个格点生成2A个目标预测；对于边界回归预测，每个格点生成4A个边界预测。

网络模型结构如下所示

CTPN MODEL STRUCTURE

anchor生成及筛选

在整个模型中，AnchorGen处需要详细说明，这就是大名鼎鼎的RPN，下面结合代码说明：

copy

# -*- coding:utf-8 -*-
import numpy as np
import numpy.random as nprfrom ..fast_rcnn.config import cfg
from bbox import bbox_overlaps, bbox_intersectionsDEBUG = False# 生成基础anchor box
def generate_basic_anchors(sizes, base_size=16):base_anchor = np.array([0, 0, base_size - 1, base_size - 1], np.int32)anchors = np.zeros((len(sizes), 4), np.int32)index = 0for h, w in sizes:anchors[index] = scale_anchor(base_anchor, h, w)index += 1return anchors# 根据baseanchor和设定的anchor的高度和宽度进行设定的anchor生成
def scale_anchor(anchor, h, w):x_ctr = (anchor[0] + anchor[2]) * 0.5y_ctr = (anchor[1] + anchor[3]) * 0.5scaled_anchor = anchor.copy()scaled_anchor[0] = x_ctr - w / 2  # xminscaled_anchor[2] = x_ctr + w / 2  # xmaxscaled_anchor[1] = y_ctr - h / 2  # yminscaled_anchor[3] = y_ctr + h / 2  # ymaxreturn scaled_anchor# 生成anchor box
# 此处使用的是宽度固定，高度不同的anchor设置
def generate_anchors(base_size=16, ratios=[0.5, 1, 2],scales=2 ** np.arange(3, 6)):heights = [11, 16, 23, 33, 48, 68, 97, 139, 198, 283]widths = [16]sizes = []for h in heights:for w in widths:sizes.append((h, w))return generate_basic_anchors(sizes)# 生成的anchor和groundtruth之间进行转换，转换方式和论文一致
def bbox_transform(ex_rois, gt_rois):"""computes the distance from ground-truth boxes to the given boxes, normed by their size:param ex_rois: n * 4 numpy array, anchor boxes:param gt_rois: n * 4 numpy array, ground-truth boxes:return: deltas: n * 4 numpy array, ground-truth boxes"""ex_widths = ex_rois[:, 2] - ex_rois[:, 0] + 1.0 # anchor width ex_heights = ex_rois[:, 3] - ex_rois[:, 1] + 1.0 # anchor heightex_ctr_x = ex_rois[:, 0] + 0.5 * ex_widths # anchor center xex_ctr_y = ex_rois[:, 1] + 0.5 * ex_heights # anchor center yassert np.min(ex_widths) > 0.1 and np.min(ex_heights) > 0.1, \'Invalid boxes found: {} {}'. \format(ex_rois[np.argmin(ex_widths), :], ex_rois[np.argmin(ex_heights), :])gt_widths = gt_rois[:, 2] - gt_rois[:, 0] + 1.0 # gt_box widthgt_heights = gt_rois[:, 3] - gt_rois[:, 1] + 1.0 # gt_box heightgt_ctr_x = gt_rois[:, 0] + 0.5 * gt_widths # gt_box center xgt_ctr_y = gt_rois[:, 1] + 0.5 * gt_heights # gt_box center y# warnings.catch_warnings()# warnings.filterwarnings('error')targets_dx = (gt_ctr_x - ex_ctr_x) / ex_widths  # (gt_c_x-a_c_x)targets_dy = (gt_ctr_y - ex_ctr_y) / ex_heightstargets_dw = np.log(gt_widths / ex_widths)targets_dh = np.log(gt_heights / ex_heights)targets = np.vstack((targets_dx, targets_dy, targets_dw, targets_dh)).transpose()return targets# 生成anchors
def anchor_target_layer(rpn_cls_score, gt_boxes, gt_ishard, dontcare_areas, im_info, _feat_stride=[16, ],anchor_scales=[16, ]):"""Assign anchors to ground-truth targets. Produces anchor classificationlabels and bounding-box regression targets.Parameters----------rpn_cls_score: (1, H, W, Ax2) bg/fg scores of previous conv layergt_boxes: (G, 5) vstack of [x1, y1, x2, y2, class]gt_ishard: (G, 1), 1 or 0 indicates difficult or notdontcare_areas: (D, 4), some areas may contains small objs but no labelling. D may be 0im_info: a list of [image_height, image_width, scale_ratios]_feat_stride: the downsampling ratio of feature map to the original input imageanchor_scales: the scales to the basic_anchor (basic anchor is [16, 16])----------Returns----------rpn_labels : (HxWxA, 1), for each anchor, 0 denotes bg, 1 fg, -1 dontcarerpn_bbox_targets: (HxWxA, 4), distances of the anchors to the gt_boxes(may contains some transform)that are the regression objectivesrpn_bbox_inside_weights: (HxWxA, 4) weights of each boxes, mainly accepts hyper param in cfgrpn_bbox_outside_weights: (HxWxA, 4) used to balance the fg/bg,beacuse the numbers of bgs and fgs mays significiantly different"""# anchors is the [x_min,y_min,x_max,y_max]# 生成基本的anchor,一共10个_anchors = generate_anchors(scales=np.array(anchor_scales))  _num_anchors = _anchors.shape[0]  # 10个anchor# allow boxes to sit over the edge by a small amount_allowed_border = 0# 原始图像的信息，图像的高宽及通道数im_info = im_info[0]  # 在feature-map上定位anchor，并加上delta，得到在实际图像中anchor的真实坐标""" Algorithm:for each (H, W) location igenerate 9 anchor boxes centered on cell iapply predicted bbox deltas at cell i to each of the 9 anchorsfilter out-of-image anchorsmeasure GT overlap """assert rpn_cls_score.shape[0] == 1, \'Only single item batches are supported'# map of shape (..., H, W)height, width = rpn_cls_score.shape[1:3]  # feature-map的高宽# 1. Generate proposals from bbox deltas and shifted anchorsshift_x = np.arange(0, width) * _feat_strideshift_y = np.arange(0, height) * _feat_strideshift_x, shift_y = np.meshgrid(shift_x, shift_y)  # in W H order# 生成feature-map和真实图像上anchor之间的偏移量# shifts构建网格结构，shape [height*width,4]shifts = np.vstack((shift_x.ravel(), shift_y.ravel(),shift_x.ravel(), shift_y.ravel())).transpose()  A = _num_anchors  # 10个anchorK = shifts.shape[0]  # feature-map的宽乘高的大小# 为当前的featuremap每个点生成A个anchor，shape is [K,A,4]all_anchors = (_anchors.reshape((1, A, 4)) +shifts.reshape((1, K, 4)).transpose((1, 0, 2)))  all_anchors = all_anchors.reshape((K * A, 4))  # shape is (K*A,4)# 在featuremap上每个点生成A个anchortotal_anchors = int(K * A)# only keep anchors inside the image# 因为生成的anchor尺寸有大有小，因此在边缘处生成的anchor有可能会超过原始图像的边界，# 将这些超过边界的anchor去掉,得到的是这些anchor的在all_anchors中的索引# 仅保留那些还在图像内部的anchor，超出图像的都删掉# anchors[:]=[x_min,y_min,x_max,y_max]inds_inside = np.where((all_anchors[:, 0] >= -_allowed_border) &(all_anchors[:, 1] >= -_allowed_border) &(all_anchors[:, 2] < im_info[1] + _allowed_border) &  # width(all_anchors[:, 3] < im_info[0] + _allowed_border)  # height)[0]# keep only inside anchorsanchors = all_anchors[inds_inside, :]  # 保留那些在图像内的anchor# 至此，anchor准备好了# --------------------------------------------------------------# label: 1 is positive, 0 is negative, -1 is dont care# (A)labels = np.empty((len(inds_inside),), dtype=np.float32)labels.fill(-1)  # 初始化label，均为-1# overlaps between the anchors and the gt boxes# overlaps (ex, gt), shape is A x G# 计算anchor和gt-box的overlap，用来给anchor上标签# anchor box and groundtruth box 交集面积/并集面积# 通过IOU的得分来确定anchor为正样本与否# overlaps shape is [anchor.shape[0],gt_box.shape[0]]overlaps = bbox_overlaps(np.ascontiguousarray(anchors, dtype=np.float),np.ascontiguousarray(gt_boxes, dtype=np.float))  # 存放每一个anchor和每一个gtbox之间的overlap# 找到和每一个gtbox，overlap最大的那个anchorargmax_overlaps = overlaps.argmax(axis=1) max_overlaps = overlaps[np.arange(len(inds_inside)), argmax_overlaps]# 找到每个位置上10个anchor中与gtbox，overlap最大的那个gt_argmax_overlaps = overlaps.argmax(axis=0)  gt_max_overlaps = overlaps[gt_argmax_overlaps,np.arange(overlaps.shape[1])]gt_argmax_overlaps = np.where(overlaps == gt_max_overlaps)[0]if not cfg.TRAIN.RPN_CLOBBER_POSITIVES:# assign bg labels first so that positive labels can clobber them# 先给背景上标签，小于0.3overlap的为负样本label为0labels[max_overlaps < cfg.TRAIN.RPN_NEGATIVE_OVERLAP] = 0  # -----------------------------------## 正样本的确定，iou得分大于0.7和每个位置上具有最大IOU得分的anchor# fg label: for each gt, anchor with highest overlap# 每个位置上的10个个anchor中overlap最大的认为是前景labels[gt_argmax_overlaps] = 1  # fg label: above threshold IOU# overlap大于0.7的认为是前景labels[max_overlaps >= cfg.TRAIN.RPN_POSITIVE_OVERLAP] = 1  if cfg.TRAIN.RPN_CLOBBER_POSITIVES:# assign bg labels last so that negative labels can clobber positiveslabels[max_overlaps < cfg.TRAIN.RPN_NEGATIVE_OVERLAP] = 0# preclude dontcare areas# 这里我们暂时不考虑有doncare_area的存在if dontcare_areas is not None and dontcare_areas.shape[0] > 0:  # intersec shape is D x Aintersecs = bbox_intersections(np.ascontiguousarray(dontcare_areas, dtype=np.float),  # D x 4np.ascontiguousarray(anchors, dtype=np.float)  # A x 4)intersecs_ = intersecs.sum(axis=0)  # A x 1labels[intersecs_ > cfg.TRAIN.DONTCARE_AREA_INTERSECTION_HI] = -1# 这里我们暂时不考虑难样本的问题# preclude hard samples that are highly occlusioned, truncated or difficult to seeif cfg.TRAIN.PRECLUDE_HARD_SAMPLES and gt_ishard is not None and gt_ishard.shape[0] > 0:assert gt_ishard.shape[0] == gt_boxes.shape[0]gt_ishard = gt_ishard.astype(int)gt_hardboxes = gt_boxes[gt_ishard == 1, :]if gt_hardboxes.shape[0] > 0:# H x Ahard_overlaps = bbox_overlaps(np.ascontiguousarray(gt_hardboxes, dtype=np.float),  # H x 4np.ascontiguousarray(anchors, dtype=np.float))  # A x 4hard_max_overlaps = hard_overlaps.max(axis=0)  # (A)labels[hard_max_overlaps >= cfg.TRAIN.RPN_POSITIVE_OVERLAP] = -1max_intersec_label_inds = hard_overlaps.argmax(axis=1)  # H x 1labels[max_intersec_label_inds] = -1  ## subsample positive labels if we have too many# 对正样本进行采样，如果正样本的数量太多的话# 限制正样本的数量不超过128个，排除的置位dont_Care类# TODO 这个后期可能还需要修改，毕竟如果使用的是字符的片段，那个正样本的数量是很多的。num_fg = int(cfg.TRAIN.RPN_FG_FRACTION * cfg.TRAIN.RPN_BATCHSIZE)fg_inds = np.where(labels == 1)[0]if len(fg_inds) > num_fg:disable_inds = npr.choice(fg_inds, size=(len(fg_inds) - num_fg), replace=False)  # 随机去除掉一些正样本labels[disable_inds] = -1  # 变为-1# subsample negative labels if we have too many# 对负样本进行采样，如果负样本的数量太多的话# 正负样本总数是256，限制正样本数目最多128，# 如果正样本数量小于128，差的那些就用负样本补上，凑齐256个样本num_bg = cfg.TRAIN.RPN_BATCHSIZE - np.sum(labels == 1)bg_inds = np.where(labels == 0)[0]if len(bg_inds) > num_bg:disable_inds = npr.choice(bg_inds, size=(len(bg_inds) - num_bg), replace=False)labels[disable_inds] = -1# print "was %s inds, disabling %s, now %s inds" % (# len(bg_inds), len(disable_inds), np.sum(labels == 0))# 至此， 上好标签，开始计算rpn-box的真值# --------------------------------------------------------------bbox_targets = np.zeros((len(inds_inside), 4), dtype=np.float32)# 根据anchor和gtbox计算得真值（anchor和gtbox之间的偏差）bbox_targets = _compute_targets(anchors, gt_boxes[argmax_overlaps, :])# 内部权重，前景就给1，其他是0bbox_inside_weights = np.zeros((len(inds_inside), 4), dtype=np.float32)bbox_inside_weights[labels == 1, :] = np.array(cfg.TRAIN.RPN_BBOX_INSIDE_WEIGHTS)  bbox_outside_weights = np.zeros((len(inds_inside), 4), dtype=np.float32)if cfg.TRAIN.RPN_POSITIVE_WEIGHT < 0: # 此处使用uniform权重，也就是正样本是1，负样本是0# uniform weighting of examples (given non-uniform sampling)# num_examples = np.sum(labels >= 0) + 1# positive_weights = np.ones((1, 4)) * 1.0 / num_examples# negative_weights = np.ones((1, 4)) * 1.0 / num_examplespositive_weights = np.ones((1, 4))  # 前景为1negative_weights = np.zeros((1, 4))  # 背景为0else:assert ((cfg.TRAIN.RPN_POSITIVE_WEIGHT > 0) &(cfg.TRAIN.RPN_POSITIVE_WEIGHT < 1))positive_weights = (cfg.TRAIN.RPN_POSITIVE_WEIGHT /(np.sum(labels == 1)) + 1)negative_weights = ((1.0 - cfg.TRAIN.RPN_POSITIVE_WEIGHT) /(np.sum(labels == 0)) + 1)# 外部权重，前景是1，背景是0# bbox_outside_weights初始化为0，将label中为0的位置赋值bbox_outside_weights为0,labels为1的位置赋值为1bbox_outside_weights[labels == 1, :] = positive_weightsbbox_outside_weights[labels == 0, :] = negative_weights# map up to original set of anchors# 一开始是将超出图像范围的anchor直接丢掉的，现在在加回来# inds_inside 是原始anchor中的索引labels = _unmap(labels, total_anchors, inds_inside, fill=-1)  # 这些anchor的label是-1，也即dontcarebbox_targets = _unmap(bbox_targets, total_anchors, inds_inside, fill=0)  # 这些anchor的真值是0，也即没有值bbox_inside_weights = _unmap(bbox_inside_weights, total_anchors,inds_inside, fill=0)  # 内部权重以0填充bbox_outside_weights = _unmap(bbox_outside_weights, total_anchors,inds_inside, fill=0)  # 外部权重以0填充# labelslabels = labels.reshape((1, height, width, A))  # reshap一下labelrpn_labels = labels# bbox_targetsbbox_targets = bbox_targets.reshape((1, height, width, A * 4))  # reshaperpn_bbox_targets = bbox_targets# bbox_inside_weightsbbox_inside_weights = bbox_inside_weights.reshape((1, height, width, A * 4))rpn_bbox_inside_weights = bbox_inside_weights# bbox_outside_weightsbbox_outside_weights = bbox_outside_weights.reshape((1, height, width, A * 4))rpn_bbox_outside_weights = bbox_outside_weightsrpn_data=(rpn_labels, rpn_bbox_targets, rpn_bbox_inside_weights, rpn_bbox_outside_weights)return rpn_data# 将排除掉边界之外的anchors之后的anchor补全回来
def _unmap(data, count, inds, fill=0):""" Unmap a subset of item (data) back to the original set of items (ofsize count) """if len(data.shape) == 1:ret = np.empty((count,), dtype=np.float32)ret.fill(fill)ret[inds] = dataelse:ret = np.empty((count,) + data.shape[1:], dtype=np.float32)ret.fill(fill)ret[inds, :] = datareturn ret# 计算anchor和gt之间的矩形框的偏差
def _compute_targets(ex_rois, gt_rois):"""Compute bounding-box regression targets for an image."""assert ex_rois.shape[0] == gt_rois.shape[0]assert ex_rois.shape[1] == 4assert gt_rois.shape[1] == 5return bbox_transform(ex_rois, gt_rois[:, :4]).astype(np.float32, copy=False)

对于bbox使用cpython写成(.pyx文件)

copy

import numpy as np
cimport numpy as npDTYPE = np.float
ctypedef np.float_t DTYPE_t# 计算IOU
def bbox_overlaps(np.ndarray[DTYPE_t, ndim=2] boxes,np.ndarray[DTYPE_t, ndim=2] query_boxes):"""Parameters----------boxes: (N, 4) ndarray of float, anchor box numsquery_boxes: (K, 4) ndarray of float, groud_truth object nums,[x_min,y_min,x_max,y_max,class]Returns-------overlaps: (N, K) ndarray of overlap between boxes and query_boxes"""cdef unsigned int N = boxes.shape[0]cdef unsigned int K = query_boxes.shape[0]cdef np.ndarray[DTYPE_t, ndim=2] overlaps = np.zeros((N, K), dtype=DTYPE)cdef DTYPE_t iw, ih, box_areacdef DTYPE_t uacdef unsigned int k, nfor k in range(K):box_area = ((query_boxes[k, 2] - query_boxes[k, 0] + 1) *(query_boxes[k, 3] - query_boxes[k, 1] + 1))for n in range(N):# 水平方向上的交集，如果存在那么iw为正iw = (min(boxes[n, 2], query_boxes[k, 2]) -max(boxes[n, 0], query_boxes[k, 0]) + 1)if iw > 0:# 竖直方向上的交集ih = (min(boxes[n, 3], query_boxes[k, 3]) -max(boxes[n, 1], query_boxes[k, 1]) + 1)if ih > 0:# 如果存在交集，计算并集的面积# union areaua = float((boxes[n, 2] - boxes[n, 0] + 1) *(boxes[n, 3] - boxes[n, 1] + 1) +box_area - iw * ih)# 交集面积/并集面积overlaps[n, k] = iw * ih / uareturn overlaps# anchor与gt交集面积相对于gt面积的比例
def bbox_intersections(np.ndarray[DTYPE_t, ndim=2] boxes,np.ndarray[DTYPE_t, ndim=2] query_boxes):"""For each query box compute the intersection ratio covered by boxes----------Parameters----------boxes: (N, 4) ndarray of floatquery_boxes: (K, 4) ndarray of floatReturns-------overlaps: (N, K) ndarray of intersec between boxes and query_boxes"""cdef unsigned int N = boxes.shape[0]cdef unsigned int K = query_boxes.shape[0]cdef np.ndarray[DTYPE_t, ndim=2] intersec = np.zeros((N, K), dtype=DTYPE)cdef DTYPE_t iw, ih, box_areacdef DTYPE_t uacdef unsigned int k, nfor k in range(K):box_area = ((query_boxes[k, 2] - query_boxes[k, 0] + 1) *(query_boxes[k, 3] - query_boxes[k, 1] + 1))for n in range(N):iw = (min(boxes[n, 2], query_boxes[k, 2]) -max(boxes[n, 0], query_boxes[k, 0]) + 1)if iw > 0:ih = (min(boxes[n, 3], query_boxes[k, 3]) -max(boxes[n, 1], query_boxes[k, 1]) + 1)if ih > 0:intersec[n, k] = iw * ih / box_areareturn intersec

代码中的注释已经写得明明白白了。anchor生成函数为anchor_target_layer.py

Anchors

首先根据设定的anchor高度和宽度在特征图上每个cell生成A个anchors，这些anchors有的会超过原始图像的边界，如上图所示，将这些超出边界的anchors先删除，并记录保留的anchor在原始所有anchors中的索引值，使用内部的anchor和groundtruth进行IOU计算(anchor和gt之间如果存在交集，则使用交集面积和二者并集的面积进行IOU计算)，使用两个原则进行anchor正样本的认定：如果anchor和gt之间的IOU大于设定的阈值0.7则认定该anchor为正样本；将具有和任意gt最大的IOU的anchor为正样本，也就是和gt最大的几个anchor最为正样本，这一步选择的anchor数量和gt的数量相同。至此就确定了正样本的anchor和剩余的负样本anchor，使用设定的正负样本数量，来控制正负样本的数量，将正负样本和和gt之间计算偏移量并作为目标框的label。对于anchor和gt之间的偏移量计算如下图所示

Anchor_groudtruth

图中红色表示groundtruth，黑色表示anchor box，首先计算两个矩形框的中心坐标和宽度高度，计算公式为

targetxtragetytragetwtrageth=(GTx−ANx)/ANwidth=(GTy−any)/ANheight=log(GTwidth/ANwidth)=log(GTheight/ANheight)targetx=(GTx−ANx)/ANwidthtragety=(GTy−any)/ANheighttragetw=log⁡(GTwidth/ANwidth)trageth=log⁡(GTheight/ANheight)

整个流程如下图所示

ctpn_anchor_gen

总结

至此，对CTPN网络结构结合代码进行了一些跟人理解的解读，该模型与2016年提出，可以看到收到很多的fastercnn的影响，可以看到CTPN具有如下的一些特点

基础VGG网络的使用，因此一般需要ImageNet数据集的预训练权重会使得训练更快速和平稳
Bilstm的使用使得模型无法向CNN那样并行运算，影响了模型的速度
Anchor的设定为等宽度变高度，因此这种anchor只能适用于水平方向文本的检测，也可以通过更改anchor使得anchor兼容竖直方向的文本检测
模型中anchor的宽度为15，因此模型的检测粒度收到该设置的影响，有可能存在边界不明确的状况
因为使用的是和fasterrcnn相同的anchor生成及预测方法，因此在inference阶段需要对预测的值进行反向变换得到目标框

EAST

论文关键idea

提出了两段式的文本检测方法，FCN+NMS，消除多过程造成的中间误差累计，减少了检测时间
模型可以进行单词级别检测，又可以进行文本行检测，检测的形状可以是任意形状的四边形也可以是普通的四边形
采用了Locality-Aware NMS的预测框过滤

网络结构如下所示

EAST Model

Pipeline

先用一个通用的网络(论文中采用的是PVAnet，实际在使用的时候可以采用VGG16，Resnet等)作为base net ，用于特征提取

此处对PAVnet进行一些说明，PAVnet主要是对VGG进行了改进并应用于目标检测任务，主要针对FasterRcnn的基础网络进行了改进，包含mCReLU,Inception,Hyper-feature各个结构

PVAnet

在论文总的基础网络用的是PVAnet的基础网络，具体参数如下所示

PVAnetParam

对于mCReLU结构和Inception结构如下所示

PVAnet mCReLU Inception
基于上述主干特征提取网络，抽取不同层的featuremap（它们的尺寸分别是inuput-image的132,116,18,14132,116,18,14，这样可以得到不同尺度的特征图，这样做的目的是解决文本行尺度变换剧烈的问题，ealy-stage可用于预测小的文本行(较大的特征图)，late-stage可用于预测大的文本行(较小的特征图)。
特征合并层，将抽取的特征进行merge．这里合并的规则采用了Unet的方法，合并规则：从特征提取网络的顶部特征按照相应的规则向上进行合并，不断增大featuremap的尺寸。
网络输出层，包含文本得分和文本形状．根据不同文本形状(可分为RBOX和QUAD，对于RROX预测的是当前点距离gtbox的四个边的距离以及gtbox的相对图像的x正方向的角度θθ，也就是总共为5个值分别对应着(d1,d2,d3,d4,θ)(d1,d2,d3,d4,θ)，而对于QUAD来说预测对应的gtbox的四个交点的坐标，一共8个值)，对于RBOX对应的示意图如下所示
EAST_RBOX

图中的didi对应的是当前点到gt的距离，知道了一个固定点到矩形的四条边的距离，就可以的知道这个矩形所在的位置和大小，即确定这个矩形。

EAST_RBOX_QUAD

可以看出，对于RBOX输出5个预测值，而QUAD输出8个预测值。

对于层g和h的计算方式如图中公式所示。

对于g为uppooling层，每次操作将featuremap放大到原来的2倍，主要进行特征图的上采样，论文中采取的双线性插值的方法进行上采样，没有使用反卷积的方式，减少了模型的计算量但是有可能降低模型的表达能力
上采样之后的featuremap和下采样同样尺寸的f层进行merge并使用conv1x1降低合并后的模型的通道数
之后使用conv3x3卷积，输出该阶段的featuremap
上述操作重复3次最终模型输出的通道数为32

进行特征图合并之后进行预测输出，也就是针对不同的box形式输出5个或者8个预测值。

Loss计算

总的损失包含分类损失和回归损失，即

L=LS+λgLgL=LS+λgLg

分类损失论文中使用的是平衡交叉熵损失

LS= balanced−xent(Y˙,Y)=−βYlogY˙−(1−β)(1−Y˙)(log(1−Y˙))whereβ=1−∑y∈Yy|Y|LS= balanced−xent(Y˙,Y)=−βYlog⁡Y˙−(1−β)(1−Y˙)(log⁡(1−Y˙))whereβ=1−∑y∈Yy|Y|

其中Y˙Y˙为预测值，YY为label值。相比普通的交叉熵损失，平衡交叉熵损失对正负样本进行了平衡。

对于LgLg损失，由于在对于RBOX信息中包含的是5个预测值即(d1,d2,d3,d4,θ)(d1,d2,d3,d4,θ)，那么就可以得到损失为

whereLg=LAABB+λθLθLAABB=−logIoU(R˙,R∗)=−log|R˙∩R∗||R˙∪R∗|Lθ=1−cos(θ˙−θ∗)Lg=LAABB+λθLθwhereLAABB=−log⁡IoU(R˙,R∗)=−log⁡|R˙∩R∗||R˙∪R∗|Lθ=1−cos⁡(θ˙−θ∗)

对于IOU损失的计算是，论文中对交集区域面积的计算方式为

wi=min(d˙2,d∗2)+min(d˙4,d∗4)hi=min(d˙1,d∗1)+min(d˙3,d∗3)wi=min(d˙2,d2∗)+min(d˙4,d4∗)hi=min(d˙1,d1∗)+min(d˙3,d3∗)

实际上这种计算方式是存在问题的，分析如下

east_iou

如上图所示，红色对应gt，蓝色对应predict，如果不考虑角度，那么按照公式所述是正确的，但是考虑角度信息之后就会发现iou的交集面积计算公式存在错误。

Reference

综述

自然场景文本检测识别技术综述

白翔:：图像OCR年度进展|VALSE2018之十一

白翔：趣谈“捕文捉字”— 场景文字检测 | VALSE2017之十

基于深度学习的目标检测及场景文字检测研究进展

知乎文本检测综述

优秀论文解读博客

知乎专栏:小石头的码疯窝

OCR_Overview_冠军试炼
文本检测
- CTPN
  
  场景文字检测—CTPN原理与实现
  
  CTPN: Tensorflow
- EAST
  
  Bolg: EAST
  
  知乎：文本检测之EAST
  
  EAST：tensorflow
  
  EAST: Keras
  
  EAST: Advanced keras
- SegLink
  
  SegLink_Blog
  
  文本检测之SegLink
- PixelLink
  
  文本检测之PixelLink
  
  Github: PixelLink
- TextBoxes
  
  论文笔记：TextBoxes++: A Single-Shot Oriented Scene Text Detector
  
  Github: TextBoxes++
- 角定位
基于角定位于区域分割
文本识别
- ASTER
  
  Github: ASTER
TextSpotter
- Mask TextSpotter
  
  华科白翔教授团队ECCV2018 OCR论文：Mask TextSpotter