2019.12.10 更新未完结
在这里插入图片描述

3. The Proposed Method

首先，我们在图1中概述了我们的两阶段方法：
在第一阶段，通过添加SF-Net和MDA-Net，可以期望特征图包含更多的特征信息和更少的噪声。为了角度参数的位置灵敏度，此阶段仍使水平框回归。
通过改进的五参数回归和第二阶段中每个提案的旋转非最大抑制（R-NMS）操作，我们可以获得任意旋转下的最终检测结果。

3.1 Finer Sampling and Feature Fusion Network (SF-Net)

在我们的分析中，检测小物体有两个主要障碍：物体特征信息不足和anchor样本不足。原因是由于使用池化层，因此小对象在深层中丢失了大部分特征信息。同时，高级特征图的较大采样步幅倾向于直接跳过较小的对象，从而导致采样不足。

3.1.1 Feature fusion

通常认为，低级特征图可以保留小对象的位置信息，而高级特征图可以包含高级语义线索。特征金字塔网络（FPN）[23]，自上而下调制（TDM）[35]和与对象先验网络（RON）的反向连接[21]是常见的特征融合方法，涉及高低级特征图不同形式的组合。

3.1.2 Finer sampling

训练样本不足和不平衡会影响检测性能。通过引入期望的最大重叠（EMO）得分，[45]中的作者计算出锚点和物体之间的期望的最大联合交叉点（IoU）。他们发现锚点（SA）的stride越小，获得的EMO得分越高，从统计上讲导致所有对象的平均最大IoU均得到改善。
在这里插入图片描述
图2显示了分别跨步16和8进行小物体采样的结果。可以看出，较小的SA可以对更多高质量的样本进行采样，从而很好地捕获了小物体，这对于检测器训练和推理都非常有帮助。

基于以上分析，我们设计了更精细的采样和特征融合网络（SF-Net），如图3所示。在基于锚点的检测框架中，特征图相对于原始图像缩减了 $S_A$ 倍。换句话说， $S_A$ 的值只能是2的指数倍。SF-Net通过更改特征图的大小来解决此问题，从而使SA的设置更加灵活以允许更多自适应采样。为了减少网络参数，SF-Net仅使用Resnet [16]中的C3和C4进行融合，以平衡语义信息和位置信息，同时忽略其他不太相关的功能。简单来说，SF-Net的第一个通道会对C4进行升采样，以使其SA = S，其中S是预期的锚跨度。第二个通道还将C3上采样到相同的大小。然后，我们将C3传递给起始结构，以扩展C3的接受域并增加语义信息。初始结构包含各种比率卷积核以捕获对象形状的多样性。最后，通过将两个通道逐个元素相加获得新的特征图F3。
在这里插入图片描述
表1列出了不同SA下DOTA的检测精度和训练开销。我们发现最佳的SA取决于特定的数据集，尤其是小对象的大小分布。在本文中，为了在精度和速度之间进行权衡，通常将S的值设置为6。

——此处代码在代码 ./libs/networks/resnet.py （test部分）

def resnet_base(img_batch, scope_name, is_training=True):
    '''
    this code is derived from light-head rcnn.
    https://github.com/zengarden/light_head_rcnn
    It is convenient to freeze blocks. So we adapt this mode.
    '''
    if scope_name == 'resnet_v1_50':
        middle_num_units = 6
    elif scope_name == 'resnet_v1_101':
        middle_num_units = 23
    else:
        raise NotImplementedError('We only support resnet_v1_50 or resnet_v1_101 or mobilenetv2. '
                                  'Check your network name.')

    blocks = [resnet_v1_block('block1', base_depth=64, num_units=3, stride=2),
              resnet_v1_block('block2', base_depth=128, num_units=4, stride=2),
              # use stride 1 for the last conv4 layer.
              resnet_v1_block('block3', base_depth=256, num_units=middle_num_units, stride=1)]
              # when use fpn, stride list is [1, 2, 2]

    with slim.arg_scope(resnet_arg_scope(is_training=False)):
        with tf.variable_scope(scope_name, scope_name):
            # Do the first few layers manually, because 'SAME' padding can behave inconsistently
            # for images of different sizes: sometimes 0, sometimes 1
            net = resnet_utils.conv2d_same( img_batch, 64, 7, stride=2, scope='conv1')
            net = tf.pad(net, [[0, 0], [1, 1], [1, 1], [0, 0]])
            net = slim.max_pool2d( net, [3, 3], stride=2, padding='VALID', scope='pool1')

    not_freezed = [False] * cfgs.FIXED_BLOCKS + (4-cfgs.FIXED_BLOCKS)*[True]
    # Fixed_Blocks can be 1~3

    with slim.arg_scope(resnet_arg_scope(is_training=(is_training and not_freezed[0]))):
        C2, end_points_C2 = resnet_v1.resnet_v1(net,
                                                blocks[0:1],
                                                global_pool=False,
                                                include_root_block=False,
                                                scope=scope_name)

    with slim.arg_scope(resnet_arg_scope(is_training=(is_training and not_freezed[1]))):
        C3, end_points_C3 = resnet_v1.resnet_v1(C2,
                                                blocks[1:2],
                                                global_pool=False,
                                                include_root_block=False,
                                                scope=scope_name)

    with slim.arg_scope(resnet_arg_scope(is_training=(is_training and not_freezed[2]))):
        C4, _ = resnet_v1.resnet_v1(C3,
                                    blocks[2:3],
                                    global_pool=False,
                                    include_root_block=False,
                                    scope=scope_name)

        if cfgs.ADD_FUSION:
            C3_shape = tf.shape(end_points_C3['{}/block2/unit_3/bottleneck_v1'.format(scope_name)])
            C4 = tf.image.resize_bilinear(C4, (C3_shape[1], C3_shape[2]))
            _C3 = slim.conv2d(end_points_C3['{}/block2/unit_3/bottleneck_v1'.format(scope_name)],
                              1024, [3, 3],
                              trainable=is_training,
                              weights_initializer=cfgs.INITIALIZER,
                              activation_fn=tf.nn.relu,
                              scope='C3_conv3x3')
            C4 += _C3
            # C4_shape[1 90 160 1024]

3.2 Multi-Dimensional Attention Network

在这里插入图片描述
由于诸如航空图像之类的现实世界数据的复杂性，RPN提供的建议可能会引入大量的噪声信息，如图4b所示。过多的噪声会使对象信息不知所措，对象之间的边界将变得模糊（请参见图4a），从而导致漏检和误报增加。因此，有必要增强对象提示并削弱非对象信息。已经提出了许多注意力结构[18、17、37、38]来解决遮挡，噪声和模糊的问题。但是，大多数方法都是无监督的，难以指导网络学习特定目的。
在这里插入图片描述
为了更有效地捕获复杂背景下的小物体，我们设计了一种受监督的多维注意力学习器（MDA-Net），如图5所示。具体地说，在像素注意力网络中，特征图F3经过一个具有不同比率卷积核的初始结构，然后通过卷积运算学习两通道显着性图（参见图4d）。显着图分别表示前景和背景的分数。然后，在显着图上执行Softmax操作，并选择一个通道与F3相乘。最终，获得新的信息特征图A3，如图4c所示。应该注意的是，Softmax函数之后的显着性图的值在[0，1]之间。换句话说，它可以减少噪音并相对增强对象信息。由于显着性图是连续的，因此不会完全消除非对象信息，这对于保留某些上下文信息并提高鲁棒性是有益的。为了指导网络学习此过程，我们采用了监督学习方法。首先，我们可以很容易地根据ground truth获得一个二元图作为标签（如图4e所示），然后将二元图和显着图的cross-entropy loss 作为attention loss。此外，我们还使用SENet [18]作为辅助的chanel attention network，reduction ration=16。

——此处代码在代码 ./libs/networks/resnet.py （rest部分）

续上面的代码

        if cfgs.ADD_ATTENTION:
            with tf.variable_scope('build_C4_attention', regularizer=slim.l2_regularizer(cfgs.WEIGHT_DECAY)):
                add_heatmap(tf.expand_dims(tf.reduce_mean(C4, axis=-1), axis=-1), 'add_attention_before')
                C4_attention_layer = build_attention(C4, is_training)
                C4_attention = tf.nn.softmax(C4_attention_layer)
                C4_attention = C4_attention[:, :, :, 0]
                C4_attention = tf.expand_dims(C4_attention, axis=-1)
                add_heatmap(C4_attention, 'C4_attention')
                C4 = tf.multiply(C4_attention, C4)
                # C4 = SE_C4 * C4
               add_heatmap(tf.expand_dims(tf.reduce_mean(C4, axis=-1), axis=-1), 'add_attention_after')

    C4 = tf.Print(C4, [tf.shape(C4)], summarize=10, message='C4_shape')
    if cfgs.ADD_ATTENTION:
        return C4, C4_attention_layer
    else:
        return C4

3.3 Rotation Branch

RPN网络为第二阶段提供了粗略的建议。为了提高RPN的计算速度，在训练阶段我们为NMS操作从12,000回归框中选择最高得分，并获得2,000个回归框。在测试阶段，NMS从10,000个回归框中提取了300个。

在第二阶段，我们使用五个参数（x，y，w，h，θ）表示一个面向任意方向的矩形。 θ在 $[ - π/2,0)$ 范围内定义为与x轴成锐角，在另一侧，我们将其表示为w。此定义与OpenCV一致。因此，在轴对齐的边界框上进行IoU计算可能会导致偏斜交互式边界框的IoU不准确，从而进一步破坏边界框的预测。我们提出了一种偏斜IoU计算的实现[29]，并考虑了三角剖分来解决这个问题。我们将旋转非最大抑制（R-NMS）用作基于偏斜IoU计算的后处理操作。对于数据集中形状的多样性，我们为不同类别设置了不同的R-NMS阈值。此外，为了充分利用预训练权重ResNet，我们用C5 block和全局平均池化层（GAP）替换了两个全连接层fc6和fc7。
旋转边界框的回归为在这里插入图片描述
其中x，y，w，h和θ分别表示框的中心坐标，宽度，高度和角度。变量x，xa，x’ 分别用于ground-truth box, anchor box, predicted box（同样适用于y，w，h和θ）。

——此处代码在代码 ./libs/networks/build_whole_network.py （test部分）

def build_whole_detection_network(self, input_img_batch, gtboxes_r_batch, gtboxes_h_batch, mask_batch):
        img_shape = tf.shape(input_img_batch)
        # 1. build base network
        if cfgs.ADD_ATTENTION:
            feature_to_cropped, C4_attention_layer = self.build_base_network(input_img_batch)
        else:
            feature_to_cropped = self.build_base_network(input_img_batch)
        # feature_to_cropped = A3  
        rpn_input = feature_to_cropped

build_base_network即为之前的通过SFNet和MDAnet，得到A3

        # 2. build rpn
        with tf.variable_scope('build_rpn',regularizer=slim.l2_regularizer(cfgs.WEIGHT_DECAY)):
            rpn_cls_score, rpn_box_pred = build_rpn(rpn_input, self.num_anchors_per_location, self.is_training)
            rpn_box_pred = tf.reshape(rpn_box_pred, [-1, 4])
            rpn_cls_score = tf.reshape(rpn_cls_score, [-1, 2])
            rpn_cls_prob = slim.softmax(rpn_cls_score, scope='rpn_cls_prob')

build_rpn

        # 3. generate_anchors
        featuremap_height, featuremap_width = tf.shape(feature_to_cropped)[1], tf.shape(feature_to_cropped)[2]
        featuremap_height = tf.cast(featuremap_height, tf.float32)
        featuremap_width = tf.cast(featuremap_width, tf.float32)

        anchors = anchor_utils.make_anchors(base_anchor_size=cfgs.BASE_ANCHOR_SIZE_LIST[0],
                                            anchor_scales=cfgs.ANCHOR_SCALES, anchor_ratios=cfgs.ANCHOR_RATIOS,
                                            featuremap_height=featuremap_height,
                                            featuremap_width=featuremap_width,
                                            stride=cfgs.ANCHOR_STRIDE,
                                            name="make_anchors_forRPN")
        # 4. postprocess rpn proposals. such as: decode, clip, NMS
        with tf.variable_scope('postprocess_RPN'):
            rois, roi_scores = postprocess_rpn_proposals(rpn_bbox_pred=rpn_box_pred,
                                                         rpn_cls_prob=rpn_cls_prob,
                                                         img_shape=img_shape,
                                                         anchors=anchors,
                                                         is_training=self.is_training)
            # rois shape [-1, 4]

postprocess_rpn_proposals

        # -------------------------------------------------------------------------------------------------------------#
        #                                            Fast-RCNN                                                         #
        # -------------------------------------------------------------------------------------------------------------#

        # 5. build Fast-RCNN
        rois = tf.Print(rois, [tf.shape(rois)], 'rois shape', summarize=10)
        bbox_pred_h, cls_score_h, bbox_pred_r, cls_score_r = build_fastrcnn(feature_to_cropped=feature_to_cropped,
                                                                            rois=rois,
                                                                            img_shape=img_shape,
                                                                            base_network_name=self.base_network_name,
                                                                            is_training=self.is_training)
        # bbox_pred shape: [-1, 4*(cls_num+1)].
        # cls_score shape： [-1, cls_num+1]
        cls_prob_h = slim.softmax(cls_score_h, 'cls_prob_h')
        cls_prob_r = slim.softmax(cls_score_r, 'cls_prob_r')


        #  6. postprocess_fastrcnn
        if not self.is_training:
            final_boxes_h, final_scores_h, final_category_h = self.postprocess_fastrcnn_h(rois=rois,
                                                                                          bbox_ppred=bbox_pred_h,
                                                                                          scores=cls_prob_h,
                                                                                          img_shape=img_shape)
            final_boxes_r, final_scores_r, final_category_r = self.postprocess_fastrcnn_r(rois=rois,
                                                                                          bbox_ppred=bbox_pred_r,
                                                                                          scores=cls_prob_r,
                                                                                          img_shape=img_shape)
            return final_boxes_h, final_scores_h, final_category_h, final_boxes_r, final_scores_r, final_category_r

def build_base_network(self, input_img_batch):
        if self.base_network_name.startswith('resnet_v1'):
            return resnet.resnet_base(input_img_batch, scope_name=self.base_network_name, is_training=self.is_training)	#3.1和3.2的代码
        elif self.base_network_name.startswith('MobilenetV2'):
            return mobilenet_v2.mobilenetv2_base(input_img_batch, is_training=self.is_training)
        else:
            raise ValueError('Sry, we only support resnet or mobilenet_v2')

#./lib/networks/layer.py
def build_rpn(inputs, num_anchors_per_location, is_training):
    rpn_conv3x3 = slim.conv2d(inputs, 512, [cfgs.KERNEL_SIZE, cfgs.KERNEL_SIZE],
                              trainable=is_training,
                              weights_initializer=cfgs.INITIALIZER,
                              activation_fn=tf.nn.relu,
                              scope='rpn_conv/3x3')
    rpn_cls_score = slim.conv2d(rpn_conv3x3, num_anchors_per_location * 2, [1, 1], stride=1,
                                trainable=is_training, weights_initializer=cfgs.INITIALIZER,
                                activation_fn=None,
                                scope='rpn_cls_score')
    rpn_box_pred = slim.conv2d(rpn_conv3x3, num_anchors_per_location * 4, [1, 1], stride=1,
                               trainable=is_training, weights_initializer=cfgs.BBOX_INITIALIZER,
                               activation_fn=None,
                               scope='rpn_bbox_pred')
    return rpn_cls_score, rpn_box_pred

def postprocess_rpn_proposals(rpn_bbox_pred, rpn_cls_prob, img_shape, anchors, is_training):

    if is_training:
        pre_nms_topN = cfgs.RPN_TOP_K_NMS_TRAIN
        post_nms_topN = cfgs.RPN_MAXIMUM_PROPOSAL_TARIN
        nms_thresh = cfgs.RPN_NMS_IOU_THRESHOLD
    else:
        pre_nms_topN = cfgs.RPN_TOP_K_NMS_TEST
        post_nms_topN = cfgs.RPN_MAXIMUM_PROPOSAL_TEST
        nms_thresh = cfgs.RPN_NMS_IOU_THRESHOLD

    cls_prob = rpn_cls_prob[:, 1]
    # 1. decode boxes
    decode_boxes = encode_and_decode.decode_boxes(encode_boxes=rpn_bbox_pred,
                                                  reference_boxes=anchors,
                                                  scale_factors=cfgs.ANCHOR_SCALE_FACTORS)

    # 2. clip to img boundaries
    decode_boxes = boxes_utils.clip_boxes_to_img_boundaries(decode_boxes=decode_boxes,
                                                            img_shape=img_shape)

    # 3. get top N to NMS
    if pre_nms_topN > 0:
        pre_nms_topN = tf.minimum(pre_nms_topN, tf.shape(decode_boxes)[0], name='avoid_unenough_boxes')
        cls_prob, top_k_indices = tf.nn.top_k(cls_prob, k=pre_nms_topN)
        decode_boxes = tf.gather(decode_boxes, top_k_indices)

    # 4. NMS
    keep = tf.image.non_max_suppression(
        boxes=decode_boxes,
        scores=cls_prob,
        max_output_size=post_nms_topN,
        iou_threshold=nms_thresh)

    final_boxes = tf.gather(decode_boxes, keep)
    final_probs = tf.gather(cls_prob, keep)

    return final_boxes, final_probs

3.4 Loss Function

使用的多任务loss定义如下：
在这里插入图片描述
其中 $N$ 表示proposal数量， $t_n$ 表示对象的标签， $p_n$ 是由Softmax函数计算的各个类别的概率分布， $t_n'$ 是一个二进制值（ $t_n' = 1$ 表示前景， $t_n' = 0$ 表示背景，背景不回归）。 $v_{∗j} '$ 代表预测的偏移矢量， $v_{∗ j}$ 代表ground truth的目标矢量。 $u_{ij}$ ， $u_{ij}'$ 分别代表蒙版像素的标签和预测。 IoU表示预测框和ground truth的重叠。超参数λ1，λ2，λ3控制权衡。另外，分类损失 $L_{cls}$ 是Softmax cross-entropy。回归损失 $L_{reg}$ 为[11]中定义的平滑L1 loss，attention loss $L_{att}$ 为像素级Softmax cross-entropy。
在这里插入图片描述
尤其是，存在旋转角度的边界问题，如图6所示。它表明了一种理想的回归形式（蓝色框逆时针旋转得到红色框），但是由于角度的周期性，这种情况的loss非常大。因此，模型必须以其他复杂形式回归（例如，在缩放w和h时顺时针旋转蓝色框），从而增加了回归的难度，如图7a所示。
在这里插入图片描述
为了更好地解决这个问题，我们在传统的平滑L1损耗中引入了IoU常数因子 $\frac{| -log(IoU)|}{|L_{reg}(v_{nj}',v_{nj})|}$ ，如等式3所示。可以看出，在边界情况下，损失函数近似等于 $| -log(IoU)| ≈0$ ，消除了loss的突然增加，如图7b所示。新的回归loss可以分为两部分， $\frac{L_{reg}(v_{j}',v_{j})}{|L_{reg}(v_{j}',v_{j})|}$ 确定梯度传播的方向，以及 $|−log(IoU)|$ 确定梯度的大小。另外，使用IoU优化位置精度与IoU支配的度量标准一致，比坐标回归更直接有效。