2021-06-29《旋转基元重建增强和鲁棒6D姿态估计》

PrimA6D: Rotational Primitive Reconstruction
for Enhanced and Robust 6D Pose Estimation
《旋转基元重建增强和鲁棒6D姿态估计》
摘要
本文介绍了一种基于旋转原语预测的6D目标姿态估计方法，该方法以单个法师作为输入。我们解决了6D对象姿态的一个已知的对象相对于相机使用一个单一的图像闭塞。许多最近的最先进的(SOTA)两步方法利用图像关键点提取，然后采用PnP回归进行姿态估计。 Many recent state-of-the-art (SOTA) two-stepapproaches have exploited image keypoints extraction followed by PnP regression for pose estimation. Instead of relying
on bounding box or keypoints on the object, we propose to
learn orientation-induced primitive so as to achieve the pose
estimation accuracy regardless of the object size.我们不依赖于物体上的边界框或关键点，而是通过获取方向诱导原语来实现无论物体大小的姿态估计精度。我们利用变分自动编码器(Variational AutoEncoder, VAE)来学习这个基础原语及其相关的关键点。然后利用重构原始图像的关键点进行PnP回归。最后，我们在一个单独的定位模块中计算平移量来完成整个6D姿态估计。在公共数据集上进行评估时，所提出的方法比LINEMOD、Occlusion LINEMOD和YCB-Video数据集有显著的改进。我们进一步提供了一个仅综合训练的案例，表现出可比的性能，现有的方法需要真实的图像在训练阶段。

一。导言
6D物体姿态估计旨在恢复物体相对于相机的方位和平移。
在求解6D姿态估计时，达到准确、快速
而健壮的解决方案一直是增强
现实[1]、机器人任务[2、3、4]和自动车辆
[5]. 尽管RGB-D相机被广泛使用，但是
当RGB D相机不适用时，基于图像的姿态估计仍然是必不可少的。

Recent advances in deep learning-based object pose esti-mation has been groundbreaking. Direct end-to-end 6D poseregression suffered from the non-linearity of the rotation, andstudies tackled this issue through == orientation representation[6]== or combining Perspective-n-Point (PnP) and random sam-ple consensus (RANSAC). The latter approaches reportedcurrent state-of-the-art (SOTA) performance by leveraginga two-step approach for pose estimation [1, 7, 8]. Theseapproaches extract keypoints
基于深度学习的物体姿态估计的最新进展是开创性的。直接端到端6D位置回归受旋转的非线性影响，研究通过方向表示**[6]或结合透视n点（PnP）和随机样本一致性（RANSAC）来解决这个问题。后一种方法通过利用两步方法进行姿态估计来报告当前最先进的（SOTA）性能[1，7，8]**。这些方法利用深度学习从RGB图像中提取关键点，利用二维-三维匹配（如PnP算法）计算物体的6D姿态。
We were also partially motivated by the finding in [9]that an Augmented Auto Encoder (AAE) could implicitlyrepresent a rotation when training an AAE from a 3D CAD.The main advantage of this approach is that there is noneed for manual annotation to prepare ground pose labeled training images. While [9] only solved for the orientationretrieval given a discretized orientation set, we reconstruct aprimitive from which keypoints are extracted under occlusionand complete a full orientation regression
我们的部分动机还在于[9]中的发现，即当从3D CAD训练AAE时，增广自动编码器（AAE）可以隐式表示旋转。这种方法的主要优点是，不需要手动注释来准备地面姿势标记的训练图像。虽然[9]只解决了给定离散方向集的方向检索问题，但我们重建了一个在遮挡情况下提取关键点的先验模型，并完成了全方位回归

Fig. 1:We train VAE to reconstruct objects and the associatedprimitive followed by the keypoints extraction over theinferred primitive. Keypoints are extracted from each cornerplus the center, the same as the object center (i.e., 21)
Fig. 1:我们训练VAE重建物体和相关的原语，然后在相关的原语上提取关键点。关键点是从每个角落加上中心提取的，与对象中心相同（即21）

This paper solves for the direct 6D pose regression byintroducing a novel primitive descriptor in Fig. 1. Theproposed solution, primitive associated 6D (PrimA6D) poseestimation, combines a direct and holistic understanding ofan image with keypoint based PnP for the final orienta-tion regression. More specifically, we newly introduce theprimitive decoder in addition to the VAE to increase thediscriminability of the orientation inference. Via this step,like the aforementioned studies, we aim to find reliablekeypoints in an image for pose regression extracting from thereconstructed primitives. Differing from previous methods,our method presents the following contributions.

通用领域
生物医药

本文通过在图1中引入一个新的基元描述子，解决了直接的6D姿态回归问题。提出的解决方案，基元关联6D（PrimA6D）姿态估计，结合了对图像的直接和全面的理解和基于关键点的PnP的最终定位回归。更具体地说，除了VAE之外，我们还引入了基本解码器，以增加方向推断的可分辨性。通过这一步，像前面的研究一样，我们的目标是在图像中找到可靠的关键点，以便从重构的基元中提取姿态回归。与以前的方法不同，我们的方法有以下贡献。

We propose a novel 6D object pose estimation networkcalled PrimA6D, which introduces a rotation primitivereconstruction and its associated keypoints to enhancethe orientation inference.•The proposed estimation scheme mitigates the manualeffort of preparing label annotated images for training.The proposed primitive is straightforward to generategiven rotation in 3D CAD; the trained VAE allows us tolearn from synthetic images via domain randomizationrobust to occlusion and symmetry.•We verify the meaningful performance improvementof 6D object pose estimation compared to the ex-isting SOTA 6D pose regression methods. The pro-posed method improved estimation over LINEMODas 97.62% (PVNet 86.27%), Occlusion LINEMOD as59.77% (PVNet 40.77%), and YCB-Video as 94.43%(PVNet 73.78%) in terms of ADD(-S).•The proposed method is more generalizable than ex-isting methods showing less performance variance be-tween datasets. This superior generalization capability isessential for robotic manipulation and automation tasks.

我们提出了一种新的6D目标姿态估计网络PrimA6D，该方法引入旋转原语重建及其相关的关键点，以增强方向推理。•所提出的估计方案减轻了为训练准备标签注释图像的手工工作量。所提出的原语在三维CAD中直接生成旋转；经过训练的VAE使我们能够通过对遮挡和对称性鲁棒的域随机化从合成图像中获取信息。•我们验证了与现有SOTA 6D姿势回归方法相比，6D目标姿态估计的有意义的性能改进。该方法对线性模型的估计值进行了改进，分别为97.62%（PVNet 86.27%）、遮挡线模为59.77%（PVNet 40.77%）和YCB视频的增加（-S），为94.43%（PVNet 73.78%）。与现有方法相比，该方法在数据集间性能差异较小，更具推广性。这种优越的泛化能力对于机器人的操作和自动化任务至关重要。
he reconstruction stage of PrimA6D. In the reconstruction stage, an encoder using ResNext50 [10] and anreconstruction decoder consisted of fully deconvolutional layers form the VAE. We reconstruct both object and primitive ina separate decoder. To increase the discriminability, we additionally introduce adversarial loss and train using GAN

PrimA6D的重建阶段。在重建阶段，使用ResNext50[10]的编码器和由完全反褶积层组成的重建解码器构成VAE。我们在单独的解码器中重建对象和原语。为了提高可分辨性，我们还引入了对抗损失和使用GAN进行训练
PnP based approaches establish 2D-3D correspondencesfrom keypoints or patches and solve for the pose from thesecorrespondences. Keypoint-based approaches define specificfeatures to infer the pose from an image, such as the cornerpoints of 3D bounding-box. These approaches chiefly involvetwo steps: first, the 2D keypoint of an object in the imageis computed, and then an object pose is regressed using thePnP algorithm. In BB8 [11], the authors predicted the 2Dprojection corner points of the object’s 3D bounding box.Then, they computed the object pose from 2D-3D corre-spondences using PnP algorithm. Similarly, [12] additionallyadded the central point of object to the keypoints of BB8[11]. Unfortunately, [11, 12] revealed the weakness to theocclusion that prevents defining the correct bounding box ofan object. To alleviate this issue, PVNet [1] predicted the unitvectors that point to predefined keypoints for each pixel in anobject. By computing the locations of these keypoints usingunit vector voting, the authors substantially improved theiraccuracy. However, these keypoint-based approaches maysuffer when an object lacks texture, whereas the proposedprimitive can be robustly reconstructed even for small andtextureless objects.
基于PnP的方法从关键点或面片建立2D-3D对应，并从这些对应中求解姿态。基于关键点的方法定义特定的特征来从图像中推断姿势，例如三维边界框的角点。这些方法主要包括两个步骤：首先计算物体在图像中的二维关键点，然后用NP算法回归出物体的姿态。在BB8[11]中，作者预测了物体三维边界盒的2D投影角点，然后利用PnP算法从2D-3D对应关系中计算出物体的姿态。类似地，[12]在BB8[11]的关键点上添加了对象的中心点。不幸的是，[11，12]揭示了这个结论的弱点，它妨碍了定义一个对象的正确边界框。为了缓解这个问题，PVNet[1]预测了指向对象中每个像素的预定义关键点的单位向量。通过使用单矢量投票计算这些关键点的位置，作者极大地提高了精确度。然而，当一个物体缺少纹理时，这些基于关键点的方法可能会受到影响，而被提议的基元即使对于小的无纹理物体也能被可靠地重建。
The part-based approaches extract parts of an object called
patches for pose regression. This method is frequently used
with depth data. For instance, [13] jointly predicted the object
labels and object coordinates for every pixel, then computed
the object pose from the predicted 2D-3D correspondence
matching. In [14], the authors extracted heatmaps of multiple
keypoints from local patches then computed the object pose
using the PnP algorithm. Depth data may not always be
available and we propose using a single RGB as input.
Recently, [15] exploited Auto Encoder (AE) to generate an
image representing 3D coordinate per each pixel. Together
with the reconstructed image containing 3D information,
the expected error per pixel was also estimated to sort
meaningful 2D-3D correspondence.
基于部分的方法提取称为对象的部分
用于姿势回归的补丁。这种方法经常使用
与深度数据。例如，[13]联合预测对象
每个像素的标签和对象坐标，然后计算
来自预测的 2D-3D 对应关系的对象姿态
匹配。在 [14] 中，作者提取了多个热图
来自局部补丁的关键点然后计算对象姿势
使用即插即用算法。深度数据可能并不总是
可用，我们建议使用单个 RGB 作为输入。
最近，[15] 利用自动编码器（AE）来生成一个
表示每个像素的 3D 坐标的图像。一起
使用包含 3D 信息的重建图像，
每个像素的预期误差也被估计为排序
有意义的 2D-3D 对应关系。
Unlike the aforementioned approaches, which consist of
steps, holistic approaches utilize the overall shape of the
object appearing in an image to directly regress the object
pose relative to the camera. In Deep-6DPose [16], the authors
found the regions where the objects were located through the
Region Proposal Network (RPN). Despite solving for both
object detection and 6D pose estimation, their method was
vulnerable to small and symmetrical objects since they relied
on the detected regions. In SSD-6D [17], the authors used
the discrete 6D pose space to represent the discriminable
viewpoint. SSD-6D is a Single Shot Multibox Detector
(SSD)-style pose estimation model that extends the work of
[18]. Their method had a faster inference time than other
methods though it sacrificed accuracy.
与上述方法不同，它们包括
步骤，整体方法利用整体形状
对象出现在图像中直接回归对象
相对于相机的姿势。在 Deep-6DPose [16] 中，作者
找到物体所在的区域
区域提案网络 (RPN)。尽管解决了两者
目标检测和 6D 姿态估计，他们的方法是
容易受到小而对称的物体的影响，因为它们依赖
在检测到的区域上。在 SSD-6D [17] 中，作者使用
离散的 6D 姿态空间来表示可辨别的
观点。 SSD-6D 是一款单发多盒探测器
(SSD) 风格的姿态估计模型，扩展了
[18]。他们的方法比其他方法有更快的推理时间
方法虽然牺牲了准确性。
Incorporating a mask has been widely adopted. For example, [2, 19, 3, 20] utilized an mask of the object to
solve the pose estimation task. The well-known PoseCNN
[19] required an Iterated Closest Point (ICP) refinement
for better accuracy. Stemming from PoseCNN, [20] enabled
a real-time performance with RGB data. The authors of
[19, 20] also reported meaningful results, but their works
were sensitive to the issue of object occlusion. Tackling the
occlusion issue, SilhoNet [2] constructed an occlusion-free
mask even from the occluded region, through CNN. This
type of mask level restoration enhanced the performance over
occlusion, however, yielded low accuracy for symmetrical
objects. Unlike these methods relying on masks, we exploit
the reconstructed object to not only regress the translation
but further tackle the occlusion issue.
使用面罩已被广泛采用。举例来说，[2，19，3，20]利用物体的遮罩
解决姿态估计任务。著名的PoseCNN
[19] 需要迭代最近点（ICP）细化
为了更准确。源于PoseCNN，[20]已启用
RGB数据的实时性能。本书的作者
[19,20]也报告了有意义的结果，但他们的工作
对物体遮挡问题很敏感。解决
遮挡问题，SilhoNet[2]构建了一个无遮挡的模型
通过CNN甚至从闭塞的区域屏蔽。这个
掩模电平恢复的类型提高了性能
但是，遮挡对对称性的测量精度较低
物体。与这些依赖于掩码的方法不同，我们利用
重构对象不仅要回归翻译
但要进一步解决遮挡问题。
InAll the abovementioned methods require true pose-labeled
images for training. Another stream of studies has focused on
leveraging latent code without the need for labeled training
images via AAE to solve this issue. Without requiring the
pose label from object 6D pose estimation, [9] utilized
the latent code generated from the object recovery process
using AE. The object pose was determined by a similarity
comparison with the predefined latent codebook for eachpose, but the accuracy was dominated by the size of the
codebook with a discretized orientation class. We have
overcome this accuracy limitation and achieved substantially
improved performance from a synthetic dataset.
所有上述方法都要求真实姿势
训练图像。另一系列的研究集中在
利用潜在代码而不需要标记训练
图片通过AAE来解决这个问题。不需要
利用物体6D姿态估计得到的姿态标签[9]
从对象恢复过程中生成的潜在代码
使用AE。物体姿态由相似性决定
与每个用户预先定义的潜在码本相比，精度主要由
具有离散化方向类的代码本。我们有
克服了这种精度限制，并取得了实质性的成就
从合成数据集中改进性能。

Fig. 3: The rotation and translation inference. We use
keypoints learned from the reconstructed primitive for the
orientation estimation using PnP. In the translation inference
stage, the network input is created using the output of the
reconstruction decoder through object relocalization.
图3：旋转和平移推理。我们使用
从重构基元中学习的关键点
利用PnP进行方位估计。在翻译推理中
阶段，使用
通过对象重定位重建解码器

Given an RGB image, we compute the 6D pose of an
object relative to the camera in terms of the its relative
transform TC
O between the camera coordinate frame C and
the object coordinate frame O. The 6D pose is represented
by quaternion and translation vectors. Similar to [15] and
[9], the proposed method can be accompanied with other
detection modules such as M2Det [21]. We use a cropped
image using a bounding box detected from other methods as
input to our network.
给定一幅RGB图像，根据物体在摄像机坐标系C与物体坐标系o之间的相对变换TC计算物体相对于摄像机的6D姿态。6D姿态由四元数和平移向量表示。与**[15]和[9]类似，该方法可以与M2Det[21]等其他检测模块一起使用。我们使用裁剪图像**，使用从其他方法检测到的边界框作为我们的网络输入。
A. Training Set Prepration
As indicated earlier, our aim is at enabling 6D pose
regression without requiring the real images. For each pair
of a 3D CAD object and the associated camera pose, we
prepare the training set consists of a bounding box, rendered
primitive, the target keypoints (the 20 corner points and a
center point), and rendered training image. The primitive
images represent the rotation of the object having different
colors per axis as in Fig. 1
A.训练集准备
如前所述，我们的目标是实现6D姿势
不需要真实图像的回归。每对
一个三维CAD对象和相关的相机姿态，我们
准备由边界框组成的训练集，渲染
基本体，目标关键点（20个角点和
中心点）和渲染的训练图像。原始的
图像表示具有不同形状的对象的旋转
每个轴的颜色如图1所示
B. Object and Primitive Reconstruction
Using the training data, the first phase of the proposed
method consists of object and primitive reconstruction as
shown in Fig. 2. We train the VAE having an encoder and
two reconstruction decoders. This step is partially similar
to AAE in [9]. In [9], the authors verified that a trained
AAE using a geometric augmentation technique learns the
representation of the object’s orientation and generates the
latent code with data for the object orientation. We adopt the
same technique in the training phase by rendering the training
images with various poses of the object. The training images
are augmented using the domain randomization technique
[22] to bridge the gap between real and rendered images.
Augmented images are fed into the encoder, and clean target
images are obtained through the reconstruction decoder using
four losses as follows. Differing from [9], our training
strategy is to use GAN by introducing a discriminator for
the reconstructed object and primitive.
B.对象与原语重构
利用训练数据，提出了第一阶段的算法
该方法由对象重建和基元重建两部分组成
如图2所示。我们训练的VAE有一个编码器和
两个重建解码器。此步骤与此部分相似
到[9]中的AAE。在[9]中，作者验证了
AAE使用几何增强技术学习
表示对象的方向并生成
带有面向对象数据的潜在代码。我们采用
在训练阶段同样的技术通过渲染训练
具有对象的各种姿势的图像。训练图像
使用领域随机化技术增强[22]
弥合真实图像和渲染图像之间的差距。
增强后的图像被送入编码器，并清除目标
图像通过重建解码器获得
四个损失如下。与[9]不同的是，我们的训练
策略是通过引入一个鉴别器来使用GAN
重建的对象和原语。

Object Reconstruction Loss: The first loss focuses on
the reconstruction of an object by comparing the pixel-wise
loss. To prevent overfitting in the reconstruction stage, we
use the top-k pixel-wise L2 loss as follow.
对象重建损失：第一损失集中在
通过比较像素的大小来重建物体
损失。为了防止重建阶段的过度装配，我们
使用top-k像素级L2损耗，如下所示。

他什么他他和其中eril是提取第i个最大错误[23]的像素的函数。本文使用K= 128。(1)中，x分别为目标图像像素和预测图像像素。这个初始的潜在代码包含了方向，但它对回归没有足够的区分性。为了增加可辨别性，我们引入了旋转原语译码器来增强代码的方向。

Primitive Reconstruction Loss: The rendered primitive
image becomes the prediction target of the primitive decoder.
Using the color-coded axes in primitive image, we designed
the color based axis-aligning loss function for the primitive
decoder. When computing the loss for the channels, we
consider a per-pixel intensity difference dc for each channels
c ∈ C = [R, G, B]
2）基本体重建丢失：渲染的基本体
图像成为原始解码器的预测目标。
利用原始图像中的颜色编码轴，设计了
基元的基于颜色的轴对齐损失函数
解码器。在计算信道损耗时，我们
考虑每个通道的每像素强度差dc
c∈ C=[R，G，B]

Here, the Sc measures misalignment of each axis in terms of
the pixel intensity difference per channel. The Cc measures
the overall misalignment per channel and discerns between
channels by penalizing the channel showing a larger error
when the primitive reconstruction loss LP is constructed.
The α is a weight constant (α = 5 in this paper).
3) Overall VAE Loss: Using these two reconstruction
losses with Kullback-Leibler (KL) divergence loss, we train
the VAE to generate both an object image and a primitive
image. By adopting VAE, the network encodes input image
x to the parameters of a Gaussian distribution q(z|x). In
doing so, we minimize the KL divergence between q(z|x)
and N (0, I). The overall loss function for the reconstruction
stack becomes as below
在这里，Sc测量每个轴的偏移量
每个通道的像素强度差。Cc措施
每个通道的总体不对中以及
通过惩罚显示较大错误的通道
当构造原始重构损失LP时。
这个α 重量是常数吗(α = （本文第5章）。
3）总阀损失：使用这两个重建
损失与Kullback-Leibler（KL）散度损失，我们训练
生成对象图像和基本体的VAE
形象。网络采用VAE对输入图像进行编码
x是高斯分布q（z | x）的参数。在
这样，我们就最小化了q（z | x）之间的KL散度
N（0，I）。重建的总体损失函数
堆栈如下所示

4) Adversarial Loss: We further aim to improve the
quality of the reconstruction using training strategy of GAN.
For instance, the reconstructed primitive sometimes becomes
ambiguous as shown in Fig. 5 and may result in incorrect
keypoints learning. Regarding this issue, introducing adversarial loss improved the primitive/object prediction accuracy
and thereby the overall pose inference performance. For the
target primitive/object image X and the reconstructed image
Xˆ, we consider the following loss to train using GAN [24].
4）对抗性损失：我们进一步致力于改善
利用GAN的训练策略进行重建的质量。
例如，重构的原语有时会变成
如图5所示不明确，可能导致错误的
学习要点。针对这一问题，引入adver sarial loss提高了原始/目标的预测精度
从而提高整体姿态推理性能。对于
目标基元/目标图像X和重建图像
十ˆ, 我们考虑使用GAN[24]进行训练时的以下损失。

C. Orientation Regression via PnP
From the reconstructed primitive we estimate the rotation
first by learning the keypoints from it. Using generated
keypoints in the training set preparation and via pixel-wise
L2 loss, we adopt ResNet18 [25] and extract the keypoints
of the reconstructed primitive using the loss as

Since the keypoints consists of corners and the center point
of the object which is the same as the primitive center, we
utilize this estimated center point, C = xc = (uc, vc) for the
translation estimation.

C.通过PnP进行取向回归
从重建的本原中，我们首先通过学习它的关键点来估计旋转。利用训练集准备中生成的关键点，通过像素级L2损耗，采用ResNet18[25]，以损耗为，提取重构原语的关键点

由于关键点是由角和与原始中心相同的物体中心点组成的，我们利用这个估计的中心点C=xc-(uc, Vc)来进行平移估计

D. Translation Regression
Given the estimated orientation, we then infer the 3D
translation by computing the scene depth Tz of the object
center (i.e., primitive center). The reconstructed image is
relocated to the original image using the bounding box of the
object. This process is called object relocalization (Fig. 3).
As the sample illustration in Fig. 4 shows, the reconstruction occurs without assuming the bounding box is centered
and thereby is robust to inaccurate object detection. The
reconstruction phase handles the inaccurate bounding box
location by properly reconstructing the object even at the offcentered position. The inflation scaling factor κ was chosen
to handle the detected bounding box with Intersection over
Union (IoU) 0.75. A small κ fails to handle inaccuracy in
objection detection phase, and a large κ might deteriorate
the performance with increased empty space near borders.
Because this re-localization replaces an object with a reconstructed object, the translation can be regressed over
occlusion-free objects. From this relocalized full image, weagain use ResNet18 to regress the scene depth Tz to the
object center.
Using Tz, the bounding-box, and the camera intrinsic matrix K, we compute the Tx and Ty components of translation
as follows from the estimated object center (uc, vc). The
focal length f and principal point (up, vp) are obtained from
the calibration matrix while the bounding box is known when
preparing the training data.
D.平移回归
给定估计的方向，我们推断出
通过计算物体的景深Tz进行平移
中心（即本原中心）。重建图像为
使用图像的边界框重新定位到原始图像
对象。这个过程称为对象重新定位（图3）。
如图4中的示例所示，在不假设边界框居中的情况下进行重构
从而对不准确的目标检测具有鲁棒性。这个
重建阶段处理不准确的边界框
即使在偏离中心的位置，也要正确地重建对象。通货膨胀比例因子κ 被选中
处理检测到的相交于
活接头（IoU）0.75。一个小的κ 无法处理
目标检测阶段，和一个大的κ 可能恶化
在边界附近增加空白空间的性能。
因为这种重新定位用一个重新构造的对象替换了一个对象，所以翻译过程可以逐步回归
无遮挡对象。从这个重新定位的完整图像中，weagain使用ResNet18将场景深度Tz回归到
对象中心。
利用Tz、边界盒和相机内禀矩阵K，我们计算了平移的Tx和Ty分量
从估计的对象中心（uc，vc）得出如下。这个
焦距f和主点（up，vp）由
当已知边界框时，校准矩阵
准备培训数据。

IV. EXPERIMENT
In this section, we evaluate the proposed method using
three different datasets.
A. Dataset
In total, three datasets were used for the evaluation:
LINEMOD [28], Occlusion LINEMOD [28], and YCBVideo [19]. The LINEMOD and Occlusion LINEMOD
datasets are widely exploited benchmarks for 6D object
pose estimation and cover various challenges, including
occlusion and texture-less. The LINEMOD dataset comprises
13 objects with 1,312 rendered images on each for the
training and about 1,200 images per object for the test.
The Occlusion-LINEMOD dataset shares the training images
with the LINEMOD dataset. For the test, they provide pose
information for eight occluded objects. A recently released
dataset, the YCB-Video dataset is composed of 21 highquality 3D models, offering 92 annotated video sequences
per frame. These video sequences include various lighting
conditions, noise in the capture, and occlusion.
四、实验
在本节中，我们将使用三个不同的数据集来评估所提出的方法。
A. 数据集
我们总共使用了3个数据集进行评估:LINEMOD[28]、Occlusion LINEMOD[28]和YCBVideo[19]。LINEMOD和Occlusion LINEMOD数据集被广泛用于6D对象姿态估计，并涵盖了各种挑战，包括遮挡和无纹理。LINEMOD数据集包括13个对象，每个对象上有1312张渲染图像用于训练，每个对象上大约有1200张图像用于测试。Occlusion-LINEMOD数据集与LINEMOD数据集共享训练图像。在测试中，他们为8个被遮挡物体提供姿态信息。一个最近发布的数据集，YCB-Video数据集由21个高质量的3D模型组成，每帧提供92个注释视频序列。这些视频序列包括各种照明条件、捕捉中的噪声和遮挡。

B.训练细节
我们采样了50000个物体姿势并渲染了50000个
每个对象的训练图像及其关联的
原语。使用[22]提出的基于域的方法对渲染的目标图像进行了进一步的处理
随机化。我们的网络训练主要由合成图像
和额外训练180个真实图像组成三种数据集的类型。
所有的网络都经过Adam优化器的训练在on Titan V。
ResNet主干用PyTorch中提供的预先训练的权重初始化。所有网络都是
a batch size of 50 and a 0.0001 learning rate for 40 epochs
C.评价指标
（i）评估姿态的二维投影误差度量
根据二维投影误差的估计，我们也使用
与**[1]中的度量相同**，并测量平均像素距离
在三维模型的投影和图像像素之间
点数。如果误差为
小于5像素。
（ii）3D投影误差度量类似于[19]，2
选择指标进行评估。我们选择了平均值
距离度量（ADD）[28]作为评估度量
计算变换后的三维模型点的距离

表一:在LINEMOD数据集上，我们的方法和基线方法的ADD(-S)度量的准确性。对称的物体用星号(**)标记，小的物体用t标记。PoseCNN[191仅提供平均值。

表二:我们的方法和基线方法在LINEMOD数据集上的二维投影误差(左)和平均绝对误差(MAE)(右)的精度。对称的物体用星号(*)标记，小的物体用t标记。使用作者提供的网络和权值，我们评估平均绝对误差。Tekin没有提供手机型号的重量。在MAE方面的评价中，对称对象被排除，因为对对称对象的评价需要一个形状比较而不是一个特定的姿态比较

表3在遮挡LINEMOD数据集上，我们的方法和基线方法在ADD(-S)度量(左)和2D投影误差(右)方面的精度。除了pix2pose和hybridpose之外，所有的值都从[1]导入。对称的物体用星号(*)标记，较小的物体用f标记

在地面真位姿和预测位姿之间使用
添加（-S）度量。当平均距离值小于
不到3D模型直径的10%，预测的6D姿势
被认为是正确的。然而，这个ADD度量是
不适合与对称物体一起使用。对称的
对象之间的平均距离
点数。
（iii）原始LINEMOD的姿态估计度量
数据集，我们直接测量平均绝对误差（MAE）
用于旋转和平移以及其他度量。
D.6D目标姿态估计评估
我们评估了我们的模型在三个方面的性能
数据集。并与其它方法进行了比较
SOTA方法从整体的、基于关键点的和基于部分的
方法。我们只在训练时评估我们的方法
合成图像（PrimA6D-S）和使用180
附加真实图像（PrimA6D SR）。其他现有的方法是使用额外的真实图像进行训练。定性的
所有三个数据集的结果如图7和7所示
图8。也请参考prima6d.mp4。我们排除了
与无原语译码器情况的比较
因为在没有原始解码器的情况下，错误非常大
用过的。
1） LINEMOD数据集结果：表。我和桌子。二
展示LINEMOD中所有13个对象的评估
数据集。在关键点或基于部分的方法中，
PVNet是SOTA最新发布的方法
性能，因此用作
比较。

在LINEMOD数据集上，PrimA6D SR在ADD（-S）得分方面优于edpvnet，并且呈现出
二维投影度量的可比结果。
不同的性能可以从MAE中理解
表中的公制单位。二。如果物体的中心点在图像中
如果找到合适的投影度量，则二维投影度量受到的影响更大
由旋转估计的精度和加法（-S）更高
对翻译推理性能至关重要。尽可能
可见，PVNet在旋转MAE度量方面表现出比PrimA6D更高的性能，这导致
二维投影度量的性能差异。在
相反，PrimA6D-S在性能上优于PVNet
翻译MAE度量，它产生一个性能
ADD（-S）分数的差异。
此外，PrimA6D（-S/-SR）表示
小而无纹理的对象是PVNet的缺点
比如猿和猫。尤其是PrimA6D（-S/-SR）具有
擅长翻译评估。

: The accuracy in terms of the ADD(-S) metric (left) and 2D projection error (right) for our method and the baseline methods on the YCB-VIDEO dataset
表IV:我们的方法和YCB-VIDEO数据集上的基线方法在ADD(-S)度量(左)和2D投影误差(右)方面的精度。对称的物体用星号(*)标记，较小的物体用t标记。

Occlusion LINEMOD Dataset Results: Tables in Table. III list the performance on the Occlusion LINEMOD
dataset. Here, we only evaluate the performance of models
that do not have the refinement step. As can be seen in two
tables, the PrimA6D(-S/-SR) outperformed PVNet in terms
of the ADD(-S) score and presented a comparable result in
terms of the 2D projection metric. As the PrimA6D(-S/-SR)
reconstruct the corresponding object and primitive, we can
affirm notable improvement on the occluded case. Moreover,
the PrimA6D(-S/-SR) demonstrates striking accuracy for
objects that are difficult to recognize by other methods due
to small size such as cat and holepuncher.
YCB-Video Dataset Result: We further evaluate the
pose estimation performance over the YCB dataset as shown
in Table. IV. The PrimA6D(-S/-SR) achieved enhanced inference capability even when trained solely from synthetic
dataset (PrimA6D-S). The proposed method is capable of
inferring 6D pose of objects that are difficult to recognize
by other methods such as wood block.
2）遮挡线型MOD数据集结果：Ta 中的表。III列出遮挡线Mod上的性能
数据集。这里，我们只评估模型的性能
这没有细化步骤。从两个方面可以看出
表中，PrimA6D（-S/-SR）在术语上优于PVNet
在添加（-S）分数中，并给出了可比结果
二维投影度量的术语。作为PrimA6D（-S/-SR）
重建相应的对象和原语，我们可以
证实闭塞病例有显著改善。此外，
PrimA6D（-S/-SR）显示了惊人的准确性
由于其他方法难以识别的对象
小尺寸的，如猫和开洞器。
3） YCB视频数据集结果：我们进一步评估
YCB数据集上的姿势估计性能如图所示
在桌子上。四、即使仅从合成训练，PrimA6D（-S/-SR）在参考能力方面也得到了增强
数据集（PrimA6D-S）。所提出的方法能够
难以识别的物体6D位姿的推断
其他方法，如木块。
Another notable point in YCB dataset is the overalldown performance from LINEMOD; all approaches including PVNET and ours shows lower accuracy in YCB than LINEMOD.
This is because of the challenging and realistic dataset nature of YCB. Despite the challenging dataset, the proposed method shows the smallest performance drop from LINEMOD both in terms of ADD(-S) and 2D projection
error even when solely trained from synthetic data. This reveals the generalization capability of the proposed method while others lacking.
YCB数据集中另一个值得注意的地方是LINEMOD的整体性能下降；所有方法，包括PVNET和我们的方法，在YCB中的精确度都低于LINEMOD。
这是因为YCB具有挑战性和现实性。尽管数据集具有挑战性，但该方法在ADD（-S）和2D投影方面的性能比LINEMOD下降最小
即使仅从合成数据中训练也会出错。这揭示了该方法的泛化能力，而其它方法则缺乏泛化能力。
E. Ablation Studies
We further evaluate the proposed method by conducting
ablation studies in terms of the effect of real images, effect
of training with GAN and bounding box accuracy.
Effect of real image The proposed method was designed to exploit synthetic data from CAD model. However,
considering the potential discrepancy between CAD and
real data, additional weight refinement using real training
image shows very meaningful performance improvement.
We train the network only from synthetic data (PrimA6D-S)
and compare the performance improvement by adding real
images in the training set (PrimA6D-SR). This comparison is
provided for three datasets. Obviously we witness accuracy
improvement in all metrics by adding real images in the
training set.
E。消融研究
通过实验进一步验证了该方法的有效性
消融研究在真实图像方面的效果，效果
用GAN和包围盒精度训练。
1）设计了一种从CAD模型中提取合成数据的方法。然而，
考虑CAD与真实数据潜在的差异，使用真实训练进行额外的权重细化
图中显示了非常有意义的性能改进。
我们仅从合成数据（PrimA6D-S）训练网络
并通过添加真实的训练集中的图像（PrimA6D SR）。这种比较是
提供三个数据集。显然，我们提高了准确性
通过将真实图像添加到训练集。
Training with GAN The adversarial loss and training
strategy using GAN provides a detailed refinement of theobject and primitive reconstruction. As can be seen in
Fig. 5, the reconstructed result shows a blurry and obscure
reconstruction when without using GAN. This indicates the
potential ambiguity in the rotation and translation estimation.
Additional training with GAN enables the VAE to build
discriminability. Also, regarding symmetrical objects with
ambiguity for one axis, incorporating adversarial training
can solve the potential axis ambiguity in the reconstructed
primitive and completely form three orthogonal axes. In the
experiment, we confirmed that the ambiguous axis in the
primitive becomes the clear orthogonal axis for the other
two axes according to the qualitative result. Testing over
Occlusion LINEMOD in Table. III reveals this substantial
improvement from the adversarial loss.
2）对抗性损失与训练
使用GAN的策略提供了对象的详细细化和原始重建。从中可以看出
图5，重构结果显示模糊和模糊的图像
不使用GAN时的重建。这表示
旋转和平移估计中的潜在歧义。
使用GAN进行额外的训练可以使VAE
可辨别性。同样，关于对称物体
一个坐标轴的模糊性，包括对抗性训练
可以解决重建图像中潜在的轴模糊问题
基本的和完全形成三个正交轴。在
实验中，我们证实了模糊轴的存在
基元成为另一个基元的清晰正交轴
根据定性结果分为两轴。测试结束
表中的线条模型。第三章揭示了这一实质性问题
对抗性损失的改善。

网络的输入是来自目标检测模块的裁剪图像，因此PrimA6D估计性能可能依赖于检测精度。我们已经检查了检测到的边界盒精度的影响，通过将loU根据ADD(-S)减少到原始边界盒，因为这是更加敏感的平移估计精度。如表所示。V，当我们将loU从1.0降低到0.75时，性能会下降。从表中可以看出。V表。我和表。第二，PrimA6D-SR尽管性能有所下降，但仍然优于现有的所有其他方法。更值得注意的是，PrimA6D-S仅从合成数据中训练，即使减少了loU，也显示了对Occlusion LINEMOD数据的异常结果。E推理时间由于我们的方法是在对象检测之后使用的，所以总的推理时间也受到对象检测模型运行时间的影响。例如，我们使用的M2Det[21]运行速度为每秒33.4帧(fps)。对于图像中单个对象的6D对象姿态估计，使用GTX 1080Ti GPU的总过程平均需要31ms。整个6D姿态估计支持16到17帧/秒。g .失败的情况下
G. Failure Cases
Fig. 6 shows failure cases when one axis is entirely
obscured by the other two axes of the rotation primitive. This
hidden primitive axis prevents the algorithm from extracting
the correct keypoint yielding a larger rotation error. In future
work, we aim to include estimated depth from the primitive
to resolve this issue.
G。失败案例
图6显示了当一个轴完全断开时的故障情况
被旋转基本体的其他两个轴遮挡。这个
隐藏的基元轴阻止了算法提取
正确的关键点会产生较大的旋转误差。将来
工作，我们的目标是包括从原始的估计深度
来解决这个问题。

TABLE V: Performance variation in terms of the ADD(-S) metric according to IoU showing the accuracy on boundingbox.
PVNET results are copied from Table I and Table II for easy comparison.
表五：根据显示边界框精确度的IoU，ADD（-S）指标的性能变化。
PVNET结果从表I和表II中复制，以便于比较。

This paper reported on 6D pose estimation
from a single RGB image
by introducing novel primitive learning associated with each object.
This paper presents a substantial
improvement in the pose estimation even in the occluded case
and comparable performance even using only synthetic images for training. The proposed method was validated using
three public benchmark datasets yielding SOTA performance
for occluded and small objects
本文报道了6D姿态估计
从单个RGB图像
通过引入与每个对象相关联的新的原始学习。
本文提出了一个实质性的问题
遮挡情况下姿态估计的改进
即使只使用合成图像进行训练，也有相当的表现。通过实例验证了该方法的有效性
产生SOTA性能的三个公共基准数据集
对于遮挡和小物体

2021-06-29《旋转基元重建增强和鲁棒6D姿态估计》相关推荐

2021.06.29【R语言】丨png转pdf批量生成
摘要接到一个特别要求,客户想把结果里的png图片全部转化成pdf.刚开始,这边销售想着结果图片不多,打算手动一个一个处理,直到她发现了16差异分组里,每个分组都有个kegg_map的文件夹... 环 ...
2021/06/29计算机视觉期末复习笔记整理
计算机视觉期末复习笔记整理引言我的复习参考期末考试考题回忆 PPT对应中文笔记整理参考的几篇博客的笔记引言刚结束可能是我学生时代最后一场考试了,orz热乎着,记录一下. 这门课是学校新开的 ...
2021.06.29 转载百度贴吧王者荣耀技术分析
转载链接: https://tieba.baidu.com/p/7427502865 新赛季快一周了,大家拿到王者印记了么? 最近和朋友双排三排,感觉太恐怖了!双排直接被人拿走射野中,剩个对抗路和辅助 ...
自然语言处理技术（NLP）在推荐系统中的应用原2017.06.29人工智能头条作者：张相於，58集团算法架构师，转转搜索推荐部负责人，负责搜索、推荐以及算法相关工作。多年来主要从事推荐系统以及机
自然语言处理技术(NLP)在推荐系统中的应用原2017.06.29人工智能头条作者: 张相於,58集团算法架构师,转转搜索推荐部负责人,负责搜索.推荐以及算法相关工作.多年来主要从事推荐系统以及机 ...
史上最详细微信小程序授权登录与后端SprIngBoot交互操作说明，附源代码，有疑惑大家可以直接留言，蟹蟹 2021.11.29完善更新小程序代码，
2021.11.29 更新文章你好,我是博主宁在春,一起学习吧!!! 写这篇文章的原因,主要是因为最近在写毕业设计,用到了小程序,这中间曲曲折折,一言难尽啊.毕业设计真的让人麻脑阔
A. [2021.1.29多校省选模拟11]最大公约数（杜教筛/数论）
A. [2021.1.29多校省选模拟11]最大公约数这是一个杜教筛的经典题目,最后我们只需要筛一下1∗xμ(x)1*x\mu(x)1∗xμ(x)这个函数的前缀和即可,然后看到有111这个函数,我们 ...
一个json传参的错误:JSON parse error: Unrecognized token ‘xxx‘{ “timestamp“: “2022-03-06T16:06:29.866
一个json传参的错误: JSON parse error: Unrecognized token 'xxx' { "timestamp": "2022-03-0 ...
死性不改【17Fi】ISO9000 Win7x64专业版、WS2008r2企业版GHO下载 2017.06.29
死性不改[17Fi]ISO9000 Win7x64专业版.WS2008r2企业版GHO下载 2017.06.29 2017年06月29日系统分享评论 2 条阅读 2,341 次最新版本:17F ...
2021.06.03邮票面值设计
2021.06.03邮票面值设计题目描述给定一个信封,最多只允许粘贴 N 张邮票,计算在给定 K(N+K≤15)种邮票的情况下(假定所有的邮票数量都足够),如何设计邮票的面值,能得到最大值 MAX ...
2021.04.29删点成林
2021.04.29删点成林 (题目来源:https://leetcode-cn.com/problems/delete-nodes-and-return-forest/) 题目描述给出二叉树的根节 ...

2021-06-29《旋转基元重建增强和鲁棒6D姿态估计》

2021-06-29《旋转基元重建增强和鲁棒6D姿态估计》相关推荐

最新文章

热门文章