Consensus-based Optimization for 3D Human Pose Estimation in Camera Coordinates（2019）

Abstract

提出一种相机坐标系下的3D估计，可以有效结合2D标注信息和3D姿态，并且有一个多视图融合。将问题分为两个角度：在图像平面以像素级预测3D姿态，以毫米级预测absolute depth。针对未标注图像的多视角预测提出一种基于一致性优化的算法，which requires a single monocular training procedure.

Introduction

【7，40，49，4，2】是相对姿态预测方法：the root joint is centered at the origin and the remaining joints are estimated relative to the center.this limitation hinders the generation for multi-view scenarios since predictions are not in the camera coordinates.when estimations are relative to camera coordinates,predicted human poses can be easily projected from one view to another,as illustrated in Fig.1.
在这里插入图片描述
【38，18】将该问题视作回归问题，直接以毫米级将输入图片转换到预测的姿态，缺点是：（1）同样的像素距离可能会导致不同的毫米距离——无法直接学习内在的参数（2）by predicting 3D poses directly in millimeters, the abundant images with annotated 2D poses in pixels cannot be easily exploited, since this 2D data have no associated 3D information,and relative poses predicted from one camera can not be easily projected into a different view, making it more difficult to handle occlusion cases in multi-view scenarios

主要工作

1、通过将3D估计问题转换为另一个角度解决上述缺点：predict coordinates (u,v) in the image plane,in pixels, and the absolute depth in millimeters.
2、【3，10，8，16】解决了2D估计和绝对深度估计，包括绝对深度基准NYUv2【24】，但是通常不相关。通过将3D转换为2D+深度可以有效结合2D和3D数据。
3、解决遮挡问题：通过学习基于一致性的优化来组合不同视角的预测，同时考虑相机坐标系中的估计。
4、思路：考虑绝对3D估计更复杂的问题解决相对3D估计的缺陷——预测是相对于静态参考物的，比如相机位置，而不是人体根关节
5、实现思路：（1）propose an absolute 3D human pose estimation method from monocular cameras;（2）propose a consensus-based optimization for multi-view absolute 3D human pose estimation from
uncalibrated images, which requires only a single monocular training procedure.

Related work

monocular (relative and absolute) and multi-view 3D human pose estimation in 【35】

Monocular relative 3D human pose estimation

1、【38，21，37，47，28】directly predict relative 3D poses from images,which requires the model to learn a complex projection from 2D pixels to millimeters in three dimensions.
2、【17，31，43，40，21，7】directly use 2D data during training is to first learn a 2D pose estimator,then lift 3D poses from 2D estimations.

Monocular absolute 3D human pose estimation

1、【49】infer the distance to the camera considering a normalized and constant body size,which is a non-realistic assumption.
2、【26】proposed to predict the depth of body joints individually.——The drawback of this method is that it suffers to capture the human body structure,since errors in the estimated depth for individual joints can degenerate the final pose.
3、【23】proposed a multi-person absolute pose estimation method——predict the absolute distance from the person to the camera based on the area of cropped 2D bounding box。【9,8】that not noly the size of objects are important,but also the position of objects in the image is an informative cue to predict its depth.
4、本文结合了三种不同的信息预测the distance of the root joint:（1）the size of the bounding box(including its ratio);（2）the position in the image;（3）the deep volutional features that provide additional visual cues

Multi-view 3D human pose estimation

For the challenging cases of occlusion or clutter background, multiple views can be decisive for disambiguating uncertain positions of body joints.
1、【4，2，6，5，12】exploring the classical concept of pictorial structures from multi-view images.
2、【33，29，27】deep neural networks have been used to estimate relative 3D poses from a set of 2D predictions from different views.【
29】proposed to collect 3D poses from 2D multi-view image,which are used to learn a second model to perform 3D estimations.Since these methods estimate 3D from multi-view 2D only, they often require both intrinsic and extrinsic parameters, with the exception of [33] that
estimates the calibration.
3、现有方法的缺点：current multi-view approaches are still completely dependent on the camera intrinsic parameters and often require a complete calibration setup, which can be prohibitive in some circumstances.Available methods are also limited to the inference
of 3D from multiple 2D predictions, requiring multi-view
datasets for training.
4、本文的优点：（1）combine predictions from multiple calibrated camers,while requiring a single monocular training procedure（2）estimate camera calibration,both intrinsic and extrinsic,from multi-view images,by a consensus-based optimization without retraining the model

Proposed method

1、目标和思路：【目标】predict 3D human poses in absolute coordinates with respect to the camera position【方法】（1）predict each body joint in image pixel coordinates and in absolute depth,orthogonal to the image plane, in millimeters（2）the predicted pixel coordinates and depth can be projected to the world,considering a pinhole camera model.
2、将问题分解为相对3D姿态估计和绝对深度估计。
【动机】a well cropped bounding box around the person is better for predicting its pose that the full image frame,since a better resolution could be attained and the person scale is automatically handled by the image crop.
【两个关键问题】（1）人体结构约束：由于位置变化直接从绝对坐标中学习人体结构是困难的，因此providing a separated loss on relative depth for each joint helps the network to learn the human body structure.（2）绝对深度的估计：从单目图像中估计绝对深度是困难的，特别是从cropped regions中估计。【9】表明神经网络依赖于图像线索和几何信息预测深度。
【解决方法】（1）predict 3D poses relative to a cropped region centered at the person, which eases the network to encode the human body structure；（2）predict absolute depth from combined local pictorial cues and global position and size of the cropped region
在这里插入图片描述

Network architecture

【25】U-Nets广泛应用于human pose estimation due to their multi-scale processing capabilities.【11】ResNet适合产生CNN features。由于我们需要精确的pose predictions and informative visual features for absolute depth estimation,提出ResNet-U网络：composed of a ResNet cut at block 4f as backbone,followed by 2 U-blocks,as shown in Fig.2.A few fully connected layers to regress the absolute depth $\hat z_a$ and the confidence scores $\hat c$ ,实现了函数 $F$ .
在这里插入图片描述

3D human pose regression

1、【目标】是估计相对于cropped bounding box的3D姿态。具体即根据给出的有关 $Ω$ 内cropped image信息，预测图像平面内像素坐标 $(\hat u_i,\hat v_i)$ .由于很难从任意裁剪区域预测绝对深度，因此在此阶段，我们预测每个身体关节相对于人的位置的相对深度。【分解为两个问题】即image plane pose estimation and body joints depth estimation.

Relative UVD pose estimation

Absolute depth estimation

在估计了关节像素坐标和相对于人体位置的深度之后，便可估计预测相对于相机的人体绝对深度。利用两个信息：（1）the bounding box position and size；（2）deep visiual features.
bounding box的位置和大小能够提供有关图像中人体位置和尺度的全局信息。基于ResNet特征从bounding box区域中提取的可视化特征提供了信息性可视化线索，用于修正绝对人体距离估计。
在这里插入图片描述

Absolute 3D pose reconstruction

在这里插入图片描述

Consensus-based optimization

1、One of the main advantages of estimating absolute instead
of relative 3D poses is the possibility to project the predictions from one camera to another, simply by applying a rotation and a translation.
2、propose a consensus-based algorithm that can be applied to estimate both intrinsic and extrinsic parameters, resulting in a completely uncalibrated multi-view approach.
在这里插入图片描述

Body joint confidence scores

在这里插入图片描述

Experiment

来源：CSDN

作者：qq_43452156

链接：https://blog.csdn.net/qq_43452156/article/details/104668992

标签

human