SEFD : Learning to Distill Complex Pose and Occlusion

ChangHee Yang^1 Kyeongbo Kong^2 SungJun Min^*1,3

Dongyoon Wee⁴ Ho-Deok Jang⁴ Geonho Cha⁴ SukJu Kang^†1

¹ Sogang University, ² Pusan National University, ³ Samsung Electronics, ⁴ NAVER Cloud Corp

Qualitative comparison of the baseline 3DCrowdNet and the proposed feature distillation SMPL overlapping edge.
(a) and (b) show complex poses and (c) and (d) shows occluded situations.

Feature distillation Learning

Conventional distillation methods for human poses aim to lighten student models. Our feature distillation aims to distill the ground-truth feature representations of the teacher model to the student model solving the real-environment condition without ground-truth. also, this is designed to reduce the structural gap between simple edge map and SMPL overlap edge.

Abstract

This paper addresses the problem of three-dimensional (3D) human mesh estimation in complex poses and occluded situations. Although many improvements have been made in 3D human mesh estimation using the two-dimensional (2D) pose with occlusion between humans, occlusion from complex poses and other objects remains a consistent problem. Therefore, we propose the novel Skinned Multi-Person Linear (SMPL) Edge Feature Distillation (SEFD) that demonstrates robustness to complex poses and occlusions, without increasing the number of parameters compared to the baseline model. The model generates an SMPL overlapping edge similar to the ground truth that contains target person boundary and occlusion information, performing subsequent feature distillation in a simple edge map. We also perform experiments on various benchmarks and exhibit fidelity both qualitatively and quantitatively. Extensive experiments prove that our method outperforms the state-of-the-art method by 2.8% in MPJPE and 1.9% in MPVPE on a benchmark 3DPW dataset in the presence of domain gap. Also, our method is superior in 3DPW-OCC, 3DPW-PC, RH-Dataset, OCHuman, CrowdPose, and LSP dataset in which occlusion, com plex pose, and domain gap exist.

SMPL Edge Feature Distillation (SEFD)

The figure below is the overall flow of our method. It consists of four components: Input Stage, SMPL Edge Generator, Teacher Model Training, and Student Model Training. To train the Teacher Model, an SMPL edge map must be generated through the SMPL Edge Generator in the Input Stage. After this process, the generated SMPL edge map is concatenated with the Input image and used to train the Teacher Model. After training the Teacher Model in this way, only the encoder of the Teacher Model is used to train the encoder of the Student Model through feature distillation. The input to the Student Model is obtained by passing it through a simple edge detector (e.g. Canny edge).

To elaborate further, the SMPL Edge Generator consists of Projection, Edge Detection, Adaptive Dilation, and Overlap. For the Loss in Feature Distillation, we used the Log Softmax Loss that we found, and we connected the 3rd and 4th feature maps for Feature Connection. An explanation of Adaptive Dilation is provided below
We have added a GIF animation below for your understanding. Please check it out if you need further clarification.

he above image shows an SMPL edge generator performing adaptive dilation. The SMPL edge generator consists of four stages in total, including mesh to image projection, edge detection, adaptive dilation, and overlap. The following image illustrates the motivation behind adaptive dilation

The image shows, from left to right, the input image, the mesh result for the pseudo ground truth in MSCOCO, the result of Canny Edge, the result of Canny Edge with dilation kernel 5, and the result of Canny Edge with dilation kernel 9. Starting from the result (c), we can observe that without dilation at a very small scale, we can infer the human form and pose. However, from a kernel size of 5 or more, we can observe that the structural information of the human is distorted. Therefore, we realized the need to adaptively adjust dilation from small to large scales, and solved this problem by adjusting the dilation kernel according to the area of the bounding box, as shown in the table above.

The above image was created to aid in understanding the training process through a video. When training the teacher model, a GT SMPL edge map is created using the SMPL edge generator. Then, the input image is concatenated with the GT SMPL edge map, and the teacher model is trained appropriately. Once the teacher model is adequately trained, only the teacher encoder is used to train the student model. Next, the input image is converted into a noisy edge using the simpl edge detector and concatenated with the student model for training. During this process, only the 3rd and 4th feature maps from the encoder are distilled using logsoftmax loss. Through this process, unnecessary boundaries in the noisy edges are removed, and only the necessary boundaries from the teacher model are used to train the student model.

Adaptive Dilation Results

The table above shows the dilation kernel size for different bounding box areas. We used a histogram to determine the criteria for adaptive dilation, and the results are shown in the figure below. The blue color represents the result obtained when the dilation kernel size is fixed to 5, and the red color represents the result obtained when adaptive dilation is applied. Compared to the blue edge map, we can see that adaptive dilation better captures structural information for small-scale objects.

3DPW Benchmark and Occlusion & Complex Pose Dataset Results

This table shows Table 4 from the main paper, which categorizes methods into those that use the training dataset for the 3DPW Benchmark and those that do not. The reason for categorizing models that do not use the training dataset is to demonstrate how well they perform in the presence of domain gaps. Our model shows superior results in this table. Furthermore, a comparison of our model with the current SOTA model, CLIFF, is shown in the table below, which compares their robustness in the presence of occlusion and complex poses when a domain gap exists.

The table above is Table 5 in the main paper, which consists of datasets for occlusion and complex poses. The occlusion dataset includes 3DPW-OCC, 3DPW-PC, RH-Dataset (RH-D), OCHuman, and CrowdPose, while the complex pose dataset is LSP. The methods compared are all targeted towards occlusion, including OCHMR, Liu et al, VisDB, and 3DCrowdNet. CLIFF was conducted to assess the difference in accuracy between our method and theirs in the presence of a domain gap. Our method shows overwhelmingly better performance compared to other methods.

Demo Results

Our demo video features videos sourced from copyright-free platforms such as Videovo and Pixabay. Specifically, we included a video that showcases our model's ability to accurately detect complex poses. We included videos of people dancing and exercising, and our model was able to appropriately detect the subject's mesh.

Visualization Results

The above result image is a comparison of our method with other State-of-the-Art methods such as I2L-MeshNet, SPIN, and 3DCrowdNet. Only the 3DCrowdNet, which considers occlusion, and our method produced plausible results. In the first row, although 3DCrowdNet plausibly detects the mesh, it fails to properly extract the person at the very back. On the other hand, our method successfully extracts all three individuals plausibly. In the second row, when the person at the front performs a complex pose, 3DCrowdNet misses it, whereas our method detects it plausibly. In the last row, due to the complex pose, 3DCrowdNet fails to resolve the issue of the hand being projected forward, but our method shows a visually appealing result with all hands positioned behind the individuals.

Other Visualization Results

The above images show, from left to right, the input image, input 2D pose, 3DCrowdNet, and our method, respectively. All of these images are from the CrowdPose test set and demonstrate how our method is more robust in complex poses and occlusion compared to 3DCrowdNet. They demonstrate how well our method performs in challenging situations involving complex poses and occlusion.

Other Visualization Results

The image results compare the 3D human modeling results on the 3DPW test set. The first row shows the input image, 2D pose, and SMPL model edges. The second row compares the GT Mesh (ground truth), Baseline Mesh, and SEFD's Mesh results. The third row shows a detailed comparison of the differences between the meshes. These image results demonstrate that our method performs well on the 3DPW test set, and works better than 3DCrowdNet in more complex poses and occlusions. Therefore, these results show that our method performs well in the field of 3D human modeling and works well in various challenging situations.

TOP