Person in Place: Generating Associative Skeleton-Guidance Maps for Human-Object Interaction Image Editing

Anonymous CVPR submission

Paper ID : 8271

Human-object interaction (HOI) image editing using generated skeleton: We synthesize human interacting with objects for an initial image using the automated object-interactive diffuser. (a) an initial image to edit. (b) the sequential process of synthesizing human image with object-interactive skeletons using textual conditions. (c) a final result image with the skeleton map. Our method generates the high quality object interactive skeleton map, and it can easily plug in to the skeleton guided generative model for HOI image editing.

Human-Object-Interaction Image Editing

Traditional image editing technologies have seen significant advancements, but they face limitations in the realm of Human-Object-Interaction (HOI) image editing. However, recent developments introduce technologies utilizing Skeleton-guidance maps (e.g., ControlNet, HumanSD, UNI-contronet, T2I-Adapter, etc.) to generate or edit images, seamlessly performing HOI image editing. Nevertheless, there is currently no existing technology for generating or editing Object interactive Skeletons, and we propose this for the first time.

Abstract

Recently, there were remarkable advances in image editing tasks in various ways. Nevertheless, existing image editing models are not designed for Human-Object Interaction (HOI) image editing. One of these approaches (e.g. ControlNet) employs the skeleton guidance to offer precise representations of human, showing better results in HOI image editing. However, using conventional methods, manually creating HOI skeleton guidance is necessary. This paper proposes the object interactive diffuser with associative attention that considers both the interaction with objects and the joint graph structure, automating the generation of HOI skeleton guidance. Additionally, we propose the HOI loss with novel scaling parameter, demonstrating its effectiveness in generating skeletons that interact better. To evaluate generated object-interactive skeletons, we propose two metrics, top-N accuracy and skeleton probabilistic distance. Our framework integrates object interactive diffuser that generates object-interactive skeletons with previous methods, demonstrating the outstanding results in HOI image editing. Finally, we present potentials of our framework beyond HOI image editing, as applications to human-to-human interaction, skeleton editing, and 3D mesh optimization.

Person-in-Place Framework

Overview of proposed framework: Our proposed framework uses a cropped image from a person bounding box as an input and the object bounding box. (Left) These are used to extract a image and an object features. (Middle) The extracted features are used as a image and object conditioning respectively in our object interactive diffuser. Using these conditionings, the object interactive diffuser comes to see the object-joint and joint-joint relationships then generate a denoised skeleton based on diffusion process. (Right) The synthesized skeleton together with a masked image using a person bounding box is used to edit image with off the shelf inpainting model.

The denoise process first estimate the correlation between the object and the joints, and then it considers the relationship between the joint themselves using a GNN. After that, the object conditioning is used to predict which joints are most likely to interact with the object. In this figure, the pixels located inside the snowboard have higher attention score on joints such as hands or foot than others. This figure visualizes which joint has the greatest association with features correspond to selected pixels in the image colored red, yellow and orange. The size of the circle indicates the degree of association.

Quantitative HOI image editing Results

results comparing our framework to the previous image editing model: Our framework outperforms others on the metrics indicating image quality FID, KID and metric measuring prompt alignment to image CS. Two approaches exist in the realm of editing models: Text-Free editing models and Text-Guided editing models. We belong to the Text-Guided editing model category, and our performance analysis indicates that utilizing our Object Interactive Skeleton for inpainting yields the best results. Moreover, our approach demonstrates superior performance compared to the upgraded version of the Stable Diffusion-based method, SDXL-inpainting

Quantitative Object Interactive Skeleton Results

Table 2 provides a comparison of results for the proposed loss scale. Applying our loss scale shows improvement in object interactive skeleton evaluation performance across all models. Additionally, Table 3 presents a performance comparison of the proposed Associative Attention mechanism. It is evident that our method significantly outperforms modules such as conventional attention or attention + GNN (Graph Neural Network).

Visualization Results

(Top): Qualitative results when generating a single person using CoModGAN, Instruct-Pix2Pix, Stable-Diffusion Inpainting (SD-Inpainting). (Top left) Incomplete or no humans are generated using other models. (Top right) Even though humans are generated, the misaligned or non-interactive humans are synthesized. (Bottom) : Demonstration of image editing with SD-inpainting, SDXL-inpainting and ours. Other models did not generate a human even using a guided skeleton.

Additional Visualization Results

Comparison between attention mechanism and associative attention mechanism: By employing our associative attention mechanism, the overall shape of skeleton becomes more natural as shown in images of a baby sitting on the bed or a man sitting on the yacht. Moreover, our associative attention mechanism generates more object-interactive poses, e.g. swinging, sitting, typing, as joints of the skeleton approach the object through propagation process. Moreover, even in scenarios with multiple objects, a natural skeleton is generated while interacting with the specified object.

Comparion of CoModGAN, Instruct-Pix2Pix, SD-Inpainting and Ours: We use a person bounding box, an object bounding box and a text prompt for HOI image editing. In most cases of the visualization results, CoModGAN and Instruct-Pix2Pix perform poorly in generating natural humans. Our results exhibit more object-interactive images than SD-Inpainting, as shown in cases of a woman with wearing a polka-dotted umbrella or a man surfing with the waves. For an example, in the case of 'A woman in pajamas using her laptop on the stove top in the kitchen', CoModGAN and SD-Inpainting did not generate even a human shape. Instruct-Pix2Pix failed to maintain the original image, while our method generated a natural woman that matches the text prompt.

Performing on multi-people images: This figure shows HOI-edited images of multiple people using SD-Inpainting, SDXL-Inpainting and Ours. In the first and second column, SD-Inpainting and SDXL-Inpainting fail to generate a human when a person bounding box is provided in a small size. On the other hand, our method generates natural HOI images regardless of a size of a person bounding box, since it utilizes object-interactive skeletons. As shown in the fourth column, people sitting on the sofa and children sitting on the sofa are generated naturally with our method.