Text-Based Video Generation With Human Motion and Controllable Camera
Anonymous CVEU submission
Paper ID : 32
The provided figure illustrates the difference between using pose guidance and not using pose guidance in T2V generation.
When pose guidance is available, it helps in creating accurate and suitable images.
However, in the absence of pose guidance,
generating an appropriate image becomes challenging, leading to issues such as scale and temporal consistency problems.
Our Contribution
Our contribution is to create a framework that enables the generation of T2V content from various viewing points. In the context of T2V, there has been an issue where images do not appear properly without pose guidance. To address this problem, we have developed a framework that allows for the creation of T2V content from multiple viewing angles or perspectives. By incorporating a range of viewing points, our framework aims to enhance the overall quality and effectiveness of T2V technology.
Abstract
As expectations for generative models have risen re cently, Text-to-Video models have been actively studied.
Ex isting Text-to-Video models have limitations in that it is dif ficult to generate complex movements such as human mo tions.
Then often generate unintended human motions and the scale of the subject. In order to improve the quality of videos that include human motion,
we propose a two-stage framework. In the first stage, Text-driven Human Motion Generation network generates 3D human motion from in put text prompt.
In the second stage, 3D human motion se quence is projected to a 2D skeleton format. In the third stage, and then Skeleton-Guided Text-to-Video Generation
module generates a video in which the motion of subject is well represented. In addition, we can manipulate the cam era view point and angle to generate a
video we want, since the human motion generated in the first stage is 3D, not, 2D. We demonstrated the proposed framework outperforms the existing Text-to-Video
models in quantitative and qualitative manners. To the best of our knowledge, the our framework is the first methods using Text-driven Human Motion Gener ation
networks to improve video with human motions.
Text-To-Video Generation with explicit Camera Control
The model architecture is depicted in the figure below. When a prompt is inputted, the Text-To-Human motion network generates a 3D mesh representation. Simultaneously, the desired viewing point is established using CPM (Camera Pose Module), which produces a 2D skeleton. Subsequently, the text-to-video network generates the corresponding image. Notably, CPM incorporates a module that adjusts the skeleton by applying an appropriate value for tilting downwards, as shown in the accompanying figure, resulting in a 2D projected skeleton.
Our model is further elaborated in the diagram presented below. When a prompt is entered, the Text-to-Motion model generates a 2D skeleton representation. This skeleton is then fed into the Camera Path Manipulation (CPM) module to apply the desired camera movement techniques. The resulting skeleton, along with the prompt, is further processed by the Video Model to generate a video through the text-to-motion pipeline. This ensures that our camera movement technology is correctly applied, resulting in high-quality output videos
Quantitative Results
The table provided compares the performance of action classification (AC), frame consistency (FC), and
CLIPscore (CS) with and without pose guidance. Overall, the results indicate that without pose guidance,
the accuracy is considerably high. This suggests that when generating an image, having pose guidance is
advantageous for selecting high-quality images. The presence of pose guidance helps in improving the accuracy
and consistency of actions and enhances the overall quality of the generated images.
The table presented showcases the results obtained by adjusting the camera rotation and skeleton scale. Notably, when viewed from a top-down perspective, there is a significant increase in performance compared to the default settings. The improvement is quite remarkable, with the accuracy rising from 33.8% to 86.7%. Similarly, most of the other actions also demonstrate improved performance. However, it is worth noting that for some actions, there was a slight decrease in performance.
Regarding the lateral view, the performance differences seem negligible, except for the "kick" action, which experienced a substantial increase to 53.8% and 86.7%. Conversely, for other actions, there was a slight decline in performance.
In terms of scale adjustments, the majority of actions experienced a decrease in performance.
Overall, the table suggests that video performance can be further enhanced by considering various camera angles or scales. By dynamically changing the viewing point, it becomes possible to generate high-quality images, even in scenarios where the camera perspective is altered.
The last table shows the performance comparison of the text-to-motion model. T2M-GPT and MDM were used, and it can be seen that the performance of T2M-GPT is higher. In fact, it was confirmed that T2M-GPT, which is State-of-the-art (SOTA), has higher performance in the motion generation model.
Demo Results
We have incorporated various camera movement techniques, including zoom, rotation, and translation, to enhance the visual dynamics of the videos. The accompanying GIF results clearly demonstrate the exceptional quality achieved in the generated videos. The smooth transitions and captivating visual effects are a testament to the effectiveness of our camera movement implementation
Visualization Results
The image above showcases the results obtained by changing the camera's viewing position. It is evident that the approach yields excellent outcomes.
Camera Movements
The images below display the outcomes of various camera movements, including rotation, translation, and zoom. These results were achieved by adjusting the camera's intrinsic and extrinsic parameters. Overall, the majority of the results appear to be appropriate and satisfactory.