X2C: Enabling Realistic Human-to-Humanoid Imitation

A video demonstration showcases the realistic imitation capabilities of our proposed framework X2CNet, where the correspondence between expression representation in image space and the robot's action space is learned using the X2C dataset. TThis framework validates the value of our dataset in advancing research on realistic humanoid facial expression imitation. Notably, our dataset and imitation framework are applicable to multiple humanoid robots with different facial appearances.

Examples of realistic humanoid imitation. Different individuals express a wide range of facial expressions, with nuances reflected in features such as frown, gaze direction, eye openness, nose wrinkles, mouth openness, and so on. These nuanced human facial expressions extend beyond canonical emotions and can be regarded as either blends of different canonical emotions or as a single emotion with varying intensities. The humanoid robot, Ameca, mimics every detail, resulting in a realistic imitation. Notably, our dataset and imitation framework are applicable to multiple robots with different facial appearances.

The ability to imitate realistic facial expressions is essential for humanoid robots in affective human–robot communication. Achieving this requires modeling two correspondences: between human and humanoid expressions, and between humanoid expressions and robot control values. During training, we predict control values from humanoid expression images, while at execution time the control values drive the robot to reproduce expressions. Progress in this area has been limited by the lack of datasets containing diverse humanoid expressions with precise control annotations. We introduce X2C (Expression to Control), a large-scale dataset of 100,000 < image, control value > pairs, where each image depicts a humanoid robot expression annotated with 30 ground-truth control values. Building on this resource, we propose X2CNet, a framework that transfers expression dynamics from human to humanoid faces and learns the mapping between humanoid expressions and control values. X2CNet enables in-the-wild imitation across diverse human performers and establishes a strong baseline for this task. Real-world robot experiments validate our framework and demonstrate the potential of X2C to advance realistic human-to-humanoid facial expression imitation.

Demonstration of X2C dataset examples. Each example in the X2C dataset consists of: (1) an image depicting the virtual robot, shown in the middle; and (2) the corresponding control values, visualized at the bottom. In these visualizations, the height of each blue bar represents the magnitude of the corresponding value, while the orange dots indicate the values in the neutral state.

The pipeline for dataset collection. We first curate humanoid facial expression animations covering all basic emotions and beyond. Images and their corresponding control values are then sampled at the same timestamps (e.g., if an image is sampled at t = 2.0, its control value annotation is also sampled at t = 2.0) to obtain the temporally aligned pairs

An illustration of the correspondence between control values and control units. In the control value visualization, the first 4 values control the brow movements, the next 4 control eyelid motions, and so on for the other units.

Value distributions of 30 controls. Controls for different expression-relevant units are indicated by different colors.

An overview of X2CNet, the proposed imitation framework. The first module captures facial expression subtleties from humans, while the mapping network learns the correspondence between various humanoid expressions and their underlying control values using the X2C dataset.

X2C: ENABLING REALISTIC HUMAN-TO-HUMANOID FACIAL EXPRESSION IMITATION

Abstract

The X2C Dataset

Dataset Collection

Control Values and Distributions

Framework