The ability to imitate realistic facial expressions is essential for humanoid robots in affective human–robot communication. Achieving this requires modeling two correspondences: between human and humanoid expressions, and between humanoid expressions and robot control values. During training, we predict control values from humanoid expression images, while at execution time the control values drive the robot to reproduce expressions. Progress in this area has been limited by the lack of datasets containing diverse humanoid expressions with precise control annotations. We introduce X2C (Expression to Control), a large-scale dataset of 100,000 < image, control value > pairs, where each image depicts a humanoid robot expression annotated with 30 ground-truth control values. Building on this resource, we propose X2CNet, a framework that transfers expression dynamics from human to humanoid faces and learns the mapping between humanoid expressions and control values. X2CNet enables in-the-wild imitation across diverse human performers and establishes a strong baseline for this task. Real-world robot experiments validate our framework and demonstrate the potential of X2C to advance realistic human-to-humanoid facial expression imitation.