BiPO: Bidirectional Partial Occlusion Network for Text-to-Motion Synthesis

WACV 2026 (Oral)

Abstract

Text-to-motion generation has considerably advanced with part-based autoregressive models. Traditional unidirectional approaches are limited by their inability to access future tokens, leading to constrained temporal coherence and suboptimal motion quality. Furthermore, autoregressive models are challenging to apply to motion editing tasks. Recently, bidirectional autoregressive models have been proposed, integrating past and future contexts to enhance consistency. In this work, we introduce the first model to combine part-based generation with bidirectional autoregressive methods. This approach leverages detailed control over individual parts alongside rich temporal context, with the added advantage of applicability to motion editing tasks. However, it can cause parts to rely too heavily on each other, as each part must account for expanded contextual information. This reliance can result in tangled motion sequences and compounding small errors in both directions along the sequence. To resolve these issues, we propose Partial Occlusion, a stochastic training technique that probabilistically occludes specific motion part information, encouraging the model to learn robust representations under partial context. We combine these contributions into BiPO. Our model achieves superior performance in FID on HumanML3D compared to previous part-based methods and sets a new state-of-the-art.

BiPO's Architecture

BiPO Architecture Diagram

BiPO introduces a novel part-based bidirectional autoregressive network designed to generate coherent and controllable human motions from textual descriptions. Unlike existing approaches that either rely solely on unidirectional generation or lack per-part controllability, BiPO represents the human body as multiple parts—root, backbone, arms, and legs—each generated through its own transformer-based pipeline. These parts can reference one another through a Selective Part Coordination Layer, enabling the model to maintain global motion coherence while allowing fine-grained control at the part level. To prevent parts from becoming overly reliant on one another, BiPO employs Partial Occlusion, randomly obscuring certain part tokens during training. This encourages each component to learn robust representations independently, resulting in more stable and natural full-body motion synthesis.

Inference

BiPO Architecture Diagram

During inference, BiPO employs a dual-phase decoding strategy to refine the motion output. In the initial phase, each body part’s motion tokens are generated in a unidirectional manner, guided by the textual prompt. This step provides a coarse yet contextually appropriate sequence. In the second phase, BiPO applies a bidirectional autoregressive refinement: selectively masking and regenerating certain tokens while considering both past and future context. By integrating the updated tokens from all parts, the model corrects potential inconsistencies and enhances the temporal smoothness and structural integrity of the motion. The result is a polished, high-quality motion sequence that faithfully represents the provided text.

Qualitative test

BiPO Architecture Diagram BiPO Architecture Diagram

Qualitative evaluations of BiPO highlight its ability to produce fluid, context-rich human motions that capture nuanced details of the textual description. BiPO’s outputs stand out as natural, lifelike, and semantically aligned. Compared to prior methods, BiPO demonstrates clearer transitions, balanced posture control, and fine-grained articulation in limbs. These tests illustrate how the model’s architecture and training strategy combine to yield motions that not only look realistic but also faithfully reflect the intended textual narrative.

Motion Editing

BiPO Architecture Diagram

Beyond straightforward text-to-motion generation, BiPO excels in motion editing tasks, where only partial information—such as a starting pose or a midpoint sequence—is provided. Through its bidirectional inference process and robust part-based representation, BiPO can seamlessly fill in missing segments, extend a partially completed motion, or adjust the trajectory to accommodate new textual constraints. Whether predicting the next steps of a dancer after a given pose or completing the beginning and end of a walking sequence, BiPO maintains consistency, style, and semantic fidelity to the user’s directions. This adaptability opens new possibilities for iterative refinement and user-guided editing of 3D human motion data.

Ablation study

BiPO Architecture Diagram

To evaluate the effectiveness of BiPO’s core components, we conducted ablation studies by systematically removing or altering certain features. Removing the bidirectional autoregressive mechanism led to less coherent global motion and reduced fidelity to the text prompt, while omitting Partial Occlusion resulted in models that relied too heavily on other parts’ tokens, producing entangled or inconsistent motions.

BibTeX


@article{hong2024bipo,
  title={BiPO: Bidirectional Partial Occlusion Network for Text-to-Motion Synthesis},
  author={Hong, Seong-Eun and Lim, Soobin and Hwang, Juyeong and Chang, Minwook and Kang, Hyeongyeop},
  journal={arXiv preprint arXiv:2412.00112},
  year={2024}
}