BiManiBench
A Hierarchical Benchmark for Evaluating Bimanual Coordination of Multimodal Large Language Models

1Tsinghua University, 2The University of Hong Kong, 3HKUST, 4Beijing Innovation Center of Humanoid Robotics
*Equal Contribution Corresponding Authors

Abstract

Multimodal Large Language Models (MLLMs) have significantly advanced embodied AI, and using them to benchmark robotic intelligence has become a pivotal trend. However, existing frameworks remain predominantly confined to single-arm manipulation, failing to capture the spatio-temporal coordination required for bimanual tasks like lifting a heavy pot. To address this, we introduce BiManiBench, a hierarchical benchmark evaluating MLLMs across three tiers: fundamental spatial reasoning, high-level action planning, and low-level end-effector control. Our framework isolates unique bimanual challenges, such as arm reachability and kinematic constraints, thereby distinguishing perceptual hallucinations from planning failures. Analysis of over 30 state-of-the-art models reveals that despite high-level reasoning proficiency, MLLMs struggle with dual-arm spatial grounding and control, frequently resulting in mutual interference and sequencing errors. These findings suggest the current paradigm lacks a deep understanding of mutual kinematic constraints, highlighting the need for future research to focus on inter-arm collision-avoidance and fine-grained temporal sequencing.

Overview

BiManiBench is the first hierarchical benchmark specifically designed to systematically evaluate the bimanual coordination capabilities of Multimodal Large Language Models (MLLMs). While current research in embodied AI has made significant strides in single-arm manipulation, bimanual coordination remains a formidable challenge. It requires more than just parallel execution; it demands rigorous spatiotemporal synchronization and dynamic role assignment to navigate complex kinematic constraints and prevent self-collisions. BiManiBench addresses this critical gap by providing a dedicated platform to analyze how foundation models manage the unique complexities of dual-arm physical interaction.

BiManiBench Framework Overview

Figure 1: The hierarchical evaluation framework of BiManiBench, deconstructing bimanual coordination into three tiers of abstraction.

As illustrated in Figure 1, our benchmark features a comprehensive three-tier evaluation framework that deconstructs bimanual tasks into different levels of abstraction. Tier 1 (Dual-Arm Spatial Reasoning) assesses fundamental workspace awareness and arm allocation. Tier 2 (High-Level Action Planning) evaluates long-horizon reasoning under diverse coordination modes, including independent parallel tasks and complex sequential collaborative manipulation. Tier 3 (Low-Level End-Effector Control) tests the model’s ability to directly generate fine-grained, 16-dimensional continuous poses for precise bimanual synchronization. This hierarchical design allows researchers to isolate specific failure modes and distinguish between perceptual hallucinations and planning deficiencies.

BiManiBench Agent Pipeline

Figure 2: The vision-driven agent pipeline designed for structured multimodal perception and reasoning.

The core of our evaluation is supported by a vision-driven agent pipeline designed for structured multimodal perception and reasoning (Figure 2). The agent processes diverse inputs—including multi-view observations (main and third-person views), language instructions, and task-specific auxiliary information—to bridge the gap between perception and action. Within each planning step, the MLLM functions as a central "brain" that generates a visual state description, performs internal reasoning and reflection, and formulates a language-based plan before outputting a structured, executable JSON format. This iterative closed-loop process ensures that the agent can adapt its coordination strategy based on the evolving environment state.

Through an extensive empirical study of over 30 state-of-the-art models—including proprietary systems like GPT-5, Gemini, and Claude—our results reveal a significant "reasoning-actuation gap." While modern MLLMs demonstrate proficiency in high-level strategic planning, they frequently struggle with fragile spatial grounding and precise dual-arm control. By pinpointing these bottlenecks, BiManiBench provides a foundational framework and diagnostic tool for the community to develop more robust, versatile robotic agents capable of human-like physical coordination.

Hierarchical Evaluation Examples

Tier 1: Dual-Arm Spatial Reasoning

This tier assesses fundamental spatial awareness and the ability to perform dynamic arm assignment. Given a visual observation, the model must determine the optimal manipulator while navigating strict kinematic constraints and limited reachability.

High Quality Reasoning

High-quality reasoning: Precise grounding and optimal arm allocation.

Medium Quality Reasoning

Average-quality reasoning: Valid logic but with minor spatial ambiguity.

Low Quality Reasoning

Low-quality reasoning: Significant visual hallucinations and planning failures.

Tier 2: High-Level Action Planning

Evaluates logical reasoning and task decomposition in long-horizon scenarios. The model acts as a strategic planner, outputting a logical sequence of atomic primitives (e.g., Grasp, Place).

Task Success: handover_block (Sequential coordination)

Tier 3: Low-Level End-Effector Control

The most challenging tier requiring precise motor control. The agent directly generates continuous 16-dimensional actions (7-DoF pose and 1-DoF gripper state per arm) for bimanual synchronization.

Task Success: stack_blocks_two (Precise motor control)

Hierarchical Evaluation Results

Table 1: Dual-Arm Spatial Reasoning Results

Success scores across three scenario settings: Sparse, Dense, and Cluttered. "Avg." represents the overall mean performance.
Models Task scenario settings Avg.
Sparse Dense Cluttered
Gemini-2.0-flash95.4598.6992.0095.38
Gemini-2.5-flash95.7796.7692.8895.13
Gemini-2.5-pro96.1496.7792.1295.01
Claude-sonnet-4.596.1294.7892.2394.38
GPT-594.7395.1392.9794.28
GLM-4.5V91.4897.7793.0094.08
Qwen3-VL-32B-Instruct94.4795.7791.7794.00
Claude-sonnet-494.1394.4692.8893.82
Claude-sonnet-3.793.4695.1191.9493.51
InternVL3-78B92.8097.0790.1693.34
Ovis2-34B94.7892.7890.4592.67
GPT-4.193.4392.4891.7692.55
Ovis2-16B94.0791.7488.0091.27
Qwen3-VL-235B-A22B-Instruct86.8293.5090.3390.22
InternVL3.5-38B89.4891.4586.7589.23
GPT-4o89.0291.1387.1089.08
Qwen3-VL-30B-A3B-Instruct85.5091.1388.9888.54
InternVL3-38B81.8292.1389.8587.94
InternVL2.5-78B87.2186.4589.3787.68
Llama-4-Scout-17B-16E-Instruct85.4987.7586.1686.47
Gemma-3-27b-it92.4081.1285.7886.43
Qwen2.5-VL-32B-Instruct85.1686.0887.3886.21
InternVL2.5-38B79.1685.4785.9983.54
InternVL2.5-8B87.4878.9681.8182.75
InternVL3-8B79.5369.7986.7978.70
Ovis2.5-9B72.7978.1273.1374.68
Qwen2.5-VL-7B-Instruct75.2065.8379.3473.46
Gemma-3-12b-it80.0957.1770.2269.16
Llama-3.2-11B-Vision-Instruct54.6453.6254.0154.09

Table 2: High-Level Action Planning Results

Success rate (%) for independent parallel and sequential collaborative manipulation tasks.
Models Independent Parallel Manipulation Sequential Collaborative Manipulation Total Avg.
Avg.P1P2R1R2S1S2 Avg.H1H2H3P3P4P5P6P7
Gemini-2.5-Pro 71.33772288996280 69.389460839474356352 70.21
GPT-5 76.676450921008668 59.756924869071175863 67.00
Gemini-2.5-flash 67.17603682936468 59.007828789461354553 62.50
GPT-4.1 78.508140871008677 42.8847468896718140 58.14
Claude-sonnet-4 67.00573168978267 46.63923283973720111 55.36
Claude-sonnet-3.7 69.00683966968659 45.00953284923212112 55.29
Qwen3-VL-235B-A22B-Instruct 58.6736175908367 50.8896638692517246 54.21
InternVL3-38B 57.5071079775860 49.3863679097144033 52.86
Qwen3-VL-32B-Instruct 54.67411688754068 50.88936375961436822 52.50
Qwen2.5-VL-32B-Instruct 52.6762793882442 50.1394558895749013 51.21
Gemini-2.0-flash 62.83724387944338 41.25673381871533311 50.50
GPT-4o 52.33682250743763 45.5088608595516015 48.43
InternVL3-78B 56.3372769813970 33.6395 92696423016 43.36
InternVL2.5-38B 45.331083706256 33.009211843747020 38.29
Ovis2-34B 45.508027782131 31.7596314941027010 37.64
InternVL2.5-78B 47.83562273762733 29.518008433113025 37.36
InternVL3.5-38B 41.507206815238 33.1394111893128029 36.71
Qwen2.5-VL-72B-Instruct 28.60160/88336 37.257442855943202 33.92
Ovis2-16B 27.506702532383 24.8872019742500 26.00
Ovis2.5-9B 17.83410471603 28.757814713372700 24.07
Qwen3-VL-30B-A3B-Instruct 19.837061231513 26.25511545154404 23.50
Gemma-3-27b-it 27.176205126024 19.501024663811106 22.79
Llama-4-Scout-17B-16E-Instruct 10.6720073700 29.75819606441208 21.57
Gemma-3-12b-it 20.3383053400 13.8832283611409 16.64
Llama-3.2-11B-Vision-Instruct 6.5010231500 20.6368586911400 14.57
InternVL3-8B 13.8355016363 10.380171001100 11.86
InternVL2.5-8B 2.672013100 1.2509001000 1.86
Qwen2.5-VL-7B-Instruct 1.67304210 1.2510603000 1.43

Table 3: Low-Level Manipulation Performance

Success rate (%) on specific manipulation tasks.
Models Tasks Avg.
Place8Place9Place10Grab1Stack3
GPT-5668350795666.80
Gemini-2.5-Pro826139813860.20
Gemini-2.5-flash744813844953.60
InternVL3-78B850079127.60
Claude-sonnet-4.51713689225.40
Qwen3-VL-235B-A22B-Instruct4128946225.20
Gemma-3-27b-it8133706.20
Llama-4-Scout-17B-16E-Instruct1002906.00

Error Analysis

GPT-5 Error Distribution

(a) GPT-5

Gemini Error Distribution

(b) Gemini-2.5-Pro

Comparison of error type distributions. Visualization of failure modes for (a) GPT-5 and (b) Gemini-2.5-Pro. Inner rings represent primary error categories (Perceptual vs. Planning), while outer rings detail specific failure modes such as misjudgment or sequencing errors. Detailed definitions are provided in the Appendix.

We analyzed failure modes for GPT-5 and Gemini-2.5-Pro, excluding environmental noise. As illustrated in Figure 1, the primary bottleneck for GPT-5 is perceptual (54%), largely driven by Task State Estimation Misjudgment (39%). Furthermore, it exhibits a notable inability to strictly adhere to prompt-specified execution parameters, categorized as Action Parameter Inconsistency (23%).

Conversely, while Gemini-2.5-Pro follows prompt constraints more reliably, it is significantly more limited by complex planning logic (56%). Its main hurdles are Action Sequencing (31%) and Bimanual Conflict (24%), indicating deeper struggles with the temporal and spatial synchronization essential for sophisticated dual-arm coordination.

BibTeX

@article{wu2026bimanibench,
  title     = {BiManiBench: A Hierarchical Benchmark for Evaluating Bimanual Coordination of Multimodal Large Language Models},
  author    = {Wu, Xin and Liang, Zhixuan and Ma, Yue and Hu, Mengkang and Qin, Zhiyuan and Li, Xiu},
  journal   = {arXiv preprint arXiv:2602.08392},
  year      = {2026},
  url       = {https://arxiv.org/abs/2602.08392}
}