A Harness Framework for Embodied Manipulation

Guava: Effective and Universal Harness for Embodied Manipulation

Haowen Liu^1,* Xirui Li^1,* Shaoxiong Yao² Peng Shi³ Tianyi Zhou⁴ Jia-Bin Huang¹ Furong Huang¹ Jiayuan Mao^5,6

¹ University of Maryland College Park ² University of Illinois Urbana-Champaign ³ University of Waterloo ⁴ Mohamed bin Zayed University of Artificial Intelligence ⁵ University of Pennsylvania ⁶ Amazon FAR

^* Equal contribution

PDF Paper arX arXiv { } Code (coming soon) @ BibTeX

Overview

We present Guava, a harness framework for embodied tool use. Through systematic exploration of the design space, we identify three key ingredients of an effective harness: iterative perception–reasoning–action loops, semantic action abstractions, and multimodal observations. Using this harness, we distill embodied manipulation into a 4B open-source model with fewer than 2K simulation trajectories — matching frontier proprietary models in both simulation and the real world, with strong zero-shot generalization to unseen objects, novel instructions, and long-horizon tasks.

Ingredients for an effective harness

Existing harness systems often rely on one-shot code generation, domain-specific pipelines coupled with powerful frontier models, which makes robust long-horizon behavior and failure recovery expensive and brittle. By exploring the design space of agent workflows, action spaces, and observation spaces, we reframe the design problem around three reusable ingredients that consistently matter for embodied manipulation:

Iterative perception–reasoning–action loops

ReAct-style loops let the agent adapt to execution outcomes and recover from failures, rather than committing to a single open-loop plan.
Semantic action abstractions

Tools that encapsulate manipulation skills at a semantic level free the language model to focus on task decomposition and planning instead of low-level control.
Multimodal observations

Rich visual and structured observations provide the environmental context that embodied reasoning requires.

Guava-Agent-4B

Building on these principles, we ask whether an effective harness can serve as a universal interface for embodied manipulation across model scales — including small open-source ones. We distill embodied tool-use behaviors into a small open-source model (Qwen3.5-4B), using fewer than 2K trajectories collected entirely in simulation and perform supervised fine-tuning.

Data generation pipeline: simulation environment, a frontier VLM, and the resulting reasoning + tool-call traces. — **Data engine.** Randomized simulation scenes plus error perturbations are paired with a frontier VLM. The pipeline yields interleaved reasoning and tool-call traces used to distill the 4B student model.

Data gallery

place the can in the box

pick up the orange

move the hotdog near the donut

pick up the hidden cube under the cup

push basket to the left

remove the red cube from the tray

push cereal box to the left

open drawer

set the table

Results

Guava-Agent-4B achieves the highest overall zero-shot real-world success rates on both in-distribution (86%) and out-of-distribution (92%) tasks, outperforming all baselines.

Bar chart comparing guava-agent-4b against CaP-Agent0, GPT 5.4, and Qwen3.5-4B on in-distribution and out-of-distribution real-world tasks. — **Real-world success rates.** The 4B Guava agent matches or beats frontier proprietary models in average success across both in-distribution and OOD tasks, while running on a compact open-source backbone.

Our method generalizes zero-shot across a broad range of tasks that requires semantic and geometric reasoning. It is able to handle long-horizon tasks, and is robust to distractors, OOD tasks and setups.

place all red objects into basket

set the table by putting bowl on plate and spoon next to it

place utensils on tray and food into basket

move the hotdog away from the donut

push basket to the right

arrange objects from left to right by increasing size

close drawer

stack the red cube on the green cube

find the hidden cube under the cups and pick it up

Failure recovery

Guava interleaves observation, reasoning, and tool execution on every step — so when something goes wrong, the agent can see and revise the plan. We observe successful recovery from previously unseen failures, as recovery emerges from reasoning over execution feedback beyond memorizing predefined correction patterns.

When the carrot is moved during execution, Guava detects the displacement and re-attemps grasping

Guava recovers from unseen real-world control errors e.g., joint limits and unreachable poses by generating corrective actions before retrying

RL fine-tuning

We further finetune Guava-Agent-4B with GRPO on the two challenging, OOD long-horizon manipulation tasks that require substantially more multi-step reasoning, error recovery, and action planning. While the SFT policy struggles on both shell game (6.7%) and place all red objects in basket (0.0%), RL fine-tuning improves performance to 60.0% and 93.3%, respectively. The large gains suggest that training on challenging long-horizon tasks with sparse success rewards effectively strengthens recovery behaviors and enables the policy to better handle off-trajectory states.

BibTeX

@misc{liu2026guavaeffectiveuniversalharness,
      title={Guava: An Effective and Universal Harness for Embodied Manipulation},
      author={Haowen Liu and Xirui Li and Shaoxiong Yao and Peng Shi and Tianyi Zhou and Jia-Bin Huang and Furong Huang and Jiayuan Mao},
      year={2026},
      eprint={2606.18363},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2606.18363},
}