Guava

A Harness Framework for Embodied Manipulation

Guava: Effective and Universal Harness for Embodied Manipulation

1 University of Maryland College Park 2 University of Illinois Urbana-Champaign 3 University of Waterloo 4 Mohamed bin Zayed University of Artificial Intelligence 5 University of Pennsylvania 6 Amazon FAR

* Equal contribution

Guava overview: harness-based embodied agent that uses tools for perception, planning, and control.
Guava overview. A harness combining iterative reasoning, semantic tools, and multimodal observations distills frontier-VLM capabilities into a compact 4B model deployable zero-shot in the real world.

Overview

We present Guava, a harness framework for embodied tool use. Through systematic exploration of the design space, we identify three key ingredients of an effective harness: iterative perception–reasoning–action loops, semantic action abstractions, and multimodal observations. Using this harness, we distill embodied manipulation into a 4B open-source model with fewer than 2K simulation trajectories — matching frontier proprietary models in both simulation and the real world, with strong zero-shot generalization to unseen objects, novel instructions, and long-horizon tasks.

Ingredients for an effective harness

Existing harness systems often rely on one-shot code generation, domain-specific pipelines coupled with powerful frontier models, which makes robust long-horizon behavior and failure recovery expensive and brittle. By exploring the design space of agent workflows, action spaces, and observation spaces, we reframe the design problem around three reusable ingredients that consistently matter for embodied manipulation:

  1. Iterative perception–reasoning–action loops

    ReAct-style loops let the agent adapt to execution outcomes and recover from failures, rather than committing to a single open-loop plan.

  2. Semantic action abstractions

    Tools that encapsulate manipulation skills at a semantic level free the language model to focus on task decomposition and planning instead of low-level control.

  3. Multimodal observations

    Rich visual and structured observations provide the environmental context that embodied reasoning requires.

Guava-Agent-4B

Building on these principles, we ask whether an effective harness can serve as a universal interface for embodied manipulation across model scales — including small open-source ones. We distill embodied tool-use behaviors into a small open-source model (Qwen3.5-4B), using fewer than 2K trajectories collected entirely in simulation and perform supervised fine-tuning.

Data generation pipeline: simulation environment, a frontier VLM, and the resulting reasoning + tool-call traces.
Data engine. Randomized simulation scenes plus error perturbations are paired with a frontier VLM. The pipeline yields interleaved reasoning and tool-call traces used to distill the 4B student model.

Data gallery

place the can in the box
pick up the orange
move the hotdog near the donut
pick up the hidden cube under the cup
push basket to the left
remove the red cube from the tray
push cereal box to the left
open drawer
set the table

Results

Guava-Agent-4B achieves the highest overall zero-shot real-world success rates on both in-distribution (86%) and out-of-distribution (92%) tasks, outperforming all baselines.

Bar chart comparing guava-agent-4b against CaP-Agent0, GPT 5.4, and Qwen3.5-4B on in-distribution and out-of-distribution real-world tasks.
Real-world success rates. The 4B Guava agent matches or beats frontier proprietary models in average success across both in-distribution and OOD tasks, while running on a compact open-source backbone.

Our method generalizes zero-shot across a broad range of tasks that requires semantic and geometric reasoning. It is able to handle long-horizon tasks, and is robust to distractors, OOD tasks and setups.

place all red objects into basket
set the table by putting bowl on plate and spoon next to it
place utensils on tray and food into basket
move the hotdog away from the donut
push basket to the right
arrange objects from left to right by increasing size
close drawer
stack the red cube on the green cube
find the hidden cube under the cups and pick it up

Failure recovery

Guava interleaves observation, reasoning, and tool execution on every step — so when something goes wrong, the agent can see and revise the plan. We observe successful recovery from previously unseen failures, as recovery emerges from reasoning over execution feedback beyond memorizing predefined correction patterns.

When the carrot is moved during execution, Guava detects the displacement and re-attemps grasping
Guava recovers from unseen real-world control errors e.g., joint limits and unreachable poses by generating corrective actions before retrying

RL fine-tuning

We further finetune Guava-Agent-4B with GRPO on the two challenging, OOD long-horizon manipulation tasks that require substantially more multi-step reasoning, error recovery, and action planning. While the SFT policy struggles on both shell game (6.7%) and place all red objects in basket (0.0%), RL fine-tuning improves performance to 60.0% and 93.3%, respectively. The large gains suggest that training on challenging long-horizon tasks with sparse success rewards effectively strengthens recovery behaviors and enables the policy to better handle off-trajectory states.

Comparison of SFT vs RL fine-tuning success rates on long-horizon tasks.
RL fine-tuning on long-horizon tasks with sparse rewards substantially improves over the SFT baseline.

BibTeX

@misc{liu2026guavaeffectiveuniversalharness,
      title={Guava: An Effective and Universal Harness for Embodied Manipulation},
      author={Haowen Liu and Xirui Li and Shaoxiong Yao and Peng Shi and Tianyi Zhou and Jia-Bin Huang and Furong Huang and Jiayuan Mao},
      year={2026},
      eprint={2606.18363},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2606.18363},
}