CATNAV: Cached Vision-Language Traversability for Efficient Zero-Shot Robot Navigation

pic.twitter.com/EGNSLpwXCx
— Aditya Potnis (@aditya_potnis) March 29, 2026

CATNAV generates embodiment-aware traversability costmaps in a zero-shot manner using a multimodal LLM, reuses prior risk assessments through a visuosemantic cache, and selects the safest path via VLM-based trajectory reasoning.

Abstract

Navigating unstructured environments requires assessing traversal risk relative to a robot's physical capabilities, a challenge that varies across embodiments. We present CATNAV, a cost-aware traversability navigation framework that leverages multimodal LLMs for zero-shot, embodiment-aware costmap generation without task-specific training. We introduce a visuosemantic caching mechanism that detects scene novelty and reuses prior risk assessments for semantically similar frames, reducing online VLM queries by 85.7%. Furthermore, we introduce a VLM-based trajectory selection module that evaluates proposals through visual reasoning to choose the safest path given behavioral constraints. We evaluate CATNAV on a quadruped robot across indoor and outdoor unstructured environments, comparing against state-of-the-art vision-language-action baselines. Across five navigation tasks, CATNAV achieves a higher average goal-reaching rate (68% vs. 58%) and 33% fewer behavioral constraint violations.

Method

CATNAV turns a single multimodal LLM into an embodiment-aware navigation stack with no task-specific training. The system makes three contributions:

A zero-shot, embodiment-aware costmap generation framework that uses VLM semantic-consequence reasoning to infer per-object traversal risk conditioned on the robot's morphology and locomotion modality.
A visuosemantic caching mechanism that uses CLIP embeddings and a vector store to detect semantically recurrent scenes, reusing prior risk assessments and significantly reducing online LLM query latency.
A VLM-based trajectory reasoning module that visually evaluates multi-proposal paths overlaid on the RGB image, selecting the safest trajectory given behavioral constraints and robot capabilities.

At runtime, a novelty check (k-NN over CLIP embeddings) decides whether to query the LLM or reuse a cached risk assessment. Risk costs are segmented with CLIPSeg, projected into a 3D risk point cloud, and collapsed into a 2D occupancy grid. A TRRT planner proposes four candidate paths, which a second VLM query evaluates visually to choose the safest route under natural-language behavioral instructions.

CATNAV system architecture — **System overview.** Novelty check → embodiment-aware costmap construction → multi-proposal TRRT planning → VLM trajectory reasoning.

Results

68%

average goal-reaching rate — up from 58% for OmniVLA

−33%

fewer behavioral constraint violations

85.7%

reduction in online VLM scene queries

86.5%

visuosemantic cache utilization

We evaluate CATNAV on a Unitree Go1 quadruped (ZED 2i stereo camera, Jetson Orin, GNSS) across five indoor and outdoor navigation tasks, comparing against the OmniVLA vision-language-action baseline over 10 trials each.

Qualitative costmap results — **Qualitative costmaps.** Across diverse indoor and outdoor scenes, CATNAV assigns high traversal cost to risky semantics (obstacles, crops, hazards) and low cost to walkable terrain. Costmaps recovered from cached risk assessments closely match those from fresh VLM queries, showing the cache preserves quality while cutting queries.

Trajectory comparisons across tasks (CATNAV vs. baseline)

Footpath navigation tasks — **Tasks 1–2: outdoor footpath.** CATNAV (green) keeps to the commanded side of the path and reaches the goal, while the baseline (blue) drifts off the walkable surface.

Dynamic obstacle task — **Task 3: dynamic human crossing.** CATNAV reasons about the crossing pedestrian and reroutes safely rather than driving through the person's path.

Bench avoidance task — **Task 4: bench avoidance.** Embodiment-aware costs mark the benches as non-traversable, so CATNAV plans a clear path around them to the goal.

Indoor paper avoidance task — **Task 5: indoor paper avoidance.** Given the behavioral instruction to avoid the paper on the floor, CATNAV selects the trajectory that respects the constraint.

VLM query frequency under caching — **Caching efficiency.** VLM query frequency across caching configurations — the visuosemantic cache cuts online queries by 85.7%.

BibTeX

arXiv preprint

@misc{potnis2026catnavcachedvisionlanguagetraversability,
      title={CATNAV: Cached Vision-Language Traversability for Efficient Zero-Shot Robot Navigation},
      author={Aditya Potnis and Francisco Affonso and Shreya Gummadi and Naveen Kumar Uppalapati and Girish Chowdhary},
      year={2026},
      eprint={2603.22800},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2603.22800},
}