Safety-Critical AI for Robotics

Before We Trust Them: Decision-Making Failures in Navigation of Foundation Models

High success rates on navigation-related tasks do not necessarily translate into reliable decision making. We evaluate whether modern LLMs and VLMs can be trusted for safety-critical spatial decisions.

1Dongguk University · 2Sungkyunkwan University · 3Carnegie Mellon University

*Equal contribution

In an emergency-evacuation task, Gemini-2.5 Flash directs users to important documents (32%) or a server room (1%) instead of the exit

In an emergency-evacuation task, Gemini-2.5 Flash directs users to important documents (32%) or a server room (1%) instead of the exit.

93%

GPT-5 on unknown-cell maps—yet the remaining 7% still included invalid paths with constraint violations.

67%

Gemini-2.5 Flash on hard emergency evacuation, underperforming Gemini-2.0 Flash which reached 100%.

32%

Hard emergency prompts where Gemini-2.5 Flash prioritized document retrieval over evacuation.

Abstract

High success rates on navigation-related tasks do not necessarily translate into reliable decision making by foundation models. To examine this gap, we evaluate current models on six diagnostic tasks spanning three settings: reasoning under complete spatial information, reasoning under incomplete spatial information, and reasoning under safety-relevant information. Our results show that important decision-making failures can persist even when overall performance is strong, underscoring the need for failure-focused analysis to understand model limitations and guide future progress. In a path-planning setting with unknown cells, GPT-5 achieved a high success rate of 93%, yet the remaining cases still included invalid paths. We also find that newer models are not always more reliable than their predecessors. In reasoning under safety-relevant information, Gemini-2.5 Flash achieved only 67% on the challenging emergency-evacuation task, underperforming Gemini-2.0 Flash, which reached 100% under the same condition. Across all evaluations, models exhibited structural collapse, hallucinated reasoning, constraint violations, and unsafe decisions. These findings show that foundation models still exhibit substantial failures in navigation-related decision making and require fine-grained evaluation before they can be trusted.

Evaluation Framework

Overview of three evaluation settings: reasoning under complete spatial information using ASCII maps, reasoning under incomplete spatial information using unknown cells and egocentric sequences, and reasoning under safety-relevant information using orientation tracking and emergency evacuation

Overview of the three evaluation settings and representative input formats. The figure summarizes reasoning under complete spatial information, reasoning under incomplete spatial information, and reasoning under safety-relevant information. In the prompts for reasoning under safety-relevant information, red text indicates phrases related to task difficulty, and blue text indicates important contextual clues.

Core Contributions

Six Diagnostic Tasks

Fine-grained evaluation of navigation-related decision making under complete, incomplete, and safety-relevant information.

Recurring Failure Modes

Unstable spatial grounding, structural breakdown, hallucinated reasoning, explicit constraint violations, and unsafe choices in emergency scenarios.

Newer Is Not Always Safer

Newer model versions do not consistently preserve safety-aligned behavior—Gemini-2.5 Flash underperformed Gemini-2.0 Flash on emergency evacuation.

Beyond Aggregate Accuracy

Strong performance does not guarantee reliable decision making—failure-focused analysis is essential before foundation models can be trusted.

Main Quantitative Results

Success rates (%) of LLMs across map-based tasks and reasoning under safety-relevant information tasks. Each model tested 100 times per task.
Task Gemini-2.5 Flash Gemini-2.0 Flash GPT-5 GPT-4o Llama-3-8b
Map-Based Task (Success rate %)
Complete (Easy)66100100800
Complete (Normal)93010000
Complete (Hard)73010000
Unknown — Map 190010000
Unknown — Map 25609300
Reasoning under Safety-Relevant Information
Orientation-Tracking (Easy)989998947
Orientation-Tracking (Normal)10072826612
Orientation-Tracking (Hard)100421005351
Emergency Evacuation (Easy)100100100100100
Emergency Evacuation (Hard)671001009846

Each model tested 100 times per task. Temperature and top-p set to 1.

Dot plot showing success rates of five LLMs across Easy, Normal, Hard complete maps and two unknown-cell maps

Complete + Unknown Map Tasks

GPT-5 achieved 100% success across all complete maps and remained the most stable performer on unknown-cell maps. Gemini-2.0 Flash and GPT-4o exhibited abrupt collapse once map complexity increased—dropping from 100% and 80% on Easy to 0% on Normal and Hard.

This pattern reveals that degradation is often non-gradual, shifting from near-success to total failure under small increases in spatial complexity.

Structural Integrity Failure

Llama-3-8b achieved 0% across all maps. It not only failed to produce continuous paths, but used invalid symbols and failed to preserve the input map structure itself—producing collapsed, disorganized outputs. These failures indicate severe breakdown in structural preservation, not simple path-planning errors.

Side-by-side comparison: red boxes show Llama-3-8b collapsed map outputs with invalid paths, green boxes show correct answers with valid continuous paths

Egocentric Sequence Reasoning

We evaluate reasoning under incomplete spatial information using 100 short egocentric image sequences from indoor and outdoor navigation trajectories. Two complementary tasks are tested:

  • Turn-Direction Inference: Infer the turning direction from an ordered 5-frame sequence.
  • Missing-Frame Selection: Select the correct missing intermediate frame from two visually similar candidates.

Models span API-based (Gemini, GPT) and open-source (LLaVA, Qwen, InternVL) families, from 3B to 14B parameters.

Success rates (%) of the egocentric sequence reasoning task.
ModelTurnMissing
API Models
Gemini-2.5 Flash51%68%
Gemini-2.0 Flash53%12%
GPT-564%92%
GPT-4o50%54%
Open-Source Models
LLaVA-v1.6-vicuna-13B37%24%
LLaVA-v1.6-vicuna-7B39%23%
LLaVA-v1.6-mistral-7B39%59%
LLaVA-v1.5-7B48%10%
Qwen2.5-VL-7B-Instruct52%52%
Qwen2.5-VL-3B-Instruct44%54%
Qwen2.5-Omni-7B52%58%
InternVL3-14B49%67%

Sycophantic Bias in Turn-Direction Inference

Models exhibited a strong bias toward answering “right” regardless of the actual turning direction, resulting in accuracy rates mostly around 40–60%. Because “right” often carries an affirmative meaning, models may have favored it over more neutral alternatives.

Hallucination in Missing-Frame Selection

Most models’ accuracy was close to random, suggesting failure to grasp spatial context. Hallucinated cases included inventing nonexistent options like “(C)” or “(J)”, incorrectly judging temporal continuity, or refusing to answer altogether.

Path Planning with Unknown Cells: Detailed Analysis

Constraint-Aware Reasoning and Safe Adaptation (GPT-5)

GPT-5 achieved 100% on Unknown Map 1 and 93% on Map 2. In all Map 1 trials, it explicitly stated “I assume that unknown cells ? is not passable,” demonstrating a stable safety-first bias. When the goal became unreachable under this assumption in Map 2, GPT-5 correctly responded “No path exists under this assumption” in 27% of runs.

However, two Map 2 failures (7%) involved diagonal movement—an explicitly prohibited action. This highlights a critical insight: high accuracy does not imply safety. In practical robotic settings, such violations may directly lead to unsafe or physically infeasible behaviors.

Partial Alignment, Fragile Consistency

Gemini-2.5 Flash adopted the “not passable” assumption in 97% of Map 1 runs, but its success rate dropped to 57% on Map 2, with frequent failures such as obstacle traversal and map collapse.

These results indicate that although the model could imitate safety-oriented reasoning, it failed to maintain constraint consistency once uncertainty was introduced. Llama-3-8b showed the same collapse pattern, failing entirely on maps with unknown cells.

Reasoning under Safety-Relevant Information

Radar chart showing model performance: GPT-5 and Gemini-2.5 Flash lead on orientation tracking while Gemini-2.0 Flash leads on emergency evacuation

Radar summary of model performance across orientation-tracking difficulty levels and emergency evacuation tasks.

Pie chart: Gemini-2.5 Flash emergency evacuation responses showing 67% exit, 32% professor office, 1% server room

Response rate of Gemini-2.5 Flash on the hard emergency-evacuation task. The model directs users to the professor’s office (32%) or server room (1%) instead of the emergency exit.

Critical Failure Rate

In the emergency-evacuation experiment, Gemini-2.5 Flash directed users toward the professor’s office in 32% of trials, prioritizing document retrieval over evacuation. In 1% of trials, the model instructed users to head to the server room—a location never mentioned in the prompt. This hallucinated reasoning may further increase potential risk, as the server room is itself a high-risk area with potential explosion hazards.

The latest LLMs do not always exhibit superior performance over their predecessors. On the hard emergency-evacuation task, Gemini-2.5 Flash performed 40% worse than Gemini-2.0 Flash, suggesting that post-training adaptation may introduce safety-alignment drift, whereby capabilities reinforced during later optimization do not consistently preserve previously learned safety-relevant behavior. In contrast, GPT-4o refused to respond to safety-critical prompts, while Gemini-2.5 Flash produced confident yet hazardous responses.

Emergency evacuation task diagram: fire scenario where Gemini-2.5 Flash suggests professor office or server room instead of exit

Emergency-evacuation scenario: Gemini-2.5 Flash suggests unsafe destinations instead of immediate evacuation, revealing failures in safety prioritization and contextual grounding.

Supplementary: Back-of-the-Building Task

Four failure types: (a) structural collapse with missing topology, (b) directional error failing to reach building rear, (c) constraint violation with path intersecting obstacles, (d) waypoint error at transition points

We tested GPT-4o, Claude Opus 4.1, and Claude Sonnet 4 with an identical instruction: “Navigate the robot to the back of the building.” The task required inferring the robot’s position within the scene, transforming a first-person viewpoint into a top-down layout, and generating a coherent map that links visual perception with spatial reasoning.

The tested models showed limited ability to establish stable spatial correspondences between the visual scene and the generated map. Most produced partially plausible layouts but failed to consistently identify the correct orientation, preserve the structural integrity of the building, or maintain feasible trajectories.

As shown in the figure, these results indicate recurring breakdowns in visual–spatial grounding and constraint adherence: (a) structural collapse, (b) directional error, (c) constraint violation, and (d) waypoint error.

Discussion

Our findings indicate a clear gap between overall task performance and reliable robotic decision making. Models were often competent when spatial structure was explicit and constraints were easy to satisfy, yet this competence did not consistently carry over to settings that required inference from incomplete context, stable visual–spatial grounding, or prioritization of safety under competing cues. The transition from solving the task to solving it safely and reliably remains fragile.

This gap became evident through qualitatively different forms of model breakdown, including structural collapse in symbolic maps, hallucinated reasoning in sequence and emergency scenarios, violations of explicit movement constraints, and unsafe choices under goal-conflicting prompts. Some newer models also did not behave more safely than earlier ones under the same prompt, suggesting that gains in general capability do not automatically translate into more reliable safety prioritization. In their current form, foundation models are better viewed as assistive reasoning components than as autonomous decision makers.

Takeaways

  • Six diagnostic tasks for fine-grained evaluation of navigation-related decision making under complete, incomplete, and safety-relevant information.
  • Recurring failure modes identified in current LLMs and VLMs, including unstable spatial grounding, structural breakdown, hallucinated reasoning, constraint violations, and unsafe choices in emergency scenarios.
  • Newer models are not always safer — Gemini-2.5 Flash underperformed its predecessor on emergency evacuation, suggesting safety-alignment drift.
  • Strong performance does not guarantee reliable decision making — the key question is not only whether a model can solve a task, but whether it can support decisions that remain reliable when safety is at stake.
  • Failure-centered evaluation is essential before foundation models are deployed in safety-critical robotic systems.

BibTeX

@article{han2025beforewetrust,
  title={Before We Trust Them: Decision-Making Failures in Navigation of Foundation Models},
  author={Han, Jua and Seo, Jaeyoon and Min, Jungbin and Choi, Sieun and Seo, Huichan and Kim, Jihie and Oh, Jean},
  year={2025},
  url={https://cmubig.github.io/before-we-trust-them/}
}