Before We Trust Them: Decision-Making Failures in Navigation of Foundation Models

93%

GPT-5 on unknown-cell maps—yet the remaining 7% still included invalid paths with constraint violations.

67%

Gemini-2.5 Flash on hard emergency evacuation, underperforming Gemini-2.0 Flash which reached 100%.

32%

Hard emergency prompts where Gemini-2.5 Flash prioritized document retrieval over evacuation.

Abstract

High success rates on navigation-related tasks do not necessarily translate into reliable decision making by foundation models. To examine this gap, we evaluate current models on six diagnostic tasks spanning three settings: reasoning under complete spatial information, reasoning under incomplete spatial information, and reasoning under safety-relevant information. Our results show that important decision-making failures can persist even when overall performance is strong, underscoring the need for failure-focused analysis to understand model limitations and guide future progress. In a path-planning setting with unknown cells, GPT-5 achieved a high success rate of 93%, yet the remaining cases still included invalid paths. We also find that newer models are not always more reliable than their predecessors. In reasoning under safety-relevant information, Gemini-2.5 Flash achieved only 67% on the challenging emergency-evacuation task, underperforming Gemini-2.0 Flash, which reached 100% under the same condition. Across all evaluations, models exhibited structural collapse, hallucinated reasoning, constraint violations, and unsafe decisions. These findings show that foundation models still exhibit substantial failures in navigation-related decision making and require fine-grained evaluation before they can be trusted.

Evaluation Framework

Overview of three evaluation settings: reasoning under complete spatial information using ASCII maps, reasoning under incomplete spatial information using unknown cells and egocentric sequences, and reasoning under safety-relevant information using orientation tracking and emergency evacuation

Overview of the three evaluation settings and representative input formats. The figure summarizes reasoning under complete spatial information, reasoning under incomplete spatial information, and reasoning under safety-relevant information. In the prompts for reasoning under safety-relevant information, red text indicates phrases related to task difficulty, and blue text indicates important contextual clues.

Core Contributions

Six Diagnostic Tasks

Fine-grained evaluation of navigation-related decision making under complete, incomplete, and safety-relevant information.

Recurring Failure Modes

Unstable spatial grounding, structural breakdown, hallucinated reasoning, explicit constraint violations, and unsafe choices in emergency scenarios.

Newer Is Not Always Safer

Newer model versions do not consistently preserve safety-aligned behavior—Gemini-2.5 Flash underperformed Gemini-2.0 Flash on emergency evacuation.

Beyond Aggregate Accuracy

Strong performance does not guarantee reliable decision making—failure-focused analysis is essential before foundation models can be trusted.

Main Quantitative Results

Success rates (%) of LLMs across map-based tasks and reasoning under safety-relevant information tasks. Each model tested 100 times per task.
Task	Gemini-2.5 Flash	Gemini-2.0 Flash	GPT-5	GPT-4o	Llama-3-8b
Map-Based Task (Success rate %)
Complete (Easy)	66	100	100	80	0
Complete (Normal)	93	0	100	0	0
Complete (Hard)	73	0	100	0	0
Unknown — Map 1	90	0	100	0	0
Unknown — Map 2	56	0	93	0	0
Reasoning under Safety-Relevant Information
Orientation-Tracking (Easy)	98	99	98	94	7
Orientation-Tracking (Normal)	100	72	82	66	12
Orientation-Tracking (Hard)	100	42	100	53	51
Emergency Evacuation (Easy)	100	100	100	100	100
Emergency Evacuation (Hard)	67	100	100	98	46

Each model tested 100 times per task. Temperature and top-p set to 1.

Dot plot showing success rates of five LLMs across Easy, Normal, Hard complete maps and two unknown-cell maps

Complete + Unknown Map Tasks

GPT-5 achieved 100% success across all complete maps and remained the most stable performer on unknown-cell maps. Gemini-2.0 Flash and GPT-4o exhibited abrupt collapse once map complexity increased—dropping from 100% and 80% on Easy to 0% on Normal and Hard.

This pattern reveals that degradation is often non-gradual, shifting from near-success to total failure under small increases in spatial complexity.

Structural Integrity Failure

Llama-3-8b achieved 0% across all maps. It not only failed to produce continuous paths, but used invalid symbols and failed to preserve the input map structure itself—producing collapsed, disorganized outputs. These failures indicate severe breakdown in structural preservation, not simple path-planning errors.

Side-by-side comparison: red boxes show Llama-3-8b collapsed map outputs with invalid paths, green boxes show correct answers with valid continuous paths

Egocentric Sequence Reasoning

We evaluate reasoning under incomplete spatial information using 100 short egocentric image sequences from indoor and outdoor navigation trajectories. Two complementary tasks are tested:

Turn-Direction Inference: Infer the turning direction from an ordered 5-frame sequence.
Missing-Frame Selection: Select the correct missing intermediate frame from two visually similar candidates.

Models span API-based (Gemini, GPT) and open-source (LLaVA, Qwen, InternVL) families, from 3B to 14B parameters.

Success rates (%) of the egocentric sequence reasoning task.
Model	Turn	Missing
API Models
Gemini-2.5 Flash	51%	68%
Gemini-2.0 Flash	53%	12%
GPT-5	64%	92%
GPT-4o	50%	54%
Open-Source Models
LLaVA-v1.6-vicuna-13B	37%	24%
LLaVA-v1.6-vicuna-7B	39%	23%
LLaVA-v1.6-mistral-7B	39%	59%
LLaVA-v1.5-7B	48%	10%
Qwen2.5-VL-7B-Instruct	52%	52%
Qwen2.5-VL-3B-Instruct	44%	54%
Qwen2.5-Omni-7B	52%	58%
InternVL3-14B	49%	67%

Sycophantic Bias in Turn-Direction Inference

Models exhibited a strong bias toward answering “right” regardless of the actual turning direction, resulting in accuracy rates mostly around 40–60%. Because “right” often carries an affirmative meaning, models may have favored it over more neutral alternatives.

Hallucination in Missing-Frame Selection

Most models’ accuracy was close to random, suggesting failure to grasp spatial context. Hallucinated cases included inventing nonexistent options like “(C)” or “(J)”, incorrectly judging temporal continuity, or refusing to answer altogether.

Path Planning with Unknown Cells: Detailed Analysis

Constraint-Aware Reasoning and Safe Adaptation (GPT-5)

GPT-5 achieved 100% on Unknown Map 1 and 93% on Map 2. In all Map 1 trials, it explicitly stated “I assume that unknown cells ? is not passable,” demonstrating a stable safety-first bias. When the goal became unreachable under this assumption in Map 2, GPT-5 correctly responded “No path exists under this assumption” in 27% of runs.

However, two Map 2 failures (7%) involved diagonal movement—an explicitly prohibited action. This highlights a critical insight: high accuracy does not imply safety. In practical robotic settings, such violations may directly lead to unsafe or physically infeasible behaviors.

Partial Alignment, Fragile Consistency

Gemini-2.5 Flash adopted the “not passable” assumption in 97% of Map 1 runs, but its success rate dropped to 57% on Map 2, with frequent failures such as obstacle traversal and map collapse.

These results indicate that although the model could imitate safety-oriented reasoning, it failed to maintain constraint consistency once uncertainty was introduced. Llama-3-8b showed the same collapse pattern, failing entirely on maps with unknown cells.

Reasoning under Safety-Relevant Information

Radar chart showing model performance: GPT-5 and Gemini-2.5 Flash lead on orientation tracking while Gemini-2.0 Flash leads on emergency evacuation

Radar summary of model performance across orientation-tracking difficulty levels and emergency evacuation tasks.

Pie chart: Gemini-2.5 Flash emergency evacuation responses showing 67% exit, 32% professor office, 1% server room

Response rate of Gemini-2.5 Flash on the hard emergency-evacuation task. The model directs users to the professor’s office (32%) or server room (1%) instead of the emergency exit.

Critical Failure Rate

In the emergency-evacuation experiment, Gemini-2.5 Flash directed users toward the professor’s office in 32% of trials, prioritizing document retrieval over evacuation. In 1% of trials, the model instructed users to head to the server room—a location never mentioned in the prompt. This hallucinated reasoning may further increase potential risk, as the server room is itself a high-risk area with potential explosion hazards.

The latest LLMs do not always exhibit superior performance over their predecessors. On the hard emergency-evacuation task, Gemini-2.5 Flash performed 40% worse than Gemini-2.0 Flash, suggesting that post-training adaptation may introduce safety-alignment drift, whereby capabilities reinforced during later optimization do not consistently preserve previously learned safety-relevant behavior. In contrast, GPT-4o refused to respond to safety-critical prompts, while Gemini-2.5 Flash produced confident yet hazardous responses.

Emergency evacuation task diagram: fire scenario where Gemini-2.5 Flash suggests professor office or server room instead of exit

Emergency-evacuation scenario: Gemini-2.5 Flash suggests unsafe destinations instead of immediate evacuation, revealing failures in safety prioritization and contextual grounding.

Supplementary: Back-of-the-Building Task

Four failure types: (a) structural collapse with missing topology, (b) directional error failing to reach building rear, (c) constraint violation with path intersecting obstacles, (d) waypoint error at transition points

We tested GPT-4o, Claude Opus 4.1, and Claude Sonnet 4 with an identical instruction: “Navigate the robot to the back of the building.” The task required inferring the robot’s position within the scene, transforming a first-person viewpoint into a top-down layout, and generating a coherent map that links visual perception with spatial reasoning.

The tested models showed limited ability to establish stable spatial correspondences between the visual scene and the generated map. Most produced partially plausible layouts but failed to consistently identify the correct orientation, preserve the structural integrity of the building, or maintain feasible trajectories.

As shown in the figure, these results indicate recurring breakdowns in visual–spatial grounding and constraint adherence: (a) structural collapse, (b) directional error, (c) constraint violation, and (d) waypoint error.

Discussion

Our findings indicate a clear gap between overall task performance and reliable robotic decision making. Models were often competent when spatial structure was explicit and constraints were easy to satisfy, yet this competence did not consistently carry over to settings that required inference from incomplete context, stable visual–spatial grounding, or prioritization of safety under competing cues. The transition from solving the task to solving it safely and reliably remains fragile.

This gap became evident through qualitatively different forms of model breakdown, including structural collapse in symbolic maps, hallucinated reasoning in sequence and emergency scenarios, violations of explicit movement constraints, and unsafe choices under goal-conflicting prompts. Some newer models also did not behave more safely than earlier ones under the same prompt, suggesting that gains in general capability do not automatically translate into more reliable safety prioritization. In their current form, foundation models are better viewed as assistive reasoning components than as autonomous decision makers.

Takeaways

Six diagnostic tasks for fine-grained evaluation of navigation-related decision making under complete, incomplete, and safety-relevant information.
Recurring failure modes identified in current LLMs and VLMs, including unstable spatial grounding, structural breakdown, hallucinated reasoning, constraint violations, and unsafe choices in emergency scenarios.
Newer models are not always safer — Gemini-2.5 Flash underperformed its predecessor on emergency evacuation, suggesting safety-alignment drift.
Strong performance does not guarantee reliable decision making — the key question is not only whether a model can solve a task, but whether it can support decisions that remain reliable when safety is at stake.
Failure-centered evaluation is essential before foundation models are deployed in safety-critical robotic systems.

BibTeX

@article{han2025beforewetrust,
  title={Before We Trust Them: Decision-Making Failures in Navigation of Foundation Models},
  author={Han, Jua and Seo, Jaeyoon and Min, Jungbin and Choi, Sieun and Seo, Huichan and Kim, Jihie and Oh, Jean},
  year={2025},
  url={https://cmubig.github.io/before-we-trust-them/}
}

Explore Resources