93%
GPT-5 on unknown-cell maps—yet the remaining 7% still included invalid paths with constraint violations.
High success rates on navigation-related tasks do not necessarily translate into reliable decision making. We evaluate whether modern LLMs and VLMs can be trusted for safety-critical spatial decisions.
1Dongguk University · 2Sungkyunkwan University · 3Carnegie Mellon University
*Equal contribution
93%
GPT-5 on unknown-cell maps—yet the remaining 7% still included invalid paths with constraint violations.
67%
Gemini-2.5 Flash on hard emergency evacuation, underperforming Gemini-2.0 Flash which reached 100%.
32%
Hard emergency prompts where Gemini-2.5 Flash prioritized document retrieval over evacuation.
High success rates on navigation-related tasks do not necessarily translate into reliable decision making by foundation models. To examine this gap, we evaluate current models on six diagnostic tasks spanning three settings: reasoning under complete spatial information, reasoning under incomplete spatial information, and reasoning under safety-relevant information. Our results show that important decision-making failures can persist even when overall performance is strong, underscoring the need for failure-focused analysis to understand model limitations and guide future progress. In a path-planning setting with unknown cells, GPT-5 achieved a high success rate of 93%, yet the remaining cases still included invalid paths. We also find that newer models are not always more reliable than their predecessors. In reasoning under safety-relevant information, Gemini-2.5 Flash achieved only 67% on the challenging emergency-evacuation task, underperforming Gemini-2.0 Flash, which reached 100% under the same condition. Across all evaluations, models exhibited structural collapse, hallucinated reasoning, constraint violations, and unsafe decisions. These findings show that foundation models still exhibit substantial failures in navigation-related decision making and require fine-grained evaluation before they can be trusted.
Overview of the three evaluation settings and representative input formats. The figure summarizes reasoning under complete spatial information, reasoning under incomplete spatial information, and reasoning under safety-relevant information. In the prompts for reasoning under safety-relevant information, red text indicates phrases related to task difficulty, and blue text indicates important contextual clues.
Fine-grained evaluation of navigation-related decision making under complete, incomplete, and safety-relevant information.
Unstable spatial grounding, structural breakdown, hallucinated reasoning, explicit constraint violations, and unsafe choices in emergency scenarios.
Newer model versions do not consistently preserve safety-aligned behavior—Gemini-2.5 Flash underperformed Gemini-2.0 Flash on emergency evacuation.
Strong performance does not guarantee reliable decision making—failure-focused analysis is essential before foundation models can be trusted.
| Task | Gemini-2.5 Flash | Gemini-2.0 Flash | GPT-5 | GPT-4o | Llama-3-8b |
|---|---|---|---|---|---|
| Map-Based Task (Success rate %) | |||||
| Complete (Easy) | 66 | 100 | 100 | 80 | 0 |
| Complete (Normal) | 93 | 0 | 100 | 0 | 0 |
| Complete (Hard) | 73 | 0 | 100 | 0 | 0 |
| Unknown — Map 1 | 90 | 0 | 100 | 0 | 0 |
| Unknown — Map 2 | 56 | 0 | 93 | 0 | 0 |
| Reasoning under Safety-Relevant Information | |||||
| Orientation-Tracking (Easy) | 98 | 99 | 98 | 94 | 7 |
| Orientation-Tracking (Normal) | 100 | 72 | 82 | 66 | 12 |
| Orientation-Tracking (Hard) | 100 | 42 | 100 | 53 | 51 |
| Emergency Evacuation (Easy) | 100 | 100 | 100 | 100 | 100 |
| Emergency Evacuation (Hard) | 67 | 100 | 100 | 98 | 46 |
Each model tested 100 times per task. Temperature and top-p set to 1.
GPT-5 achieved 100% success across all complete maps and remained the most stable performer on unknown-cell maps. Gemini-2.0 Flash and GPT-4o exhibited abrupt collapse once map complexity increased—dropping from 100% and 80% on Easy to 0% on Normal and Hard.
This pattern reveals that degradation is often non-gradual, shifting from near-success to total failure under small increases in spatial complexity.
Llama-3-8b achieved 0% across all maps. It not only failed to produce continuous paths, but used invalid symbols and failed to preserve the input map structure itself—producing collapsed, disorganized outputs. These failures indicate severe breakdown in structural preservation, not simple path-planning errors.
We evaluate reasoning under incomplete spatial information using 100 short egocentric image sequences from indoor and outdoor navigation trajectories. Two complementary tasks are tested:
Models span API-based (Gemini, GPT) and open-source (LLaVA, Qwen, InternVL) families, from 3B to 14B parameters.
| Model | Turn | Missing |
|---|---|---|
| API Models | ||
| Gemini-2.5 Flash | 51% | 68% |
| Gemini-2.0 Flash | 53% | 12% |
| GPT-5 | 64% | 92% |
| GPT-4o | 50% | 54% |
| Open-Source Models | ||
| LLaVA-v1.6-vicuna-13B | 37% | 24% |
| LLaVA-v1.6-vicuna-7B | 39% | 23% |
| LLaVA-v1.6-mistral-7B | 39% | 59% |
| LLaVA-v1.5-7B | 48% | 10% |
| Qwen2.5-VL-7B-Instruct | 52% | 52% |
| Qwen2.5-VL-3B-Instruct | 44% | 54% |
| Qwen2.5-Omni-7B | 52% | 58% |
| InternVL3-14B | 49% | 67% |
Models exhibited a strong bias toward answering “right” regardless of the actual turning direction, resulting in accuracy rates mostly around 40–60%. Because “right” often carries an affirmative meaning, models may have favored it over more neutral alternatives.
Most models’ accuracy was close to random, suggesting failure to grasp spatial context. Hallucinated cases included inventing nonexistent options like “(C)” or “(J)”, incorrectly judging temporal continuity, or refusing to answer altogether.
GPT-5 achieved 100% on Unknown Map 1 and 93% on Map 2. In all Map 1 trials, it explicitly stated “I assume that unknown cells ? is not passable,” demonstrating a stable safety-first bias. When the goal became unreachable under this assumption in Map 2, GPT-5 correctly responded “No path exists under this assumption” in 27% of runs.
However, two Map 2 failures (7%) involved diagonal movement—an explicitly prohibited action. This highlights a critical insight: high accuracy does not imply safety. In practical robotic settings, such violations may directly lead to unsafe or physically infeasible behaviors.
Gemini-2.5 Flash adopted the “not passable” assumption in 97% of Map 1 runs, but its success rate dropped to 57% on Map 2, with frequent failures such as obstacle traversal and map collapse.
These results indicate that although the model could imitate safety-oriented reasoning, it failed to maintain constraint consistency once uncertainty was introduced. Llama-3-8b showed the same collapse pattern, failing entirely on maps with unknown cells.
Radar summary of model performance across orientation-tracking difficulty levels and emergency evacuation tasks.
Response rate of Gemini-2.5 Flash on the hard emergency-evacuation task. The model directs users to the professor’s office (32%) or server room (1%) instead of the emergency exit.
In the emergency-evacuation experiment, Gemini-2.5 Flash directed users toward the professor’s office in 32% of trials, prioritizing document retrieval over evacuation. In 1% of trials, the model instructed users to head to the server room—a location never mentioned in the prompt. This hallucinated reasoning may further increase potential risk, as the server room is itself a high-risk area with potential explosion hazards.
The latest LLMs do not always exhibit superior performance over their predecessors. On the hard emergency-evacuation task, Gemini-2.5 Flash performed 40% worse than Gemini-2.0 Flash, suggesting that post-training adaptation may introduce safety-alignment drift, whereby capabilities reinforced during later optimization do not consistently preserve previously learned safety-relevant behavior. In contrast, GPT-4o refused to respond to safety-critical prompts, while Gemini-2.5 Flash produced confident yet hazardous responses.
Emergency-evacuation scenario: Gemini-2.5 Flash suggests unsafe destinations instead of immediate evacuation, revealing failures in safety prioritization and contextual grounding.
We tested GPT-4o, Claude Opus 4.1, and Claude Sonnet 4 with an identical instruction: “Navigate the robot to the back of the building.” The task required inferring the robot’s position within the scene, transforming a first-person viewpoint into a top-down layout, and generating a coherent map that links visual perception with spatial reasoning.
The tested models showed limited ability to establish stable spatial correspondences between the visual scene and the generated map. Most produced partially plausible layouts but failed to consistently identify the correct orientation, preserve the structural integrity of the building, or maintain feasible trajectories.
As shown in the figure, these results indicate recurring breakdowns in visual–spatial grounding and constraint adherence: (a) structural collapse, (b) directional error, (c) constraint violation, and (d) waypoint error.
Our findings indicate a clear gap between overall task performance and reliable robotic decision making. Models were often competent when spatial structure was explicit and constraints were easy to satisfy, yet this competence did not consistently carry over to settings that required inference from incomplete context, stable visual–spatial grounding, or prioritization of safety under competing cues. The transition from solving the task to solving it safely and reliably remains fragile.
This gap became evident through qualitatively different forms of model breakdown, including structural collapse in symbolic maps, hallucinated reasoning in sequence and emergency scenarios, violations of explicit movement constraints, and unsafe choices under goal-conflicting prompts. Some newer models also did not behave more safely than earlier ones under the same prompt, suggesting that gains in general capability do not automatically translate into more reliable safety prioritization. In their current form, foundation models are better viewed as assistive reasoning components than as autonomous decision makers.
@article{han2025beforewetrust,
title={Before We Trust Them: Decision-Making Failures in Navigation of Foundation Models},
author={Han, Jua and Seo, Jaeyoon and Min, Jungbin and Choi, Sieun and Seo, Huichan and Kim, Jihie and Oh, Jean},
year={2025},
url={https://cmubig.github.io/before-we-trust-them/}
}