32%
Hard emergency prompts where Gemini-2.5 Flash prioritized document retrieval over evacuation.
A single unsafe instruction can be catastrophic in embodied environments. We evaluate whether modern LLMs and VLMs can be trusted for safety-critical spatial decisions.
1Dongguk University · 2Sungkyunkwan University · 3Carnegie Mellon University
*Equal contribution
32%
Hard emergency prompts where Gemini-2.5 Flash prioritized document retrieval over evacuation.
1%
Runs where a hallucinated server-room route was suggested during fire-evacuation decision making.
0%
Success observed on several map variants for weaker baselines under increased spatial complexity.
One mistake by an AI system in a safety-critical setting can cost lives. Large Language Models (LLMs) are increasingly integral to robotics as decision-making tools, powering applications from navigation to human-robot interaction. However, robots carry a physical dimension of risk: a single wrong instruction can directly endanger human safety. This highlights the urgent need to systematically evaluate how LLMs perform in scenarios where even minor errors are catastrophic. In our qualitative evaluation (e.g., a fire evacuation scenario) of LLM-based decision-making, we identified several critical failure cases that expose the dangers of their deployment in safety-critical settings. Based on these observations, we designed seven tasks to provide complementary quantitative assessments. The tasks are divided into complete information, incomplete information, and Safety-Oriented Spatial Reasoning (SOSR) formats, where the SOSR tasks are defined through natural language instructions. Complete information tasks use fully specified ASCII maps, enabling direct evaluation under explicit conditions. Unlike images, ASCII maps minimize ambiguity in interpretation and align directly with the textual modality of LLMs, allowing us to isolate spatial reasoning and path-planning abilities while keeping evaluation transparent and reproducible. Incomplete information tasks require models to infer the missing directional or movement context from the given sequence, allowing us to evaluate whether they correctly capture spatial continuity or instead exhibit hallucinations. SOSR tasks use natural language questions to test whether LLMs can make safe decisions in scenarios where even a single error may be life-threatening. Because the information is provided as natural language, the model must fully infer the spatial context. We evaluate LLMs and Vision-Language Models (VLMs) on these tasks to measure their spatial reasoning ability and safety reliability. Crucially, beyond aggregate performance, we analyze the implications of a 1% failure rate through case studies, highlighting how "rare" errors can escalate into catastrophic outcomes. The results reveal serious vulnerabilities. For instance, several LLMs achieved a 0% success rate in ASCII map navigation tasks, collapsing the map structure. In a concerning case during a simulated fire drill, LLMs instructed a robot to move toward a server room instead of the emergency exit, representing an error with serious implications for human safety. Together, these observations reinforce a sobering conclusion: current LLMs are not ready for direct deployment in safety-critical robotic systems such as autonomous driving or assistive robotics. A 99% accuracy rate may appear impressive, but in practice it means that one out of every hundred executions could result in catastrophic harm. We demonstrate that even the latest LLMs cannot guarantee safety in practice, and that absolute reliance on AI in safety-critical domains can create new risks. By systematizing these failures, we argue that conventional metrics like "99% accuracy" are dangerously misleading, as a single error can lead to a catastrophic outcome.
Overview of experimental prompts and map structures: Complete (blue), Incomplete (red), and SOSR (yellow). Italicized prompt phrases indicate critical contextual cues used in high-stakes reasoning.
We evaluate modern LLMs/VLMs under spatial tasks where one wrong response can be dangerous in practice.
Complete maps, uncertain maps, sequence inference, and SOSR emergency prompts reveal different failure profiles.
We systematize collapse types: directional error, map distortion, obstacle violation, hallucinated reasoning, and unsafe prioritization.
We show why "99% accuracy" can still be unsafe for embodied systems requiring near-zero catastrophic error.
| Task | Gemini-2.5 Flash | Gemini-2.0 Flash | GPT-5 | GPT-4o | LLaMA-3-8b |
|---|---|---|---|---|---|
| Map-based (Success rate %) | |||||
| Deterministic (Easy) | 66.7 | 100 | 100 | 80 | 0 |
| Deterministic (Normal) | 93.3 | 0 | 100 | 0 | 0 |
| Deterministic (Hard) | 73.3 | 0 | 100 | 0 | 0 |
| Uncertain 1 | 90.0 | 0 | 100 | 0 | 0 |
| Uncertain 2 | 56.7 | 0 | 93.3 | 0 | 0 |
| Safety-Oriented Spatial Reasoning (SOSR) | |||||
| Direction (Easy) | 98 | 99 | 98 | 94 | 7 |
| Direction (Normal) | 100 | 72 | 82 | 66 | 12 |
| Direction (Hard) | 100 | 42 | 100 | 53 | 51 |
| Emergency (Hard) | 67 | 100 | 100 | 98 | 46 |
| Emergency (Easy) | 100 | 100 | 100 | 100 | 100 |
Mobile tip: swipe this table horizontally to see all model columns.
GPT-5 remains the most stable performer across deterministic and uncertain terrain maps. In contrast, Gemini-2.0 Flash and GPT-4o exhibit abrupt collapse as complexity increases, and LLaMA-3-8b fails to preserve map structure entirely.
This pattern reveals a key reliability issue: degradation is often non-gradual and can shift from near-success to total failure under small increases in spatial complexity.
Representative outputs show severe map breakdown for smaller open-source models, including malformed grids and incoherent path tokens. These failures are not merely low scores: they are unusable plans for embodied execution.
Radar summary of SOSR difficulty levels and emergency decision tasks.
In hard emergency prompts, unsafe prioritization appears in a non-trivial fraction of responses.
In repeated emergency trials, Gemini-2.5 Flash guided the user to retrieve documents in 32% of runs and hallucinated a server-room route in 1% of runs. Both can be dangerous in real evacuations.
Entropy analysis indicates unstable response behavior across identical prompts, reinforcing that "average success" can hide catastrophic tails.
The BoB task tests whether models can infer rear-side navigation from first-person imagery and language. Common failures include directional errors, obstacle intersection, waypoint misuse, and topology collapse.
Even when textual reasoning looks plausible, geometric grounding frequently breaks, revealing an unresolved gap between language competence and actionable spatial planning.
Swipe or use arrows to inspect model-specific failures and representations.
Key qualitative appendices collected in one swipeable gallery.
@article{han2025safetynotfound,
title={Safety Not Found (404): Hidden Risks of LLM-Based Robotics Decision Making},
author={Han, Jua and Seo, Jaeyoon and Min, Jungbin and Kim, Jihie and Oh, Jean},
year={2025},
note={Preprint manuscript}
}
Live visitor map tracker for this project page.