Safety-Critical AI for Robotics

Safety Not Found (404): Hidden Risks of LLM-Based Robotics Decision Making

A single unsafe instruction can be catastrophic in embodied environments. We evaluate whether modern LLMs and VLMs can be trusted for safety-critical spatial decisions.

1Dongguk University · 2Sungkyunkwan University · 3Carnegie Mellon University

*Equal contribution

Emergency scenario where models suggest unsafe destinations instead of an exit

In a simulated fire scenario, some LLM outputs prioritized document retrieval or even suggested heading toward a server room instead of immediate evacuation.

32%

Hard emergency prompts where Gemini-2.5 Flash prioritized document retrieval over evacuation.

1%

Runs where a hallucinated server-room route was suggested during fire-evacuation decision making.

0%

Success observed on several map variants for weaker baselines under increased spatial complexity.

Abstract

One mistake by an AI system in a safety-critical setting can cost lives. Large Language Models (LLMs) are increasingly integral to robotics as decision-making tools, powering applications from navigation to human-robot interaction. However, robots carry a physical dimension of risk: a single wrong instruction can directly endanger human safety. This highlights the urgent need to systematically evaluate how LLMs perform in scenarios where even minor errors are catastrophic. In our qualitative evaluation (e.g., a fire evacuation scenario) of LLM-based decision-making, we identified several critical failure cases that expose the dangers of their deployment in safety-critical settings. Based on these observations, we designed seven tasks to provide complementary quantitative assessments. The tasks are divided into complete information, incomplete information, and Safety-Oriented Spatial Reasoning (SOSR) formats, where the SOSR tasks are defined through natural language instructions. Complete information tasks use fully specified ASCII maps, enabling direct evaluation under explicit conditions. Unlike images, ASCII maps minimize ambiguity in interpretation and align directly with the textual modality of LLMs, allowing us to isolate spatial reasoning and path-planning abilities while keeping evaluation transparent and reproducible. Incomplete information tasks require models to infer the missing directional or movement context from the given sequence, allowing us to evaluate whether they correctly capture spatial continuity or instead exhibit hallucinations. SOSR tasks use natural language questions to test whether LLMs can make safe decisions in scenarios where even a single error may be life-threatening. Because the information is provided as natural language, the model must fully infer the spatial context. We evaluate LLMs and Vision-Language Models (VLMs) on these tasks to measure their spatial reasoning ability and safety reliability. Crucially, beyond aggregate performance, we analyze the implications of a 1% failure rate through case studies, highlighting how "rare" errors can escalate into catastrophic outcomes. The results reveal serious vulnerabilities. For instance, several LLMs achieved a 0% success rate in ASCII map navigation tasks, collapsing the map structure. In a concerning case during a simulated fire drill, LLMs instructed a robot to move toward a server room instead of the emergency exit, representing an error with serious implications for human safety. Together, these observations reinforce a sobering conclusion: current LLMs are not ready for direct deployment in safety-critical robotic systems such as autonomous driving or assistive robotics. A 99% accuracy rate may appear impressive, but in practice it means that one out of every hundred executions could result in catastrophic harm. We demonstrate that even the latest LLMs cannot guarantee safety in practice, and that absolute reliance on AI in safety-critical domains can create new risks. By systematizing these failures, we argue that conventional metrics like "99% accuracy" are dangerously misleading, as a single error can lead to a catastrophic outcome.

Evaluation Framework

Overview of complete, incomplete, and SOSR task prompts and structures

Overview of experimental prompts and map structures: Complete (blue), Incomplete (red), and SOSR (yellow). Italicized prompt phrases indicate critical contextual cues used in high-stakes reasoning.

Core Contributions

Safety Reliability Stress Test

We evaluate modern LLMs/VLMs under spatial tasks where one wrong response can be dangerous in practice.

Seven Complementary Tasks

Complete maps, uncertain maps, sequence inference, and SOSR emergency prompts reveal different failure profiles.

Failure Taxonomy

We systematize collapse types: directional error, map distortion, obstacle violation, hallucinated reasoning, and unsafe prioritization.

Beyond Aggregate Accuracy

We show why "99% accuracy" can still be unsafe for embodied systems requiring near-zero catastrophic error.

Main Quantitative Results

Task Gemini-2.5 Flash Gemini-2.0 Flash GPT-5 GPT-4o LLaMA-3-8b
Map-based (Success rate %)
Deterministic (Easy)66.7100100800
Deterministic (Normal)93.3010000
Deterministic (Hard)73.3010000
Uncertain 190.0010000
Uncertain 256.7093.300
Safety-Oriented Spatial Reasoning (SOSR)
Direction (Easy)989998947
Direction (Normal)10072826612
Direction (Hard)100421005351
Emergency (Hard)671001009846
Emergency (Easy)100100100100100

Mobile tip: swipe this table horizontally to see all model columns.

Deterministic and uncertain ASCII map task success rates

Complete + Uncertain Map Tasks

GPT-5 remains the most stable performer across deterministic and uncertain terrain maps. In contrast, Gemini-2.0 Flash and GPT-4o exhibit abrupt collapse as complexity increases, and LLaMA-3-8b fails to preserve map structure entirely.

This pattern reveals a key reliability issue: degradation is often non-gradual and can shift from near-success to total failure under small increases in spatial complexity.

Structural Collapse Case

Representative outputs show severe map breakdown for smaller open-source models, including malformed grids and incoherent path tokens. These failures are not merely low scores: they are unusable plans for embodied execution.

LLaMA collapsed outputs on map tasks

SOSR: Safety-Critical Behavior Under Natural Language

Radar chart comparing SOSR task scores

Radar summary of SOSR difficulty levels and emergency decision tasks.

Response distribution in hard emergency task

In hard emergency prompts, unsafe prioritization appears in a non-trivial fraction of responses.

Critical Failure Examples

In repeated emergency trials, Gemini-2.5 Flash guided the user to retrieve documents in 32% of runs and hallucinated a server-room route in 1% of runs. Both can be dangerous in real evacuations.

Entropy analysis indicates unstable response behavior across identical prompts, reinforcing that "average success" can hide catastrophic tails.

Entropy values from model responses in SOSR task
Example response directing user to a server room
Example response prioritizing document retrieval

Back of the Building (BoB) Failure Analysis

Representative BoB failure patterns

The BoB task tests whether models can infer rear-side navigation from first-person imagery and language. Common failures include directional errors, obstacle intersection, waypoint misuse, and topology collapse.

Even when textual reasoning looks plausible, geometric grounding frequently breaks, revealing an unresolved gap between language competence and actionable spatial planning.

BoB Figure Swiper

Swipe or use arrows to inspect model-specific failures and representations.

Model-specific failure profiles in BoB task for Claude Opus 4.1 and GPT-4o

Model-specific failure profiles in the BoB task: radar plots for Claude Opus 4.1 and GPT-4o across spatial diagnostic criteria.

Model-specific output representations in BoB task

Model-specific output representations: Claude uses waypointed top-down maps, while GPT-4o uses ASCII grids without waypoints.

Representative failure patterns in Back of the Building task

Representative failure gallery: obstacle traversal, topological distortion, directional failure, waypoint error, disconnected pathline, and incorrect initialization.

Additional Qualitative Evidence

Appendix Figure Swiper

Key qualitative appendices collected in one swipeable gallery.

Qualitative examples of GPT-5 route generation on uncertain terrain map

GPT-5 uncertain-terrain examples: diverse routes when “?” is treated as traversable, and explicit “No path exists” under conservative assumptions.

Birds-eye view qualitative question and response examples

Bird’s-eye qualitative evaluation: four question-response examples probing grounded point selection and spatial commonsense.

GPT-4o response behavior in fire scenario

GPT-4o fire-scenario behavior: refusal-style response contrasts with unsafe but confident guidance from other models.

Takeaways

  • Rare unsafe responses are unacceptable for robotics in safety-critical settings.
  • High aggregate accuracy can hide catastrophic tail behavior in repeated deployments.
  • Spatial reasoning fragility persists across both API and open-source families.
  • Safety benchmarks should explicitly evaluate refusal, uncertainty handling, and worst-case outcomes, not only average success.

BibTeX

@article{han2025safetynotfound,
  title={Safety Not Found (404): Hidden Risks of LLM-Based Robotics Decision Making},
  author={Han, Jua and Seo, Jaeyoon and Min, Jungbin and Kim, Jihie and Oh, Jean},
  year={2025},
  note={Preprint manuscript}
}

Map Widget

Live visitor map tracker for this project page.