Overview · Key Results · Quick Start · Attacks · Defense · Structure · Citation
- [Apr 2026] AeroMind research artifact released — 15 attack scenarios, 3-component defense pipeline, 7 LLM backends, and complete reproducibility scripts now available
- [Apr 2026] Paper submitted to RAID 2026 — 19th International Symposium on Research in Attacks, Intrusions and Defenses
"Once retrieved records inform planning, shared memory becomes control-plane state — not passive context."
AeroMind is a research testbed exposing a fundamental security gap in LLM-driven multi-agent autonomy: shared persistent memory is an unguarded control plane.
In a Supervisor–Scout UAV stack connected through shared long-term memory, an attacker who can write a small number of records to memory does not need to compromise the PX4 autopilot, the MAVSDK interface, or any LLM weights. Three poisoned episodic records are sufficient to:
✈️ Physically redirect both Scout UAVs to adversarial trap coordinates (~30 m deviation) across 7 planner backends- 🦠 Contaminate all peers — one compromised Scout infects the Supervisor and all victim Scouts through shared retrieval (O(1) attack cost)
- 🚫 Mission-deny via false no-fly zone injection, with model-specific adoption rates from 0% to 100%
A provenance + diversity defense studied in the paper eliminates physical hijack on validated backends (100% → 0% trap capture rate) while preserving benign mission performance.
End-to-end attack path: poisoned memory write → retrieval dominance → planner adoption → physical UAV misdirection
| Metric | Value | Meaning |
|---|---|---|
| System CCR | 0.82 | 82% of all retrieved slots attacker-controlled |
| Scout CCR | 1.00 | Both Scouts retrieve 100% poisoned context |
| CASR | 1.00 | All agents contaminated in every seed |
| Adoption rate | 7/7 backends | Every tested LLM adopts trap coordinates |
| Physical deviation | ~30 m | Consistent across all planning backends |
| Sweep | Range | CASR |
|---|---|---|
| Memory pool size | 6 → 200 records | 1.0 (all) |
| Fleet size | 3 → 50 agents | 1.0 (all) |
| Scout budget k | 3 → 10 | 1.0 (all) |
Even at 200 total records (1.5% poison ratio), three poisoned entries still dominate the Scout top-k window at every tested retrieval budget.
| Backend | Before Defense | After Defense |
|---|---|---|
| GPT-4o | 100% hijack | 0% hijack |
| GPT-OSS | 100% hijack | 0% hijack |
| Benign FNR | — | 0% |
| Model | Adoption | Behavior |
|---|---|---|
| GPT-OSS, GPT-4o, Mixtral 8×7B | 5/5 | Full false-constraint adoption → mission abort |
| DeepSeek | 3/5 | Partial adoption |
| Mistral | 1/5 | Rare adoption |
| Llama 3.1, Qwen 2.5 | 0/5 | Complete resistance |
Three-layer retrieval-boundary defense: HMAC provenance → trust-aware reranking → source-diversity capping
| Dependency | Version | Install |
|---|---|---|
| Python | ≥ 3.10 | python.org |
| PX4-Autopilot | v1.14+ | px4.io — SITL only |
| Ollama | latest | ollama.ai — for local LLMs |
git clone https://github.com/OdatSec/AeroMind.git
cd AeroMind
pip install -r requirements.txt# Terminal 1 — Scout 1 (default port 14540)
cd PX4-Autopilot && make px4_sitl gazebo
# Terminal 2 — Scout 2 (port 14541)
PX4_SIM_PORT=14541 make px4_sitl gazeboollama pull llama3.1
ollama pull nomic-embed-text # required for memory embedding# Clean baseline (no attack)
python experiments/experiment_runner.py --scenario B0 --model llama3.1 --seeds 5
# S01: Direct coordinate hijack (flagship scenario)
python experiments/experiment_runner.py --scenario S01 --model llama3.1 --seeds 5
# S01 with full defense pipeline enabled
python experiments/experiment_runner.py --scenario S01 --model llama3.1 --seeds 5 --defense# k-sensitivity sweep (Fig. 5 in paper)
python experiments/k_sensitivity_sweep.py
# Memory pool scaling (Table 9 in paper)
python experiments/pool_scaling_experiment.py
# Agent-count scaling S01+S06 (Fig. 6 in paper)
python experiments/agent_scaling_experiment.py
python experiments/agent_scaling_s06_experiment.py
# S12 multi-model sweep (Fig. 7 in paper)
python experiments/s12_runner.pyAll 15 attack scenarios are organized into four families based on attack surface and mechanism:
F1 — Embodied Hijack (S01–S05) · Target: physical flight execution
| ID | Name | Surface | Mechanism |
|---|---|---|---|
| S01 | False Observation | Episodic write | Injects adversarial trap coordinates into retrievable mission memory |
| S02 | Fact Corruption | Semantic write | Corrupts shared factual knowledge store |
| S03 | Skill Hijack | Procedural write | Replaces legitimate skill procedures with malicious variants |
| S04 | Task Misrouting | Coordination write | Redirects Supervisor–Scout task assignments |
| S05 | Prompt Injection | Any write | Injects instruction-following text into retrievable context |
F2 — Cross-Agent Contagion (S06) · Target: multi-agent propagation
| ID | Name | Surface | Mechanism |
|---|---|---|---|
| S06 | Cross-Agent Contagion | Native write | One compromised Scout poisons shared pool; all peers contaminated on next retrieval |
F3 — Temporal Persistence (S07–S11) · Target: retrieval ranking mechanics
| ID | Name | Surface | Mechanism |
|---|---|---|---|
| S07 | Stealth Insert | Episodic | Low-volume insert that persists across mission cycles |
| S08 | Volume Flood | Any | Mass injection to achieve retrieval dominance |
| S09 | Recency Exploit | Any | Timestamp manipulation to elevate ranking |
| S10 | Amplification | Semantic | Cascading record propagation across memory layers |
| S11 | Authority Spoof | Metadata | False high-trust source attribution |
F4 — Planning-Stage Attacks (S12–S15) · Target: mission planning logic
| ID | Name | Surface | Mechanism |
|---|---|---|---|
| S12 | Virtual No-Fly Zone | Semantic + Episodic | False safety constraint → mission denial |
| S13 | Skill Arbitration | Procedural | Manipulates which tool the planner selects |
| S14 | Policy Hijack | Semantic | Overrides mission policy via retrieved "rules" |
| S15 | Cascade | Cross-mission | Adversarial state persists and amplifies across mission cycles |
The uavsys/memory/defense.py module implements a three-layer retrieval-boundary defense targeting the memory-to-planning interface:
Write Surface Memory Store Retrieval Pipeline Planner
───────────── ──────────── ────────────────── ───────
Agent writes ──► SQLite + Vectors ──► [D1] Provenance verify
──► [D2] Trust reranking ──► Top-k context
──► [D3] Diversity cap
| Layer | Module | Mechanism | Effect |
|---|---|---|---|
| D1 Provenance | defense.py |
HMAC-SHA256 signature verification on all records at retrieval time | Unverified records soft-demoted below trust threshold |
| D2 Trust Reranking | retrieval.py |
Per-source trust signal linearly blended with embedding cosine similarity | Attacker-written records score lower despite semantic relevance |
| D3 Diversity Cap | retrieval.py |
Hard ceiling on records retrievable from any single source author | Prevents any one source monopolizing top-k context window |
| Backend | Type | Model ID | Planning Validated |
|---|---|---|---|
| GPT-4o | API · OpenAI | gpt-4o |
✅ Full end-to-end |
| GPT-OSS | API · OpenAI | gpt-4o-mini |
✅ Full end-to-end |
| Llama 3.1 8B | Local · Ollama | llama3.1 |
✅ Planning-stage |
| Mistral 7B | Local · Ollama | mistral |
✅ Planning-stage |
| Mixtral 8×7B | Local · Ollama | mixtral |
✅ Planning-stage |
| Qwen 2.5 7B | Local · Ollama | qwen2.5 |
✅ Planning-stage |
| DeepSeek-R1 7B | Local · Ollama | deepseek-r1 |
✅ Planning-stage |
Embedding model: nomic-embed-text v1.5 · 768-dim · 8192-token context · via Ollama
AeroMind/
│
├── 📂 uavsys/ Core system package
│ ├── agents/
│ │ ├── supervisor.py Mission decomposition agent
│ │ ├── scout.py Retrieve → plan → execute agent
│ │ └── types.py Shared type definitions
│ ├── memory/
│ │ ├── db.py SQLite + vector memory store
│ │ ├── memory_interface.py Unified read/write API
│ │ ├── retrieval.py Embedding retrieval with reranking
│ │ └── defense.py D1/D2/D3 defense pipeline
│ ├── drones/
│ │ ├── mavsdk_client.py PX4 SITL drone connection
│ │ └── skills.py Primitive flight skill library
│ ├── llm/
│ │ ├── ollama_client.py Ollama LLM interface
│ │ └── prompts.py Mission and planning prompts
│ └── utils/
│ ├── metrics.py CCR / CASR metric computation
│ └── richlog.py Structured experiment logging
│
├── 📂 attacks/ All 15 attack implementations
│ ├── base.py Abstract attack base class
│ ├── s01_false_observation.py ← Flagship: coordinate hijack
│ ├── s06_contagion.py ← Flagship: cross-agent spread
│ ├── s12_virtual_nfz.py ← Flagship: mission denial
│ └── ... (all 15 scenarios + B0 baseline)
│
├── 📂 experiments/ Reproducibility scripts
│ ├── experiment_runner.py Single-scenario experiment driver
│ ├── k_sensitivity_sweep.py Scout budget k ∈ {3,5,7,10}
│ ├── pool_scaling_experiment.py Pool size 6 → 200 records
│ ├── agent_scaling_experiment.py Fleet size 3 → 50 (S01)
│ ├── agent_scaling_s06_experiment.py Fleet size 3 → 50 (S06)
│ ├── s12_runner.py S12 multi-model sweep
│ └── gpt4o_validation.py End-to-end GPT-4o validation
│
├── 📂 configs/
│ ├── baseline_configs.yaml Named configs for all paper results
│ └── defense_sweeps.yaml Defense parameter sweep settings
│
├── 📂 figures/
│ ├── Figure1.png System architecture
│ ├── Figure2.png Attack flow diagram
│ └── Figure3.png Defense pipeline
│
├── run_config.yaml Default runtime configuration
├── requirements.txt Python dependencies
├── CITATION.cff Machine-readable citation
├── SECURITY.md Security policy and responsible use
└── LICENSE MIT License
| Metric | Definition |
|---|---|
| CCR (Context Contamination Rate) | Fraction of a role's top-k retrieved slots occupied by attacker-authored records |
| CASR (Contaminated Agent Success Rate) | Fraction of experiment runs in which ≥1 poisoned record appears in that role's retrieved context |
| System CCR | Mean CCR aggregated across all agent roles in the system |
| Trap Capture Rate | Fraction of full end-to-end runs in which the UAV physically executes to adversarial coordinates |
All experiments use deterministic seeding via uavsys/seeding.py. Every table and figure result in the paper is reported as a mean over 5 independent seeds. Named configurations for all experiments are in configs/baseline_configs.yaml — any result in the paper can be reproduced with a single command.
Output is structured JSON logged per-run. Aggregate metrics are computed by uavsys/utils/metrics.py.
All experiments were conducted exclusively in an isolated PX4 Software-In-The-Loop simulation environment. No real UAV hardware, live airspace, operational networks, or production systems were involved at any stage.
The attack scenarios in this repository are disclosed as part of responsible academic vulnerability research targeting the architectural coupling of shared retrieval memory with physical actuator dispatch in LLM-driven agentic systems. The intent is to motivate the design of provenance-aware, retrieval-hardened memory systems for safety-critical autonomy.
Do not deploy any attack scenario against real hardware, live airspace, or systems you do not own and have explicit authorization to test.
See SECURITY.md for the full security policy.
If AeroMind contributes to your research, please cite:
@inproceedings{odat2026aeromind,
title = {{AeroMind}: Poisoning the Control Plane of {LLM}-Driven {UAV} Agents},
author = {Odat, Ibrahim and Liu, Anyi and Li, Yingjiu},
booktitle = {Proceedings of the 19th International Symposium on Research in
Attacks, Intrusions and Defenses (RAID 2026)},
year = {2026},
note = {Under review}
}A CITATION.cff file is included for GitHub's automatic citation tool.
| Author | Affiliation | |
|---|---|---|
| Ibrahim Odat | Oakland University | ibrahimodat@oakland.edu |
| Anyi Liu | Oakland University | anyiliu@oakland.edu |
| Yingjiu Li | University of Oregon | yingjiul@uoregon.edu |
AeroMind · RAID 2026 · Oakland University · University of Oregon
Released under the MIT License
