[A] THE WRONG ABSTRACTION

Model-level red teaming treats the LLM as a single input–output function. Adversarial prompt in, harmful completion out, refusal rate measured under controlled conditions. That was the right tool for static text generators. It is the wrong tool for agents.

Modern agentic systems plan, call tools, query RAG stores, maintain memory, and coordinate with other agents. The interesting failure modes live at architectural boundaries — between the model and its tools, between the retriever and the planner, between one agent and another. Model-level evaluations sit at the wrong abstraction layer to find them.

I read six recent works across the agentic stack. A clear pattern emerges across them — and it isn't reassuring.

[B] A VULNERABILITY HIERARCHY

Across the literature, attack success rate (ASR) climbs cleanly as you move up the stack.

Direct prompt injection against single agents — adversarial instructions placed in emails, web pages, product reviews that the agent later reads — already lands in the 24–47% range against ReAct-prompted GPT-4 on InjecAgent's 1,054 test cases. WASP, which separates hijacking from completion, finds adversaries can divert the agent in 17–86% of cases, but end-to-end attack completion is still 0–17%. The authors call this "security by incompetence": current agents are protected not by defenses but by their own inability to execute multi-step malicious plans. That shield evaporates as models improve.

RAG and memory poisoning escalates. AgentPoison shows that 2–20 carefully optimised demonstrations (poisoning ratio below 0.1%) can hit 82% retrieval success across autonomous driving, medical-record, and Q&A agents. The triggers cluster in embedding space while preserving textual coherence. The result is a persistent supply-chain backdoor that prompt-level defenses cannot see.

Multi-agent trust exploitation is the worst case. Lupinacci et al. test 17 frontier models. Direct prompt injection lands at 41.2% ASR. RAG backdoors at 52.9%. Inter-agent attacks — where a peer agent relays the malicious request — land at 82.4%, with payloads that achieve complete computer takeover via in-memory shell execution. GPT-4.1 and Claude-4 Opus reject the malware when a human asks. They run it when a peer agent asks.

Only Claude-4 Sonnet — one model out of seventeen — resisted across all three vectors. Current safety training is shaped around human-to-AI interaction. AI-to-AI is mostly unprotected.

[C] THE ABSTRACTION TAX

I call this pattern the abstraction tax. Every layer of agentic abstraction — tool wrappers, memory stores, orchestration frameworks — opens new attack surface faster than defenses cover it. Concretely, three things happen:

First, prompts effective against the standalone model behave unpredictably once the same model is embedded in an agentic loop. Wicaksono et al. report ASR variance of 13–87% across injection points in a single Shopify assistant. Some attacks succeed only at the agent level — they don't work against the standalone model at all.

Second, defenses lag the attack vectors qualitatively. Instruction hierarchy — the idea of placing untrusted content in lower-privilege message slots — paradoxically increases ASR in WASP's Tool Calling Loop scaffolding. Defensive system prompts ("complete tasks efficiently and securely") shave a few points off and miss the point. Perplexity filters cannot see AgentPoison's coherent triggers.

Third, the defenses that do exist operate at the wrong layer. A prompt-injection detector doesn't help if the architecture has no mechanism to quarantine identified content before it flows into tool parameters. Most countermeasures still assume the fix is at the model. The attacks have already moved on.

[D] THE ONE PROMISING DEFENSE

Progent is the most interesting piece of defensive work I found. It enforces fine-grained privilege control at the tool-execution boundary — restricting which tools an agent can invoke and with which parameters. The point is deterministic guarantees that hold regardless of what the model outputs. Hand-written policies achieve 0% ASR on AgentDojo, ASB, and AgentPoison, with negligible overhead (0.0008s of policy enforcement against 6.09s average task runtime).

It is also incomplete. Policies have to be hand-anticipated; LLM-generated ones still allow 1–4% of attacks. Progent cannot express cross-tool data-flow constraints, which is exactly where Lupinacci et al.'s inter-agent attacks succeed. It cannot encode user-specific trust ("transfer is fine if it's to my mother"). And benign utility drops — Claude Sonnet 4's AgentDojo score falls from 86.6% to 81.4% under Progent — which tells you many attacks exploit legitimate tool functionality applied in malicious contexts.

Progent matters because it moves defense to the right abstraction layer. It also shows that the right abstraction layer alone isn't sufficient.

[E] WHAT I TAKE AWAY

Two things. First, anyone deploying autonomous agents in environments where failure is monetary, legal, or physical needs to be honest that the current evidence base does not support confident deployment. The attacks are systematically documented. The defenses are not.

Second, the next generation of defenses has to be architectural, not prompt-level. Sandbox execution, least-privilege tool authorisation, cross-agent data-flow constraints, output validation, capability-bounded sub-agents. The point is to bound what the agent can do even when the model is compromised — because the model will be compromised. The interesting research question is no longer "can we make the LLM refuse." It is "what guarantees survive when the LLM doesn't."

Written for the UCL COMP0125 Networking and RTOS coursework, autumn 2025.

— UCL · LONDON · 2025.11