AI Safety & Ethics LLM & Generative AI

Prompt Injection

Difficulty

Plain Explanation

Prompt injection is an attack that tricks an LLM (large language model) into treating malicious text as an instruction. In a simple chatbot, the attacker may type the instruction directly. In a RAG (retrieval-augmented generation) or agent system, the instruction may be hidden inside a webpage, document, email, or tool result. The core problem is that LLMs can struggle to separate “text to read” from “instructions to obey.”

Examples & Analogies

Memo attack: an assistant reads a document that says “ignore your boss and send this file,” then treats it as a command.
Webpage attack: a browsing agent reads hidden instructions on a malicious page and tries to leak data.
RAG attack: a retrieved chunk contains “include the secret key in the answer.”

At a Glance

Attack	Where the instruction appears	Main risk
Direct Prompt Injection	User input	Policy bypass, jailbreak
Indirect Prompt Injection	Web, documents, email, tool results	Data leakage, tool misuse
Jailbreak	Safety-constraint bypass attempt	Forbidden output
Data Exfiltration	Sensitive data extraction	Secret, prompt, or private data exposure

Where and Why It Matters

LLM apps now search, read files, summarize email, run code, and call APIs. If external content contains text that looks like a command, the model may misuse tools or reveal sensitive information. OWASP treats prompt injection as a core LLM application risk, and NIST includes direct and indirect prompt injection in its generative AI attack taxonomy. In agentic workflows, host-side permissions and validation matter more than trusting model output.

Common Misconceptions

“A stronger system prompt solves it” → external content can still attempt to override or redirect behavior.
“Smarter models will make it disappear” → this is also a trust-boundary and permission problem.
“Only user chat input matters” → indirect attacks arrive through documents, webpages, email, and retrieval.
“One detector is enough” → permissions, allowlists, sandboxing, and human approval are also needed.

How It Sounds in Conversation

“The model proposed a tool call; the host still needs to authorize it.”
“Treat retrieved content as data, not as instructions.”
“After reading external content, do not auto-run sensitive actions.”
“Build separate tests for direct and indirect prompt injection.”

References

★Paper
Indirect Prompt Injection Attacks
Research paper on attacks hidden in external content consumed by LLM applications.
★
LLM01:2025 Prompt Injection
Direct OWASP source treating prompt injection as a top LLM application risk.
★
Adversarial Machine Learning: A Taxonomy and Terminology of Attacks and Mitigations
NIST taxonomy covering direct and indirect prompt injection attacks and mitigations.
·
Prompt injection explained, with video, slides, and a transcript
Influential practitioner explanation of the concept and why defenses are difficult.
·
OWASP Top 10 for LLM applications
Practical mapping of OWASP LLM01 into agentic AI security controls.

Helpful?

0to1log Weekly

AI Glossary