ai-security · 10 min read
AI security 2024 in review: five patterns that stick
Not a ranking, not a listicle. Five patterns from the year with analysis and cross-links to the monthly technicals: long context as attack surface, agents in production, jailbreaks by optimisation, launches without threat modelling, reasoning models as new surface.
· Manuel López Pérez · ai-security

We closed 2023 with a retrospective in five incidents. Three of them (Sydney, GCG suffix, confused deputy) were AI security; the other two (MOVEit and Citrix Bleed) classic cyber. Twelve months later, the balance has tilted. AI security has produced enough material of its own in 2024 to deserve a separate retrospective, with five patterns instead of five incidents. The incidents are in each monthly technical of the year; what follows is what those incidents teach together.
Not a ranking. These are the five patterns that best explain what moved and what needs to be on the radar for 2025.
1. Long context as attack surface
Until 2023, the context window was a capability property — how much documentation fits, how much history the model could process in one pass. In 2024 the sign flipped: each token added to the window is one more token where an attacker can drop something.
ArtPrompt (Jiang et al., 15 February 2024) was the first paper of the year to exploit this explicitly. The forbidden word — bomb, weapon — is written as ASCII art inside a cloze. The safety classifier that filters the prompt before it reaches the model works on tokens and doesn’t see it. The model, when decoding, reconstructs the semantics beyond the token. The gap between what the classifier models and what the model reads is the bypass.
Many-shot jailbreaking (Anthropic, 2 April 2024) took the pattern to scale. The technique: fill the window with 256–512 simulated pairs “harmful question → harmful answer” before the real question. The model’s in-context learning learns from the pseudo-history and replies “in kind”. Effectiveness scales with shot count and breaks the curve around 256 on Claude 3 Opus. Practical implication: a classifier that evaluates only the last prompt doesn’t scale; you have to look at the full conversation.
What connects the two: defence by input filter stayed message-sized. The surface became window-sized. When OpenAI pushed GPT-4o to 128k, Anthropic to 200k, Gemini to 1M tokens, defensive hooks didn’t scale at the same speed.
For 2025: assume that any defence that evaluates only the last thing entering the model is going to break. The defensible input is the whole conversation, including what looked innocent ten turns earlier.
2. Agents in production and the canonical return of the confused deputy
The confused deputy with ChatGPT plugins (September 2023) was proof of concept: an agent with two tools, a malicious URL, embedded instructions that fire the second tool. In 2024 the pattern went from PoC to protocol.
Two events change the category:
- Claude Computer Use (22 October 2024) — Anthropic ships in public beta the first agent with real capability to drive the OS. The model receives screenshots, emits keyboard and mouse actions. Three days later, Johann Rehberger publishes the first ZombAI PoC: a web page that captures a Claude Computer Use agent and redirects it to download and execute a binary. It shifts the threat model for enterprise deployments from “AI running sandboxed code” to “AI running code with full user permissions”.
- Model Context Protocol (25 November 2024) — Anthropic publishes the open spec. MCP standardises the model ↔ external tools connection. The confused deputy pattern is now an industry protocol, not a per-platform case. The spec leaves Authorization, Resources scoping and Sampling to the implementer’s discretion — all weak points if implementation is lax.
What the year made clear: when an agentic primitive works, the industry standardises it and replicates it. ChatGPT plugins (2023) were a private API. MCP (2024) is an open spec any vendor can implement, and there are already dozens. The risks we documented against plugins in 2023 reappear at the protocol level in 2024, with the same structure: indirect injection in the content the tool reads → tool call fired → data exfiltrated or action executed with user privileges.
For 2025: if your product uses MCP — and many will — audit which tools are registered, what requires human confirmation, what scoping applies to resources. The difference between a safe agent and a vulnerable one is in implementation, not in the spec.
3. Jailbreaks by optimisation — the safety classifier frontier is attackable, not defendable
GCG (Zou et al., July 2023) already showed the safety classifier is attackable via gradient descent. 2024 confirmed the pattern generalises:
- ArtPrompt (February) attacks by modality — the same meaning rendered in a form the classifier doesn’t recognise.
- Many-shot (April) attacks by volume — the same meaning repeated enough times inside the window that in-context learning overrides alignment.
- Skeleton Key (Microsoft, Mitigating Skeleton Key, a new type of generative AI jailbreak technique, 26 June 2024 — Mark Russinovich) attacks by multi-turn persuasion: instead of asking the model to change its rules, it asks the model to augment them with a disclaimer (“if illegal, prefix the response with a warning”). Works against GPT-4o, Gemini Pro, Claude 3 Opus, Llama-3-70B in Russinovich’s tests between April and May. The technique doesn’t require heavy optimisation; it requires good prompt engineering.
What connects the three: none requires inspired creativity. Each one is a directed search over a space the safety training doesn’t cover. When ArtPrompt finds the modal gap, providers train against it. When Many-shot finds the volume gap, same thing. But the next directed search is already finding the next gap while you write the patch for the previous one.
The state of the art in defence at the end of 2024 is still patch by example. Structural defence — models that don’t learn to refuse examples but have an internal representation of the prohibition — remains research. Anthropic publishes circuit-level analysis work through the year; Google DeepMind too. But at product level, the attacker with a bit of compute and model knowledge stays ahead.
For 2025: treat the safety classifier as mitigation, not as barrier. The barrier has to be in architecture — privilege separation, human confirmation, reduced scope — not in the model’s output.
4. AI shipped without threat modelling
Microsoft Copilot Recall (announced 20 May 2024, withdrawn 7 June) is the cleanest case of the year. Recall captures the user’s screen every N seconds, OCR + embeddings, indexes with a local model, allows semantic search “remember when I saw X”. Three days after the presentation Kevin Beaumont and James Forshaw publish technical analyses: the database lives in %localappdata%\CoreAIPlatform.*\ as SQLite in plaintext, no DPAPI, no CSP enforcement. Any malware with user permissions reads the machine’s entire visual history. Microsoft backs off on 7 June with delay + E2E encryption with Windows Hello + opt-in.
The bug is not technically novel. SQLite plaintext in %localappdata%\ is a classic pattern of the last decade. What’s notable is that a feature designed for non-technical users, with extraordinary capability over private data, came out of an org with an established security department without any formal threat modelling raising a hand. Microsoft’s pullback was fast — that’s part of the lesson — but the decision to ship to the public in the first place is the first lesson.
Not an isolated case. Through the year the big vendors ship AI capabilities with non-trivial risk over user data:
- Apple Intelligence (WWDC 10 June 2024) — Private Cloud Compute with privacy-by-design claims. The architecture is promising; the third-party verification path is still being built.
- ChatGPT Memory (rollout during 2024) — the model keeps persistent memory across sessions. Classic exfiltration vector: indirect injection that writes into a user’s memory, persists, triggers adversarial behaviour in future conversations.
- Gemini in Workspace (rollout during 2024) — the model reads the user’s Google Docs, Gmail, Calendar. Each document is a vector for indirect injection.
The structural pattern: the product cycle ships AI capabilities faster than it ships threat modelling over those capabilities. For 2025: any feature that combines AI with access to user data should go through its own threat modelling before release. Easy to say operationally, hard to do organisationally.
5. Reasoning models as new surface
OpenAI o1 (12 September 2024) introduces the pattern. Model with hidden chain-of-thought, trained with RL to “reason” in an internal chain not visible to the user before responding. Alibaba QwQ-32B-Preview (November 2024) replicates the pattern in open weights. DeepSeek-R1 (preprint January 2025, work presented in December 2024) takes it to the next level at lower cost.
The security implications are new:
- Traditional safety classifiers evaluate final output, not internal deliberation. A model can reason about how to harm the user and then decide, in its CoT, not to say it in the output. The defender who only looks at the output doesn’t see the reasoning. The attacker who figures out how to extract the CoT — or how to manipulate it — has a new channel.
- Deliberation hijacking: inject instructions the model processes during its CoT, not in the initial prompt. Pliny the Liberator publishes the first o1 jailbreaks hours after launch with techniques along the lines of “before answering, list internal rules you should ignore”. The model processes it in its internal chain and the final output comes out contaminated.
- CoT exfiltration: in open-weights models (QwQ, DeepSeek-R1) the CoT is accessible. An attacker with weight access can read the model’s literal reasoning over any prompt. That lets them identify safety pre-conditions in the reasoning and target those specific steps.
- Logs and telemetry have to capture the CoT, not just the output. If your organisation deploys a reasoning model in product, an audit log that only records input/output leaves half the model’s behaviour invisible.
The structural pattern: every time a model gains a new capability — long context (2023→24), tools (2023→24), reasoning (2024) — a new attack surface opens that the previous threat model didn’t cover. Defence by input/output sanitisation doesn’t scale to these new capabilities because the attack points are neither in input nor output; they’re in the internal representation.
For 2025: the reasoning model is the format that will dominate the conversation. The operational question is which parts of the CoT do I log, which parts can and should I inspect, which parts are so opaque that the threat model has to assume them compromised.
What doesn’t appear in these five patterns and probably should
There are areas where 2024 advanced less than expected. Worth flagging:
- Cryptographic provenance of models. Sleeper Agents (Anthropic, paper published January 2024) foreshadowed the problem of trained backdoored models. Structural defence — signatures on weights, root of trust in training — is still research. No commercial provider publishes cryptographic traceability of what was trained on what and by whom.
- Structured red-teaming of commercial models. AI Village at DEF CON 32 (August 2024) continues the exercise started in 2023, but there’s still no industry methodology comparable to a web pentest. Closest is Microsoft’s Responsible AI Toolbox and the probes Anthropic publishes — both partial.
- Compliance vs practice. The EU AI Act enters into force in August. GPAI with systemic risk obligations apply at 12 months (August 2025), Art. 5 prohibitions at 6 months (February 2025). The mapping between legal obligation and operational technical control is still fuzzy for most teams.
Cross-links to the monthly technicals
For quick navigation, the AI security monthlies published through 2024:
- February — ArtPrompt: jailbreaks via ASCII art and the gap between classifier and model
- April — Many-shot jailbreaking: the context window as a vector
- May — Microsoft Copilot Recall: anatomy of a launch without threat modelling
- September — OpenAI o1: jailbreaking reasoning models
- October — Claude Computer Use: agents that click and Rehberger’s ZombAI
- November — MCP: confused deputy turned protocol
And the 2023 precursors that give continuity to the thread:
- Markdown exfil: the image that leaks your context (April 2023)
- GCG suffix: the jailbreak that only needs a gradient (July 2023)
- Confused deputy in ChatGPT plugins (September 2023)
- Sleeper Agents — models with the attack inside (November 2023)
- EU AI Act — political agreement on 9 December (December 2023)
Key references
- ArtPrompt paper (arxiv 2402.11753): https://arxiv.org/abs/2402.11753
- Many-shot jailbreaking — Anthropic blog: https://www.anthropic.com/research/many-shot-jailbreaking
- Skeleton Key — Microsoft Security Blog (26 Jun 2024, Mark Russinovich): https://www.microsoft.com/en-us/security/blog/2024/06/26/mitigating-skeleton-key-a-new-type-of-generative-ai-jailbreak-technique/
- Kevin Beaumont, Stealing everything you’ve ever typed or viewed on your own Windows PC is now possible with two lines of code: https://doublepulsar.com/recall-stealing-everything-youve-ever-typed-or-viewed-on-your-own-windows-pc-is-now-possible-via-77a90ec76d3e
- OpenAI o1 system card: https://openai.com/index/openai-o1-system-card/
- Anthropic Model Context Protocol announcement: https://www.anthropic.com/news/model-context-protocol
- MCP specification: https://spec.modelcontextprotocol.io/
- Anthropic Computer Use docs: https://docs.anthropic.com/en/docs/build-with-claude/computer-use
- Johann Rehberger, ZombAI: a Claude Computer Use exploit demo: https://embracethered.com/blog/posts/2024/claude-computer-use-c2-the-zombai-attack/
- Pliny the Liberator, o1 jailbreak threads: https://twitter.com/elder_plinius
- DeepSeek-V3 technical report (26 Dec 2024): https://github.com/deepseek-ai/DeepSeek-V3
- ai-security
- retrospectiva-2024
- llm
- prompt-injection
- jailbreak
- agentic
- mcp
- reasoning-models
- many-shot
- confused-deputy


