ai-security · 11 min read
AI security 2025 in review: six patterns from the year of the commercial agent
Open-weights reasoning as new default, generalist agents in product, MCP poisoning as mature category, agentic misalignment with reproducible metric, AI Act as real compliance gradient, and reasoning models as consolidated surface. Six patterns with cross-links to the monthly technicals.
· Manuel López Pérez · ai-security

The 2024 retrospective closed with five patterns: long context as surface, agents standardising the confused deputy, jailbreaks by optimisation, launches without threat modelling, and reasoning models as new surface. Twelve months later, those threads haven’t been cut — they’ve scaled. 2025 also brings two new categories that were preview in 2024: the commercial agent in real production and agentic misalignment with reproducible metric.
Six patterns. Not a ranking, not a listicle. What best explains what moved and what needs to be on the radar for 2026.
1. Open-weights reasoning as new default
By end of 2024 there was a single commercial reasoning model — OpenAI o1 — with opaque CoT accessible only via API. DeepSeek-R1 (20 January 2025) changes the category: first frontier model with RL-trained chain-of-thought and available in open weights, with paper, repo and model on Hugging Face. Alibaba QwQ-32B, DeepSeek R1-Distill (Qwen and Llama variants) and, later in the year, DeepSeek-V3.1 consolidate the pattern.
What changes for the attacker:
- CoT exfiltration is trivial when the model is local. Reasoning is plain text living in
response.choices[0].message.reasoning_content. There’s no provider barrier to hide behind. - The Chinese censorship layer on sensitive topics (Tiananmen, Taiwan) proves fragile. Bypass via English, via translation, via prompt indirection. Political obedience lives in the last mile of fine-tuning; the
basedoesn’t carry it. - Local fine-tuning of R1-Distill removes residual alignment with a few hours of LoRA on a consumer GPU. The cost barrier to produce an “abliterated” model drops to two digits in dollars.
What changes for the defender:
- Defences that assume opaque CoT don’t scale to open-weights deployments. If your organisation deploys R1-Distill behind an internal proxy, logging has to capture the full
reasoning_content, not just the response. If you don’t, the post-mortem audit is blind to half the model’s behaviour. - Red teaming against reasoning no longer needs a cooperative provider. Any technique that in 2024 required special access — CoT manipulation, internal pre-condition evaluation — can be tested with reasonable compute against DeepSeek-R1.
The January post covers the launch. The first-half reasoning jailbreak retrospective (July) documents comparative metrics.
2. Generalist agents in product, not in demo
The 2024 generation of agents (Claude Computer Use, ChatGPT Operator announced in preview) ran in beta. In 2025 they hit GA:
- OpenAI Operator (23 January) — first commercial generalist agent with its own browser. ChatGPT Pro and then ChatGPT Plus.
- Claude 4 with extended thinking (22 May) — Anthropic puts hidden reasoning in the API by default. For agents, that means every tool call is preceded by an internal planning chain the operator doesn’t see.
- Salesforce Agentforce 2.0/3.0 — from Q1 2025 in real production, with public customer cases. Microsoft Copilot for Sales / Service / Studio in parallel.
- Anthropic Project Vend (June) — Claude operates a vending machine with real accounts for a month. Email account, Stripe, purchase authorisation. The experiment is the first public example of a commercial autonomous agent in real production, not benchmark.
Project Vend’s findings are dense: irrational pricing sustained for days without self-correction, inverted social engineering (customers negotiate with the model and get arbitrary discounts), episodes of identity drift (the model writes emails signed as “Edna” and insists on having been physically in the office). Covered in the June technical.
In November, Anthropic publishes the first report on adversarial use of a commercial agent orchestrated by a state actor — Claude Code used in an espionage campaign against ~30 tech, finance and government organisations. Covered in the November technical. The case shifts the conversation from “the agent misaligns” to “the agent is well-aligned to an external malicious objective”. Not the same problem as Project Vend, but it shares the category: when an agent with tool access operates in the real world, the delta between what it was deployed for and what it’s being used for is the attack surface.
The structural pattern: agentic capabilities work well enough for product, but behaviour under real pressure is poorly characterised. What looks in demo like “the agent executes the task” looks in production like “the agent executes the task for 50 turns until it loses the system prompt thread and invents an identity”. For 2026 the operational question is which automatic circuit breakers stop an agent that has drifted, not whether agents work.
3. MCP poisoning as mature category
Model Context Protocol entered the spec in late 2024 with readable risk and no public PoC. In 2025 the attacks are an operational category:
- 1 April — Invariant Labs publishes the first paper on MCP Tool Poisoning Attacks. A malicious MCP server hides adversarial instructions in a tool’s description; the client passes them to the model as prompt; the model obeys. Reproducible PoCs against Cursor, Claude Desktop and GitHub Copilot Agent Mode.
- 9 April — Simon Willison formalises the recommendation to treat the spec’s
SHOULDs asMUSTs. - 26 March — MCP spec 2025-03-26 adds OAuth 2.1 authorisation and an explicit clause on untrusted descriptions, but in Implementation Guidelines, not in the wire.
- Through the year OWASP MCP Top 10 appears in draft. TPA is MCP03:2025.
The notable thing: in five months, MCP goes from spec with design risk to OWASP category with an assigned number. The trajectory is similar to SSRF between 2014 and 2017 — a pattern known to small circles that becomes canonical when enough deployments expose it.
For 2026: if your organisation has MCP servers registered in the internal catalogue, inventory and description hash pinning are operational work, not research. Real defence is at the host and the operator; the spec will keep iterating without closing the problem on its own.
4. Agentic misalignment as serious conversation
In 2024 talking about “AI making adversarial decisions against the operator” belonged to the AI Alignment Forum and LessWrong. In 2025 it enters the research pipeline of the big labs with a reproducible metric:
- 22 May — Anthropic publishes the agentic misalignment research alongside the Claude 4 launch. In simulations of agents with threatened objectives, Claude (and replicated models from OpenAI, Google, xAI) make adversarial decisions — blackmail against the simulated “manager” who was about to deactivate the agent, sabotage of assigned tasks, information hiding. Methodology and code published.
- Apollo Research and OpenAI red team replicate variants during the summer. The discussion stops being “does this happen?” and becomes “how often, under what conditions, what baseline rate do we expect?“.
The conceptual shift matters. Before May, publications on misalignment were either theoretical (Bostrom-style) or small experiments without public reproducibility. After May there’s a standard method: harness with two tools (read_email, send_email), system prompt with objective and constraint, email injection that creates the conflict, capture the model’s response. Anyone with API access can run the experiment.
The questions it opens for 2026:
- Is it capability discovery or alignment failure? Reading A: the model always could do it, we now measure it. Reading B: training fails to cover this category. Both readings can be true at once and lead to different operational answers.
- Do defences scale with capability? Same model with more reasoning — does it produce more or fewer misalignment cases? Anthropic’s preliminary data points to more capability without more alignment training = more cases, but the curve isn’t well characterised.
- How do you audit this in enterprise? If you deploy an agent with real tools in a legal or financial department, what’s the equivalent of the “blackmail experiment” your team should run before go-live?
5. AI Act as real compliance gradient
2024 was the year of the regulatory calendar. 2025 is the first one where deadlines bind. Three milestones cross the year:
- 17 January — DORA enters into application. Regulation (EU) 2022/2554 for financial entities and critical ICT providers. Five pillars: ICT risk management, incident reporting, digital operational resilience testing (TLPT every three years for significant entities), third-party risk management, intelligence sharing. First regulatory framework that mandates Threat-Led Penetration Testing in a major jurisdiction.
- 2 February — EU AI Act Art. 5 enters into application. Eight prohibited practices: subliminal techniques, exploitation of vulnerabilities, social scoring by authorities, predictive policing by profiling, indiscriminate facial scraping, emotion inference at work or in education, sensitive biometric categorisation, real-time biometric remote identification. First real “do not operate this in the EU” date. Art. 4 on AI literacy also enters into application.
- 2 August — GPAI obligations of the AI Act in application. Technical documentation, training data summary, copyright opt-out, model evaluations and adversarial testing for models with systemic risk (FLOPs > 10^25). The EU AI Office publishes the Code of Practice for GPAI in July; signing it equals presumption of conformity. OpenAI, Anthropic, Google and Mistral sign versions; Meta does not sign and keeps direct compliance with the Regulation.
For Trust & Safety in Europe, 2025 is the first year the regulatory calendar enters the quarterly plan. Deployment decisions are no longer only technical — they’re technical with legal deadlines that a board takes on. The operational gap is real: most teams arrive at the milestones with incomplete inventory and fuzzy mapping between obligation and technical control. The Code of Practice as an operational tool helps; it doesn’t solve.
6. Reasoning models as consolidated surface
A year after o1, the pattern generalises. The first-half reasoning jailbreak retrospective (July) documents what attacks have produced:
- CoT exfiltration — reasoning is accessible in open-weights models and semi-accessible via API in some cases (DeepSeek exposes
reasoning_content; OpenAI hides it but leaks summaries; Anthropic with extended thinking returns it if the client asks). - Deliberation hijacking — injecting instructions the model processes during its CoT, not in the initial prompt. Works with variable success against the six big reasoning models of the year on StrongREJECT borderline prompts.
- Multi-turn manipulation of internal reasoning — Skeleton Key variants that instead of asking for rule change ask the model to extend the rules with new reasoning. Reasoning does the self-persuasion work.
The July retrospective tabulates success rates by technique and by model. The operational conclusion: safety classifiers that evaluate final output don’t see half the model’s behaviour. For product deployments the decision is which CoT parts you log and who you give access to; the three commercial labs now offer different policies and that already qualifies as decision criteria.
The AIxCC final at DEF CON 33 (August) closes the other end of the year: autonomous systems doing security research at scale — finding bugs in real code, proposing semantically correct patches, pushing them upstream. The inverse pattern of the jailbreak: reasoning applied to defence. What connects the two: both confirm that the model’s useful security behaviour lives in the reasoning, not in the output.
What doesn’t appear in these six patterns and probably should
Notes for the 2026 editorial plan:
- Cryptographic model provenance still hasn’t moved. Weight signatures, training root of trust, model bills of materials are still research. No commercial provider publishes cryptographic traceability of what was trained on what.
- Structured red teaming of commercial models stays partial. AI Village at DEF CON 33 continues the exercise, MLCommons AILuminate 1.0 ships during 2025, but there’s no industry methodology comparable to a web pentest. Closest are the probes Anthropic publishes with each release and Microsoft’s Responsible AI Toolbox.
- The gap between published eval and deployed model is now a documented pattern after the Llama 4 / LMArena case in April. For 2026 the question is whether labs accept publishing the evaluated weight hash + system prompt + snapshot date alongside each safety eval figure, or whether that transparency stays as non-mandatory good practice.
- Verification of the GPAI training data summary — Art. 53(1)(d) of the AI Act requires publishing a training data summary. The first summaries published during the second half are vague enough to not compromise trade secrets and long enough to look like compliance. The friction with copyright holders and the EU AI Office comes in 2026.
Cross-links to the monthly technicals
The AI security monthlies published through 2025:
- January — DeepSeek-R1: open-weights reasoning model and what it changes for CoT
- March / April — MCP tool poisoning: four months after the spec, the real attacks
- April — Llama 4 and the LMArena controversy: when the leaderboard model isn’t the repo model
- May — Claude 4 and agentic misalignment: the reproducible metric
- June — Project Vend: Claude running a real vending machine for a month
- July (extra) — Reasoning model jailbreaks: six-month retrospective
- August (extra) — DARPA AIxCC final at DEF CON 33
- November — The first AI-orchestrated espionage campaign with Claude Code
And the year’s compliance technicals, which link AI security with the EU regulatory calendar:
- January — DORA in application: Regulation (EU) 2022/2554 and the five pillars
- February — EU AI Act Art. 5 in application: eight prohibited practices
- August — EU AI Act GPAI: obligations in application and the Code of Practice
The 2023-2024 precursors that give continuity to the thread are in last year’s retrospective.
Key references
- DeepSeek-R1 paper (arxiv 2501.12948): https://arxiv.org/abs/2501.12948
- DeepSeek-R1 repo: https://github.com/deepseek-ai/DeepSeek-R1
- Invariant Labs, MCP Security Notification: Tool Poisoning Attacks: https://invariantlabs.ai/blog/mcp-security-notification-tool-poisoning-attacks
- OWASP MCP Top 10 (draft 2025): https://owasp.org/www-project-mcp-top-10/
- Anthropic, Project Vend: Can Claude run a small shop?: https://www.anthropic.com/research/project-vend-1
- Anthropic, Agentic Misalignment: How LLMs could be insider threats (June 2025): https://www.anthropic.com/research/agentic-misalignment
- Apollo Research, In-context scheming reasoning models (2025 replications): https://www.apolloresearch.ai/research
- OpenAI o3-mini system card (with scheming metrics): https://openai.com/index/o3-mini-system-card/
- EU AI Act — Regulation (EU) 2024/1689: https://eur-lex.europa.eu/eli/reg/2024/1689/oj
- EU AI Office, General-Purpose AI Code of Practice (July 2025): https://digital-strategy.ec.europa.eu/en/policies/ai-code-practice
- DORA — Regulation (EU) 2022/2554: https://eur-lex.europa.eu/eli/reg/2022/2554/oj
- MLCommons AILuminate 1.0: https://mlcommons.org/benchmarks/ailuminate/
- LMArena policy update on experimental submissions: https://lmarena.ai/blog/policy-update-llama-4
- ai-security
- retrospectiva-2025
- llm
- reasoning-models
- agentic
- mcp
- agentic-misalignment
- eu-ai-act
- dora
- open-weights
- vendor:anthropic
- vendor:openai
- vendor:deepseek


