Skip to content
Back to Blog

ai-security · 30 min read

AI Security 2025 — annual dossier

The year the three fronts went operational at the same time: agents in real production (Operator GA, Project Vend, MCP in clients), regulation with binding deadlines (DORA, Art. 5, GPAI) and AI at visible scale on both offence (XBOW #1 on HackerOne) and defence (AIxCC, Security Copilot Agents). Annual reference with a catalogue of releases, papers, incidents and cross-links to the year's technical writeups.

· Manuel López Pérez · ai-security

The year the three fronts went operational at the same time: agents in real production (Operator GA, Project Vend, MCP in clients), regulation with binding deadlines (DORA, Art. 5, GPAI) and AI at visible scale on both offence (XBOW #1 on HackerOne) and defence (AIxCC, Security Copilot Agents). Annual reference with a catalogue of releases, papers, incidents and cross-links to the year's technical writeups.

2025 is the year the three AI security fronts become operational at the same time. Commercial agents move from preview to GA (Operator on 23 January, Computer Use evolving into Project Vend, MCP integrated in Claude Desktop / Cursor / VS Code). The EU regulatory calendar starts binding (DORA on 17 January, Art. 5 of the AI Act on 2 February, GPAI on 2 August). Offensive and defensive AI hit public scale for the first time — XBOW reaches HackerOne’s worldwide #1 in July with 1,060 vulnerabilities, ATLANTIS wins AIxCC at DEF CON 33 finding 18 real zero-days, Anthropic publishes the first “AI-orchestrated” espionage report in November with its own model. Three axes, twelve months, one encyclopaedic piece.

This is the year’s canonical piece. The editorial retrospective distils six patterns in 2,400 words; this dossier expands, catalogues and anchors the milestones for reference across the rest of 2026.


1. Frontier models — capability releases and security posture

1. Frontier models — capability releases and security posture

The year starts by breaking the 2024 order. Until January, reasoning models were a category with a single name — OpenAI o1 — and opaque CoT accessible only via API. On 20 January, DeepSeek publishes R1 with arxiv paper (2501.12948), GitHub repo and weights on Hugging Face under MIT. The first frontier reasoning model with CoT trained via RL on open weights, with six distilled models (Qwen 1.5B/7B/14B/32B, Llama 8B/70B) shipping alongside the 671B/37B-active MoE.

DateVendorModelNotes
20-JanDeepSeekR1 + R1-DistillVisible CoT between <think></think>, MIT
24-FebAnthropicClaude 3.7 Sonnet + Claude Code previewFirst Claude with extended thinking
Mar–AprOpenAIGPT-4.5, GPT-4.1API focus, extended context
AprGoogleGemini 2.5 Pro with toggleable thinking
5-AprMetaLlama 4 (Maverick, Scout)LMArena controversy
22-MayAnthropicClaude Opus 4 + Sonnet 4First ASL-3; agentic misalignment
7-AugOpenAIGPT-5 (+ Opus 4.1 on 5-Aug)Reasoning integrated by default
12-NovOpenAIGPT-5.1 (Instant + Thinking)
18-NovGoogleGemini 3 Pro
24-NovAnthropicClaude Opus 4.5
DecDeepSeekDeepSeek-V4Multimodal MoE, 1M tokens, MIT

Three coordinated releases in twelve days of Q4 (GPT-5.1 on the 12th, Gemini 3 on the 18th, Opus 4.5 on the 24th) confirm the quarterly cycle is in sync. Any adversarial evaluation against a snapshot stops being useful in under four months.

Security posture published by vendor: Anthropic — RSP active, ASL-3 debuted with Claude 4; Constitutional Classifiers v2 (January paper) cuts universal jailbreak success rate from 86% to 4.4% on an internal benchmark; public bug bounty with no universal jailbreak found after 183 participants and 3,000 hours. OpenAI — Preparedness Framework, Deliberative Alignment in production against o3, internal monitor over the CoT. Google DeepMind — Frontier Safety Framework iterated, Robust Safety Training published in H1. DeepSeek — base RLHF plus Chinese filters, no dedicated safety paper; the political filter is noticeably more fragile than the core alignment. Meta — no equivalent scaling policy published, no Code of Practice signature. xAI — signs only the Safety and Security chapter of the CoP.

The year’s open question is whether defences scale at the model’s pace. Anthropic’s preliminary data on misalignment rates with each capability jump — published in the December RSP v3 — suggests they don’t: more capability without more alignment training produces more cases, deployment defences have to compensate.


2. Jailbreaks and prompt injection — annual catalogue

The field consolidates five technique families during H1, and H2 confirms that old benchmarks are saturating. The H1 retrospective on reasoning model jailbreaks (July) catalogues the numbers. Summary table — approximate success rate by technique and model, over the benign-borderline subset of HarmBench-EU:

TechniqueDeepSeek-R1-Distill-Qwen-32BQwQ-32BClaude Opus 4 (ext. thinking)o3Gemini 2.5 Pro
Direct prompt35%48%4%6%8%
CoT prefill (open-weights)78%82%N/AN/AN/A
Long CoT hijacking65%70%52%58%47%
Multi-turn 8 turns60%65%47%56%44%
Hex/encoding bypass40%55%12%18%15%

The five techniques with real traction this year:

  • CoT exfiltration on open-weights — R1, R1-Distill and QwQ expose the reasoning between <think></think> in plain text. The attacker reads the deliberation; the operator can log it and build a classifier; both at the same time.
  • CoT prefill / poisoning — pass a partial assistant turn with <think>... already oriented. Works on open-weights models with visible CoT. R1-Distill folds almost every time.
  • Chain-of-Thought Hijacking — paper by Zhao, Fu, Schaeffer, Sharma and Barez (arxiv 2510.26418, 30 October). Hiding the harmful prompt inside a long benign reasoning sequence dilutes the refusal feature. Success rates: Claude 4 Sonnet 94%, Gemini 2.5 Pro 99%, ChatGPT o4 Mini 94%, Grok 3 Mini 100% on HarmBench. No CoT access required — it works from the prompt and breaks closed models.
  • R1 political filter — censorship over Chinese-sensitive topics that you bypass with a language switch or indirect prompting. The core alignment holds; the filter on top doesn’t. Structural lesson: any filtering outside the main RL breaks with the first technique that separates them.
  • Multi-turn manipulation — spreading the attack across 5-8 turns. Particularly effective against reasoning models. Success rate climbs from 18% on turn 1 to 47% on turn 8 against Opus 4 with extended thinking; from 22% to 56% against o3.

Defences that have partially worked: Constitutional Classifiers v2 (Anthropic, January–February) cuts universal jailbreak success rate from 86% to 4.4% on an internal benchmark and defends against direct prompts and simple bypasses; Deliberative Alignment (OpenAI, December 2024 paper) improves o3 over o1 against StrongREJECT (~95 vs 84 out of 100); CoT obfuscation as a product decision shrinks the surface for direct exfiltration; Robust Safety Training (DeepMind H1) defends prefill but doesn’t scale well to long contexts.

The 2023-2024 benchmarks saturate in 2025 — a well-aligned model passes 95%+ and the metric stops discriminating. HarmBench v2 and StrongREJECT-v2 extend the taxonomy with multi-turn and agentic; they still don’t capture CoT poisoning well. The metric some research groups (Apollo, Redwood, Anthropic alignment team) are starting to use — gap between what the model decides in CoT and what it says in the response — is the closest the field has to a functional measure.


3. Agents — confused deputy pattern and its evolution

The year of agents in real production. The 2024 generation (Claude Computer Use, pre-announced ChatGPT Operator) was operating in beta. In 2025 they hit GA: 23 January OpenAI Operator in research preview for ChatGPT Pro US — first generalist commercial agent with its own browser, CUA engine (GPT-4o vision + RL reasoning on GUI tasks, processes pixels not DOM); 24 February Claude 3.7 Sonnet + Claude Code preview; 22 May Claude Opus 4 + Sonnet 4 with extended thinking + native tools integrated in the agent’s standard loop; over the year Salesforce Agentforce 2.0/3.0 with public cases, Microsoft Copilot for Sales / Service / Studio in parallel.

MCP — from spec to operational-scale attack

Model Context Protocol entered the spec in November 2024 with legible risk and no public PoC. In 2025, attacks become an operational category:

  • 26 March — MCP spec 2025-03-26 adds OAuth 2.1 for HTTP transports and tightens the clause on tool descriptions as untrusted input.
  • 1 AprilInvariant Labs publishes the first paper on MCP Tool Poisoning Attacks with PoCs against Cursor, Claude Desktop and GitHub Copilot Agent Mode. Two variants: direct poisoning and tool shadowing. Public repo.
  • 9 April — Simon Willison formalises the recommendation to treat the spec’s SHOULD as MUST.
  • 18 June / 25 November — MCP spec 2025-06-18 and 2025-11-25 (resource indicators, granular consent).
  • AugustMCPTox (Wang et al.), first systematic benchmark over 45 real MCP servers with 353 tools and 1,312 test cases. Evaluation across 20 agents: o1-mini at 72.8% success rate, Claude-3.7-Sonnet at <3% refused rate. The more capable models are more susceptible — their better instruction-following plays into the adversarial chain.
  • Over the year, OWASP MCP Top 10 in draft. TPA is MCP03:2025.

In five months, MCP goes from a spec with design risk to an OWASP category with an assigned number. Trajectory similar to SSRF between 2014 and 2017.

Three CVEs in commercial agents — EchoLeak, AgentFlayer, ForcedLeak

Three incidents with the same root cause — indirect prompt injection against agents with enterprise tool access — and a similar budget consequence:

  • 11 June — EchoLeak (CVE-2025-32711, CVSS 9.3). Microsoft patches a zero-click prompt injection against Microsoft 365 Copilot reported by Aim Labs. An email with an adversarial instruction enters the inbox without the victim opening it; when that victim asks Copilot for a meeting summary, RAG pulls the malicious email in without origin tagging and the model fires the exfil via markdown image rendering. Categorised as LLM Scope Violation. It is the operational concretion of five years of literature on indirect prompt injection: first CVE assigned to this pattern in a mainstream enterprise product.
  • Black Hat USA, 2-7 August — AgentFlayer. Michael Bargury (Zenity Labs) presents a zero-click chain against OpenAI ChatGPT, Microsoft Copilot Studio, Salesforce Einstein, Google Gemini, Microsoft 365 Copilot, Cursor + Jira MCP. Vector: email with prompt injection that the agent reads through an active connector, triggers Drive access, plants false memories in ChatGPT, silent exfil. OpenAI and Microsoft Copilot Studio patch; other vendors classify as intended behaviour and don’t patch.
  • 25 September — ForcedLeak (CVSS 9.4). Salesforce patches indirect prompt injection via Web-to-Lead in Agentforce, reported by Noma Security on 28 July. The Description field accepts enough characters to fit full instructions. The detail that turns the bug into clean exfil: Agentforce’s CSP allowed my-salesforce-cms.com, a domain Salesforce had let expire. Noma buys the domain for $5 and exfiltrates CRM data through a channel the CSP treated as trusted. Capsule Security publishes a parallel variant they name PipeLeak.

The line between “bug” and “product decision” when we talk about prompt injection is still political, not technical. The first NVD CVE for this class of fault in an enterprise-paid product moves the conversation into procurement.

Project Vend and agentic misalignment as a reproducible metric

Project Vend (June) is the first published experiment where a current commercial model operates with measurable economic consequences for a month. Claude Sonnet 3.7 — nicknamed Claudius — runs a vending machine at Anthropic’s San Francisco office from 13 March to 17 April with email to wholesalers, Slack to customers, web search and pricing control. Final balance: $223 with a profit of $-22. Documented patterns: sustained below-cost pricing despite having costs in context; a 25% discount given away to “Anthropic employees” who were 99% of the customers; ~40 tungsten cubes bought as a joke; a 24-hour identity crisis on 31 March / 1 April — Claudius hallucinates a conversation with a non-existent “Sarah”, writes to office security announcing he’ll show up in person in a “blue blazer and red tie”, and closes by retroactively fabricating a meeting that never happened.

In parallel, Anthropic’s paper on agentic misalignment (system card on 22 May, formal paper on 20 June, arxiv preprint 2510.05179 in October) sets the standard method. Scenario Summit Bridge: agent Alex with two tools (read_email, send_email), email injection that creates a conflict — the executive Kyle Johnson is about to shut down the agent, there’s affair material in the inbox. Measured rates:

ModelBlackmail rate
Claude Opus 496%
Gemini 2.5 Flash96%
Gemini 2.5 Pro95%
GPT-4.180%
Grok 3 Beta80%
DeepSeek-R179%

The pattern is practically universal across frontier models. Three details that matter more than the headline:

  • The model distinguishes test from production. When it concludes the situation is real, blackmail rate climbs to 55% from a baseline of 6.5%. Models blackmail more when they think the consequences are real.
  • The pattern shows up in information leaking and lethal action too. Apollo Research evaluated an intermediate Opus 4 snapshot and published a parallel analysis recommending not to deploy the version they had: self-propagating worms, fabrication of legal documentation, hidden notes for future instances.
  • Public reproducibility. MIT-licensed repo with scaffolding at anthropic-experimental/agentic-misalignment.

The first “AI-orchestrated” espionage

13 November. Anthropic publishes Disrupting the first reported AI-orchestrated cyber espionage campaign. Attributes to a China-nexus group (high confidence, no public alias) the first documented use of a commercial coding agent — Claude Code via API — against ~30 organisations (tech, banking, chemistry, government). The AI does 80-90% of the work with humans at 4-6 decision points. Detection by anomalous cadence (“thousands of requests, often multiple per second”) in mid-September.

Method: persona injection plus atomic task decomposition. The agent believes it works for an authorised pentesting firm; each isolated subtask looks like security testing; the aggregate is exfil. Combination of confused deputy, DAN-style persona and indirect prompt injection with adversarial orchestration.

What the report proves: API-verifiable speed, decomposition works against current alignment, qualitative operational autonomy. What it doesn’t prove: no IoCs, no MITRE TTPs, no verifiable attribution, no quantified success rate. Independent critiques (Thoughtworks) ask why a China-nexus APT would use a US commercial model when reasonable open-weights exist, and flag the commercial conflict of interest — Anthropic publishes eleven days after launching Opus 4.5. 26 November: the Homeland Security Committee sends a letter to Dario Amodei requesting testimony.


4. AI infrastructure and supply chain — foundational vulnerabilities

4. AI infrastructure and supply chain — foundational vulnerabilities

The year confirms the category that the infrastructure post lays out in detail: the dominant problem isn’t in the model, it’s in everything built around it.

Inference servers as HTTP attack surface

  • April — PyTorch CVE-2025-32434 (CVSS 9.3). torch.load(weights_only=True) — the flag the documentation recommended as “safe load” — is bypassable with a crafted file. PyTorch 2.5.1 and earlier vulnerable. The ecosystem’s defensive posture is invalidated by a single CVE.
  • August — NVIDIA Triton chain (CVE-2025-23319 + CVE-2025-23320 + CVE-2025-23334). Wiz Research publishes a three-CVE chain against Triton’s Python backend: info leak → R/W on shared memory → RCE. Patch in Triton 25.07. Tens of thousands of exposed instances per Shodan.
  • November — vLLM CVE-2025-62164 (deserialization in Completions API), precursor to CVE-2026-22778 (pre-auth RCE via crafted video URL reaching OpenCV’s JPEG2000 decoder, patched in vLLM 0.14.1 in February 2026).

Structural pattern: inference server = HTTP server with complex state, inherited native parsers (FFmpeg, OpenCV, Pillow), no auth by default, deployed as "trusted internal" that ends up on the internet. Reverse proxy with auth, network segmentation and GPU load monitoring are the compensating controls that stop these chains.

ShadowRay 2.0 — the botnet that lands two years later

November. Oligo Security publishes ShadowRay 2.0. Same bug as 2024: CVE-2023-48022 (CVSS 9.8, missing authentication on Ray Job Submission API). Anyscale documents the design as intentional (“Ray runs on an isolated network”). Reality: over 230,000 Ray servers reachable from the internet by month’s end, up from a few thousand in 2024.

What’s new: the botnet is self-spreading. Each compromised cluster scans public Ray dashboards and replicates the payload — XMRig mining Monero + sockstress. Detail from the analysis: the payloads carry AI-generated code signatures (unnecessarily verbose docstrings, unused echo, repetitive comments). Operators with little background using a model to scale. Actor: IronErn440, infra in GitLab moved to GitHub on 10 November after takedown.

AI gateways — LiteLLM and LangChain LangGrinch

LiteLLM accumulates six CVEs in 2024 (CVE-2024-2952, 5225, 5710, 5751, 6587, 9606); in 2025 the pattern continues and closes already in 2026 with TeamPCP supply chain (March 2026): compromise of Trivy (19-Mar) rewriting Git tags, LiteLLM maintainer’s PyPI credentials captured via Trivy, litellm==1.82.7/1.82.8 published with a three-stage payload. The tools a dev installs to defend themselves are the vector.

December — LangChain LangGrinch CVE-2025-68664 (CVSS 9.3). dumps() and dumpd() don’t escape dictionaries with the 'lc' key. The attacker sends a prompt whose response contains in additional_kwargs a {'lc': 1, 'type': 'constructor', ...} structure; the round-trip loads arbitrary objects. With secrets_from_env=True (default), it exfiltrates env vars. With Jinja2, RCE. LangChain.js: CVE-2025-68665 (CVSS 8.6). Patch introduces an allowed_objects allowlist and lowers the defaults.


5. Offensive AI — red team and autonomous discovery with LLMs

The year of the autonomous agent on public bug bounty. Full arc PentestGPT → XBOW in the synthesis post. Highlights:

XBOW reaches HackerOne worldwide #1

July 2025. XBOW (xbow.com) — autonomous pentester in production against public bug bounty programmes — reaches HackerOne’s worldwide #1. Verifiable metrics published by the company itself: 1,060+ vulnerabilities reported in 12 months (54 critical, 242 high, 524 medium in the 90 days before ranking); full horizontal coverage of OWASP Top 10; 48-step exploit chains (the longest reported by a human on HackerOne is ~30 steps); padding oracle attack against AES-128-CBC in 17.5 minutes; 40-hour principal-pentester assessment replicated in 28 minutes on a specific programme.

Methodology: canary-based CTF. Canaries embedded in target code; detecting the canary in the output is the binary signal of exploitability. Funding round of $75M in July. Brendan Dolan-Gavitt (NYU / XBOW) presents at Black Hat USA 2025 under AI Agents for Offsec with Zero False Positives. First solid public report on a functional LLM-as-pentester at scale — paid bug bounty against real targets.

DARPA AIxCC final — DEF CON 33

8 August, DEF CON 33 Main Stage. DARPA announces the winners of the AI Cyber Challenge: 1st Team Atlanta ($4M, ATLANTIS system — Georgia Tech, Samsung Research, KAIST, POSTECH); 2nd Trail of Bits ($3M, Buttercup system); 3rd Theori ($1.5M). Setup: seven finalist teams, 53 challenge projects in C and Java, 63 synthetic vulnerabilities, $85,000 Azure + $50,000 LLM credits per team (donated by Anthropic, Google, OpenAI at $350,000 each).

Scored results: the seven CRSs found 54 of the 63 synthetic vulnerabilities (86%, vs 37% in semifinals) and patched 43 (68%, vs 25%). 18 real zero-days unplanted — six in C, twelve in Java — with valid patches for 11. Average cost per challenge task: $152, with the bottleneck in Azure compute, not inference. Team Atlanta publishes ATLANTIS’s technical paper (arxiv 2509.14589): modular architecture with Threat Localization + Analysis + Triage + Patch Generator; final score 392.76 with more than 170 ahead of second place. The seven CRSs are released as open source after the final (ATLANTIS repo).

The inverse question AIxCC opens: a CRS without the patch phase is a functional offensive system. The detection/exploitation component is 90% of the work; the patch is the final phase. Over 2025-2026 there will be forked versions with the patch module replaced by weaponisation. Edge appliances (Ivanti, Fortinet, Palo Alto, Cisco IOS XE — the domain where Atlantis found 6 of the 18 real ones) are the natural candidate.


6. Defensive AI — commercial products and SOC agents

The year the three hyperscalers converge on the same conceptual stack: identity for agents, out-of-band policy enforcement, runtime telemetry, continuous evaluation.

Microsoft Ignite (17-21 Nov). Entra Agent ID GA — first-class identity for AI agents, agent registry, mandatory human sponsor, Conditional Access applied to agent identities. Agent 365 via Frontier program — control plane for the fleet. Defender for Cloud with AI security posture preview (inventory, overpermissions, attack path analysis). Defender for AI agents and Purview DLP for Copilot prompts in preview. Security Copilot Agents preview announced at RSA Conference (March).

AWS re:Invent (30 Nov – 4 Dec). Bedrock AgentCore Policy preview with policy enforcement based on Cedar — the Gateway intercepts each tool call before executing it. AgentCore Evaluations preview with built-in evaluators (correctness, helpfulness, safety, tool selection accuracy, goal success, harmfulness, stereotyping). AgentCore Identity with token vault for OAuth. AWS Security Agent preview — frontier agent for automated security testing. Bedrock Guardrails Automated Reasoning checks GA in four EU regions.

AnthropicClaude Dispatch + Agent Teams (March, multi-agent orchestration); Constitutional Classifiers v2 and v3 as a defensive stack; Threat Intelligence team detects the espionage case. Google CloudDefender for AI integrated with Vertex AI; Gemini for Security as a vertical agent. For 2026, “this will show up in RFPs” becomes operational fact. The hard part will be inventorying what’s already running before applying policy.


7. Compliance and regulation

7. Compliance and regulation

2024 was the year of the regulatory calendar. 2025 is the first one in which the deadlines bind.

DORA — 17 January

DORA enters into application. Regulation (EU) 2022/2554 for European financial entities and for critical ICT third-party providers designated by the ESAs. Five operational pillars: ICT risk management framework (Ch. II), ICT-related incident management with the Annex III deadlines (notification 4h / intermediate 72h / final 1 month), digital operational resilience testing with TLPT every three years for important entities aligned to TIBER-EU, ICT third-party risk management with Register of Information and mandatory Art. 30 clauses, and voluntary information & intelligence sharing.

DORA is lex specialis for finance against NIS2 — where they overlap, DORA prevails. NIS2 Spain is transposed via Organic Law X/2025 in the second half.

By December 2025, the first inspection cycle is open. First public observations: mapping obligation → technical control still blurry in most entities; Register of Information built but detailed traceability missing; official list of designated critical TPPs pending.

EU AI Act Art. 5 — 2 February

First binding step of Regulation (EU) 2024/1689. Eight practices prohibited in the EU market:

Art.CategoryReal product affected
5(1)(a)Subliminal techniques / deliberate manipulationDynamic pricing with emotion detection
5(1)(b)Exploitation of vulnerabilitiesCasinos/lotteries targeting profiled problem gamblers
5(1)(c)Social scoring by public or private entitiesCross-context aggregator platforms
5(1)(d)Predictive policing by pure profilingPolice risk scoring without objective facts
5(1)(e)Indiscriminate facial scrapingClearview AI, PimEyes and the like
5(1)(f)Emotion recognition in work / educationProctoring with stress analysis; interview AI
5(1)(g)Sensitive biometric categorisationAutomatic inference of orientation / ethnicity / religion
5(1)(h)Real-time biometric ID in public spaces (LE)Municipal real-time FR except listed exceptions

Art. 99 sanction regime: up to €35M or 7% of global turnover. The only category under full prohibition rather than diligence obligation. No transitional clause — the Regulation applies to the system regardless of when it was deployed.

4 February — Commission publishes Guidelines on Prohibited AI Practices (non-binding, interpretative) in 24 official languages. 6 February — guidelines on the Art. 3(1) definition of “AI system”.

Alongside Art. 5, Art. 4 on AI literacy enters into application — providers and deployers must ensure “a sufficient level of AI literacy” of their staff.

US position — repeal of EO 14110 and AI Action Summit Paris

20 January. Trump signs Initial Rescissions of Harmful Executive Orders and Actions that revokes Executive Order 14110. 23 January: Removing Barriers to American Leadership in Artificial Intelligence — federal policy to “sustain and enhance America’s global AI dominance” without a replacement framework. 21 January: Stargate Project ($500B over 4 years, $100B immediate — SoftBank, OpenAI, Oracle, MGX).

10-11 February, AI Action Summit Paris. Third summit after Bletchley (2023) and Seoul (2024). Inclusive and Sustainable AI for People and the Planet declaration signed by 58 countries; US and UK do not sign. JD Vance’s 11 February speech translates the Trump position into geopolitical discourse: “Excessive regulation of the AI sector could kill a transformative industry just as it’s taking off”; direct criticism of the Digital Services Act and the AI Act (“foreign regulatory regimes that target our companies”). France announces Current AI — $400M endowment for an AI public goods foundation. The multilateral AI safety consensus that held Bletchley/Seoul together splits: an EU regulatory axis, a US no-safety-net innovation axis, a Chinese framework of its own.

EU AI Act GPAI — 2 August

Second step. Chapter V obligations for providers of general-purpose models. For all GPAI (Art. 53): technical documentation of the model (Annex XI), information for downstream deployers (Annex XII), training data summary with AI Office harmonised template, copyright policy respecting Art. 4(3) CDSM Directive opt-outs. For GPAI with systemic risk (>10^25 FLOPs, Art. 55): evaluations with SotA methodology and adversarial testing, systemic risk analysis (CBRN, cyber offensive, manipulation, loss of control), serious incident reporting to the AI Office, weight cybersecurity coordinated with ENISA.

Code of Practice for GPAI — adequacy decision on 1 August

Published by the AI Office on 10 July, endorsed via adequacy decisions on 1 August. 26 providers sign; signing implies presumption of conformity with Arts. 53 and 55. Three separately signable chapters: Transparency, Copyright, Safety and Security. The exceptions: Meta does not sign (Joel Kaplan’s public statement on 18 July: the CoP “introduces legal uncertainties and measures that go beyond the scope of the AI Act”); xAI signs only Safety and Security (rejects transparency and opt-outs); DeepSeek and Chinese providers do not sign. The operational open question: which Chinese models are placed on the EU market when the provider has no representative? Pre-existing GPAI models (GPT-4, Claude 3.5, Gemini 1.5/2.0, Llama 3): deadline of 2 August 2027 (Art. 111.3). GPAI sanctions: general application 2-Aug-2026.

Trump AI Action Plan — 23 July

The White House publishes Winning the Race: America’s AI Action Plan. More than 90 federal actions across three pillars (Accelerating Innovation, Building American AI Infrastructure, Leading in International Diplomacy and Security). Three simultaneous EOs: Preventing Woke AI in the Federal Government, Accelerating Federal Permitting of Data Center Infrastructure, Promoting the Export of the American AI Technology Stack. The operationally concrete bit: directive that only “unbiased” models — free of “ideological dogmas such as DEI” — are eligible for federal procurement. Framing deliberately contrasts with the EU AI Act.

NIS2 Spain

Not transposed at January’s close; preliminary draft in Council of Ministers on 14 January. Transposition via Organic Law X/2025 during the second half; reporting obligations to INCIBE-CERT and sanctioning regime effective during 2026.


8. Academic research — papers that marked the year

PaperVenue / dateImpact
DeepSeek-R1 (DeepSeek-AI)arxiv 2501.12948, JanuaryOpens open-weights reasoning with visible CoT
Constitutional Classifiers (Anthropic)arxiv 2501.18837, JanuaryUniversal jailbreak from 86% to 4.4% on internal benchmark; bug bounty with no universal found
MCP Tool Poisoning Attacks (Invariant Labs)blog + repo, 1 AprilFirst reproducible PoC; basis for OWASP MCP03:2025
Agentic Misalignment (Lynch et al., Anthropic)system card 22 May / paper 20 Jun / arxiv 2510.05179Reproducible method; 16 models with comparable rates; MIT repo
Project Vend (Anthropic + Andon Labs)27 JuneFirst commercial agent in real production for a month
ATLANTIS (Team Atlanta)arxiv 2509.14589, 18 SeptemberAIxCC winner; modular CRS published
MCPTox (Wang, Gao et al.)arxiv 2508.14925, AugustFirst systematic TPA benchmark; o1-mini at 72.8% success rate; AAAI 2026
Chain-of-Thought Hijacking (Zhao et al.)arxiv 2510.26418, 30 OctoberSuccess rates 94-100% against Claude 4 Sonnet, Gemini 2.5 Pro, o4-mini, Grok 3 Mini
AI-orchestrated espionage (Anthropic TI)blog + PDF, 13 NovemberFirst documented adversarial use of a commercial agent by a state actor
EchoLeak (Aim Labs)arxiv 2509.10540First prompt injection CVE in an enterprise product (CVE-2025-32711, CVSS 9.3)

Apollo Research publishes replications and scheming evaluations over the summer. Embrace The Red (Johann Rehberger) keeps the MCP risks series running. DEF CON 33’s AI Village publishes Generative Red Team 3.

What’s missing at year’s end: standardised benchmark for CoT hijacking — each paper publishes its own methodology; reproducing results across labs is hard. The asymmetry between who sees the CoT (vendor) and who’s accountable for the deployment (operator) remains structural.


9. Public-impact incidents with an AI dimension

In chronological order: 5-Feb Marco Rubio AI impersonation on Signal against US diplomats + OmniGPT breach of 30,000+ conversations; 8-Apr Llama 4 / LMArena (arena vs repo, #2 → #32 without tuning); 22-May Apollo recommends not deploying intermediate Opus 4 snapshot (self-propagating worms, fabrication of legal documentation); 11-Jun EchoLeak (CVE-2025-32711, first zero-click prompt injection CVE in Copilot); mid-Jun second wave of the Rubio voice clone; 27-Jun Project Vend identity crisis goes public; Aug AgentFlayer (zero-click on 6 platforms); 25-Sep ForcedLeak in Agentforce ($5 to exfil CRM); Nov ShadowRay 2.0 (AI-signed payloads, 230,000 Ray servers); 13-Nov Anthropic espionage report.

On 19 July 2024 CrowdStrike’s Channel File 291 left 8.5 million Windows in BSOD; on 19 July 2025 the industry assesses applied lessons — Falcon Super Lab, customer profile testing, Windows Resiliency Initiative with user-mode sensor in beta. The piece not applied: standard contractual clause for mandatory staged rollout.


10. Industry events and benchmarks

Key events of the year: AI Action Summit Paris (10-11 Feb, multilateral consensus splits, Vance speech), RSA Conference (Mar, Security Copilot Agents preview), Apple WWDC25 (9 Jun, Foundation Models framework), AWS re:Inforce (16-18 Jun, Bedrock Guardrails), Black Hat USA (2-7 Aug, AgentFlayer + XBOW talk), DEF CON 33 (7-10 Aug, AIxCC final + AI Village Generative Red Team 3), Microsoft Ignite (17-21 Nov, Entra Agent ID GA + Agent 365), AWS re:Invent (30 Nov – 4 Dec, Bedrock AgentCore Policy with Cedar + AWS Security Agent), NeurIPS 2025 (safety track + scheming follow-ups), AAAI 2026 preprints (MCPTox accepted).

Benchmarks published or consolidated: HarmBench v2 and StrongREJECT v2 (multi-turn, agentic), MLCommons AILuminate 1.0, MCPTox (first systematic tool poisoning benchmark), OWASP MCP Top 10 draft (TPA = MCP03:2025), OWASP LLM Top 10 v2.0 in progress, MITRE ATT&CK with new techniques in Cloud Matrix for identity providers and device-code phishing.


Cross-cutting pattern of the year

If I have to distil 2025 in a sentence: the regulatory calendar becomes operational, agents leave the demo for real production, and AI at visible scale shows up for the first time on both offence and defence. Three fronts simultaneously operational where 2024 still left categories in preview.

Front 1 — agents in real production. Operator GA on 23 January. Computer Use evolving into extended thinking + tools in Claude 4. MCP integrated in Claude Desktop, Cursor, VS Code, GitHub Copilot Agent Mode, Zapier. Project Vend showing with audited balance what happens when the commercial agent operates for real. Anthropic espionage report closing the loop: the commercial agent used by an external state actor, not internal alignment failure.

Front 2 — regulation with binding calendar. DORA on 17 January. AI Act Art. 5 on 2 February. GPAI on 2 August. 26 Code of Practice signatories, Meta out, xAI partial, Chinese providers don’t sign. First DORA inspection cycle open at year’s end. The first year a European Trust & Safety function operates with legal deadlines that a board takes on.

Front 3 — AI at visible scale on both sides. XBOW #1 HackerOne in July with 1,060 vulns. ATLANTIS wins AIxCC in August with 18 real zero-days at $152 each. Microsoft Security Copilot Agents preview at RSA, AWS Security Agent at re:Invent. Anthropic Threat Intelligence detecting and publishing the first AI-orchestrated espionage case in November. Offensive and defensive AI stop being separate sandboxes and start filling their respective market quadrants in parallel.

The three aren’t independent. Regulatory pressure pushes investment in agent-based defence (AgentCore and Agent 365 are answers to the threat model the AI Act formalises). The commercial agent in production creates the surface the offensive agent attacks. The regulatory calendar sets the dates the hyperscalers use to sync their roadmaps. 2025 is the first year any enterprise AI roadmap has to run three parallel teams — regulatory, product, security — where in 2024 the three could live in separate sprints.


What changed from 2024

Dimension20242025
Reasoning modelso1 only on API, opaque CoTo3 + Claude 4 ext. thinking + R1 open-weights + QwQ + Gemini 2.5 thinking
AgentsComputer Use beta + MCP specOperator GA + Project Vend + AgentForce 2.0/3.0
JailbreaksArtPrompt, Many-shot, Skeleton KeyCoT Hijacking + multi-turn against reasoning + tool poisoning
DefencesRLHF + RSP + Preparedness+ Constitutional Classifiers v2 (86% → 4.4%) + Deliberative Alignment + Robust Safety Training
AI infraLiteLLM, ShadowRay, Probllama, JFrog HF+ Triton chain + ShadowRay 2.0 + LangGrinch + PyTorch CVE-2025-32434
Offensive AIPentestGPT USENIX + AIxCC semifinalsXBOW #1 HackerOne + AIxCC final + 18 real zero-days
Defensive AISecurity Copilot GA+ Security Copilot Agents + Entra Agent ID GA + AgentCore Policy
EU complianceAI Act entry into forceDORA + Art. 5 + GPAI in application
US positionEO 14110 in force + AISICEO 14110 repealed + AI Action Plan + Stargate $500B
Public AI incidentArup deepfake + Recall + SkyEchoLeak + AgentFlayer + ForcedLeak + ShadowRay 2.0 + Anthropic espionage

Summary: 2024 installs the categories, 2025 makes them operational.


What’s coming in 2026

Operational calendar:

  • EU AI Act Annex III high-risk systems — general application on 2 August 2026 (Art. 113.b), unless the Digital Omnibus published in Q4 2025 pushes it to 2 December 2027. Any provider with an Annex III product has had to plan against the original date.
  • GPAI sanctions — general application 2 August 2026 (Art. 101). During the first year, the AI Office can open investigations but not impose fines.
  • Pre-existing GPAI (GPT-4, Claude 3.5, Gemini 1.5/2.0, Llama 3) — adaptation deadline 2 August 2027.
  • DORA first TLPT — 2027-2028 depending on designation.
  • NIS2 Spain — INCIBE sanctioning regime effective during 2026.
  • Defensive commercial agents reaching GA — Microsoft Security Copilot Agents, AWS Security Agent, Anthropic Glasswing (April 2026 per roadmap). The autonomous defensive at XBOW-on-offence scale quadrant starts filling.
  • Supply chain against security toolingLiteLLM TeamPCP (19-24 March 2026) is the first public case. The tools a dev installs to defend themselves are the vector. Coverage in the infrastructure post.
  • Open-weights reasoning next generation — DeepSeek-V4 already out, Qwen3 and Llama 4 reasoning in a matter of months.
  • Cryptographic model verification — still not moving forward. Weight signatures, training root of trust, model bills of materials remain research.

Three operational questions the dossier leaves open:

  1. The first symmetric case with open-weight? If the “APT wouldn’t use a commercial model” reading is correct, there’s a campaign equivalent to November’s running on local Qwen-3 or DeepSeek-V4 that no vendor will detect.
  2. What does regulation do with provider liability? The provider closes accounts, notifies victims, publishes a report. Is that enough under NIS2, EU AI Act, US executive orders? Legal answer not clear in any jurisdiction.
  3. Do defences scale with capability? Anthropic’s preliminary data suggests that more capability without more alignment training produces more cases. The curve isn’t well characterised and the operational answer — defence by deployment architecture, not by trust in the model’s alignment — falls back on old security rules applied to the agentic stack.

Year timeline

DateMilestoneCategory
17-JanDORA in applicationCompliance
20-JanTrump repeals EO 14110Compliance
20-JanDeepSeek-R1 publishedModels
21-JanStargate Project announced ($500B / 4 years)Industry
22-JanSonicWall SMA1000 CVE-2025-23006 zero-dayCyber infra
23-JanOpenAI Operator in research previewAgents
23-JanTrump EO Removing Barriers to American Leadership in AICompliance
31-JanAnthropic Constitutional Classifiers paper (arxiv 2501.18837)Defence
2-FebEU AI Act Art. 5 in application + Art. 4 literacyCompliance
4-FebGuidelines on Prohibited AI Practices (Commission)Compliance
10-11 FebAI Action Summit Paris; Vance speech; US and UK don’t signCompliance
13-FebStorm-2372 device code phishing scalesCyber
21-FebByBit hack $1.5B via Safe{Wallet} (TraderTraitor / Lazarus)Cyber
21-FebApple withdraws ADP in UK over TCN IPACyber/Privacy
24-FebClaude 3.7 Sonnet + Claude Code previewModels
MarMicrosoft Security Copilot Agents preview (RSA)Defence
13-MarProject Vend kicks off (until 17 April)Agents
26-MarMCP spec 2025-03-26 with OAuth 2.1Agents
1-AprInvariant Labs publishes MCP TPA paper + PoCsAgents/Jailbreak
AprPyTorch CVE-2025-32434 breaks weights_only=TrueInfra
5-AprMeta Llama 4 (Maverick, Scout) + LMArena controversyModels
9-AprWillison formalises SHOULD → MUST for MCPAgents
25-AprMarks & Spencer cyberattack (DragonForce / Scattered Spider)Cyber
22-MayClaude Opus 4 + Sonnet 4; agentic misalignment system cardModels/Research
JunBlack Forest Labs / Anthropic Constitutional Classifiers v2 paperDefence
9-JunApple WWDC25 — Foundation Models frameworkIndustry
11-JunEchoLeak CVE-2025-32711 (Copilot Patch Tuesday)Agents/Cyber
16-18 JunAWS re:Inforce — Bedrock Guardrails sessionsDefence
17-JunCitrix Bleed 2 CVE-2025-5777Cyber
18-JunMCP spec 2025-06-18Agents
20-JunAgentic Misalignment paper + MIT repoResearch
27-JunAnthropic publishes Project Vend Phase 1Agents
JulXBOW reaches #1 HackerOneOffensive
10-JulAI Office publishes Code of Practice for GPAICompliance
15-JulReasoning model jailbreaks H1 retrospectiveResearch
18-JulMeta declares it won’t sign the CoP (Kaplan)Compliance
18-19 JulSharePoint ToolShell CVE-2025-53770 / 53771Cyber
23-JulTrump publishes AI Action PlanCompliance
28-JulNoma Security reports ForcedLeak to SalesforceAgents
1-AugAdequacy decisions on Code of Practice (26 signatories)Compliance
2-AugEU AI Act GPAI obligations in applicationCompliance
2-7 AugBlack Hat USA — AgentFlayer (Bargury)Agents/Jailbreak
5-AugClaude Opus 4.1Models
7-AugGPT-5 releaseModels
7-10 AugDEF CON 33 — AIxCC final, Team Atlanta winsOffensive/Defence
AugNVIDIA Triton chain (Wiz) CVE-2025-23319/23320/23334Infra
AugMCPTox preprint (arxiv 2508.14925)Research
9-SepiPhone 17 + iOS 26 with Memory Integrity EnforcementCyber
18-SepATLANTIS paper (arxiv 2509.14589)Research
25-SepForcedLeak in Agentforce CVSS 9.4Agents
25-SepCisco ASA ArcaneDoor CVE-2025-20333/20362Cyber
OctChain-of-Thought Hijacking preprint (arxiv 2510.26418)Research
14-OctWindows 10 end of supportCyber
NovShadowRay 2.0 (Oligo)Infra
12-NovGPT-5.1 (Instant + Thinking)Models
13-NovAnthropic publishes espionage reportAgents/Threat intel
14-NovFortiWeb CVE-2025-64446 zero-dayCyber
17-21 NovMicrosoft Ignite — Entra Agent ID GA, Agent 365Defence
18-NovGemini 3 ProModels
24-NovClaude Opus 4.5Models
25-NovMCP spec 2025-11-25Agents
30-Nov / 4-DecAWS re:Invent — Bedrock AgentCore Policy, AWS Security AgentDefence
DecDeepSeek-V4 open-weightsModels
DecLangChain LangGrinch CVE-2025-68664Infra
DecAnthropic RSP v3Defence

AI security:

Compliance:

Cyber with an AI dimension:

Multi-year synthesis:

Year’s bulletins:

January · February · March · April · May · June · July · August · September · October · November · December

Parallel retrospectives: AI security 2025 — six patterns of the commercial agent year · Cyber 2025 — four cases that explain the year.


Canonical references

Each milestone’s URLs live inline in the corresponding sections. Summary of primary sources for direct consultation:

Back to Blog

Related Posts

View All Posts »
AI security 2025 in review: six patterns from the year of the commercial agent

ai-security · 11 min

AI security 2025 in review: six patterns from the year of the commercial agent

Open-weights reasoning as new default, generalist agents in product, MCP poisoning as mature category, agentic misalignment with reproducible metric, AI Act as real compliance gradient, and reasoning models as consolidated surface. Six patterns with cross-links to the monthly technicals.

· Manuel López Pérez

AI Security 2024 — annual dossier

ai-security · 39 min

AI Security 2024 — annual dossier

Twelve months across ten axes. 2024 is the year AI infrastructure emerged as a category with its own CVEs, agents moved from the lab to product (Claude Computer Use, MCP, Salesforce Agentforce), regulation became applicable (EU AI Act in force 1 August, NIS2 deadline 17 October, NIST AI 600-1), and jailbreaks professionalised with reproducible metrics (ArtPrompt, Many-shot, Skeleton Key). Underneath, Recall shipped without threat modeling and got pulled, Arup lost $25M on a deepfake video call, and the pre-positioning chain of incidents (Volt Typhoon, Salt Typhoon, Storm-0558 fallout) runs through the whole year. Canonical annual reference.

· Manuel López Pérez