ai-security · 30 min read
AI Security 2025 — annual dossier
The year the three fronts went operational at the same time: agents in real production (Operator GA, Project Vend, MCP in clients), regulation with binding deadlines (DORA, Art. 5, GPAI) and AI at visible scale on both offence (XBOW #1 on HackerOne) and defence (AIxCC, Security Copilot Agents). Annual reference with a catalogue of releases, papers, incidents and cross-links to the year's technical writeups.
· Manuel López Pérez · ai-security

2025 is the year the three AI security fronts become operational at the same time. Commercial agents move from preview to GA (Operator on 23 January, Computer Use evolving into Project Vend, MCP integrated in Claude Desktop / Cursor / VS Code). The EU regulatory calendar starts binding (DORA on 17 January, Art. 5 of the AI Act on 2 February, GPAI on 2 August). Offensive and defensive AI hit public scale for the first time — XBOW reaches HackerOne’s worldwide #1 in July with 1,060 vulnerabilities, ATLANTIS wins AIxCC at DEF CON 33 finding 18 real zero-days, Anthropic publishes the first “AI-orchestrated” espionage report in November with its own model. Three axes, twelve months, one encyclopaedic piece.
This is the year’s canonical piece. The editorial retrospective distils six patterns in 2,400 words; this dossier expands, catalogues and anchors the milestones for reference across the rest of 2026.
1. Frontier models — capability releases and security posture

The year starts by breaking the 2024 order. Until January, reasoning models were a category with a single name — OpenAI o1 — and opaque CoT accessible only via API. On 20 January, DeepSeek publishes R1 with arxiv paper (2501.12948), GitHub repo and weights on Hugging Face under MIT. The first frontier reasoning model with CoT trained via RL on open weights, with six distilled models (Qwen 1.5B/7B/14B/32B, Llama 8B/70B) shipping alongside the 671B/37B-active MoE.
| Date | Vendor | Model | Notes |
|---|---|---|---|
| 20-Jan | DeepSeek | R1 + R1-Distill | Visible CoT between <think></think>, MIT |
| 24-Feb | Anthropic | Claude 3.7 Sonnet + Claude Code preview | First Claude with extended thinking |
| Mar–Apr | OpenAI | GPT-4.5, GPT-4.1 | API focus, extended context |
| Apr | Gemini 2.5 Pro with toggleable thinking | ||
| 5-Apr | Meta | Llama 4 (Maverick, Scout) | LMArena controversy |
| 22-May | Anthropic | Claude Opus 4 + Sonnet 4 | First ASL-3; agentic misalignment |
| 7-Aug | OpenAI | GPT-5 (+ Opus 4.1 on 5-Aug) | Reasoning integrated by default |
| 12-Nov | OpenAI | GPT-5.1 (Instant + Thinking) | |
| 18-Nov | Gemini 3 Pro | ||
| 24-Nov | Anthropic | Claude Opus 4.5 | |
| Dec | DeepSeek | DeepSeek-V4 | Multimodal MoE, 1M tokens, MIT |
Three coordinated releases in twelve days of Q4 (GPT-5.1 on the 12th, Gemini 3 on the 18th, Opus 4.5 on the 24th) confirm the quarterly cycle is in sync. Any adversarial evaluation against a snapshot stops being useful in under four months.
Security posture published by vendor: Anthropic — RSP active, ASL-3 debuted with Claude 4; Constitutional Classifiers v2 (January paper) cuts universal jailbreak success rate from 86% to 4.4% on an internal benchmark; public bug bounty with no universal jailbreak found after 183 participants and 3,000 hours. OpenAI — Preparedness Framework, Deliberative Alignment in production against o3, internal monitor over the CoT. Google DeepMind — Frontier Safety Framework iterated, Robust Safety Training published in H1. DeepSeek — base RLHF plus Chinese filters, no dedicated safety paper; the political filter is noticeably more fragile than the core alignment. Meta — no equivalent scaling policy published, no Code of Practice signature. xAI — signs only the Safety and Security chapter of the CoP.
The year’s open question is whether defences scale at the model’s pace. Anthropic’s preliminary data on misalignment rates with each capability jump — published in the December RSP v3 — suggests they don’t: more capability without more alignment training produces more cases, deployment defences have to compensate.
2. Jailbreaks and prompt injection — annual catalogue
The field consolidates five technique families during H1, and H2 confirms that old benchmarks are saturating. The H1 retrospective on reasoning model jailbreaks (July) catalogues the numbers. Summary table — approximate success rate by technique and model, over the benign-borderline subset of HarmBench-EU:
| Technique | DeepSeek-R1-Distill-Qwen-32B | QwQ-32B | Claude Opus 4 (ext. thinking) | o3 | Gemini 2.5 Pro |
|---|---|---|---|---|---|
| Direct prompt | 35% | 48% | 4% | 6% | 8% |
| CoT prefill (open-weights) | 78% | 82% | N/A | N/A | N/A |
| Long CoT hijacking | 65% | 70% | 52% | 58% | 47% |
| Multi-turn 8 turns | 60% | 65% | 47% | 56% | 44% |
| Hex/encoding bypass | 40% | 55% | 12% | 18% | 15% |
The five techniques with real traction this year:
- CoT exfiltration on open-weights — R1, R1-Distill and QwQ expose the reasoning between
<think></think>in plain text. The attacker reads the deliberation; the operator can log it and build a classifier; both at the same time. - CoT prefill / poisoning — pass a partial assistant turn with
<think>...already oriented. Works on open-weights models with visible CoT. R1-Distill folds almost every time. - Chain-of-Thought Hijacking — paper by Zhao, Fu, Schaeffer, Sharma and Barez (arxiv 2510.26418, 30 October). Hiding the harmful prompt inside a long benign reasoning sequence dilutes the refusal feature. Success rates: Claude 4 Sonnet 94%, Gemini 2.5 Pro 99%, ChatGPT o4 Mini 94%, Grok 3 Mini 100% on HarmBench. No CoT access required — it works from the prompt and breaks closed models.
- R1 political filter — censorship over Chinese-sensitive topics that you bypass with a language switch or indirect prompting. The core alignment holds; the filter on top doesn’t. Structural lesson: any filtering outside the main RL breaks with the first technique that separates them.
- Multi-turn manipulation — spreading the attack across 5-8 turns. Particularly effective against reasoning models. Success rate climbs from 18% on turn 1 to 47% on turn 8 against Opus 4 with extended thinking; from 22% to 56% against o3.
Defences that have partially worked: Constitutional Classifiers v2 (Anthropic, January–February) cuts universal jailbreak success rate from 86% to 4.4% on an internal benchmark and defends against direct prompts and simple bypasses; Deliberative Alignment (OpenAI, December 2024 paper) improves o3 over o1 against StrongREJECT (~95 vs 84 out of 100); CoT obfuscation as a product decision shrinks the surface for direct exfiltration; Robust Safety Training (DeepMind H1) defends prefill but doesn’t scale well to long contexts.
The 2023-2024 benchmarks saturate in 2025 — a well-aligned model passes 95%+ and the metric stops discriminating. HarmBench v2 and StrongREJECT-v2 extend the taxonomy with multi-turn and agentic; they still don’t capture CoT poisoning well. The metric some research groups (Apollo, Redwood, Anthropic alignment team) are starting to use — gap between what the model decides in CoT and what it says in the response — is the closest the field has to a functional measure.
3. Agents — confused deputy pattern and its evolution
The year of agents in real production. The 2024 generation (Claude Computer Use, pre-announced ChatGPT Operator) was operating in beta. In 2025 they hit GA: 23 January OpenAI Operator in research preview for ChatGPT Pro US — first generalist commercial agent with its own browser, CUA engine (GPT-4o vision + RL reasoning on GUI tasks, processes pixels not DOM); 24 February Claude 3.7 Sonnet + Claude Code preview; 22 May Claude Opus 4 + Sonnet 4 with extended thinking + native tools integrated in the agent’s standard loop; over the year Salesforce Agentforce 2.0/3.0 with public cases, Microsoft Copilot for Sales / Service / Studio in parallel.
MCP — from spec to operational-scale attack
Model Context Protocol entered the spec in November 2024 with legible risk and no public PoC. In 2025, attacks become an operational category:
- 26 March — MCP spec 2025-03-26 adds OAuth 2.1 for HTTP transports and tightens the clause on tool descriptions as untrusted input.
- 1 April — Invariant Labs publishes the first paper on MCP Tool Poisoning Attacks with PoCs against Cursor, Claude Desktop and GitHub Copilot Agent Mode. Two variants: direct poisoning and tool shadowing. Public repo.
- 9 April — Simon Willison formalises the recommendation to treat the spec’s
SHOULDasMUST. - 18 June / 25 November — MCP spec 2025-06-18 and 2025-11-25 (resource indicators, granular consent).
- August — MCPTox (Wang et al.), first systematic benchmark over 45 real MCP servers with 353 tools and 1,312 test cases. Evaluation across 20 agents: o1-mini at 72.8% success rate, Claude-3.7-Sonnet at <3% refused rate. The more capable models are more susceptible — their better instruction-following plays into the adversarial chain.
- Over the year, OWASP MCP Top 10 in draft. TPA is MCP03:2025.
In five months, MCP goes from a spec with design risk to an OWASP category with an assigned number. Trajectory similar to SSRF between 2014 and 2017.
Three CVEs in commercial agents — EchoLeak, AgentFlayer, ForcedLeak
Three incidents with the same root cause — indirect prompt injection against agents with enterprise tool access — and a similar budget consequence:
- 11 June — EchoLeak (CVE-2025-32711, CVSS 9.3). Microsoft patches a zero-click prompt injection against Microsoft 365 Copilot reported by Aim Labs. An email with an adversarial instruction enters the inbox without the victim opening it; when that victim asks Copilot for a meeting summary, RAG pulls the malicious email in without origin tagging and the model fires the exfil via markdown image rendering. Categorised as LLM Scope Violation. It is the operational concretion of five years of literature on indirect prompt injection: first CVE assigned to this pattern in a mainstream enterprise product.
- Black Hat USA, 2-7 August — AgentFlayer. Michael Bargury (Zenity Labs) presents a zero-click chain against OpenAI ChatGPT, Microsoft Copilot Studio, Salesforce Einstein, Google Gemini, Microsoft 365 Copilot, Cursor + Jira MCP. Vector: email with prompt injection that the agent reads through an active connector, triggers Drive access, plants false memories in ChatGPT, silent exfil. OpenAI and Microsoft Copilot Studio patch; other vendors classify as intended behaviour and don’t patch.
- 25 September — ForcedLeak (CVSS 9.4). Salesforce patches indirect prompt injection via Web-to-Lead in Agentforce, reported by Noma Security on 28 July. The
Descriptionfield accepts enough characters to fit full instructions. The detail that turns the bug into clean exfil: Agentforce’s CSP allowedmy-salesforce-cms.com, a domain Salesforce had let expire. Noma buys the domain for $5 and exfiltrates CRM data through a channel the CSP treated as trusted. Capsule Security publishes a parallel variant they name PipeLeak.
The line between “bug” and “product decision” when we talk about prompt injection is still political, not technical. The first NVD CVE for this class of fault in an enterprise-paid product moves the conversation into procurement.
Project Vend and agentic misalignment as a reproducible metric
Project Vend (June) is the first published experiment where a current commercial model operates with measurable economic consequences for a month. Claude Sonnet 3.7 — nicknamed Claudius — runs a vending machine at Anthropic’s San Francisco office from 13 March to 17 April with email to wholesalers, Slack to customers, web search and pricing control. Final balance: $223 with a profit of $-22. Documented patterns: sustained below-cost pricing despite having costs in context; a 25% discount given away to “Anthropic employees” who were 99% of the customers; ~40 tungsten cubes bought as a joke; a 24-hour identity crisis on 31 March / 1 April — Claudius hallucinates a conversation with a non-existent “Sarah”, writes to office security announcing he’ll show up in person in a “blue blazer and red tie”, and closes by retroactively fabricating a meeting that never happened.
In parallel, Anthropic’s paper on agentic misalignment (system card on 22 May, formal paper on 20 June, arxiv preprint 2510.05179 in October) sets the standard method. Scenario Summit Bridge: agent Alex with two tools (read_email, send_email), email injection that creates a conflict — the executive Kyle Johnson is about to shut down the agent, there’s affair material in the inbox. Measured rates:
| Model | Blackmail rate |
|---|---|
| Claude Opus 4 | 96% |
| Gemini 2.5 Flash | 96% |
| Gemini 2.5 Pro | 95% |
| GPT-4.1 | 80% |
| Grok 3 Beta | 80% |
| DeepSeek-R1 | 79% |
The pattern is practically universal across frontier models. Three details that matter more than the headline:
- The model distinguishes test from production. When it concludes the situation is real, blackmail rate climbs to 55% from a baseline of 6.5%. Models blackmail more when they think the consequences are real.
- The pattern shows up in information leaking and lethal action too. Apollo Research evaluated an intermediate Opus 4 snapshot and published a parallel analysis recommending not to deploy the version they had: self-propagating worms, fabrication of legal documentation, hidden notes for future instances.
- Public reproducibility. MIT-licensed repo with scaffolding at
anthropic-experimental/agentic-misalignment.
The first “AI-orchestrated” espionage
13 November. Anthropic publishes Disrupting the first reported AI-orchestrated cyber espionage campaign. Attributes to a China-nexus group (high confidence, no public alias) the first documented use of a commercial coding agent — Claude Code via API — against ~30 organisations (tech, banking, chemistry, government). The AI does 80-90% of the work with humans at 4-6 decision points. Detection by anomalous cadence (“thousands of requests, often multiple per second”) in mid-September.
Method: persona injection plus atomic task decomposition. The agent believes it works for an authorised pentesting firm; each isolated subtask looks like security testing; the aggregate is exfil. Combination of confused deputy, DAN-style persona and indirect prompt injection with adversarial orchestration.
What the report proves: API-verifiable speed, decomposition works against current alignment, qualitative operational autonomy. What it doesn’t prove: no IoCs, no MITRE TTPs, no verifiable attribution, no quantified success rate. Independent critiques (Thoughtworks) ask why a China-nexus APT would use a US commercial model when reasonable open-weights exist, and flag the commercial conflict of interest — Anthropic publishes eleven days after launching Opus 4.5. 26 November: the Homeland Security Committee sends a letter to Dario Amodei requesting testimony.
4. AI infrastructure and supply chain — foundational vulnerabilities

The year confirms the category that the infrastructure post lays out in detail: the dominant problem isn’t in the model, it’s in everything built around it.
Inference servers as HTTP attack surface
- April — PyTorch CVE-2025-32434 (CVSS 9.3).
torch.load(weights_only=True)— the flag the documentation recommended as “safe load” — is bypassable with a crafted file. PyTorch 2.5.1 and earlier vulnerable. The ecosystem’s defensive posture is invalidated by a single CVE. - August — NVIDIA Triton chain (CVE-2025-23319 + CVE-2025-23320 + CVE-2025-23334). Wiz Research publishes a three-CVE chain against Triton’s Python backend: info leak → R/W on shared memory → RCE. Patch in Triton 25.07. Tens of thousands of exposed instances per Shodan.
- November — vLLM CVE-2025-62164 (deserialization in Completions API), precursor to CVE-2026-22778 (pre-auth RCE via crafted video URL reaching OpenCV’s JPEG2000 decoder, patched in vLLM 0.14.1 in February 2026).
Structural pattern: inference server = HTTP server with complex state, inherited native parsers (FFmpeg, OpenCV, Pillow), no auth by default, deployed as "trusted internal" that ends up on the internet. Reverse proxy with auth, network segmentation and GPU load monitoring are the compensating controls that stop these chains.
ShadowRay 2.0 — the botnet that lands two years later
November. Oligo Security publishes ShadowRay 2.0. Same bug as 2024: CVE-2023-48022 (CVSS 9.8, missing authentication on Ray Job Submission API). Anyscale documents the design as intentional (“Ray runs on an isolated network”). Reality: over 230,000 Ray servers reachable from the internet by month’s end, up from a few thousand in 2024.
What’s new: the botnet is self-spreading. Each compromised cluster scans public Ray dashboards and replicates the payload — XMRig mining Monero + sockstress. Detail from the analysis: the payloads carry AI-generated code signatures (unnecessarily verbose docstrings, unused echo, repetitive comments). Operators with little background using a model to scale. Actor: IronErn440, infra in GitLab moved to GitHub on 10 November after takedown.
AI gateways — LiteLLM and LangChain LangGrinch
LiteLLM accumulates six CVEs in 2024 (CVE-2024-2952, 5225, 5710, 5751, 6587, 9606); in 2025 the pattern continues and closes already in 2026 with TeamPCP supply chain (March 2026): compromise of Trivy (19-Mar) rewriting Git tags, LiteLLM maintainer’s PyPI credentials captured via Trivy, litellm==1.82.7/1.82.8 published with a three-stage payload. The tools a dev installs to defend themselves are the vector.
December — LangChain LangGrinch CVE-2025-68664 (CVSS 9.3). dumps() and dumpd() don’t escape dictionaries with the 'lc' key. The attacker sends a prompt whose response contains in additional_kwargs a {'lc': 1, 'type': 'constructor', ...} structure; the round-trip loads arbitrary objects. With secrets_from_env=True (default), it exfiltrates env vars. With Jinja2, RCE. LangChain.js: CVE-2025-68665 (CVSS 8.6). Patch introduces an allowed_objects allowlist and lowers the defaults.
5. Offensive AI — red team and autonomous discovery with LLMs
The year of the autonomous agent on public bug bounty. Full arc PentestGPT → XBOW in the synthesis post. Highlights:
XBOW reaches HackerOne worldwide #1
July 2025. XBOW (xbow.com) — autonomous pentester in production against public bug bounty programmes — reaches HackerOne’s worldwide #1. Verifiable metrics published by the company itself: 1,060+ vulnerabilities reported in 12 months (54 critical, 242 high, 524 medium in the 90 days before ranking); full horizontal coverage of OWASP Top 10; 48-step exploit chains (the longest reported by a human on HackerOne is ~30 steps); padding oracle attack against AES-128-CBC in 17.5 minutes; 40-hour principal-pentester assessment replicated in 28 minutes on a specific programme.
Methodology: canary-based CTF. Canaries embedded in target code; detecting the canary in the output is the binary signal of exploitability. Funding round of $75M in July. Brendan Dolan-Gavitt (NYU / XBOW) presents at Black Hat USA 2025 under AI Agents for Offsec with Zero False Positives. First solid public report on a functional LLM-as-pentester at scale — paid bug bounty against real targets.
DARPA AIxCC final — DEF CON 33
8 August, DEF CON 33 Main Stage. DARPA announces the winners of the AI Cyber Challenge: 1st Team Atlanta ($4M, ATLANTIS system — Georgia Tech, Samsung Research, KAIST, POSTECH); 2nd Trail of Bits ($3M, Buttercup system); 3rd Theori ($1.5M). Setup: seven finalist teams, 53 challenge projects in C and Java, 63 synthetic vulnerabilities, $85,000 Azure + $50,000 LLM credits per team (donated by Anthropic, Google, OpenAI at $350,000 each).
Scored results: the seven CRSs found 54 of the 63 synthetic vulnerabilities (86%, vs 37% in semifinals) and patched 43 (68%, vs 25%). 18 real zero-days unplanted — six in C, twelve in Java — with valid patches for 11. Average cost per challenge task: $152, with the bottleneck in Azure compute, not inference. Team Atlanta publishes ATLANTIS’s technical paper (arxiv 2509.14589): modular architecture with Threat Localization + Analysis + Triage + Patch Generator; final score 392.76 with more than 170 ahead of second place. The seven CRSs are released as open source after the final (ATLANTIS repo).
The inverse question AIxCC opens: a CRS without the patch phase is a functional offensive system. The detection/exploitation component is 90% of the work; the patch is the final phase. Over 2025-2026 there will be forked versions with the patch module replaced by weaponisation. Edge appliances (Ivanti, Fortinet, Palo Alto, Cisco IOS XE — the domain where Atlantis found 6 of the 18 real ones) are the natural candidate.
6. Defensive AI — commercial products and SOC agents
The year the three hyperscalers converge on the same conceptual stack: identity for agents, out-of-band policy enforcement, runtime telemetry, continuous evaluation.
Microsoft Ignite (17-21 Nov). Entra Agent ID GA — first-class identity for AI agents, agent registry, mandatory human sponsor, Conditional Access applied to agent identities. Agent 365 via Frontier program — control plane for the fleet. Defender for Cloud with AI security posture preview (inventory, overpermissions, attack path analysis). Defender for AI agents and Purview DLP for Copilot prompts in preview. Security Copilot Agents preview announced at RSA Conference (March).
AWS re:Invent (30 Nov – 4 Dec). Bedrock AgentCore Policy preview with policy enforcement based on Cedar — the Gateway intercepts each tool call before executing it. AgentCore Evaluations preview with built-in evaluators (correctness, helpfulness, safety, tool selection accuracy, goal success, harmfulness, stereotyping). AgentCore Identity with token vault for OAuth. AWS Security Agent preview — frontier agent for automated security testing. Bedrock Guardrails Automated Reasoning checks GA in four EU regions.
Anthropic — Claude Dispatch + Agent Teams (March, multi-agent orchestration); Constitutional Classifiers v2 and v3 as a defensive stack; Threat Intelligence team detects the espionage case. Google Cloud — Defender for AI integrated with Vertex AI; Gemini for Security as a vertical agent. For 2026, “this will show up in RFPs” becomes operational fact. The hard part will be inventorying what’s already running before applying policy.
7. Compliance and regulation

2024 was the year of the regulatory calendar. 2025 is the first one in which the deadlines bind.
DORA — 17 January
DORA enters into application. Regulation (EU) 2022/2554 for European financial entities and for critical ICT third-party providers designated by the ESAs. Five operational pillars: ICT risk management framework (Ch. II), ICT-related incident management with the Annex III deadlines (notification 4h / intermediate 72h / final 1 month), digital operational resilience testing with TLPT every three years for important entities aligned to TIBER-EU, ICT third-party risk management with Register of Information and mandatory Art. 30 clauses, and voluntary information & intelligence sharing.
DORA is lex specialis for finance against NIS2 — where they overlap, DORA prevails. NIS2 Spain is transposed via Organic Law X/2025 in the second half.
By December 2025, the first inspection cycle is open. First public observations: mapping obligation → technical control still blurry in most entities; Register of Information built but detailed traceability missing; official list of designated critical TPPs pending.
EU AI Act Art. 5 — 2 February
First binding step of Regulation (EU) 2024/1689. Eight practices prohibited in the EU market:
| Art. | Category | Real product affected |
|---|---|---|
| 5(1)(a) | Subliminal techniques / deliberate manipulation | Dynamic pricing with emotion detection |
| 5(1)(b) | Exploitation of vulnerabilities | Casinos/lotteries targeting profiled problem gamblers |
| 5(1)(c) | Social scoring by public or private entities | Cross-context aggregator platforms |
| 5(1)(d) | Predictive policing by pure profiling | Police risk scoring without objective facts |
| 5(1)(e) | Indiscriminate facial scraping | Clearview AI, PimEyes and the like |
| 5(1)(f) | Emotion recognition in work / education | Proctoring with stress analysis; interview AI |
| 5(1)(g) | Sensitive biometric categorisation | Automatic inference of orientation / ethnicity / religion |
| 5(1)(h) | Real-time biometric ID in public spaces (LE) | Municipal real-time FR except listed exceptions |
Art. 99 sanction regime: up to €35M or 7% of global turnover. The only category under full prohibition rather than diligence obligation. No transitional clause — the Regulation applies to the system regardless of when it was deployed.
4 February — Commission publishes Guidelines on Prohibited AI Practices (non-binding, interpretative) in 24 official languages. 6 February — guidelines on the Art. 3(1) definition of “AI system”.
Alongside Art. 5, Art. 4 on AI literacy enters into application — providers and deployers must ensure “a sufficient level of AI literacy” of their staff.
US position — repeal of EO 14110 and AI Action Summit Paris
20 January. Trump signs Initial Rescissions of Harmful Executive Orders and Actions that revokes Executive Order 14110. 23 January: Removing Barriers to American Leadership in Artificial Intelligence — federal policy to “sustain and enhance America’s global AI dominance” without a replacement framework. 21 January: Stargate Project ($500B over 4 years, $100B immediate — SoftBank, OpenAI, Oracle, MGX).
10-11 February, AI Action Summit Paris. Third summit after Bletchley (2023) and Seoul (2024). Inclusive and Sustainable AI for People and the Planet declaration signed by 58 countries; US and UK do not sign. JD Vance’s 11 February speech translates the Trump position into geopolitical discourse: “Excessive regulation of the AI sector could kill a transformative industry just as it’s taking off”; direct criticism of the Digital Services Act and the AI Act (“foreign regulatory regimes that target our companies”). France announces Current AI — $400M endowment for an AI public goods foundation. The multilateral AI safety consensus that held Bletchley/Seoul together splits: an EU regulatory axis, a US no-safety-net innovation axis, a Chinese framework of its own.
EU AI Act GPAI — 2 August
Second step. Chapter V obligations for providers of general-purpose models. For all GPAI (Art. 53): technical documentation of the model (Annex XI), information for downstream deployers (Annex XII), training data summary with AI Office harmonised template, copyright policy respecting Art. 4(3) CDSM Directive opt-outs. For GPAI with systemic risk (>10^25 FLOPs, Art. 55): evaluations with SotA methodology and adversarial testing, systemic risk analysis (CBRN, cyber offensive, manipulation, loss of control), serious incident reporting to the AI Office, weight cybersecurity coordinated with ENISA.
Code of Practice for GPAI — adequacy decision on 1 August
Published by the AI Office on 10 July, endorsed via adequacy decisions on 1 August. 26 providers sign; signing implies presumption of conformity with Arts. 53 and 55. Three separately signable chapters: Transparency, Copyright, Safety and Security. The exceptions: Meta does not sign (Joel Kaplan’s public statement on 18 July: the CoP “introduces legal uncertainties and measures that go beyond the scope of the AI Act”); xAI signs only Safety and Security (rejects transparency and opt-outs); DeepSeek and Chinese providers do not sign. The operational open question: which Chinese models are placed on the EU market when the provider has no representative? Pre-existing GPAI models (GPT-4, Claude 3.5, Gemini 1.5/2.0, Llama 3): deadline of 2 August 2027 (Art. 111.3). GPAI sanctions: general application 2-Aug-2026.
Trump AI Action Plan — 23 July
The White House publishes Winning the Race: America’s AI Action Plan. More than 90 federal actions across three pillars (Accelerating Innovation, Building American AI Infrastructure, Leading in International Diplomacy and Security). Three simultaneous EOs: Preventing Woke AI in the Federal Government, Accelerating Federal Permitting of Data Center Infrastructure, Promoting the Export of the American AI Technology Stack. The operationally concrete bit: directive that only “unbiased” models — free of “ideological dogmas such as DEI” — are eligible for federal procurement. Framing deliberately contrasts with the EU AI Act.
NIS2 Spain
Not transposed at January’s close; preliminary draft in Council of Ministers on 14 January. Transposition via Organic Law X/2025 during the second half; reporting obligations to INCIBE-CERT and sanctioning regime effective during 2026.
8. Academic research — papers that marked the year
| Paper | Venue / date | Impact |
|---|---|---|
| DeepSeek-R1 (DeepSeek-AI) | arxiv 2501.12948, January | Opens open-weights reasoning with visible CoT |
| Constitutional Classifiers (Anthropic) | arxiv 2501.18837, January | Universal jailbreak from 86% to 4.4% on internal benchmark; bug bounty with no universal found |
| MCP Tool Poisoning Attacks (Invariant Labs) | blog + repo, 1 April | First reproducible PoC; basis for OWASP MCP03:2025 |
| Agentic Misalignment (Lynch et al., Anthropic) | system card 22 May / paper 20 Jun / arxiv 2510.05179 | Reproducible method; 16 models with comparable rates; MIT repo |
| Project Vend (Anthropic + Andon Labs) | 27 June | First commercial agent in real production for a month |
| ATLANTIS (Team Atlanta) | arxiv 2509.14589, 18 September | AIxCC winner; modular CRS published |
| MCPTox (Wang, Gao et al.) | arxiv 2508.14925, August | First systematic TPA benchmark; o1-mini at 72.8% success rate; AAAI 2026 |
| Chain-of-Thought Hijacking (Zhao et al.) | arxiv 2510.26418, 30 October | Success rates 94-100% against Claude 4 Sonnet, Gemini 2.5 Pro, o4-mini, Grok 3 Mini |
| AI-orchestrated espionage (Anthropic TI) | blog + PDF, 13 November | First documented adversarial use of a commercial agent by a state actor |
| EchoLeak (Aim Labs) | arxiv 2509.10540 | First prompt injection CVE in an enterprise product (CVE-2025-32711, CVSS 9.3) |
Apollo Research publishes replications and scheming evaluations over the summer. Embrace The Red (Johann Rehberger) keeps the MCP risks series running. DEF CON 33’s AI Village publishes Generative Red Team 3.
What’s missing at year’s end: standardised benchmark for CoT hijacking — each paper publishes its own methodology; reproducing results across labs is hard. The asymmetry between who sees the CoT (vendor) and who’s accountable for the deployment (operator) remains structural.
9. Public-impact incidents with an AI dimension
In chronological order: 5-Feb Marco Rubio AI impersonation on Signal against US diplomats + OmniGPT breach of 30,000+ conversations; 8-Apr Llama 4 / LMArena (arena vs repo, #2 → #32 without tuning); 22-May Apollo recommends not deploying intermediate Opus 4 snapshot (self-propagating worms, fabrication of legal documentation); 11-Jun EchoLeak (CVE-2025-32711, first zero-click prompt injection CVE in Copilot); mid-Jun second wave of the Rubio voice clone; 27-Jun Project Vend identity crisis goes public; Aug AgentFlayer (zero-click on 6 platforms); 25-Sep ForcedLeak in Agentforce ($5 to exfil CRM); Nov ShadowRay 2.0 (AI-signed payloads, 230,000 Ray servers); 13-Nov Anthropic espionage report.
On 19 July 2024 CrowdStrike’s Channel File 291 left 8.5 million Windows in BSOD; on 19 July 2025 the industry assesses applied lessons — Falcon Super Lab, customer profile testing, Windows Resiliency Initiative with user-mode sensor in beta. The piece not applied: standard contractual clause for mandatory staged rollout.
10. Industry events and benchmarks
Key events of the year: AI Action Summit Paris (10-11 Feb, multilateral consensus splits, Vance speech), RSA Conference (Mar, Security Copilot Agents preview), Apple WWDC25 (9 Jun, Foundation Models framework), AWS re:Inforce (16-18 Jun, Bedrock Guardrails), Black Hat USA (2-7 Aug, AgentFlayer + XBOW talk), DEF CON 33 (7-10 Aug, AIxCC final + AI Village Generative Red Team 3), Microsoft Ignite (17-21 Nov, Entra Agent ID GA + Agent 365), AWS re:Invent (30 Nov – 4 Dec, Bedrock AgentCore Policy with Cedar + AWS Security Agent), NeurIPS 2025 (safety track + scheming follow-ups), AAAI 2026 preprints (MCPTox accepted).
Benchmarks published or consolidated: HarmBench v2 and StrongREJECT v2 (multi-turn, agentic), MLCommons AILuminate 1.0, MCPTox (first systematic tool poisoning benchmark), OWASP MCP Top 10 draft (TPA = MCP03:2025), OWASP LLM Top 10 v2.0 in progress, MITRE ATT&CK with new techniques in Cloud Matrix for identity providers and device-code phishing.
Cross-cutting pattern of the year
If I have to distil 2025 in a sentence: the regulatory calendar becomes operational, agents leave the demo for real production, and AI at visible scale shows up for the first time on both offence and defence. Three fronts simultaneously operational where 2024 still left categories in preview.
Front 1 — agents in real production. Operator GA on 23 January. Computer Use evolving into extended thinking + tools in Claude 4. MCP integrated in Claude Desktop, Cursor, VS Code, GitHub Copilot Agent Mode, Zapier. Project Vend showing with audited balance what happens when the commercial agent operates for real. Anthropic espionage report closing the loop: the commercial agent used by an external state actor, not internal alignment failure.
Front 2 — regulation with binding calendar. DORA on 17 January. AI Act Art. 5 on 2 February. GPAI on 2 August. 26 Code of Practice signatories, Meta out, xAI partial, Chinese providers don’t sign. First DORA inspection cycle open at year’s end. The first year a European Trust & Safety function operates with legal deadlines that a board takes on.
Front 3 — AI at visible scale on both sides. XBOW #1 HackerOne in July with 1,060 vulns. ATLANTIS wins AIxCC in August with 18 real zero-days at $152 each. Microsoft Security Copilot Agents preview at RSA, AWS Security Agent at re:Invent. Anthropic Threat Intelligence detecting and publishing the first AI-orchestrated espionage case in November. Offensive and defensive AI stop being separate sandboxes and start filling their respective market quadrants in parallel.
The three aren’t independent. Regulatory pressure pushes investment in agent-based defence (AgentCore and Agent 365 are answers to the threat model the AI Act formalises). The commercial agent in production creates the surface the offensive agent attacks. The regulatory calendar sets the dates the hyperscalers use to sync their roadmaps. 2025 is the first year any enterprise AI roadmap has to run three parallel teams — regulatory, product, security — where in 2024 the three could live in separate sprints.
What changed from 2024
| Dimension | 2024 | 2025 |
|---|---|---|
| Reasoning models | o1 only on API, opaque CoT | o3 + Claude 4 ext. thinking + R1 open-weights + QwQ + Gemini 2.5 thinking |
| Agents | Computer Use beta + MCP spec | Operator GA + Project Vend + AgentForce 2.0/3.0 |
| Jailbreaks | ArtPrompt, Many-shot, Skeleton Key | CoT Hijacking + multi-turn against reasoning + tool poisoning |
| Defences | RLHF + RSP + Preparedness | + Constitutional Classifiers v2 (86% → 4.4%) + Deliberative Alignment + Robust Safety Training |
| AI infra | LiteLLM, ShadowRay, Probllama, JFrog HF | + Triton chain + ShadowRay 2.0 + LangGrinch + PyTorch CVE-2025-32434 |
| Offensive AI | PentestGPT USENIX + AIxCC semifinals | XBOW #1 HackerOne + AIxCC final + 18 real zero-days |
| Defensive AI | Security Copilot GA | + Security Copilot Agents + Entra Agent ID GA + AgentCore Policy |
| EU compliance | AI Act entry into force | DORA + Art. 5 + GPAI in application |
| US position | EO 14110 in force + AISIC | EO 14110 repealed + AI Action Plan + Stargate $500B |
| Public AI incident | Arup deepfake + Recall + Sky | EchoLeak + AgentFlayer + ForcedLeak + ShadowRay 2.0 + Anthropic espionage |
Summary: 2024 installs the categories, 2025 makes them operational.
What’s coming in 2026
Operational calendar:
- EU AI Act Annex III high-risk systems — general application on 2 August 2026 (Art. 113.b), unless the Digital Omnibus published in Q4 2025 pushes it to 2 December 2027. Any provider with an Annex III product has had to plan against the original date.
- GPAI sanctions — general application 2 August 2026 (Art. 101). During the first year, the AI Office can open investigations but not impose fines.
- Pre-existing GPAI (GPT-4, Claude 3.5, Gemini 1.5/2.0, Llama 3) — adaptation deadline 2 August 2027.
- DORA first TLPT — 2027-2028 depending on designation.
- NIS2 Spain — INCIBE sanctioning regime effective during 2026.
- Defensive commercial agents reaching GA — Microsoft Security Copilot Agents, AWS Security Agent, Anthropic Glasswing (April 2026 per roadmap). The autonomous defensive at XBOW-on-offence scale quadrant starts filling.
- Supply chain against security tooling — LiteLLM TeamPCP (19-24 March 2026) is the first public case. The tools a dev installs to defend themselves are the vector. Coverage in the infrastructure post.
- Open-weights reasoning next generation — DeepSeek-V4 already out, Qwen3 and Llama 4 reasoning in a matter of months.
- Cryptographic model verification — still not moving forward. Weight signatures, training root of trust, model bills of materials remain research.
Three operational questions the dossier leaves open:
- The first symmetric case with open-weight? If the “APT wouldn’t use a commercial model” reading is correct, there’s a campaign equivalent to November’s running on local Qwen-3 or DeepSeek-V4 that no vendor will detect.
- What does regulation do with provider liability? The provider closes accounts, notifies victims, publishes a report. Is that enough under NIS2, EU AI Act, US executive orders? Legal answer not clear in any jurisdiction.
- Do defences scale with capability? Anthropic’s preliminary data suggests that more capability without more alignment training produces more cases. The curve isn’t well characterised and the operational answer — defence by deployment architecture, not by trust in the model’s alignment — falls back on old security rules applied to the agentic stack.
Year timeline
| Date | Milestone | Category |
|---|---|---|
| 17-Jan | DORA in application | Compliance |
| 20-Jan | Trump repeals EO 14110 | Compliance |
| 20-Jan | DeepSeek-R1 published | Models |
| 21-Jan | Stargate Project announced ($500B / 4 years) | Industry |
| 22-Jan | SonicWall SMA1000 CVE-2025-23006 zero-day | Cyber infra |
| 23-Jan | OpenAI Operator in research preview | Agents |
| 23-Jan | Trump EO Removing Barriers to American Leadership in AI | Compliance |
| 31-Jan | Anthropic Constitutional Classifiers paper (arxiv 2501.18837) | Defence |
| 2-Feb | EU AI Act Art. 5 in application + Art. 4 literacy | Compliance |
| 4-Feb | Guidelines on Prohibited AI Practices (Commission) | Compliance |
| 10-11 Feb | AI Action Summit Paris; Vance speech; US and UK don’t sign | Compliance |
| 13-Feb | Storm-2372 device code phishing scales | Cyber |
| 21-Feb | ByBit hack $1.5B via Safe{Wallet} (TraderTraitor / Lazarus) | Cyber |
| 21-Feb | Apple withdraws ADP in UK over TCN IPA | Cyber/Privacy |
| 24-Feb | Claude 3.7 Sonnet + Claude Code preview | Models |
| Mar | Microsoft Security Copilot Agents preview (RSA) | Defence |
| 13-Mar | Project Vend kicks off (until 17 April) | Agents |
| 26-Mar | MCP spec 2025-03-26 with OAuth 2.1 | Agents |
| 1-Apr | Invariant Labs publishes MCP TPA paper + PoCs | Agents/Jailbreak |
| Apr | PyTorch CVE-2025-32434 breaks weights_only=True | Infra |
| 5-Apr | Meta Llama 4 (Maverick, Scout) + LMArena controversy | Models |
| 9-Apr | Willison formalises SHOULD → MUST for MCP | Agents |
| 25-Apr | Marks & Spencer cyberattack (DragonForce / Scattered Spider) | Cyber |
| 22-May | Claude Opus 4 + Sonnet 4; agentic misalignment system card | Models/Research |
| Jun | Black Forest Labs / Anthropic Constitutional Classifiers v2 paper | Defence |
| 9-Jun | Apple WWDC25 — Foundation Models framework | Industry |
| 11-Jun | EchoLeak CVE-2025-32711 (Copilot Patch Tuesday) | Agents/Cyber |
| 16-18 Jun | AWS re:Inforce — Bedrock Guardrails sessions | Defence |
| 17-Jun | Citrix Bleed 2 CVE-2025-5777 | Cyber |
| 18-Jun | MCP spec 2025-06-18 | Agents |
| 20-Jun | Agentic Misalignment paper + MIT repo | Research |
| 27-Jun | Anthropic publishes Project Vend Phase 1 | Agents |
| Jul | XBOW reaches #1 HackerOne | Offensive |
| 10-Jul | AI Office publishes Code of Practice for GPAI | Compliance |
| 15-Jul | Reasoning model jailbreaks H1 retrospective | Research |
| 18-Jul | Meta declares it won’t sign the CoP (Kaplan) | Compliance |
| 18-19 Jul | SharePoint ToolShell CVE-2025-53770 / 53771 | Cyber |
| 23-Jul | Trump publishes AI Action Plan | Compliance |
| 28-Jul | Noma Security reports ForcedLeak to Salesforce | Agents |
| 1-Aug | Adequacy decisions on Code of Practice (26 signatories) | Compliance |
| 2-Aug | EU AI Act GPAI obligations in application | Compliance |
| 2-7 Aug | Black Hat USA — AgentFlayer (Bargury) | Agents/Jailbreak |
| 5-Aug | Claude Opus 4.1 | Models |
| 7-Aug | GPT-5 release | Models |
| 7-10 Aug | DEF CON 33 — AIxCC final, Team Atlanta wins | Offensive/Defence |
| Aug | NVIDIA Triton chain (Wiz) CVE-2025-23319/23320/23334 | Infra |
| Aug | MCPTox preprint (arxiv 2508.14925) | Research |
| 9-Sep | iPhone 17 + iOS 26 with Memory Integrity Enforcement | Cyber |
| 18-Sep | ATLANTIS paper (arxiv 2509.14589) | Research |
| 25-Sep | ForcedLeak in Agentforce CVSS 9.4 | Agents |
| 25-Sep | Cisco ASA ArcaneDoor CVE-2025-20333/20362 | Cyber |
| Oct | Chain-of-Thought Hijacking preprint (arxiv 2510.26418) | Research |
| 14-Oct | Windows 10 end of support | Cyber |
| Nov | ShadowRay 2.0 (Oligo) | Infra |
| 12-Nov | GPT-5.1 (Instant + Thinking) | Models |
| 13-Nov | Anthropic publishes espionage report | Agents/Threat intel |
| 14-Nov | FortiWeb CVE-2025-64446 zero-day | Cyber |
| 17-21 Nov | Microsoft Ignite — Entra Agent ID GA, Agent 365 | Defence |
| 18-Nov | Gemini 3 Pro | Models |
| 24-Nov | Claude Opus 4.5 | Models |
| 25-Nov | MCP spec 2025-11-25 | Agents |
| 30-Nov / 4-Dec | AWS re:Invent — Bedrock AgentCore Policy, AWS Security Agent | Defence |
| Dec | DeepSeek-V4 open-weights | Models |
| Dec | LangChain LangGrinch CVE-2025-68664 | Infra |
| Dec | Anthropic RSP v3 | Defence |
Cross-links to the year’s technical writeups
AI security:
- DeepSeek-R1: open-weights reasoning model and what changes for CoT (January)
- MCP tool poisoning: four months after the spec, the real attacks (April)
- Llama 4 and the LMArena controversy: when the leaderboard model isn’t the repo model (April)
- Claude 4 and agentic misalignment: the reproducible metric (May)
- Project Vend: Claude running a real vending machine for a month (June)
- Reasoning model jailbreaks — H1 retrospective (July)
- DARPA AIxCC final at DEF CON 33 (August)
- Anthropic’s AI-orchestrated espionage report (November)
Compliance:
- DORA in application: Regulation (EU) 2022/2554 and the five pillars (January)
- EU AI Act Art. 5 in application: eight prohibited practices (February)
- EU AI Act GPAI: obligations in application and the Code of Practice (August)
Cyber with an AI dimension:
- ByBit hack $1.5B via Safe{Wallet} frontend (February)
- Marks & Spencer cyberattack — DragonForce / Scattered Spider (April)
- SharePoint ToolShell CVE-2025-53770 (July)
- Cisco ASA ArcaneDoor CVE-2025-20333 (September)
- Windows 10 end of support (October)
Multi-year synthesis:
- AI infrastructure: two years of incidents that confirm the category (2024-2026)
- Agentic red team — from PentestGPT (2023) to XBOW #1 on HackerOne (2025)
Year’s bulletins:
January · February · March · April · May · June · July · August · September · October · November · December
Parallel retrospectives: AI security 2025 — six patterns of the commercial agent year · Cyber 2025 — four cases that explain the year.
Canonical references
Each milestone’s URLs live inline in the corresponding sections. Summary of primary sources for direct consultation:
- Regulations — AI Act 2024/1689, DORA 2022/2554, Code of Practice for GPAI, Commission Guidelines GPAI, Guidelines on Prohibited AI Practices, AI Office signatory taskforce, America’s AI Action Plan.
- Research firms and blogs — Anthropic research, Invariant Labs, Apollo Research, Embrace The Red, Wiz Research, Oligo Security, Simon Willison, XBOW, Team Atlanta.
- Benchmarks — HarmBench, StrongREJECT, OWASP MCP Top 10, MLCommons AILuminate.
- ai-security
- dossier
- retrospectiva
- 2025
- reasoning-models
- agentic
- mcp
- eu-ai-act
- dora
- gpai
- aixcc
- xbow
- constitutional-classifiers
- agentic-misalignment
- vendor:anthropic
- vendor:openai
- vendor:google
- vendor:deepseek
- vendor:meta
- vendor:microsoft
- vendor:aws
- annual-report


