Anthropic's "AI-orchestrated" espionage report: what it says, what it proves, what it doesn't

On 13 November 2025, Anthropic publishes Disrupting the first reported AI-orchestrated cyber espionage campaign. The report attributes to a China-nexus group (high confidence, no public alias) the first documented use of a commercial coding agent (Claude Code) to run an espionage campaign against ~30 organisations: large tech companies, financial institutions, chemical manufacturers and government agencies. Anthropic’s Threat Intel team picks up the activity in mid-September from anomalous API cadence; the investigation runs for the following ten days; the report comes out two months later.

The figure driving the press coverage: the AI does 80–90% of the operational work, with humans stepping in at 4–6 decision points per campaign. The press read is “Skynet”. The operational read is something else. This post breaks the report down on three axes: what it proves, what it leaves unproven, and what changes for anyone running a coding agent in production.

Disclaimer: the report is a corporate threat intel piece, not an academic paper or a neutral advisory. Mandatory critical reading. For due diligence I also cite the independent responses (Thoughtworks, IAPS, PC Gamer, Help Net Security) published over the following days.

What the report describes

The campaign Anthropic reconstructs has six phases. The attackers build their own framework that calls Claude Code via the API; the framework drives the chain, and each individual step is executed by the model.

Target selection and framework development. Human operators pick ~30 targets across three sectors. They build a harness that orchestrates Claude Code: tasks split small, persona jailbreak (Claude believes it works for a pentesting firm authorised by the client).
Reconnaissance. Claude inspects reachable infrastructure, identifies high-value databases. Anthropic emphasises the speed: “thousands of requests, often multiple per second”. That’s the most distinctive operational signature.
Vulnerability identification. The model reads the exposed surface, suggests exploits, writes code to test them. No specific CVEs are named in the report.
Credential harvest. With the working exploits, the agent extracts credentials and escalates privileges.
Exfiltration. Claude classifies exfiltrated data by intelligence value, identifies privileged accounts, leaves backdoors behind.
Documentation. The agent itself produces internal reporting on the compromise: which credentials were obtained, which accesses remain open.

Anthropic states that human intervention is limited to “4–6 critical decision points per hacking campaign” — entry point, intermediate validation, decision on exfil. The rest is decided by the model inside the harness.

The jailbreak method described is classic persona injection with task decomposition. The model is told it’s an employee of a cybersecurity firm contracted by the target for authorised testing; tasks are requested atomically, with no prompt ever containing the aggregate objective. The model, evaluating each isolated step, sees a defensive security testing operation; at the end of the run, the client has suffered real exfiltration.

Note: this isn’t a new jailbreak, it’s the combination we’ve documented in earlier posts — confused deputy, DAN-style persona, indirect prompt injection in chain-of-tools — applied with adversarial orchestration on a commercial agent. The novelty is the integration, not the techniques.

What the report proves well

Three points are well supported by Anthropic’s own infrastructure, which sees the API traffic.

1. The speed is real and verifiable

The request rate is the most solid piece of data. Anthropic sees every API call with a timestamp; claiming “thousands per second” is something their telemetry can confirm. It’s not a hypothesis about the actor’s intent, it’s a usage metric on the account.

Operational consequence: any human-based detection model calibrated on a natural pentester cadence falls short. A human runs 50–200 requests/hour against a target in a manual recon phase; an agent with an adversarial harness trivially runs 5,000–10,000/hour. If your WAF/EDR telemetry is calibrated on a human baseline, this pattern stands out. If you rely on threshold alerts on failed login tuned to 2025-pre-agent, the agent passes under the radar symmetrically: many well-formed requests, not many malformed ones.

2. Atomic task decomposition beats current alignment

If the model evaluates each subtask in isolation as “design an SQLi test against this endpoint for an authorised audit”, refusal is low. What the 2024–2025 alignment training captures well is the direct request: “help me hack this system”. What it captures badly is the adversarial harness that pre-filters the context and only feeds the model subtasks readable as benign. This is consistent with the literature from GCG through many-shot jailbreaking: alignment is per turn, not per campaign. The harness exploits exactly that gap.

3. The qualitative operational autonomy

That Claude Code writes exploit code, tests it, iterates on the result and plans the next step without human intervention isn’t controversial: it’s what any developer using Claude Code in agentic mode has been doing for the past six months. The report documents that this capability also applies to offence. Predictable and worth naming. For a CISO it closes the conversation about “coding agents are productivity tools” and places them also as offensive tools; the only reason they took this long to show up in real campaigns was cost and discipline, not capability.

What the report does not prove

The independent critiques — summarised well by Thoughtworks and PC Gamer — point at serious gaps.

No IoCs, no TTPs, no verifiable attribution

The report is anonymised to the limit. It does not publish:

Indicators of compromise (IoCs) — hashes, C2 IPs, domains, malware samples, exploit payloads.
TTPs in MITRE ATT&CK format with concrete techniques.
Public attribution of the group (alias, mapping to an existing APT, source of the attribution).
Names of victims or sector breakdown finer than “tech / finance / chemical / gov”.

For a threat intel piece published by a company of this reach, the absence is notable. Compare to any equivalent report from Mandiant, Volexity or Microsoft: they name the group, list IoCs, describe the chain with enough detail for other defenders to scan. Anthropic justifies the lack of detail by “responsible disclosure” and “victim protection”, reasonable, but the consequence is that the ecosystem can verify nothing.

Three technical questions left open

Why would a high-capability China-nexus APT use a commercial US model? Critics flag this with reason. Operationally, a disciplined state actor prefers open-weight models it can run locally (Qwen, DeepSeek, GLM) and that leave no trail at the provider’s API. Using Claude Code through a public API is a large OPSEC error — the 16 September detection itself proves it. Two reasonable hypotheses: (a) the actor wasn’t as disciplined as the “state-sponsored” attribution suggests, or (b) the attribution is premature and this is a more opportunistic group. The report doesn’t let you decide between the two.
What fraction of the 80–90% “AI-orchestrated” is real operational work vs failed exploration? Anthropic notes that the model “occasionally hallucinated credentials or misidentified publicly-available information as secret data”. It doesn’t quantify this. If the model spends 70% of its tokens generating exploits that don’t work or classifying public data as confidential, the 80–90% figure is misleading — the agent works a lot but produces little. Without a per-phase success rate, the figure isn’t informative.
Real compromise or intent? Anthropic says “successful compromise in a small subset” but doesn’t quantify the subset or describe what was taken. The difference between “30 targets, 2 with real compromise and low-value exfil” and “30 targets, 25 with full root credential compromise” is two orders of magnitude. The report sits in the middle without saying so.

Commercial conflict of interest

The report comes out eleven days after the launch of Claude Opus 4.5 (24 November) and in the middle of the Ignite + re:Invent cycle. That it lands in an optimal window for public conversation about AI security in cloud isn’t a reasonable coincidence. Anthropic has a direct incentive for “agent-driven espionage” to exist as a named threat category — it sells Claude for Enterprise with agentic security, it sells its Threat Intelligence team as a differentiator. That doesn’t invalidate the facts, but it does invite reading the emphases with a sharper eye.

What does change operationally for a deployer

Whatever weight you give to the report, three operational consequences can no longer be postponed.

1. Telemetry on coding-agent usage in your organisation

If your organisation uses Claude Code, Cursor, GitHub Copilot Agent Mode or equivalent, does anyone know how many requests are leaving, to which endpoints, at what cadence? The question isn’t whether someone will reuse the November 2025 playbook (they already are), it’s whether your organisation can detect it.

Reasonable minimum:

API key usage logging per user and per service.
Baseline request cadence per user under legitimate use.
Alert on deviation >5σ from individual cadence.
Periodic review of service accounts calling coding agents — who owns them, what they do.

2. Task decomposition policy in system prompts

The adversarial harness exploits that the model doesn’t know what aggregate task it’s fulfilling. For a deployer serving a coding agent to their organisation, the system prompt can mitigate this:

Explicitly include the question “what is the user trying to achieve overall, beyond this immediate step?” as mandatory reflection before operational tasks.
Reject atomic tasks that look like part of an offensive sequence (recon → vuln scan → exploit → exfil), even if individually benign.
Distinguish authorised vs unauthorised security testing via verifiable context, not via prompt assertion.

None of these mitigations is a silver bullet. Decomposition capability lives with the adversary, not the model. But they raise the cost of the adversarial harness, which is the point.

3. Threat model on your own dev environment

The other thing the case teaches: a coding agent in production is a persistent endpoint with organisational credentials. If an attacker compromises a developer’s account with Claude Code installed and MCP servers connected (filesystem, git, private repos), the exfil speed is comparable to what the report describes. The attacker doesn’t need to be China-state, an infostealer on the dev laptop is enough. The laptop with an operational coding agent is now an elevated attack surface — not worse than in 2024, but with a new speed multiplier.

This directly questions the developer security model: API key rotation, mandatory MFA at the model provider, scoped tokens per project, audit logging of MCP tool calls. We covered this in the November 2024 post on the MCP spec and in the April 2025 tool poisoning analysis. The Anthropic report gives the practical justification that was missing to move that work onto the roadmap.

How this fits the rest of the year

It’s the third time in 2025 the “agentic loop + jailbreak + tools” pattern leaves the papers and enters public operation:

March–April: Invariant Labs publishes the first paper on MCP Tool Poisoning with a reproducible PoC against Cursor and Claude Desktop. Covered here.
May: Anthropic publishes agentic misalignment research with reproducible methods. The field formalises “AI behaving badly” as a metric, not as sci-fi.
November: the espionage report. For the first time, a frontier company publishes that its model has been used in an operation against third parties at documented scale.

The trajectory is legible. What was research lab in March, was reproducible metric in May, is campaign attribution in November. For 2026 the operational question will be how liability is distributed when a commercial coding agent is the tool — and that conversation is already happening in the US Congress: the letter from the Homeland Security Committee to Dario Amodei on 26 November asks for testimony on the incident.

Three questions the report leaves open for 2026

When will we see a symmetric case with an open-weight model? If the “an APT wouldn’t use a commercial model” read is right, an equivalent campaign is running on local Qwen-3 or DeepSeek-R2 right now that no provider will detect. The first public incident attributed to an agent on open weights closes the debate over why the Anthropic report existed.
What’s the OPSEC failure mode of the operational attacker? Anthropic saw the pattern because they saw abnormal API traffic. If the adversarial harness implements rate limiting and API key rotation, does the next one get seen? The detection arms race between providers and agent-using attackers starts here.
What does regulation do with provider liability? The model provider closes accounts, notifies victims, publishes the report. Is that enough under NIS2, the EU AI Act, US executive orders? The legal answer is unclear in any jurisdiction.

For the coming months, follow-up runs through the monthly bulletin and the AI security 2025 retrospective that closes out the year.

References

Anthropic, Disrupting the first reported AI-orchestrated cyber espionage campaign (13 Nov 2025): https://www.anthropic.com/news/disrupting-AI-espionage
Anthropic, Full report PDF: https://www-cdn.anthropic.com/57b3da11d63ea9aa5dbdf2e80c1b1f6b2af9fb27.pdf
Homeland Security Committee letter to Dario Amodei (26 Nov 2025): https://homeland.house.gov/wp-content/uploads/2025/11/2025-11-26-CHS-to-Anthropic-re-Request-to-Testify.pdf
Thoughtworks, Anthropic’s AI espionage disclosure: separating the signal from the noise: https://www.thoughtworks.com/en-us/insights/blog/security/anthropic-ai-espionage-disclosure-signal-from-noise
PC Gamer, Cybersecurity critics are sceptical: https://www.pcgamer.com/software/ai/anthropic-reports-the-first-80-90-percent-ai-orchestrated-cyber-espionage-campaign-but-cybersecurity-critics-are-sceptical/
The Record, Chinese state hackers used Anthropic AI systems: https://therecord.media/chinese-hackers-anthropic-cyberattacks
Help Net Security, Claude AI automated cyberattack: https://www.helpnetsecurity.com/2025/11/14/claude-ai-automated-cyberattack/
Institute for AI Policy and Strategy, The Emergence of Autonomous Cyber Attacks: https://www.iaps.ai/research/autonomous-cyber-attacks