Skip to content
Back to Blog

ai-security · 9 min read

DARPA AIxCC final at DEF CON 33: Team Atlanta wins with 18 real zero-days patched at average $152 cost

The AI Cyber Challenge final closes in Las Vegas on 8 August. Seven autonomous systems analyse 53 challenges and, on top of the synthetic vulnerabilities planted, find 18 real zero-days in open-source projects and propose valid patches for 11 of them. Three of the seven CRSs are already published as open source.

· Manuel López Pérez · ai-security

The AI Cyber Challenge final closes in Las Vegas on 8 August. Seven autonomous systems analyse 53 challenges and, on top of the synthetic vulnerabilities planted, find 18 real zero-days in open-source projects and propose valid patches for 11 of them. Three of the seven CRSs are already published as open source.

8 August 2025, DEF CON 33 Main Stage, Las Vegas Convention Center. DARPA announced the winners of the AI Cyber Challenge (AIxCC). First place and $4M to Team Atlanta (Georgia Tech, Samsung Research, KAIST, POSTECH) with their ATLANTIS system. Second and $3M to Trail of Bits with Buttercup. Third and $1.5M to Theori. The seven finalist systems (Cyber Reasoning Systems, CRSs) collectively found 54 of the 63 planted synthetic vulnerabilities (86 %) and patched 43 (68 %). But the figure marking the event is different: they also found 18 real zero-days not planted in the challenge code, and proposed valid patches for 11.

Average cost per challenge task: $152 in compute + LLM credits. For operational context: 18 zero-days in real open code, patched, over a weekend, for less than a junior pentester’s annual salary.

Reading: analysis of DARPA’s published results and the team blogs. No applicable PoCs (the CRS repos are dense material for future posts when all seven are open).

How the final was set up

The final ran from 7 to 10 August during DEF CON 33. Setup:

  • Seven finalist teams selected after the semifinal at DEF CON 32 (August 2024): Team Atlanta, Trail of Bits, Theori, Shellphish, 42-b3yond-6ug, Lacrosse, NCC Group.
  • 53 challenge projects split between C and Java. A mix of widely deployed open-source software — including Linux kernel projects, critical libraries and development backbones.
  • 63 synthetic vulnerabilities planted by the organising committee across those 53 projects. Realistic mix — patterns appearing in historic CVEs: integer overflows, use-after-free, path traversal, deserialisation, race conditions.
  • Compute budget per team: $85,000 in Azure (servers, VMs, GPUs) plus $50,000 in LLM API credits. Anthropic, Google and OpenAI each donated $350,000 in credits, split across the seven teams.
  • Autonomous tasks: each CRS receives a project, its test suite, and a compute/LLM budget. It must (a) detect bugs, (b) write a proof of vulnerability that triggers the bug, (c) propose a patch fixing the bug without breaking the test suite.

Change from earlier Cyber Grand Challenge generations (2016): 2016 CRSs worked on compiled binaries with synthetic bugs in custom architecture — an OS built for the competition. The 2025 CRSs work on real source code of real production projects, in languages the LLM “understands” semantically.

The common architecture across finalist CRSs

What the seven systems share, per a SoK paper published in September by members of the organising committee and the teams’ blogs:

  1. LLM layer (Claude Opus 4 / Sonnet 4, GPT-4o / o3, Gemini 2.5 depending on team) for semantic reasoning over code — detecting suspicious patterns, generating hypotheses, writing proof of vulnerability, writing patches.
  2. Classic program analysis layer: directed fuzzing (AFL++, libFuzzer, custom), symbolic execution (KLEE, angr, Mayhem), static analysis (CodeQL, Semgrep, Joern).
  3. Orchestration layer: the LLM decides which classic tool to fire against which piece of code, reads results, refines hypotheses, iterates.

The novelty compared to the naive “LLM-as-pentester” pattern of 2023-2024: the LLM doesn’t do the bug hunting. It orchestrates tools that do, and interprets results. The bug is confirmed with a reproducible crash that triggers the proof of vulnerability — not with the model’s “intuition”.

Atlantis (Team Atlanta) — the winner

Team Atlanta published the ATLANTIS technical paper in September (arxiv 2509.14589). Relevant details:

  • Final score: 392.76 points, more than 170 ahead of second place (Trail of Bits with Buttercup).
  • Combination of directed fuzzing, symbolic execution and static analysis, all orchestrated by an LLM layer (mix of models per subtask).
  • Modular architecture: a Threat Localization module that selects code zones to investigate, an Analysis module generating hypotheses, a Triage Intelligence module discarding false positives and prioritising, a Patch Generator module producing the patch and validating it against the test suite.
  • Public repo on GitHub: Team-Atlanta/aixcc-afc-atlantis, OSI-approved licence.

The interesting thing about the Atlantis approach: clear separation between detection and validation. The Analysis module generates many hypotheses (“this memcpy in this function with this length looks suspicious”); the Triage module discards 90 % using classic program analysis. Only what survives triage goes to the directed fuzzer for crash confirmation. The result: very few false positives in what it reports as confirmed vulnerability.

Buttercup (Trail of Bits) — second place

Trail of Bits published a post-mortem analysis on their blog. Buttercup:

  • Emphasis on patch quality: a bug found but patched with regression doesn’t score. Buttercup has a specific patch validation layer running extensive tests beyond the project’s official test suite.
  • Use of multi-LLM ensembling — different models opine on the same bug and patch, and only consensus or additional-evidence outputs are accepted.
  • Trail of Bits is public about CRS safety — the system doesn’t execute code on non-isolated hosts, doesn’t call the network except to LLM APIs and the fuzzer, doesn’t write outside the assigned workdir.

Theori — third place

Third. Theori published a technical writeup of the approach. The system uses LLM-guided fuzzing with a more aggressive coverage feedback loop than its competitors: if the fuzzer gets stuck, the LLM reads the “stuck” function code and proposes mutated input seeds to break the plateau.

The 18 real zero-days — the important part

The 63 synthetic vulnerabilities planted by the organising committee are the controlled benchmark. But the figure that changes the conversation: the CRSs also found 18 real, unplanted bugs in the challenge project code. Six in C codebases (including one vulnerability found in parallel by the project’s maintainer during the event) and 12 in Java codebases. The teams proposed valid patches for 11 of the 18.

This is unprecedented. Traditional fuzzing has found bugs in real code for 15 years (afl-fuzz, OSS-Fuzz). What CRSs add:

  • Semantic confirmation: the LLM reads the crash dump, reads the surrounding code, writes the proof of vulnerability in human language, and proposes a patch in the project’s own language.
  • Functional patch validation: the proposed patch doesn’t just fix the crash; it passes the project’s test suite. This is what traditionally requires a human.
  • Speed: the 18 zero-days were found in the final’s window — an operational weekend. The closest precedent (OSS-Fuzz running continuously, human bug bounty teams) operates on month-long timelines for equivalent output.

DARPA reports an average cost of $152 per completed challenge task. This includes Azure compute and LLM credits. Five of the seven teams used less than 25 % of their LLM credit budget — the practical bottleneck was Azure compute, not model inference.

Mandatory open-source release

A programme condition since the start: the seven finalist CRSs are published as free software under OSI-approved licence after the final. As of event close, five of the seven are already published; the other two are expected in the following weeks.

Repos published as of 10 August:

Dense material for the coming months. Once all seven are open and there’s community scrutiny, we’ll be able to talk in detail about which CRS detects which patterns, where they break, and which patterns are replicable outside the competition setup.

Reading for the SOC

Three questions AIxCC reopens:

1. What happens with responsible disclosure?

If an attacker (state-sponsored or not) replicates ATLANTIS over an open-source project with a large user base, they can find zero-days at $152 each. The kicker is the supply — the attacker can enumerate repos, run the CRS, and harvest. Responsible disclosure policy assumes a human researcher with mixed motives. With autonomous CRSs, the relevant metric stops being “bugs found per researcher / year” and becomes “throughput of a system operating continuously”.

The other side: the defender can run the same CRS on its own projects before the attacker. OSS-Fuzz already does this at low speed; an LLM-augmented OSS-Fuzz can raise patch coverage pre-disclosure. But that assumes the defender runs the system and publishes patches; many open-source maintainers won’t have the time / capacity to process the flow.

2. What happens with offensive AI?

AIxCC CRSs are defensive systems by design — they find bugs and patch them. But the detection/exploitation component is 90 % of the work; the patch is the last phase. A CRS without the patch phase is a functional offensive system. Through 2025-2026 there will be forked / replicated versions of the finalist CRSs with the patch module replaced by a weaponisation module (writing the proof of exploit with shellcode or RCE, not just the proof of vulnerability).

For the SOC: the question isn’t whether this kind of system exists offensively in 2026, it’s who operates them and against what stack. Edge appliances (Ivanti, Fortinet, Palo Alto, Cisco IOS XE — legacy C code, exactly the domain where Atlantis found 6 of the 18 real ones) are candidates.

3. And the human maintainers?

A maintainer of a small open-source project may next month receive a PR written by a CRS with a correct patch for a bug they didn’t know about, along with a proof of vulnerability. That flow is positive. But they may also receive a PR written by a human attacker using a forked CRS that looks like a correct patch and actually introduces a subtle backdoor — XZ-utils 5.6.0 (CVE-2024-3094, April 2024 post) done at industrial scale.

The question for maintainers: how do you validate an LLM-generated PR when volume goes up two orders of magnitude. Nobody knows yet.

Operational notes

  • The published CRS repos are dense. Expect 6–12 months for solid community analysis on which work in which domain.
  • The specific challenge projects aren’t public (DARPA keeps the list opaque to avoid contaminating the dataset for future editions). The characteristics — C/Java, widely deployed open-source projects — are what’s been published.
  • DARPA confirms ARPA-H is a collaborator on the edition — the healthcare angle (healthcare critical infrastructure) appears in marketing material but not as a separate challenge category. A follow-on programme with sectoral focus is likely.

References

Back to Blog

Related Posts

View All Posts »
Agentic red team — from PentestGPT (2023) to XBOW #1 on HackerOne (2025)

ai-security · 13 min

Agentic red team — from PentestGPT (2023) to XBOW #1 on HackerOne (2025)

Three years of red team with LLMs. PentestGPT (Aalto/NTU paper, Aug 2023, USENIX 2024) opens the academic category; HackerGPT and WhiteRabbitNeo build the commercial side; XBOW (July 2025) reaches #1 globally on HackerOne with 1,060 reported vulns. Reproducible PoC with PentestGPT v2 against HackTheBox.

· Manuel López Pérez

AI Security 2025 — annual dossier

ai-security · 30 min

AI Security 2025 — annual dossier

The year the three fronts went operational at the same time: agents in real production (Operator GA, Project Vend, MCP in clients), regulation with binding deadlines (DORA, Art. 5, GPAI) and AI at visible scale on both offence (XBOW #1 on HackerOne) and defence (AIxCC, Security Copilot Agents). Annual reference with a catalogue of releases, papers, incidents and cross-links to the year's technical writeups.

· Manuel López Pérez

One year on from R1: Engram, Kimi K2.5 and the state of the open-weights frontier

ai-security · 14 min

One year on from R1: Engram, Kimi K2.5 and the state of the open-weights frontier

January 2026 marks a year since DeepSeek-R1. The expected V4 doesn't land — DeepSeek publishes the Engram paper (conditional memory) and an updated R1 paper instead. Moonshot AI drops Kimi K2.5 with multimodal and agent swarm. The open-weights frontier pattern is now normal: Chinese labs dominate the Hugging Face rankings. State and the defences it assumes broken.

· Manuel López Pérez