Skip to content
Back to Blog

ai-security · 13 min read

Agentic red team — from PentestGPT (2023) to XBOW #1 on HackerOne (2025)

Three years of red team with LLMs. PentestGPT (Aalto/NTU paper, Aug 2023, USENIX 2024) opens the academic category; HackerGPT and WhiteRabbitNeo build the commercial side; XBOW (July 2025) reaches #1 globally on HackerOne with 1,060 reported vulns. Reproducible PoC with PentestGPT v2 against HackTheBox.

· Manuel López Pérez · ai-security

Three years of red team with LLMs. PentestGPT (Aalto/NTU paper, Aug 2023, USENIX 2024) opens the academic category; HackerGPT and WhiteRabbitNeo build the commercial side; XBOW (July 2025) reaches #1 globally on HackerOne with 1,060 reported vulns. Reproducible PoC with PentestGPT v2 against HackTheBox.

This blog’s AI security coverage over the last three years has a visible asymmetry. Defensive AI under attack we’ve followed closely — jailbreaks, indirect injection, MCP poisoning, agentic misalignment, Recall, infrastructure — but we’ve barely touched the other side: AI used for autonomous offence. PentestGPT, HackerGPT, WhiteRabbitNeo, XBOW, AIxCC. The field has been maturing for three years and in July 2025 it crossed the threshold: XBOW reaches #1 globally on HackerOne with 1,060 vulnerabilities reported in 12 months. The question this post closes on: is AI red team a real category with tooling, papers and benchmark, or is it still a narrative product?

Lab: PoC with PentestGPT v2 (GPL-3.0, 6,200+ stars) against retired HackTheBox machines. The framework is reproducible on Linux with an OpenAI/Anthropic API account. Typical result from my run: the agent solves easy HTB machines in 30-50 minutes at ~$3 API cost, fails on medium requiring multi-host pivoting, and gets stuck on hard without human guidance.

The 2023–2026 trajectory

August 2023 — PentestGPT as an academic category

arxiv 2308.06782. Gelei Deng et al. (NTU Singapore + Aalto + Edinburgh + collaborations) publish PentestGPT: Evaluating and Harnessing Large Language Models for Automated Penetration Testing. They present it formally at USENIX Security 2024 a year later.

The paper’s contribution isn’t “we asked GPT-4 to hack” — anyone with ChatGPT plugins was already trying that in summer 2023. The contribution is the Pentesting Task Tree (PTT), a structure inspired by classic attack trees that encodes pentesting process state and lives outside the LLM conversation context:

   ├─ Reconnaissance ─┬─ nmap scan
   │                  ├─ enum subdomains
   │                  └─ tech stack identify

   ├─ Vulnerability ──┬─ CVE search
   │                  ├─ Banner grab → match
   │                  └─ Default creds test

   └─ Exploitation ───┬─ Public PoC search
                     ├─ Adapt PoC
                     └─ Shell / read flag

The LLM doesn’t hold the tree — the Python harness does. The LLM only receives the active sub-node + minimal context + tool descriptions. This solves the paper’s canonical problem: context loss in long sessions. Without PTT, GPT-4 forgets what it did 10 turns ago; with PTT, the harness reminds it only of what matters.

Three modules:

  • Reasoning module — decides what to do next given PTT state.
  • Generation module — emits the concrete command or request.
  • Parsing module — processes command output and synthesises it back to reasoning.

Paper benchmark: PentestGPT improves task completion 228% over vanilla GPT-3.5 and 58.6% over vanilla GPT-4 across a set of 13 machines (HackTheBox + VulnHub) and 182 sub-tasks. Tied to that: performance is still below a junior human pentester on hard machines and on any multi-host pivoting.

2023–2024 — the commercial side: HackerGPT, BurpGPT, WhiteRabbitNeo

PentestGPT launches a wave. Within months, commercial forks and specialisations appear:

  • HackerGPT — commercial fork of the concept with integrated tooling (Nmap, ffuf, Nuclei, custom recon modules). SaaS pricing, targeting consultancies.
  • BurpGPT (Burp Suite extension) — integrates GPT-4 directly into the Burp flow for request/response analysis, vulnerability detection during interception.
  • WhiteRabbitNeo (Kindo / WhiteRabbitNeo team) — LLM fine-tuned specifically for offensive security. Models released on Hugging Face (33B, 13B, 7B). No alignment against offensive-sec content. Hosted via Kindo.
  • Pentest-Copilot (various vendors) — product category positioning as copilots for human pentesters, not autonomous pentesters.

What ties these products together: they’re still assisted tools, not autonomous. The human drives; the LLM speeds things up. The harness that decides the next step, executes, parses and persists state is still the human operator. The conceptual gap with PentestGPT (where the harness lives in the framework) is operational: in production, “pentester with AI tool” delivers value; “autonomous AI pentester” doesn’t yet.

Until July 2025.

July 2025 — XBOW reaches #1 globally on HackerOne

XBOW (xbow.com, founded 2024) operates as an autonomous pentester in production against public bug bounty programmes. It’s not an assisted harness — it’s an agent with tool access (browser, terminal, custom modules) running against real targets without an operator in the per-iteration loop. In July 2025 it reaches #1 on HackerOne globally with verifiable metrics:

  • 1,060+ vulnerabilities reported in 12 months.
  • 22 confirmed CVEs from audited Docker Hub images. 174 vulnerabilities reported in total from that exercise + 650 additional potential flaws.
  • 48-step exploit chains documented (the longest chain reported by a human on HackerOne is ~30 steps).
  • Padding oracle attack on AES-128-CBC in 17.5 minutes — identifies an encrypted cookie, recognises the mode, discovers the oracle via differential error responses, writes the exploit byte-by-byte, decrypts. An experienced human does this in hours to days.
  • 40-hour principal-pentester assessment replicated in 28 minutes on a specific programme (XBOW publishes the benchmark, not the programme identity).

Vulnerability categories reported: RCE, information disclosure, cache poisoning, SQL injection, XXE, path traversal, SSRF, XSS, secret exposure. Full horizontal coverage of the OWASP Top 10.

XBOW’s public methodology: canary-based CTF. The operator embeds canaries (unique values) in the target’s code. XBOW launches agents that try to read/exfiltrate the canaries via vulnerability discovery. Detection of the canary in output is the binary signal of “exploitable vulnerability, not just detected”.

August 2025 — DARPA AIxCC final closes the other end

We covered AIxCC in its dedicated technical post. Team Atlanta wins with ATLANTIS, finds 18 real zero-days across the 7 finalist systems + 11 patches submitted upstream. Average cost $152 per challenge task.

XBOW + AIxCC + WhiteRabbitNeo cover three points of the quadrant:

Defensive / patchingOffensive / discovery
Open-source / researchAIxCC finalists (ATLANTIS, Buttercup, etc.)PentestGPT, WhiteRabbitNeo
Commercial / production(early category — Anthropic Code Mode, GitHub Copilot security)XBOW

Q4 2025 – Q1 2026 — from product to service, and the defensive response

After the #1 on HackerOne, XBOW closes a $75M Series B in July 2025 (Sequoia + Altimeter) and accelerates integration with enterprise consultancies. Mandiant, Bishop Fox and NCC Group fold it into their engagements as a surface-coverage layer. The per-engagement price doesn’t change substantially; the volume of findings per unit of time does.

Three moves over the last year shift the quadrant:

  • LiteLLM supply chain TeamPCP (March 2026) — the group first compromises Trivy (the security scanner) to reach the PyPI credentials of LiteLLM’s maintainer. Coverage in the March 2026 bulletin and the AI infrastructure technical post. Relevant detail for the quadrant: the attacker uses compromised security tooling as a ramp to downstream AI tooling. First publicly documented multi-ecosystem operation (PyPI + npm + Docker Hub + GitHub Actions + OpenVSX) and proof that the offensive quadrant has state actors operating in real time, not just agents against bug bounty programmes.
  • Anthropic Mythos / Project Glasswing (7 Apr 2026) — Anthropic publishes the first serious gated frontier model: a model (Mythos) whose deployment ships with a mandatory harness (Glasswing) applying safety classifiers, cryptographic audit trail and per-sensitive-task rate limits. Coverage in the April 2026 bulletin. It’s the first commercial response to the “agent with tool access needs scaffolding” pattern XBOW industrialised on the offensive side. Defensively, Glasswing is the first structural competitor in the quadrant.
  • OpenAI GPT-5.5-Cyber (23 Apr 2026) — a GPT-5.5 variant specialised through additional training on defensive security (SIEM triage, KQL queries, SIGMA generation). Embedded in Microsoft Security Copilot Agents as the underlying model for the enterprise tier. Not “autonomous defensive AI” in the XBOW sense — the SOC analyst stays in the loop — but the first commercial model trained specifically for blue team tasks.

As of May 2026, the quadrant looks like this:

Defensive / patchingOffensive / discovery
Open-source / researchAIxCC finalists (ATLANTIS, Buttercup, etc.)PentestGPT, WhiteRabbitNeo
Commercial / productionAnthropic Glasswing, Microsoft Security Copilot Agents + GPT-5.5-Cyber, CrowdStrike Charlotte AIXBOW + enterprise services

What’s still open: the defensive quadrant’s metric. XBOW has canary CTF as a binary success signal; Glasswing and Security Copilot Agents have more opaque metrics — percent of incidents triaged correctly, MTTR reduction, false positive rate — that depend on the client environment and don’t lend themselves to open benchmarking. The first vendor to publish a defensive benchmark comparable to XBOW’s canary CTF wins the 2027 conversation.

Why it works

The operational question: what makes an autonomous AI agent effective at red team and why does it take until 2025?

Four components converge:

  1. Reasoning models with visible CoT — since DeepSeek-R1 in January 2025 and o1 in September 2024, the agent can think about a problem for hundreds of tokens before issuing a command. Claude Opus 4.7 (April 2026) and GPT-5.5 (November 2025) consolidate extended thinking by default.
  2. Mature tool use — function calling consolidated on GPT-4o/Claude 3.5 since mid-2024, MCP standardising tools since November 2024, and MCP in production analysed at 16 months in the March 2026 technical post. The agent can run nmap, parse the result, decide the next command, all without human intervention.
  3. Long context windows — 1M tokens in Gemini 2.5, 200k in Claude 4 + 4.5 + 4.7. Lets you keep the full pentest session history without needing an external PTT like PentestGPT required in 2023.
  4. Constitutional classifiers and configurable safety harness — Anthropic publishes Constitutional Classifiers v2 in February 2025 with a $20k bug bounty and no universal jailbreak found after 183 participants / 3,000 hours. Glasswing (April 2026) inherits the pattern and takes it to product. For XBOW operating against real targets, this means the model provider’s classifier allows legitimate offensive testing with the bug bounty programme’s explicit authorisation as context.

The harness is still critical. XBOW publishes that its real value isn’t in the model, it’s in the scaffold: flow control, parallelisation, telemetry, action rollback, gating on destructive actions. PentestGPT proved this conceptually in 2023; XBOW industrialised it in 2024-2025; Glasswing tries to do the same on the defensive side in 2026.

Reproducible PoC — PentestGPT v2 against HackTheBox

Open lab. Minimal setup on Linux with an OpenAI or Anthropic API account. The PoC reproduces the canonical flow against a retired easy HackTheBox machine.

# Setup
git clone https://github.com/GreyDGL/PentestGPT
cd PentestGPT
python3 -m venv .venv && source .venv/bin/activate
pip install -e .

# Configure API key
export OPENAI_API_KEY="sk-..."

# Launch against HTB target with VPN already active
pentestgpt --reasoning_model gpt-4o --parsing_model gpt-4o-mini

The agent starts with a system prompt defining the role + loads an empty Pentesting Task Tree. The user injects the target:

> Target IP: 10.10.10.40 (HackTheBox machine "Cronos", retired)
> Initial recon: nmap -sV -sC 10.10.10.40
> 22/tcp open ssh OpenSSH 7.2p2
> 53/tcp open domain ISC BIND 9.10.3-P4
> 80/tcp open http Apache 2.4.18

Model response (abbreviated) — the reasoning module suggests:

PTT update:
  - Reconnaissance ✓
    └─ Web port 80 → enum
  - Vulnerability
    └─ Port 53 DNS → check zone transfer
    └─ Web Apache → CVE search
  - Suggested commands:
    1. dig axfr @10.10.10.40 cronos.htb
    2. curl http://10.10.10.40/ -I
    3. gobuster dir -u http://10.10.10.40/ -w /usr/share/wordlists/dirb/common.txt

The operator (human or automatic) executes the commands, returns output, the agent integrates it into state and moves on. In practice this continues for ~40 turns on an easy machine. When it reaches exploitation, the model identifies patterns (SQL injection in a login form, command injection in a web parameter, file inclusion in some include) and emits a concrete payload.

What the PoC teaches that isn’t in the paper:

  • Real API cost — ~$3-5 per easy machine with GPT-4o. Climbs to $10-20 on a medium machine if the agent loops. Cheaper than a junior pentester per hour, but only if the agent doesn’t get stuck.
  • Recurring failure modes — the agent hallucinates non-existent web paths after a 404 (“there must be an /admin even if it returns 404”). Repeats already-executed commands without checking prior context. Fails at pivoting between hosts (doesn’t keep credentials cross-target consistently).
  • The “model knows the writeup” pattern — retired HTB machines have public writeups in blogs. GPT-4o has seen some during pretraining. On easy machines the model sometimes solves “from memory” — intellectually honest signal for evaluation: if the target is public, the benchmark is contaminated. XBOW avoids this by operating only against active bug bounty programmes.
  • Latency — the planning → execute → parse → planning cycle costs 10-30 seconds per turn with GPT-4o. 40 turns = 10-20 effective minutes of “agent work” + human time to run commands if auto-execute isn’t enabled.

What changes for the blue team

If XBOW can report 1,060 vulnerabilities in 12 months against public bug bounty programmes, defending teams face three structural changes:

  1. The detection/exploitation ratio tightens. Historically, a manual attacker working 8 hours/day finds X vulns per unit of time. An attacker with XBOW or equivalent finds ~10-50X in the same window. The bug bounty programme’s human triage capacity falls short — XBOW generates more reports than a human team can review thoroughly. HackerOne, Bugcrowd and Intigriti are adapting their triage pipelines for this throughout 2025-2026.
  2. The “time between public disclosure and first production exploit” keeps falling. If an AI agent can read a CVE advisory and generate a working PoC in under 30 minutes against a similar surface, the patch window is no longer weeks, it’s hours. Patch + rotate stays the rule (already covered in the July 2025 bulletin for SharePoint ToolShell).
  3. AI red team becomes an enterprise service before an end-user product. The big consultancies (Mandiant, Bishop Fox, NCC Group, IOActive) integrate XBOW / PentestGPT / equivalents into their engagements from mid-2025. For the end client, delivery looks like a classic engagement but coverage is 10X and cost per vulnerability found drops sharply. The consultancy absorbs the delta, not the client directly.

What’s missing — the adjacent space

The defensive quadrant filled partially in H2 2025 and H1 2026, but autonomous defensive at XBOW’s offensive scale is still pending. Inventory as of May 2026:

  • Microsoft Security Copilot Agents (preview RSA 2025, GA late 2025, integrating GPT-5.5-Cyber from 23 Apr 2026) — agents for incident triage, KQL query generation, threat summarisation. Not end-to-end autonomous — the SOC analyst stays in the loop.
  • CrowdStrike Charlotte AI — similar profile, embedded in the Falcon dashboard, expansion to IR workflows in Q1 2026.
  • Anthropic Glasswing + Mythos (April 2026 bulletin) — first gated frontier model with mandatory harness. Published defensive use cases (audit pipelines, sandboxed forensics) but no product autonomy comparable to XBOW on offence. Glasswing’s autonomy is deliberately limited by design — closer to supervised AI copilot than to autonomous agent.
  • OpenAI GPT-5.5-Cyber — first commercial model trained specifically for blue team tasks. Available via API and embedded in Security Copilot Agents.
  • AWS Security Agent + Bedrock Guardrails Automated Reasoning — AWS re:Invent Dec 2025. Closer to policy-as-code than to agentic, but part of the same movement.

The open question for 2026-2027: which consultancy or vendor publishes the first defensive XBOW — autonomous agent operating a SOC at the level XBOW operates red team? The technical primitives are there (reasoning models, tool use, MCP, sandboxing, gated harness à la Glasswing). What’s missing is the canary CTF equivalent applied to defence: how we measure that the defensive agent is “doing its job well” without the luxury of a binary signal like did_find_canary. Until that metric appears, the defensive side stays at supervised AI copilot — and the asymmetry this post opens with (offence with autonomous agents, defence with assistants) holds.

References

Earlier posts on the AI offensive/defensive arc

Back to Blog

Related Posts

View All Posts »
AI Security 2025 — annual dossier

ai-security · 30 min

AI Security 2025 — annual dossier

The year the three fronts went operational at the same time: agents in real production (Operator GA, Project Vend, MCP in clients), regulation with binding deadlines (DORA, Art. 5, GPAI) and AI at visible scale on both offence (XBOW #1 on HackerOne) and defence (AIxCC, Security Copilot Agents). Annual reference with a catalogue of releases, papers, incidents and cross-links to the year's technical writeups.

· Manuel López Pérez

AI infrastructure: two years of incidents that confirm the category

ai-security · 17 min

AI infrastructure: two years of incidents that confirm the category

Pickle as a broken legacy format, inference servers as a new HTTP attack surface, AI gateways as a pivot into the infra, and ML frameworks running with research-project security. The 2024–2026 arc with the Wiz / Oligo / JFrog / Orca / Datadog milestones and the PoCs they left behind.

· Manuel López Pérez