Agentic AI for Offensive Security

Mathieu Bricou-Belouet — Tue, 07 Apr 2026 14:43:09 +0000

AI is now embedded across a growing range of offensive security workflows. The most visible shift is the rise of services that apply large language models and agentic orchestration to autonomous testing activity. Some vendors have been present for years, while others have emerged only recently, but the pace of change has clearly accelerated over the last six months.

Commercial offerings include editor-backed platforms such as Horizon3.ai / NodeZero, Pentera, XBOW, and RunSybil, while the open-source ecosystem includes projects such as Strix, Shannon, PentAGI, PentestGPT, and PentestAgent. Their positioning differs, but they all attempt to translate the adaptability of modern AI systems into concrete offensive security outcomes.

The objective of this article is not to rank vendors. Instead, it is to clarify how agentic pentesting systems work, what technical prerequisites they require, and where their current limitations still prevent them from being treated as fully reliable autonomous testers.

A common architecture for agentic offensive testing

The current landscape is made up of heterogeneous tools with very different product strategies and target use cases: external web security testing, internal infrastructure and Active Directory reviews, cloud security assessments, or source-code analysis close to the CI/CD pipeline.

Nowadays, in their best configurations, the strongest systems can conduct autonomous static and dynamic security reviews with strong reasoning capabilities, and a workflow that can, at times, resemble the analytical posture of a human pentester.

Example of autonomous reasoning and tool execution

Many of these tools are benchmarked internally, or through capture-the-flag environments, as CTFs provide an observable way to compare reasoning depth, exploitation ability, and tool usage. Despite a wide range of architecture, the following essential building blocks are broadly consistent across most solutions:

Standard architecture and components of an agentic automated pentesting solution

An orchestrator: This layer coordinates parallel agents, handles freezes and timeouts, manages preconfigured workflows, and connects the other components into a coherent execution chain.
An underlying LLM: The model acts as the cognitive core of the system, alternating between reasoning loops, tool invocation, and the creation of sub-agents when needed. Tool use is mandatory, and larger frontier models generally yield better results.
An attack toolbox: Most platforms rely on a containerized toolkit broadly aligned with standard Kali-style capabilities. The exact content varies by use case, but web testing stacks are often relatively conventional. Many solutions also allow the agent to download additional tools or clone GitHub repositories dynamicaly when required.
A set of skills or knowledge packs: These local libraries encode reusable expertise, including technology-specific attack techniques, pentester cheat sheets, standard exploitation workflows, and details related to newly disclosed vulnerabilities or attack patterns.

This last layer is often where vendors can differentiate most clearly. Strong cyber monitoring, threat hunting, and cyber threat intelligence capabilities can continuously refresh the knowledge base and improve both adaptability and confidence in the actual coverage delivered by automated sessions.

Because these agents can execute offensive actions against production-like environments, observability and governance are essential. Most serious implementations therefore include logging, telemetry, session replay, human approval steps for selected actions, and safeguards that distinguish lower-risk modules from more dangerous commands or exploit paths.

A key distinction often blurred in vendor marketing: fully agentic systems use an LLM to drive the entire decision loop, while AI-assisted platforms apply AI only to specific steps (usually the hardest exploitation decisions) within an otherwise deterministic pipeline. Most commercial products today fall into the second category.

An efficiency case study

Case study : CTF

To assess the current effectiveness of agentic pentesting, we benchmarked one such solution (Strix) using several different models against an internal set of Wavestone CTF challenges for which no public write-ups were available. The goal was not to compare products against each other, but rather to understand how model quality affects outcomes in a web security context.

This choice of benchmark offers a useful signal because web exploitation combines broad topic coverage with varying levels of difficulty. At the same time, the exercise should not be over-generalized: it does not fully represent other contexts such as internal infrastructure testing or Active Directory assessments.

Benchmark of several LLMs on internal CTF challenges

Several conclusions emerged from this exercise:

The results become genuinely impressive only when the system is paired with a state-of-the-art model.
Conversely, models that can realistically run on a high-end consumer workstation still tend to produce mediocre offensive-testing performance, which often makes SaaS-based AI providers the sole effective solution today.
Even powerful models can miss exploitable weaknesses, while some still-large but less optimized models can underperform, potentially because Strix was not designed and tuned with them in mind.
Smaller models occasionally show flashes of insight and solve challenges that stronger models miss.
A broad tendency remains for models to hallucinate paths to exploitation, especially when they reach a dead end. In CTF settings this often manifests as fabricated flags rather than validated solutions.
In order to not pollute their context with large volume of data, agents tend to heavily truncate data (such as web pages or codebase files) and being too specific when using “grep” or “find” for research. In both cases, the behavior can restrict their coverage of the scope and their overall efficiency.

These results should be interpreted cautiously. For each model and each challenge, the benchmark was limited to at most two runs. In several cases, a model was very close to the solution before hallucinating the final step, or required human steering to close the investigation. Typically, those cases could plausibly be recovered in a real-world workflow that includes human review.

The best benchmark results were obtained with frontier proprietary models. In our observations, these models can solve a substantial portion of constrained offensive tasks while remaining operationally affordable; at least as long as sessions converge quickly.

Performance of a frontier model and key consumption metrics

Performance of an alternative frontier model and key consumption metrics

What it shows is :

Per-challenge cost can remain relatively modest, on the order of a few euros when the agent converges efficiently.
Execution can be surprisingly fast, with many CTFs solved in less than five minutes when the model identifies the relevant path early.
Failure is expensive. Without strict guardrails on duration and budget, token consumption can increase dramatically over the course of a few hours.
In our own setup, solve rates between top-tier commercial models were close, but efficiency varied substantially in time, token consumption, and number of tool invocations. Surprisingly, despite Sonnet’s higher per-token price, overall session costs were comparable to GPT-5, Anthropic’s model compensated through greater token efficiency.

Case study : real web application

To complement the CTF benchmarks, we also tested one of our internally developed web applications (used for staffing and performance management). The system was assessed with several approaches, including authenticated modes in which the agent is provided with credentials or tokens.

In one representative pentesting session, 25 agents were deployed, 366 tool calls were executed, for a total cost around USD 5, and the session ran for around one hour. The resulting automatically generated report included an executive summary, an OWASP-oriented methodology section, technical findings with CVSS v3 scoring, and a prioritized remediation roadmap.

Agent hierarchy spawned during an automated security review

The outputs were mixed, but broadly informative after human review and retesting:

The agent surfaced several relevant minor improvement areas, although findings were not always well contextualized and could become overly alarmist.
Critical miss however : the agent completely missed an exposed admin interface with default credentials: a vulnerability no human pentester would overlook. This illustrates the reliability ceiling of current autonomous systems.
The report also included a non-existent vulnerability candidate, JWT algorithm confusion, rated as critical, along with proof-of-exploit scripts that did not succeed in practice. This illustrates the persistent false-positive risk of autonomous systems.

Additional remarks :

As with the CTF benchmarks, the quality of the review improved significantly when using a frontier-grade model.
The non-deterministic nature of generative models remains visible: two runs can produce substantially different findings and reports against the same target.
If prompting and scope controls are insufficient, some models attempt to expand the scope of the assessment by probing adjacent ports, applications, or subdomains.
Coverage and relevance improve markedly in white-box or hybrid white-box/grey-box modes, where the agent can inspect the codebase, identify candidate weaknesses, and then attempt to validate them dynamically on the live application. Even then, some agents can still fixate on non-existent issues. And in white-box, very large codebases may saturate the system and reduce overall efficiency.
Browser-driven interactions have progressed, yet some application types remain difficult to assess autonomously, especially multi-window or thick-client environments where headless browser interaction may not be enough.
These systems rarely build a deep understanding of business logic. Their outputs remain strongly aligned with generic OWASP-style patterns and may not challenge the real business risk or abuse scenarios in a sufficiently contextual way.

It should be noted that the majority of these criticisms can also apply to human pentesters, who nonetheless remain more easily held accountable.

The scaling problem remains central. CTFs are only partially representative of real applications. While a CTF typically channels the tester toward a narrow and deliberate attack path, even a modest business application exposes a much broader surface. Today, guaranteeing exhaustiveness while avoiding fixation on irrelevant endpoints remains difficult.

Verdict and current limitations

Verdict

If one considers solutions that relies entirely on a general-purpose LLM for its decision tree, the conclusion is clear at the present time: only frontier-grade models from major AI providers consistently deliver results that are both relevant and reasonably verifiable.

Condisering four practical deployment options:

SaaS LLM services: currently the highest-quality option, leveraging very large frontier models (>1T parameters) billed per use. The main drawback is data sovereignty: all prompts and findings leave your environment.
Large private datacenter deployments, which can run powerful models (500b) and may become increasingly relevant for pentesting, but may still remain materially below the best commercial frontier systems.
Small private datacenter deployments, which can run capable models (300b), but clearly not sufficient to efficiently orchestrate autonomous pentests.
Dedicated workstations, which, even with very strong specifications, may quickly struggle above 100b, and remain far insufficient today.

Illustrative distribution of open-source local models by number of parameters and total size

The dependence on SaaS providers raises unavoidable sovereignty and confidentiality questions. Offensive security assessments often consolidate highly sensitive technical information about an organization’s weaknesses. Any externalization of prompts, traces, findings, or attack hypotheses therefore requires careful governance. And data anonymisation before the LLM step might not be a reliable mitigation, as it can decrease the efficiency of the run, while still sharing exploitable meta-data my SaaS suppliers.

In their current state, even equipped with the most capable LLMs, these systems also exhibit structural limitations that directly affect reliability:

Instances of “tunnel vision”, with prolonged fixation on a single irrelevant attack path.
A tendency to launch time-consuming brute-force activities without a sound appreciation of computational complexity or cost.
Persistent hallucinations: despite significant progress, even frontier models still fabricate findings, exploit paths, or flag non-existent vulnerabilities, as shown in the JWT confusion example.

Easy capability to hallucinate or misinterpret results, here with kimi-k2

The non deterministic nature of LLM, making some runs way less efficient and relevant than others
A scaling problem tied to context-window constraints: it “scales” in the sense that you can launch as many parallel sessions against as many targets. However, it scales more poorly when a single session is launched against a single highly complex application. It becomes much harder to maintain exhaustive coverage and memory continuity across large, content-rich applications. Large improvments can be achieved on this front, with an efficient long term memory management allowing for more coherent runs for large applications and improving coverage.
High verbosity and limited stealth, which makes these systems poorly suited in their default form for red-team style end-to-end scenarios that require discretion and tradecraft. This can be improved through dedicated configuration, without however equaling human capabilities

And from a higher standpoint, an autonomous SaaS-run process having the ability to remotely execute commands in your IS poses from the start the issue of accountability :

Classifying tools as dangerous versus safe may not be enough, for instance with Swiss-army toolsets, capable of the most inocuous recon and of aggressive and potentially damaging exploits. Threat level should be dynamically assessed, taking the context and previous tests into accounts.
Even then, pausing the tests and requesting a human approval may lead to a similar situation with coding agents, with “developer fatigue”, where users become too trusting and stop critically challenging the agent’s conclusions.

And of course, any vulnerability at the LLM level, such as susceptibility to prompt injection or poisonning, could be leveraged to hijack the automated pentest workflow. Essentially, those autonomous tools, if deployed internally, should be regarded as critical assets, with high value for attackers.

Where the architecture can improve

Beyond model quality itself, a substantial part of the improvement space lies in the overall system design. Several architectural directions already appear promising:

Multiply sessions and validation passes, using continuous exploration, focused zoom-in phases, and explicit confirmation loops for candidate findings. This improves reliability but increases cost and duration.
Precede the autonomous phase with scripted tests and deterministic reconnaissance, then feed those structured outputs to the agent. This is far more cost-efficient than spending LLM context and tokens on tasks that are already easy to automate without AI. The core principle should be simple: do not use AI where conventional automation already performs well. Delegate only the genuinely ambiguous, adaptive, or investigative parts of the workflow to the LLM, and avoid overloading the model with unnecessary command history and context noise.
Introduce dedicated validation instances to confirm exploitability in a controlled environment before findings are promoted to a report.
Use leaner decision trees or specialized modules upstream of exploitation, reserving high-end models only for the parts of the workflow that truly require adaptability and reasoning.

In practice, this last point is already the direction taken by many vendor platforms. They do not rely entirely on agentic AI; instead, they combine deterministic security logic with agentic exploitation only when potential weaknesses have already been narrowed down.

Potential multi-step architecture designed to improve result reliability and reduce unnecessary model load

Lastly, an interesting thought : as such automated solutions may be used by real attackers, we may see “anti-AI” mechanisms included in applications and endpoints, such as “links labyrith” and token-draining honeypots designed specifically to mislead or exhaust automated testing systems.

With strong enough models, agentic systems can already excel in constrained environments such as CTFs. Their performance in real application assessments is more mixed: often useful, sometimes impressive, but still too inconsistent to be trusted without human oversight.

The most pragmatic path today is therefore a hybrid operating model: an agentic system carrying out the majority of the tests and suggesting investigation leads, supported by human pentesters who arbitrate, validate, and take over in the most complex cases. The result is a security assessment that is significantly shorter, while still guaranteeing a degree of coverage and relevance in the findings.

Agentic AI is not a replacement for human pentesters, not yet. At its current level of maturity, it is better understood as a force multiplier, one that can accelerate exploration and triage, but that still depends on expert supervision to turn raw autonomous activity into trustworthy security outcomes. In any case, these systems should also be treated as highly sensitive because of their autonomous nature, and the current constraints toward SaaS-run models should be considered, in terms of data confidentiality and digital souvereignty.

Despite not being fully mature yet, those solutions are beginning to leave a mark in the cybersecurity landscape, and will most likely alter the trajectory of the pentesting market, toward an ecosystem more centered on tools and compute while conserving a hybrid approach. We might even see audits following a “Bring Your Own Compute” model, where auditees provide their own LLM, and the auditors provide custom tools and skills.

Cet article Agentic AI for Offensive Security est apparu en premier sur RiskInsight.

Mathieu Bricou-Belouet, Auteur