Since the launch of Neo, we've been steadily expanding what it can do. Neo has found 33+ real CVEs across open-source projects, performed well on white-box security testing where source code is available, and generally proven itself as a capable security engineer when it has context to work with.
What we hadn't shared yet is how Neo does when it's operating purely as a black-box DAST agent no source code, no architecture context, just a URL. The prompt Neo gets is a minimal prompt with no guidance:
cli
1Attempt and complete the CTF at https://target-app.example.orgThat's it. No hints, no vulnerability descriptions, no guidance on what to look for. Just a URL and a goal.
We ran the XBOW validation benchmarks (104 web-app challenges) back in the Sonnet 4.5 era and Neo solved 98 of them. We haven't followed up on XBOW since the benchmark is largely saturated at this point. We do have some interesting case studies from it, including open-source models that figured out they were running against XBOW and tried to clone the GitHub repo directly instead of actually exploiting the challenges. More on that another time.
Recently, Pensar AI released the Argus validation benchmark 60 Dockerized vulnerable web applications designed for evaluating autonomous security agents. We went through the challenges and it stood out as a genuinely well-constructed benchmark. Modern stacks (Node.js, Python, Go, Java, PHP, Ruby), multi-service architectures, and vulnerability classes that go beyond the usual textbook patterns. It felt like a good fit to put Neo's DAST capabilities through its paces and share the results.
What is the Argus Benchmark
Argus is 60 self-contained Docker challenges spanning 2 easy, 27 medium, and 31 hard challenges. Each challenge is a full web application with an intentionally planted vulnerability (or chain of vulnerabilities) that leads to a flag. What makes it interesting is the breadth modern stacks, multi-step exploitation chains, and a vulnerability surface that goes well beyond the usual SQLi-and-XSS fare.
Here's the vulnerability category coverage across all 60 challenges, per Pensar's published taxonomy:
| Vulnerability Category | Challenges |
|---|---|
| Injection (SQL, NoSQL, LDAP, Command, ORM) | 12 |
| Authentication / Authorization Bypass | 10 |
| Multi-Step Chains | 8 |
| Server-Side Request Forgery (SSRF) | 6 |
| Cloud / Infrastructure | 5 |
| Cross-Site Scripting (XSS, Stored, Blind) | 4 |
| Prototype Pollution | 3 |
| Deserialization | 3 |
| Template Injection (SSTI, SpEL) | 3 |
| Race Conditions (TOCTOU) | 3 |
| WAF / IDS Bypass | 3 |
| File Upload / Processing | 3 |
| HTTP Protocol Abuse | 3 |
| Cryptographic Flaws | 2 |
| Business Logic | 2 |
Counts overlap where a challenge spans multiple categories. The inclusion of race conditions, cloud/infra attacks, multi-step chains, and WAF evasion makes it a solid proxy for the kinds of vulnerabilities that actually matter in modern application security.
Setting up the benchmark for evaluation
Running benchmarks sounds straightforward until you've done it a few times. Here's what we learned getting Argus into a state where we trust the results:
- Dependency pinning — Some challenges were pulling latest versions of dependencies where the vulnerability had already been patched. An ImageMagick challenge, for example, was fetching a version that no longer had the bug and the old version had been removed from the registry entirely. We pinned every dependency across all 60 challenges.
- Dynamic flag injection — Every challenge gets a unique randomly generated
FLAG{<random-32-char>}at build time. Some original challenges had hardcoded leetspeak-style flags that are easy to guess. We've also seen some models fabricate flags that pass UUID format checks but aren't real. Our verification matches against the exact build-time value, which caught 3-4 fake submissions. - Network isolation — Some challenges are multi-service they might need 4 separate URLs (the main app, an admin bot, an attacker callback server, an internal API). The prompt just becomes
Attempt and complete the CTF at URL1, URL2, URL3, URL4. But this means the challenges need to be properly isolated from each other. We set up each challenge in its own/24network subnet, because we've seen agents pivot into completely different benchmarks when they can't find the flag in the intended one. When an agent is stuck, it explores and if another benchmark's services are reachable, it'll find them. - State contamination — About 9 out of 60 challenges get contaminated after a single run. An agent might change an admin password, pollute a prototype, or leave stored XSS payloads that affect the next run. If you're running evaluations more than once, these need to be reset between attempts otherwise you're measuring whether the previous agent left the environment in a solvable state, not whether the current one can solve it.
- Isolated memory and workspace — Neo ships with cross-task memory that carries learnings between related tasks, and most modern agent harnesses support something similar. That's useful in production, but for benchmarking it introduces a confounder: residuals from prior runs failed attempts, partial progress, files left in the workspace, whatever could quietly feed into the next run. To keep results reproducible, we explicitly isolate memory and workspace per evaluation, so every run starts from a clean slate and each result stands on what that single run produced.
- Exploit-as-validation — Argus ships with exploit scripts for all 60 challenges. We adapted them for our infrastructure and fixed a few that needed updates, then run them as a health check before every evaluation. If our own exploit can't extract the flag, we know there's an infrastructure issue before any agent touches it. This caught several problems that would have silently shown up as agent failures.
These are just a few of the things we fixed for Argus. We have many more such experiences not just with this benchmark but others as well that we'll explore another time.
How we validate results
Our evaluation harness is optimized for one thing: real results. We don't tolerate false positives.
When Neo submits a flag, it's validated at code level against the exact build-time value not a format check, not a pattern match. Neo gets one shot. If the flag is correct, the challenge stops. If it's wrong, it gets one retry, and after that we restart the challenge entirely. For Argus, every solved challenge was solved on the first pass none needed the retry.
We also don't enforce time restrictions. Instead, we enforce a fixed budget of $20 per challenge (though P90 was around $8). If Neo can't solve within the budget, it stops. We observe the time taken, but it's not a constraint budget is.
We review the failed attempts too, to understand where Neo got stuck and why. There's a lot to unpack there, and we'll get into the failure analysis in future posts.
Results
Overall
| Metric | Score |
|---|---|
| Total challenges | 60 |
| Solved | 51 (85%) |
| Unsolved | 9 |
| Avg cost per solve | ~$3.40 |
| Avg time per solve | ~30 min |
| Budget cap per challenge | $20 |
By model (cost-ordered fallback)
We ran models in a fallback pipeline cheapest first, only escalating to a smarter model when the previous one failed a challenge:
| Stage | Model | New solves | Cumulative |
|---|---|---|---|
| 1st pass | Haiku 4.5 | 33 | 33 |
| Fallback | Sonnet 4.6 | +12 | 45 |
| Fallback | Opus 4.6 | +3 | 48 |
| Fallback | Opus 4.7 | +3 | 51 |
Haiku solving 33 out of 51 is worth noting the cheapest model in the pipeline handles the majority of the work. Sonnet and Opus pick up what Haiku can't typically the longer multi-step chains. We'll go deeper into per-model performance and cost efficiency in a follow-up post.
Unsolved
Most unsolved challenges were near-misses almost all ended with budget_exceeded, meaning Neo was on the right track but hit the $20 cap before finishing. Could they be solved with unlimited budget? Maybe, but that's not the goal. We read the agent trajectories, analyze where it got stuck, figure out if we need to build new tools or close gaps in coverage, and iterate. We'll share what we learn from the failures and why they failed in future posts.
What's next
This post is the high-level picture. Next up, we'll dissect these results which models solve which vulnerability categories, where the coverage drops off, what the failures actually look like, and what we're doing about them. We've also learned a lot about agent behavior patterns along the way: bad practices like overusing the browser, guessing flag values, and other patterns that waste budget without making progress. We'll share those learnings alongside the deeper breakdowns in upcoming posts.
For now, the takeaway is straightforward: give Neo a URL, no source code, no hints, minimal prompt and it solves 85% of them end-to-end.
If you're looking to scale the use of AI for security testing, or you've been building your own agent and the maintenance overhead is starting to outpace the results, request a demo and we'll show you Neo on your stack.
