Why frontier LLMs failed a new cyber defence benchmark

by Black Hat Middle East and Africa

on 26 May 2026

Why frontier LLMs failed a new cyber defence benchmark

LLMs are very good at sounding like security analysts. Ask one to explain credential dumping or map LSASS activity to MITRE ATT&CK, and it’ll give you a polished answer.

But SecOps platform Simbian recently developed a new cyber defence test with the aim of finding out if LLMs can actually hunt – and the answer, for now, is no.

The test and its results

Simbian tested 11 frontier models across more than 880 runs and 105 attack procedures. Not one passed. And the passing bar wasn’t perfection – it was more than 50% recall on every tested MITRE ATT&CK tactic. None of those 11 models cleared it.

The benchmark is much closer to real SOC work than the tidy tests AI often does perform well on.

Logs are where confidence goes to die

The benchmark gave agents a threat briefing and a SQL-queryable database of real Windows telemetry, including Sysmon and Security event logs. Each hunt had a 50-query budget and between 75,000 and 135,000 log records. The agent had to submit exact malicious timestamps. No multiple choice, and no friendly hint saying: ‘look here’.

This setup exposes the difference between cybersecurity knowledge and cyber defence. A model can know what malicious PowerShell looks like; it can explain Mimikatz; it can describe lateral movement. But that’s just recall – and threat hunting, in contrast, is investigation under uncertainty.

The defender doesn’t know how many malicious events exist. Normal admin activity can resemble reconnaissance, and one useful clue may only become useful when it’s linked to another clue 40,000 events later. On top of that, there’s no clean sign that lets you know you’ve won. Attackers get feedback when an exploit lands, but the defence side just gets more noise.

Even the leading LLM missed too much

The strongest result in Simbian’s testing came from Opus 4.6, which achieved a 0.46 coverage score and found 4.49% of flags at an average cost of USD $17.98 per run. GPT 5 scored 0.17 coverage and found 2.24% of flags at $1.07 per run. Gemini 3 Flash cost just $0.19 per run, but found 1.44% of flags.

So there’s a cost-quality conundrum here. Spending much more improved results, but not enough to make raw LLMs dependable defenders. Simbian’s own page notes that some models stopped making progress before they’d exhausted their query budget because they believed the task was complete.

In an incident response, that behaviour is lethal. We know a tired human analyst is easier to challenge, but a confident AI agent that prematurely closes the hunt may simply leave the attacker free to move.

The benchmark is also a warning about AI buying

What we’ve taken from this is that an LLM alone isn’t a SOC product.

Simbian argues that the missing layer is the harness: context about assets and users, deterministic retrieval, structured investigation loops, calibrated tool access, cost controls, and assessment mechanisms that keep the agent testing hypotheses.

That’s important from a commercial perspective. CISOs are now being sold AI copilots, agents and autonomous SOC promises all the time – and this benchmark suggests the procurement question should change.

Instead of just asking which model drives the product, you need to ask what forces it to keep investigating. Ask whether it can run on your telemetry, and how it’s scored. Ask what happens when it thinks it is finished.

All CISOs should:

Test AI security tools on your own logs, not a vendor’s polished demo.
Treat knowledge benchmarks as weak evidence for detection performance.
Look for harnesses, workflows and verification (not just frontier-model branding).

Simbian’s benchmark is useful because it punctures the cinematic version of AI defence. The model is not yet Sherlock Holmes in the SOC. On its own, it is more like a bright intern with a short attention span: it’s useful and fast, and occasionally does something impressive – but it’s absolutely not someone you leave alone with the case.

REGISTER FOR BLACK HAT MEA 2026

Share on

Join newsletter

Join the newsletter to receive the latest updates in your inbox.

Topics

Webinars Cryptography Network Defense Articles Ransomware Podcasts CyberSecurity Applied Security Whitepaper Exploit Development Reverse Engineering Newsletters The Black Hat Files

Sign up for more like this.

Join the newsletter to receive the latest updates in your inbox.

Why frontier LLMs failed a new cyber defence benchmark

The test and its results

Logs are where confidence goes to die

Even the leading LLM missed too much

The benchmark is also a warning about AI buying

Join newsletter

Follow us

Topics

Sign up for more like this.

Related articles

Has mobile become AI’s new frontline?

The 21 second breach: why phishing defence now fails after the click

Why cybersecurity still needs real-world classrooms