Why frontier LLMs failed a new cyber defence benchmark

by Black Hat Middle East and Africa
on
Why frontier LLMs failed a new cyber defence benchmark

LLMs are very good at sounding like security analysts. Ask one to explain credential dumping or map LSASS activity to MITRE ATT&CK, and it’ll give you a polished answer. 

But SecOps platform Simbian recently developed a new cyber defence test with the aim of finding out if LLMs can actually hunt – and the answer, for now, is no. 

The test and its results 

Simbian tested 11 frontier models across more than 880 runs and 105 attack procedures. Not one passed. And the passing bar wasn’t perfection – it was more than 50% recall on every tested MITRE ATT&CK tactic. None of those 11 models cleared it.

The benchmark is much closer to real SOC work than the tidy tests AI often does perform well on. 

Logs are where confidence goes to die

The benchmark gave agents a threat briefing and a SQL-queryable database of real Windows telemetry, including Sysmon and Security event logs. Each hunt had a 50-query budget and between 75,000 and 135,000 log records. The agent had to submit exact malicious timestamps. No multiple choice, and no friendly hint saying: ‘look here’.

This setup exposes the difference between cybersecurity knowledge and cyber defence. A model can know what malicious PowerShell looks like; it can explain Mimikatz; it can describe lateral movement. But that’s just recall – and threat hunting, in contrast, is investigation under uncertainty. 

The defender doesn’t know how many malicious events exist. Normal admin activity can resemble reconnaissance, and one useful clue may only become useful when it’s linked to another clue 40,000 events later. On top of that, there’s no clean sign that lets you know you’ve won. Attackers get feedback when an exploit lands, but the defence side just gets more noise. 

Even the leading LLM missed too much

The strongest result in Simbian’s testing came from Opus 4.6, which achieved a 0.46 coverage score and found 4.49% of flags at an average cost of USD $17.98 per run. GPT 5 scored 0.17 coverage and found 2.24% of flags at $1.07 per run. Gemini 3 Flash cost just $0.19 per run, but found 1.44% of flags.

So there’s a cost-quality conundrum here. Spending much more improved results, but not enough to make raw LLMs dependable defenders. Simbian’s own page notes that some models stopped making progress before they’d exhausted their query budget because they believed the task was complete.

In an incident response, that behaviour is lethal. We know a tired human analyst is easier to challenge, but a confident AI agent that prematurely closes the hunt may simply leave the attacker free to move.

The benchmark is also a warning about AI buying

What we’ve taken from this is that an LLM alone isn’t a SOC product. 

Simbian argues that the missing layer is the harness: context about assets and users, deterministic retrieval, structured investigation loops, calibrated tool access, cost controls, and assessment mechanisms that keep the agent testing hypotheses.

That’s important from a commercial perspective. CISOs are now being sold AI copilots, agents and autonomous SOC promises all the time – and this benchmark suggests the procurement question should change.

Instead of just asking which model drives the product, you need to ask what forces it to keep investigating. Ask whether it can run on your telemetry, and how it’s scored. Ask what happens when it thinks it is finished.

All CISOs should: 

  • Test AI security tools on your own logs, not a vendor’s polished demo.
  • Treat knowledge benchmarks as weak evidence for detection performance.
  • Look for harnesses, workflows and verification (not just frontier-model branding).

Simbian’s benchmark is useful because it punctures the cinematic version of AI defence. The model is not yet Sherlock Holmes in the SOC. On its own, it is more like a bright intern with a short attention span: it’s useful and fast, and occasionally does something impressive – but it’s absolutely not someone you leave alone with the case.

Share on

Join newsletter

Join the newsletter to receive the latest updates in your inbox.


Follow us


Topics

Sign up for more like this.

Join the newsletter to receive the latest updates in your inbox.

Related articles