CISOs: would you pay the ransom?
More than half of CISOs would consider paying ransomware demands as recovery times remain painfully slow. Here’s why resilience (not prevention alone) is becoming the real test of strength.
Read More
LLMs are very good at sounding like security analysts. Ask one to explain credential dumping or map LSASS activity to MITRE ATT&CK, and it’ll give you a polished answer.
But SecOps platform Simbian recently developed a new cyber defence test with the aim of finding out if LLMs can actually hunt – and the answer, for now, is no.
Simbian tested 11 frontier models across more than 880 runs and 105 attack procedures. Not one passed. And the passing bar wasn’t perfection – it was more than 50% recall on every tested MITRE ATT&CK tactic. None of those 11 models cleared it.
The benchmark is much closer to real SOC work than the tidy tests AI often does perform well on.
The benchmark gave agents a threat briefing and a SQL-queryable database of real Windows telemetry, including Sysmon and Security event logs. Each hunt had a 50-query budget and between 75,000 and 135,000 log records. The agent had to submit exact malicious timestamps. No multiple choice, and no friendly hint saying: ‘look here’.
This setup exposes the difference between cybersecurity knowledge and cyber defence. A model can know what malicious PowerShell looks like; it can explain Mimikatz; it can describe lateral movement. But that’s just recall – and threat hunting, in contrast, is investigation under uncertainty.
The defender doesn’t know how many malicious events exist. Normal admin activity can resemble reconnaissance, and one useful clue may only become useful when it’s linked to another clue 40,000 events later. On top of that, there’s no clean sign that lets you know you’ve won. Attackers get feedback when an exploit lands, but the defence side just gets more noise.
The strongest result in Simbian’s testing came from Opus 4.6, which achieved a 0.46 coverage score and found 4.49% of flags at an average cost of USD $17.98 per run. GPT 5 scored 0.17 coverage and found 2.24% of flags at $1.07 per run. Gemini 3 Flash cost just $0.19 per run, but found 1.44% of flags.
So there’s a cost-quality conundrum here. Spending much more improved results, but not enough to make raw LLMs dependable defenders. Simbian’s own page notes that some models stopped making progress before they’d exhausted their query budget because they believed the task was complete.
In an incident response, that behaviour is lethal. We know a tired human analyst is easier to challenge, but a confident AI agent that prematurely closes the hunt may simply leave the attacker free to move.
What we’ve taken from this is that an LLM alone isn’t a SOC product.
Simbian argues that the missing layer is the harness: context about assets and users, deterministic retrieval, structured investigation loops, calibrated tool access, cost controls, and assessment mechanisms that keep the agent testing hypotheses.
That’s important from a commercial perspective. CISOs are now being sold AI copilots, agents and autonomous SOC promises all the time – and this benchmark suggests the procurement question should change.
Instead of just asking which model drives the product, you need to ask what forces it to keep investigating. Ask whether it can run on your telemetry, and how it’s scored. Ask what happens when it thinks it is finished.
All CISOs should:
Simbian’s benchmark is useful because it punctures the cinematic version of AI defence. The model is not yet Sherlock Holmes in the SOC. On its own, it is more like a bright intern with a short attention span: it’s useful and fast, and occasionally does something impressive – but it’s absolutely not someone you leave alone with the case.
Join the newsletter to receive the latest updates in your inbox.
More than half of CISOs would consider paying ransomware demands as recovery times remain painfully slow. Here’s why resilience (not prevention alone) is becoming the real test of strength.
Read More
Trina Ford (CISO at iHeartMedia) explains why cybersecurity leaders should manage AI agents like junior employees.
Read More
Cybersecurity leader Trina Ford explains why diversity of thought, inclusive leadership, and non-traditional talent are becoming critical to modern cyber resilience and innovation.
Read More