Why AI isn’t ready to replace cybersecurity jobs
AI is affecting cybersecurity jobs, but new workforce data and SOC benchmark results suggest automation will change cyber roles faster than it replaces them.
Read More
LLMs are very good at sounding like security analysts. Ask one to explain credential dumping or map LSASS activity to MITRE ATT&CK, and it’ll give you a polished answer.
But SecOps platform Simbian recently developed a new cyber defence test with the aim of finding out if LLMs can actually hunt – and the answer, for now, is no.
Simbian tested 11 frontier models across more than 880 runs and 105 attack procedures. Not one passed. And the passing bar wasn’t perfection – it was more than 50% recall on every tested MITRE ATT&CK tactic. None of those 11 models cleared it.
The benchmark is much closer to real SOC work than the tidy tests AI often does perform well on.
The benchmark gave agents a threat briefing and a SQL-queryable database of real Windows telemetry, including Sysmon and Security event logs. Each hunt had a 50-query budget and between 75,000 and 135,000 log records. The agent had to submit exact malicious timestamps. No multiple choice, and no friendly hint saying: ‘look here’.
This setup exposes the difference between cybersecurity knowledge and cyber defence. A model can know what malicious PowerShell looks like; it can explain Mimikatz; it can describe lateral movement. But that’s just recall – and threat hunting, in contrast, is investigation under uncertainty.
The defender doesn’t know how many malicious events exist. Normal admin activity can resemble reconnaissance, and one useful clue may only become useful when it’s linked to another clue 40,000 events later. On top of that, there’s no clean sign that lets you know you’ve won. Attackers get feedback when an exploit lands, but the defence side just gets more noise.
The strongest result in Simbian’s testing came from Opus 4.6, which achieved a 0.46 coverage score and found 4.49% of flags at an average cost of USD $17.98 per run. GPT 5 scored 0.17 coverage and found 2.24% of flags at $1.07 per run. Gemini 3 Flash cost just $0.19 per run, but found 1.44% of flags.
So there’s a cost-quality conundrum here. Spending much more improved results, but not enough to make raw LLMs dependable defenders. Simbian’s own page notes that some models stopped making progress before they’d exhausted their query budget because they believed the task was complete.
In an incident response, that behaviour is lethal. We know a tired human analyst is easier to challenge, but a confident AI agent that prematurely closes the hunt may simply leave the attacker free to move.
What we’ve taken from this is that an LLM alone isn’t a SOC product.
Simbian argues that the missing layer is the harness: context about assets and users, deterministic retrieval, structured investigation loops, calibrated tool access, cost controls, and assessment mechanisms that keep the agent testing hypotheses.
That’s important from a commercial perspective. CISOs are now being sold AI copilots, agents and autonomous SOC promises all the time – and this benchmark suggests the procurement question should change.
Instead of just asking which model drives the product, you need to ask what forces it to keep investigating. Ask whether it can run on your telemetry, and how it’s scored. Ask what happens when it thinks it is finished.
All CISOs should:
Simbian’s benchmark is useful because it punctures the cinematic version of AI defence. The model is not yet Sherlock Holmes in the SOC. On its own, it is more like a bright intern with a short attention span: it’s useful and fast, and occasionally does something impressive – but it’s absolutely not someone you leave alone with the case.
Join the newsletter to receive the latest updates in your inbox.
AI is affecting cybersecurity jobs, but new workforce data and SOC benchmark results suggest automation will change cyber roles faster than it replaces them.
Read More
New research shows AI models can now autonomously find vulnerabilities and generate exploits. Explore what this means for cybersecurity risk and defence strategies.
Read More
Two new reports reveal gaps in incident response readiness, from poor coordination to visibility blind spots. Learn how connected ecosystems improve cyber resilience.
Read More