The AI code quality gap, in numbers

by Black Hat Middle East and Africa

on 27 Jan 2026

Last month, we wrote about the rise of AI code – and why it’s critical that human developers remain responsible for outputs, even when they leverage AI tools. Now, a new report from CodeRabbit (which, by the way, is an AI-powered code review tool – worth knowing the context behind their research) lays the risks out in numbers.

The researchers analysed 470 real GitHub open-source pull requests (PRs), comparing 320 labelled as AI co-authored with 150 treated as human-only, and normalised findings as issue rates per 100 PRs. The average AI PR produced 10.83 findings vs 6.45 for human PRs – about 1.7× more issues.

This shows that any organisation using AI to write code needs to make sure they have the capacity to review the output.

The hidden cost is the long tail

Stepping away from the averages for a moment, the report shows AI PRs have a much heavier tail. At the 90th percentile* AI PRs hit 26 issues vs 12.3 for humans; by the 95th percentile, it’s 39.2 vs 22.65.

*(Here, ‘90th percentile’ means the number of issues found in the worst-reviewed 10% of pull requests).

In practice, that means more PRs that stall pipelines, burn reviewer attention, and increase the chance something serious slips through simply because everyone’s skimming.

Along with volume, the severity of issues rises

More findings would be manageable if they were mostly cosmetic. But unfortunately, they aren’t.

When normalised per 100 PRs, critical issues rise from 240 (human) to 341 (AI) – that’s 1.4× higher. Major issues jump from 257 to 447 – 1.7× higher. This suggests AI isn’t just adding noise; it’s increasing the count of defects with real production blast radius.

Where the gap is widest: logic, correctness, and the boring stuff

Top-level category comparisons put logic and correctness at the centre of the gap: 570 findings per 100 AI PRs vs 326 for humans (1.75×).

If we dig deeper, the pattern becomes more actionable. Algorithm and business-logic mistakes show up 194.28 times per 100 AI PRs vs 86 for humans (2.25× higher). Error and exception-handling gaps are nearly doubled (70.37 vs 36; 1.97×). Concurrency control and null-pointer risks also rise sharply.

These are exactly the defects that tend to evade superficial review: the code ‘looks right’, compiles, and even passes happy-path tests – until it hits the edge case that was never modelled.

Security: fewer findings than correctness, but riskier

Security findings are also higher: 94 vs 60 per 100 PRs (1.57×). The standout is improper password handling (hardcoded credentials, unsafe hashing, ad-hoc auth logic): 65.99 vs 35 (1.88×).

And the way the report frames this is worth taking seriously: these aren’t exotic AI-only flaws. They’re foundational mistakes that are appearing more often.

A nuance you should remember

Not everything gets worse. Humans showed more spelling errors and slightly more testability issues in this dataset, which we think is a helpful reminder that ‘human-only’ isn’t a quality guarantee – it’s just a different risk profile.

If you’re adopting AI coding tools, the report suggests your review discipline needs to evolve: focus hard on domain logic, error paths, dependency ordering, concurrency, and credential-handling defaults – and assume you’ll see more spiky PRs that require deeper scrutiny.

PRE-REGISTER FOR BLACK HAT MEA 2026

Share on

Join newsletter

Join the newsletter to receive the latest updates in your inbox.

Topics

Webinars Cryptography Network Defense Articles Ransomware Podcasts CyberSecurity Applied Security Whitepaper Exploit Development Reverse Engineering Newsletters

Sign up for more like this.

Join the newsletter to receive the latest updates in your inbox.

The AI code quality gap, in numbers

The hidden cost is the long tail

Along with volume, the severity of issues rises

Where the gap is widest: logic, correctness, and the boring stuff

Security: fewer findings than correctness, but riskier

A nuance you should remember

Join newsletter

Follow us

Topics

Sign up for more like this.

Related articles

From access to impact: why 2025 was the year OT threats grew teeth

A ransomware speed record: three hours to disaster

From 745 days to 44: the collapse of the patching grace period