Last month, we wrote about the rise of AI code – and why it’s critical that human developers remain responsible for outputs, even when they leverage AI tools. Now, a new report from CodeRabbit (which, by the way, is an AI-powered code review tool – worth knowing the context behind their research) lays the risks out in numbers.
The researchers analysed 470 real GitHub open-source pull requests (PRs), comparing 320 labelled as AI co-authored with 150 treated as human-only, and normalised findings as issue rates per 100 PRs. The average AI PR produced 10.83 findings vs 6.45 for human PRs – about 1.7× more issues.
This shows that any organisation using AI to write code needs to make sure they have the capacity to review the output.
The hidden cost is the long tail
Stepping away from the averages for a moment, the report shows AI PRs have a much heavier tail. At the 90th percentile* AI PRs hit 26 issues vs 12.3 for humans; by the 95th percentile, it’s 39.2 vs 22.65.
*(Here, ‘90th percentile’ means the number of issues found in the worst-reviewed 10% of pull requests).
In practice, that means more PRs that stall pipelines, burn reviewer attention, and increase the chance something serious slips through simply because everyone’s skimming.
Along with volume, the severity of issues rises
More findings would be manageable if they were mostly cosmetic. But unfortunately, they aren’t.
When normalised per 100 PRs, critical issues rise from 240 (human) to 341 (AI) – that’s 1.4× higher. Major issues jump from 257 to 447 – 1.7× higher. This suggests AI isn’t just adding noise; it’s increasing the count of defects with real production blast radius.
Where the gap is widest: logic, correctness, and the boring stuff
Top-level category comparisons put logic and correctness at the centre of the gap: 570 findings per 100 AI PRs vs 326 for humans (1.75×).
If we dig deeper, the pattern becomes more actionable. Algorithm and business-logic mistakes show up 194.28 times per 100 AI PRs vs 86 for humans (2.25× higher). Error and exception-handling gaps are nearly doubled (70.37 vs 36; 1.97×). Concurrency control and null-pointer risks also rise sharply.
These are exactly the defects that tend to evade superficial review: the code ‘looks right’, compiles, and even passes happy-path tests – until it hits the edge case that was never modelled.
Security: fewer findings than correctness, but riskier
Security findings are also higher: 94 vs 60 per 100 PRs (1.57×). The standout is improper password handling (hardcoded credentials, unsafe hashing, ad-hoc auth logic): 65.99 vs 35 (1.88×).
And the way the report frames this is worth taking seriously: these aren’t exotic AI-only flaws. They’re foundational mistakes that are appearing more often.
A nuance you should remember
Not everything gets worse. Humans showed more spelling errors and slightly more testability issues in this dataset, which we think is a helpful reminder that ‘human-only’ isn’t a quality guarantee – it’s just a different risk profile.
If you’re adopting AI coding tools, the report suggests your review discipline needs to evolve: focus hard on domain logic, error paths, dependency ordering, concurrency, and credential-handling defaults – and assume you’ll see more spiky PRs that require deeper scrutiny.