Across nearly 37,000 software dependency upgrade recommendations, newer AI models produced fewer hallucinations than their predecessors. But they also introduced a different problem: excessive caution.
Sonatype’s latest research tested frontier models including Claude Sonnet 3.7 and 4.5, Claude Opus 4.6, Gemini 2.5 Pro and 3 Pro, GPT-5 and GPT-5.2, along with smaller models. Hallucination rates have fallen, but still occur in roughly 1 in 16 recommendations, enough to force development teams to validate fixes and clean up unreliable guidance.
More striking was the finding that newer models increasingly recommend “no change” to a software component rather than suggesting an upgrade path. While this restraint reduces hallucinations, it leaves vulnerabilities in place. The most cautious models still carried approximately 800 to 900 critical and high-severity vulnerabilities across the test set.
The standout result: a smaller model paired with real-time software intelligence produced 19 fewer critical and 38 fewer high-severity vulnerabilities than Opus 4.6, a model whose per-token inference cost is 71 times higher. Sonatype’s conclusion is that live context about actual package versions, known vulnerabilities and compatibility matters more than raw model scale.