AI Counted Carbs 27,000 Times and Never Gave the Same Answer Twice
Everyone assumes AI consistency problems are solved with temperature settings near zero. Wrong.
A developer at Diabettech just shattered this myth with the most comprehensive real-world AI reliability test I've seen. They fed 13 identical food photographs to four leading models—OpenAI GPT-5.4, Anthropic Claude Sonnet 4.6, Google Gemini 2.5 Pro, and Google Gemini 3.1 Pro—across 26,904 total queries. Same images. Same prompts. Lowest randomness settings.
Not one model gave consistent carbohydrate estimates.
<> "An AI that gives you the same wrong answer 500 times in a row is no more trustworthy than one that varies—consistency is necessary but not sufficient."/>
The variance was staggering. Gemini models showed 10-20% spreads on identical photos. That's not a rounding error—that's the difference between a safe meal and a hypoglycemic emergency for diabetics. Claude Sonnet 4.6 performed best with spreads mostly under 5%, but even that's problematic when you're calculating insulin doses.
Here's what really terrifies me: every model returned 100% confidence scores for all food identifications. Perfect confidence. Zero self-doubt. These systems have no idea when they're wrong.
The Math Doesn't Lie
Each photo was queried 500+ times per model. The results, visualized as violin plots, look like chaos theory in action. Same input, wildly different outputs, despite temperature settings designed to minimize randomness.
All models systematically overestimated carbohydrates. This isn't just inconsistency—it's biased inconsistency. Some foods triggered specific blindspots where models would consistently fail in the same direction.
The Elephant in the Room
Every diabetes app promising "AI-powered carb counting" from photos is essentially running the same gamble. None of these apps have magic—they're built on the same underlying vision-language models that just failed this test spectacularly.
The Hacker News thread (293 comments, 230 points) hammered this point home: no proprietary health app can reliably outperform public models like Claude or GPT without novel breakthroughs. They're all playing with the same flawed foundation.
Think about the business implications:
- Health apps marketing photo-based carb counting face massive liability exposure
- Premium features built on unreliable AI create false value propositions
- Regulatory scrutiny is inevitable when variance could cause medical emergencies
- User trust evaporates once people realize the inconsistency
What Developers Need to Know
This test reveals fundamental issues that can't be prompt-engineered away:
1. Non-determinism persists even at minimal temperature settings
2. Confidence scores are meaningless for filtering bad estimates
3. Systematic biases require domain-specific datasets and fine-tuning
4. High-repetition benchmarks (500+ queries) should be standard for health apps
The solution isn't abandoning AI—it's honest AI deployment. Use ensemble methods. Provide ranges instead of false precision. Integrate human verification loops. Stop pretending these models are more reliable than they actually are.
Claude Sonnet 4.6's tighter variance suggests consistency gaps between models are solvable. But until then, any diabetes app claiming reliable carb counting from photos is selling dangerous fiction.
The 27,000 queries don't lie. The question is whether the industry will listen.

