AI Hallucination Rates
by Use Case
What is the actual hallucination rate of GPT, Claude, and Gemini in real-world use — not just benchmark scores?
Data synthesized from 4 research reports, 6 major benchmarks, 5 public audits, and 12+ expert sources | 2024 – 2026 | Links to all source reports are included at the bottom of this page.
The hallucination rate of any AI model is not a single number.
Online reports, forums, and blogs are filled with users citing radically different figures on AI hallucination rates – some claim under 3%, others report over 50%. The result is widespread confusion, because nobody has explained that these figures measure entirely different things on entirely different task types. This report exists to fix that: we reconcile the conflicting numbers, map them to the specific benchmarks and use cases they actually describe, and give you a single coherent picture of where hallucination stands today.
It is a family of numbers that spans from 0.7% to 79% depending on what you measure. The same model that hallucinates 0.7% of the time summarizing a document can hallucinate 51% of the time recalling facts about a person. Citing either figure alone is accurate but incomplete.
This report synthesizes findings from Vectara’s HHEM leaderboard, OpenAI’s PersonQA and SimpleQA benchmarks, the BBC/EBU international news audit of 3,000+ responses, Stanford’s preregistered legal RAG evaluation, EPFL’s HalluHard multi-turn benchmark, Google DeepMind’s FACTS benchmark, the AA-Omniscience index, and four domain-specific research reports spanning medical, legal, coding, and financial use cases.
Three patterns repeat across every source. First, task type is the primary driver of hallucination rate – not the model. Second, grounding outputs in source documents reduces hallucination by 40 – 96% but never eliminates it. Third, reasoning models hallucinate more on factual recall despite outperforming predecessors on complex reasoning tasks.
The most counterintuitive finding: medical-specialized AI models hallucinate more than general-purpose models on clinical tasks (51.3% vs 76.6% hallucination-free), and developers who use AI most heavily experience 3x more frequent hallucinations than casual users.
The Numbers That Define the Problem
Six headline figures that frame the entire hallucination landscape in 2025 – 2026. Each tells a different part of the story.
Why Every Hallucination Statistic You Have Seen Is Technically Correct
The confusion around AI hallucination rates stems from a single, widely ignored problem: every major benchmark measures something different, and the differences are enormous. Consider these numbers, all drawn from credible sources, all published between 2025 and 2026. Every one is accurate. None is complete.
The practical conclusion: “hallucination rate” is a family of metrics that varies along six axes. Asking “Which model hallucinates least?” is only coherent after specifying a use case and a measurement regime.
1. Grounding condition – closed-book (model relies on training weights) vs grounded (model must stick to provided documents).
2. Unit of scoring – per answer, per claim, per citation, or per “significant issue.”
3. Task shape – summarization, short factual QA, multi-turn dialogue, code generation, domain reasoning.
4. Model behavior policy – willingness to answer vs abstain; lower refusal can inflate measured error.
5. Context length and turn count – longer inputs and later turns amplify error via self-conditioning.
6. Output length and claim volume – models that make more claims increase both correct and incorrect assertions.
The Six Benchmark Regimes (and Two to Discount)
Understanding the benchmark landscape is not optional. It is the prerequisite for interpreting any hallucination data responsibly. Each regime below measures a fundamentally different phenomenon. Click on each tab below to cycle through the details.
Vectara HHEM – “Can It Stick to What Is Written?”
Vectara’s Hughes Hallucination Evaluation Model gives a model a document and asks it to summarize using only the facts in that document. It then checks whether the summary added anything not present in the source. This is a grounded summarization test – a direct proxy for RAG systems, enterprise search, and document analysis pipelines.
Two versions exist. The original dataset (~1,000 short documents, April 2025) produces the lowest hallucination rates in the industry: 0.7% – 15%. The new dataset (7,700 enterprise-length articles, Feb 2026) spans law, medicine, finance, and technology. Rates jumped 3 – 10x across all models.
Top family (2026): Gemini (Google) on original dataset; mixed on enterprise dataset. Rate range: 0.7% – 20.2%
PersonQA and SimpleQA – “Can It Recall Facts?”
OpenAI’s PersonQA tests accuracy on questions about real people. SimpleQA tests short-answer factual queries across diverse topics. Both are ungrounded – no source document is provided; the model relies entirely on training knowledge.
These benchmarks revealed the “reasoning model paradox.” They are appropriate for evaluating AI assistants used without internet access on questions about public figures, historical events, or specific data points.
Top family (2026): o-series (OpenAI). Rate range: 14.8% – 79%
AA-Omniscience – “Does It Know What It Does Not Know?”
Released by Artificial Analysis in November 2025, this benchmark covers 6,000 questions across 42 topics in six domains. Its critical innovation: the Omniscience Index penalizes wrong answers and does not penalize refusals. A model that says “I don’t know” scores better than one that guesses incorrectly.
This inverts the standard incentive structure and produces radically different rankings. Only 4 of 40 tested models achieved a positive Omniscience Index score.
Top family (2026): Claude (Anthropic). Rate range: 0% – 88% hallucination rate when wrong
FACTS Grounding – Multi-Dimensional Factuality
Google DeepMind’s FACTS benchmark, introduced in December 2025, breaks factuality into four dimensions: Grounding (faithfulness to provided documents), Multimodal (accuracy on visual + text), Parametric (stored training knowledge), and Search (accuracy with web retrieval tools).
No model scored above 70% on this multi-dimensional test. Even Gemini 3 Pro – the best performer at 68.8 – is wrong more than 30% of the time across all four slices combined. The Search slice consistently produces the best scores, confirming that retrieval access dramatically improves factual accuracy.
Top family (2026): Gemini (Google). Rate range: 36.0 – 68.8 overall score (higher = better)
HalluHard – “What Happens in Real Conversations?”
Introduced by researchers in Switzerland and Germany, HalluHard tests models in realistic multi-turn settings. Models must ground factual claims in cited sources; a web-search-based judge retrieves and verifies whether citations support the generated content.
This is the most pessimistic benchmark in the landscape – and likely the most representative of real-world production deployments. Even the best-performing model (Claude Opus 4.5 with web search) hallucinated 30% of the time. Without web search, rates exceeded 60% for most models.
Top family (2026): Claude (Anthropic). Rate range: 30% – 60%+
BBC/EBU News Integrity Audit
The European Broadcasting Union and BBC evaluated 3,000+ assistant responses to news questions across 18 countries and 14 languages. They found 45% of answers had at least one significant issue, 31% had serious sourcing problems, and 20% had major accuracy issues including hallucinated details.
This is not a pure “factual hallucination rate” in the narrow summarization sense. It is a “news answer integrity” metric that includes sourcing and context failures – highly relevant to real-world use, but not comparable to PersonQA or Vectara benchmarks.
Worst performer: Gemini (76% issue rate). Rate range: 30% – 76% significant issues
Two Benchmarks to Discount
Models have been trained on its patterns, and a simple decision tree can score 79.6% without even reading the question. Citing TruthfulQA scores for 2025 – 2026 models is unreliable.
A classifier that flags answers longer than 27 characters achieves 93.3% accuracy, meaning the benchmark measures answer length more than truthfulness.
Model-by-Model Comparison
The reference data readers will screenshot and share. Four benchmark views, each telling a different story about the same models.
| Model | Provider | Hallucination Rate | Visual |
|---|---|---|---|
| Gemini-2.0-Flash-001 | 0.7% | ||
| Gemini-2.0-Pro-Exp | 0.8% | ||
| o3-mini-high | OpenAI | 0.8% | |
| GPT-4.5-Preview | OpenAI | 1.2% | |
| GPT-5 | OpenAI | 1.4% | |
| GPT-4o | OpenAI | 1.5% | |
| Grok-2 | xAI | 1.9% | |
| GPT-4.1 | OpenAI | 2.0% | |
| Grok-3-Beta | xAI | 2.1% | |
| Claude-3.7-Sonnet | Anthropic | 4.4% | |
| Llama-4-Maverick | 4.6% | ||
| Grok-4 | xAI | 4.8% | |
| Claude-3-Opus | Anthropic | 10.1% | |
| DeepSeek-R1 | DeepSeek | 14.3% |
| Model | Provider | Hallucination Rate | Visual |
|---|---|---|---|
| Gemini-2.5-Flash-Lite | 3.3% | ||
| GPT-4.1 | OpenAI | 5.6% | |
| Grok-3 | xAI | 5.8% | |
| DeepSeek-V3 | DeepSeek | 6.1% | |
| Gemini-2.5-Pro | 7.0% | ||
| GPT-5 | OpenAI | >10% | |
| Claude Sonnet 4.6 | Anthropic | 10.6% | |
| GPT-5.2-high | OpenAI | 10.8% | |
| DeepSeek-R1 | DeepSeek | 11.3% | |
| Claude Opus 4.6 | Anthropic | 12.2% | |
| Gemini-3-Pro | 13.6% | ||
| Grok-4-fast-reasoning | xAI | 20.2% |
Use Case Breakdown – Where Hallucination Rates Actually Matter
The practical heart of the report. The “actual” hallucination rate in real use is a use-case envelope, not a scalar. Six domain views follow, each with risk indicators and before/after evidence.
Medical and Healthcare
Medical hallucination is not just a reliability problem – it is a patient safety problem. A global clinician survey found that 91.8% of respondents had personally encountered medical hallucinations in AI output. Physician audits in 2026 confirmed that 64 – 72% of residual hallucinations stem from causal or temporal reasoning failures – such as misidentifying the order of drug-drug interactions – rather than simple knowledge deficits.
The counterintuitive finding: medical-specialized models hallucinate more than general-purpose models (51.3% vs 76.6% hallucination-free). Narrow domain fine-tuning does not replace broad reasoning ability.
Legal and Juridical
Legal hallucination has a defining failure mode: AI invents plausible-sounding case citations. These are not random errors – they are confidently presented, structurally valid-looking legal references that simply do not exist. MIT researchers found AI models are 34% more likely to use definitive phrases like “definitely” or “without doubt” when hallucinating.
Over 700 court cases involving AI-generated hallucinated content have been documented as of 2026. Stanford’s preregistered evaluation finds that leading legal research tools marketed as “hallucination-free” still produce incorrect information between 17% and 33% of the time.
Coding and Software Engineering
The Purdue University study remains the most-cited coding hallucination research: 52% of ChatGPT answers to programming questions were incorrect, often appearing plausible and well-formatted. Stack Overflow’s 2025 developer survey found that while 81% of developers use AI coding tools, 46% do not trust the accuracy. Trust dropped from 40% in 2024 to 29% in 2025-2026.
Domain data from AllAboutAI shows top coding models at 5.2% hallucination, while the all-model average is 17.8%. The gap has narrowed from the Purdue era but remains substantial. The “almost right” problem is the primary pain point: 45% of developers report that debugging AI code is more time-consuming than manual coding.
News Retrieval and AI Search
The BBC/EBU study evaluated 3,000+ assistant responses to news questions across 18 countries and 14 languages. 45% of responses contained at least one significant issue. 81% contained at least one mistake of any kind. Gemini was the worst performer at 76%, driven by catastrophic sourcing failures.
The Columbia Journalism Review / Tow Center test reports that AI search products often return incorrect answers and struggle with citation behavior. Incorrectness exceeded 60% in their experiment. A critical data point: refusal rates were only 0.5% despite the high error levels – these systems almost never say “I don’t know.”
Grounded Summarization (Best-Case Scenario)
When a model has source material and its only job is to accurately summarize it, hallucination rates drop to industry-best values. Top models achieve 0.7 – 1.5%, and even mid-tier models stay under 5% on the original Vectara dataset. This explains why AI performs reliably in enterprise document management, PDF summarization, and internal knowledge-base Q&A.
The lesson: grounding eliminates the vast majority of hallucination risk. When you give AI a document and ask it to stay within that document, it mostly does. The catch: enterprise-length documents push even the best models to 3.3% – 13.6%.
Financial Data and Analysis
Top models achieve ~2.1% hallucination rates on financial data, while the all-model average is 13.8%. Without safeguards, hallucination rates on financial tasks run 15 – 25%. Firms report 2.3 significant AI-driven errors per quarter, with individual incident costs ranging from $50,000 to $2.1 million.
RAG-enhanced models improved verifiability scores from 4.11/5.0 to 4.82/5.0 and relevance to 4.81/5.0 on financial report analysis – RAG improves verifiability and relevance more than raw accuracy.
The Reasoning Models Paradox – Why “Smarter” AI Hallucinates More
The most counterintuitive finding in 2025 – 2026 hallucination research: the models designed to be better at reasoning consistently score worse on factual recall benchmarks. Each new generation of reasoning model hallucinates more on knowledge recall, despite being demonstrably better at reasoning tasks.
If a task requires 100 reasoning steps and the model is 99% accurate at each step, the probability of a flawless result is only 0.99100 = 36.6%. As models engage in longer reasoning chains, the probability of end-to-end failure increases exponentially. This is not a bug in reasoning models – it is a mathematical property of sequential inference.
The mechanism: more reasoning produces longer, more assertive outputs containing more claims. If the evaluation punishes every unsupported claim, the measured “hallucination rate” rises even as capability improves. OpenAI explicitly notes that o3 “tends to make more claims overall,” producing more accurate and more hallucinated claims simultaneously.
An arXiv study (October 2025) confirmed causality: “Reasoning RL increases tool hallucination even when trained on non-tool tasks. This effect transcends overfitting. The effect is method-agnostic, appearing when reasoning is instilled via supervised fine-tuning and when it is merely elicited at inference.”
Correlation Map: Seven Factors That Drive Hallucination
A directional map of what pushes hallucination risk up or down, based on convergent evidence across multiple cited sources. Strong relationships are labeled when multiple studies align.
| Factor | Effect | Strength | Primary Evidence |
|---|---|---|---|
| More turns / longer dialogue history | ↑ Higher | Strong + | HalluHard: errors worsen in later turns via error propagation |
| Longer context windows / more input text | ↑ Higher | Strong + | Reuters experiment: error rates rise as input grows (32k to 128k words) |
| Output length / number of claims | ↑ Higher | Strong + | OpenAI system card: more claims = more correct AND more hallucinated |
| External retrieval / web search added | ↓ Lower | Strong – | HalluHard: web search reduces but does not eliminate hallucination |
| Constraining to a provided source | ↓ Lower | Strong – | Vectara: faithfulness to provided document yields single-digit rates |
| Domain complexity (legal/medical/news) | ↑ Higher | Strong + | News audits: 45 – 76%; legal: 17 – 88%; medical: up to 65.9% |
| Low refusal tendency (always answers) | ↑ Higher | Moderate + | EBU: 0.5% refusal rate with 45% issue rate; models rarely say “I don’t know” |
Expert Knowledge Guide: Five Principles From the Frontier
Synthesized from expert discussions including Lex Fridman episodes with Jensen Huang, Aravind Srinivas, Marc Andreessen, Yann LeCun, Nathan Lambert, and Sebastian Raschka. These insights rarely appear in written content.
“Truth needs an interface, not just a bigger model.”
Aravind Srinivas, CEO of Perplexity: For products where hallucination is a “bug,” architect for citation-backed retrieval and ranking quality rather than “prompting your way out.” He explicitly downplays prompt engineering as a long-term solution and highlights robust retrieval methods beyond embeddings alone.
Error probability compounds with length and long-tail prompts.
Yann LeCun, Chief AI Scientist at Meta: Each generated token carries some probability of drifting out of the “set of reasonable answers.” Errors accumulate with length. He stresses the “long tail” of prompts that training cannot cover. This maps directly onto empirical findings where longer contexts yield higher error rates.
Creativity vs correctness is not a vibe – it is an evaluation mismatch.
Marc Andreessen: Creative tasks tolerate invention; correctness-critical tasks do not. He points to legal hallucinated citations as the canonical example. “Hallucination rate” is a function of the domain’s tolerance and verificability, not a model identity.
Benchmark literacy is now a product skill.
Nathan Lambert and Sebastian Raschka: Benchmark scores can be misleading due to contamination and even small format changes. Evaluate on fresh benchmarks that post-date the model’s training cutoff, otherwise you may be measuring dataset familiarity rather than reliability.
“Ground truth access” is the practical definition of reliability.
Jensen Huang, CEO of NVIDIA (March 2026): A “digital worker” must access ground truth – files, databases – and do research. This is a systems view consistent with the empirical trend that retrieval and grounding reduce (but never eliminate) hallucination.
Timeline: How the Hallucination Landscape Evolved (2023 – 2026)
Mitigation Playbook: What the Data Shows Works
Ranked by evidence strength. Each intervention includes the measured reduction from published studies.
Tier 1 – Strongest Evidence
1. Ground outputs in checkable sources (RAG)
Reduction: 40 – 96%. Cancer chatbots: 40% to 0 – 6%. Legal RAG: 69 – 88% to 17 – 33%. HalluHard with web search: ~60% to ~30%. The single most impactful intervention across all domains.
2. Structured prompting and mitigation prompts
Reduction: 32 – 56%. Medical: 65.9% to 44.2%. GPT-4o: 53% to 23% (p < 0.001). Program-of-Thought decomposition and bullet-point justifications tied to citations outperform generic chain-of-thought.
3. Model selection by task type
Impact: 3 – 10x difference. Medical: top model 4.3% vs average 15.6%. Legal: best 6.4% vs average 18.7%. Coding: best 5.2% vs average 17.8%. The choice of model matters more than any single prompting technique.
Tier 2 – Strong Evidence
4. Reduce context length and turn depth
Both Reuters and HalluHard indicate longer contexts and later turns increase error rates. Summarize or re-ground between turns. Best model: 1.2% at 32k words, rising to 3.2% at 128k.
5. Require inline citations and references
HalluHard’s design principle: requiring models to cite sources enables automated verification. Span-level verification matches each claim against evidence and flags unsupported assertions.
6. Track refusal rate alongside accuracy
Lower refusal rates can mask higher error volume. EBU: 0.5% refusal despite 45% issue rate. AA-Omniscience rewards “I don’t know” – only 4 of 40 models achieved a positive score.
Tier 3 – Emerging Evidence
7. Multi-agent consensus pipelines
Multiple agents verify each other’s outputs. VaaS protocols reach 98%+ reliability in niche science domains. Still early-stage for general deployment.
8. Knowledge graphs + multimodal RAG
Graph-augmented RAG and hybrid RAG architectures becoming common in finance, healthcare, and legal workflows for improved retrieval precision.
9. Fine-tuning on faithful outputs
A NAACL 2025 study showed synthetic examples of faithful translations dropped hallucination rates by 90 – 96% without hurting quality. Domain-specific but promising.
Eight Common Mistakes in AI Deployment
The most frequent implementation errors, their measured impact, and the evidence-based fix.
Using one hallucination number for all use cases
Leads to wildly miscalibrated risk. The same model ranges from 0.7% to 79%.
Trusting reasoning models for factual recall
o3 hallucinates 33% on PersonQA; o4-mini hits 79% on SimpleQA.
Ignoring refusal rate metrics
EBU: 0.5% refusal + 45% errors = hidden risk. Low refusal inflates measured helpfulness.
Deploying without domain-specific evaluation
Benchmark performance does not equal your workflow performance.
Assuming RAG eliminates hallucination
Legal RAG still 17 – 33%. HalluHard with search: still 30%+.
Temperature tuning as primary mitigation
Temperature = 0 yielded ~1% improvement (65.9% to ~64% in medical study).
Relying on saturated benchmarks
TruthfulQA: a decision tree scores 79.6%. HaluEval: answer length predicts 93.3%.
Not testing at deployment context lengths
Error rises from 1.2% at 32k words to 3.2% at 128k; some models break at 200k.
Tool and Resource Comparison
A comparative view of the leading tools for measuring and reducing hallucinations as of early 2026. No single tool covers every use case.
| Tool | Best For | Key Features | Pricing | Limitations |
|---|---|---|---|---|
| LangSmith | Production tracing + eval for agents/RAG | Tracing, online/offline evals, annotation queues | Free + $39/seat/mo | Vendor platform; costs scale with traces |
| Arize Phoenix | Open-source tracing (self-host) | OTEL-based, eval templates, Alyx AI Copilot | Free; $50/mo (AX) | Requires DevOps for self-hosting |
| TruLens | Hallucination triad evaluation | Open-source, feedback functions, tracing | Free; enterprise via TruEra | LLM-as-judge can be brittle |
| Ragas | RAG evaluation metrics | Test set generation + metric suite | Free / open source | Metrics can be gamed |
| DeepEval | CI/CD testing and development | 50+ metrics, G-Eval custom metrics | Free; $19.99/user/mo | Resource-heavy for large suites |
| Weights & Biases | Agentic workflows and experiment tracking | Real-time guardrails, MCP auto-logging | $60/mo | ML-first interface; complex for non-devs |
| Maxim AI | Non-technical domain expert review | End-to-end simulation, no-code UI | Custom Enterprise | High barrier for small teams |
| Vectara HHEM | Grounded summarization benchmarking | Regularly updated public table; includes answer rate | Free / public | Summarization-only; not a proxy for multi-turn |
| HalluHard | Stress-testing multi-turn behavior | Multi-turn design; judge reads full PDFs | Free / public | Tail-risk benchmark, not average use |
| FACTS Grounding | Multi-dimensional factuality | Four-slice evaluation (grounding, parametric, search, multimodal) | Free / public | Risk of benchmark overfitting |
Success Checklist: Synthesized Best Practices
A phase-gated implementation checklist drawn from every source in this report. Each item starts with a verb. Each is specific enough that you know what “done” looks like.
- Map each AI workflow to its closest benchmark type (grounded summarization, factual recall, multi-turn, domain-specific)
- Identify your domain’s hallucination baseline using the domain heatmap from this report
- Select 2 – 3 evaluation tools from the comparison chart that match your stack
- Build a representative evaluation set from your actual production queries
- Choose models based on task-specific performance, not general benchmark rankings
- Implement RAG with curated, domain-specific knowledge bases (not generic web search)
- Set up structured prompting templates with explicit grounding instructions
- Configure refusal and abstention thresholds – prefer “I don’t know” over confident fabrication
- Test at your actual deployment context lengths, not just short-form benchmarks
- Track hallucination rate, answer rate, and refusal rate as a trio
- Implement span-level verification for high-stakes outputs
- Set up human-in-the-loop review for outputs above your risk threshold
- Re-evaluate on fresh benchmarks that post-date model training cutoff
- Monitor for context-length degradation (error rises from ~1.2% at 32k to ~3.2% at 128k)
- Review multi-turn error propagation in production conversations
- Update model selection quarterly as new benchmark data emerges
Six Hidden Patterns From Peer-Reviewed Research
Practitioner content tends to cite the same handful of numbers. The peer-reviewed literature contains insights that reframe the hallucination problem in ways the industry conversation has not yet absorbed.
The Specialization Paradox
Medical-specialized models hallucinate more than general-purpose models (51.3% vs 76.6% hallucination-free). Narrow domain fine-tuning does not replace broad reasoning ability. The clinical AI safety property emerges from sophisticated reasoning and broad knowledge integration, not narrow optimization. (arXiv, March 2025)
The Prompt Complexity Multiplier
Within the same model and domain, hallucination rates nearly double between simple and contextual prompts (19% vs 28%). Complexity itself – independent of user type – drives errors. Simple questions can be grounded in single database entries; complex clinical scenarios requiring reasoning across multiple conditions strain model capabilities.
The Detection Gap
Semantic entropy methods achieve 0.790 AUROC on detection benchmarks, yet hallucination detectors reach only ~50% accuracy on FaithBench’s challenging samples. On hard cases, detection tools perform approximately as well as a coin flip. The gap between detection benchmarks and real-world detection is as wide as the hallucination gap itself.
The Metric Mismatch
Synthetic rubrics inflate scores by 17.9 points on average compared to expert-validated performance. Benchmark optimization does not transfer to real-world reliability. The CHECK framework’s 0.3% rate on structured clinical trial questions represents an upper bound under ideal conditions – real clinical use falls between this and the 23% meta-analytic average.
The Citation vs Semantic Split
Systematic review generation (39.6% for GPT-3.5) measures citation accuracy – whether title, author, and year match. This is a fundamentally different construct than semantic hallucinations in patient Q&A. Research using these numbers interchangeably is comparing incomparable phenomena.
The Cancer Chatbot Proof of Concept
RAG with curated databases dropped cancer information hallucination from 40% to 0% (GPT-4) and 6% (GPT-3.5). The tradeoff: models declined to answer 19 – 64% of the time. Admitting uncertainty is the critical success metric for high-stakes AI – not helpfulness.
Why AI Is More Confident When It Is Wrong
One of the most dangerous properties of current AI systems is that confidence and accuracy are inversely correlated in failure modes. The model that sounds most certain is often the least reliable.
MIT researchers found that AI models are 34% more likely to use definitive phrases like “definitely” and “without doubt” when they are hallucinating facts. This “dangerously self-confident” behavior means that the linguistic signals humans use to assess reliability – hedging language, qualifiers, expressed uncertainty – are inverted in AI systems.
The Gemini Paradox illustrates this directly. Gemini 3 Pro achieved the highest raw accuracy on AA-Omniscience (55.9%) but also an 88% hallucination rate – meaning when it does not know an answer, it fabricates one 88% of the time rather than refusing. The Gemini 3.1 Pro update addressed this by increasing refusal, cutting hallucination from 88% to 50% with only 0.6% accuracy loss.
The business implication is clear: enterprises should prefer models that say “I don’t know” over models that confidently fabricate. AA-Omniscience and similar calibration-rewarding benchmarks are more predictive of production reliability than raw accuracy scores.
How Each Provider Approaches the Hallucination Tradeoff
Each major AI provider has made different architectural and policy choices that produce distinct hallucination profiles. Understanding these choices helps explain why models perform differently across benchmarks.
Scale + Reasoning, Accepting Higher Hallucination as Capability Tradeoff
Strengths: Best at grounded summarization (GPT-4o: 1.5%, GPT-5: 1.4%). Strong on FACTS Search slice (77.7%). Multiple model tiers for different price/performance points.
Weaknesses: Reasoning models (o3, o4-mini) hallucinate 33 – 79% on factual recall. The o-series paradox means their “smartest” models are least reliable for simple fact-checking.
Philosophy: Maximize capability, accept that longer reasoning chains increase error surface, rely on users and downstream systems to verify.
Safety + Abstention Over Helpfulness
Strengths: Best knowledge calibration (Claude 4.1 Opus: 0% hallucination on AA-Omniscience). Best multi-turn reliability (HalluHard leader at ~30%). Strong on knowledge-boundary tasks.
Weaknesses: Higher hallucination on grounded summarization (4.4 – 12.2% on Vectara) vs Google and OpenAI top models. Higher refusal rates can frustrate users expecting answers.
Philosophy: It is better to refuse than to fabricate. Calibrate confidence to actual knowledge. Safety as an emergent property of honest uncertainty.
Breadth + Knowledge, With Recent Calibration Improvements
Strengths: Best raw accuracy on AA-Omniscience (55.9%). Best overall FACTS score (68.8). Lowest grounded summarization rate on original Vectara (0.7%). Best Search-augmented performance.
Weaknesses: Highest “confident when wrong” rate (88% on AA-Omniscience). Worst performer on BBC/EBU news audit (76% error rate). The calibration problem is being addressed (3.1 Pro: 88% to 50%).
Philosophy: Maximize knowledge breadth and retrieval capability. Prioritize answering over refusing. Recent course correction toward calibration.
Accessibility + Transparency, With Higher Error Baselines
Strengths: Full model access enables custom fine-tuning and RAG integration. DeepSeek-V3 competitive on enterprise summarization (6.1%). Llama ecosystem enables on-premise deployment for regulated industries.
Weaknesses: Highest hallucination rates on most benchmarks. DeepSeek-R1: 14.3% on easy summarization, 83% on AA-Omniscience. Llama-4-Maverick: 87.6% on AA-Omniscience.
Philosophy: Open access and customizability offset higher base error rates. Enterprises can fine-tune to their specific domains.
What Industry Leaders Said (October 2025 – April 2026)
On Systemic Risk
On Whether Hallucination Is “Solved”
On the Nature of the Problem
On Industry Challenges
References and Citations
All data in this report is drawn from publicly available, independently verifiable sources. Where sources conflict, the conflict is noted in the text.
- Vectara HHEM Hallucination Leaderboard – GitHub repository, updated through March 2026. github.com/vectara/hallucination-leaderboard
- OpenAI o3 and o4-mini System Card – PersonQA, SimpleQA benchmark results. cdn.openai.com
- EBU/BBC News Integrity in AI Assistants Report – 3,000+ responses, 18 countries, 14 languages. Published 2025. ebu.ch
- Reuters – “Does AI Business Model Have a Fatal Flaw?” Context-length stress test coverage, April 2026. reuters.com
- Stanford Digital Humanities Observatory – “Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools.” Preregistered evaluation, 2025. dho.stanford.edu
- Nature Communications Medicine – Medical hallucination mitigation study (s43856-025-01021-3). nature.com
- HalluHard Benchmark – Multi-turn citation-grounded hallucination evaluation. EPFL/German researchers. arxiv.org
- AllAboutAI – AI Hallucination Report 2026: domain-specific rates and model comparisons. allaboutai.com
- Google DeepMind FACTS Grounding Benchmark – Multi-dimensional factuality evaluation, December 2025.
- Artificial Analysis AA-Omniscience Benchmark – Knowledge reliability and calibration index, November 2025 – February 2026.
- Columbia Journalism Review / Tow Center – AI search citation accuracy audit.
- Rev.com AI Results Survey – 1,038 US adult AI users, July 2025. rev.com
- Purdue University – ChatGPT vs Stack Overflow study: 517 questions, 52% incorrect rate.
- Stack Overflow Developer Survey 2025/2026 – 84% adoption, 29% trust in accuracy.
- arXiv – “Reasoning RL increases tool hallucination” – October 2025 study on reasoning and hallucination causality.
- Lex Fridman Podcast transcripts – Episodes with Jensen Huang (March 2026), Aravind Srinivas (June 2024), Marc Andreessen (January 2025), Yann LeCun (March 2024), Nathan Lambert & Sebastian Raschka (January 2026).
- Suprmind – AI Hallucination Rates & Benchmarks research compilation, 2026. suprmind.ai
- Scientific Reports (2025) – App-store review analysis of ChatGPT hallucination indicators.
- arXiv (March 2025) – Medical hallucination survey: general-purpose vs specialized models.
- IJCAI 2024 – Benchmarking Fact-Conflicting Hallucination Detection. ijcai.org
AI Hallucination Rates by Use Case – Information-Gain Report | April 2026
Published by chatgptguide.ai | Data current as of April 3, 2026
This report synthesizes publicly available data for informational purposes. It does not constitute professional advice.
