AI Response Fact-Checking
The task isn't to "improve model quality" — the task is to ensure that no factual answer from the system reaches users without verification. This is an engineering task with specific architecture, not model fine-tuning.
Why Model Confidence Doesn't Equal Accuracy
GPT-4, Claude 3.5, Gemini — all modern LLMs generate answers with subjectively high confidence even with factual errors. Logprob close to 0 on a hallucinated statement is standard. RLHF fine-tuning makes it worse: models are trained to give complete coherent answers, not say "I don't know".
This means model confidence is unsuitable as a signal for filtering. External verification is needed.
Production Fact-Checking Architecture
Decomposition into Atomic Claims
Before verification, the answer is broken down into minimal verifiable statements (claims). "The company was founded in 1998 and holds 40% market share" — that's two statements. Use LLM call with structured output (JSON Schema) or NLP pipeline based on spaCy + coreference resolution.
Without decomposition, the verifier works at document level — loses precision and doesn't localize the specific error.
NLI Verification Against Source
If source is known (RAG base, uploaded document), each claim is verified via NLI (Natural Language Inference). Apply cross-encoder/nli-deberta-v3-base: input is a pair (claim, context from source), output is entailment / neutral / contradiction with probabilities.
Entailment threshold > 0.75 for accepting a claim. Contradiction > 0.5 — immediate flag. Neutral — mark as "not confirmed by source".
External Verification via Search
For claims without known source — search via external APIs: Tavily Search, Bing Web Search API, or specialized bases (PubMed for medicine, SEC EDGAR for finance, Wikidata SPARQL for general facts).
Scheme: extract named entities (NER) → form verification query → get top-3 results → run NLI between claim and each result → aggregate.
Practical Case
Client is a news aggregator, automatic article summarization system with GPT-4o. After launch, discovered: in 12% of summaries appear dates, numbers, and names that aren't in the original text (checked on sample of 500 summaries).
Implemented pipeline: claim extraction via OpenAI functions (structured output) → for each claim NLI-check against original text (deberta-v3-large-mnli) → claims with entailment < 0.70 marked yellow in UI with reference to source.
Result: share of unverified statements in final summary dropped from 12% to 1.8%. Latency added 180-220ms per summary (batch NLI on GPU T4).
Verification Methods Comparison
| Method | When to Apply | Accuracy | Latency |
|---|---|---|---|
| NLI against source | RAG, document QA | High | 50-150ms |
| Self-consistency (N=5) | No source | Medium | ×N LLM cost |
| External search + NLI | General facts | Medium–high | 500-1500ms |
| Specialized API | Medicine, law | High in domain | API-dependent |
What's Needed to Start
Minimum: access to production logs (500+ requests for baseline), domain description and error criticality, info on existing pipeline (RAG or not).
Optimal: ground truth dataset of 100-300 question-answer pairs with expert verification. Without it, metrics are measured indirectly.
Stages: audit current answers and classify error types → select verification method for domain → develop claim extraction → integrate verifier into pipeline → A/B test on 10% traffic → monitor verifier precision/recall.
Timeline: 2-4 weeks for integration into existing pipeline. Complex domains with external APIs — up to 6 weeks.







