Production RAG: What Nobody Tells You After 6 Months

The RAG tutorials are good. They'll get you to a working demo in an afternoon. What they don't cover is what happens six months into production: indexing drift, retrieval quality decay, the cost at scale, the failure modes that only appear in real usage.

Here's what you learn from operating RAG systems long enough for the surprising problems to surface.

The Index Goes Stale and Nobody Notices

The most common production RAG failure isn't retrieval quality — it's index freshness.

Documents in your source system change. New documents are added. Old documents are updated or deleted. If your indexing pipeline runs weekly and a critical document was updated on day 2 of the cycle, the RAG system will serve the outdated version for the next 5 days — often without any indication to the user that the retrieved content isn't current.

The first sign is usually a user complaint: "The AI told me X, but the current policy is Y." By then, the outdated information has been served to an unknown number of users.

What to add: Track document version or modification timestamp in your vector store metadata. Compare the modification date of the source document against the indexing timestamp at retrieval time. If there's a gap above a defined threshold (e.g., source was modified more than 24 hours after the indexed version), flag the response or trigger a re-index before serving.

Also: monitor your indexing pipeline as a first-class production process — with success/failure alerting, record count checks, and schema validation. A silent pipeline failure that stops indexing new documents is indistinguishable from "everything is fine" for weeks.

Retrieval Quality Decays Without You Noticing

Retrieval quality — whether the most relevant passages are actually being retrieved — is easy to evaluate at build time and hard to monitor in production.

The problem: retrieval quality degrades as the document corpus changes. New documents can dilute results if chunking isn't consistent. Embedding model updates can shift the vector space subtly, making old embeddings less comparable to new query embeddings. Corpus growth increases the probability of retrieving plausible-but-wrong passages.

None of this shows up in your system logs as an error. Retrieval still returns N results. They're just not the right ones anymore.

What to add: A retrieval quality eval that runs on a schedule. Maintain a test set of queries with known relevant passages, and evaluate whether the correct passages appear in the top K retrieved results. Track Recall@K over time. A downward trend is a signal that retrieval needs attention before users notice the quality drop in responses.

The Chunking Strategy You Chose at Launch Won't Scale

Chunking decisions made at the start — chunk size, overlap, whether to chunk by paragraph or by token count — are made with limited data about how users will actually query the system.

Six months in, you have real data: the queries users actually ask, the retrieval failures that occurred, the cases where context was split across chunks in ways that made retrieval fail. That data almost always suggests changes to the chunking strategy.

The problem is that re-chunking requires re-indexing the entire corpus, which is a significant operation on a large document set. Teams that haven't built re-indexing capacity into their operational workflow find that their chunking debt accumulates.

What to add: Build re-indexing as a first-class operational process from the start, not an ad-hoc task. Define the process, estimate the time, and make sure it can be run without service interruption. Plan for at least one chunking strategy revision within the first six months, informed by retrieval failure analysis.

RAG Hallucinations Are Different From Regular LLM Hallucinations

General LLM hallucinations happen when the model generates content not in its training data. RAG hallucinations are more subtle: the model generates content not in the retrieved context — or over-interpolates between retrieved passages to produce a claim that no single passage actually makes.

These are harder to detect because the output looks grounded: it may cite real documents or reference real information. The hallucination is in the synthesis step, not the retrieval step.

Common RAG-specific hallucination patterns:

The model combines two separate facts from two separate passages to imply a connection that doesn't exist
The model fills in details (dates, numbers, proper nouns) from its parametric memory when the retrieved passage doesn't include them
The model answers confidently when retrieved passages contain contradictory information, rather than surfacing the contradiction

What to add: In your eval suite, include test cases where the relevant passage is deliberately absent from the context. The model should say "I don't know" or "I can't find this in the provided documents" — not hallucinate an answer. Also test: cases where retrieved passages contain contradictory information. The model should surface the contradiction, not pick one and present it as settled.

Cost at Scale Is Different From Cost in Dev

RAG adds cost beyond LLM inference: embedding generation at query time, vector database reads, and — if you're doing reranking — a second model call.

In development, these costs are negligible. At production scale, they compound:

Embedding cost at query time: If you use a managed embedding endpoint, every query incurs an embedding API call. At 50,000 queries/day with a managed embedding model at $0.0001 per call, that's $1,800/year just for query embeddings — often not in the cost estimate.

Vector database reads: Managed vector databases charge per query or per vector stored. Understand the pricing model before you hit production scale, not after.

Context window cost: Long retrieved contexts mean long input token counts. A RAG system retrieving 5,000 tokens of context per query uses 5x the input tokens of a system with no retrieval. At GPT-4o pricing, this is a meaningful multiplier on your LLM cost.

What to add: A per-query cost breakdown that includes all components — embedding, retrieval, LLM inference, reranking if applicable. Track this with the same budget alert mechanism you'd use for plain LLM calls.

The Retrieval Failure Silent Mode

When retrieval returns low-relevance results, the system can either generate a response anyway using the low-relevance context (producing a potentially wrong answer with high confidence), or decline and surface the low confidence to the user.

Most RAG systems do the first because it's the default: the LLM will try to synthesize something useful from whatever context it receives. In production, this produces a class of failures that users experience as "wrong answers" rather than "I don't know" — and confident wrong answers are worse than a clear "I couldn't find that."

What to add: A retrieval confidence gate: if the highest similarity score from retrieval is below a defined threshold, don't attempt generation. Instead, surface a "I couldn't find a reliable answer to that" response. This requires calibrating the threshold against your eval set, but transforms the highest-cost failure mode (confident wrong answer) into a much cheaper one (transparent limitation).

Six Months In: What Actually Matters

The things that seemed important at build time (embedding model choice, vector database selection, chunk size) turn out to be less important than the operational concerns that only surface in production: index freshness, retrieval quality monitoring, re-indexing capacity, hallucination detection, and cost tracking.

Every RAG system that operates successfully long-term has invested in these operational foundations. Every one that has struggled has treated them as later concerns.

See the full production readiness checklist — covering all 9 dimensions that affect long-term RAG system performance. Or assess your AI workflow to see how it currently scores.