Handling Concept Drift Without Ground Truth

Decision summary

RAG pipelines degrade silently when user intents shift. Statistical distribution monitoring on semantic embeddings catches the shift before explicitly labeled failure data is available.

Invariant: Drift must be detected before degraded outputs become the only signal.

DriftScope TraceAI

Decision ID
DEC-003
Status
Active
Date
Oct 2024

Primary implementation

DriftScopeDrift Monitoring System

Where this is applied

This shows up in systems like DriftScope and TraceAI, where correctness depends on traceable outcomes.

Problem context

While deploying production LLM applications using custom RAG pipelines at Tech Bharat AI, I encountered a silent failure mode: user query semantics drifted over time.

For the dementia awareness assistant I worked on during the internship, the types of questions users asked evolved as they interacted more with the system. Without immediate human-in-the-loop ground truth labels, it wasn't easy to tell if the retrieval pipeline was failing to find relevant context until users explicitly complained.

The tempting option

The classic approach is to log all user queries and responses, manually label a statistically significant sample every week (Ground Truth), and compute metrics like MRR or NDCG.

Another option was LLM-as-a-judge to evaluate every retrieval output, providing a rapid proxy for ground truth.

Why it failed

Relying on manual ground truth is far too slow. By the time a shift in user behavior was labeled, the system had already been delivering degraded answers for a week.

Using LLM-as-a-judge for every query introduced unacceptable latency and immense cost at scale. Furthermore, it didn't prevent drift; it merely logged failures after they occurred. The goal was to detect that the input distribution had shifted before generating the output.

The chosen approach

I implemented embedding distribution tracking using statistical proxies (like PSI and KS-tests).

Instead of scoring the correctness of the answer, incoming query embeddings are compared against a historical baseline cluster. If the incoming queries fall significantly outside the known embedding clusters, the system flags a "Concept Drift" alert.

This enables anomaly alerts and fallback to conservative answer generation parameters immediately, without waiting for explicit human labeling.

Failure Modes

This approach primarily identifies when users ask novel questions, not necessarily whether the system answered them incorrectly.

- High false positive rate if the user base grows and naturally expands the semantic space. - Requires maintaining an active index of historical embedding distributions. - Does not monitor the generation quality, only the retrieval input boundaries.

When to Revisit

If the environment shifts to one with deterministic feedback loops (e.g., users consistently hitting thumbs up/down on every response), direct human feedback becomes viable.

Until explicit, high-volume ground truth is available in real-time, statistical embedding proxies remain the first line of defense.