DriftScope
Persistence-aware drift scoring without ML magic
DriftScope is a deterministic monitoring system for ML services: it compares live windows to baselines using PSI, KS, and related tests, tracks persistence (how long a shift lasts), and routes severities to playbooks. There is no "black-box drift score"—only declared statistics, thresholds, and recorded evidence.
The system acknowledges delayed labels: it optimizes for early, explainable signals (distribution shift, confidence entropy) while joining ground truth when it finally arrives.
Why it matters: teams need monitoring that engineers trust when they wake up at 3am.
From resume to something you can read
Side-by-side windows for one feature, using the same statistic names called out on the resume (PSI, JS divergence, entropy). Numbers are illustrative of how the engine reads.
Reference window (A) vs live window (B) — model score
| Statistic | Ref (A) | Live (B) | Notes |
|---|---|---|---|
| Mean | 0.412 | 0.389 | shifted lower |
| PSI | — | 0.24 | > 0.2 threshold |
| JS divergence | — | 0.081 | non-trivial move |
| Entropy (nats) | 0.93 | 0.88 | confidence mass tightening |
| KS p-value | — | 0.004 | reject same-dist hypothesis |
How DriftScope would read this
PSI + KS agree → distribution shift is real, not noise. Persistence tracker: condition held > 45 minutes → escalate to "review" instead of paging on a one-bar spike.
Resume-aligned: persistence-aware severity, deterministic tests first—ML anomaly optional shadow only.
Models fail silently: latency looks fine while predictions quietly rot. Most teams add dashboards; fewer build decision-ready signals with clear thresholds and actions.
The challenge is to detect degradation without requiring instant labels—while avoiding alert fatigue.
-
Windowing: Fixed and staggered windows for robust comparison against training/reference slices.
-
Test battery: Per-feature PSI/KS with Holm-Bonferroni style control across large feature sets.
-
Persistence scoring: Short spikes downgrade severity; sustained shifts escalate.
-
Playbooks: Map severities to concrete actions (notify, shadow, rollback candidate) with audit logs.
- 01
Seasonality without seasonal baselines causes false positives—seasonal templates required.
- 02
Delayed labels create blind spots—confidence/entropy proxies used with explicit caveats.
- 03
High-cardinality features explode test counts—feature grouping and hierarchical alerts.
- 01
Tuned for high recall on drift at the cost of more triage work—better than missing rot.
- 02
Fixed windows sacrifice agility for interpretability.
- 03
Statistical tests assume representative samples—biased sampling breaks conclusions.
drift_monitor.py
PSI + persistence gate
1 Designed
Defined severity ladders, persistence rules, and playbook mapping for model families.
Implemented
Built the statistical test suite, Prometheus exporters, and Grafana dashboards.
Scrapped
Opaque ML-based anomaly detector as primary signal—kept only as optional shadow.