Monitoring Systems·2024·Research

Deterministic Monitoring

DriftScope

Persistence-aware drift scoring without ML magic

PythonPandasEvidently AIPrometheusGrafanaFastAPI

GitHub

0.92

Drift sensitivity (recall)

0.07

False alarm rate

5 min

Min to first alert

Models monitored

DriftScope is a deterministic monitoring system for ML services: it compares live windows to baselines using PSI, KS, and related tests, tracks persistence (how long a shift lasts), and routes severities to playbooks. There is no "black-box drift score"—only declared statistics, thresholds, and recorded evidence.

The system acknowledges delayed labels: it optimizes for early, explainable signals (distribution shift, confidence entropy) while joining ground truth when it finally arrives.

Why it matters: teams need monitoring that engineers trust when they wake up at 3am.

Inspectable proof

From resume to something you can read

Side-by-side windows for one feature, using the same statistic names called out on the resume (PSI, JS divergence, entropy). Numbers are illustrative of how the engine reads.

Reference window (A) vs live window (B) — model score

Statistic	Ref (A)	Live (B)	Notes
Mean	0.412	0.389	shifted lower
PSI	—	0.24	> 0.2 threshold
JS divergence	—	0.081	non-trivial move
Entropy (nats)	0.93	0.88	confidence mass tightening
KS p-value	—	0.004	reject same-dist hypothesis

How DriftScope would read this

PSI + KS agree → distribution shift is real, not noise.
Persistence tracker: condition held > 45 minutes → escalate to "review"
instead of paging on a one-bar spike.

Resume-aligned: persistence-aware severity, deterministic tests first—ML anomaly optional shadow only.

The challenge

Models fail silently: latency looks fine while predictions quietly rot. Most teams add dashboards; fewer build decision-ready signals with clear thresholds and actions.

The challenge is to detect degradation without requiring instant labels—while avoiding alert fatigue.

Approach

Windowing: Fixed and staggered windows for robust comparison against training/reference slices.
Test battery: Per-feature PSI/KS with Holm-Bonferroni style control across large feature sets.
Persistence scoring: Short spikes downgrade severity; sustained shifts escalate.
Playbooks: Map severities to concrete actions (notify, shadow, rollback candidate) with audit logs.

System architecture

Live predictions

Windowing

Stat tests

Persistence

Playbooks

Alerts / hooks

Input

Process

Model

Storage

Output

Failure modes

01
Seasonality without seasonal baselines causes false positives—seasonal templates required.
02
Delayed labels create blind spots—confidence/entropy proxies used with explicit caveats.
03
High-cardinality features explode test counts—feature grouping and hierarchical alerts.

Trade-offs

01
Tuned for high recall on drift at the cost of more triage work—better than missing rot.
02
Fixed windows sacrifice agility for interpretability.
03
Statistical tests assume representative samples—biased sampling breaks conclusions.

Implementation details

drift_monitor.py

PSI + persistence gate

python

drift_monitor.py

🐍python

Ownership

Designed

Defined severity ladders, persistence rules, and playbook mapping for model families.

Implemented

Built the statistical test suite, Prometheus exporters, and Grafana dashboards.

Scrapped

Opaque ML-based anomaly detector as primary signal—kept only as optional shadow.

UniConvert

Local-first execution engine for AI-assisted pipelines

Printish

Print ordering platform — customer checkout, order tracking, and an admin dispatch view. Built and launched solo.