Monitoring Systems·2024·Research
    Deterministic Monitoring

    DriftScope

    Persistence-aware drift scoring without ML magic

    PythonPandasEvidently AIPrometheusGrafanaFastAPI
    0.92
    Drift sensitivity (recall)
    0.07
    False alarm rate
    5 min
    Min to first alert
    14
    Models monitored

    DriftScope is a deterministic monitoring system for ML services: it compares live windows to baselines using PSI, KS, and related tests, tracks persistence (how long a shift lasts), and routes severities to playbooks. There is no "black-box drift score"—only declared statistics, thresholds, and recorded evidence.

    The system acknowledges delayed labels: it optimizes for early, explainable signals (distribution shift, confidence entropy) while joining ground truth when it finally arrives.

    Why it matters: teams need monitoring that engineers trust when they wake up at 3am.

    Inspectable proof

    From resume to something you can read

    Side-by-side windows for one feature, using the same statistic names called out on the resume (PSI, JS divergence, entropy). Numbers are illustrative of how the engine reads.

    Reference window (A) vs live window (B) — model score

    StatisticRef (A)Live (B)Notes
    Mean0.4120.389shifted lower
    PSI0.24> 0.2 threshold
    JS divergence0.081non-trivial move
    Entropy (nats)0.930.88confidence mass tightening
    KS p-value0.004reject same-dist hypothesis

    How DriftScope would read this

    PSI + KS agree → distribution shift is real, not noise.
    Persistence tracker: condition held > 45 minutes → escalate to "review"
    instead of paging on a one-bar spike.

    Resume-aligned: persistence-aware severity, deterministic tests first—ML anomaly optional shadow only.

    The challenge

    Models fail silently: latency looks fine while predictions quietly rot. Most teams add dashboards; fewer build decision-ready signals with clear thresholds and actions.

    The challenge is to detect degradation without requiring instant labels—while avoiding alert fatigue.

    Approach
    1. Windowing: Fixed and staggered windows for robust comparison against training/reference slices.

    2. Test battery: Per-feature PSI/KS with Holm-Bonferroni style control across large feature sets.

    3. Persistence scoring: Short spikes downgrade severity; sustained shifts escalate.

    4. Playbooks: Map severities to concrete actions (notify, shadow, rollback candidate) with audit logs.

    System architecture
    Live predictions
    Windowing
    Stat tests
    Persistence
    Playbooks
    Alerts / hooks
    Input
    Process
    Model
    Storage
    Output
    Failure modes
    • 01

      Seasonality without seasonal baselines causes false positives—seasonal templates required.

    • 02

      Delayed labels create blind spots—confidence/entropy proxies used with explicit caveats.

    • 03

      High-cardinality features explode test counts—feature grouping and hierarchical alerts.

    Trade-offs
    • 01

      Tuned for high recall on drift at the cost of more triage work—better than missing rot.

    • 02

      Fixed windows sacrifice agility for interpretability.

    • 03

      Statistical tests assume representative samples—biased sampling breaks conclusions.

    Implementation details

    drift_monitor.py

    PSI + persistence gate

    python
    drift_monitor.py
    🐍python
    1 
    Ownership

    Designed

    Defined severity ladders, persistence rules, and playbook mapping for model families.

    Implemented

    Built the statistical test suite, Prometheus exporters, and Grafana dashboards.

    Scrapped

    Opaque ML-based anomaly detector as primary signal—kept only as optional shadow.