Causal Systems·2026·Built
    Causal AI System

    TraceAI

    Walks pipelines backward to find what actually caused a failure

    PythonFastAPIPostgreSQLGraph modelingTypeScript SDKOpenTelemetry-compatible hooks
    O(V+E)
    Worst-case analysis
    0.82
    Root-cause confidence (demo)
    <200 hops
    Typical graph depth
    Stable
    Ranked output across 12 DAG replays

    TraceAI treats every pipeline as a directed execution graph: sources, transforms, joins, and sinks are nodes; data dependencies are edges. When an outcome is wrong or a metric shifts, the engine does not "search logs"—it scores propagation paths and attributes blame along deterministic structure.

    The system is built for teams where black-box models sit inside larger systems: the goal is explainable attribution with bounded complexity (graph traversal + calibrated scoring), not another chat interface for debugging.

    Core ideas:

    • Execution snapshots capture lineage IDs and schema hashes at each step so drift can be localized to a subgraph.
    • Causal scoring ranks candidate root causes using structural evidence (upstream deltas, fan-in/out, retry boundaries) rather than keyword frequency.
    • Human-readable narratives are generated from the graph walk, not from free-form LLM speculation—language follows structure.
    Inspectable proof

    From resume to something you can read

    Graph, ranked hypotheses, a before/after on the ranker, and a postmortem-style timeline. Numeric scores are fixtures; the regression class is real—undamped merge aggregation let a genuine secondary delta (schema hash on join) outrank the true driver (enrich_features) until edge typing and fan-in damping shipped.

    Example execution graph (sink = bad outcome)

    sources ──► ingest ──┬──► enrich_features ──► score_model ──► publish_api
                             └──► join_reference ──┘
    
    Edges carry: step_id, schema_hash, volume_ema, error_rate.

    Same snapshot + same graph → same ranked causes (deterministic scorer over structure + deltas).

    Ranked root-cause hypotheses (current scorer)

    1. enrich_features     0.84   null spike + fan-in into merge (weighted)
    2. join_reference       0.44   schema hash delta (down-ranked after merge fix)
    3. score_model          0.22   local deltas healthy

    Matches resume line: typed-edge inference with confidence-style ranking—not log grep or LLM prose.

    Before vs after — ranking bug during engine bring-up

    BEFORE (v1: naive sum at merge, uniform edge weights)
      1. join_reference       0.79   ← wrong #1: hash delta overstated
      2. enrich_features      0.61
      3. score_model          0.18
    
    AFTER (v2: typed edge weights + fan-in damping at merge)
      1. enrich_features      0.83   ← ground truth in fixture replay
      2. join_reference       0.44
      3. score_model          0.22
    
    CHANGE
      • edge_weight(u→v) from edge kind (ingest vs join vs model)
      • merge: damped sum so double-counted paths do not dominate

    Shows iteration depth: misattribution was structural (merge math), not bad data—the kind of bug staff engineers expect you to surface.

    Failure explanation the engine attaches (after fix)

    Primary path: publish_api regression likely introduced at enrich_features.
    Evidence: null_rate jumped vs baseline; suspicion propagated
    ingest → enrich_features → score_model with damping at the merge.

    Narrative stays template-bound to the subgraph so reviewers can argue with it.

    Postmortem timeline (bring-up style)

    T+0   Golden fixture replay: ranker pointed at join_reference; operator intuition said enrich.
    T+4h  Diffed propagation: merge at score_model was summing undamped parent scores.
    T+1d  Shipped v2: typed coefficients + fan-in damping + regression tests on 12 DAGs.
    T+2d  Replay stable: same fixture → same top hypothesis; narrative templates updated.
    
    What hurt: looked "correct" because join had a real schema hash delta—the graph math lied until merge behavior was explicit.

    Compressed timeline of how this class of bug gets debugged: wrong rank → structural cause → fix → replay proof—not a customer war story, but how the ranker was hardened.

    The challenge

    Traditional observability optimizes for availability and latency, not correctness of inference. When a KPI breaks, teams grep logs, stare at dashboards, or ask an LLM to "explain"—none of which guarantees a structural explanation tied to how data actually moved.

    The challenge is to build attribution that is as deterministic as the system itself: same inputs, same graph walk, same ranked hypotheses—so engineering can act with confidence.

    Approach
    1. Graph construction: Instrument pipelines to emit stable node IDs, versioned transforms, and edge contracts (schema + semantics). Persist a queryable graph store—not just unstructured traces.

    2. Delta detection: Compare expected vs observed signals per node (volume, null rate, distribution shift proxies) and propagate "suspicion scores" along edges with monotonic damping at merge points.

    3. Root-cause ranking: Combine structural centrality (bottlenecks, single points of failure) with empirical deltas to produce a short ranked list—never a single opaque score.

    4. Narrative synthesis: Map the winning subgraph to a fixed template (path, likely failure class, suggested validation query) so explanations remain auditable.

    System architecture
    Sources
    Ingest & validate
    Transforms
    Model / rules
    Aggregate
    Sink / API
    Input
    Process
    Model
    Storage
    Output
    Failure modes
    • 01

      Incomplete instrumentation makes the graph sparse; attribution degrades to coarse buckets until edges exist.

    • 02

      Highly dynamic DAGs (per-request topology) require snapshotting; stale graphs mis-rank causes.

    • 03

      Correlated upstream failures can dominate a single path—mitigated with fan-in penalties and multi-path reporting.

    Trade-offs
    • 01

      Chose explainable graph math over deep learning attribution—better auditability, less expressive on unstructured text.

    • 02

      Requires upfront engineering discipline (IDs, contracts); payoff is downstream incident time.

    • 03

      Narratives are template-bound by design—less 'magical,' more reliable.

    Implementation details

    causal_walk.py

    Rank candidate root nodes from a delta vector and adjacency

    python
    causal_walk.py
    🐍python
    1 

    sdk.ts

    Client: attach stable lineage to outbound spans

    typescript
    sdk.ts
    typescript
    1export function withLineage(
    2 ctx: TraceCtx,
    3 stepId: string,
    4 fn: () => Promise,
    5): Promise {
    6 const next = ctx.fork({
    7 stepId,
    8 schemaHash: hashSchema(ctx.outputSchema),
    9 });
    10 return traceActive.run(next, fn);
    11}
    Ownership

    Designed

    Designed how the graph gets built, how blame gets attributed across nodes, and how instrumentation hooks attach to pipeline steps.

    Implemented

    Built the ranking engine, persistence layer, and TypeScript helpers for attaching stable step IDs to pipeline runs.

    Scrapped

    An LLM-first 'explain the logs' path—discarded because it could not produce the same answer twice given the same inputs.