Causal Systems·2026·Built

Causal AI System

TraceAI

Walks pipelines backward to find what actually caused a failure

PythonFastAPIPostgreSQLGraph modelingTypeScript SDKOpenTelemetry-compatible hooks

GitHub

O(V+E)

Worst-case analysis

0.82

Root-cause confidence (demo)

<200 hops

Typical graph depth

Stable

Ranked output across 12 DAG replays

TraceAI treats every pipeline as a directed execution graph: sources, transforms, joins, and sinks are nodes; data dependencies are edges. When an outcome is wrong or a metric shifts, the engine does not "search logs"—it scores propagation paths and attributes blame along deterministic structure.

The system is built for teams where black-box models sit inside larger systems: the goal is explainable attribution with bounded complexity (graph traversal + calibrated scoring), not another chat interface for debugging.

Core ideas:

Execution snapshots capture lineage IDs and schema hashes at each step so drift can be localized to a subgraph.
Causal scoring ranks candidate root causes using structural evidence (upstream deltas, fan-in/out, retry boundaries) rather than keyword frequency.
Human-readable narratives are generated from the graph walk, not from free-form LLM speculation—language follows structure.

Inspectable proof

From resume to something you can read

Graph, ranked hypotheses, a before/after on the ranker, and a postmortem-style timeline. Numeric scores are fixtures; the regression class is real—undamped merge aggregation let a genuine secondary delta (schema hash on join) outrank the true driver (enrich_features) until edge typing and fan-in damping shipped.

Example execution graph (sink = bad outcome)

sources ──► ingest ──┬──► enrich_features ──► score_model ──► publish_api
                         └──► join_reference ──┘

Edges carry: step_id, schema_hash, volume_ema, error_rate.

Same snapshot + same graph → same ranked causes (deterministic scorer over structure + deltas).

Ranked root-cause hypotheses (current scorer)

1. enrich_features     0.84   null spike + fan-in into merge (weighted)
2. join_reference       0.44   schema hash delta (down-ranked after merge fix)
3. score_model          0.22   local deltas healthy

Matches resume line: typed-edge inference with confidence-style ranking—not log grep or LLM prose.

Before vs after — ranking bug during engine bring-up

BEFORE (v1: naive sum at merge, uniform edge weights)
  1. join_reference       0.79   ← wrong #1: hash delta overstated
  2. enrich_features      0.61
  3. score_model          0.18

AFTER (v2: typed edge weights + fan-in damping at merge)
  1. enrich_features      0.83   ← ground truth in fixture replay
  2. join_reference       0.44
  3. score_model          0.22

CHANGE
  • edge_weight(u→v) from edge kind (ingest vs join vs model)
  • merge: damped sum so double-counted paths do not dominate

Shows iteration depth: misattribution was structural (merge math), not bad data—the kind of bug staff engineers expect you to surface.

Failure explanation the engine attaches (after fix)

Primary path: publish_api regression likely introduced at enrich_features.
Evidence: null_rate jumped vs baseline; suspicion propagated
ingest → enrich_features → score_model with damping at the merge.

Narrative stays template-bound to the subgraph so reviewers can argue with it.

Postmortem timeline (bring-up style)

T+0   Golden fixture replay: ranker pointed at join_reference; operator intuition said enrich.
T+4h  Diffed propagation: merge at score_model was summing undamped parent scores.
T+1d  Shipped v2: typed coefficients + fan-in damping + regression tests on 12 DAGs.
T+2d  Replay stable: same fixture → same top hypothesis; narrative templates updated.

What hurt: looked "correct" because join had a real schema hash delta—the graph math lied until merge behavior was explicit.

Compressed timeline of how this class of bug gets debugged: wrong rank → structural cause → fix → replay proof—not a customer war story, but how the ranker was hardened.

The challenge

Traditional observability optimizes for availability and latency, not correctness of inference. When a KPI breaks, teams grep logs, stare at dashboards, or ask an LLM to "explain"—none of which guarantees a structural explanation tied to how data actually moved.

The challenge is to build attribution that is as deterministic as the system itself: same inputs, same graph walk, same ranked hypotheses—so engineering can act with confidence.

Approach

Graph construction: Instrument pipelines to emit stable node IDs, versioned transforms, and edge contracts (schema + semantics). Persist a queryable graph store—not just unstructured traces.
Delta detection: Compare expected vs observed signals per node (volume, null rate, distribution shift proxies) and propagate "suspicion scores" along edges with monotonic damping at merge points.
Root-cause ranking: Combine structural centrality (bottlenecks, single points of failure) with empirical deltas to produce a short ranked list—never a single opaque score.
Narrative synthesis: Map the winning subgraph to a fixed template (path, likely failure class, suggested validation query) so explanations remain auditable.

System architecture

Sources

Ingest & validate

Transforms

Model / rules

Aggregate

Sink / API

Input

Process

Model

Storage

Output

Failure modes

01
Incomplete instrumentation makes the graph sparse; attribution degrades to coarse buckets until edges exist.
02
Highly dynamic DAGs (per-request topology) require snapshotting; stale graphs mis-rank causes.
03
Correlated upstream failures can dominate a single path—mitigated with fan-in penalties and multi-path reporting.

Trade-offs

01
Chose explainable graph math over deep learning attribution—better auditability, less expressive on unstructured text.
02
Requires upfront engineering discipline (IDs, contracts); payoff is downstream incident time.
03
Narratives are template-bound by design—less 'magical,' more reliable.

Implementation details

causal_walk.py

Rank candidate root nodes from a delta vector and adjacency

python

causal_walk.py

🐍python

sdk.ts

Client: attach stable lineage to outbound spans

typescript

sdk.ts

⚡typescript

1export function withLineage(
2  ctx: TraceCtx,
3  stepId: string,
4  fn: () => Promise,
5): Promise {
6  const next = ctx.fork({
7    stepId,
8    schemaHash: hashSchema(ctx.outputSchema),
9  });
10  return traceActive.run(next, fn);
11}

Ownership

Designed

Designed how the graph gets built, how blame gets attributed across nodes, and how instrumentation hooks attach to pipeline steps.

Implemented

Built the ranking engine, persistence layer, and TypeScript helpers for attaching stable step IDs to pipeline runs.

Scrapped

An LLM-first 'explain the logs' path—discarded because it could not produce the same answer twice given the same inputs.

Printish

Print ordering platform — customer checkout, order tracking, and an admin dispatch view. Built and launched solo.

DecisionGraph

Schema-adaptive analytics without hallucinated SQL