Back to projects

Fullstory | Summer 2024

Python | SQL | Kafka | Airflow | dbt | Docker

Analytics observability systems for pipelines too large to debug by hand.

The core problem was catching bad analytics before it reached dashboards or downstream consumers while keeping alerting precise enough for teams to trust. The work centered on validation close to warehouse and scheduling layers plus incident context that made failures faster to debug.

Key Outcomes

The proof from the observability system.

Precision

97%

alert precision after validation redesign

Detection

6 min

median time-to-detection after workflow improvements

Scale

6.4B+

weekly product events monitored across analytics pipelines

Project Breakdown

Problem, method, system, validation, results, reliability, and research value.

Problem

Silent data regressions were reaching dashboards too late.

  • Analytics breakage was expensive because downstream teams were already working from bad data.
  • The system had to detect regressions early without overwhelming people with noise.

Method

Monitoring was framed as statistical inference over a moving data-generating process.

  • Separated seasonality, instrumentation changes, product launches, upstream lag, bot traffic, and genuine pipeline failures.
  • Modeled validation around schema evolution, event freshness, transformation replay, metric discontinuities, cardinality explosions, join drift, null spikes, and deploy-correlated anomalies.
  • Used adaptive thresholds and ownership-aware routing so alerts remained broad without becoming noisy.

System / Stack

Validation focused on the failure modes analytics pipelines actually hit.

  • Used Python, SQL, Kafka, Airflow, dbt, PostgreSQL, analytical SQL, Docker, GitHub Actions, Grafana, and OpenTelemetry.
  • Built validators for schema drift, freshness lag, broken joins, replay mismatches, null spikes, cardinality explosions, and deploy-correlated anomaly bursts.
  • Attached affected tables, lag windows, replay deltas, schema changes, query rewrites, upstream delays, ingestion partitions, and ownership metadata to alert context.

Validation Methodology

Layered statistical checks handled drift, integrity, and replay.

  • Combined robust z-scores, seasonal baselines, Kolmogorov-Smirnov drift tests, population-stability indexes, column-level entropy checks, foreign-key integrity tests, and lineage-neighbor correlation checks.
  • Re-executed historical transformations under pinned inputs and compared outputs against current logic to catch non-idempotent code paths, hidden dependencies, and semantic changes.
  • Made incident context part of the alert itself so triage started with usable evidence.

Results

Precision improved because the system was structured around real debugging work.

  • Alert precision improved from 61% to 97%.
  • False-positive pages fell by 82%.
  • Median time-to-detection dropped from 3.7 hours to 6 minutes.
  • Median debugging time dropped from 2.8 hours to 19 minutes.

Failure Modes / Reliability Checks

The checks targeted the ways analytics systems quietly lie.

  • Covered schema evolution, event freshness, transformation replay, metric discontinuities, cardinality explosions, join drift, null spikes, deploy-correlated anomalies, hidden dependencies, and accidental semantic changes.

Why It Matters for Research

Reliable empirical systems need observability over changing assumptions.

  • The project treats production data validation as online statistical inference, which mirrors the reliability work needed for reproducible computational research.

Confidentiality Boundary

Architecture and outcomes are shared here.

Internal source code, company data, and proprietary implementation details are not.