Back to projects

Arthur AI | Summer 2025

Python | PyTorch | JAX | Ray | FastAPI | PostgreSQL | Redis Streams | Docker | Kubernetes | AWS

LLM evaluation systems built to keep fast-moving experiments reproducible, inspectable, and decision-safe.

The core problem was keeping evaluation reliable as prompts, datasets, tokenizers, judges, and scoring logic changed at product speed. This work focused on lineage, replay, async execution, and regression analysis that kept results trustworthy.

Key Outcomes

The proof from the evaluation platform.

Reproducibility

99.98%

exact rerun reproducibility after deterministic lineage redesign

Scale

52,000+

controlled evaluation jobs/month across reasoning, retrieval, tool use, safety, robustness, code, and long-context suites

Throughput

54 min

median turnaround after async scheduling, down from 9.6 hours

Project Breakdown

Problem, method, system, validation, results, reliability, and research value.

Problem

Shared benchmark workflows could not guarantee exact reruns or trustworthy comparisons.

  • Evaluation was no longer a one-off spreadsheet exercise; it was a shared product surface.
  • Teams needed to compare runs confidently even as core inputs changed across the stack.

Method

Each evaluation became a typed experimental object.

  • Represented each run over dataset snapshot, prompt graph, model artifact, tokenizer revision, sampling policy, inference backend, judge rubric, scorer, postprocessor, aggregation rule, and environment hash.
  • Made lineage explicit so model comparisons could be replayed instead of reconstructed from informal context.
  • Separated aggregate score movement from per-slice uncertainty and scenario-family regressions.

System / Stack

The platform treated lineage, execution, and review as one system.

  • Used Python, PyTorch, JAX, Ray, FastAPI, PostgreSQL, Redis Streams, Docker, Kubernetes, AWS, OpenTelemetry, NumPy, pandas, and scikit-learn.
  • Built adaptive batching, token-budget admission control, retry semantics, circuit breakers, partial-failure isolation, and cost-aware queue priorities.
  • Structured artifacts and manifests so runs could be replayed, diffed, and reviewed cleanly.

Validation Methodology

Model comparison carried uncertainty and multiplicity controls.

  • Implemented paired bootstrap intervals, stratified randomization tests, multiple-comparison correction, per-slice uncertainty bands, effect-size reporting, regression severity tiers, and run-diff reports.
  • Added judge reliability diagnostics with gold preference sets, rubric-level variance decomposition, inter-judge agreement, prompt sensitivity, disagreement clustering, and calibration curves.
  • Preserved manifests and artifacts so every important result could be replayed and inspected.

Results

The platform became faster while telling the truth about its own experiments.

  • Exact rerun reproducibility increased from 71.4% to 99.98%.
  • Benchmark execution scaled to 52,000+ controlled evaluation jobs/month.
  • Median experiment turnaround dropped from 9.6 hours to 54 minutes.
  • Judge agreement on audited tasks improved from kappa = 0.43 to 0.76.

Failure Modes / Reliability Checks

Benchmark contamination and judge variance were first-class risks.

  • Detected prompt-template leakage and near-duplicate examples that inflated a reasoning suite by 5.8 absolute points.
  • Added semantic deduplication checks, leakage alarms, red-team fixtures, disagreement clustering, and per-slice uncertainty bands.

Why It Matters for Research

Foundation-model evaluation needs the discipline of controlled experimentation.

  • The system makes LLM evaluation more falsifiable by tying claims to lineage, uncertainty, reproducibility, and explicit data-generating assumptions.

Confidentiality Boundary

Public-safe architecture, workflow, and outcomes are documented here.

Internal models, prompts, and proprietary platform details remain private.