Reproducibility
99.98%
exact rerun reproducibility after deterministic lineage redesign
Arthur AI | Summer 2025
The core problem was keeping evaluation reliable as prompts, datasets, tokenizers, judges, and scoring logic changed at product speed. This work focused on lineage, replay, async execution, and regression analysis that kept results trustworthy.
Key Outcomes
Reproducibility
99.98%
exact rerun reproducibility after deterministic lineage redesign
Scale
52,000+
controlled evaluation jobs/month across reasoning, retrieval, tool use, safety, robustness, code, and long-context suites
Throughput
54 min
median turnaround after async scheduling, down from 9.6 hours
Project Breakdown
Problem
Method
System / Stack
Validation Methodology
Results
Failure Modes / Reliability Checks
Why It Matters for Research
Confidentiality Boundary
Internal models, prompts, and proprietary platform details remain private.