Generative audio robustness evaluation
RE-AMP: Generative Audio Robustness Evaluation
A public benchmarking system for controlled robustness evaluation of generated audio.
- Problem
- Generative audio robustness is difficult to compare when prompts, seeds, model checkpoints, transforms, metric configs, and execution environments drift between runs.
- Method
- Ran controlled perturbation studies across compression artifacts, prompt variation, acoustic transformations, stochastic decoding differences, and model-version changes.
- System / stack
- Python, PyTorch, torchaudio, FastAPI, React, TypeScript, Redis, PostgreSQL, Docker, async workers, statistical dashboards, and deterministic experiment manifests.
- Validation methodology
- Deterministic manifests covered prompts, seeds, checkpoints, transforms, metric configs, and environment hashes; results were compared through signal-level, perceptual, and model-based metric families with bootstrap uncertainty and failure-slice analysis.
- Results
- Supported 18,000+ controlled robustness runs across 82 perturbation operators and 9 evaluator families while reducing experiment setup time by 91%.
- Failure modes / reliability checks
- Tracked ranking instability, loudness drift, clipping artifacts, compression sensitivity, spectral convergence, embedding similarity, pitch/chroma stability, and FAD-style distribution distance.
- Why it matters for research
- The platform turns subjective generative-audio comparison into reproducible empirical measurement with explicit assumptions and inspectable artifacts.