RE-AMP: controlled robustness evaluation for generative audio models.

RE-AMP is a full-stack benchmark platform for testing generative audio models under perturbations, distribution shifts, compression artifacts, prompt variation, acoustic transformations, stochastic decoding differences, and model-version changes.

See project summary Open GitHub repository

Key Outcomes

Robustness evaluation made reproducible.

Runs

18,000+

controlled robustness runs

Perturbations

operators across audio and prompt transformations

Setup

91%

reduction in experiment setup time

Project Breakdown

Problem, method, system, validation, results, reliability, and research value.

Problem

Aggregate audio scores can hide model-specific robustness failures.

Prompt changes, decoding randomness, transforms, compression, and model versions can change rankings in ways that are hard to reproduce.
A useful benchmark needed explicit manifests, reusable adapters, and failure-slice analysis.

Method

Controlled perturbation runs became the unit of comparison.

Benchmarks varied prompts, seeds, checkpoints, perturbations, transforms, and metric configurations under pinned environments.
Signal-level, perceptual, and model-based metric families were compared with uncertainty and slice analysis.

System / Stack

The platform handled orchestration, storage, and comparative dashboards.

Built with Python, PyTorch, torchaudio, FastAPI, React, TypeScript, Redis, PostgreSQL, Docker, async workers, and statistical dashboards.
Used declarative benchmark specifications, reusable dataset/model adapters, typed evaluator configs, cached feature extraction, and structured result storage.

Validation Methodology

Every run carried its experimental assumptions.

Deterministic manifests recorded prompts, seeds, model checkpoints, audio transforms, metric configurations, and environment hashes.
Metric panels included FAD-style distribution distance, embedding similarity, spectral convergence, loudness drift, pitch/chroma stability, clipping artifacts, compression sensitivity, and ranking instability.

Results

RE-AMP made large robustness sweeps easier to run and audit.

Supported 18,000+ controlled robustness runs across 82 perturbation operators and 9 evaluator families.
Reduced experiment setup time by 91% through reusable configs and adapters.

Failure Modes / Reliability Checks

The benchmark looked for hidden sensitivity, not only average quality.

Tracked ranking instability, compression sensitivity, clipping artifacts, loudness drift, pitch/chroma instability, and failure slices hidden by aggregate summaries.

Why It Matters for Research

Robust generative-audio evaluation needs reproducible empirical instrumentation.

The project turns informal listening comparisons into controlled, falsifiable experiments with traceable artifacts.