Back to projects

Independent / Research Engineering Project

Python | PyTorch | torchaudio | FastAPI | React | TypeScript | Redis | PostgreSQL | Docker

RE-AMP: controlled robustness evaluation for generative audio models.

RE-AMP is a full-stack benchmark platform for testing generative audio models under perturbations, distribution shifts, compression artifacts, prompt variation, acoustic transformations, stochastic decoding differences, and model-version changes.

Key Outcomes

Robustness evaluation made reproducible.

Runs

18,000+

controlled robustness runs

Perturbations

82

operators across audio and prompt transformations

Setup

91%

reduction in experiment setup time

Project Breakdown

Problem, method, system, validation, results, reliability, and research value.

Problem

Aggregate audio scores can hide model-specific robustness failures.

  • Prompt changes, decoding randomness, transforms, compression, and model versions can change rankings in ways that are hard to reproduce.
  • A useful benchmark needed explicit manifests, reusable adapters, and failure-slice analysis.

Method

Controlled perturbation runs became the unit of comparison.

  • Benchmarks varied prompts, seeds, checkpoints, perturbations, transforms, and metric configurations under pinned environments.
  • Signal-level, perceptual, and model-based metric families were compared with uncertainty and slice analysis.

System / Stack

The platform handled orchestration, storage, and comparative dashboards.

  • Built with Python, PyTorch, torchaudio, FastAPI, React, TypeScript, Redis, PostgreSQL, Docker, async workers, and statistical dashboards.
  • Used declarative benchmark specifications, reusable dataset/model adapters, typed evaluator configs, cached feature extraction, and structured result storage.

Validation Methodology

Every run carried its experimental assumptions.

  • Deterministic manifests recorded prompts, seeds, model checkpoints, audio transforms, metric configurations, and environment hashes.
  • Metric panels included FAD-style distribution distance, embedding similarity, spectral convergence, loudness drift, pitch/chroma stability, clipping artifacts, compression sensitivity, and ranking instability.

Results

RE-AMP made large robustness sweeps easier to run and audit.

  • Supported 18,000+ controlled robustness runs across 82 perturbation operators and 9 evaluator families.
  • Reduced experiment setup time by 91% through reusable configs and adapters.

Failure Modes / Reliability Checks

The benchmark looked for hidden sensitivity, not only average quality.

  • Tracked ranking instability, compression sensitivity, clipping artifacts, loudness drift, pitch/chroma instability, and failure slices hidden by aggregate summaries.

Why It Matters for Research

Robust generative-audio evaluation needs reproducible empirical instrumentation.

  • The project turns informal listening comparisons into controlled, falsifiable experiments with traceable artifacts.