Biomedical ML research systems built for rigor, reproducibility, and clean evaluation.

The core problem was experimental rigor: multimodal pipelines, leakage risk, and evaluation methodology that could quietly distort results. The work centered on repeatable workflows, cohort fixes, and clearer analysis surfaces for collaborators.

See this in CV Open project framing

Quick read

Problem

Multimodal clinical experiments were vulnerable to patient overlap, temporal contamination, and weak evaluation discipline.

System

Reproducible pipelines for preprocessing, fusion, calibration, threshold analysis, and subgroup-aware evaluation.

Outcome

1,200+ controlled runs supported, held-out AUROC improved from 0.69 to 0.86, and expected calibration error improved from 0.18 to 0.045 after cohort fixes.

Key Outcomes

The proof from the research pipeline.

Held-out AUROC

0.86

after cohort and methodology fixes, up from 0.69

Experimentation

1,200+

controlled runs across preprocessing, fusion, and evaluation settings

Calibration

0.045

expected calibration error after fixes, down from 0.18

Project Breakdown

Problem, method, system, validation, results, reliability, and research value.

Problem

Methodology issues could quietly make multimodal results look better than they were.

The hardest problems were cohort construction, evaluation design, and experimental hygiene.
The system needed to make rigorous comparison easier for collaborators running many variants.

Method

Cohort construction was rebuilt around clinical validity.

Supported cohort definitions, preprocessing variants, imaging features, clinical covariates, fusion strategies, calibration procedures, subgroup analyses, and decision thresholds.
Rebuilt splits around patient-level independence, temporal separation, site-aware validation, feature availability, and label-proxy audits.
Collaborators needed repeatable ways to compare runs without rebuilding the pipeline each time.

System / Stack

Leakage prevention and reproducibility became part of the workflow itself.

Used Python, PyTorch, scikit-learn, NumPy, pandas, SQL, Streamlit, MATLAB, medical-imaging preprocessing, calibration analysis, and nested cross-validation.
Tracked preprocessing variants, feature sets, fusion strategies, calibration, thresholds, and evaluation outputs.
Built interfaces that made metrics and artifacts easier to inspect without re-running everything manually.

Validation Methodology

Evaluation reports included uncertainty, calibration, and subgroup structure.

Implemented bootstrap confidence intervals, DeLong-style AUROC comparisons, AUPRC, sensitivity/specificity at policy thresholds, decision-curve analysis, subgroup calibration, missingness analysis, and error taxonomy.
Treated subgroup evaluation, calibration, and threshold analysis as standard outputs.
Turned leakage prevention into infrastructure instead of a one-time cleanup exercise.

Results

Better science produced the largest gains.

Held-out AUROC improved from 0.69 to 0.86 after fixing cohort construction and evaluation methodology.
Expected calibration error improved from 0.18 to 0.045.
More than 1,200 controlled runs were supported across preprocessing, fusion, calibration, and threshold settings.

Failure Modes / Reliability Checks

Ablations distinguished signal from shortcuts.

Designed ablations to separate genuine predictive signal from scanner/site artifacts, missingness shortcuts, preprocessing artifacts, distributional confounding, and label-derived proxies.

Why It Matters for Research

Biomedical ML needs methodology as much as model capacity.

The work shows how reproducible pipelines, leakage audits, and calibrated uncertainty can turn retrospective modeling into a more credible scientific instrument.

Confidentiality Boundary

Scientific workflow and measured outcomes are documented here.

Private datasets, collaborator data, and internal artifacts remain private.