Music source separation aims to extract individual sound sources (e.g., vocals, drums, guitar) from a mixed music recording.
However, evaluating the quality of separated audio remains challenging, as commonly used metrics like the source-to-distortion
ratio (SDR) do not always align with human perception. In this study, we conducted a large-scale listener evaluation on the
MUSDB18 test set, collecting approximately 30 ratings per track from seven distinct listener groups. We compared several objective
energy-ratio metrics, including legacy measures (BSSEval v4, SI-SDR variants), and embedding-based alternatives (Fréchet Audio
Distance using CLAP-LAION-music, EnCodec, VGGish, Wave2Vec2, and HuBERT). While SDR remains the best-performing metric for
vocal estimates, our results show that the scale-invariant signal-to-artifacts ratio (SI-SAR) better predicts listener ratings
for drums and bass stems. Fréchet Audio Distance (FAD) computed with the CLAP-LAION-music embedding also performs competitively—achieving
Kendall's τ values of 0.25 for drums and 0.19 for bass—matching or surpassing energy-based metrics for those stems. However,
none of the embedding-based metrics, including CLAP, correlate positively with human perception for vocal estimates. These
findings highlight the need for stem-specific evaluation strategies and suggest that no single metric reliably reflects perceptual
quality across all source types. We release our raw listener ratings to support reproducibility and further research.