TuneJury: An Open Metric for Improving Music Generation Preference Alignment

TL;DR

TuneJury is a pairwise preference reward model trained on 17,500 human judgments, achieving 0.7086 accuracy, outperforming non-pseudo-label models for music preference alignment.

cs.SD 🔴 Advanced 2026-06-16 22 views
Yonghyun Kim Junwon Lee Haiwen Xia Yinghao Ma Junghyun Koo Koichi Saito Yuki Mitsufuji Chris Donahue
music generation preference modeling reinforcement learning reward models deep learning

Key Findings

Methodology

TuneJury employs a RankNet-based pairwise preference learning framework, utilizing pre-trained audio and text encoders such as LAION-CLAP and MERT-v1-330M to extract features. A small MLP head (~2.8 million parameters) maps concatenated features to a scalar preference score. The training process maximizes the likelihood of preferences expressed as pairwise comparisons from four open datasets, totaling 17,500 pairs. The model predicts preference scores for individual clips, supports data filtering via score thresholds, and enables post-hoc calibration. Anchor calibration, based on Bradley-Terry models, adjusts scores across different music generation systems efficiently, reducing the need for costly retraining.

Key Results

  • On the CMI-RewardBench test set, TuneJury achieves a pairwise accuracy of 0.7086, surpassing the no-pseudo-label ablation (0.541) and partially pseudo-augmented models (0.691). It maintains competitive performance on out-of-distribution benchmarks, with SRCC reaching up to 0.7680 on MusicEval, demonstrating strong generalization.
  • The model's utility extends across multiple downstream tasks: inference-time best-of-N selection (monotonically increasing reward with N up to 32), DITTO-style latent optimization that improves both distributional and alignment metrics, and expert-iteration post-training that balances reward gains with distribution fidelity. These applications validate the model's effectiveness in real-world music generation scenarios.
  • A key innovation is the anchor calibration method, which uses a per-system Bradley-Terry calibration to align scores with minimal calibration data (~25× less than retraining), enabling rapid adaptation to new music generation systems without retraining, thus significantly reducing operational costs.

Significance

This work addresses the challenge of subjective and inconsistent music evaluation by providing an open, data-efficient, and scalable preference model. Unlike traditional metrics such as FAD, TuneJury directly models human preferences, capturing nuanced aesthetic judgments at the individual clip level. Its ability to generalize across unseen data and adapt swiftly to new systems makes it a valuable tool for advancing preference-aligned music AI. The open-source release fosters transparency and community-driven improvements, promising to accelerate research and industrial deployment in personalized music synthesis, recommendation, and creative AI applications.

Technical Contribution

The primary technical innovations include: 1) a lightweight (2.8M parameters) pairwise preference model based on RankNet, trained solely on publicly available human preference data; 2) elimination of reliance on pseudo-label augmentation, simplifying training and improving transparency; 3) a novel anchor calibration technique leveraging Bradley-Terry models for efficient cross-system score alignment, reducing calibration data requirements by approximately 25 times. These contributions collectively enhance the scalability, robustness, and practicality of preference models in music generation contexts.

Novelty

This study is the first to systematically implement a RankNet-based pairwise preference learning framework in the music domain using only publicly available human preference data. Unlike prior models such as CMI-RM, which incorporate pseudo-labels and multi-axis outputs, TuneJury is a lean, single-score model that supports rapid cross-system calibration. Its integration of anchor calibration for system-specific score alignment is a novel contribution, enabling effective adaptation without retraining. These innovations mark a significant step forward in scalable, preference-driven music AI.

Limitations

  • The model depends heavily on the quality and diversity of available preference data, which may limit its performance on unseen styles or highly subjective tastes. Data collection remains costly and potentially biased.
  • While anchor calibration reduces the need for retraining, its effectiveness diminishes if the new system's characteristics diverge significantly from the calibration data, possibly leading to calibration drift.
  • Despite its lightweight design, the model still requires pre-extracted features and inference computations, which may pose challenges for real-time deployment in resource-constrained environments.

Future Work

Future directions include expanding the preference dataset to cover more diverse styles and user groups, integrating multi-modal cues such as visual or emotional context, and developing active learning strategies to continuously improve calibration and generalization. Additionally, exploring end-to-end training pipelines and more efficient architectures could further enhance deployment efficiency. Ultimately, the goal is to create adaptive, personalized music AI systems capable of real-time preference learning and dynamic adjustment, fostering more engaging and satisfying user experiences.

AI Executive Summary

Music is a deeply subjective art form, with individual preferences varying widely across listeners. Traditional metrics for evaluating AI-generated music, such as Fréchet Audio Distance (FAD), focus on distributional similarity in embedding space and often fail to capture human aesthetic judgments directly. This disconnect hampers progress in developing music generation systems that truly resonate with users. Recognizing this challenge, Kim et al. introduce TuneJury, an open-source, preference-based reward model designed to align AI music generation with human tastes.

TuneJury leverages a pairwise learning framework inspired by RankNet, training on 17,500 human preference comparisons collected from publicly available datasets. These datasets include Music Arena, MusicPrefs, AIME, and SongEval, each providing different perspectives on musical quality and alignment. The core idea is to predict the probability that one music clip is preferred over another, enabling the model to produce a scalar preference score for any given clip. This approach circumvents the issues associated with absolute ratings, such as scale drift and individual bias, by focusing on relative preferences.

The model architecture is notably lightweight, with only 2.8 million trainable parameters, making it highly efficient. It combines features from pre-trained audio and text encoders—LAION-CLAP and MERT-v1-330M—whose embeddings are concatenated and fed into the MLP head. The training employs a pairwise logistic loss, optimizing the model to distinguish preferred clips accurately. Importantly, the authors introduce a post-hoc anchor calibration method based on Bradley-Terry models, allowing the model to adapt scores across different music generation systems with minimal additional data, thus avoiding costly retraining.

Experimental results demonstrate that TuneJury achieves a pairwise accuracy of 0.7086 on a held-out test set, outperforming models that rely on pseudo-label augmentation. It maintains competitive performance on out-of-distribution benchmarks such as MusicEval and PAM, with SRCC scores up to 0.7680. The model's utility extends to multiple downstream applications: selecting the best sample among N candidates, optimizing latent representations in a DITTO-style manner, and iterative fine-tuning guided by the reward signal. Across these tasks, TuneJury consistently improves the alignment between generated music and human preferences.

The significance of this work lies in its ability to provide a scalable, transparent, and adaptable preference metric that directly models human judgments. Its open-source nature encourages community participation and further development. By enabling rapid system-specific calibration, TuneJury reduces the barriers to deploying preference-aware music AI in real-world scenarios, such as personalized recommendation engines, creative tools, and interactive composition systems.

Looking ahead, future research will focus on expanding preference datasets, integrating multi-modal cues, and developing active learning strategies to enhance calibration and generalization. The ultimate goal is to create intelligent music systems capable of real-time preference adaptation, fostering more engaging and personalized musical experiences. TuneJury thus represents a significant step toward human-aligned AI in the creative domain, promising to reshape how machines understand and generate music that resonates with individual tastes.

Deep Dive

Abstract

We introduce TuneJury, an open, instance-level pairwise reward model for text-to-music that predicts a music preference score from a text prompt and an audio clip. The released checkpoint is trained on publicly available human-preference labels covering arena-style (A vs. B) votes, metric-alignment preference pairs, crowdsourced pairwise comparisons, and expert aesthetic ratings. The predicted score margin between two clips is well calibrated on our held-out test split, supporting data filtering via a simple score threshold. TuneJury generalizes to both held-out test pairs and out-of-distribution benchmarks, remaining competitive with prior baselines on the latter. For generators released after training, we introduce anchor calibration, a post-hoc, per-system Bradley-Terry calibration that recovers agreement at substantially better data efficiency than from-scratch retraining. The same frozen reward drives consistent reward-axis gains across three downstream applications: inference-time best-of-N selection, DITTO-style latent optimization, and expert-iteration post-training. TuneJury is available at https://github.com/yonghyunk1m/TuneJury.

cs.SD cs.AI cs.LG cs.MM eess.AS