Conformal Bayes under Label Shift: Post-Hoc Calibration vs. In-Training Adaptation

TL;DR

This paper introduces two calibration strategies—post-hoc calibration and in-training adaptation—for Bayesian prediction under label shift, validated through synthetic experiments.

stat.ML 🔴 Advanced 2026-06-10 56 views

Seungjin Choi

AI Reader Arxiv Page Download PDF

Bayesian methods conformal prediction label shift calibration strategies statistical guarantees

Key Findings

Methodology

The study builds on Bayesian linear regression models under label shift assumptions, proposing two calibration approaches: one is post-hoc calibration, which adjusts the predictive threshold via importance-weighted quantiles without changing the parameter posterior; the other is in-training adaptation, which directly tilts the parameter posterior to align with the target domain. Both strategies leverage importance-weighted quantiles of the negative log predictive density (NLPD) scores to construct prediction sets with coverage guarantees. The algorithms involve calculating the source posterior, deriving the tilted predictive distributions, and employing importance weights based on the exponential tilting model to calibrate the conformal prediction sets. Theoretical analysis confirms coverage validity, and synthetic experiments demonstrate their effectiveness across different bias and shift intensities.

Key Results

In synthetic Gaussian linear models with dimension d=5, both strategies maintained at least 90% coverage even at high label shift intensity β=0.6, where uncalibrated models dropped below 84%. The in-training adaptation (posterior tilting) significantly reduced parameter bias (by approximately 40%) and narrowed prediction intervals by about 16% at the same coverage level, outperforming post-hoc calibration in biased training scenarios. The experiments confirmed that importance-weighted quantiles are crucial for coverage guarantees, and the geometric differences in prediction sets reflect the distinct correction mechanisms. Results also showed that the estimated bias parameter β, obtained via a two-step approach, effectively guides calibration in practice.
The experiments validated the theoretical guarantees, showing that the importance-weighted quantile approach robustly maintains coverage under label shift, while the two strategies differentially impact the geometric shape and efficiency of prediction sets. The in-training approach was particularly effective in correcting parameter bias and improving predictive accuracy in biased training regimes, demonstrating its practical value. The results highlight the importance of accurate bias estimation and the potential for extending these methods to more complex models.
Overall, the findings establish that importance-weighted conformal calibration combined with predictive or parameter tilting can effectively address label shift, ensuring reliable uncertainty quantification and efficient prediction in challenging environments.

Significance

This research advances the theoretical understanding and practical application of Bayesian conformal prediction under label shift conditions. It provides a rigorous framework for maintaining statistical coverage guarantees while controlling the geometric shape of prediction sets, addressing a key challenge in distributional robustness. The dual strategies—post-hoc and in-training—offer flexible solutions adaptable to different scenarios, such as unbiased training or biased, lead-optimization settings. The ability to correct for label shift without retraining the entire model or requiring extensive labeled target data makes these methods highly relevant for real-world applications, including drug discovery, chemical property prediction, and financial risk management. By integrating importance-weighted quantiles with Bayesian inference, the work bridges a gap between statistical validity and geometric efficiency, paving the way for more reliable and interpretable predictive models in dynamic environments.

Technical Contribution

The paper's main technical contribution lies in formalizing two complementary calibration strategies within the conformal Bayes framework under label shift. The first, post-hoc calibration, adjusts the predictive distribution at the score level by importance-weighted quantiles of NLPD scores, leaving the parameter posterior unchanged. The second, in-training adaptation, directly tilts the parameter posterior using a likelihood ratio derived from the exponential tilting model, resulting in a corrected predictive distribution. Both methods leverage the conjugacy of Gaussian linear models to derive closed-form expressions for the tilted posterior and predictive distributions, enabling efficient computation and theoretical guarantees. The importance-weighted quantile approach ensures finite-sample coverage guarantees, even under model misspecification or finite data. The framework unifies Bayesian inference, conformal prediction, and importance weighting, providing a robust methodology for distribution shift adaptation that is both statistically valid and geometrically interpretable.

Novelty

This work is pioneering in systematically integrating conformal Bayes with importance-weighted calibration under label shift, explicitly distinguishing between predictive and parameter-level corrections. Unlike prior methods that rely solely on density ratio estimation or black-box calibration, this approach offers a transparent, theory-backed mechanism for coverage guarantees and geometric control. The derivation of closed-form tilted posteriors in Gaussian models and the explicit comparison of the two strategies' geometric effects represent significant innovations. The framework's flexibility allows adaptation to different bias regimes and model complexities, marking a substantial step forward in distributionally robust uncertainty quantification.

Limitations

The methods rely on the exponential tilting assumption for label shift, which may not hold in more complex or multimodal shift scenarios. Extending the framework to general shift types remains a challenge.
The estimation of the tilt parameter β is crucial; inaccuracies in β estimation can impair calibration quality, especially in real-world, unlabeled target domains.
The experiments are conducted on synthetic Gaussian models, and the scalability to high-dimensional, deep neural networks has not yet been demonstrated. Computational costs and approximation errors may limit practical deployment.
The framework assumes the conditional distribution p(x|y) remains invariant, which may not be valid in covariate or concept shift scenarios, restricting its applicability.

Future Work

Future research will focus on extending the calibration framework to deep neural networks, incorporating scalable Bayesian inference techniques such as variational methods or Monte Carlo sampling. Developing robust, automatic methods for estimating the bias parameter β in unlabeled target environments is also a priority. Additionally, exploring joint calibration strategies under multiple types of distributional shifts—such as covariate, concept, and label shift—will enhance the robustness of predictive models. Practical applications in real-world datasets, including molecular property prediction and financial modeling, are planned to validate and refine the proposed methods. Long-term, integrating these calibration strategies into adaptive, online learning systems could enable models to continuously self-correct in dynamic environments, significantly advancing the reliability of AI systems in industry.

AI Executive Summary

In the rapidly evolving landscape of machine learning, ensuring the reliability and robustness of predictive models remains a fundamental challenge, especially under distributional shifts such as label shift. Traditional Bayesian methods, while providing principled uncertainty quantification, often falter when the training and deployment environments differ, leading to inaccurate coverage and unreliable decision-making. This paper addresses this critical gap by proposing two innovative calibration strategies—post-hoc calibration and in-training adaptation—that leverage the strengths of Bayesian inference and conformal prediction to maintain statistical guarantees under label shift.

The core idea is to correct the predictive distribution in a way that preserves the coverage guarantee, regardless of how much the label distribution shifts. The first strategy, post-hoc calibration, involves adjusting the prediction threshold after training by importance-weighted quantiles of the negative log predictive density (NLPD). This approach keeps the parameter posterior intact but tilts the predictive distribution to match the target domain. The second strategy, in-training adaptation, directly modifies the parameter posterior using a likelihood ratio derived from the exponential tilting model, resulting in a corrected predictive distribution that better reflects the target environment.

Both methods are grounded in a Bayesian linear regression framework, which allows for explicit derivation of closed-form tilted posteriors and predictive distributions. The importance-weighted quantile method ensures finite-sample coverage guarantees, a critical property for high-stakes applications. Synthetic experiments demonstrate that, under unbiased training, both strategies achieve the desired 90% coverage, with the in-training approach effectively reducing parameter bias and prediction interval width in biased, lead-optimization scenarios.

The significance of this work lies in its ability to unify Bayesian inference, conformal prediction, and importance weighting into a coherent framework that addresses a long-standing challenge: how to adapt models reliably when the data distribution shifts. The proposed strategies are flexible, theoretically sound, and practically effective, offering a pathway toward more robust AI systems capable of maintaining performance in dynamic, real-world environments.

Looking ahead, future research will aim to extend these methods to complex deep learning models, develop scalable Bayesian inference techniques, and explore multi-shift scenarios. The ultimate goal is to create adaptive, distribution-robust predictive systems that can self-correct in real time, ensuring reliability and trustworthiness across diverse applications such as drug discovery, chemical property prediction, and financial risk management. Despite current limitations, this work marks a significant step forward in distributionally robust uncertainty quantification, promising a more dependable future for AI-driven decision-making.

Deep Dive

Abstract

Conformal Bayes combines Bayesian posterior predictives with conformal calibration to produce prediction sets that are both statistically valid and geometrically efficient. We study conformal Bayes under label shift from a unified perspective, identifying two complementary approaches that restore nominal target-domain coverage through importance-weighted conformal calibration but operate through independent mechanisms. \emph{Post-hoc calibration} tilts the posterior predictive toward the target domain and corrects the conformal threshold via an importance-weighted quantile, leaving the parameter posterior unchanged. \emph{In-training adaptation} tilts the parameter posterior itself to the target domain, producing a corrected predictive whose highest predictive density region serves as the highest predictive density (HPD) based prediction set under the fitted target predictive; efficiency is model-dependent and does not imply finite-sample conditional optimality. Two controlled experiments show that in an unbiased training regime both strategies achieve valid coverage equally, while in a lead-optimization regime in-training adaptation acts as a debiasing operator, reducing interval width at unchanged coverage.

stat.ML cs.LG

Conformal Bayes under Label Shift: Post-Hoc Calibration vs. In-Training Adaptation

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Dive

Abstract

Related Papers

SSH-Net: A Deep Neural Network for Predicting Failure Time Distribution Functions under Competing Risks with Application to GPU Data

ProtoX-AD: Self-Explainable Time Series Anomaly Detection and Characterization

Itô maps for any-step SDEs

Finite-Particle Convergence Rates for Conservative and Non-Conservative Drifting Models

Model-based Bootstrap of Controlled Markov Chains

A Divergence-Based Method for Weighting and Averaging Model Predictions