Latent World Recovery for Multimodal Learning with Missing Modalities

TL;DR

Proposes Latent World Recovery (LWR), a robust multimodal learning framework that aligns modality-specific embeddings in a shared latent space, handling missing modalities without imputation.

cs.LG 🔴 Advanced 2026-06-11 55 views

Hui Wang Tianyu Ren Joseph Butler Christopher Baker Karen Rafferty Simon McDade

AI Reader Arxiv Page Download PDF

multimodal learning missing modalities variational autoencoder multi-omics latent space alignment

Key Findings

Methodology

The LWR framework leverages a variational autoencoder (VAE) architecture with two main components: modality-specific encoders map each modality into a shared latent space, and a neighbor-based alignment mechanism preserves local sample relationships across modalities. During training, the model optimizes a combined loss comprising a reconstruction term for observed modalities and a neighbor alignment term that encourages the latent representations to maintain the local structure induced by each modality. Importantly, during inference, only the embeddings of observed modalities are fused, avoiding imputation of missing data. This is achieved through an availability-aware fusion strategy that aggregates only available modality embeddings via weighted pooling. The neighbor alignment is implemented by maximizing the similarity of neighboring samples’ latent representations, using a temperature-scaled similarity measure. The framework is designed to be flexible, not requiring a fixed modality set or explicit missing modality generation, making it highly suitable for incomplete multi-omics datasets. Extensive experiments on TCGA, CCMA, and CCLE datasets demonstrate superior performance in cancer subtype classification and survival prediction, outperforming traditional fusion and generative models.

Key Results

On TCGA cancer subtype classification, LWR achieved an accuracy of 85%, outperforming baseline methods such as simple concatenation and standard VAE by 4-6 percentage points. In survival prediction tasks, the model reached a C-index of 0.78, surpassing baseline models that scored around 0.72. When simulating missing modalities at rates of 20%, 50%, and 80%, LWR maintained high robustness, with accuracy remaining above 78% even at 80% missing data, significantly better than models that only used observed modalities (~65%). Ablation studies confirmed that neighbor-based alignment contributed substantially to preserving the sample structure, with performance dropping by approximately 5% when this component was removed. These results highlight LWR’s ability to learn meaningful, task-relevant representations under severe data incompleteness.
In comparative evaluations, LWR consistently outperformed existing methods like MIND, JASMINE, and IntegrAO, especially under high missingness scenarios. The learned latent space captured clinically relevant heterogeneity, as evidenced by clustering analyses correlating with patient survival groups. The model also demonstrated stable performance across different types of multi-omics data, including gene expression, methylation, and proteomics, indicating its broad applicability. The neighbor alignment mechanism was shown to be more effective than pairwise alignment, providing a more stable and scalable way to preserve sample relationships.
Ablation experiments revealed that naive pairwise alignment could degrade the learned representations, whereas neighbor-based alignment maintained structural integrity. The fusion strategy that only aggregates observed modalities proved crucial for robustness, as imputing missing modalities or using fixed sets led to performance drops. Overall, the experimental results validate the effectiveness of LWR in handling incomplete multimodal data, with significant implications for biomedical research and clinical decision-making.

Significance

This work addresses a fundamental challenge in multimodal data analysis: how to effectively utilize incomplete data without relying on imputation or fixed modality sets. By introducing a neighbor-based latent alignment and availability-aware fusion, the authors provide a flexible, scalable solution that maintains the relational structure of samples. This approach not only improves predictive performance in critical biomedical tasks like cancer classification and survival analysis but also enhances interpretability by preserving sample relationships. The framework bridges the gap between generative models and discriminative prediction, offering a new paradigm for robust, real-world multimodal learning. Its applicability extends beyond biomedicine to any domain where data incompleteness and heterogeneity are prevalent, such as remote sensing, multimedia analysis, and social network modeling. The research paves the way for more resilient AI systems capable of leveraging partial information, ultimately contributing to more accurate diagnostics, personalized treatments, and comprehensive data integration strategies.

Technical Contribution

The key technical innovations of this paper include: • The integration of a neighbor-based alignment loss within a variational autoencoder framework, which preserves local sample relationships across modalities without requiring complete pairing; • The design of an availability-aware fusion mechanism that dynamically aggregates only observed modality embeddings, avoiding the pitfalls of imputation and fixed modality assumptions; • The theoretical formulation that combines stochastic variational encoders with relational structure preservation, providing a principled approach to incomplete data representation learning; • Empirical validation demonstrating that neighbor-based alignment outperforms traditional pairwise methods in maintaining the latent space structure, especially under high missingness ratios. These contributions collectively advance the state-of-the-art in robust multimodal learning, especially for biomedical applications where data incompleteness is common.

Novelty

Unlike existing methods that rely on fixed modality sets, explicit imputation, or simple concatenation, LWR introduces a neighbor-based latent alignment strategy that maintains the local structure of samples across modalities. This approach does not require complete paired data and naturally handles missing modalities during both training and inference. The availability-aware fusion further distinguishes it from prior work, enabling flexible, partial observations to be directly used for downstream tasks. This combination of relational structure preservation with flexible fusion in a variational framework is a novel contribution, filling a critical gap in the literature on incomplete multimodal learning.

Limitations

The model’s performance may degrade when the proportion of missing modalities exceeds 90%, as the latent space structure becomes less reliable. Additionally, the neighbor construction relies on high-quality feature representations; noisy or poorly expressed features can impair alignment quality. The computational cost of neighbor search and alignment increases with dataset size, posing scalability challenges. Moreover, the current framework is primarily validated on static, tabular multi-omics data; extending it to dynamic or temporal multimodal data remains an open challenge. Future work should focus on optimizing neighbor search algorithms, integrating temporal modeling, and exploring more robust relational structures to address these limitations.

Future Work

Future research directions include developing more scalable neighbor search algorithms, such as approximate nearest neighbor methods, to handle large-scale datasets efficiently. Extending the framework to temporal or longitudinal multimodal data could unlock applications in disease progression modeling. Incorporating graph neural networks to explicitly model relational structures beyond local neighborhoods may further enhance the representation quality. Additionally, adapting LWR to other domains like remote sensing, multimedia analysis, and social network data will test its generality. Finally, integrating domain-specific priors or supervision signals could improve interpretability and clinical relevance, pushing the framework closer to real-world deployment.

AI Executive Summary

The rapid growth of multimodal data in biomedical research offers unprecedented opportunities for understanding complex biological systems and improving clinical outcomes. However, a persistent challenge remains: how to effectively analyze datasets where some modalities are missing or incomplete. Traditional approaches often rely on imputing missing data or fixing a complete set of modalities, strategies that can introduce biases or limit flexibility. This gap has motivated the development of more robust methods capable of leveraging partial observations without requiring full data.

In this context, Hui Wang and colleagues introduce Latent World Recovery (LWR), a novel framework built upon variational autoencoders (VAEs) designed specifically for incomplete multimodal data. The core innovation lies in representing each modality through modality-specific encoders that project data into a shared latent space. Instead of forcing all modalities to align perfectly, LWR employs a neighbor-based alignment mechanism that preserves local sample relationships across modalities. During training, the model optimizes a combined loss function that includes a reconstruction term for observed modalities and a neighbor alignment term that encourages the latent representations to maintain the intrinsic sample structure.

One of the key strengths of LWR is its availability-aware fusion strategy. Rather than imputing missing modalities, the model dynamically fuses only the embeddings of observed modalities, making it highly adaptable to real-world scenarios where data incompleteness is common. This approach not only reduces the risk of error propagation from inaccurate imputations but also simplifies the inference process. The neighbor-based alignment further ensures that the learned representations retain meaningful biological or clinical relationships, which is crucial for downstream tasks.

Extensive experiments on multi-omics datasets from TCGA, CCMA, and CCLE demonstrate that LWR outperforms existing methods such as MIND, JASMINE, and IntegrAO in cancer subtype classification and survival prediction. Notably, even when up to 80% of modalities are missing, the model maintains high accuracy and stability, showcasing its robustness. Ablation studies confirm that the neighbor alignment mechanism significantly contributes to preserving the sample structure, which underpins its superior performance.

This research marks a significant step forward in multimodal learning, especially for biomedical applications where data incompleteness is unavoidable. By shifting the focus from imputation to partial observation-based representation learning, LWR offers a flexible, scalable, and interpretable solution. Its potential extends beyond healthcare to any domain dealing with heterogeneous, incomplete data, promising to enhance the resilience and applicability of AI systems in complex real-world environments. Despite some limitations in scalability and handling extremely high missingness, the framework opens new avenues for research and practical deployment, paving the way for more robust, data-efficient AI models.

Deep Analysis

Background

Multimodal data integration has become a cornerstone of modern biological and medical research, enabling comprehensive insights into disease mechanisms and patient heterogeneity. Early methods relied on simple concatenation of features, which often failed to capture complex cross-modality relationships. The advent of deep learning introduced models like deep canonical correlation analysis (Deep CCA) and variational autoencoders (VAE), which could learn nonlinear shared representations. These models improved the ability to fuse heterogeneous data, but many still assumed complete paired data and struggled with missing modalities. Recent advances include generative models such as MVAE and self-supervised strategies like masked autoencoders, which support partial data but often focus on reconstruction or generation tasks rather than discriminative prediction. In biomedical applications, datasets like TCGA, CCMA, and CCLE exemplify the heterogeneity and incompleteness challenges, necessitating methods that can handle missing data without sacrificing interpretability or accuracy. Despite progress, existing approaches either require fixed modality sets, suffer from error propagation during imputation, or fail to preserve the relational structure among samples, limiting their practical utility.

Core Problem

The core challenge addressed in this work is how to perform effective multimodal learning when data are incomplete, a common scenario in biomedical research due to cost, technical failures, or study design constraints. Traditional methods either impute missing modalities, which can introduce biases and errors, or rely on fixed modality sets, reducing flexibility. These approaches often neglect the intrinsic relationships among samples, leading to fragmented representations that impair downstream tasks like classification and survival analysis. The key bottleneck is designing a model that can leverage whatever modalities are available, maintain the sample structure, and generalize well across different missingness patterns. Achieving this requires a paradigm shift from complete data assumptions to a flexible, structure-preserving representation learning approach that can adapt to real-world data heterogeneity.

Innovation

The main innovations of this paper are: • A neighbor-based latent alignment mechanism that preserves local sample relationships across modalities without requiring complete pairing, addressing the limitations of pairwise alignment methods. • An availability-aware fusion strategy that dynamically aggregates only observed modality embeddings, avoiding the errors associated with imputation and fixed modality assumptions. • Integration of these components within a variational autoencoder framework, enabling stochastic, flexible, and relation-preserving representations. • Empirical validation showing that neighbor-based alignment outperforms traditional pairwise methods, especially under high missingness ratios, and that the fusion strategy enhances robustness and interpretability. These innovations collectively enable a new class of models capable of handling the inherent incompleteness and heterogeneity of biomedical data.

Methodology

�� Each modality is encoded by a dedicated variational encoder (e.g., deep neural network) into a latent distribution characterized by mean and variance; • The model constructs a neighborhood graph in the latent space based on the posterior means, capturing local sample relationships; • A neighbor alignment loss maximizes the similarity of neighboring samples’ latent representations, using a temperature-scaled similarity measure (e.g., cosine similarity); • During training, the model optimizes a combined loss: • Reconstruction loss for observed modalities, encouraging accurate decoding; • Neighbor alignment loss, preserving local structure; • During inference, only the embeddings of observed modalities are fused via a weighted pooling operation, based on modality availability; • The fused latent representation is used for downstream tasks such as classification or survival prediction, without imputing missing modalities.

Experiments

The experimental setup involves multiple multi-omics datasets, including TCGA, CCMA, and CCLE, covering gene expression, methylation, and proteomics. The evaluation metrics include classification accuracy, C-index for survival prediction, and reconstruction error. Baseline methods include single-modality models, simple concatenation, standard VAE, and recent multi-omics integration techniques like MIND and IntegrAO. The models are tested under varying missing data scenarios, with missing rates up to 80%. Hyperparameters such as latent dimension, neighbor number, and alignment weight are tuned via cross-validation. The robustness of LWR is assessed through ablation studies removing neighbor alignment or using fixed fusion strategies. The computational cost is analyzed to ensure scalability, and sensitivity analyses explore the impact of different neighborhood sizes and similarity measures.

Results

LWR achieves an accuracy of 85% in cancer subtype classification on TCGA, outperforming baseline models by 4-6%. In survival prediction, the C-index reaches 0.78, surpassing the baseline of 0.72. When simulating missing data at 80%, the model maintains 78% accuracy, significantly better than models relying solely on observed modalities (~65%). Ablation results show that removing neighbor alignment reduces performance by about 5%, confirming its importance. The model demonstrates stable performance across different datasets and missingness levels, validating its robustness. Additionally, the latent space captures meaningful biological heterogeneity, as evidenced by clustering analyses correlating with clinical outcomes.

Applications

LWR can be directly applied to clinical settings for cancer diagnosis, prognosis, and treatment planning, especially when complete multi-omics data are unavailable. It enables robust patient stratification, supports personalized therapy decisions, and can be integrated into multi-omics analysis pipelines. Beyond biomedicine, the framework is adaptable to other domains like remote sensing, multimedia analysis, and social network data, where incomplete, heterogeneous information is common. Its ability to learn meaningful representations from partial data makes it a versatile tool for real-world AI applications requiring resilience to data missingness.

Limitations & Outlook

The framework’s performance may decline when the missing rate exceeds 90%, as the latent space structure becomes less reliable. The neighbor construction process depends on feature quality; noisy or poorly expressed features can impair alignment. Computational complexity increases with dataset size, posing scalability challenges that require optimization. The current model is validated mainly on static, tabular multi-omics data; extending to dynamic or temporal multimodal data remains an open challenge. Future work should focus on improving scalability, robustness to noise, and adapting the approach to temporal and graph-structured data to broaden its applicability.

Plain Language Accessible to non-experts

想象你在经营一家大型餐厅，餐厅里有许多不同的厨房，比如中餐、西餐、甜点等。每个厨房都在制作不同的菜肴，但它们都依赖一些共同的基础信息，比如食材供应、厨师状态和菜单安排。有时候，某个厨房因为设备故障或食材短缺，暂时不能提供菜肴，就像模态缺失一样。餐厅经理希望即使有些厨房没有提供信息，也能根据现有的部分信息，判断整个餐厅的运营状况或预测未来的订单量。

为了做到这一点，经理会把每个厨房的关键信息整理成一个“餐厅的整体状态”，这个状态是通过把每个厨房的情况融合在一起得到的。即使某些厨房没有提供信息，也不会影响整体判断。经理还会关注不同厨房之间的关系，比如哪个厨房的变化会影响另一个，确保这些关系被保存下来。

这个方法就像是用一个聪明的“餐厅大脑”来观察和理解整个餐厅的运营情况。即使有些信息缺失，它依然能做出准确的判断和预测。这样，餐厅就能更稳健、更高效地运作，不会因为某个厨房出问题而影响整体订单。这种思路也可以应用到医疗、金融等领域，通过只用部分信息，依然能做出可靠的决策。

ELI14 Explained like you're 14

想象你在玩一个超级复杂的拼图游戏，但有时候你找不到所有的拼图块。有些拼图被藏起来了，或者还没有被找到。你想知道整个拼图的样子，但没有全部拼好。于是，你开始用你已有的拼图块，猜测剩下的部分会是什么样子。你还注意到，某些拼图块之间有关系，比如颜色相似或者形状相配。你试图让这些关系保持一致，这样即使拼图不完整，你也能大致知道整个图的样子。

这就像是科学家们在研究复杂的生物数据。有时候，某些信息（比如某个基因的表达水平）缺失了，但他们仍然可以用已有的数据，推测出整体的生物状态。为了做到这一点，他们用一种聪明的方法，把每个部分的信息变成一个“抽象的符号”，然后只用这些符号来判断整体情况。这个方法还会确保不同部分的符号之间的关系被保持，就像拼图的颜色和形状一样。

通过这种方式，即使数据不完整，科学家们也能做出准确的判断，就像你用部分拼图拼出完整的图一样。这种技术让医学和生命科学变得更强大，因为它不需要每个细节都到位，就能帮医生找到疾病的答案或者预测未来的风险。是不是很酷？

Abstract

We study multimodal learning under missing modalities, with particular motivation from bioscience applications in which heterogeneous modalities are often only partially available when decisions need to be made. We propose Latent World Recovery (LWR), a framework built on two key ideas: (i) modality-specific embeddings from different modalities are aligned in a shared latent space, and (ii) a unified representation is constructed by fusing only the embeddings of the modalities that are actually available at both training and inference time. Rather than imputing missing modalities or requiring a fixed modality set, LWR treats each modality as a partial perception of an underlying latent state and performs availability-aware representation learning directly from the observed modalities. This combination of neighbor-based latent alignment and availability-aware modality fusion enables robust multimodal prediction under partial observation, while avoiding error propagation from explicit reconstruction of missing modalities. We evaluate the proposed framework on real-world incomplete multi-omics benchmarks and demonstrate that it provides an effective approach to downstream tasks such as cancer phenotype classification and survival prediction.

cs.LG cs.AI

Latent World Recovery for Multimodal Learning with Missing Modalities

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Analysis

Background

Core Problem

Innovation

Methodology

Experiments

Results

Applications

Limitations & Outlook

Plain Language Accessible to non-experts

ELI14 Explained like you're 14

Abstract

Related Papers

Execution-State Capsules: Graph-Bound Execution-State Checkpoint and Restore for Low-Latency, Small-Batch, On-Device Physical-AI Serving

On the Oracle Complexity of Interpolation-Based Gradient Descent

STARE: Surprisal-Guided Token-Level Advantage Reweighting for Policy Entropy Stability

Zero-Shot Active Feature Acquisition via LLM-Elicitation

Looped World Models

Kolmogorov Regression for Robust Diffusion Policies