TEVI: Text-Conditioned Editing of Visual Representations via Sparse Autoencoders for Improved Vision-Language Alignment

TL;DR

TEVI leverages sparse autoencoders with text conditioning to refine image embeddings, significantly improving vision-language alignment and retrieval accuracy.

cs.CV 🔴 Advanced 2026-06-06 73 views

Sweta Mahajan Sukrut Rao Jiahao Xie Alexander Koller Bernt Schiele

AI Reader Arxiv Page Download PDF

multimodal learning vision-language alignment sparse autoencoders image retrieval model fine-tuning

Key Findings

Methodology

This paper introduces TEVI, a framework that combines sparse autoencoders (SAEs) with a text-conditioned masking mechanism to optimize image embeddings for better alignment with textual descriptions. Initially, the authors utilize SAEs to decompose CLIP image embeddings into disentangled, interpretable latent concepts. A small MLP network is trained to map text embeddings into a mask over these latent concepts, enabling selective retention or suppression of specific features based on the input caption. The process involves training the mask to maximize the InfoNCE loss, pulling the conditioned image embedding closer to the target caption while pushing away others. Controlled experiments on synthetic datasets (MAD) demonstrate the ability of TEVI to precisely manipulate targeted attributes, such as 'swelling' or 'fracture.' Subsequently, TEVI is applied to real-world datasets, including MS COCO, Flickr, IIW, and DOCCI, where it significantly improves cross-modal retrieval metrics such as R@1, R@5, and R@10. The approach also enhances robustness against textual perturbations, indicating its practical viability for real-world applications.

Key Results

On the synthetic MAD dataset, TEVI accurately identified and manipulated specific attributes. For example, the attribute 'swelling' was disentangled with an AUC close to 1, and masking out the corresponding latent reduced the attribute detection accuracy from 85% to near chance levels (~10%).
In natural image retrieval tasks, TEVI improved MS COCO image-to-text R@1 from 32.98% to 35.66%, Flickr R@1 from 42.46% to 44.75%, and long-caption benchmarks like DOCCI from 20.38% to 24.20%. Similar improvements were observed across other datasets, especially with richer captions, confirming the method's effectiveness.
Training with negative caption conditioning (Eq. 11) further increased the stability and robustness of cross-modal retrieval, as evidenced by higher pairwise cosine similarities after conditioning, demonstrating better semantic alignment.

Significance

This work addresses a fundamental challenge in multimodal AI: the modality gap caused by information imbalance between images and text. By introducing a post-hoc, text-guided content editing mechanism, TEVI offers a novel solution that enhances the interpretability, controllability, and robustness of vision-language models. It provides a pathway to more precise content retrieval, content editing, and understanding, which are critical for applications like personalized content filtering, semantic editing, and assistive AI systems. The approach also complements existing alignment techniques, opening new avenues for research in multimodal representation learning.

Technical Contribution

The key technical innovation lies in integrating sparse autoencoders with a text-conditioned masking module. Unlike end-to-end training, TEVI operates on pre-trained CLIP embeddings, decomposing them into disentangled concepts via SAEs. The mask, learned through a small MLP, selectively filters these concepts based on textual input, enabling targeted content editing. The training employs the InfoNCE loss with both positive and negative caption conditioning, enhancing semantic alignment and robustness. This modular, post-hoc approach allows flexible content manipulation without retraining the entire model, offering a practical and scalable solution for improving multimodal alignment.

Novelty

TEVI's novelty is in its post-hoc, content-editing paradigm that leverages the disentangled representations from sparse autoencoders, conditioned on text embeddings. Unlike prior methods such as SmartCLIP or FLAIR, which require training from scratch or end-to-end fine-tuning, TEVI applies a lightweight masking module on frozen embeddings, enabling dynamic, interpretable content control. This approach introduces a new level of explainability and flexibility in multimodal content editing, setting it apart from existing alignment or fine-tuning techniques.

Limitations

The effectiveness of TEVI depends heavily on the quality of the disentangled concepts produced by the SAE; in complex, real-world scenarios, the latent representations may not perfectly align with human-interpretable concepts, limiting editing precision.
The current framework assumes a fixed set of concepts or relies on predefined text embeddings, which may not capture the full diversity of real-world content, especially for open vocabulary tasks.
While the method improves robustness against textual perturbations, extreme or ambiguous descriptions still pose challenges, and the computational overhead of autoencoder decomposition and masking may limit real-time applications.

Future Work

Future research could focus on jointly training the autoencoder and the masking module in an end-to-end manner to enhance concept disentanglement. Expanding the latent concept space to cover more complex, nuanced content will improve editing granularity. Integrating TEVI with large-scale multimodal models beyond CLIP, such as LMMs, could broaden its applicability. Additionally, exploring unsupervised or weakly supervised approaches to learn concept representations without predefined labels will make the framework more adaptable to diverse datasets. Finally, real-time content editing and interactive applications represent promising directions for practical deployment.

AI Executive Summary

The rapid evolution of multimodal AI has led to powerful vision-language models like CLIP, which embed images and text into a shared semantic space. These models have revolutionized tasks such as zero-shot classification and cross-modal retrieval, enabling machines to understand and connect visual content with natural language descriptions. However, despite their success, a persistent challenge remains: the modality gap caused by the inherent information imbalance between images and their textual descriptions. Images often contain richer, more detailed information than captions, which leads to poor alignment and limits the models' performance in downstream tasks.

Addressing this fundamental issue, Sweta Mahajan and colleagues introduce TEVI, a novel framework that employs sparse autoencoders (SAEs) combined with a text-conditioned masking mechanism to refine image embeddings. Unlike traditional approaches that rely on retraining entire models or fine-tuning on large datasets, TEVI operates post-hoc, building on pre-trained CLIP models. The core idea is to decompose image embeddings into interpretable latent concepts using SAEs, which can be selectively filtered based on textual input. This enables the model to retain only the content relevant to the caption, effectively editing the visual representation to improve alignment.

The methodology involves training a sparse autoencoder to disentangle image features into concept-specific latents. A small multi-layer perceptron (MLP) then learns to map text embeddings to a mask over these latents, controlling which concepts are preserved. During inference, applying this mask results in a text-conditioned, content-edited image embedding. The authors validate their approach through controlled experiments on synthetic datasets, demonstrating precise manipulation of attributes like 'swelling' or 'fracture.' These experiments confirm that TEVI can effectively identify and isolate specific concepts, with high attribute-specific disentanglement scores.

Extending beyond synthetic data, TEVI was applied to real-world datasets such as MS COCO, Flickr, IIW, and DOCCI. The results show consistent improvements in retrieval metrics: for example, MS COCO image-to-text R@1 increased from 32.98% to 35.66%, and long-caption benchmarks like DOCCI saw R@1 rise from 20.38% to 24.20%. Notably, the gains were more pronounced with richer, longer descriptions, indicating that detailed captions provide stronger signals for content editing. The framework also enhanced robustness against textual perturbations, making it more reliable in practical scenarios.

Furthermore, the authors incorporated negative caption conditioning during training, which further stabilized the content editing process and improved the semantic consistency of the edited embeddings. This comprehensive evaluation underscores TEVI’s potential to significantly advance multimodal understanding, content retrieval, and content editing. Its modular, interpretable design offers a scalable and flexible pathway to improve existing vision-language models, bridging the gap between visual richness and textual abstraction.

Looking ahead, future work aims to expand the concept space, enable end-to-end training, and extend TEVI’s capabilities to video and 3D content. The ultimate goal is to develop more controllable, explainable, and robust multimodal AI systems that can seamlessly integrate visual and linguistic information, fostering new applications in content creation, personalized AI assistants, and beyond. TEVI’s innovative approach marks a meaningful step toward more intelligent, adaptable, and human-aligned multimodal models, promising a future where machines can not only understand but also precisely manipulate visual content guided by natural language.

Deep Dive

Abstract

Vision-language models such as CLIP are highly useful for diverse tasks due to their shared image-text embedding space. Despite this, the image and text embeddings are often poorly aligned, affecting downstream performance. Recent work has shown that this can be attributed to an information imbalance: images contain more information than their captions describe. In this work, we propose TEVI, a framework that uses captions as a signal for what to retain from image embeddings. Specifically, we use sparse autoencoders to disentangle image embeddings and train a masking module to selectively reconstruct the embedding based on a given caption. In a controlled setup with synthetic captions, we show that TEVI is effective at preserving caption-described attributes while discarding others. By applying TEVI to CLIP models trained on natural images, we further achieve improved retrieval performance across coarse-grained short-caption (MS COCO, Flickr) and fine-grained long-caption (IIW, DOCCI) benchmarks, with stronger gains on richer captions, and improved robustness on the RoCOCO benchmark.

cs.CV cs.AI cs.CL cs.LG

References (20)

SmartCLIP: Modular Vision-language Alignment with Identification Guarantees

Shaoan Xie, Lingjing Kong, Yujia Zheng et al.

2025 12 citations ⭐ Influential View Analysis →

Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models

Bryan A. Plummer, Liwei Wang, Christopher M. Cervantes et al.

2015 2577 citations ⭐ Influential View Analysis →

Mitigate the Gap: Investigating Approaches for Improving Cross-Modal Alignment in CLIP

Sedigheh Eslami, Gerard de Melo

2024 25 citations ⭐ Influential View Analysis →

Two Effects, One Trigger: On the Modality Gap, Object Bias, and Information Imbalance in Contrastive Vision-Language Models

Simon Schrodi, David T. Hoffmann, Max Argus et al.

2024 39 citations ⭐ Influential View Analysis →

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov et al.

2020 63615 citations ⭐ Influential View Analysis →

Sigmoid Loss for Language Image Pre-Training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov et al.

2023 3151 citations ⭐ Influential View Analysis →

Microsoft COCO: Common Objects in Context

Tsung-Yi Lin, M. Maire, Serge J. Belongie et al.

2014 52996 citations ⭐ Influential View Analysis →

Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts

Soravit Changpinyo, P. Sharma, Nan Ding et al.

2021 1476 citations ⭐ Influential View Analysis →

Decoupled Weight Decay Regularization

I. Loshchilov, F. Hutter

2017 34741 citations ⭐ Influential

DOCCI: Descriptions of Connected and Contrasting Images

Yasumasa Onoe, Sunayana Rane, Zachary Berger et al.

2024 123 citations ⭐ Influential View Analysis →

Learning Transferable Visual Models From Natural Language Supervision

Alec Radford, Jong Wook Kim, Chris Hallacy et al.

2021 49901 citations ⭐ Influential View Analysis →

RoCOCO: Robustness Benchmark of MS-COCO to Stress-Test Image-Text Matching Models

Seulki Park, Daeho Um, Hajung Yoon et al.

2023 7 citations ⭐ Influential View Analysis →

Interpreting CLIP with Hierarchical Sparse Autoencoders

Vladimir Zaigrajew, Hubert Baniecki, P. Biecek

2025 37 citations View Analysis →

Applying sparse autoencoders to unlearn knowledge in language models

Eoin Farrell, Yeu-Tong Lau, Arthur Conmy

2024 57 citations View Analysis →

SAeUron: Interpretable Concept Unlearning in Diffusion Models with Sparse Autoencoders

Bartosz Cywi'nski, Kamil Deja

2025 64 citations View Analysis →

Representation Learning with Contrastive Predictive Coding

Aäron van den Oord, Yazhe Li, O. Vinyals

2018 13491 citations View Analysis →

Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning

Piyush Sharma, Nan Ding, Sebastian Goodman et al.

2018 2993 citations

Improving Dictionary Learning with Gated Sparse Autoencoders

Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith et al.

2024 168 citations View Analysis →

FG-CLIP: Fine-Grained Visual and Textual Alignment

Chunyu Xie, Bin Wang, Fanjing Kong et al.

2025 82 citations View Analysis →

SLIP: Self-supervision meets Language-Image Pre-training

Norman Mu, Alexander Kirillov, David A. Wagner et al.

2021 630 citations View Analysis →

TEVI: Text-Conditioned Editing of Visual Representations via Sparse Autoencoders for Improved Vision-Language Alignment

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Dive

Abstract

References (20)

Related Papers

JanusMesh: Fast and Zero-Shot 3D Visual Illusion Generation via Cross-Space Denoising

UNIEGO: Proxies as Mediators for Unified Egocentric Video Representation Learning

SSD: Spatially Speculative Decoding Accelerates Autoregressive Image Generation

CalTennis: Large Multi-View Tennis Video Dataset and Benchmark of Monocular-to-3D Pose Estimation

Unified Multimodal Autoregressive Modeling with Shared Context-Visual Tokenizer is Key to Unification

EventDrive: Event Cameras for Vision-Language Driving Intelligence