TEVI: Text-Conditioned Editing of Visual Representations via Sparse Autoencoders for Improved Vision-Language Alignment
TEVI leverages sparse autoencoders with text conditioning to refine image embeddings, significantly improving vision-language alignment and retrieval accuracy.
Key Findings
Methodology
This paper introduces TEVI, a framework that combines sparse autoencoders (SAEs) with a text-conditioned masking mechanism to optimize image embeddings for better alignment with textual descriptions. Initially, the authors utilize SAEs to decompose CLIP image embeddings into disentangled, interpretable latent concepts. A small MLP network is trained to map text embeddings into a mask over these latent concepts, enabling selective retention or suppression of specific features based on the input caption. The process involves training the mask to maximize the InfoNCE loss, pulling the conditioned image embedding closer to the target caption while pushing away others. Controlled experiments on synthetic datasets (MAD) demonstrate the ability of TEVI to precisely manipulate targeted attributes, such as 'swelling' or 'fracture.' Subsequently, TEVI is applied to real-world datasets, including MS COCO, Flickr, IIW, and DOCCI, where it significantly improves cross-modal retrieval metrics such as R@1, R@5, and R@10. The approach also enhances robustness against textual perturbations, indicating its practical viability for real-world applications.
Key Results
- On the synthetic MAD dataset, TEVI accurately identified and manipulated specific attributes. For example, the attribute 'swelling' was disentangled with an AUC close to 1, and masking out the corresponding latent reduced the attribute detection accuracy from 85% to near chance levels (~10%).
- In natural image retrieval tasks, TEVI improved MS COCO image-to-text R@1 from 32.98% to 35.66%, Flickr R@1 from 42.46% to 44.75%, and long-caption benchmarks like DOCCI from 20.38% to 24.20%. Similar improvements were observed across other datasets, especially with richer captions, confirming the method's effectiveness.
- Training with negative caption conditioning (Eq. 11) further increased the stability and robustness of cross-modal retrieval, as evidenced by higher pairwise cosine similarities after conditioning, demonstrating better semantic alignment.
Significance
This work addresses a fundamental challenge in multimodal AI: the modality gap caused by information imbalance between images and text. By introducing a post-hoc, text-guided content editing mechanism, TEVI offers a novel solution that enhances the interpretability, controllability, and robustness of vision-language models. It provides a pathway to more precise content retrieval, content editing, and understanding, which are critical for applications like personalized content filtering, semantic editing, and assistive AI systems. The approach also complements existing alignment techniques, opening new avenues for research in multimodal representation learning.
Technical Contribution
The key technical innovation lies in integrating sparse autoencoders with a text-conditioned masking module. Unlike end-to-end training, TEVI operates on pre-trained CLIP embeddings, decomposing them into disentangled concepts via SAEs. The mask, learned through a small MLP, selectively filters these concepts based on textual input, enabling targeted content editing. The training employs the InfoNCE loss with both positive and negative caption conditioning, enhancing semantic alignment and robustness. This modular, post-hoc approach allows flexible content manipulation without retraining the entire model, offering a practical and scalable solution for improving multimodal alignment.
Novelty
TEVI's novelty is in its post-hoc, content-editing paradigm that leverages the disentangled representations from sparse autoencoders, conditioned on text embeddings. Unlike prior methods such as SmartCLIP or FLAIR, which require training from scratch or end-to-end fine-tuning, TEVI applies a lightweight masking module on frozen embeddings, enabling dynamic, interpretable content control. This approach introduces a new level of explainability and flexibility in multimodal content editing, setting it apart from existing alignment or fine-tuning techniques.
Limitations
- The effectiveness of TEVI depends heavily on the quality of the disentangled concepts produced by the SAE; in complex, real-world scenarios, the latent representations may not perfectly align with human-interpretable concepts, limiting editing precision.
- The current framework assumes a fixed set of concepts or relies on predefined text embeddings, which may not capture the full diversity of real-world content, especially for open vocabulary tasks.
- While the method improves robustness against textual perturbations, extreme or ambiguous descriptions still pose challenges, and the computational overhead of autoencoder decomposition and masking may limit real-time applications.
Future Work
Future research could focus on jointly training the autoencoder and the masking module in an end-to-end manner to enhance concept disentanglement. Expanding the latent concept space to cover more complex, nuanced content will improve editing granularity. Integrating TEVI with large-scale multimodal models beyond CLIP, such as LMMs, could broaden its applicability. Additionally, exploring unsupervised or weakly supervised approaches to learn concept representations without predefined labels will make the framework more adaptable to diverse datasets. Finally, real-time content editing and interactive applications represent promising directions for practical deployment.
AI Executive Summary
The rapid evolution of multimodal AI has led to powerful vision-language models like CLIP, which embed images and text into a shared semantic space. These models have revolutionized tasks such as zero-shot classification and cross-modal retrieval, enabling machines to understand and connect visual content with natural language descriptions. However, despite their success, a persistent challenge remains: the modality gap caused by the inherent information imbalance between images and their textual descriptions. Images often contain richer, more detailed information than captions, which leads to poor alignment and limits the models' performance in downstream tasks.
Addressing this fundamental issue, Sweta Mahajan and colleagues introduce TEVI, a novel framework that employs sparse autoencoders (SAEs) combined with a text-conditioned masking mechanism to refine image embeddings. Unlike traditional approaches that rely on retraining entire models or fine-tuning on large datasets, TEVI operates post-hoc, building on pre-trained CLIP models. The core idea is to decompose image embeddings into interpretable latent concepts using SAEs, which can be selectively filtered based on textual input. This enables the model to retain only the content relevant to the caption, effectively editing the visual representation to improve alignment.
The methodology involves training a sparse autoencoder to disentangle image features into concept-specific latents. A small multi-layer perceptron (MLP) then learns to map text embeddings to a mask over these latents, controlling which concepts are preserved. During inference, applying this mask results in a text-conditioned, content-edited image embedding. The authors validate their approach through controlled experiments on synthetic datasets, demonstrating precise manipulation of attributes like 'swelling' or 'fracture.' These experiments confirm that TEVI can effectively identify and isolate specific concepts, with high attribute-specific disentanglement scores.
Extending beyond synthetic data, TEVI was applied to real-world datasets such as MS COCO, Flickr, IIW, and DOCCI. The results show consistent improvements in retrieval metrics: for example, MS COCO image-to-text R@1 increased from 32.98% to 35.66%, and long-caption benchmarks like DOCCI saw R@1 rise from 20.38% to 24.20%. Notably, the gains were more pronounced with richer, longer descriptions, indicating that detailed captions provide stronger signals for content editing. The framework also enhanced robustness against textual perturbations, making it more reliable in practical scenarios.
Furthermore, the authors incorporated negative caption conditioning during training, which further stabilized the content editing process and improved the semantic consistency of the edited embeddings. This comprehensive evaluation underscores TEVIβs potential to significantly advance multimodal understanding, content retrieval, and content editing. Its modular, interpretable design offers a scalable and flexible pathway to improve existing vision-language models, bridging the gap between visual richness and textual abstraction.
Looking ahead, future work aims to expand the concept space, enable end-to-end training, and extend TEVIβs capabilities to video and 3D content. The ultimate goal is to develop more controllable, explainable, and robust multimodal AI systems that can seamlessly integrate visual and linguistic information, fostering new applications in content creation, personalized AI assistants, and beyond. TEVIβs innovative approach marks a meaningful step toward more intelligent, adaptable, and human-aligned multimodal models, promising a future where machines can not only understand but also precisely manipulate visual content guided by natural language.
Deep Dive
Abstract
Vision-language models such as CLIP are highly useful for diverse tasks due to their shared image-text embedding space. Despite this, the image and text embeddings are often poorly aligned, affecting downstream performance. Recent work has shown that this can be attributed to an information imbalance: images contain more information than their captions describe. In this work, we propose TEVI, a framework that uses captions as a signal for what to retain from image embeddings. Specifically, we use sparse autoencoders to disentangle image embeddings and train a masking module to selectively reconstruct the embedding based on a given caption. In a controlled setup with synthetic captions, we show that TEVI is effective at preserving caption-described attributes while discarding others. By applying TEVI to CLIP models trained on natural images, we further achieve improved retrieval performance across coarse-grained short-caption (MS COCO, Flickr) and fine-grained long-caption (IIW, DOCCI) benchmarks, with stronger gains on richer captions, and improved robustness on the RoCOCO benchmark.
References (20)
SmartCLIP: Modular Vision-language Alignment with Identification Guarantees
Shaoan Xie, Lingjing Kong, Yujia Zheng et al.
Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models
Bryan A. Plummer, Liwei Wang, Christopher M. Cervantes et al.
Mitigate the Gap: Investigating Approaches for Improving Cross-Modal Alignment in CLIP
Sedigheh Eslami, Gerard de Melo
Two Effects, One Trigger: On the Modality Gap, Object Bias, and Information Imbalance in Contrastive Vision-Language Models
Simon Schrodi, David T. Hoffmann, Max Argus et al.
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov et al.
Sigmoid Loss for Language Image Pre-Training
Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov et al.
Microsoft COCO: Common Objects in Context
Tsung-Yi Lin, M. Maire, Serge J. Belongie et al.
Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts
Soravit Changpinyo, P. Sharma, Nan Ding et al.
Decoupled Weight Decay Regularization
I. Loshchilov, F. Hutter
DOCCI: Descriptions of Connected and Contrasting Images
Yasumasa Onoe, Sunayana Rane, Zachary Berger et al.
Learning Transferable Visual Models From Natural Language Supervision
Alec Radford, Jong Wook Kim, Chris Hallacy et al.
RoCOCO: Robustness Benchmark of MS-COCO to Stress-Test Image-Text Matching Models
Seulki Park, Daeho Um, Hajung Yoon et al.
Interpreting CLIP with Hierarchical Sparse Autoencoders
Vladimir Zaigrajew, Hubert Baniecki, P. Biecek
Applying sparse autoencoders to unlearn knowledge in language models
Eoin Farrell, Yeu-Tong Lau, Arthur Conmy
SAeUron: Interpretable Concept Unlearning in Diffusion Models with Sparse Autoencoders
Bartosz Cywi'nski, Kamil Deja
Representation Learning with Contrastive Predictive Coding
AΓ€ron van den Oord, Yazhe Li, O. Vinyals
Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning
Piyush Sharma, Nan Ding, Sebastian Goodman et al.
Improving Dictionary Learning with Gated Sparse Autoencoders
Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith et al.
FG-CLIP: Fine-Grained Visual and Textual Alignment
Chunyu Xie, Bin Wang, Fanjing Kong et al.
SLIP: Self-supervision meets Language-Image Pre-training
Norman Mu, Alexander Kirillov, David A. Wagner et al.