Modular Representation Compression: Adapting LLMs for Efficient and Effective Recommendations

TL;DR

MARC method improves recommendation efficiency by modular representation compression, achieving a 2.82% eCPM lift in online tests.

cs.IR 🔴 Advanced 2026-04-20 33 views

Yunjia Xi Menghui Zhu Jianghao Lin Bo Chen Ruiming Tang Yong Yu Weinan Zhang

AI Reader Arxiv Page Download PDF

recommender systems large language models representation compression modularity information constraint

Key Findings

Methodology

The paper introduces a novel Modular Representation Compression (MARC) method, which explicitly controls the modularity of large language models (LLMs) through modular adjustment and task decoupling. Specifically, MARC introduces compression and task adaptation modules via modular adjustment, allowing the LLM to function solely as a representation-learning module. Subsequently, Modular Task Decoupling employs information constraints and different network structures to ensure each module focuses on its specific task.

Key Results

MARC achieved a 2.82% eCPM lift in an online A/B test within a large-scale commercial search advertising scenario, demonstrating its effectiveness in real-world applications.
Experiments on the MovieLens-1M dataset show that MARC effectively addresses the Mid-layer Representation Advantage (MRA) issue, where middle-layer representations outperform final-layer representations in recommendation tasks.
Through comparative experiments, MARC consistently outperforms traditional final-layer compression methods across multiple datasets.

Significance

The MARC method is significant in the field of recommender systems, especially in industrial scenarios that handle large volumes of users and items. By effectively compressing LLM representations, MARC not only reduces storage and computational costs but also enhances the performance of recommendation systems. This method addresses the limitations of existing compression methods that focus on final-layer representations, providing new insights for efficient deployment of recommendation systems.

Technical Contribution

MARC's technical contributions lie in its ability to address the limitations of existing compression methods that focus on final-layer representations through modular adjustment and task decoupling. By introducing information constraints and different network structures, MARC achieves efficient compression without sacrificing representation quality. Additionally, MARC offers a new framework that separates representation learning from task adaptation, maintaining the representational capabilities of LLMs.

Novelty

MARC is the first to explicitly control the modularity of LLMs in recommender systems, addressing the Mid-layer Representation Advantage issue. Unlike traditional methods, MARC ensures each module focuses on its specific task through modular adjustment and task decoupling, enhancing the efficiency and effectiveness of representation compression.

Limitations

MARC may require additional task adaptation module design in certain scenarios to ensure its generalizability across different tasks.
The computational overhead of MARC needs further optimization when handling extremely large-scale datasets.
MARC's performance may depend on specific LLM architectures and training datasets, requiring validation across different scenarios.

Future Work

Future research directions include further optimizing MARC's computational efficiency, exploring its generalizability across more tasks and datasets, and developing more lightweight task adaptation modules. Additionally, investigating how MARC can be applied to other types of deep learning models is a promising avenue.

AI Executive Summary

In recent years, large language models (LLMs) have made significant advancements in the field of recommender systems. However, the high-dimensional representations generated by LLMs introduce substantial storage and computational costs, limiting their online deployment in industrial recommender systems. Existing methods typically generate and cache augmented representations offline, but these methods face limitations in compressing final-layer representations.

This paper proposes a novel Modular Representation Compression (MARC) method that enhances the efficiency and effectiveness of recommender systems by explicitly controlling the modularity of LLMs. MARC introduces compression and task adaptation modules through modular adjustment, allowing the LLM to function solely as a representation-learning module. Subsequently, Modular Task Decoupling employs information constraints and different network structures to ensure each module focuses on its specific task.

In experiments, MARC demonstrates superior performance across multiple datasets, particularly on the MovieLens-1M dataset, where it effectively addresses the Mid-layer Representation Advantage issue. Additionally, MARC achieved a 2.82% eCPM lift in an online A/B test within a large-scale commercial search advertising scenario, proving its effectiveness in real-world applications.

Despite MARC's impressive performance in recommender systems, its computational overhead needs further optimization when handling extremely large-scale datasets. Additionally, MARC's performance may depend on specific LLM architectures and training datasets, requiring validation across different scenarios. Future research directions include further optimizing MARC's computational efficiency, exploring its generalizability across more tasks and datasets, and developing more lightweight task adaptation modules.

Deep Analysis

Background

Large language models (LLMs) have recently achieved significant advancements in the field of natural language processing, and their application in recommender systems has gained widespread attention. Traditional recommender systems typically rely on static features of users and items, whereas LLMs can inject rich semantic information by generating high-dimensional representations, significantly enhancing recommendation performance. However, the high-dimensional representations of LLMs introduce substantial storage and computational costs, limiting their online deployment in industrial recommender systems. Existing methods typically generate and cache augmented representations offline to avoid high latency in online inference, but these methods face limitations in compressing final-layer representations.

Core Problem

Effectively compressing the high-dimensional representations of large language models (LLMs) in recommender systems is a critical issue. Existing methods typically compress at the final layer, but experiments show that middle-layer representations often outperform final-layer representations in recommendation tasks. This phenomenon, known as the Mid-layer Representation Advantage (MRA), results in suboptimal performance of existing compression methods that focus on final-layer representations. Addressing this issue to improve the efficiency and effectiveness of recommender systems is the core problem investigated in this paper.

Innovation

1. Modular Adjustment: Introducing compression and task adaptation modules, allowing the LLM to function solely as a representation-learning module.

2. Modular Task Decoupling: Employing information constraints and different network structures to ensure each module focuses on its specific task.

3. Information Constraint: Maximizing the mutual information between original and compressed representations to maintain the information density of the compressed representations.

Methodology

The specific steps of the MARC method are as follows:

�� Modular Adjustment: Introducing compression and task adaptation modules, allowing the LLM to function solely as a representation-learning module.
�� Modular Task Decoupling: Employing information constraints and different network structures to ensure each module focuses on its specific task.
�� Information Constraint: Maximizing the mutual information between original and compressed representations to maintain the information density of the compressed representations.
�� User-Item Matching Network: Serving as the dedicated task adaptation module, absorbing the optimization pressure from the training objective.

Experiments

Experiments were conducted on the MovieLens-1M, Yelp, and MovieLens-25M datasets, using baselines including traditional final-layer compression methods and existing projection head methods. Experimental metrics included click-through rate (CTR) and eCPM. The experimental design included comparative experiments and ablation studies to verify the effectiveness and robustness of the MARC method.

Results

Experimental results show that MARC consistently outperforms traditional final-layer compression methods across multiple datasets. On the MovieLens-1M dataset, MARC effectively addresses the Mid-layer Representation Advantage issue. Additionally, MARC achieved a 2.82% eCPM lift in an online A/B test within a large-scale commercial search advertising scenario, demonstrating its effectiveness in real-world applications.

Applications

The MARC method has broad application prospects in industrial recommender systems that handle large volumes of users and items. By effectively compressing LLM representations, MARC not only reduces storage and computational costs but also enhances the performance of recommendation systems. This method is particularly suitable for scenarios requiring efficient deployment, such as online advertising recommendations and personalized content recommendations.

Limitations & Outlook

Plain Language Accessible to non-experts

Imagine you have a huge library with all sorts of books, each containing a wealth of information. Now, you need to pick out the most useful information to recommend to readers. Large language models (LLMs) are like this library; they can generate a lot of information, but storing and processing it is costly. To improve efficiency, we need to compress this information, much like condensing a thick book into a summary.

The MARC method is like a smart librarian who can identify the most valuable information and extract it. By introducing modular adjustment and task decoupling, MARC ensures that each module focuses on its specific task, much like different librarians handling different categories of books.

Moreover, MARC uses information constraints to ensure that the compressed information retains the essence of the original. This is akin to ensuring that every important chapter and paragraph is preserved when compressing a book. Ultimately, MARC can provide high-quality recommendations at a lower cost, just like offering readers a better reading experience with fewer books.

ELI14 Explained like you're 14

Hey there! Imagine you're playing a super complex game with lots of levels, and each level has different challenges. Large language models (LLMs) are like this game; they can generate tons of cool content, but sometimes it's just too much to handle.

So, we need a smart assistant to help us pick out the most useful content, and that's what the MARC method does. MARC is like a super helper that can help you extract the important info from the game, so you can level up faster!

MARC uses modular adjustment and task decoupling to ensure each module focuses on its task, just like every game character has its special skills. Plus, MARC uses information constraints to ensure the compressed info still keeps the essence of the original. This way, you can get a better gaming experience with less time. Isn't that cool?

So next time you're gaming, think about how MARC helps you boost efficiency!

Glossary

Large Language Model (LLM)

A large language model is a deep learning-based natural language processing model with a vast number of parameters, capable of generating high-quality text representations.

In this paper, LLMs are used to generate high-dimensional representations for recommender systems.

Recommender System

A recommender system is a system that uses feature information of users and items to provide personalized recommendations to users.

The paper explores integrating LLMs into recommender systems to enhance performance.

Modular Representation Compression (MARC)

MARC is a method that compresses LLM representations through modular adjustment and task decoupling, enhancing the efficiency and effectiveness of recommender systems.

MARC is the core method proposed in the paper to address the Mid-layer Representation Advantage issue.

Mid-layer Representation Advantage (MRA)

MRA refers to the phenomenon where middle-layer representations of LLMs often outperform final-layer representations in recommendation tasks.

The paper addresses the MRA issue using the MARC method.

Information Constraint

An information constraint is a method that maximizes the mutual information between original and compressed representations to maintain information density.

In MARC, information constraints ensure the quality of compressed representations.

Task Decoupling

Task decoupling is a method that uses different network structures and information constraints to ensure each module focuses on its specific task.

MARC improves the efficiency of representation compression through task decoupling.

User-Item Matching Network

The User-Item Matching Network is a module in MARC that absorbs the optimization pressure from the training objective.

In MARC, the User-Item Matching Network serves as the dedicated task adaptation module.

Click-Through Rate (CTR)

CTR is an important metric for measuring the performance of recommender systems, representing the probability of users clicking on recommended items.

In experiments, CTR is used to evaluate the effectiveness of the MARC method.

eCPM

eCPM is the effective cost per thousand impressions, used to measure the effectiveness and revenue of advertisements.

In online A/B tests, MARC achieved a 2.82% eCPM lift.

Projection Head Method

The projection head method is a method that compresses representations by adding a projection layer to the final layer of LLMs.

The paper compares the effectiveness of MARC with traditional projection head methods.

Open Questions Unanswered questions from this research

1 How can MARC's computational efficiency be further optimized on extremely large-scale datasets? Existing methods have high computational overhead when handling large-scale data, requiring the development of more efficient algorithms.
2 What is the generalizability of MARC across different types of recommendation tasks? Current research mainly focuses on specific datasets and tasks, requiring validation of its effectiveness in other scenarios.
3 How can more lightweight task adaptation modules be designed? Existing task adaptation modules may be overly complex in certain scenarios, necessitating simplified designs.
4 How does MARC perform across different LLM architectures? Current research mainly relies on specific LLM architectures, requiring exploration of its adaptability to other architectures.
5 How can MARC be applied to other types of deep learning models? Current research mainly focuses on LLMs, requiring exploration of its potential applications in other models.

Applications

Immediate Applications

Online Advertising Recommendation

MARC can be used in online advertising recommendation to reduce storage and computational costs and improve the efficiency and effectiveness of ad recommendations by compressing LLM representations.

Personalized Content Recommendation

In personalized content recommendation, MARC can improve the performance of recommendation systems by efficiently compressing representations, providing users with more accurate recommendations.

Social Media Recommendation

MARC can be applied to social media platforms to improve the response speed and recommendation quality of recommendation systems by compressing user and content representations.

Long-term Vision

Intelligent Assistants

MARC can be used to develop smarter assistants by efficiently processing large amounts of information, providing more accurate suggestions and services.

Autonomous Driving

In autonomous driving, MARC can be used to compress and process sensor data, improving the system's real-time response capabilities and decision accuracy.

Abstract

Recently, large language models (LLMs) have advanced recommendation systems (RSs), and recent works have begun to explore how to integrate LLMs into industrial RSs. While most approaches deploy LLMs offline to generate and pre-cache augmented representations for RSs, high-dimensional representations from LLMs introduce substantial storage and computational costs. Thus, it is crucial to compress LLM representations effectively. However, we identify a counterintuitive phenomenon during representation compression: Mid-layer Representation Advantage (MRA), where representations from middle layers of LLMs outperform those from final layers in recommendation tasks. This degraded final layer renders existing compression methods, which typically compress on the final layer, suboptimal. We interpret this based on modularity theory that LLMs develop spontaneous internal functional modularity and force the final layer to specialize in the proxy training task. Thus, we propose \underline{M}odul\underline{a}r \underline{R}epresentation \underline{C}ompression (MARC) to explicitly control the modularity of LLMs. First, Modular Adjustment explicitly introduces compression and task adaptation modules, enabling the LLM to operate strictly as a representation-learning module. Next, to ground each module to its specific task, Modular Task Decoupling uses information constraints and different network structures to decouple tasks. Extensive experiments validate that MARC addresses MRA and produces efficient representations. Notably, MARC achieved a 2.82% eCPM lift in an online A/B test within a large-scale commercial search advertising scenario.

cs.IR cs.AI cs.CL

References (20)

2D Matryoshka Training for Information Retrieval

Shuai Wang, Shengyao Zhuang, B. Koopman et al.

2024 5 citations ⭐ Influential View Analysis →

DCN V2: Improved Deep & Cross Network and Practical Lessons for Web-scale Learning to Rank Systems

Ruoxi Wang, Rakesh Shivanna, D. Cheng et al.

2020 782 citations ⭐ Influential View Analysis →

Representation Learning with Large Language Models for Recommendation

Xubin Ren, Wei Wei, Lianghao Xia et al.

2023 379 citations ⭐ Influential View Analysis →

Deep & Cross Network for Ad Click Predictions

Ruoxi Wang, Bin Fu, G. Fu et al.

2017 1472 citations ⭐ Influential View Analysis →

Breaking the Length Barrier: LLM-Enhanced CTR Prediction in Long Textual User Behaviors

Binzong Geng, Zhaoxin Huan, Xiaolu Zhang et al.

2024 49 citations ⭐ Influential View Analysis →

LARR: Large Language Model Aided Real-time Scene Recommendation with Semantic Understanding

Zhizhong Wan, Bin Yin, Jun Xie et al.

2024 12 citations ⭐ Influential View Analysis →

Behavior-Dependent Linear Recurrent Units for Efficient Sequential Recommendation

Chengkai Liu, Jianghao Lin, Hanzhou Liu et al.

2024 19 citations View Analysis →

A survey on large language models for recommendation

Likang Wu, Zhilan Zheng, Zhaopeng Qiu et al.

2023 774 citations View Analysis →

ClickPrompt: CTR Models are Strong Prompt Generators for Adapting Language Models to CTR Prediction

Jianghao Lin, Bo Chen, Hangyu Wang et al.

2023 58 citations View Analysis →

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee et al.

2019 113254 citations View Analysis →

Variational Autoencoder for Deep Learning of Images, Labels and Captions

Yunchen Pu, Zhe Gan, Ricardo Henao et al.

2016 828 citations View Analysis →

Learning deep representations by mutual information estimation and maximization

R. Devon Hjelm, A. Fedorov, Samuel Lavoie-Marchildon et al.

2018 2964 citations View Analysis →

LoRA: Low-Rank Adaptation of Large Language Models

J. Hu, Yelong Shen, Phillip Wallis et al.

2021 18198 citations View Analysis →

Fine-Tuning LLaMA for Multi-Stage Text Retrieval

Xueguang Ma, Liang Wang, Nan Yang et al.

2023 375 citations View Analysis →

Recommender Systems in the Era of Large Language Models (LLMs)

Wenqi Fan, Zihuai Zhao, Jiatong Li et al.

2023 483 citations View Analysis →

Large Language Models for Generative Recommendation: A Survey and Visionary Discussions

Lei Li, Yongfeng Zhang, Dugang Liu et al.

2023 138 citations View Analysis →

Auto-encoder based dimensionality reduction

Yasi Wang, H. Yao, Sicheng Zhao

2016 911 citations

Adapting Large Language Models by Integrating Collaborative Semantics for Recommendation

Bowen Zheng, Yupeng Hou, Hongyu Lu et al.

2023 303 citations View Analysis →

On-device Integrated Re-ranking with Heterogeneous Behavior Modeling

Yunjia Xi, Weiwen Liu, Yang Wang et al.

2023 12 citations

Principal Components Analysis (PCA)

John M. Hancock

2014 537 citations

Modular Representation Compression: Adapting LLMs for Efficient and Effective Recommendations

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Analysis

Background

Core Problem

Innovation

Methodology

Experiments

Results

Applications

Limitations & Outlook

Plain Language Accessible to non-experts

ELI14 Explained like you're 14

Glossary

Large Language Model (LLM)

Recommender System

Modular Representation Compression (MARC)

Mid-layer Representation Advantage (MRA)

Information Constraint

Task Decoupling

User-Item Matching Network

Click-Through Rate (CTR)

eCPM

Projection Head Method

Open Questions Unanswered questions from this research

Applications

Immediate Applications

Online Advertising Recommendation

Personalized Content Recommendation

Social Media Recommendation

Long-term Vision

Intelligent Assistants

Autonomous Driving

Abstract

References (20)

Related Papers

Aligning Dense Retrievers with LLM Utility via DistillationAligning Dense Retrievers with LLM Utility via Distillation

Can QPP Choose the Right Query Variant? Evaluating Query Variant Selection for RAG Pipelines

Objective Shaping with Hard Negatives: Windowed Partial AUC Optimization for RL-based LLM Recommenders

Rethinking Semantic Collaborative Integration: Why Alignment Is Not Enough

ResRank: Unifying Retrieval and Listwise Reranking via End-to-End Joint Training with Residual Passage Compression

ECLASS-Augmented Semantic Product Search for Electronic Components