Quantized Inference for OneRec-V2
OneRec-V2 achieves 49% latency reduction and 92% throughput increase via FP8 quantized inference.
Key Findings
Methodology
The study proposes an FP8 post-training quantization framework for OneRec-V2, integrated into an optimized inference infrastructure. By analyzing the statistical distribution of weights and activations, it was found that OneRec-V2's numerical behavior is closer to that of large language models, with a more compute-intensive inference pattern and higher hardware utilization. This method effectively applies low-precision computation by controlling the numerical range of weights and activations.
Key Results
- The FP8 post-training quantization framework achieved a 49% reduction in end-to-end inference latency and a 92% increase in throughput for OneRec-V2. These improvements are primarily due to the combination of infrastructure upgrades, low-precision computation, and operator-level optimizations.
- Extensive online A/B testing shows that FP8 inference introduces no degradation in core metrics, proving the feasibility of low-precision inference in real production environments.
- Comparing traditional recommendation models and large language models, OneRec-V2's distribution characteristics are closer to the latter, making the application of low-precision quantization more reasonable.
Significance
This study demonstrates that as recommender systems evolve towards the paradigms of large language models, low-precision computation techniques can be effectively adapted to large-scale recommendation workloads. This is significant not only for academia but also provides new optimization directions for the industry, particularly in terms of hardware utilization and inference efficiency. By transferring low-precision techniques from the domain of large language models to recommender systems, the study addresses pain points related to numerical behavior and hardware utilization in traditional recommendation models.
Technical Contribution
Technical contributions include the development of an FP8 post-training quantization framework combined with an optimized inference infrastructure, achieving significant latency reduction and throughput improvement. Compared to existing recommender systems, OneRec-V2 relies more on dense computation paths and unified execution patterns, improving hardware utilization. Additionally, the study demonstrates how low-precision techniques from large language models can be effectively transferred to recommender systems.
Novelty
This study is the first to successfully apply FP8 quantized inference to recommender systems and demonstrate its effectiveness in OneRec-V2. Compared to traditional recommendation models, OneRec-V2's improvements in numerical behavior and hardware utilization make low-precision computation feasible. This innovation provides new insights for optimizing recommender systems.
Limitations
- The current study only explores FP8 inference and does not investigate more aggressive low-precision settings such as INT8, FP6, or FP4, thus not revealing the full accuracy-efficiency frontier of generative recommendation models.
- The solution relies on substantial infrastructure support and system-level customization, which may limit its reproducibility and portability in production environments lacking advanced inference stacks or sufficient engineering resources.
- The study only experiments on OneRec-V2 and does not cover a broader set of generative recommendation architectures, leaving it unclear to what extent the observed quantization properties and deployment benefits generalize across different model designs.
Future Work
Future work can explore more aggressive low-precision quantization settings to further improve efficiency. Additionally, the study can be extended to other generative recommendation models to verify the generalizability of quantization techniques. More generic infrastructure can also be developed to reduce dependency on specific hardware and systems, enhancing the portability of the solution.
AI Executive Summary
Applying low-precision quantized inference in recommender systems has been challenging due to significant differences in numerical behavior and hardware utilization compared to traditional models, limiting the practical benefits of low-precision computation. OneRec-V2 narrows the gap with large language models through more compute-intensive paths and unified execution patterns, making low-precision quantization feasible.
The study proposes an FP8 post-training quantization framework integrated into an optimized inference infrastructure. By analyzing the statistical distribution of weights and activations, it was found that OneRec-V2's numerical behavior is closer to that of large language models, with a more compute-intensive inference pattern and higher hardware utilization. This discovery provides a solid foundation for applying low-precision computation.
In experiments, FP8 quantized inference achieved a 49% reduction in end-to-end inference latency and a 92% increase in throughput for OneRec-V2. The combination of infrastructure upgrades, low-precision computation, and operator-level optimizations is key to these improvements. Extensive online A/B testing further confirms that FP8 inference introduces no degradation in core metrics.
These results indicate that as recommender systems evolve towards the paradigms of large language models, low-precision computation techniques can be effectively adapted to large-scale recommendation workloads. This is significant not only for academia but also provides new optimization directions for the industry, particularly in terms of hardware utilization and inference efficiency.
However, the current study only explores FP8 inference and does not investigate more aggressive low-precision settings such as INT8, FP6, or FP4. Future work can explore more aggressive low-precision quantization settings to further improve efficiency. Additionally, the study can be extended to other generative recommendation models to verify the generalizability of quantization techniques.
Deep Analysis
Background
Quantized inference is an essential technique for improving the efficiency of large-scale neural networks, particularly in large language models where low-precision formats have demonstrated substantial system-level benefits while preserving model quality. However, reliably applying low-precision quantization in recommender systems has long been challenging in industrial practice. Traditional recommendation models are typically optimized for fine-grained ranking tasks and differ significantly from large language models in both training paradigms and architectural structures. Empirically, their weights and activations often exhibit high magnitudes and large variances, making these models more sensitive to quantization-induced perturbations. From a systems perspective, classical recommender inference workloads are frequently memory or control bound and exhibit relatively low hardware utilization. As a result, even when hardware platforms support low-precision computation, the practical end-to-end gains may be limited. Recent advances in generative recommendation models have begun to narrow this gap. OneRec introduces a unified generative framework that integrates retrieval and ranking, and subsequent extensions such as OneRec-V2 further refine this paradigm through architectural scaling and training improvements.
Core Problem
Applying low-precision quantized inference in recommender systems has been challenging due to significant differences in numerical behavior and hardware utilization compared to traditional models, limiting the practical benefits of low-precision computation. Traditional recommendation models' weights and activations often exhibit high magnitudes and large variances, making them more sensitive to quantization-induced perturbations. Additionally, recommendation workloads frequently suffer from limited hardware utilization, restricting the practical gains of low-precision computation. These numerical and system factors have historically hindered the effective deployment of low-precision inference in traditional recommendation pipelines.
Innovation
The core innovation of this study lies in successfully applying FP8 quantized inference to recommender systems and demonstrating its effectiveness in OneRec-V2. Compared to traditional recommendation models, OneRec-V2's improvements in numerical behavior and hardware utilization make low-precision computation feasible. This innovation provides new insights for optimizing recommender systems. Specifically, the study develops an FP8 post-training quantization framework combined with an optimized inference infrastructure, achieving significant latency reduction and throughput improvement. By analyzing the statistical distribution of weights and activations, it was found that OneRec-V2's numerical behavior is closer to that of large language models, with a more compute-intensive inference pattern and higher hardware utilization.
Methodology
- �� Developed an FP8 post-training quantization framework combined with an optimized inference infrastructure, achieving significant latency reduction and throughput improvement.
- �� Analyzed the statistical distribution of weights and activations, finding that OneRec-V2's numerical behavior is closer to that of large language models, with a more compute-intensive inference pattern and higher hardware utilization.
- �� Adopted a post-training quantization (PTQ) approach to introduce low-precision computation into the inference stage of OneRec-V2 without modifying the model architecture or training procedure. Quantization is applied only to the most computation-intensive operators, namely the Linear layers (including the qkvo projection layers in Attention and the linear transformations in Dense FFN) and the grouped GEMM operations in Sparse MoE. Other numerically sensitive or less compute-dominant components remain in their original precision to control potential numerical risks.
Experiments
The experimental design includes both offline and online performance evaluations of the OneRec-V2 model. In offline experiments, system performance is measured in terms of end-to-end latency and throughput. The baseline system performs inference in FP16, while the optimized system applies post-training quantization to computation-dominant Linear layers. Online A/B testing is conducted in real production environments to verify the impact of low-precision inference on recommendation quality. The results show that FP8 quantized inference achieved a 49% reduction in end-to-end inference latency and a 92% increase in throughput for OneRec-V2. The combination of infrastructure upgrades, low-precision computation, and operator-level optimizations is key to these improvements.
Results
The experimental results show that FP8 quantized inference achieved a 49% reduction in end-to-end inference latency and a 92% increase in throughput for OneRec-V2. The combination of infrastructure upgrades, low-precision computation, and operator-level optimizations is key to these improvements. Extensive online A/B testing further confirms that FP8 inference introduces no degradation in core metrics. These results indicate that as recommender systems evolve towards the paradigms of large language models, low-precision computation techniques can be effectively adapted to large-scale recommendation workloads. This is significant not only for academia but also provides new optimization directions for the industry, particularly in terms of hardware utilization and inference efficiency.
Applications
The application scenarios of this study include the optimization of large-scale recommender systems, particularly in terms of hardware utilization and inference efficiency. By transferring low-precision techniques from the domain of large language models to recommender systems, the study addresses pain points related to numerical behavior and hardware utilization in traditional recommendation models. This technology can be directly applied to recommender systems that require efficient inference, such as short video recommendations and personalized advertising.
Limitations & Outlook
The current study only explores FP8 inference and does not investigate more aggressive low-precision settings such as INT8, FP6, or FP4. Therefore, future work can explore more aggressive low-precision quantization settings to further improve efficiency. Additionally, the study can be extended to other generative recommendation models to verify the generalizability of quantization techniques. More generic infrastructure can also be developed to reduce dependency on specific hardware and systems, enhancing the portability of the solution.
Plain Language Accessible to non-experts
Imagine a factory that needs to produce a large number of products. Traditionally, the production line uses high-precision machines to ensure each product is flawless, but this requires a lot of time and resources. Now, the factory introduces a new method, using low-precision machines to speed up production. These machines, although slightly less precise, still produce acceptable quality products because they focus on the critical steps of production rather than every detail. This is similar to applying low-precision quantized inference in recommender systems, where reducing computational precision improves efficiency while ensuring the final recommendation quality remains unaffected. In this way, the factory can produce more products in less time to meet market demand. Similarly, recommender systems can process more data in less time, providing faster recommendation services. The key to this approach is finding the balance between precision and efficiency, ensuring that while efficiency is improved, product quality still meets customer expectations.
ELI14 Explained like you're 14
Hey there! Imagine you're playing a super cool game where your task is to recommend things your friends might like. Traditionally, you might spend a lot of time making sure each recommendation is perfect, like using a high-precision telescope to look at stars. But this time, we have a new tool, like a fast microscope, that lets you find those important stars quicker! Although this microscope isn't as precise as the telescope, it helps you find more stars in less time, so you can complete your task faster. This new tool is like the low-precision quantization technique we use in recommender systems, which makes the system run faster while ensuring the recommendation quality doesn't drop. This way, you can recommend more things your friends will like in less time. Isn't that cool?
Glossary
Quantized Inference
Quantized inference is a technique that reduces computation and memory costs by representing weights and activations with lower numerical precision, often used to improve the efficiency of large-scale neural networks.
In this paper, quantized inference is used to enhance the inference efficiency of OneRec-V2.
FP8
FP8 is a low-precision numerical format that uses 8 bits to represent floating-point numbers, significantly reducing computation and storage costs while maintaining a certain level of precision.
The paper develops an FP8 post-training quantization framework for OneRec-V2 inference.
Post-Training Quantization
Post-training quantization is a technique applied after model training, reducing the numerical precision of model parameters and activations to improve inference efficiency.
The paper adopts post-training quantization to introduce low-precision computation into the inference stage of OneRec-V2.
Generative Recommendation
Generative recommendation is a paradigm that formulates recommendation tasks as conditional sequence generation, integrating retrieval and ranking for end-to-end optimization.
OneRec-V2 adopts a generative recommendation paradigm to enhance the efficiency of recommender systems.
Hardware Utilization
Hardware utilization refers to the efficiency with which computing resources are used when performing tasks; higher hardware utilization typically means higher computational efficiency.
OneRec-V2's inference pattern is more compute-intensive, resulting in higher hardware utilization.
Linear Layer
A linear layer is a basic neural network layer typically used to perform linear transformations, such as matrix multiplication.
In the paper, quantization is applied only to the most computation-intensive operators, namely the Linear layers.
Grouped GEMM
Grouped GEMM is a matrix multiplication operation often used to handle sparse matrices, improving computational efficiency.
In the paper, grouped GEMM operations are a focus of quantization.
TensorCore
TensorCore is a hardware unit specifically designed to accelerate matrix operations, significantly improving the computational efficiency of deep learning models.
The paper uses FP8 TensorCore multiplication to enhance computational efficiency.
MoE (Mixture of Experts)
MoE is a model architecture that improves computational efficiency by selectively activating parts of the expert network.
OneRec-V2 employs an MoE architecture to enhance inference efficiency.
A/B Testing
A/B testing is an experimental method used to evaluate the effect of a change by comparing the performance of two versions, widely used for product optimization.
The paper verifies the impact of FP8 inference on recommendation quality through online A/B testing.
Open Questions Unanswered questions from this research
- 1 The current study only explores FP8 inference and does not investigate more aggressive low-precision settings such as INT8, FP6, or FP4. Therefore, future work can explore more aggressive low-precision quantization settings to further improve efficiency.
- 2 The solution relies on substantial infrastructure support and system-level customization, which may limit its reproducibility and portability in production environments lacking advanced inference stacks or sufficient engineering resources.
- 3 The study only experiments on OneRec-V2 and does not cover a broader set of generative recommendation architectures, leaving it unclear to what extent the observed quantization properties and deployment benefits generalize across different model designs.
- 4 Although the study demonstrates the effectiveness of low-precision inference in recommender systems, it remains unclear how applicable it is to other tasks, such as natural language processing or computer vision.
- 5 The study does not explore the impact of different quantization strategies on model performance in detail, and future work could conduct more granular analyses to optimize quantization strategies.
Applications
Immediate Applications
Short Video Recommendation
By applying low-precision quantization techniques, short video recommendation systems can process more data in less time, providing faster recommendation services.
Personalized Advertising
Low-precision quantization techniques can improve the efficiency of advertising recommendation systems, enabling them to respond to user needs more quickly and increase the accuracy of ad placements.
E-commerce Recommendation
In e-commerce platforms, low-precision quantization techniques can improve the response speed of recommendation systems, helping users find interesting products more quickly.
Long-term Vision
Cross-Domain Recommender Systems
Low-precision quantization techniques can be extended to recommender systems in other domains, such as music and news, to enhance overall recommendation efficiency.
Smart Home Recommendations
In the future, low-precision quantization techniques can be applied to recommendation systems in smart home devices, improving device response speed and user experience.
Abstract
Quantized inference has demonstrated substantial system-level benefits in large language models while preserving model quality. In contrast, reliably applying low-precision quantization to recommender systems remains challenging in industrial settings. This difficulty arises from differences in training paradigms, architectural patterns, and computational characteristics, which lead to distinct numerical behaviors in weights and activations. Traditional recommender models often exhibit high-magnitude and high-variance weights and activations, making them more sensitive to quantization-induced perturbations. In addition, recommendation workloads frequently suffer from limited hardware utilization, limiting the practical gains of low-precision computation. In this work, we revisit low-precision inference in the context of generative recommendation. Through empirical distribution analysis, we show that the weight and activation statistics of OneRec-V2 are significantly more controlled and closer to those of large language models than traditional recommendation models. Moreover, OneRec-V2 exhibits a more compute-intensive inference pattern with substantially higher hardware utilization, enabling more end-to-end throughput gains with low-precision computation. Leveraging this property, we develop a FP8 post training quantization framework and integrate it into an optimized inference infrastructure. The proposed joint optimization achieves a 49\% reduction in end-to-end inference latency and a 92\% increase in throughput. Extensive online A/B testing further confirms that FP8 inference introduces no degradation in core metrics. These results suggest that as recommender systems evolve toward the paradigms of large language models, algorithm-level and system-level optimization techniques established in the LLM domain can be effectively adapted to large-scale recommendation workloads.
References (16)
A review on deep learning for recommender systems: challenges and remedies
Zeynep Batmaz, Ali Yurekli, Alper Bilge et al.
TorchRec: a PyTorch Domain Library for Recommendation Systems
Dmytro Ivchenko, Dennis Van Der Staay, Colin Taylor et al.
Scalable deep learning-based recommendation systems
Hyeungill Lee, Jungwoo Lee
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference
Benoit Jacob, S. Kligys, Bo Chen et al.
Generative Recommendation: Towards Next-generation Recommender Paradigm
Wenjie Wang, Xinyu Lin, Fuli Feng et al.
Deep Learning Recommendation Model for Personalization and Recommendation Systems
M. Naumov, Dheevatsa Mudigere, H. Shi et al.
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
Elias Frantar, Saleh Ashkboos, T. Hoefler et al.
Training with Low-precision Embedding Tables
Jian Zhang, Jiyan Yang, Hector Yuen
A survey on large language models for recommendation
Likang Wu, Zhilan Zheng, Zhaopeng Qiu et al.
GPT3.int8(): 8-bit Matrix Multiplication for Transformers at Scale
Tim Dettmers, M. Lewis, Younes Belkada et al.
OneRec-Think: In-Text Reasoning for Generative Recommendation
Zhanyun Liu, Shiyao Wang, Xing-Yao Wang et al.
Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation
Hao Wu, Patrick Judd, Xiaojie Zhang et al.
Is ChatGPT a Good Recommender? A Preliminary Study
Junling Liu, Chaoyong Liu, Renjie Lv et al.
Post-training Quantization for Neural Networks with Provable Guarantees
Jinjie Zhang, Yixuan Zhou, Rayan Saab
AWQ: Activation-aware Weight Quantization for On-Device LLM Compression and Acceleration
Ji Lin, Jiaming Tang, Haotian Tang et al.
Mixed Precision Training
P. Micikevicius, Sharan Narang, Jonah Alben et al.