Quantized Inference for OneRec-V2

TL;DR

OneRec-V2 achieves 49% latency reduction and 92% throughput increase via FP8 quantized inference.

cs.IR 🔴 Advanced 2026-03-12 10 views
Yi Su Xinchen Luo Hongtao Cheng Ziteng Shu Yunfeng Zhao Fangyu Zhang Jiaqiang Liu Xiao Liang Yiwu Liu Ruiming Tang
quantized inference recommender systems FP8 OneRec-V2 hardware utilization

Key Findings

Methodology

The study proposes an FP8 post-training quantization framework for OneRec-V2, integrated into an optimized inference infrastructure. By analyzing the statistical distribution of weights and activations, it was found that OneRec-V2's numerical behavior is closer to that of large language models, with a more compute-intensive inference pattern and higher hardware utilization. This method effectively applies low-precision computation by controlling the numerical range of weights and activations.

Key Results

  • The FP8 post-training quantization framework achieved a 49% reduction in end-to-end inference latency and a 92% increase in throughput for OneRec-V2. These improvements are primarily due to the combination of infrastructure upgrades, low-precision computation, and operator-level optimizations.
  • Extensive online A/B testing shows that FP8 inference introduces no degradation in core metrics, proving the feasibility of low-precision inference in real production environments.
  • Comparing traditional recommendation models and large language models, OneRec-V2's distribution characteristics are closer to the latter, making the application of low-precision quantization more reasonable.

Significance

This study demonstrates that as recommender systems evolve towards the paradigms of large language models, low-precision computation techniques can be effectively adapted to large-scale recommendation workloads. This is significant not only for academia but also provides new optimization directions for the industry, particularly in terms of hardware utilization and inference efficiency. By transferring low-precision techniques from the domain of large language models to recommender systems, the study addresses pain points related to numerical behavior and hardware utilization in traditional recommendation models.

Technical Contribution

Technical contributions include the development of an FP8 post-training quantization framework combined with an optimized inference infrastructure, achieving significant latency reduction and throughput improvement. Compared to existing recommender systems, OneRec-V2 relies more on dense computation paths and unified execution patterns, improving hardware utilization. Additionally, the study demonstrates how low-precision techniques from large language models can be effectively transferred to recommender systems.

Novelty

This study is the first to successfully apply FP8 quantized inference to recommender systems and demonstrate its effectiveness in OneRec-V2. Compared to traditional recommendation models, OneRec-V2's improvements in numerical behavior and hardware utilization make low-precision computation feasible. This innovation provides new insights for optimizing recommender systems.

Limitations

  • The current study only explores FP8 inference and does not investigate more aggressive low-precision settings such as INT8, FP6, or FP4, thus not revealing the full accuracy-efficiency frontier of generative recommendation models.
  • The solution relies on substantial infrastructure support and system-level customization, which may limit its reproducibility and portability in production environments lacking advanced inference stacks or sufficient engineering resources.
  • The study only experiments on OneRec-V2 and does not cover a broader set of generative recommendation architectures, leaving it unclear to what extent the observed quantization properties and deployment benefits generalize across different model designs.

Future Work

Future work can explore more aggressive low-precision quantization settings to further improve efficiency. Additionally, the study can be extended to other generative recommendation models to verify the generalizability of quantization techniques. More generic infrastructure can also be developed to reduce dependency on specific hardware and systems, enhancing the portability of the solution.

AI Executive Summary

Applying low-precision quantized inference in recommender systems has been challenging due to significant differences in numerical behavior and hardware utilization compared to traditional models, limiting the practical benefits of low-precision computation. OneRec-V2 narrows the gap with large language models through more compute-intensive paths and unified execution patterns, making low-precision quantization feasible.

The study proposes an FP8 post-training quantization framework integrated into an optimized inference infrastructure. By analyzing the statistical distribution of weights and activations, it was found that OneRec-V2's numerical behavior is closer to that of large language models, with a more compute-intensive inference pattern and higher hardware utilization. This discovery provides a solid foundation for applying low-precision computation.

In experiments, FP8 quantized inference achieved a 49% reduction in end-to-end inference latency and a 92% increase in throughput for OneRec-V2. The combination of infrastructure upgrades, low-precision computation, and operator-level optimizations is key to these improvements. Extensive online A/B testing further confirms that FP8 inference introduces no degradation in core metrics.

These results indicate that as recommender systems evolve towards the paradigms of large language models, low-precision computation techniques can be effectively adapted to large-scale recommendation workloads. This is significant not only for academia but also provides new optimization directions for the industry, particularly in terms of hardware utilization and inference efficiency.

However, the current study only explores FP8 inference and does not investigate more aggressive low-precision settings such as INT8, FP6, or FP4. Future work can explore more aggressive low-precision quantization settings to further improve efficiency. Additionally, the study can be extended to other generative recommendation models to verify the generalizability of quantization techniques.

Deep Analysis

Background

Quantized inference is an essential technique for improving the efficiency of large-scale neural networks, particularly in large language models where low-precision formats have demonstrated substantial system-level benefits while preserving model quality. However, reliably applying low-precision quantization in recommender systems has long been challenging in industrial practice. Traditional recommendation models are typically optimized for fine-grained ranking tasks and differ significantly from large language models in both training paradigms and architectural structures. Empirically, their weights and activations often exhibit high magnitudes and large variances, making these models more sensitive to quantization-induced perturbations. From a systems perspective, classical recommender inference workloads are frequently memory or control bound and exhibit relatively low hardware utilization. As a result, even when hardware platforms support low-precision computation, the practical end-to-end gains may be limited. Recent advances in generative recommendation models have begun to narrow this gap. OneRec introduces a unified generative framework that integrates retrieval and ranking, and subsequent extensions such as OneRec-V2 further refine this paradigm through architectural scaling and training improvements.

Core Problem

Applying low-precision quantized inference in recommender systems has been challenging due to significant differences in numerical behavior and hardware utilization compared to traditional models, limiting the practical benefits of low-precision computation. Traditional recommendation models' weights and activations often exhibit high magnitudes and large variances, making them more sensitive to quantization-induced perturbations. Additionally, recommendation workloads frequently suffer from limited hardware utilization, restricting the practical gains of low-precision computation. These numerical and system factors have historically hindered the effective deployment of low-precision inference in traditional recommendation pipelines.

Innovation

The core innovation of this study lies in successfully applying FP8 quantized inference to recommender systems and demonstrating its effectiveness in OneRec-V2. Compared to traditional recommendation models, OneRec-V2's improvements in numerical behavior and hardware utilization make low-precision computation feasible. This innovation provides new insights for optimizing recommender systems. Specifically, the study develops an FP8 post-training quantization framework combined with an optimized inference infrastructure, achieving significant latency reduction and throughput improvement. By analyzing the statistical distribution of weights and activations, it was found that OneRec-V2's numerical behavior is closer to that of large language models, with a more compute-intensive inference pattern and higher hardware utilization.

Methodology

  • �� Developed an FP8 post-training quantization framework combined with an optimized inference infrastructure, achieving significant latency reduction and throughput improvement.

  • �� Analyzed the statistical distribution of weights and activations, finding that OneRec-V2's numerical behavior is closer to that of large language models, with a more compute-intensive inference pattern and higher hardware utilization.

  • �� Adopted a post-training quantization (PTQ) approach to introduce low-precision computation into the inference stage of OneRec-V2 without modifying the model architecture or training procedure. Quantization is applied only to the most computation-intensive operators, namely the Linear layers (including the qkvo projection layers in Attention and the linear transformations in Dense FFN) and the grouped GEMM operations in Sparse MoE. Other numerically sensitive or less compute-dominant components remain in their original precision to control potential numerical risks.

Experiments

The experimental design includes both offline and online performance evaluations of the OneRec-V2 model. In offline experiments, system performance is measured in terms of end-to-end latency and throughput. The baseline system performs inference in FP16, while the optimized system applies post-training quantization to computation-dominant Linear layers. Online A/B testing is conducted in real production environments to verify the impact of low-precision inference on recommendation quality. The results show that FP8 quantized inference achieved a 49% reduction in end-to-end inference latency and a 92% increase in throughput for OneRec-V2. The combination of infrastructure upgrades, low-precision computation, and operator-level optimizations is key to these improvements.

Results

The experimental results show that FP8 quantized inference achieved a 49% reduction in end-to-end inference latency and a 92% increase in throughput for OneRec-V2. The combination of infrastructure upgrades, low-precision computation, and operator-level optimizations is key to these improvements. Extensive online A/B testing further confirms that FP8 inference introduces no degradation in core metrics. These results indicate that as recommender systems evolve towards the paradigms of large language models, low-precision computation techniques can be effectively adapted to large-scale recommendation workloads. This is significant not only for academia but also provides new optimization directions for the industry, particularly in terms of hardware utilization and inference efficiency.

Applications

The application scenarios of this study include the optimization of large-scale recommender systems, particularly in terms of hardware utilization and inference efficiency. By transferring low-precision techniques from the domain of large language models to recommender systems, the study addresses pain points related to numerical behavior and hardware utilization in traditional recommendation models. This technology can be directly applied to recommender systems that require efficient inference, such as short video recommendations and personalized advertising.

Limitations & Outlook

The current study only explores FP8 inference and does not investigate more aggressive low-precision settings such as INT8, FP6, or FP4. Therefore, future work can explore more aggressive low-precision quantization settings to further improve efficiency. Additionally, the study can be extended to other generative recommendation models to verify the generalizability of quantization techniques. More generic infrastructure can also be developed to reduce dependency on specific hardware and systems, enhancing the portability of the solution.

Plain Language Accessible to non-experts

Imagine a factory that needs to produce a large number of products. Traditionally, the production line uses high-precision machines to ensure each product is flawless, but this requires a lot of time and resources. Now, the factory introduces a new method, using low-precision machines to speed up production. These machines, although slightly less precise, still produce acceptable quality products because they focus on the critical steps of production rather than every detail. This is similar to applying low-precision quantized inference in recommender systems, where reducing computational precision improves efficiency while ensuring the final recommendation quality remains unaffected. In this way, the factory can produce more products in less time to meet market demand. Similarly, recommender systems can process more data in less time, providing faster recommendation services. The key to this approach is finding the balance between precision and efficiency, ensuring that while efficiency is improved, product quality still meets customer expectations.

ELI14 Explained like you're 14

Hey there! Imagine you're playing a super cool game where your task is to recommend things your friends might like. Traditionally, you might spend a lot of time making sure each recommendation is perfect, like using a high-precision telescope to look at stars. But this time, we have a new tool, like a fast microscope, that lets you find those important stars quicker! Although this microscope isn't as precise as the telescope, it helps you find more stars in less time, so you can complete your task faster. This new tool is like the low-precision quantization technique we use in recommender systems, which makes the system run faster while ensuring the recommendation quality doesn't drop. This way, you can recommend more things your friends will like in less time. Isn't that cool?

Glossary

Quantized Inference

Quantized inference is a technique that reduces computation and memory costs by representing weights and activations with lower numerical precision, often used to improve the efficiency of large-scale neural networks.

In this paper, quantized inference is used to enhance the inference efficiency of OneRec-V2.

FP8

FP8 is a low-precision numerical format that uses 8 bits to represent floating-point numbers, significantly reducing computation and storage costs while maintaining a certain level of precision.

The paper develops an FP8 post-training quantization framework for OneRec-V2 inference.

Post-Training Quantization

Post-training quantization is a technique applied after model training, reducing the numerical precision of model parameters and activations to improve inference efficiency.

The paper adopts post-training quantization to introduce low-precision computation into the inference stage of OneRec-V2.

Generative Recommendation

Generative recommendation is a paradigm that formulates recommendation tasks as conditional sequence generation, integrating retrieval and ranking for end-to-end optimization.

OneRec-V2 adopts a generative recommendation paradigm to enhance the efficiency of recommender systems.

Hardware Utilization

Hardware utilization refers to the efficiency with which computing resources are used when performing tasks; higher hardware utilization typically means higher computational efficiency.

OneRec-V2's inference pattern is more compute-intensive, resulting in higher hardware utilization.

Linear Layer

A linear layer is a basic neural network layer typically used to perform linear transformations, such as matrix multiplication.

In the paper, quantization is applied only to the most computation-intensive operators, namely the Linear layers.

Grouped GEMM

Grouped GEMM is a matrix multiplication operation often used to handle sparse matrices, improving computational efficiency.

In the paper, grouped GEMM operations are a focus of quantization.

TensorCore

TensorCore is a hardware unit specifically designed to accelerate matrix operations, significantly improving the computational efficiency of deep learning models.

The paper uses FP8 TensorCore multiplication to enhance computational efficiency.

MoE (Mixture of Experts)

MoE is a model architecture that improves computational efficiency by selectively activating parts of the expert network.

OneRec-V2 employs an MoE architecture to enhance inference efficiency.

A/B Testing

A/B testing is an experimental method used to evaluate the effect of a change by comparing the performance of two versions, widely used for product optimization.

The paper verifies the impact of FP8 inference on recommendation quality through online A/B testing.

Open Questions Unanswered questions from this research

  • 1 The current study only explores FP8 inference and does not investigate more aggressive low-precision settings such as INT8, FP6, or FP4. Therefore, future work can explore more aggressive low-precision quantization settings to further improve efficiency.
  • 2 The solution relies on substantial infrastructure support and system-level customization, which may limit its reproducibility and portability in production environments lacking advanced inference stacks or sufficient engineering resources.
  • 3 The study only experiments on OneRec-V2 and does not cover a broader set of generative recommendation architectures, leaving it unclear to what extent the observed quantization properties and deployment benefits generalize across different model designs.
  • 4 Although the study demonstrates the effectiveness of low-precision inference in recommender systems, it remains unclear how applicable it is to other tasks, such as natural language processing or computer vision.
  • 5 The study does not explore the impact of different quantization strategies on model performance in detail, and future work could conduct more granular analyses to optimize quantization strategies.

Applications

Immediate Applications

Short Video Recommendation

By applying low-precision quantization techniques, short video recommendation systems can process more data in less time, providing faster recommendation services.

Personalized Advertising

Low-precision quantization techniques can improve the efficiency of advertising recommendation systems, enabling them to respond to user needs more quickly and increase the accuracy of ad placements.

E-commerce Recommendation

In e-commerce platforms, low-precision quantization techniques can improve the response speed of recommendation systems, helping users find interesting products more quickly.

Long-term Vision

Cross-Domain Recommender Systems

Low-precision quantization techniques can be extended to recommender systems in other domains, such as music and news, to enhance overall recommendation efficiency.

Smart Home Recommendations

In the future, low-precision quantization techniques can be applied to recommendation systems in smart home devices, improving device response speed and user experience.

Abstract

Quantized inference has demonstrated substantial system-level benefits in large language models while preserving model quality. In contrast, reliably applying low-precision quantization to recommender systems remains challenging in industrial settings. This difficulty arises from differences in training paradigms, architectural patterns, and computational characteristics, which lead to distinct numerical behaviors in weights and activations. Traditional recommender models often exhibit high-magnitude and high-variance weights and activations, making them more sensitive to quantization-induced perturbations. In addition, recommendation workloads frequently suffer from limited hardware utilization, limiting the practical gains of low-precision computation. In this work, we revisit low-precision inference in the context of generative recommendation. Through empirical distribution analysis, we show that the weight and activation statistics of OneRec-V2 are significantly more controlled and closer to those of large language models than traditional recommendation models. Moreover, OneRec-V2 exhibits a more compute-intensive inference pattern with substantially higher hardware utilization, enabling more end-to-end throughput gains with low-precision computation. Leveraging this property, we develop a FP8 post training quantization framework and integrate it into an optimized inference infrastructure. The proposed joint optimization achieves a 49\% reduction in end-to-end inference latency and a 92\% increase in throughput. Extensive online A/B testing further confirms that FP8 inference introduces no degradation in core metrics. These results suggest that as recommender systems evolve toward the paradigms of large language models, algorithm-level and system-level optimization techniques established in the LLM domain can be effectively adapted to large-scale recommendation workloads.

cs.IR

References (16)

A review on deep learning for recommender systems: challenges and remedies

Zeynep Batmaz, Ali Yurekli, Alper Bilge et al.

2018 437 citations

TorchRec: a PyTorch Domain Library for Recommendation Systems

Dmytro Ivchenko, Dennis Van Der Staay, Colin Taylor et al.

2022 47 citations

Scalable deep learning-based recommendation systems

Hyeungill Lee, Jungwoo Lee

2019 39 citations

Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference

Benoit Jacob, S. Kligys, Bo Chen et al.

2017 3971 citations View Analysis →

Generative Recommendation: Towards Next-generation Recommender Paradigm

Wenjie Wang, Xinyu Lin, Fuli Feng et al.

2023 131 citations View Analysis →

Deep Learning Recommendation Model for Personalization and Recommendation Systems

M. Naumov, Dheevatsa Mudigere, H. Shi et al.

2019 876 citations View Analysis →

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Elias Frantar, Saleh Ashkboos, T. Hoefler et al.

2022 1742 citations View Analysis →

Training with Low-precision Embedding Tables

Jian Zhang, Jiyan Yang, Hector Yuen

2018 21 citations

A survey on large language models for recommendation

Likang Wu, Zhilan Zheng, Zhaopeng Qiu et al.

2023 727 citations View Analysis →

GPT3.int8(): 8-bit Matrix Multiplication for Transformers at Scale

Tim Dettmers, M. Lewis, Younes Belkada et al.

2022 762 citations

OneRec-Think: In-Text Reasoning for Generative Recommendation

Zhanyun Liu, Shiyao Wang, Xing-Yao Wang et al.

2025 19 citations View Analysis →

Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation

Hao Wu, Patrick Judd, Xiaojie Zhang et al.

2020 444 citations View Analysis →

Is ChatGPT a Good Recommender? A Preliminary Study

Junling Liu, Chaoyong Liu, Renjie Lv et al.

2023 367 citations View Analysis →

Post-training Quantization for Neural Networks with Provable Guarantees

Jinjie Zhang, Yixuan Zhou, Rayan Saab

2022 52 citations View Analysis →

AWQ: Activation-aware Weight Quantization for On-Device LLM Compression and Acceleration

Ji Lin, Jiaming Tang, Haotian Tang et al.

2023 1138 citations View Analysis →

Mixed Precision Training

P. Micikevicius, Sharan Narang, Jonah Alben et al.

2017 2233 citations View Analysis →