RAMP: Reinforcement Adaptive Mixed Precision Quantization for Efficient On Device LLM Inference

TL;DR

RAMP uses reinforcement learning for adaptive mixed-precision quantization, achieving 6% size and 1-3% quality improvements for on-device LLM inference.

cs.LG 🔴 Advanced 2026-03-19 63 views

Arpit Singh Gautam Saurabh Jha

AI Reader Arxiv Page Download PDF

reinforcement learning mixed-precision quantization large language models on-device inference model compression

Key Findings

Methodology

RAMP employs the Soft Actor-Critic (SAC) framework from reinforcement learning to learn per-layer bit-width assignments to minimize perplexity under a global bit budget. The policy is conditioned on an 11-dimensional embedding of activation statistics, weight properties, and structural descriptors, enabling zero-shot transfer across model families and scales. To enable stable sub-4-bit quantization, RAMP introduces Scale Folding, a preconditioning technique that migrates activation outliers into weights.

Key Results

On Llama-2-7B, RAMP achieves a perplexity of 5.54 at 3.68GB (3.65 effective bits), outperforming uniform 4-bit AWQ (5.60 at 3.90GB) and GPTQ by 6% in size and 1-3% in quality.
A policy trained only on Llama-2-7B generalizes zero-shot to Llama-2-13B and Mistral-7B, often surpassing target-specific training, supporting the hypothesis that quantization sensitivity is primarily architectural.
The HALO pipeline exports allocations to GGUF format for kernel-free inference on CPUs, GPUs, and edge devices, retaining 99.5% of FP16 commonsense reasoning performance.

Significance

RAMP holds significant implications for both academia and industry. It addresses the deployment bottleneck of large language models on resource-constrained hardware by employing adaptive mixed-precision quantization, significantly enhancing inference efficiency and quality. This method not only reduces model memory footprint but also accelerates inference speed and lowers deployment costs. Furthermore, RAMP's policy can transfer across different models, reducing the time and resource consumption for model optimization, paving the way for practical applications of large-scale models.

Technical Contribution

RAMP offers several key technical contributions. Firstly, it introduces a reinforcement learning-based adaptive mixed-precision quantization strategy, breaking the limitation of uniform bit-width allocation in traditional methods. Secondly, the proposed Scale Folding technique effectively addresses activation outliers, supporting stable sub-4-bit quantization. Lastly, RAMP's policy can generalize across different models, significantly reducing the complexity and computational cost of model optimization.

Novelty

RAMP is the first to apply reinforcement learning to adaptive mixed-precision quantization for large language models, overcoming the limitations of uniform bit-width allocation in existing methods. Compared to existing quantization methods, RAMP achieves more efficient model compression and inference performance through its 11-dimensional embedding and Scale Folding technique.

Limitations

RAMP's performance under extreme low bit-widths (e.g., below 3 bits) has not been fully validated, which may lead to performance degradation in some scenarios.
The method requires significant computational resources and time during training, which may not be suitable for all application scenarios.
Although RAMP performs well across various models, its applicability to larger-scale models (e.g., GPT-3.5) still needs further investigation.

Future Work

Future research directions include further optimizing RAMP's strategy to support lower bit-width quantization and exploring its application to larger-scale models. Additionally, integrating RAMP with other model compression techniques (e.g., pruning and knowledge distillation) could achieve more efficient model compression and inference performance.

AI Executive Summary

The advent of large language models (LLMs) has revolutionized natural language processing, but their massive memory requirements pose a bottleneck for deployment on resource-constrained hardware. Existing quantization methods typically enforce uniform bit-widths across layers, leading to suboptimal accuracy-efficiency trade-offs.

RAMP (Reinforcement Adaptive Mixed Precision) introduces a Soft Actor-Critic (SAC) framework from reinforcement learning to achieve adaptive mixed-precision quantization. This method learns per-layer bit-width assignments to minimize perplexity under a global bit budget. The policy is conditioned on an 11-dimensional embedding of activation statistics, weight properties, and structural descriptors, enabling zero-shot transfer across model families and scales.

A core innovation of RAMP is the Scale Folding technique, a preconditioning method that migrates activation outliers into weights via per-channel scaling and normalization layer compensation, supporting stable sub-4-bit quantization. A quality-prioritized reward with asymmetric penalties and budget cliffs drives rapid convergence.

On Llama-2-7B, RAMP achieves a perplexity of 5.54 at 3.68GB (3.65 effective bits), outperforming uniform 4-bit AWQ (5.60 at 3.90GB) and GPTQ by 6% in size and 1-3% in quality. RAMP's policy can generalize across different models, reducing the time and resource consumption for model optimization.

The HALO pipeline exports allocations to GGUF format for kernel-free inference on CPUs, GPUs, and edge devices, retaining 99.5% of FP16 commonsense reasoning performance. Although RAMP's performance under extreme low bit-widths has not been fully validated, it holds promising potential for practical applications of large language models.

Deep Analysis

Background

Large language models (LLMs) such as GPT-4 and Llama-2 have achieved state-of-the-art performance in tasks like machine translation, code generation, and multi-step reasoning. However, the scale and memory requirements of these models pose a significant challenge for deployment on resource-constrained hardware. Existing quantization methods, such as GPTQ and AWQ, typically enforce uniform bit-widths across layers, ignoring substantial variation in layer sensitivity to quantization noise, leading to suboptimal accuracy-efficiency trade-offs. Additionally, these methods require costly per-model optimization and calibration, lacking transferability across models. Mixed-precision quantization, while theoretically superior to uniform quantization, introduces kernel fragmentation issues, leading to decreased inference speed.

Core Problem

The deployment of large language models faces a significant bottleneck due to the growing disparity between model memory requirements and available hardware capacity, especially on edge devices and cost-sensitive cloud environments. Existing quantization methods exhibit suboptimal accuracy-efficiency trade-offs, lack transferability across models, and face kernel fragmentation issues in mixed-precision quantization. Addressing these challenges while maintaining model performance is a critical problem that needs to be solved.

Innovation

RAMP's core innovations include:

1. Introducing a Soft Actor-Critic (SAC) framework from reinforcement learning for adaptive mixed-precision quantization, breaking the limitation of uniform bit-width allocation in traditional methods.

2. Proposing the Scale Folding technique, which migrates activation outliers into weights via per-channel scaling and normalization layer compensation, supporting stable sub-4-bit quantization.

3. Employing an 11-dimensional embedding strategy to enable zero-shot transfer across model families and scales, significantly reducing the complexity and computational cost of model optimization.

Methodology

RAMP's methodology includes the following steps:

�� Using the SAC framework to learn per-layer bit-width assignments to minimize perplexity.
�� Conditioning the policy on an 11-dimensional embedding of activation statistics, weight properties, and structural descriptors to enable zero-shot transfer across models.
�� Introducing the Scale Folding technique to migrate activation outliers into weights, supporting stable sub-4-bit quantization.
�� Employing a quality-prioritized reward with asymmetric penalties and budget cliffs to drive rapid convergence.

Experiments

The experimental design includes testing on Llama-2-7B, Llama-2-13B, and Mistral-7B, using the WikiText-2 dataset to evaluate perplexity. Baselines include 4-bit AWQ and GPTQ. Key hyperparameters include the bit-width range (3-5 bits) and the global bit budget (4.25). Ablation studies analyze the impact of different strategies and techniques on model performance.

Results

RAMP achieves a perplexity of 5.54 at 3.68GB (3.65 effective bits) on Llama-2-7B, outperforming uniform 4-bit AWQ (5.60 at 3.90GB) and GPTQ by 6% in size and 1-3% in quality. RAMP's policy can generalize across different models, reducing the time and resource consumption for model optimization. Ablation studies show that the Scale Folding technique significantly improves the stability of low-bit-width quantization.

Applications

RAMP can be directly applied to edge devices and cost-sensitive cloud environments, significantly reducing the memory footprint of large language models and improving inference speed. Its zero-shot transfer capability makes deployment across different models more efficient, reducing the time and resource consumption for model optimization. In privacy-sensitive applications, RAMP supports efficient on-device inference, protecting user data.

Limitations & Outlook

RAMP's performance under extreme low bit-widths (e.g., below 3 bits) has not been fully validated, which may lead to performance degradation in some scenarios. Additionally, the method requires significant computational resources and time during training, which may not be suitable for all application scenarios. Although RAMP performs well across various models, its applicability to larger-scale models (e.g., GPT-3.5) still needs further investigation. Future research directions include further optimizing RAMP's strategy to support lower bit-width quantization and exploring its application to larger-scale models.

Plain Language Accessible to non-experts

Imagine you're in a kitchen cooking. You have a large pot (a large language model), but your kitchen space is limited (hardware resources are constrained). You need to shrink the pot (quantization) but without affecting the taste of the food (model performance). Existing methods are like using a one-size-fits-all lid (uniform bit-width allocation), regardless of the pot's size, which might cause the food to spill over (performance degradation).

RAMP is like a smart lid that can automatically adjust based on the pot's size (adaptive mixed-precision quantization). It uses a smart algorithm called SAC (reinforcement learning) to learn how to adjust the lid size, ensuring the food doesn't spill over while maximizing kitchen space.

Moreover, RAMP has a special trick called Scale Folding, which is like cleverly placing extra ingredients (activation outliers) at the bottom of the pot (into weights), ensuring the taste remains unchanged even if the lid is small (low bit-width).

Through this method, RAMP not only keeps your kitchen tidy (reduces memory footprint) but also helps you cook faster (improves inference speed), without having to readjust the lid every time (zero-shot transfer).

ELI14 Explained like you're 14

Hey, imagine you're playing a super cool game. This game has a huge map (a large language model), but your game console doesn't have enough memory (hardware resources are limited), so you need to shrink the map (quantization) without affecting the game experience (model performance).

Existing methods are like using the same shrink ratio for all maps (uniform bit-width allocation), regardless of the map's details, which might cause some important details to be lost (performance degradation).

RAMP is like a super smart game assistant that can automatically adjust the shrink ratio based on the map's details (adaptive mixed-precision quantization). It uses a smart algorithm called SAC (reinforcement learning) to learn how to adjust the shrink ratio, ensuring the game experience remains unchanged while maximizing the console's memory.

Moreover, RAMP has a special trick called Scale Folding, which is like cleverly hiding extra details (activation outliers) in the map's background (into weights), ensuring the game experience remains unchanged even if the shrink ratio is large (low bit-width).

Through this method, RAMP not only makes your console run smoother (reduces memory footprint) but also helps you load maps faster (improves inference speed), without having to readjust the shrink ratio every time (zero-shot transfer).

Glossary

Reinforcement Learning

A machine learning method that learns a policy by interacting with the environment to maximize cumulative rewards. Used in RAMP to learn bit-width allocation strategies.

Used to learn per-layer bit-width allocation strategies.

Adaptive Mixed Precision Quantization

A quantization method that dynamically adjusts bit-width allocation based on each layer's sensitivity to minimize perplexity under a global bit budget.

Core method of RAMP, achieved through reinforcement learning.

Perplexity

A metric used to evaluate the performance of language models, with lower values indicating better models. Used in RAMP to assess the performance of quantized models.

Used to evaluate the performance of quantized models on the WikiText-2 dataset.

Scale Folding

A preconditioning technique that migrates activation outliers into weights via per-channel scaling and normalization layer compensation.

Used to support stable sub-4-bit quantization.

SAC (Soft Actor-Critic)

A reinforcement learning algorithm that combines policy and value function learning, offering high sample efficiency. Used in RAMP to learn bit-width allocation strategies.

Used to learn per-layer bit-width allocation strategies.

Zero-shot Transfer

The ability to apply a policy trained on one model directly to other models without retraining.

RAMP's policy can generalize across different models with zero-shot transfer.

HALO Pipeline

A process for exporting quantization strategies to GGUF format, supporting kernel-free inference on various hardware.

Used to export and deploy RAMP's strategies.

Activation Outliers

Refers to layers in activation distributions where certain values significantly exceed the median.

RAMP addresses activation outliers through the Scale Folding technique.

GGUF Format

A format for exporting quantized models, supporting kernel-free inference on various hardware.

RAMP's HALO pipeline exports strategies to GGUF format.

Budget Cliff

A concept in reward mechanisms that imposes penalties when exceeding the bit budget.

Used in RAMP's reward mechanism to drive rapid convergence.

Open Questions Unanswered questions from this research

1 RAMP's performance under extreme low bit-widths (e.g., below 3 bits) has not been fully validated, which may lead to performance degradation in some scenarios. Further research is needed to support lower bit-width quantization while maintaining model performance.
2 Although RAMP performs well across various models, its applicability to larger-scale models (e.g., GPT-3.5) still needs further investigation. Exploring how to apply RAMP to larger-scale models and verifying its performance is necessary.
3 RAMP requires significant computational resources and time during training, which may not be suitable for all application scenarios. Research is needed to optimize RAMP's training process to reduce resource and time consumption.
4 While RAMP's policy can generalize across different models with zero-shot transfer, some specific models may still require fine-tuning. Research is needed to further enhance RAMP's transferability to reduce the complexity of model optimization.
5 RAMP's Scale Folding technique may not fully address activation outliers in some cases. Further research is needed to improve this technique to better support low-bit-width quantization.

Applications

Immediate Applications

Edge Device Deployment

RAMP can be directly applied to edge devices, such as mobile and IoT devices, significantly reducing the memory footprint of large language models and improving inference speed.

Cost-sensitive Cloud Environments

In cloud environments, RAMP can reduce memory bandwidth and capacity requirements, lowering inference costs, making it suitable for cost-sensitive applications.

Privacy-sensitive Applications

RAMP supports efficient on-device inference, protecting user data, making it suitable for privacy-sensitive applications, such as healthcare and finance.

Long-term Vision

Proliferation of Large-scale Models

RAMP's zero-shot transfer capability makes deployment across different models more efficient, reducing the time and resource consumption for model optimization, potentially driving the proliferation of large-scale models.

Integration with Other Model Compression Techniques

Future integration of RAMP with other model compression techniques (e.g., pruning and knowledge distillation) could achieve more efficient model compression and inference performance, advancing AI technology.

Abstract

Post training quantization is essential for deploying large language models (LLMs) on resource constrained hardware, yet state of the art methods enforce uniform bit widths across layers, yielding suboptimal accuracy efficiency trade offs. We present RAMP (Reinforcement Adaptive Mixed Precision), an off policy Soft Actor Critic framework that learns per layer bit width assignments to minimize perplexity under a global bit budget. The policy conditions on an 11 dimensional embedding of activation statistics, weight properties, and structural descriptors, enabling zero shot transfer across model families and scales. To enable stable sub 4 bit quantization, we introduce Scale Folding, a preconditioning technique that migrates activation outliers into weights via per channel scaling and normalization layer compensation. A quality prioritized reward with asymmetric penalties and budget cliffs drives rapid convergence. On Llama 2 7B, RAMP achieves 5.54 perplexity at 3.68GB (3.65 effective bits), outperforming uniform 4 bit AWQ (5.60 at 3.90 GB) and GPTQ by 6% in size and 1% to3% in quality. Critically, a policy trained only on Llama 2 7B generalizes zero shot to Llama 2 13B and Mistral 7B, often surpassing target specific training, supporting the hypothesis that quantization sensitivity is primarily architectural. The HALO pipeline exports allocations to GGUF format for kernel free inference on CPUs, GPUs, and edge devices, retaining 99.5% of FP16 commonsense reasoning performance.

cs.LG cs.AI

References (20)

An Adversarial Winograd Schema Challenge at Scale

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula et al.

2019 2888 citations ⭐ Influential

SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

Guangxuan Xiao, Ji Lin, Mickael Seznec et al.

2022 1382 citations ⭐ Influential View Analysis →

HellaSwag: Can a Machine Really Finish Your Sentence?

Rowan Zellers, Ari Holtzman, Yonatan Bisk et al.

2019 3823 citations ⭐ Influential View Analysis →

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni et al.

2018 4205 citations ⭐ Influential View Analysis →

PIQA: Reasoning about Physical Commonsense in Natural Language

Yonatan Bisk, Rowan Zellers, Ronan Le Bras et al.

2019 2755 citations ⭐ Influential View Analysis →

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Elias Frantar, Saleh Ashkboos, T. Hoefler et al.

2022 1761 citations ⭐ Influential View Analysis →

AutoQ: Automated Kernel-Wise Neural Network Quantization

Qian Lou, Feng Guo, Lantao Liu et al.

2019 117 citations View Analysis →

Pointer Sentinel Mixture Models

Stephen Merity, Caiming Xiong, James Bradbury et al.

2016 3778 citations View Analysis →

OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models

Wenqi Shao, Mengzhao Chen, Zhaoyang Zhang et al.

2023 363 citations View Analysis →

Mistral 7B

Albert Qiaochu Jiang, Alexandre Sablayrolles, Arthur Mensch et al.

2023 3207 citations View Analysis →

Progressive Mixed-Precision Decoding for Efficient LLM Inference

H. Chen, Fuwen Tan, Alexandros Kouris et al.

2024 11 citations View Analysis →

MoQAE: Mixed-Precision Quantization for Long-Context LLM Inference via Mixture of Quantization-Aware Experts

Wei Tao, Haocheng Lu, Xiaoyang Qu et al.

2025 4 citations View Analysis →

Learning Efficient Convolutional Networks through Network Slimming

Zhuang Liu, Jianguo Li, Zhiqiang Shen et al.

2017 2710 citations View Analysis →

Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor

Tuomas Haarnoja, Aurick Zhou, P. Abbeel et al.

2018 10675 citations View Analysis →

Neural Architecture Search with Reinforcement Learning

Barret Zoph, Quoc V. Le

2016 5808 citations View Analysis →

Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference

Benoit Jacob, S. Kligys, Bo Chen et al.

2017 3985 citations View Analysis →

SqueezeLLM: Dense-and-Sparse Quantization

Sehoon Kim, Coleman Hooper, A. Gholami et al.

2023 286 citations View Analysis →

MixLLM: LLM Quantization with Global Mixed-precision between Output-features and Highly-efficient System Design

Zhen Zheng, Xiaonan Song, Chuanjie Liu

2024 7 citations View Analysis →

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

Ji Lin, Jiaming Tang, Haotian Tang et al.

2023 681 citations

HAWQ: Hessian AWare Quantization of Neural Networks With Mixed-Precision

Zhen Dong, Z. Yao, A. Gholami et al.

2019 620 citations View Analysis →

RAMP: Reinforcement Adaptive Mixed Precision Quantization for Efficient On Device LLM Inference

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Analysis

Background

Core Problem

Innovation

Methodology

Experiments

Results

Applications

Limitations & Outlook

Plain Language Accessible to non-experts

ELI14 Explained like you're 14

Glossary

Reinforcement Learning

Adaptive Mixed Precision Quantization

Perplexity

Scale Folding

SAC (Soft Actor-Critic)

Zero-shot Transfer

HALO Pipeline

Activation Outliers

GGUF Format

Budget Cliff

Open Questions Unanswered questions from this research

Applications

Immediate Applications

Edge Device Deployment

Cost-sensitive Cloud Environments

Privacy-sensitive Applications

Long-term Vision

Proliferation of Large-scale Models

Integration with Other Model Compression Techniques

Abstract

References (20)

Related Papers

Scalable Hyperparameter-Divergent Ensemble Training with Automatic Learning Rate Exploration for Large Models

Efficient learning by implicit exploration in bandit problems with side observations

Necessary and sufficient conditions for universality of Kolmogorov-Arnold networks

Collocation-based Robust Physics Informed Neural Networks for time-dependent simulations of pollution propagation under thermal inversion conditions on Spitsbergen

Spend Less, Fit Better: Budget-Efficient Scaling Law Fitting via Active Experiment Selection

Neural Recovery of Historical Lexical Structure in Bantu Languages from Modern Data