GSQ: Highly-Accurate Low-Precision Scalar Quantization for LLMs via Gumbel-Softmax Sampling

TL;DR

GSQ achieves high-accuracy low-bit quantization using Gumbel-Softmax sampling, narrowing the accuracy gap with QTIP methods.

cs.CL πŸ”΄ Advanced 2026-04-21 34 views
Alireza Dadgarnia Soroush Tabesh Mahdi Nikdan Michael Helcig Eldar Kurtic Dan Alistarh
quantization Gumbel-Softmax scalar quantization LLMs low-bit

Key Findings

Methodology

GSQ (Gumbel-Softmax Quantization) is a post-training scalar quantization method that jointly learns per-coordinate grid assignments and per-group scales using a Gumbel-Softmax relaxation of the discrete grid. GSQ matches the cardinality of the relaxation to the small number of levels available in the target bit-width regime (e.g., 3-8 levels for ternary and 3 bpp, respectively), making the relaxation tight and the optimization tractable.

Key Results

  • On the Llama-3.1-8B/70B-Instruct models, GSQ closes most of the gap between scalar quantization and the QTIP frontier at 2 and 3 bits. At 2 bits, GSQ improves average zero-shot accuracy by 4.76 and 4.14 points over the best scalar baseline (EfficientQAT), respectively.
  • At 3 bits, GSQ matches or surpasses all scalar baselines and is essentially on par with QTIP on the 70B model. Notably, these results are obtained without zero-point parameters, indicating that the gains come from better optimization of the discrete assignments.
  • GSQ scales to trillion-scale Mixture-of-Experts models such as Kimi-K2.5, achieving low-bit, near-lossless quantization where vector-quantized methods are difficult to apply.

Significance

The introduction of GSQ is significant for both academia and industry. It offers a method to significantly improve low-bit quantization accuracy while maintaining compatibility with existing scalar inference kernels. GSQ not only narrows the accuracy gap with more complex vector quantization methods but also demonstrates scalability on large-scale Mixture-of-Experts models, impacting model compression and deployment.

Technical Contribution

GSQ's technical contribution lies in its innovative use of Gumbel-Softmax relaxation, significantly enhancing scalar quantization accuracy at low bit-widths. Compared to existing scalar quantization methods, GSQ not only achieves breakthroughs in accuracy but also maintains compatibility with existing inference kernels. Additionally, GSQ's successful application to large-scale models demonstrates its engineering feasibility and scalability.

Novelty

GSQ is the first to apply Gumbel-Softmax relaxation in scalar quantization, significantly improving quantization accuracy at low bit-widths. Compared to existing vector and scalar quantization methods, GSQ offers higher accuracy and better scalability while maintaining simplicity and compatibility.

Limitations

  • GSQ may encounter instability in certain model layers, such as the down_proj of the second layer in the Llama-3.1-8B model, where compression is less effective.
  • While GSQ performs well in most cases, its performance at extremely low bit-widths (e.g., 1 bit) may not match some specially designed vector quantization methods.
  • The training process of GSQ requires additional computational resources, especially when handling large-scale models, which may increase training time and cost.

Future Work

Future research directions include further optimizing the GSQ training process to reduce computational resource requirements and exploring its performance at even lower bit-widths. Additionally, investigating how GSQ can be applied to more types of models and tasks is a promising direction. The community can further study GSQ's performance on different hardware platforms and explore ways to improve its computational efficiency without compromising accuracy.

AI Executive Summary

In the inference process of large language models (LLMs), memory and bandwidth costs have become a major challenge. To address this issue, weight quantization has become a standard method for efficient deployment. Existing quantization methods are mainly divided into two categories: simple scalar quantization techniques, such as GPTQ or AWQ, which are widely used but plateau in accuracy at 3-4 bits per parameter; and 'second-generation' vector or trellis quantized methods, such as QTIP, GPTVQ, and AQLM, which push the accuracy frontier at low bit-widths but are notoriously hard to implement and scale. In this paper, we introduce a new scalar quantization method, GSQ (Gumbel-Softmax Quantization), which jointly learns per-coordinate grid assignments and per-group scales using a Gumbel-Softmax relaxation of the discrete grid. GSQ closes most of the gap between scalar quantization and the QTIP frontier at 2 and 3 bits, while maintaining full compatibility with existing scalar inference kernels. We further demonstrate that GSQ scales to trillion-scale Mixture-of-Experts models such as Kimi-K2.5, where vector-quantized methods are difficult to apply.

The introduction of GSQ is significant for both academia and industry. It offers a method to significantly improve low-bit quantization accuracy while maintaining compatibility with existing scalar inference kernels. GSQ not only narrows the accuracy gap with more complex vector quantization methods but also demonstrates scalability on large-scale Mixture-of-Experts models, impacting model compression and deployment.

GSQ's technical contribution lies in its innovative use of Gumbel-Softmax relaxation, significantly enhancing scalar quantization accuracy at low bit-widths. Compared to existing scalar quantization methods, GSQ not only achieves breakthroughs in accuracy but also maintains compatibility with existing inference kernels. Additionally, GSQ's successful application to large-scale models demonstrates its engineering feasibility and scalability.

GSQ is the first to apply Gumbel-Softmax relaxation in scalar quantization, significantly improving quantization accuracy at low bit-widths. Compared to existing vector and scalar quantization methods, GSQ offers higher accuracy and better scalability while maintaining simplicity and compatibility.

Despite GSQ's excellent performance in most cases, it may encounter instability in certain model layers, such as the down_proj of the second layer in the Llama-3.1-8B model, where compression is less effective. Additionally, the training process of GSQ requires additional computational resources, especially when handling large-scale models, which may increase training time and cost. Future research directions include further optimizing the GSQ training process to reduce computational resource requirements and exploring its performance at even lower bit-widths.

Deep Analysis

Background

In the inference process of large language models (LLMs), memory and bandwidth costs have become a major challenge. To address this issue, weight quantization has become a standard method for efficient deployment. Existing quantization methods are mainly divided into two categories: simple scalar quantization techniques, such as GPTQ or AWQ, which are widely used but plateau in accuracy at 3-4 bits per parameter; and 'second-generation' vector or trellis quantized methods, such as QTIP, GPTVQ, and AQLM, which push the accuracy frontier at low bit-widths but are notoriously hard to implement and scale. In recent years, with the widespread application of large language models, how to reduce computational and storage costs while ensuring model performance has become an important research direction.

Core Problem

Existing scalar quantization methods perform poorly at low bit-widths, while vector quantization methods, although advantageous in accuracy, are difficult to implement and scale. How to improve low-bit quantization accuracy while maintaining compatibility with existing scalar inference kernels has become an urgent problem to solve. Especially in the application of large-scale models, achieving efficient quantization and deployment without affecting model performance is a challenging problem.

Innovation

The core innovation of GSQ lies in its combination of Gumbel-Softmax relaxation, significantly enhancing scalar quantization accuracy at low bit-widths. β€’ GSQ uses Gumbel-Softmax relaxation to jointly learn per-coordinate grid assignments and per-group scales. β€’ GSQ closes most of the gap between scalar quantization and the QTIP frontier at 2 and 3 bits, while maintaining full compatibility with existing scalar inference kernels. β€’ GSQ demonstrates scalability on trillion-scale Mixture-of-Experts models such as Kimi-K2.5, where vector-quantized methods are difficult to apply.

Methodology

The detailed steps of the GSQ method are as follows: β€’ Use Gumbel-Softmax relaxation to make the discrete grid selection process differentiable. β€’ Jointly learn per-coordinate grid assignments and per-group scales. β€’ Match the cardinality of the relaxation to the small number of levels available in the target bit-width regime, making the relaxation tight and the optimization tractable. β€’ Conduct experiments on the Llama-3.1-8B/70B-Instruct models to verify GSQ's accuracy improvement at 2 and 3 bits. β€’ Conduct experiments on trillion-scale Mixture-of-Experts models such as Kimi-K2.5 to verify GSQ's scalability.

Experiments

The experimental design includes testing on the Llama-3.1-8B/70B-Instruct models and the Kimi-K2.5 model. β€’ On the Llama models, quantize all non-embedding and non-head linear layers to verify GSQ's accuracy performance at 2 and 3 bits. β€’ On the Kimi-K2.5 model, only quantize non-shared expert weights to verify GSQ's scalability on large-scale models. β€’ Use 2-bit and 3-bit weight-only quantization configurations with a group size of 128. β€’ Use a symmetric scalar quantizer where each group shares a single scale value.

Results

Experimental results show that at 2 bits, GSQ improves average zero-shot accuracy by 4.76 and 4.14 points over the best scalar baseline (EfficientQAT), respectively. At 3 bits, GSQ matches or surpasses all scalar baselines and is essentially on par with QTIP on the 70B model. Notably, these results are obtained without zero-point parameters, indicating that the gains come from better optimization of the discrete assignments. GSQ scales to trillion-scale Mixture-of-Experts models such as Kimi-K2.5, achieving low-bit, near-lossless quantization where vector-quantized methods are difficult to apply.

Applications

Application scenarios of GSQ include: β€’ Achieving efficient low-bit quantization in the inference process of large language models, reducing memory and bandwidth costs. β€’ Demonstrating scalability in large-scale Mixture-of-Experts models, providing new possibilities for model compression and deployment. β€’ Providing an efficient quantization solution in applications that require maintaining compatibility with existing scalar inference kernels.

Limitations & Outlook

The limitations of GSQ include: β€’ It may encounter instability in certain model layers, such as the down_proj of the second layer in the Llama-3.1-8B model, where compression is less effective. β€’ While GSQ performs well in most cases, its performance at extremely low bit-widths (e.g., 1 bit) may not match some specially designed vector quantization methods. β€’ The training process of GSQ requires additional computational resources, especially when handling large-scale models, which may increase training time and cost. Future research directions include further optimizing the GSQ training process to reduce computational resource requirements and exploring its performance at even lower bit-widths.

Plain Language Accessible to non-experts

Imagine you're in a kitchen cooking. You have a lot of ingredients, but your pot is small and can't fit everything at once. This is like large language models (LLMs) needing a lot of memory and bandwidth during inference. To save space, you need to chop the ingredients into smaller pieces, similar to how quantization compresses model parameters into smaller bits. Existing methods are like using knives; some are very sharp (vector quantization) but hard to use, while others are not as sharp (scalar quantization) but easy to handle. GSQ is like a new type of cutting tool that combines the sharpness and convenience, allowing you to chop quickly and efficiently while keeping the dish tasty (model accuracy). With GSQ, you can fit more ingredients into the pot without changing the pot (maintaining compatibility with existing inference kernels), thus improving quantization accuracy.

ELI14 Explained like you're 14

Hey there, friends! Did you know that in the computer world, we have super-smart programs called 'large language models' that help us with things like chatbots and translations? But these models are like a giant backpack filled with all sorts of stuff, needing lots of space and energy to run. To make them lighter, we need a method called 'quantization' to compress them. Imagine you have a pile of LEGO bricks but can only use a small box to store them. You need to break the bricks into smaller pieces to fit them in. GSQ is a new method that's like a magic tool, helping us compress these bricks better while keeping their shape and color. This way, we can use less space to store more bricks, making our backpack lighter! Isn't that cool?

Glossary

Gumbel-Softmax

Gumbel-Softmax is a technique used to make discrete selection processes differentiable by adding Gumbel noise, allowing discrete choices to be optimized through gradient descent.

Used in GSQ to optimize the selection process of discrete grids.

Scalar Quantization

Scalar quantization is a simple quantization technique that rounds each weight independently to a small uniform grid, easy to implement and widely used.

Used in GSQ to achieve low-bit-width weight compression.

Vector Quantization

Vector quantization is a more complex quantization technique that minimizes reconstruction mean square error to reduce accuracy loss but is harder to implement and scale.

Compared to GSQ, vector quantization offers higher accuracy at low bit-widths.

Llama-3.1-8B/70B-Instruct

Llama-3.1-8B/70B-Instruct are standard models used to test the GSQ method, containing 8B and 70B versions.

Used in experiments to verify GSQ's accuracy performance at low bit-widths.

Kimi-K2.5

Kimi-K2.5 is a trillion-scale Mixture-of-Experts model used to test GSQ's scalability on large-scale models.

Used in experiments to verify GSQ's application on large-scale models.

EfficientQAT

EfficientQAT is a quantization-aware training method that combines block-wise training of model parameters with a final end-to-end optimization of quantization parameters, making QAT practical for scalar quantization.

Used as a scalar baseline for comparison in experiments.

QTIP

QTIP is a 'second-generation' vector quantization method that minimizes reconstruction mean square error to reduce accuracy loss but is harder to implement and scale.

Used as a vector baseline for comparison in experiments.

GPTQ

GPTQ is a method that uses second-order information to minimize layer-wise quantization error based on the Optimal Brain Surgeon framework.

Used in GSQ to initialize the selection process of discrete grids.

AWQ

AWQ is a quantization method that uses activation statistics to identify and protect a small set of weights.

Used as a comparison baseline in the GSQ method.

Mixture-of-Experts Model

A Mixture-of-Experts model is an architecture that improves model performance by using multiple expert modules, each responsible for handling specific inputs.

Used in GSQ to verify its scalability on large-scale models.

Open Questions Unanswered questions from this research

  • 1 Despite GSQ's excellent performance in most cases, its performance at extremely low bit-widths (e.g., 1 bit) may not match some specially designed vector quantization methods. Future research needs to explore how to improve GSQ's performance at extremely low bit-widths without increasing computational complexity.
  • 2 The training process of GSQ requires additional computational resources, especially when handling large-scale models, which may increase training time and cost. Future research can explore how to optimize the GSQ training process to reduce computational resource requirements.
  • 3 GSQ may encounter instability in certain model layers, such as the down_proj of the second layer in the Llama-3.1-8B model, where compression is less effective. Future research can explore how to improve GSQ's stability in these specific layers.
  • 4 GSQ's application on large-scale Mixture-of-Experts models demonstrates its scalability, but how to apply GSQ to more types of models and tasks still requires further research. Future research can explore GSQ's performance on different tasks and model architectures.
  • 5 While GSQ improves quantization accuracy while maintaining compatibility with existing scalar inference kernels, how to further enhance its computational efficiency remains an open question. Future research can explore ways to improve GSQ's computational efficiency without compromising accuracy.

Applications

Immediate Applications

Large Language Model Inference

Achieve efficient low-bit quantization using GSQ, reducing memory and bandwidth costs, allowing large language models to run on resource-limited devices.

Mixture-of-Experts Model Compression

GSQ demonstrates scalability in large-scale Mixture-of-Experts models, providing new possibilities for model compression and deployment.

Scalar Inference Kernel Compatibility

In applications requiring compatibility with existing scalar inference kernels, GSQ provides an efficient quantization solution suitable for existing inference infrastructure.

Long-term Vision

Universal Low-Bit Quantization Framework

GSQ has the potential to become a universal low-bit quantization framework applicable to various models and tasks, advancing model compression technology.

Efficient Resource Utilization

By further optimizing the GSQ training process, reducing computational resource requirements, and improving computational efficiency, supporting the training and deployment of large-scale models.

Abstract

Weight quantization has become a standard tool for efficient LLM deployment, especially for local inference, where models are now routinely served at 2-3 bits per parameter. The state of the art is currently split into two sets of methods: simple scalar quantization techniques, such as GPTQ or AWQ, which are widely deployed but plateau in accuracy at 3-4 bits per parameter (bpp), and "second-generation" vector- or trellis-quantized methods, such as QTIP, GPTVQ and AQLM, which push the accuracy frontier at low bit-widths but are notoriously hard to implement and to scale, and have gained relatively less traction. In this paper, we ask whether this gap is fundamental, or whether a carefully optimized scalar quantizer can recover most of it. We answer in the affirmative, by introducing GSQ (Gumbel-Softmax Quantization), a post-training scalar quantization method which jointly learns the per-coordinate grid assignments and the per-group scales using a Gumbel-Softmax relaxation of the discrete grid. GSQ matches the cardinality of the relaxation to the small number of levels available in the target bit-width regime (e.g., 3-8 levels for ternary and 3 bpp, respectively), making the relaxation tight and the optimization tractable. Practically, on the standard Llama-3.1-8B/70B-Instruct models, GSQ closes most of the gap between scalar quantization and the QTIP frontier at 2 and 3 bits, while using a symmetric scalar grid with group-wise quantization, and thus fully compatible with existing scalar inference kernels. We further show that GSQ scales to trillion-scale Mixture-of-Experts models such as Kimi-K2.5, where vector-quantized methods are difficult to apply.

cs.CL cs.LG

References (20)

QuIP#: Even Better LLM Quantization with Hadamard Incoherence and Lattice Codebooks

Albert Tseng, Jerry Chee, Qingyao Sun et al.

2024 284 citations ⭐ Influential View Analysis β†’

EfficientQAT: Efficient Quantization-Aware Training for Large Language Models

Mengzhao Chen, Wenqi Shao, Peng Xu et al.

2024 112 citations ⭐ Influential View Analysis β†’

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Elias Frantar, Saleh Ashkboos, T. Hoefler et al.

2022 1869 citations ⭐ Influential View Analysis β†’

LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

Tim Dettmers, M. Lewis, Younes Belkada et al.

2022 961 citations ⭐ Influential View Analysis β†’

MaskLLM: Learnable Semi-Structured Sparsity for Large Language Models

Gongfan Fang, Hongxu Yin, Saurav Muralidharan et al.

2024 46 citations ⭐ Influential View Analysis β†’

QTIP: Quantization with Trellises and Incoherence Processing

Albert Tseng, Qingyao Sun, David Hou et al.

2024 59 citations ⭐ Influential View Analysis β†’

BitNet: Scaling 1-bit Transformers for Large Language Models

Hongyu Wang, Shuming Ma, Li Dong et al.

2023 219 citations ⭐ Influential View Analysis β†’

QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models

Elias Frantar, Dan Alistarh

2023 44 citations View Analysis β†’

Optimal Brain Surgeon and general network pruning

B. Hassibi, D. Stork, G. Wolff

1993 930 citations

PTQTP: Post-Training Quantization to Trit-Planes for Large Language Models

He Xiao, Runming Yang, Qingyao Yang et al.

2025 4 citations View Analysis β†’

QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs

Saleh Ashkboos, Amirkeivan Mohtashami, Maximilian L. Croci et al.

2024 429 citations View Analysis β†’

DB-LLM: Accurate Dual-Binarization for Efficient LLMs

Hong Chen, Chengtao Lv, Liang Ding et al.

2024 39 citations View Analysis β†’

XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks

Mohammad Rastegari, Vicente Ordonez, J. Redmon et al.

2016 4676 citations View Analysis β†’

Let's Verify Step by Step

H. Lightman, Vineet Kosaraju, Yura Burda et al.

2023 2919 citations View Analysis β†’

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Naman Jain, King Han, Alex Gu et al.

2024 1332 citations View Analysis β†’

OWQ: Outlier-Aware Weight Quantization for Efficient Fine-Tuning and Inference of Large Language Models

Changhun Lee, Jun-gyu Jin, Taesu Kim et al.

2023 126 citations View Analysis β†’

ARB-LLM: Alternating Refined Binarizations for Large Language Models

Zhiteng Li, Xianglong Yan, Tianao Zhang et al.

2024 27 citations View Analysis β†’

Michelangelo: Long Context Evaluations Beyond Haystacks via Latent Structure Queries

Kiran Vodrahalli, Santiago OntanΜƒón, Nilesh Tripuraneni et al.

2024 60 citations View Analysis β†’

SpinQuant: LLM quantization with learned rotations

Zechun Liu, Changsheng Zhao, Igor Fedorov et al.

2024 319 citations View Analysis β†’

BinaryConnect: Training Deep Neural Networks with binary weights during propagations

Matthieu Courbariaux, Yoshua Bengio, J. David

2015 3188 citations View Analysis β†’