Semantic Token Clustering for Efficient Uncertainty Quantification in Large Language Models

TL;DR

Semantic Token Clustering (STC) method achieves efficient uncertainty quantification in large language models, significantly reducing computational overhead.

cs.CL 🔴 Advanced 2026-03-21 52 views

Qi Cao Andrew Gambardella Takeshi Kojima Yutaka Matsuo Yusuke Iwasawa

AI Reader Arxiv Page Download PDF

large language models uncertainty quantification semantic token clustering computational efficiency natural language processing

Key Findings

Methodology

This paper introduces a novel method called Semantic Token Clustering (STC) for efficient uncertainty quantification in large language models (LLMs). The method leverages the semantic information inherently encoded in LLMs to group tokens into semantically consistent clusters and quantify uncertainty based on the probability mass aggregated over the corresponding semantic cluster. Specifically, STC employs embedding clustering and prefix matching to achieve semantic clustering of tokens and aggregates token probabilities within each semantic cluster at each decoding step to compute an uncertainty score. This approach requires only a single generation and does not depend on auxiliary models, making it highly efficient.

Key Results

Experimental results demonstrate that STC achieves performance comparable to state-of-the-art baselines across multiple datasets and models while significantly reducing computational overhead. Specifically, compared to the CCP method, STC achieves competitive performance while reducing inference-time overhead by an average of 98%.
On various datasets such as TriviaQA, Natural Questions, and WebQuestions, the STC method shows excellent performance in terms of AUROC, indicating its effectiveness in uncertainty quantification.
Ablation studies reveal that removing either the embedding clustering or prefix matching components leads to performance degradation, highlighting the complementary nature and importance of these components in uncertainty quantification.

Significance

This research provides an efficient and self-contained solution for uncertainty quantification in large language models through the proposed Semantic Token Clustering (STC) method. By eliminating the need for external models or multiple generations, the method significantly reduces computational overhead, making it particularly suitable for resource-constrained and low-latency scenarios. The introduction of the STC method addresses the issue of high computational overhead in existing methods while fully leveraging the semantic information encoded within LLMs, offering a new perspective for uncertainty quantification. This study holds significant implications for both academia and industry, especially in applications requiring high reliability and low computational costs.

Technical Contribution

Technically, the STC method achieves efficient uncertainty quantification by directly leveraging the internal semantic representations of LLMs, avoiding the need for external models and multiple generations. The method introduces the concept of semantic token clustering in uncertainty quantification, using embedding clustering and prefix matching to achieve semantic clustering of tokens and aggregating token probabilities within each semantic cluster at each decoding step to compute an uncertainty score. Compared to existing methods, the STC method maintains competitive performance while significantly reducing computational overhead, showcasing its potential in engineering applications.

Novelty

The innovation of the STC method lies in its efficient uncertainty quantification by directly leveraging the semantic information inherently encoded in LLMs. This method is the first to apply semantic token clustering to uncertainty quantification, achieving semantic clustering of tokens through embedding clustering and prefix matching, thus avoiding the need for external models and multiple generations. Compared to existing methods, the STC method offers significant advantages in computational efficiency and performance.

Limitations

The STC method requires access to token logits and token embeddings, which are typically unavailable in closed-source models, thus limiting its direct applicability to such models.
The method relies on static token embeddings and semantic relationships derived from the LLM's vocabulary, which may introduce noise, particularly in cases of polysemy.
Similar to the CCP method, the STC method does not explicitly address the calibration of uncertainty scores.

Future Work

Future research directions include exploring the integration of context-aware semantic representations (e.g., contextualized embeddings) into the STC method to reduce noise and enhance the performance and robustness of uncertainty quantification. Additionally, investigating how to apply the STC method in closed-source models and how to better calibrate uncertainty scores are also worth exploring.

AI Executive Summary

Large language models (LLMs) have demonstrated remarkable capabilities across diverse tasks, but the truthfulness of their outputs is not always guaranteed, and they often exhibit a tendency toward overconfidence. This uncertainty limits the application of LLMs in high-stakes domains such as healthcare, law, and science. Existing methods for uncertainty quantification typically rely on repeated sampling or auxiliary models, resulting in substantial computational overhead. To address these issues, this paper proposes a novel method called Semantic Token Clustering (STC) for efficient uncertainty quantification. The method leverages the semantic information inherently encoded in LLMs to group tokens into semantically consistent clusters and quantify uncertainty based on the probability mass aggregated over the corresponding semantic cluster.

The core of the STC method is its ability to perform uncertainty quantification without relying on external models or multiple generations, requiring only a single generation. Specifically, the method employs embedding clustering and prefix matching to achieve semantic clustering of tokens and aggregates token probabilities within each semantic cluster at each decoding step to compute an uncertainty score. This approach maintains competitive performance while significantly reducing computational overhead, making it particularly suitable for resource-constrained and low-latency scenarios.

Experimental results demonstrate that STC achieves performance comparable to state-of-the-art baselines across multiple datasets and models while significantly reducing computational overhead. Specifically, compared to the CCP method, STC achieves competitive performance while reducing inference-time overhead by an average of 98%. Ablation studies reveal that removing either the embedding clustering or prefix matching components leads to performance degradation, highlighting the complementary nature and importance of these components in uncertainty quantification.

The introduction of the STC method provides an efficient and self-contained solution for uncertainty quantification in large language models, addressing the issue of high computational overhead in existing methods while fully leveraging the semantic information encoded within LLMs. This study holds significant implications for both academia and industry, especially in applications requiring high reliability and low computational costs.

However, the STC method also has some limitations. Firstly, the method requires access to token logits and token embeddings, which are typically unavailable in closed-source models, thus limiting its direct applicability to such models. Secondly, the STC method relies on static token embeddings and semantic relationships derived from the LLM's vocabulary, which may introduce noise, particularly in cases of polysemy. Future research directions include exploring the integration of context-aware semantic representations into the STC method to reduce noise and enhance the performance and robustness of uncertainty quantification.

Deep Analysis

Background

In recent years, large language models (LLMs) have made significant advancements in the field of natural language processing, demonstrating exceptional capabilities across various tasks. However, despite their impressive performance in generating natural language text, the truthfulness of their outputs is not always guaranteed, particularly in high-stakes domains such as healthcare, law, and science. Existing methods for uncertainty quantification typically rely on repeated sampling or auxiliary models, which not only increase computational overhead but also fail to fully exploit the semantic information encoded within LLMs. Therefore, achieving efficient uncertainty quantification while maintaining performance has become a critical challenge in this research area.

Core Problem

Large language models often exhibit a tendency toward overconfidence, generating plausible-sounding but incorrect responses. This overconfidence limits the application of LLMs in high-stakes domains where the truthfulness of outputs is crucial. Existing methods for uncertainty quantification typically rely on repeated sampling or auxiliary models, resulting in substantial computational overhead, making them difficult to apply in resource-constrained and low-latency scenarios. Therefore, achieving efficient uncertainty quantification without relying on external models or multiple generations has become a pressing issue.

Innovation

This paper introduces a novel method called Semantic Token Clustering (STC) for efficient uncertainty quantification. • STC leverages the semantic information inherently encoded in LLMs to group tokens into semantically consistent clusters, avoiding the need for external models and multiple generations. • The method employs embedding clustering and prefix matching to achieve semantic clustering of tokens and aggregates token probabilities within each semantic cluster at each decoding step to compute an uncertainty score. • The STC method maintains competitive performance while significantly reducing computational overhead, making it particularly suitable for resource-constrained and low-latency scenarios.

Methodology

The implementation of the STC method involves the following steps:

�� Embedding Clustering: In the pre-computation stage, an unsupervised clustering algorithm (e.g., Agglomerative Clustering) is used to group token embeddings into semantically consistent clusters. The clustering process is performed offline, avoiding computational overhead during inference.

�� Prefix Matching: During inference, semantic consistency of clusters is enhanced by checking whether candidate tokens serve as prefixes of the subsequent generation.

�� Probability Aggregation: At each decoding step, token probabilities within each semantic cluster are aggregated to compute an uncertainty score. In this way, the STC method achieves efficient uncertainty quantification without relying on external models or multiple generations.

Experiments

The experimental design includes testing on multiple datasets (e.g., TriviaQA, Natural Questions, and WebQuestions) and various models (e.g., Llama-2-7B, Llama-3-8B, Mistral-7B, and Qwen2.5 models). Baseline methods used in the experiments include single-generation methods (e.g., Perplexity, tokenSAR, and CCP) and sampling-based methods (e.g., Predictive Entropy, LN-Entropy, and EigenScore). Key hyperparameters include the number of clusters and temperature sampling parameters. Ablation studies are conducted to evaluate the contributions of the embedding clustering and prefix matching components.

Results

Experimental results demonstrate that the STC method achieves performance comparable to state-of-the-art baselines across multiple datasets and models, particularly in terms of AUROC. Compared to the CCP method, STC achieves competitive performance while reducing inference-time overhead by an average of 98%. Ablation studies reveal that removing either the embedding clustering or prefix matching components leads to performance degradation, highlighting the complementary nature and importance of these components in uncertainty quantification.

Applications

The STC method is suitable for scenarios requiring efficient uncertainty quantification, such as real-time natural language processing applications, resource-constrained mobile device applications, and large-scale text generation systems requiring high reliability. The method eliminates the need for external models or multiple generations, significantly reducing computational overhead, making it particularly suitable for low-latency and resource-constrained scenarios.

Limitations & Outlook

The limitations of the STC method include: • The requirement for access to token logits and token embeddings, which are typically unavailable in closed-source models, thus limiting its direct applicability to such models. • The reliance on static token embeddings and semantic relationships derived from the LLM's vocabulary, which may introduce noise, particularly in cases of polysemy. Future research directions include exploring the integration of context-aware semantic representations into the STC method to reduce noise and enhance the performance and robustness of uncertainty quantification.

Plain Language Accessible to non-experts

Imagine you're in a kitchen cooking a meal. You have a bunch of ingredients, but you're not sure which ones are fresh and which might have gone bad. To make a delicious dish, you need a way to judge the freshness of each ingredient. Now, imagine these ingredients are words or phrases generated by a large language model, and freshness is uncertainty. Semantic Token Clustering (STC) is like a smart chef who can quickly group ingredients, say all the vegetables together, all the meats together, and then judge the overall freshness of each group. This way, the chef only needs to check once to know which ingredients are reliable, without repeatedly checking each one. STC uses the semantic information within the language model to group words or phrases into semantically consistent clusters, then quantifies uncertainty based on the probability of each cluster. This is like the chef judging the overall freshness of each group of ingredients. By doing this, STC can quickly and efficiently determine which generated words or phrases are reliable, without needing external models or multiple generations. This method is especially useful in scenarios that require quick decisions and low computational costs, like real-time natural language processing applications or resource-constrained mobile device applications.

ELI14 Explained like you're 14

Hey there! Have you ever wondered how smart assistants like Siri or Google Assistant know if what they're saying is right? Well, sometimes they make mistakes, just like we might get a question wrong on a test. To make these assistants smarter, we need a way to figure out if what they're saying is reliable. That's where the 'Semantic Token Clustering' (STC) method comes in. Imagine you're playing a word guessing game, and you need to guess a word based on clues. Now, suppose each clue is a different version of a word, like 'television' and 'TV'. STC is like a super-smart player who can group these clues into different sets and then figure out which word is correct based on the overall clues. This way, it only needs to guess once to know which word is most likely, without guessing repeatedly. STC uses the semantic information inside the smart assistant to group words into semantically consistent sets, then figures out which word is most reliable based on the probability of each set. This is like you figuring out which word is correct in the word guessing game based on the overall clues. With this method, STC can quickly and efficiently determine if what the smart assistant is saying is reliable, without needing outside help. This method is especially great for scenarios that need quick responses and low computational costs, like real-time voice assistant applications or resource-limited mobile device applications. Isn't that cool?

Glossary

Large Language Model (LLM)

A large language model is an AI model capable of generating and understanding natural language text, typically with billions of parameters, excelling in various tasks.

Used as the foundational model for generating natural language text and performing uncertainty quantification in this paper.

Uncertainty Quantification

Uncertainty quantification is a method for assessing the reliability of model outputs by calculating the probability distribution of the outputs.

Used to identify potentially unreliable parts of outputs from large language models.

Semantic Token Clustering (STC)

Semantic Token Clustering is a method that uses the model's internal semantic information to group tokens into semantically consistent clusters for efficient uncertainty quantification.

The core method proposed in this paper for achieving efficient uncertainty quantification in large language models.

Embedding Clustering

Embedding clustering is a technique for grouping token embeddings into semantically consistent clusters, typically using unsupervised clustering algorithms.

A key step in implementing semantic token clustering.

Prefix Matching

Prefix matching is a method for enhancing the semantic consistency of clusters by checking if candidate tokens serve as prefixes of the subsequent generation.

Used during inference to enhance the semantic consistency of clusters.

Probability Aggregation

Probability aggregation is a method for calculating uncertainty scores by aggregating token probabilities within semantic clusters.

Used to obtain uncertainty scores at each decoding step.

Agglomerative Clustering

Agglomerative clustering is a hierarchical clustering algorithm that builds a hierarchy of clusters by iteratively merging the most similar clusters.

The unsupervised clustering algorithm used for embedding clustering.

AUROC

AUROC is a metric for evaluating the performance of classification models, representing the area under the receiver operating characteristic curve; higher values indicate better performance.

Used to evaluate the performance of the STC method in uncertainty quantification.

Ablation Study

An ablation study is a method for evaluating the contribution of specific components to overall performance by removing them from the model.

Used to assess the contributions of embedding clustering and prefix matching components in the STC method.

Temperature Sampling

Temperature sampling is a technique for generating diverse outputs by adjusting the temperature parameter of the sampling probability distribution.

Used to generate auxiliary responses for evaluating the performance of sampling-based baseline methods.

Open Questions Unanswered questions from this research

1 How can the STC method be applied in closed-source models? Since the STC method requires access to token logits and embeddings, which are typically unavailable in closed-source models, it cannot be directly applied to these models. Future research needs to explore how to achieve efficient uncertainty quantification without access to these internal representations.
2 How can noise in the STC method be reduced? The STC method relies on static token embeddings and semantic relationships derived from the LLM's vocabulary, which may introduce noise, particularly in cases of polysemy. Future research could explore integrating context-aware semantic representations into the STC method to reduce noise and improve performance.
3 How can uncertainty scores in the STC method be calibrated? Similar to the CCP method, the STC method does not explicitly address the calibration of uncertainty scores. Future research could explore how to better calibrate uncertainty scores to improve their reliability in practical applications.
4 How can the STC method be applied in multilingual environments? The current STC method is primarily optimized for a single language, and future research could explore how to apply the method in multilingual environments to improve its applicability and performance across different languages.
5 How can the STC method be optimized for low-resource environments? Although the STC method has advantages in computational efficiency, it may still face challenges in extremely low-resource environments. Future research could explore how to further optimize the method to suit even lower-resource application scenarios.

Applications

Immediate Applications

Real-time Natural Language Processing Applications

The STC method can be used in real-time natural language processing applications, such as voice assistants and chatbots, to provide more reliable responses.

Resource-constrained Mobile Device Applications

On mobile devices, where computational resources are limited, the STC method can provide efficient uncertainty quantification, reducing computational overhead.

Large-scale Text Generation Systems

In systems that require generating large volumes of text, the STC method can improve the reliability of generated text, reducing erroneous outputs.

Long-term Vision

Multilingual Natural Language Processing Systems

The STC method can be extended to multilingual environments, improving the reliability of natural language processing systems across different languages.

Intelligent Decision Support Systems

By improving the reliability of system responses to uncertainty, the STC method can be used in intelligent decision support systems to help users make more informed decisions.

Abstract

Large language models (LLMs) have demonstrated remarkable capabilities across diverse tasks. However, the truthfulness of their outputs is not guaranteed, and their tendency toward overconfidence further limits reliability. Uncertainty quantification offers a promising way to identify potentially unreliable outputs, but most existing methods rely on repeated sampling or auxiliary models, introducing substantial computational overhead. To address these limitations, we propose Semantic Token Clustering (STC), an efficient uncertainty quantification method that leverages the semantic information inherently encoded in LLMs. Specifically, we group tokens into semantically consistent clusters using embedding clustering and prefix matching, and quantify uncertainty based on the probability mass aggregated over the corresponding semantic cluster. Our approach requires only a single generation and does not depend on auxiliary models. Experimental results show that STC achieves performance comparable to state-of-the-art baselines while substantially reducing computational overhead.

cs.CL cs.AI cs.LG

References (18)

Fact-Checking the Output of Large Language Models via Token-Level Uncertainty Quantification

Ekaterina Fadeeva, Aleksandr Rubashevskii, Artem Shelmanov et al.

2024 129 citations ⭐ Influential View Analysis →

Scikit-learn: Machine Learning in Python

Fabian Pedregosa, G. Varoquaux, Alexandre Gramfort et al.

2011 87142 citations View Analysis →

On a Measure of the Information Provided by an Experiment

D. Lindley

1956 1710 citations

Detecting hallucinations in large language models using semantic entropy

Sebastian Farquhar, Jannik Kossen, Lorenz Kuhn et al.

2024 1004 citations

The Internal State of an LLM Knows When its Lying

A. Azaria, Tom M. Mitchell

2023 556 citations View Analysis →

TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

Mandar Joshi, Eunsol Choi, Daniel S. Weld et al.

2017 3559 citations View Analysis →

Steven Bird, Ewan Klein and Edward Loper: Natural Language Processing with Python, Analyzing Text with the Natural Language Toolkit

Wiebke Wagner

2010 3394 citations

Benchmarking Uncertainty Quantification Methods for Large Language Models with LM-Polygraph

Roman Vashurin, Ekaterina Fadeeva, Artem Vazhentsev et al.

2024 80 citations View Analysis →

ConU: Conformal Uncertainty in Large Language Models with Correctness Coverage Guarantees

Zhiyuan Wang, Jinhao Duan, Lu Cheng et al.

2024 49 citations View Analysis →

Semantic Parsing on Freebase from Question-Answer Pairs

Jonathan Berant, A. Chou, Roy Frostig et al.

2013 2167 citations

Uncertainty Estimation in Autoregressive Structured Prediction

A. Malinin, M. Gales

2021 403 citations

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

Nils Reimers, Iryna Gurevych

2019 16756 citations View Analysis →

Unsupervised Quality Estimation for Neural Machine Translation

M. Fomicheva, Shuo Sun, L. Yankovskaya et al.

2020 267 citations View Analysis →

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

Tri Dao

2023 2381 citations View Analysis →

ROUGE: A Package for Automatic Evaluation of Summaries

Chin-Yew Lin

2004 19626 citations

Natural Questions: A Benchmark for Question Answering Research

T. Kwiatkowski, J. Palomaki, Olivia Redfield et al.

2019 4382 citations

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin R. Stone et al.

2023 16182 citations View Analysis →

Uncertainty Estimation and Quantification for LLMs: A Simple Supervised Approach

Linyu Liu, Yu Pan, Xiaocheng Li et al.

2024 78 citations View Analysis →

Semantic Token Clustering for Efficient Uncertainty Quantification in Large Language Models

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Analysis

Background

Core Problem

Innovation

Methodology

Experiments

Results

Applications

Limitations & Outlook

Plain Language Accessible to non-experts

ELI14 Explained like you're 14

Glossary

Large Language Model (LLM)

Uncertainty Quantification

Semantic Token Clustering (STC)

Embedding Clustering

Prefix Matching

Probability Aggregation

Agglomerative Clustering

AUROC

Ablation Study

Temperature Sampling

Open Questions Unanswered questions from this research

Applications

Immediate Applications

Real-time Natural Language Processing Applications

Resource-constrained Mobile Device Applications

Large-scale Text Generation Systems

Long-term Vision

Multilingual Natural Language Processing Systems

Intelligent Decision Support Systems

Abstract

References (18)

Related Papers

Sentiment and Emotion Classification of Indonesian E-Commerce Reviews via Multi-Task BiLSTM and AutoML Benchmarking

SeaEvo: Advancing Algorithm Discovery with Strategy Space Evolution

Improving Robustness of Tabular Retrieval via Representational Stability

Representational Harms in LLM-Generated Narratives Against Global Majority Nationalities

CRAFT: Clustered Regression for Adaptive Filtering of Training data

BERAG: Bayesian Ensemble Retrieval-Augmented Generation for Knowledge-based Visual Question Answering