Tokenisation via Convex Relaxations

TL;DR

ConvexTok uses convex relaxations to optimize tokenisation, achieving near-optimal compression within 1% at 128k vocabulary, improving BpB significantly.

cs.CL 🔴 Advanced 2026-05-22 232 views
Jan Tempus Philip Whittington Craig W. Schmidt Dennis Komm Tiago Pimentel
Natural Language Processing Tokenisation Convex Optimization Linear Programming Language Models

Key Findings

Methodology

This paper introduces ConvexTok, a novel tokeniser construction method based on convex relaxations. The authors first formulate the tokeniser construction as an integer program (IP) representing the combinatorial optimization of vocabulary selection and text segmentation. They then relax the discrete constraints to continuous intervals, converting the IP into a linear program (LP) solvable with efficient convex optimization tools. To discretize the LP's fractional solutions into usable vocabularies, three rounding schemes are proposed: deterministic (Det), biased (Bias), and integral-only (Int). ConvexTok not only improves intrinsic tokenisation metrics but also provides a theoretical lower bound on compression, certifying how close a tokeniser is to the global optimum. Empirical results show that at common vocabulary sizes (8k to 256k), ConvexTok's solutions are within 1% of optimal compression.

Key Results

  • On the ClimbMix400B dataset, ConvexTok achieves compression rates within 1% of the LP lower bound at 128k vocabulary size, outperforming traditional BPE which exhibits over 2% gap.
  • Intrinsic metrics such as vocabulary utilization, type-token ratio, and compression rate show Bias rounding consistently surpasses BPE, with vocabulary utilization improvements exceeding 5%.
  • In language modeling, the Det rounding scheme yields better Bits-per-byte (BpB) scores than BPE, improving by approximately 0.1 to 0.3 percentage points, and demonstrates more stable performance on CORE downstream tasks.

Significance

Tokenisation is a foundational step in NLP pipelines, directly influencing language model input representation and downstream performance. Existing tokenisers like BPE and Unigram rely on greedy, locally optimal merges, which fail to guarantee global optimality, limiting compression efficiency and model generalization. ConvexTok addresses this by formulating tokeniser construction as a convex optimization problem, enabling near-global optimal solutions and providing theoretical compression lower bounds. This bridges a critical gap in tokeniser design, offering both practical improvements and theoretical guarantees. The approach advances the field from heuristic-driven tokenisation to principled, optimization-based design, with significant implications for large-scale language model training and efficiency.

Technical Contribution

The paper's key technical contribution is the reformulation of tokeniser construction as a shortest path problem on a colored directed acyclic graph (DAG), which is then modeled as an integer program. By applying convex relaxation, the integer program is converted into a linear program solvable by modern LP solvers, enabling efficient approximation of the NP-hard problem. The authors design multiple rounding strategies to discretize the LP solutions into practical vocabularies, balancing compression and generalization. Additionally, the method provides a provable lower bound on achievable compression, allowing users to certify the optimality gap of any tokeniser. This combination of theoretical rigor and practical algorithm design marks a substantial advancement over prior greedy or heuristic methods.

Novelty

ConvexTok is the first work to systematically cast tokeniser construction as a convex optimization problem and solve it via LP relaxation, departing fundamentally from traditional greedy approaches like BPE and Unigram. Unlike prior methods that focus on local merges or probabilistic selection, ConvexTok offers a global optimization perspective with theoretical guarantees and a compression lower bound certificate. This represents a novel integration of polyhedral combinatorics and NLP tokenisation, providing both new theoretical insights and practical algorithms.

Limitations

  • ConvexTok's LP solver exhibits longer runtimes at smaller vocabulary sizes (e.g., 8k), limiting efficiency in low-resource settings.
  • The tokeniser's vocabulary stability is lower than BPE's, showing sensitivity to training data sampling, which may affect generalization.
  • Rounding schemes require further refinement; for example, the integral-only rounding underperforms in compression and downstream tasks compared to other strategies.

Future Work

Future research directions include developing more efficient and stable rounding algorithms to improve vocabulary robustness and downstream performance. Extending the convex optimization framework to multi-objective settings, incorporating semantic fidelity and multilingual adaptability, is another promising avenue. Additionally, integrating ConvexTok with end-to-end language model training for dynamic tokeniser adaptation could enhance overall model efficiency and generalization.

AI Executive Summary

Tokenisation is a fundamental step in natural language processing (NLP), converting raw text into discrete tokens that language models can process. Traditional tokenisation algorithms, such as Byte-Pair Encoding (BPE) and Unigram models, rely on greedy heuristics that iteratively merge frequent byte pairs or select high-probability subwords. While effective and widely adopted, these methods do not guarantee globally optimal vocabularies, potentially limiting compression efficiency and downstream model performance. This paper introduces ConvexTok, a novel tokeniser construction framework based on convex relaxations, which addresses these limitations by formulating tokeniser design as a global optimization problem.

ConvexTok begins by representing the tokenisation problem as a shortest path problem on a colored directed acyclic graph (DAG), where vertices correspond to byte positions and edges represent candidate tokens. This graph formulation is then encoded as an integer program (IP) capturing vocabulary selection and segmentation constraints. Recognizing the NP-hardness of the IP, the authors apply convex relaxation to transform it into a linear program (LP), which can be efficiently solved using modern convex optimization solvers. To convert the LP's fractional solutions into discrete vocabularies, three rounding schemes—deterministic, biased, and integral-only—are proposed, each balancing compression and generalization differently.

The convex relaxation approach not only yields near-optimal tokenisers but also provides a theoretical lower bound on achievable compression, enabling users to certify the optimality gap of any tokeniser. Experiments on the large-scale ClimbMix400B dataset demonstrate that ConvexTok achieves compression rates within 1% of the LP lower bound across vocabulary sizes ranging from 8k to 256k, outperforming traditional BPE by a significant margin. Intrinsic tokenisation metrics such as vocabulary utilization and type-token ratio consistently favor ConvexTok's Bias rounding strategy. Moreover, language models trained with ConvexTok tokenisers exhibit improved Bits-per-byte (BpB) scores and more stable performance on CORE downstream tasks, particularly with the deterministic rounding scheme.

These findings highlight ConvexTok's potential to transform tokeniser design from heuristic-driven to theoretically grounded optimization. By providing both practical algorithms and theoretical guarantees, ConvexTok advances the state of the art in tokenisation, with implications for improving large-scale language model training efficiency and effectiveness.

Despite these advances, ConvexTok faces challenges including increased computational cost at smaller vocabulary sizes and sensitivity to training data sampling, affecting vocabulary stability. Future work aims to refine rounding procedures, extend the framework to multi-objective optimization incorporating semantic and multilingual factors, and integrate tokeniser optimization with end-to-end language model training for dynamic adaptation. Overall, ConvexTok represents a significant step toward principled, globally optimal tokenisation in NLP.

Deep Analysis

Background

Tokenisation is a critical preprocessing step in natural language processing (NLP), responsible for segmenting raw text into discrete tokens that serve as the input units for language models. Early tokenisation approaches relied on rule-based or dictionary-driven methods, which struggled to scale and generalize to large, diverse corpora. The advent of subword tokenisation algorithms, notably Byte-Pair Encoding (BPE) introduced by Sennrich et al., and Unigram models, revolutionized this process by enabling data-driven, unsupervised vocabulary construction. These methods iteratively merge frequent byte pairs or select subwords based on probabilistic models, achieving significant improvements in compression and downstream task performance. They have become the de facto standard in training large-scale language models such as GPT and BERT.


However, these popular tokenisers employ greedy heuristics that optimize locally at each merge or selection step, without considering the global vocabulary structure. This myopic approach can lead to suboptimal vocabularies that do not minimize the overall encoding length or maximize compression. While various heuristic improvements have been proposed, none provide guarantees of global optimality or theoretical bounds on performance. Consequently, the tokenisation problem remains an open challenge, especially as language models grow in scale and complexity, demanding more efficient and principled tokenisation strategies.

Core Problem

The core problem addressed in this paper is the construction of an optimal tokeniser that minimizes the total encoding length of a training corpus under a fixed vocabulary size constraint. Formally, given a dataset of byte-strings and a vocabulary budget K, the goal is to select a set of tokens and segment the corpus such that the sum of tokenized lengths is minimized. This problem is combinatorial and has been proven NP-hard, making exact solutions computationally infeasible for large datasets and vocabularies.


Existing methods like BPE and Unigram rely on greedy or probabilistic heuristics that do not guarantee global optimality, potentially leading to suboptimal compression and downstream performance. Moreover, there is a lack of theoretical tools to quantify how close a given tokeniser is to the optimal solution. Addressing these challenges requires a framework that can efficiently approximate the global optimum and provide certificates of optimality, enabling principled tokeniser design and evaluation.

Innovation

The paper presents several key innovations:


1. Graph-based Reformulation: The tokenisation problem is reformulated as a shortest path problem on a colored directed acyclic graph (DAG), where vertices represent byte positions and edges correspond to candidate tokens colored by their identity. This representation captures all possible segmentations and vocabulary choices in a unified structure.


2. Integer Programming Model: Using the graph, the authors formulate an integer program (IP) that encodes vocabulary selection and segmentation constraints, including flow conservation and vocabulary size limits.


3. Convex Relaxation: Recognizing the NP-hardness of the IP, the problem is relaxed into a linear program (LP) by allowing variables to take continuous values between 0 and 1. This convex relaxation enables efficient solution using state-of-the-art LP solvers.


4. Rounding Schemes: To convert fractional LP solutions into discrete vocabularies, three rounding strategies are proposed—deterministic (select top-K tokens by LP value), biased (favor shorter tokens by normalizing LP values by token length), and integral-only (select tokens with LP values near 1). These provide trade-offs between compression and generalization.


5. Optimality Certification: The LP solution provides a provable lower bound on achievable compression, allowing users to quantify the optimality gap of any tokeniser, a novel contribution in tokenisation research.


Together, these innovations enable a principled, globally aware approach to tokeniser construction, overcoming the limitations of greedy heuristics.

Methodology

  • �� Dataset and Preprocessing: The ClimbMix400B corpus, a large-scale curated pretraining dataset, is used. Text is pre-tokenised using a regular expression to prevent crossing coarse linguistic boundaries.

  • �� Graph Construction: For each byte-string, vertices are created for each byte boundary, with edges representing possible tokens (byte-edges and token-edges). Token-edges are colored to represent distinct tokens.

  • �� Integer Program Formulation: Variables represent inclusion of free edges (byte-edges), priced edges (candidate tokens), and colors (tokens). Constraints enforce flow conservation, vocabulary size limits, and consistency between token usage and vocabulary selection.

  • �� Convex Relaxation: Binary constraints on variables are relaxed to continuous intervals [0,1], yielding a linear program.

  • �� LP Solving: The LP is solved using NVIDIA CuOPT, handling over 100 million variables and constraints efficiently.

  • �� Rounding: Three rounding schemes (Det, Bias, Int) are applied to the LP solution to produce discrete vocabularies.

  • �� Tokeniser Construction: Using the rounded vocabulary, shortest path algorithms segment the corpus into tokens.

  • �� Evaluation: Tokenisers are trained with vocabulary sizes from 8k to 256k. Metrics include compression rate, vocabulary utilization, Rényi entropy, Bits-per-byte (BpB), and CORE downstream task performance. Stability is assessed via Jaccard similarity over multiple training subsets.

Experiments

Experiments utilize the ClimbMix400B dataset with 593,920 documents for training tokenisers and language models. Vocabulary sizes range from 8k to 256k in powers of two. Baseline comparisons are made against BPE. The LP solver NVIDIA CuOPT is employed with default primal-dual gap stopping criteria. Language models follow the nanochat GPT architecture, with parameter counts from 135 million to 3.5 billion, trained on NVIDIA GH200 GPUs. Training times range from 17 minutes for small models to 199 minutes for large ones. Tokeniser stability is evaluated by retraining on independently sampled subsets and computing Jaccard similarity of vocabularies. Intrinsic tokenisation metrics and downstream language modeling metrics (BpB, CORE) are reported, with multiple seeds for robustness.

Results

The LP relaxation value decreases monotonically with increasing vocabulary size, and solutions become more integral, indicating reduced ambiguity. Bias rounding consistently outperforms BPE in vocabulary utilization and compression rate, improving utilization by over 5%. Det rounding yields the best Bits-per-byte (BpB) scores in language modeling, improving over BPE by 0.1 to 0.3 percentage points, and shows more stable CORE downstream task performance. Integral-only rounding, while producing smaller vocabularies, remains close to optimal compression at larger vocabulary sizes. BPE, despite its greedy nature, achieves compression within approximately 2% of the LP lower bound, underscoring its effectiveness. Stability analysis reveals BPE vocabularies are more stable across training samples than ConvexTok, particularly at smaller vocabulary sizes. Overall, ConvexTok demonstrates superior compression and modeling performance with theoretical optimality guarantees.

Applications

ConvexTok can be directly applied to design tokenisers for large-scale language model pretraining, enhancing compression and reducing input sequence lengths, thereby accelerating training and improving model performance. Its theoretical lower bound capability enables rigorous tokeniser performance evaluation and comparison, facilitating standardized tokeniser design in industry. The framework is also suitable for multilingual and multi-domain tokeniser optimization, improving cross-lingual model generalization. Future integration with dynamic tokeniser adaptation during model training could enable end-to-end optimization, further boosting efficiency and effectiveness in NLP applications.

Limitations & Outlook

ConvexTok's LP solver requires significant computational resources, especially at smaller vocabulary sizes, limiting real-time or resource-constrained deployment. The tokeniser exhibits sensitivity to training data sampling, resulting in lower vocabulary stability compared to BPE, which may affect model robustness. Rounding strategies, while effective, are heuristic and may not fully optimize the trade-off between compression and generalization. The current focus on compression neglects other objectives such as semantic preservation and multilingual adaptability. These limitations highlight the need for algorithmic improvements and broader objective integration in future work.

Plain Language Accessible to non-experts

Imagine you have a huge box of LEGO bricks (bytes) and want to build a cool castle (text). But there are so many bricks that building takes forever. Traditional methods pick the easiest bricks to connect one by one, which is fast but might not build the best castle. ConvexTok is like drawing a big map of all possible ways to connect bricks and then using math magic to find the shortest path to build the castle with the fewest bricks. This way, you build a better castle faster and use fewer bricks. Although this sounds complicated, it guarantees your building plan is nearly perfect and even tells you how close you are to the best plan. It also has clever ways to turn the math plan into real building steps. In the end, this helps computers understand and generate language more efficiently, just like building LEGO faster and better.

ELI14 Explained like you're 14

Hey! Imagine you're playing with LEGO and want to build the coolest castle ever. But there are tons of bricks, and you need to pick which ones to use so your castle looks awesome and doesn't fall apart. Old ways are like picking bricks one by one, just grabbing the easiest pieces to snap together. It works but might not be the best castle. Now, there's this super smart method called ConvexTok. It first draws a huge map showing all the ways you can build your castle, then uses math magic to find the best path that uses the fewest bricks but still looks amazing! It even tells you how close your plan is to the perfect one. Sounds tricky, but it helps computers learn languages better by picking the best 'building blocks' for words. Cool, right?

Glossary

Tokenisation

The process of splitting text into smaller units (tokens) such as subwords or byte sequences, enabling language models to process text effectively.

Central to this paper's focus on optimizing tokeniser construction.

Byte-Pair Encoding (BPE)

A greedy algorithm that iteratively merges the most frequent byte pairs to build a vocabulary, widely used in modern language models.

Used as the primary baseline for comparison.

Integer Programming (IP)

An optimization framework where variables are restricted to integer values, suitable for discrete decision problems.

The tokeniser construction problem is initially formulated as an IP.

Linear Programming (LP)

An optimization method with linear objectives and constraints over continuous variables, solvable efficiently.

The IP is relaxed to an LP for efficient approximate solving.

Convex Relaxation

The technique of relaxing discrete constraints to continuous convex sets to enable tractable optimization.

Core method enabling efficient tokeniser optimization.

Rounding

The process of converting continuous optimization solutions into discrete, usable vocabularies.

Three rounding schemes are proposed to discretize LP solutions.

Bits-per-byte (BpB)

A metric measuring compression efficiency of language models; lower values indicate better compression.

Used to evaluate tokeniser impact on model performance.

Rényi Entropy

An information-theoretic measure reflecting the diversity and uncertainty of token distributions.

Used as an intrinsic metric for tokeniser evaluation.

Directed Acyclic Graph (DAG)

A graph with directed edges and no cycles, representing all possible token segmentations.

Foundation for the graph-based tokenisation formulation.

Jaccard Similarity

A measure of similarity between two sets, defined as the size of their intersection divided by the size of their union.

Used to assess tokeniser vocabulary stability across training samples.

Open Questions Unanswered questions from this research

  • 1 Designing rounding algorithms that balance compression efficiency and vocabulary stability remains an open challenge.
  • 2 Extending tokeniser optimization beyond compression to include semantic preservation and multilingual adaptability is yet unexplored.
  • 3 Reducing computational complexity of LP solving for real-time or resource-constrained scenarios is needed.
  • 4 Improving tokeniser robustness to training data variability to enhance generalization is an unresolved issue.
  • 5 Integrating convex tokeniser optimization with end-to-end language model training for dynamic adaptation is a promising future direction.

Applications

Immediate Applications

Large-scale Language Model Pretraining

ConvexTok enables construction of highly compressed tokenisers, reducing input sequence lengths and accelerating training while improving model quality.

Multilingual Tokeniser Optimization

The framework supports building unified, near-optimal vocabularies across languages, enhancing cross-lingual model generalization.

Tokeniser Performance Evaluation and Standardization

Providing theoretical lower bounds allows rigorous comparison and benchmarking of tokenisers, facilitating industry-wide standards.

Long-term Vision

End-to-End Joint Language Model and Tokeniser Optimization

Integrating ConvexTok with model training for dynamic tokeniser adaptation could significantly boost overall NLP system efficiency and effectiveness.

Multi-objective Tokeniser Design

Extending the convex optimization approach to incorporate semantic fidelity and multilingual considerations will drive smarter, more versatile tokenisers.

Abstract

Tokenisation is an integral part of the current NLP pipeline. Current tokenisation algorithms such as BPE and Unigram are greedy algorithms -- they make locally optimal decisions without considering the resulting vocabulary as a whole. We instead formulate tokeniser construction as a linear program and solve it using convex optimisation tools, yielding a new algorithm we call ConvexTok. We find ConvexTok consistently improves intrinsic tokenisation metrics and the bits-per-byte (BpB) achieved by language models; it also improves downstream task performance, but less consistently. Furthermore, ConvexTok allows the user to certify how far their tokeniser is from optimal, with respect to a certain objective, via a lower bound, and we empirically find it to be within 1\% of optimal at common vocabulary sizes.

cs.CL cs.LG

References (20)

Investigating the Effectiveness of BPE: The Power of Shorter Sequences

Matthias Gallé

2019 76 citations ⭐ Influential

Flows in Networks

E. Denardo

2011 1627 citations

Scaling neural machine translation to 200 languages

M. Costa-jussà, James Cross, Onur Çelebi et al.

2024 227 citations

Paths and cycles in colored graphs

H. Broersma, Xueliang Li, G. Woeginger et al.

2001 100 citations

Practical Guidelines for Solving Difficult Mixed Integer Linear

Ed Klotz, A. Newman

2013 150 citations

The Traveling Salesman Problem: A Computational Study

D. Applegate, R. Bixby, V. Chvátal et al.

2007 2093 citations

A Formal Perspective on Byte-Pair Encoding

Vilém Zouhar, Clara Meister, Juan Luis Gastaldi et al.

2023 56 citations View Analysis →

An approximation algorithm for the generalized assignment problem

D. Shmoys, É. Tardos

1993 766 citations

Our Method

Imene Bensalem, Paolo Rosso, S. Chikhi

1867 61 citations

Boundless Byte Pair Encoding: Breaking the Pre-tokenization Barrier

Craig W. Schmidt, Varshini Reddy, Chris Tanner et al.

2025 23 citations View Analysis →

Theoretical Analysis of Byte-Pair Encoding

László Kozma, Johannes Voderholzer

2024 19 citations View Analysis →

Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling

Stella Biderman, Hailey Schoelkopf, Quentin Anthony et al.

2023 1873 citations View Analysis →

Language Models

Jordan Boyd-Graber, Philipp Koehn

2009 1092 citations

Scaling Laws for Neural Language Models

J. Kaplan, Sam McCandlish, T. Henighan et al.

2020 7974 citations View Analysis →

Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates

Taku Kudo

2018 1350 citations View Analysis →

Approximation algorithms and hardness results for labeled connectivity problems

Refael Hassin, J. Monnot, D. Segev

2006 73 citations

Nemotron-CLIMB: CLustering-based Iterative Data Mixture Bootstrapping for Language Model Pre-training

Shizhe Diao, Yu Yang, Y. Fu et al.

2025 34 citations View Analysis →

Scaffold-BPE: Enhancing Byte Pair Encoding for Large Language Models with Simple and Effective Scaffold Token Removal

Haoran Lian, Yizhe Xiong, Jianwei Niu et al.

2024 12 citations View Analysis →

A partition cover approach to tokenization

Jia Peng Lim, Davin Choo, Hady W. Lauw

2025 5 citations View Analysis →

Random Contractions and Sampling for Hypergraph and Hedge Connectivity

M. Ghaffari, David R Karger, D. Panigrahi

2017 35 citations