Efficient Test-Time Finetuning of LLMs via Convex Reconstruction and Gradient Caching

TL;DR

HullFT employs convex reconstruction and gradient caching for efficient test-time fine-tuning, improving speed and quality tradeoff in large language models.

cs.LG 🔴 Advanced 2026-05-29 179 views

Alaa Khamis Alaa Maalouf

AI Reader Arxiv Page Download PDF

Large Language Models Test-Time Fine-Tuning Convex Geometry Gradient Caching Optimization Algorithms

Key Findings

Methodology

This paper introduces HullFT, a convex geometry-based framework for test-time fine-tuning (TTFT). The core idea is to represent a query embedding as a sparse convex combination of a small subset of training sequences, solved efficiently via the projection-free Frank–Wolfe algorithm. This approach naturally balances relevance and diversity by exploiting the geometric properties of convex sets, avoiding explicit diversity penalties. The fractional convex weights are then converted into an exact integer multiset through a geometric integerization process, which preserves the approximation quality. The resulting multiset contains repeated samples, enabling gradient reuse during fine-tuning. The pipeline involves: 1) solving the sparse convex approximation with Frank–Wolfe; 2) geometric integerization to produce a support multiset; 3) leveraging repeated samples for gradient caching. Extensive experiments demonstrate that HullFT outperforms state-of-the-art TTFT methods in bits-per-byte and runtime efficiency, with significant speedups and quality improvements.

Key Results

Across 12 datasets, HullFT achieves an average of 3.83% lower bits-per-byte (BPB) than kNN at a 0.75s budget and 3.44% lower at 0s, with maximum improvements reaching 6.4% under tight latency constraints.
Support set selection via convex geometry is 25.8× faster than SIFT, and gradient reuse further accelerates fine-tuning by 1.48×, saving approximately 89 seconds overall, while maintaining or improving BPB metrics.
The geometric diversity of support sets emerges naturally without explicit penalties, and repeated samples from integerization enable effective gradient caching, leading to substantial efficiency gains.

Significance

This work addresses a fundamental bottleneck in deploying large language models in latency-sensitive environments by combining convex geometric optimization with gradient caching. It enables models to adapt quickly and efficiently at inference time, making real-time applications like chatbots, content moderation, and personalized assistants more feasible. The approach bridges theoretical convex optimization with practical engineering, offering a scalable solution that maintains high adaptation quality while drastically reducing computational costs. Its implications extend beyond NLP, potentially influencing other domains requiring rapid model adaptation under resource constraints, such as computer vision and multimodal AI.

Technical Contribution

The main technical innovations include: 1) formulating sample selection as a convex approximation problem solved via Frank–Wolfe, ensuring sparse, diverse, and relevant support sets; 2) developing a geometric integerization method that converts fractional weights into a support multiset with controlled approximation error; 3) implementing a gradient reuse mechanism that caches and reuses gradients across repeated samples, significantly reducing backward pass computations. These contributions collectively enable a highly efficient TTFT pipeline that balances theoretical guarantees with practical speedups, surpassing existing methods like kNN and SIFT in both quality and runtime.

Novelty

This study is the first to integrate convex geometric optimization with integerization and gradient caching for test-time fine-tuning. Unlike prior approaches relying on heuristic or greedy sample selection, HullFT leverages the inherent geometry of convex hulls to automatically ensure diversity and relevance. The geometric integerization transforms fractional convex weights into a support multiset, naturally inducing repeated samples that facilitate gradient caching. This combination of convex optimization, geometric integerization, and gradient reuse constitutes a novel paradigm that significantly advances the efficiency of TTFT, especially under strict latency constraints.

Limitations

The method depends on the quality of the candidate retrieval pool; if the pool lacks sufficient diversity or recall, the support set may not fully capture the query's semantic space, limiting effectiveness.
Support set diversity is achieved geometrically without explicit penalties, which may not always align with task-specific notions of diversity, especially in highly complex semantic spaces.
Gradient caching assumes repeated samples are beneficial; in highly diverse or novel scenarios, the effectiveness diminishes, potentially leading to suboptimal adaptation or bias.
The approach's performance on extremely large models or in multi-task settings remains to be thoroughly validated, and computational overhead in very high-dimensional spaces could pose challenges.

Future Work

Future research could focus on integrating internal model representations to dynamically adapt support sets, enhancing expressiveness beyond geometric convexity. Extending the framework to multi-modal and multi-task scenarios could broaden its applicability. Developing adaptive support set size and composition strategies, possibly via reinforcement learning, may further improve efficiency and robustness. Additionally, exploring more sophisticated integerization techniques and gradient caching schemes tailored for diverse data distributions could push the limits of real-time model adaptation.

AI Executive Summary

The rapid growth of large language models (LLMs) has revolutionized natural language processing, enabling unprecedented performance across a wide array of tasks. However, deploying these models in real-world applications often encounters a critical bottleneck: the high computational cost and latency associated with fine-tuning during inference. Traditional fine-tuning methods require extensive backpropagation over large datasets, making real-time adaptation impractical. To address this, recent research has explored test-time fine-tuning (TTFT), where models are dynamically adapted to each input query by retrieving relevant data, updating parameters, and evaluating. Despite promising results, existing TTFT approaches face a fundamental tradeoff: faster retrieval and updates often compromise the quality of adaptation, while more sophisticated selection strategies incur prohibitive computational overhead.

This paper introduces HullFT, a novel framework that leverages convex geometry and gradient caching to significantly enhance the efficiency of TTFT. The core innovation lies in representing a query embedding as a sparse convex combination of a small subset of training sequences, obtained through an efficient, projection-free Frank–Wolfe algorithm. This approach inherently balances relevance and diversity without explicit penalties, as the convex hull naturally suppresses redundant samples pointing in similar directions. Once the convex combination is computed, the fractional weights are converted into an integer multiset via a geometric integerization process, ensuring the support set contains repeated samples. These repetitions are exploited through gradient reuse, where cached gradients are reused across multiple optimizer steps, drastically reducing the number of forward-backward passes.

Extensive experiments on 12 datasets from The Pile demonstrate that HullFT consistently outperforms state-of-the-art methods like kNN and SIFT in both bits-per-byte (BPB) compression efficiency and total runtime. The results show an average BPB improvement of 3.83% at a 0.75-second budget and 3.44% at zero additional time, with maximum gains exceeding 6%. The support set selection process is 25.8× faster than SIFT, and gradient reuse further accelerates fine-tuning by 1.48×, saving nearly 90 seconds overall. Notably, the geometric diversity of the support set emerges naturally from the convex hull, eliminating the need for explicit diversity penalties. The repeated samples, generated through integerization, enable effective gradient caching, which is crucial for real-time deployment.

Overall, HullFT offers a scalable, theoretically grounded, and practically effective solution for high-speed test-time adaptation of large language models. Its combination of convex geometric optimization, integerization, and gradient caching addresses the core bottlenecks of traditional TTFT, making it suitable for latency-critical applications such as conversational AI, content moderation, and personalized assistants. The framework's modular design also opens avenues for future enhancements, including adaptive support set generation, multi-modal extension, and integration with internal model representations. This work marks a significant step toward making large models more adaptable, efficient, and deployable in real-world scenarios, bridging the gap between research and practical AI deployment.

Deep Analysis

Background

The evolution of large-scale language models (LLMs) such as GPT-3, BERT, and T5 has transformed NLP by enabling models with billions of parameters to perform a wide range of tasks with minimal task-specific tuning. These models are trained on web-scale corpora, capturing broad linguistic and factual knowledge. However, their deployment often faces challenges related to adaptability and efficiency. Fine-tuning on task-specific data improves performance but is computationally expensive and slow, especially for large models. Test-time fine-tuning (TTFT) emerged as a promising paradigm, allowing models to adapt dynamically to each input by retrieving relevant data, updating parameters, and evaluating. Early approaches like gradient-based meta-learning (e.g., MAML) provided theoretical foundations but were impractical at scale. Recent methods such as kNN retrieval and SIFT focus on selecting relevant samples for quick adaptation, yet they suffer from redundancy, high computational cost, and limited diversity. These issues hinder real-time deployment, especially in latency-sensitive applications like chatbots, real-time translation, and content moderation. The challenge remains to develop methods that can efficiently select diverse, relevant data and perform rapid updates without sacrificing accuracy.

Core Problem

The core problem addressed in this work is how to efficiently perform test-time fine-tuning on large language models under strict latency constraints. Existing methods either rely on simple nearest-neighbor retrieval, which often results in redundant samples and limited diversity, or on complex selection heuristics that are computationally expensive. Additionally, frequent model updates via gradient descent incur high computational costs, making real-time adaptation infeasible. The key bottlenecks are: 1) selecting a relevant and diverse support set quickly; 2) performing fast fine-tuning with minimal resource consumption; 3) avoiding redundant computations during repeated samples. Overcoming these bottlenecks is critical for deploying adaptive models in real-world, latency-sensitive environments such as conversational agents, where response time directly impacts user experience. The challenge is to balance relevance, diversity, and computational efficiency simultaneously, requiring novel algorithmic solutions that leverage geometric insights and optimization techniques.

Innovation

The main innovations introduced in this paper are threefold: 1) convex geometric support set selection—by formulating the data selection as a sparse convex approximation problem solved via the Frank–Wolfe algorithm, which naturally balances relevance and diversity without explicit penalties; 2) geometric integerization—converting fractional convex weights into an integer multiset that preserves approximation quality and induces natural sample repetition, facilitating efficient gradient caching; 3) gradient reuse—exploiting repeated samples in the multiset to cache and reuse gradients across multiple optimizer steps, dramatically reducing the number of forward-backward passes. These innovations collectively enable a highly efficient TTFT pipeline that maintains high adaptation quality while significantly reducing computational costs. The approach leverages the geometry of convex hulls, avoiding heuristic or greedy sample selection, and introduces a principled way to handle sample diversity and redundancy simultaneously.

Methodology

�� Define the problem as finding a sparse convex combination of candidate samples that approximates the query embedding, minimizing the squared L2 distance.
�� Use Frank–Wolfe algorithm: starting from a single vertex, iteratively add the candidate that maximally reduces residual error, forming a sparse convex combination with at most d+1 support points.
�� Solve the convex approximation efficiently via inner product evaluations, avoiding projections, and stopping when the residual error falls below a threshold or support size cap is reached.
�� Geometric integerization: convert fractional weights into integer counts through three steps—floor allocation, greedy fill, and local swap refinement—ensuring the total sample size equals N.
�� Construct the support multiset from the integerized weights, where each sample appears multiple times according to its count.
�� During fine-tuning, process the multiset in blocks, caching gradients for repeated samples, and refresh the cache periodically to balance accuracy and speed.
�� Conduct experiments on datasets like The Pile, comparing BPB metrics, total runtime, and support set sizes, validating the speedup and quality improvements over baselines.

Experiments

�� Datasets: 12 subsets from The Pile, including domains like arXiv, DM Math, Enron, GitHub, and others, covering academic, legal, and technical texts.
�� Baselines: compare against kNN retrieval, SIFT, and existing TTFT methods.
�� Metrics: primarily bits-per-byte (BPB) to measure compression efficiency, along with total runtime, selection speed, and fine-tuning effectiveness.
�� Hyperparameters: candidate pool size, support set cap, Frank–Wolfe tolerance, gradient cache refresh interval.
�� Protocol: evaluate at multiple time budgets (e.g., T.75s, T.0s), analyze the impact of support set size, integerization, and gradient reuse through ablation studies. Measure the speedup factors and BPB improvements, and visualize the diversity of selected support sets using t-SNE plots.

Results

�� Across all datasets, HullFT consistently outperforms kNN and SIFT, with an average BPB reduction of 3.83% at T.75s and 3.44% at T.0s, with maximum improvements exceeding 6%.
�� The convex hull-based support set selection is 25.8× faster than SIFT, and gradient caching accelerates fine-tuning by 1.48×, saving nearly 90 seconds in total.
�� The support set's geometric diversity naturally emerges, avoiding explicit diversity penalties, and the repeated samples enable effective gradient reuse, leading to substantial speedups without compromising quality.

Applications

�� Real-time adaptive NLP systems: enabling chatbots, virtual assistants, and content moderation tools to quickly adapt to new inputs with minimal latency.
�� Edge computing: deploying models on resource-constrained devices, where fast support set selection and gradient caching reduce computational load.
�� Personalized AI: dynamically customizing models based on user interactions, improving relevance and user experience in applications like recommendation systems.
�� Cross-modal adaptation: extending the geometric approach to multi-modal models, facilitating rapid adaptation across different data types.

Limitations & Outlook

�� Dependence on the quality of the candidate retrieval pool; limited recall or diversity in the pool constrains the support set quality.
�� The geometric diversity achieved may not fully capture complex semantic nuances in highly intricate tasks.
�� Gradient caching assumes high sample repetition; in highly diverse or novel scenarios, the benefits diminish, potentially leading to bias.
�� The approach's scalability to extremely large models or multi-task settings remains to be validated, and high-dimensional convex optimization may incur additional computational costs.

Plain Language Accessible to non-experts

想象你在准备一份大餐，每次做菜都需要挑选合适的食材。传统的方法就像每次都从超市买一堆新鲜食材，虽然新鲜，但可能会买到重复的食材，浪费时间和资源。而现在，你有一个聪明的厨师助手，能根据你要做的菜，把最重要、最能搭配的几样食材，从存货中挑出来。这个助手会用一种特殊的数学方法，把这些食材的特点（比如味道、颜色）用一种几何的方式表示，然后找到一组既丰富又不重复的食材组合。它还会把这些食材的数量告诉你，确保你用得既充分又不过量。这样，你就可以用更少的时间，做出更美味、丰富的菜肴。这就像HullFT用几何和优化的方法，帮大模型快速找到最相关、最丰富的训练样本，从而在短时间内让模型变得更聪明、更贴合当前任务。

ELI14 Explained like you're 14

想象你在学校里参加一个特别的学习比赛，每次比赛前，你都可以偷偷准备一些资料来帮你表现得更好。可是，如果每次都准备一堆资料，既费时间又容易重复，效果也不一定好。于是，你的哥哥告诉你一个聪明的方法：用一种特殊的“数学魔法”帮你挑选最重要的资料。这种魔法会根据你要答的题目，把所有可能的资料变成一个“几何图形”，然后用一种叫做“凸组合”的技巧，从这个图形里挑出一小部分既相关又多样的资料。更酷的是，这个魔法还能把重复的资料合成一份，节省时间。每次你用这些资料练习时，哥哥还会帮你记住哪些资料用得多，就重复使用它们，省得每次都重新看一遍。这就像HullFT的方法，用几何和聪明的技巧，帮大模型在短时间内学到更多有用的知识，让它变得更聪明、更快反应。是不是很厉害呢？

Abstract

Test-time finetuning (TTFT) is a rapidly evolving paradigm that adapts a language model to each prompt by retrieving related sequences, updating the model on them, and then evaluating the prompt. However, TTFT is only practical if it is fast: selection and finetuning both happen per query, making each a direct bottleneck. Existing methods trade speed for quality: fast retrieval is often redundant, while stronger diversity-aware selection adds prohibitive per-query cost. We introduce HullFT, a geometric approach to TTFT that addresses both bottlenecks. Given a query, HullFT first represents the query embedding as a sparse convex combination of few training sequences, using efficient projection-free Frank-Wolfe optimization. This yields a support set that is inherently relevant and diverse. We then convert the fractional convex weights into an exact integer multiset for finetuning through a geometric integerization procedure. The resulting multiplicities naturally create repeated examples, which we exploit with Gradient Reuse to amortize forward-backward computation across repeated finetuning steps. Our experiments show that HullFT improves the quality-efficiency tradeoff over current state-of-the-art TTFT methods, achieving lower bits-per-byte at substantially lower total runtime.

cs.LG

References (20)

Test-Time Training on Nearest Neighbors for Large Language Models

Moritz Hardt, Yu Sun

2023 74 citations ⭐ Influential View Analysis →

Efficiently Learning at Test-Time: Active Fine-Tuning of LLMs

Jonas Hübotter, Sascha Bongni, Ido Hakimi et al.

2024 39 citations ⭐ Influential View Analysis →

Language Models are Unsupervised Multitask Learners

Alec Radford, Jeff Wu, R. Child et al.

2019 28779 citations

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

Leo Gao, Stella Biderman, Sid Black et al.

2020 2822 citations View Analysis →

The Surprising Effectiveness of Test-Time Training for Abstract Reasoning

Ekin Akyürek, Mehul Damani, Linlu Qiu et al.

2024 52 citations

Some comments on Wolfe's ‘away step’

J. Guélat, P. Marcotte

1986 231 citations

Test-Time Training with Self-Supervision for Generalization under Distribution Shifts

Yu Sun, X. Wang, Zhuang Liu et al.

2019 1264 citations

Geometric Approximation via Coresets

P. Agarwal, Sariel Har-Peled, K. Varadarajan

2007 445 citations

Test-Time Training with Masked Autoencoders

Yossi Gandelsman, Yu Sun, Xinlei Chen et al.

2022 252 citations View Analysis →

Revisiting Frank-Wolfe: Projection-Free Sparse Convex Optimization

Martin Jaggi

2013 1495 citations

LESS: Selecting Influential Data for Targeted Instruction Tuning

Mengzhou Xia, Sadhika Malladi, Suchin Gururangan et al.

2024 499 citations View Analysis →

Tight Bounds for Approximate Carathéodory and Beyond

V. Mirrokni, R. Leme, Adrian Vladu et al.

2015 38 citations View Analysis →

Robustness Is a Function, Not a Number: A Factorized Comprehensive Study of OOD Robustness in Vision-Based Driving

Amir Mallak, Alaa Maalouf

2026 1 citations View Analysis →

A unified framework for approximating and clustering data

Dan Feldman, M. Langberg

2011 491 citations View Analysis →

AutoCoreset: An Automatic Practical Coreset Construction Framework

Alaa Maalouf, M. Tukan, V. Braverman et al.

2023 4 citations View Analysis →

Drive Anywhere: Generalizable End-to-end Autonomous Driving with Multi-modal Foundation Models

Tsun-Hsuan Wang, Alaa Maalouf, Wei Xiao et al.

2023 78 citations View Analysis →

The Influence Curve and Its Role in Robust Estimation

F. Hampel

1974 2998 citations

The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only

Guilherme Penedo, Quentin Malartic, Daniel Hesslow et al.

2023 945 citations View Analysis →

Coresets for Data-efficient Training of Machine Learning Models

Baharan Mirzasoleiman, J. Bilmes, J. Leskovec

2019 513 citations

New Frameworks for Offline and Streaming Coreset Constructions

V. Braverman, Dan Feldman, Harry Lang

2016 152 citations View Analysis →

Efficient Test-Time Finetuning of LLMs via Convex Reconstruction and Gradient Caching

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Analysis

Background

Core Problem

Innovation

Methodology

Experiments

Results

Applications

Limitations & Outlook

Plain Language Accessible to non-experts

ELI14 Explained like you're 14

Abstract

References (20)

Related Papers

Execution-State Capsules: Graph-Bound Execution-State Checkpoint and Restore for Low-Latency, Small-Batch, On-Device Physical-AI Serving

On the Oracle Complexity of Interpolation-Based Gradient Descent

STARE: Surprisal-Guided Token-Level Advantage Reweighting for Policy Entropy Stability

Zero-Shot Active Feature Acquisition via LLM-Elicitation

Looped World Models

Kolmogorov Regression for Robust Diffusion Policies