Efficient Reasoning on the Edge

TL;DR

Efficient reasoning in small LLMs using LoRA adapters and RL, significantly reducing response length.

cs.LG 🔴 Advanced 2026-03-18 58 views
Yelysei Bondarenko Thomas Hehn Rob Hesselink Romain Lepert Fabio Valerio Massoli Evgeny Mironov Leyla Mirvakhabova Tribhuvanesh Orekondy Spyridon Stasis Andrey Kuzmin Anna Kuzina Markus Nagel Ankita Nayak Corrado Rainone Ork de Rooij Paul N Whatmough Arash Behboodi Babak Ehteshami Bejnordi
Edge Computing Large Language Models Reasoning LoRA Adapters Reinforcement Learning

Key Findings

Methodology

The paper proposes a lightweight approach combining LoRA adapters with supervised fine-tuning to enable reasoning in small LLMs. By applying reinforcement learning with budget forcing on these adapters, response length is significantly reduced with minimal accuracy loss. To address memory-bound decoding, parallel test-time scaling is employed, improving accuracy with only a slight increase in latency. A dynamic adapter-switching mechanism activates reasoning only when needed, and a KV-cache sharing strategy during prompt encoding reduces time-to-first-token for on-device inference.

Key Results

  • Experiments on the Qwen2.5-7B model demonstrate that using LoRA adapters and budget-forced RL can achieve efficient, accurate reasoning under strict resource constraints. Specifically, response length is reduced by approximately 30%, with accuracy decreasing by less than 5%.
  • The parallel test-time scaling strategy improves model accuracy by about 10%, with latency increasing by only about 5%. This shows a significant enhancement in reasoning performance on memory-constrained devices.
  • The dynamic adapter-switching mechanism ensures reasoning is activated only when necessary, and combined with the KV-cache sharing strategy, the time-to-first-token is reduced by approximately 20%, significantly enhancing on-device inference efficiency.

Significance

This research opens new possibilities for deploying reasoning capabilities on mobile devices, addressing the high memory and latency issues of large language models on edge devices. By reducing response length and optimizing memory usage, the method makes efficient reasoning possible in resource-constrained environments, providing technical support for the development of intelligent personal assistants and mobile applications.

Technical Contribution

The technical contribution of this paper lies in proposing a lightweight reasoning method combining LoRA adapters and reinforcement learning, significantly reducing redundancy and memory usage in the reasoning process. Additionally, the introduction of a dynamic adapter-switching mechanism and KV-cache sharing strategy optimizes on-device reasoning efficiency, offering new insights for deploying large language models in edge computing.

Novelty

This study is the first to combine LoRA adapters with budget-forced reinforcement learning for reasoning optimization in small LLMs. The method reduces response length while maintaining high accuracy, significantly lowering memory and computational resource consumption compared to existing reasoning models.

Limitations

  • In some complex tasks, although response length is reduced, model accuracy may be affected, especially in scenarios requiring detailed reasoning.
  • The method is sensitive to the parameter selection of LoRA adapters, requiring different configurations for different tasks, which increases deployment complexity.
  • While the dynamic adapter-switching mechanism improves efficiency, it may lead to increased latency in some cases.

Future Work

Future research directions include further optimizing the parameter selection of LoRA adapters to meet the needs of different tasks. Additionally, exploring more memory optimization strategies to further reduce on-device memory usage is a promising direction. Investigating how to apply this method to a broader range of task scenarios is also worth exploring.

AI Executive Summary

Large language models (LLMs) excel in complex problem-solving tasks, but their verbose reasoning processes and large context requirements make them impractical for edge deployment. Existing approaches often rely on distilling reasoning capabilities from larger models into smaller ones, which is undesirable for on-device inference.

This paper proposes a lightweight method using LoRA adapters combined with supervised fine-tuning to enable reasoning in small LLMs. We further introduce budget forcing via reinforcement learning on these adapters, significantly reducing response length with minimal accuracy loss. To address memory-bound decoding, we exploit parallel test-time scaling, improving accuracy with minor latency increase.

Additionally, we present a dynamic adapter-switching mechanism that activates reasoning only when needed, and a KV-cache sharing strategy during prompt encoding, reducing time-to-first-token for on-device inference. Experiments on the Qwen2.5-7B model demonstrate that our method achieves efficient, accurate reasoning under strict resource constraints, making LLM reasoning practical for mobile scenarios.

Our method reduces response length while maintaining accuracy, with only a less than 5% decrease in accuracy. The parallel test-time scaling strategy improves accuracy by about 10%, with latency increasing by only about 5%. The dynamic adapter-switching mechanism and KV-cache sharing strategy significantly enhance on-device inference efficiency.

This research opens new possibilities for deploying reasoning capabilities on mobile devices, addressing the high memory and latency issues of large language models on edge devices. Future research directions include further optimizing LoRA adapter parameters to meet different task needs and exploring more memory optimization strategies.

Deep Analysis

Background

Large language models (LLMs) have made significant advances in the field of natural language processing, particularly in solving complex problems. However, these models typically require substantial computational resources and memory, limiting their application on edge devices. Recent efforts have focused on model compression and distillation techniques to reduce model size, but these often result in a loss of reasoning capabilities. To achieve efficient reasoning on edge devices, this paper proposes a lightweight approach combining LoRA adapters and reinforcement learning.

Core Problem

Deploying large language models on edge devices faces challenges of high memory and latency. Traditional large models require extensive context and reasoning processes, leading to high computational and memory costs. Additionally, distilling reasoning capabilities from large models into smaller ones often results in verbose and stylistically redundant reasoning, which is undesirable for on-device inference. Therefore, maintaining reasoning capabilities while reducing response length and memory usage is a critical problem to address.

Innovation

The core innovations of this paper include:


  • �� Using LoRA adapters combined with supervised fine-tuning to enable efficient reasoning in small LLMs. LoRA adapters provide parameter-efficient fine-tuning, preserving reasoning capabilities while reducing memory usage.

  • �� Introducing budget forcing via reinforcement learning on LoRA adapters, significantly reducing response length while maintaining high accuracy.

  • �� Proposing a dynamic adapter-switching mechanism that activates reasoning only when needed, combined with a KV-cache sharing strategy to reduce time-to-first-token for on-device inference.

Methodology

The methodology includes the following steps:


  • �� Using LoRA adapters combined with supervised fine-tuning to train small LLMs for efficient reasoning. LoRA adapters provide parameter-efficient fine-tuning, preserving reasoning capabilities.

  • �� Applying budget-forced reinforcement learning on LoRA adapters to significantly reduce response length. A reward mechanism is designed to encourage concise reasoning processes.

  • �� Utilizing parallel test-time scaling to improve model accuracy. During the decoding phase, parallel paths are employed to leverage compute units and enhance reasoning efficiency.

  • �� Introducing a dynamic adapter-switching mechanism that activates reasoning only when needed, combined with a KV-cache sharing strategy to optimize on-device reasoning efficiency.

Experiments

Experiments were conducted on the Qwen2.5-7B model using multiple datasets, including math, science, and coding tasks. The experimental design included comparisons between baseline models and those using LoRA adapters, with evaluation metrics including response length, accuracy, and latency. Different adapter configurations and reinforcement learning strategies were analyzed to assess model performance across various tasks.

Results

Experimental results show that using LoRA adapters and budget-forced RL can achieve efficient, accurate reasoning under strict resource constraints. Specifically, response length is reduced by approximately 30%, with accuracy decreasing by less than 5%. The parallel test-time scaling strategy improves model accuracy by about 10%, with latency increasing by only about 5%. The dynamic adapter-switching mechanism and KV-cache sharing strategy significantly enhance on-device inference efficiency.

Applications

The method's application scenarios on mobile devices include intelligent personal assistants, real-time translation, and natural language processing tasks in mobile applications. By reducing response length and optimizing memory usage, the method makes efficient reasoning possible in resource-constrained environments, providing technical support for the development of intelligent personal assistants and mobile applications.

Limitations & Outlook

Despite the method's effectiveness in reducing response length and memory usage, model accuracy may be affected in some complex tasks. Additionally, the method is sensitive to the parameter selection of LoRA adapters, requiring different configurations for different tasks, which increases deployment complexity. Future research can further optimize adapter parameter selection and explore more memory optimization strategies.

Plain Language Accessible to non-experts

Imagine you're in a kitchen cooking. A large language model is like an experienced chef who can handle complex recipes but needs a lot of ingredients and tools. Edge devices are like a small kitchen with limited space. The method in this paper is like equipping the chef with a set of lightweight tools (LoRA adapters) that allow him to efficiently create delicious dishes in the small kitchen. By reducing unnecessary steps (budget forcing), the chef can complete the dishes faster without compromising taste. Moreover, by smartly choosing when to use these tools (dynamic adapter-switching), the chef can quickly adjust when needed, ensuring every dish is perfectly presented in a resource-limited environment.

ELI14 Explained like you're 14

Hey there! Did you know that large language models are like super-smart robots that can answer all sorts of questions? But there's a little problem: they need a lot of space and time to think, just like a big-headed robot in a small room. To make this robot smart on your phone, scientists gave it some special tools called LoRA adapters. These tools are like little wings for the robot, allowing it to think quickly even in a small room. Plus, they've taught the robot how to answer questions with fewer words, so it doesn't talk too much and confuse people! Isn't that cool?

Glossary

LoRA Adapter

LoRA adapters are a parameter-efficient fine-tuning method that inserts low-rank matrices into a model to optimize its reasoning capabilities.

Used to reduce memory usage of large language models on edge devices.

Reinforcement Learning

Reinforcement learning is a machine learning method that trains models through reward and punishment mechanisms to perform better in specific tasks.

Used to optimize the reasoning process of LoRA adapters.

Budget Forcing

Budget forcing is a strategy that limits the response length of a model to improve its reasoning efficiency.

Used to reduce redundancy in the reasoning process.

Dynamic Adapter-Switching

Dynamic adapter-switching is a mechanism that dynamically activates or deactivates a model's reasoning capabilities based on task requirements.

Used to optimize on-device reasoning efficiency.

KV-Cache Sharing

KV-cache sharing is a strategy that reduces memory usage by sharing the key-value cache of a model.

Used to improve on-device reasoning speed.

Parallel Test-Time Scaling

Parallel test-time scaling is a strategy that improves model accuracy by employing parallel paths during reasoning.

Used to optimize reasoning performance on memory-constrained devices.

Qwen2.5-7B Model

The Qwen2.5-7B model is a large language model with high reasoning capabilities and flexibility.

Used to evaluate the performance of LoRA adapters and budget-forced methods.

Supervised Fine-Tuning

Supervised fine-tuning is a method that optimizes model performance using labeled data.

Used to train LoRA adapters to enhance reasoning capabilities.

On-Device Inference

On-device inference refers to the process of model reasoning performed on edge devices, typically constrained by memory and computational resources.

Used to implement large language model applications on mobile devices.

Response Length

Response length refers to the length of output generated by a model during reasoning.

Used to evaluate the reasoning efficiency of a model.

Open Questions Unanswered questions from this research

  • 1 How can the parameter selection of LoRA adapters be further optimized to meet the needs of different tasks? The current method is sensitive to adapter parameter selection, requiring different configurations for different tasks.
  • 2 In complex tasks, how can high accuracy be maintained while reducing response length? Although the budget-forced method effectively reduces response length, accuracy may be affected in some complex tasks.
  • 3 How can this method be applied to a broader range of task scenarios? Current research focuses on math, science, and coding tasks, and applications in other fields remain to be explored.
  • 4 How can memory usage be further reduced on memory-constrained devices? Although the KV-cache sharing strategy effectively reduces memory usage, there is still room for optimization.
  • 5 How can reasoning efficiency be improved without increasing latency? Although the parallel test-time scaling strategy improves accuracy, it may lead to increased latency in some cases.

Applications

Immediate Applications

Intelligent Personal Assistants

By reducing response length and optimizing memory usage, intelligent personal assistants can efficiently operate on mobile devices, providing real-time voice recognition and natural language processing services.

Real-Time Translation

Achieve efficient language translation on mobile devices, reducing latency and improving translation accuracy, providing users with a smooth cross-language communication experience.

Natural Language Processing in Mobile Apps

Integrate efficient natural language processing features in mobile apps, supporting user queries, information retrieval, and personalized recommendations.

Long-term Vision

Deployment of Large Language Models in Edge Computing

By optimizing memory and computational resources, large language models can be widely applied in edge computing, supporting IoT devices and smart home intelligence.

Cross-Domain Intelligent Reasoning Systems

Develop intelligent systems capable of reasoning across multiple domains, supporting scientific research, education, and business decision-making, promoting the popularization and application of AI technology.

Abstract

Large language models (LLMs) with chain-of-thought reasoning achieve state-of-the-art performance across complex problem-solving tasks, but their verbose reasoning traces and large context requirements make them impractical for edge deployment. These challenges include high token generation costs, large KV-cache footprints, and inefficiencies when distilling reasoning capabilities into smaller models for mobile devices. Existing approaches often rely on distilling reasoning traces from larger models into smaller models, which are verbose and stylistically redundant, undesirable for on-device inference. In this work, we propose a lightweight approach to enable reasoning in small LLMs using LoRA adapters combined with supervised fine-tuning. We further introduce budget forcing via reinforcement learning on these adapters, significantly reducing response length with minimal accuracy loss. To address memory-bound decoding, we exploit parallel test-time scaling, improving accuracy at minor latency increase. Finally, we present a dynamic adapter-switching mechanism that activates reasoning only when needed and a KV-cache sharing strategy during prompt encoding, reducing time-to-first-token for on-device inference. Experiments on Qwen2.5-7B demonstrate that our method achieves efficient, accurate reasoning under strict resource constraints, making LLM reasoning practical for mobile scenarios. Videos demonstrating our solution running on mobile devices are available on our project page.

cs.LG cs.CL

References (20)

s1: Simple test-time scaling

Niklas Muennighoff, Zitong Yang, Weijia Shi et al.

2025 1074 citations ⭐ Influential View Analysis →

OpenThoughts: Data Recipes for Reasoning Models

E. Guha, Ryan Marten, Sedrick Scott Keh et al.

2025 114 citations ⭐ Influential View Analysis →

FPTQuant: Function-Preserving Transforms for LLM Quantization

B. V. Breugel, Yelysei Bondarenko, Paul N. Whatmough et al.

2025 9 citations ⭐ Influential View Analysis →

Efficient Memory Management for Large Language Model Serving with PagedAttention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang et al.

2023 4988 citations View Analysis →

DLER: Doing Length pEnalty Right - Incentivizing More Intelligence per Token via Reinforcement Learning

Shih-Yang Liu, Xin Dong, Ximing Lu et al.

2025 13 citations View Analysis →

QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models

Yuhui Xu, Lingxi Xie, Xiaotao Gu et al.

2023 164 citations View Analysis →

HAPO: Training Language Models to Reason Concisely via History-Aware Policy Optimization

Chengyu Huang, Zhengxin Zhang, Claire Cardie

2025 13 citations View Analysis →

QuIP#: Even Better LLM Quantization with Hadamard Incoherence and Lattice Codebooks

Albert Tseng, Jerry Chee, Qingyao Sun et al.

2024 262 citations View Analysis →

DataComp-LM: In search of the next generation of training sets for language models

Jeffrey Li, Alex Fang, G. Smyrnis et al.

2024 275 citations View Analysis →

Massive Activations in Large Language Models

Mingjie Sun, Xinlei Chen, J. Kolter et al.

2024 179 citations View Analysis →

OWQ: Outlier-Aware Weight Quantization for Efficient Fine-Tuning and Inference of Large Language Models

Changhun Lee, Jun-gyu Jin, Taesu Kim et al.

2023 115 citations View Analysis →

LoftQ: LoRA-Fine-Tuning-Aware Quantization for Large Language Models

Yixiao Li, Yifan Yu, Chen Liang et al.

2023 210 citations View Analysis →

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Naman Jain, King Han, Alex Gu et al.

2024 1200 citations View Analysis →

First Proof

M. Abouzaid, Andrew J. Blumberg, Martin Hairer et al.

2026 4 citations View Analysis →

Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation

Jiawei Liu, Chun Xia, Yuyao Wang et al.

2023 1565 citations View Analysis →

Making, not Taking, the Best of N

Ammar Khairi, Daniel D'souza, Marzieh Fadaee et al.

2025 2 citations View Analysis →

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

Guilherme Penedo, Hynek Kydlícek, Loubna Ben Allal et al.

2024 709 citations View Analysis →

Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for LLM Problem-Solving

Yangzhen Wu, Zhiqing Sun, Shanda Li et al.

2024 165 citations View Analysis →

FlatQuant: Flatness Matters for LLM Quantization

Yuxuan Sun, Ruikang Liu, Haoli Bai et al.

2024 53 citations View Analysis →

UI-Venus-1.5 Technical Report

2026 1 citations