Routers Learn the Geometry of Their Experts: Geometric Coupling in Sparse Mixture-of-Experts

TL;DR

The paper introduces a parameter-free online K-Means router leveraging geometric coupling for effective expert assignment, reducing load imbalance with only a slight perplexity increase.

cs.LG 🔴 Advanced 2026-05-13 78 views

Sagi Ahrac Noya Hochwald Mor Geva

AI Reader Arxiv Page Download PDF

Sparse Mixture-of-Experts Geometric Coupling Router Load Balancing Language Model

Key Findings

Methodology

This study investigates the routing decision mechanisms in Sparse Mixture-of-Experts (SMoE) models, revealing a geometric coupling between routers and their corresponding experts. By analyzing gradients, it is found that router weights and expert weights receive updates along the same input direction, differing only in scalar coefficients. The paper also proposes a parameter-free online K-Means router where each expert maintains a running average of the hidden states routed to it, and tokens are assigned based on cosine similarity.

Key Results

In a 1B SMoE model trained from scratch, higher router scores predict stronger expert neuron activations, indicating that routing decisions are mirrored inside the selected expert.
Auxiliary load balancing losses disrupt the router-expert geometric coupling, making distinct router directions nearly three times more similar.
The parameter-free online K-Means router achieves the lowest load imbalance with only a modest perplexity increase, indicating that geometric coupling captures a substantial part of what the router learns.

Significance

This research provides a new perspective on understanding routing decisions in Sparse Mixture-of-Experts models by revealing the geometric coupling between routers and experts. Through geometric coupling, routers can effectively assign tasks, supporting expert specialization. Additionally, the proposed parameter-free online K-Means router demonstrates excellent performance in load balancing, highlighting the importance of geometric coupling in router learning. This finding may influence future routing methods to preserve the natural router-expert geometry during training.

Technical Contribution

The technical contributions of this paper include revealing the geometric coupling between routers and experts in Sparse Mixture-of-Experts models and proposing a parameter-free online K-Means router. This router achieves effective task assignment by maintaining a running average of each expert's hidden states. The paper also analyzes the impact of auxiliary load balancing losses on geometric coupling, finding that they weaken the coupling between routers and experts.

Novelty

This paper is the first to reveal the geometric coupling between routers and experts in Sparse Mixture-of-Experts models and proposes a parameter-free online K-Means router. The innovation lies in achieving effective task assignment through geometric coupling without relying on additional parameters or gradient updates.

Limitations

In some cases, auxiliary load balancing losses may weaken the geometric coupling between routers and experts, leading to reduced expert specialization.
The experiments in this paper mainly focus on a 1B SMoE model, which may behave differently in larger-scale models.
The parameter-free online K-Means router slightly increases perplexity, which may affect performance in certain application scenarios.

Future Work

Future research could explore the effects of geometric coupling in larger-scale models and how to further optimize the parameter-free online K-Means router without increasing perplexity. Additionally, other types of load balancing methods could be studied to enhance expert specialization.

AI Executive Summary

Sparse Mixture-of-Experts (SMoE) models have gained attention for their ability to scale language model parameters without increasing inference latency. However, routing decisions during training remain challenging, especially when routing concentrates on a few experts, leading to representation collapse. This paper investigates the routing decision mechanisms in SMoE models from a geometric perspective, revealing a geometric coupling between routers and their corresponding experts.

The study finds that router weights and expert weights receive gradient updates along the same input direction, differing only in scalar coefficients. This geometric coupling is empirically validated in routing dynamics. In a 1B SMoE model trained from scratch, higher router scores predict stronger expert neuron activations, indicating that routing decisions are mirrored inside the selected expert.

However, common auxiliary load balancing losses disrupt this geometric structure by spreading input-directed gradients across router weights, making distinct router directions nearly three times more similar. This weakens the coupling between routers and experts, unifying expert-specific directions and eroding the specialization produced by coupling.

To address this issue, the paper proposes a parameter-free online K-Means router, where each expert maintains a running average of the hidden states routed to it, and tokens are assigned based on cosine similarity. Compared with auxiliary-loss and loss-free balancing, this router achieves the lowest load imbalance with only a modest perplexity increase, indicating that geometric coupling captures a substantial part of what the router learns.

Overall, the study reveals how routers form assignment geometry that supports an effective division of labor. More broadly, it suggests that future routing methods may benefit from preserving the natural router-expert geometry that emerges during training. The findings have significant implications for both academia and industry, potentially influencing the design of future routing methods.

Deep Dive

Abstract

Sparse Mixture-of-Experts (SMoE) models enable scaling language models efficiently, but training them remains challenging, as routing can collapse onto few experts and auxiliary load-balancing losses can reduce specialization. Motivated by these hurdles, we study how routing decisions in SMoEs are formed mechanistically. First, we reveal a geometric coupling between routers and their corresponding experts. For a given token, the router weights for the selected expert and the expert weights processing it receive gradients along the same input direction, differing only in scalar coefficients. Thus, matched router--expert directions accumulate the same routed token history. This theoretical coupling also appears empirically in routing dynamics. In a $1$B SMoE trained from scratch, higher router scores predict stronger expert neuron activations, showing that routing decisions are mirrored inside the selected expert. Next, we analyze the effects of auxiliary load balancing on the router--expert geometric coupling, showing that such losses break this structure by spreading input-directed gradients across router weights, making distinct router directions nearly three times more similar to each other. Last, we demonstrate the centrality of geometric coupling for effective routing with a parameter-free online K-Means router, in which each expert maintains a running average of the hidden states routed to it and tokens are assigned based on cosine similarity. Compared with auxiliary-loss and loss-free balancing, this router achieves the lowest load imbalance with only a modest perplexity increase, indicating that geometric coupling captures a substantial part of what the router learns. Overall, our results explain how routers form assignment geometry that supports an effective division of labor.

cs.LG cs.CL

References (20)

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

W. Fedus, Barret Zoph, Noam Shazeer

2021 3848 citations ⭐ Influential View Analysis →

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu et al.

2020 1972 citations ⭐ Influential View Analysis →

Auxiliary-Loss-Free Load Balancing Strategy for Mixture-of-Experts

Lean Wang, Huazuo Gao, Chenggang Zhao et al.

2024 155 citations ⭐ Influential View Analysis →

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz et al.

2017 4541 citations ⭐ Influential View Analysis →

Some methods for classification and analysis of multivariate observations

J. MacQueen

1967 30092 citations ⭐ Influential

Decoding Knowledge Attribution in Mixture-of-Experts: A Framework of Basic-Refinement Collaboration and Efficiency Analysis

Junzhuo Li, Bo Wang, Xiuze Zhou et al.

2025 3 citations View Analysis →

A Closer Look into Mixture-of-Experts in Large Language Models

Ka Man Lo, Zeyu Huang, Zihan Qiu et al.

2024 38 citations View Analysis →

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

Leo Gao, Stella Biderman, Sid Black et al.

2020 2786 citations View Analysis →

Grassmannian Mixture-of-Experts: Concentration-Controlled Routing on Subspace Manifolds

Ibne Farabi Shihab, Sanjeda Akter, Anuj Sharma

2026 1 citations View Analysis →

On the Benefits of Learning to Route in Mixture-of-Experts Models

Nishanth Dikkala, Nikhil Ghosh, Raghu Meka et al.

2023 37 citations

StableMoE: Stable Routing Strategy for Mixture of Experts

Damai Dai, Li Dong, Shuming Ma et al.

2022 115 citations View Analysis →

Dense Training, Sparse Inference: Rethinking Training of Mixture-of-Experts Language Models

Bowen Pan, Yikang Shen, Haokun Liu et al.

2024 40 citations View Analysis →

On the Representation Collapse of Sparse Mixture of Experts

Zewen Chi, Li Dong, Shaohan Huang et al.

2022 178 citations View Analysis →

Monkey Jump : MoE-Style PEFT for Efficient Multi-Task Learning

Nusrat Jahan Prottasha, Md. Kowsher, Chun-Nam Yu et al.

2026 2 citations View Analysis →

EMoE: Eigenbasis-Guided Routing for Mixture-of-Experts

Anzhe Cheng, Shukai Duan, Shixuan Li et al.

2026 2 citations View Analysis →

Advancing Expert Specialization for Better MoE

Hongcan Guo, Haolang Lu, Guoshun Nan et al.

2025 29 citations View Analysis →

Diversifying the Mixture-of-Experts Representation for Language Models with Orthogonal Optimizer

Boan Liu, Liang Ding, Li Shen et al.

2023 28 citations View Analysis →

Mixture of Experts Made Intrinsically Interpretable

Xingyi Yang, Constantin Venhoff, Ashkan Khakzar et al.

2025 17 citations View Analysis →

Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM

Sainbayar Sukhbaatar, Olga Golovneva, Vasu Sharma et al.

2024 105 citations View Analysis →

Latent Prototype Routing: Achieving Near-Perfect Load Balancing in Mixture-of-Experts

Jiajie Yang

2025 View Analysis →

Routers Learn the Geometry of Their Experts: Geometric Coupling in Sparse Mixture-of-Experts

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Dive

Abstract

References (20)

Related Papers

Execution-State Capsules: Graph-Bound Execution-State Checkpoint and Restore for Low-Latency, Small-Batch, On-Device Physical-AI Serving

On the Oracle Complexity of Interpolation-Based Gradient Descent

STARE: Surprisal-Guided Token-Level Advantage Reweighting for Policy Entropy Stability

Zero-Shot Active Feature Acquisition via LLM-Elicitation

Looped World Models

Kolmogorov Regression for Robust Diffusion Policies