The Discrete Charm of the MLP: Binary Routing of Continuous Signals in Transformer Feed-Forward Layers

TL;DR

MLP layers in Transformers perform binary routing; validated in GPT-2, removing MLP increases perplexity by 43.3%.

cs.LG 🔴 Advanced 2026-03-12 13 views

Peter Balogh

MLP Transformer binary routing GPT-2 neural networks

Key Findings

Methodology

This study employs the GPT-2 small model (124M parameters) to analyze the binary routing characteristics of MLP layers when processing continuous signals. By examining neuron activation patterns across different layers, it was found that early layers use single gateway neurons for routing, middle layers exhibit diffuse processing, and late layers crystallize full consensus/exception architectures. Causal validation confirms the functionality of this routing structure.

Key Results

In the GPT-2 small model, specific neurons implement a consensus architecture: seven 'default-ON' neurons and one exception handler (N2123 in Layer 11), which are 93-98% mutually exclusive, forming a binary routing switch.
Causal validation shows that removing the MLP at consensus breakdown increases perplexity by 43.3%, while at full consensus, removal increases it by only 10.1%.
Comparing binary vs. continuous features for routing decisions reveals that binarization loses essentially no information (79.2% vs. 78.8% accuracy), while continuous activations carry additional magnitude information (R^2 = 0.36 vs. 0.22).

Significance

This research unveils the binary routing characteristics of MLP layers in Transformer models, challenging traditional smooth polynomial approximation methods. This binary routing structure explains why smooth polynomial approximation fails in highly nonlinear layers. The findings suggest that the well-established piecewise-affine characterization of deep networks can be complemented by a routing characterization, implementing binary decisions along the natural data manifold.

Technical Contribution

The technical contribution of this paper lies in revealing the binary routing characteristics of MLP layers and empirically validating their functionality. This structure explains why smooth polynomial approximation fails in highly nonlinear layers. The study also shows that binarization loses almost no information in routing decisions, while continuous activations carry additional magnitude information.

Novelty

This paper is the first to reveal the binary routing characteristics of MLP layers in Transformer models and empirically validate their functionality. This discovery challenges traditional smooth polynomial approximation methods and proposes a new framework for understanding deep networks.

Limitations

The study is primarily based on the GPT-2 small model, and the results may not generalize to other more complex models.
Experiments were conducted only on the WikiText-103 dataset, which may not generalize to other types of data.
The specific implementation details of binary routing still require further research.

Future Work

Future research could extend to other more complex Transformer models to verify whether binary routing characteristics are universally present. Additionally, exploring how to leverage this structure to optimize model computational efficiency and performance is a promising direction.

AI Executive Summary

In the field of natural language processing, Transformer models are renowned for their powerful performance and flexibility. However, the MLP layers within Transformers are typically viewed as function approximators, responsible for mapping inputs to outputs. Peter Balogh's study reveals a new perspective: MLP layers actually perform binary routing.

Through analysis of the GPT-2 small model, the study finds that specific neurons form a consensus architecture capable of effectively deciding which tokens require nonlinear processing. This architecture consists of seven 'default-ON' neurons and one exception handler, which are 93-98% mutually exclusive.

Experimental results show that removing the MLP layer at consensus breakdown increases perplexity by 43.3%, while at full consensus, it only increases by 10.1%. This indicates that binary routing plays a crucial role in the model's computation.

Furthermore, the study compares binary vs. continuous features for routing decisions, finding that binarization loses almost no information, while continuous activations carry additional magnitude information. This finding challenges traditional smooth polynomial approximation methods.

This research not only reveals the binary routing characteristics of MLP layers in Transformer models but also provides a new framework for understanding deep networks. Future research could explore how to leverage this structure to optimize model computational efficiency and performance.

Although the study is primarily based on the GPT-2 small model and results may not generalize to other more complex models, it provides important insights for further research.

Deep Analysis

Background

In recent years, Transformer models have made significant advancements in the field of natural language processing. One of their core components is the multilayer perceptron (MLP) layer, typically viewed as a function approximator responsible for mapping inputs to outputs. However, the traditional view that these layers are solely for smooth function approximation overlooks their potential other functionalities. Peter Balogh's study challenges this traditional view, proposing that MLP layers actually perform binary routing. This discovery provides a new perspective for understanding deep networks and could have profound implications for model optimization and performance enhancement.

Core Problem

MLP layers within Transformer models are typically viewed as function approximators responsible for mapping inputs to outputs. However, this perspective overlooks the potential other functionalities of MLP layers, particularly their binary routing characteristics when processing continuous signals. Understanding this characteristic is crucial for optimizing model computational efficiency and performance. However, existing research lacks in-depth analysis and validation of this characteristic.

Innovation

The core innovation of this paper lies in revealing the binary routing characteristics of MLP layers in Transformer models. By analyzing the GPT-2 small model, the study finds that specific neurons form a consensus architecture capable of effectively deciding which tokens require nonlinear processing. This discovery challenges traditional smooth polynomial approximation methods and provides a new framework for understanding deep networks. Additionally, the study empirically validates the functionality of this structure, offering new insights for model optimization and performance enhancement.

Methodology

�� Use the GPT-2 small model (124M parameters) to analyze the binary routing characteristics of MLP layers when processing continuous signals.
�� Examine neuron activation patterns across different layers, finding that early layers use single gateway neurons for routing, middle layers exhibit diffuse processing, and late layers crystallize full consensus/exception architectures.
�� Causal validation confirms the functionality of this routing structure.
�� Compare binary vs. continuous features for routing decisions, finding that binarization loses almost no information, while continuous activations carry additional magnitude information.

Experiments

Experiments were conducted on the WikiText-103 dataset using the GPT-2 small model (124M parameters, 12 layers, 3072 MLP hidden neurons per layer). The study captures input-output pairs for each token position to analyze the input-output relationship of MLP layers. Experimental designs include polynomial probing, branch detection, and binary feature extraction. Through these experiments, the study validates the binary routing characteristics of MLP layers and analyzes their performance across different layers.

Results

Experimental results show that specific neurons implement a consensus architecture: seven 'default-ON' neurons and one exception handler (N2123 in Layer 11), which are 93-98% mutually exclusive, forming a binary routing switch. Causal validation shows that removing the MLP at consensus breakdown increases perplexity by 43.3%, while at full consensus, removal increases it by only 10.1%. Additionally, comparing binary vs. continuous features for routing decisions reveals that binarization loses almost no information (79.2% vs. 78.8% accuracy), while continuous activations carry additional magnitude information (R^2 = 0.36 vs. 0.22).

Applications

The application scenarios of this study include optimizing the computational efficiency and performance of Transformer models. By understanding the binary routing characteristics of MLP layers, more efficient computational paths can be introduced in model design, improving inference speed and accuracy. Additionally, this discovery can be applied to other deep learning models, helping researchers better understand and utilize the internal structure of neural networks.

Limitations & Outlook

Although this study reveals the binary routing characteristics of MLP layers, its results are primarily based on the GPT-2 small model and may not generalize to other more complex models. Additionally, experiments were conducted only on the WikiText-103 dataset, which may not generalize to other types of data. Future research needs to further validate the applicability of this characteristic in other models and datasets.

Plain Language Accessible to non-experts

Imagine you're managing a large library. Each book has a label that tells you which shelf it should go on. Now, suppose you have a smart robot assistant that can quickly decide which shelf each book should be placed on. This assistant has two modes: a fast mode that decides the shelf based on the book's label, and a slow mode that requires careful analysis of the book's content to decide the shelf.

In this analogy, the books are the input data, the shelves are the output categories, and the robot assistant is the MLP layer. The fast mode is like binary routing, deciding the output based on simple rules, while the slow mode requires more complex computation.

The research finds that certain neurons in the MLP layer act like this assistant's fast mode, quickly deciding which data needs more complex processing. This binary routing structure allows the model to process data more efficiently.

By understanding this structure, we can optimize the model's computational efficiency, allowing the robot assistant to make quick decisions in most cases, thereby improving the overall system performance.

ELI14 Explained like you're 14

Hey there! Did you know that computers are like super-smart robots that can help us do many things, like translating languages and recognizing pictures? To do this, they need a toolbox called 'Transformer.'

In this toolbox, there's a little tool called MLP. We used to think this tool was just for doing simple math calculations, but recently a scientist discovered it has a hidden skill!

This hidden skill is like a switch that decides when to do complex calculations and when to keep it simple. It's like when you're playing a game, sometimes you need to think hard, and sometimes you can just play casually.

This discovery gives us a new understanding of how computers work, and maybe in the future, we can use this method to make computers smarter and faster!

Glossary

MLP (Multilayer Perceptron)

A type of neural network structure typically used to map inputs to outputs. It consists of multiple hidden layers, each containing several neurons.

In Transformer models, MLP layers are used to process input data and generate output.

Transformer

A neural network architecture used for natural language processing, known for its powerful performance and flexibility.

Transformer models excel in processing sequential data and are widely used in tasks like machine translation and text generation.

GPT-2

A language model based on the Transformer architecture, developed by OpenAI, used for generating natural language text.

This paper uses the GPT-2 small model to analyze the binary routing characteristics of MLP layers.

Binary Routing

A decision-making mechanism that determines the processing path of data through simple binary conditions.

The study finds that certain neurons in MLP layers form a binary routing structure, deciding which data requires nonlinear processing.

Consensus Architecture

A neural network structure consisting of multiple 'default-ON' neurons and one exception handler, used to implement binary routing.

In the GPT-2 small model, specific neurons implement a consensus architecture, forming a binary routing switch.

Perplexity

A metric used to measure the performance of language models, with lower values indicating better performance.

Causal validation shows that removing the MLP at consensus breakdown increases perplexity by 43.3%.

Smooth Polynomial Approximation

A mathematical method used to approximate complex functions.

The study finds that smooth polynomial approximation fails in highly nonlinear layers, while the binary routing structure can explain this phenomenon.

Activation Pattern

The activation state of neurons in a neural network, used to determine the processing path of data.

By analyzing neuron activation patterns across different layers, the study reveals the binary routing characteristics of MLP layers.

Nonlinear Processing

A complex computational process used to handle data that requires higher computational power.

The consensus architecture effectively decides which tokens require nonlinear processing.

Data Manifold

The distribution form of data in high-dimensional space, used to describe the intrinsic structure of data.

The study proposes that the piecewise-affine characterization of deep networks can be complemented by routing characteristics, implementing binary decisions along the natural data manifold.

Open Questions Unanswered questions from this research

1 Although this paper reveals the binary routing characteristics of MLP layers, its applicability in more complex Transformer models still needs verification. Existing research is primarily based on the GPT-2 small model, and future studies need to test on larger-scale models.
2 Experiments were conducted only on the WikiText-103 dataset, which may not generalize to other types of data. Future research needs to verify the performance of this characteristic on other datasets.
3 The specific implementation details of binary routing still require further research. While this paper reveals its existence, how to utilize this characteristic in practical applications remains to be explored.
4 It is still unclear whether the consensus architecture used in the study is applicable to other types of neural networks. Future research could explore the application of this architecture in other models.
5 Although the study reveals why smooth polynomial approximation fails in highly nonlinear layers, how to improve this method still requires further research.

Applications

Immediate Applications

Transformer Model Optimization

By understanding the binary routing characteristics of MLP layers, more efficient computational paths can be introduced in model design, improving inference speed and accuracy.

Natural Language Processing Applications

This discovery can be applied to tasks like machine translation and text generation, helping researchers better understand and utilize the internal structure of neural networks.

Deep Learning Model Improvement

Revealing the binary routing characteristics of MLP layers provides new ideas and methods for optimizing other deep learning models.

Long-term Vision

Intelligent System Design

Understanding binary routing characteristics can help design more intelligent systems, making them more efficient when handling complex tasks.

Computer Science Education

This research provides new content for computer science education, helping students better understand the working principles of neural networks.

Abstract

We show that MLP layers in transformer language models perform binary routing of continuous signals: the decision of whether a token needs nonlinear processing is well-captured by binary neuron activations, even though the signals being routed are continuous. In GPT-2 Small (124M parameters), we find that specific neurons implement a consensus architecture -- seven "default-ON" neurons and one exception handler (N2123 in Layer 11) that are 93-98% mutually exclusive -- creating a binary routing switch. A cross-layer analysis reveals a developmental arc: early layers (L1-3) use single gateway neurons to route exceptions without consensus quorums; middle layers (L4-6) show diffuse processing with neither gateway nor consensus; and late layers (L7-11) crystallize full consensus/exception architectures with increasing quorum size (1 to 3 to 7 consensus neurons). Causal validation confirms the routing is functional: removing the MLP at consensus breakdown costs 43.3% perplexity, while at full consensus removing it costs only 10.1% -- exceeding a 4x difference. Comparing binary vs. continuous features for the routing decision confirms that binarization loses essentially no information (79.2% vs. 78.8% accuracy), while continuous activations carry additional magnitude information (R^2 = 0.36 vs. 0.22). This binary routing structure explains why smooth polynomial approximation fails: cross-validated polynomial fits (degrees 2-7) never exceed R^2 = 0.06 for highly nonlinear layers. We propose that the well-established piecewise-affine characterization of deep networks can be complemented by a routing characterization: along the natural data manifold, the piecewise boundaries implement binary decisions about which tokens need nonlinear processing, routing continuous signals through qualitatively different computational paths.

cs.LG

References (18)

The Truth is in There: Improving Reasoning in Language Models with Layer-Selective Rank Reduction

Pratyusha Sharma, Jordan T. Ash, Dipendra Misra

2023 125 citations View Analysis →

Progress measures for grokking via mechanistic interpretability

Neel Nanda, Lawrence Chan, Tom Lieberum et al.

2023 704 citations View Analysis →

Pointer Sentinel Mixture Models

Stephen Merity, Caiming Xiong, James Bradbury et al.

2016 3752 citations View Analysis →

Sparse Autoencoders Find Highly Interpretable Features in Language Models

Hoagy Cunningham, Aidan Ewart, L. Smith et al.

2023 927 citations View Analysis →

A symbolic analysis of relay and switching circuits

C. Shannon

1938 1051 citations

GLU Variants Improve Transformer

Noam Shazeer

2020 1652 citations View Analysis →

Toy Models of Superposition

Nelson Elhage, Tristan Hume, Catherine Olsson et al.

2022 656 citations View Analysis →

Language Models are Unsupervised Multitask Learners

Alec Radford, Jeff Wu, R. Child et al.

2019 27602 citations

A Mathematical Theory of Communication

J. Shin, Sang Joon Kim

2006 72798 citations

Adaptive Computation Time for Recurrent Neural Networks

Alex Graves

2016 656 citations View Analysis →

LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

Tim Dettmers, M. Lewis, Younes Belkada et al.

2022 906 citations View Analysis →

Transformer Feed-Forward Layers Are Key-Value Memories

Mor Geva, R. Schuster, Jonathan Berant et al.

2020 1235 citations View Analysis →

Depth-Adaptive Transformer

Maha Elbayad, Jiatao Gu, Edouard Grave et al.

2019 249 citations View Analysis →

ShortGPT: Layers in Large Language Models are More Redundant Than You Expect

Xin Men, Mingyu Xu, Qingyu Zhang et al.

2024 282 citations View Analysis →

A Spline Theory of Deep Learning

Randall Balestriero, Richard Baraniuk

2018 99 citations

Knowledge Neurons in Pretrained Transformers

Damai Dai, Li Dong, Y. Hao et al.

2021 610 citations View Analysis →

The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks

Jonathan Frankle, Michael Carbin

2018 4055 citations View Analysis →

Scaling and evaluating sparse autoencoders

Leo Gao, Tom Dupr'e la Tour, Henk Tillman et al.

2024 354 citations View Analysis →

The Discrete Charm of the MLP: Binary Routing of Continuous Signals in Transformer Feed-Forward Layers

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Analysis

Background

Core Problem

Innovation

Methodology

Experiments

Results

Applications

Limitations & Outlook

Plain Language Accessible to non-experts

ELI14 Explained like you're 14

Glossary

MLP (Multilayer Perceptron)

Transformer

GPT-2

Binary Routing

Consensus Architecture

Perplexity

Smooth Polynomial Approximation

Activation Pattern

Nonlinear Processing

Data Manifold

Open Questions Unanswered questions from this research

Applications

Immediate Applications

Transformer Model Optimization

Natural Language Processing Applications

Deep Learning Model Improvement

Long-term Vision

Intelligent System Design

Computer Science Education

Abstract

References (18)

Related Papers

PhysMoDPO: Physically-Plausible Humanoid Motion with Preference Optimization

Representation Learning for Spatiotemporal Physical Systems

Learnability and Privacy Vulnerability are Entangled in a Few Critical Weights

MXNorm: Reusing MXFP block scales for efficient tensor normalisation

ZO-SAM: Zero-Order Sharpness-Aware Minimization for Efficient Sparse Training

BoSS: A Best-of-Strategies Selector as an Oracle for Deep Active Learning