A Synthesizable RTL Implementation of Predictive Coding Networks

Key Findings

Methodology

The paper presents a digital architecture that implements discrete-time predictive coding updates directly in hardware. Each neural core maintains its activity, prediction error, and synaptic weights, communicating only with adjacent layers through hardwired connections. Supervised learning and inference are supported via a uniform per-neuron clamping interface that enforces boundary conditions while leaving the internal update schedule unchanged. The design is based on a sequential MAC datapath and a fixed finite-state schedule.

Key Results

Result 1: In the teacher-student regression experiment, a three-layer network (2→4→3) rapidly reduced initial MSE from 0.341207 to 0.004784, demonstrating the effectiveness of the incremental tick regime.
Result 2: In the nonlinear regression experiment, a smaller network (2→2→1) reduced initial MSE from 0.106512 to 0.004382, indicating stability under a limited tick budget.
Result 3: In the architectural scaling experiment, networks of different sizes showed rapid initial descent followed by stable residual floors under the same tick schedule, validating the design's scalability.

Significance

This research provides a new algorithmic substrate for physically embedded learning systems, particularly in scenarios requiring local update structures for embedded online adaptation. By implementing predictive coding directly in hardware, it reduces reliance on global coordination and centralized storage, advancing energy efficiency and real-time learning capabilities. This architecture could have profound implications for future adaptive computing devices.

Technical Contribution

Technical contributions include: 1) a composable neural-core architecture implementing discrete-time predictive coding updates using a sequential MAC datapath; 2) a uniform per-neuron clamping interface supporting supervised training and inference; 3) a direct correspondence between predictive coding computations and hardware FSM stages, ensuring verifiable consistency between update equations and hardware datapath.

Novelty

This study is the first to implement predictive coding learning dynamics directly in hardware, rather than proposing new learning rules. Unlike existing spiking neural network hardware, it uses continuous-valued neural representations and a synchronous deterministic RTL design, prioritizing direct correspondence between update equations and hardware datapath.

Limitations

Limitation 1: The sequential floating-point datapath increases tick latency as fan-in grows, potentially affecting real-time performance in large-scale networks.
Limitation 2: Implementing nonlinear activations and their derivatives for synthesis requires careful numerical design to ensure precision and stability.
Limitation 3: Convergence and stability guarantees for the discrete-time, finite-precision system need empirical mapping, and theoretical analysis remains to be explored.

Future Work

Future work could include: 1) exploring the balance between parallelism and area/power to enhance scalability; 2) researching activation approximations suitable for synthesis; 3) conducting task-driven benchmarks to identify scenarios where local online inference is advantageous.

AI Executive Summary

In modern deep learning, backpropagation is a widely used training method, but its global error propagation and reliance on centralized storage make it challenging to implement distributed online learning in hardware. Predictive coding offers an alternative by enabling inference and learning through local prediction-error dynamics between layers.

This paper introduces a digital architecture capable of implementing discrete-time predictive coding updates directly in hardware. Each neural core maintains its activity, prediction error, and synaptic weights, communicating only with adjacent layers through hardwired connections. A uniform per-neuron clamping interface supports supervised learning and inference, ensuring boundary conditions while leaving the internal update schedule unchanged.

The design is based on a sequential MAC datapath and a fixed finite-state schedule, rather than executing task-specific instruction sequences within the learning substrate. The system evolves under fixed local update rules, with task structure imposed through connectivity, parameters, and boundary conditions.

Experimental results demonstrate the architecture's effectiveness in teacher-student regression and nonlinear regression tasks, validating its stability and efficiency under a limited tick budget. Additionally, architectural scaling experiments show the design's scalability, with networks of different sizes exhibiting rapid initial descent followed by stable residual floors under the same tick schedule.

This research provides a new algorithmic substrate for physically embedded learning systems, particularly in scenarios requiring local update structures for embedded online adaptation. By implementing predictive coding directly in hardware, it reduces reliance on global coordination and centralized storage, advancing energy efficiency and real-time learning capabilities.

However, the current design has limitations, including increased tick latency as fan-in grows and the need for careful numerical design of nonlinear activations and their derivatives. Future work will explore the balance between parallelism and area/power to enhance scalability and conduct task-driven benchmarks to identify scenarios where local online inference is advantageous.

Deep Analysis

Background

Modern machine learning systems are typically trained using backpropagation, which computes gradients by combining global loss information with a tightly coordinated forward/backward computation schedule. Although highly effective, this paradigm is challenging to realize as a fully distributed learning substrate in hardware. Backpropagation requires structured global error propagation, intermediate activation storage, and substantial data movement through memory and interconnect.

Predictive coding offers an alternative formulation where inference and learning arise from minimizing prediction errors across a hierarchy. In standard predictive coding networks (PCNs), each layer predicts the layer below; each unit updates its state and synaptic weights using only locally available quantities: its own activity, its own prediction error, presynaptic activity from the adjacent layer above, and prediction errors from the adjacent layer below. This locality makes predictive coding attractive as a candidate algorithmic substrate for physically embedded learning systems.

This paper presents a digital micro-architecture that directly implements predictive coding equations at the level of individual neurons. Each unit executes a fixed finite-state schedule per tick. Communication is strictly between adjacent layers via hardwired connections. No shared parameter memory and no global learning-phase controller are required. The objective of this work is not to propose a new learning rule but to demonstrate a concrete mapping from predictive-coding-style local learning to a structured, synthesizable digital substrate.

Core Problem

Backpropagation requires global coordination that is challenging to reconcile with distributed biological learning and certain classes of embedded hardware systems. First, standard gradient computation requires propagating error information backward through the entire network, creating a dependency structure that is not purely local. Second, training is typically organized into distinct phases (forward, backward, update) that demand synchronization and storage of intermediate activations. Third, backpropagation assumes differentiability of the computational graph, whereas biological systems involve discontinuous and stochastic signaling. While these issues do not prevent backpropagation from being implemented on conventional accelerators, they motivate research into alternative learning formulations that admit local update structures suitable for embedded online adaptation.

Innovation

The core innovations of this paper include:

�� A composable neural-core architecture implementing discrete-time predictive coding updates using a sequential MAC datapath. This design allows each neural core to maintain its activity, prediction error, and synaptic weights, communicating only with adjacent layers through hardwired connections.

�� A uniform per-neuron clamping interface supporting supervised training and inference. This interface enforces boundary conditions while leaving the internal update schedule unchanged, supporting learning and inference through boundary conditions.

�� A direct correspondence between predictive coding computations and hardware FSM stages, ensuring verifiable consistency between update equations and hardware datapath. This design prioritizes direct, verifiable correspondence over the energy efficiency gains provided by event-driven spiking systems.

These innovations enable predictive coding to be implemented directly in hardware, reducing reliance on global coordination and centralized storage.

Methodology

The proposed digital architecture implements discrete-time predictive coding updates directly in hardware, with the following methodology:

�� Each neural core maintains its activity, prediction error, and synaptic weights, communicating only with adjacent layers through hardwired connections. Each core corresponds to one indexed unit, maintaining its local state and parameters, and communicating only with adjacent layers via hardwired signals.

�� The design is based on a sequential MAC datapath and a fixed finite-state schedule, rather than executing task-specific instruction sequences within the learning substrate. The system evolves under fixed local update rules, with task structure imposed through connectivity, parameters, and boundary conditions.

�� A uniform per-neuron clamping interface supports supervised learning and inference, ensuring boundary conditions while leaving the internal update schedule unchanged. This interface enforces boundary conditions while leaving the internal update schedule unchanged, supporting learning and inference through boundary conditions.

�� A direct correspondence between predictive coding computations and hardware FSM stages, ensuring verifiable consistency between update equations and hardware datapath. This design prioritizes direct, verifiable correspondence over the energy efficiency gains provided by event-driven spiking systems.

Experiments

The experimental design includes teacher-student regression and nonlinear regression tasks, validating the architecture's stability and efficiency under a limited tick budget. In the teacher-student regression experiment, a three-layer network (2→4→3) rapidly reduced initial MSE from 0.341207 to 0.004784, demonstrating the effectiveness of the incremental tick regime. In the nonlinear regression experiment, a smaller network (2→2→1) reduced initial MSE from 0.106512 to 0.004382, indicating stability under a limited tick budget.

Additionally, architectural scaling experiments show the design's scalability, with networks of different sizes exhibiting rapid initial descent followed by stable residual floors under the same tick schedule. Experiments are implemented as Verilator simulations using the reference RTL implementation, with learning and inference controlled entirely through clamping and learning-rate parameters without changing the internal neural-core schedule.

Results

Experimental results demonstrate the architecture's effectiveness in teacher-student regression and nonlinear regression tasks, validating its stability and efficiency under a limited tick budget. In the teacher-student regression experiment, a three-layer network (2→4→3) rapidly reduced initial MSE from 0.341207 to 0.004784, demonstrating the effectiveness of the incremental tick regime. In the nonlinear regression experiment, a smaller network (2→2→1) reduced initial MSE from 0.106512 to 0.004382, indicating stability under a limited tick budget.

Additionally, architectural scaling experiments show the design's scalability, with networks of different sizes exhibiting rapid initial descent followed by stable residual floors under the same tick schedule. Experiments are implemented as Verilator simulations using the reference RTL implementation, with learning and inference controlled entirely through clamping and learning-rate parameters without changing the internal neural-core schedule.

Applications

The proposed architecture is suitable for scenarios requiring local update structures for embedded online adaptation. By implementing predictive coding directly in hardware, it reduces reliance on global coordination and centralized storage, advancing energy efficiency and real-time learning capabilities. This architecture could have profound implications for future adaptive computing devices, particularly in scenarios requiring local update structures for embedded online adaptation.

Additionally, the architecture's scalability makes it suitable for networks of different sizes, capable of exhibiting rapid initial descent followed by stable residual floors under the same tick schedule. This design could play a significant role in future adaptive computing devices, particularly in scenarios requiring local update structures for embedded online adaptation.

Limitations & Outlook

The current design has limitations, including increased tick latency as fan-in grows and the need for careful numerical design of nonlinear activations and their derivatives. Convergence and stability guarantees for the discrete-time, finite-precision system need empirical mapping, and theoretical analysis remains to be explored.

Future work will explore the balance between parallelism and area/power to enhance scalability and conduct task-driven benchmarks to identify scenarios where local online inference is advantageous. Additionally, research into activation approximations suitable for synthesis is needed to ensure precision and stability.

Plain Language Accessible to non-experts

Imagine you're in a kitchen, cooking a meal. Each neural core is like a chef, responsible for their own workstation, only communicating with the chefs next to them. Predictive coding is like chefs adjusting their recipes based on the dishes next to them to reduce errors. Each chef has their own ingredients, spices, and tools, focusing only on their work without worrying about the entire kitchen's operation.

This approach reduces reliance on central command, much like each chef can complete their dish independently without the head chef's direction. This way, the kitchen can operate more efficiently, with each chef quickly adjusting their dish based on feedback from neighboring stations, ensuring the quality of the entire kitchen's dishes.

The benefit of this design is that even as the kitchen scales up, each chef can still focus on their work, ensuring the quality and consistency of the dishes. This method not only improves the kitchen's efficiency but also reduces reliance on central command, making the entire kitchen's operation more flexible and efficient.

However, this method also has its limitations. As the kitchen scales up, communication between chefs may become complex, affecting overall efficiency. Additionally, ensuring the quality and consistency of each chef's dishes is a challenge. Future work can explore how to optimize communication and collaboration between chefs to improve the efficiency and quality of the entire kitchen's dishes.

ELI14 Explained like you're 14

Hey there! Imagine you're playing a super cool game where each character has their own mission, but they only talk to the characters next to them. The goal of the game is for each character to adjust their actions based on their neighbors' feedback to reduce mistakes.

This is like predictive coding, where each character has their own skills, gear, and missions, focusing only on their work without worrying about the entire game's operation. This way, the game can run more efficiently, with each character quickly adjusting their actions based on feedback from neighbors, ensuring the success of the entire game.

The benefit of this design is that even as the game scales up, each character can still focus on their mission, ensuring the quality and consistency of the game. This method not only improves the game's efficiency but also reduces reliance on central command, making the entire game's operation more flexible and efficient.

However, this method also has its limitations. As the game scales up, communication between characters may become complex, affecting overall efficiency. Additionally, ensuring the quality and consistency of each character's missions is a challenge. Future work can explore how to optimize communication and collaboration between characters to improve the efficiency and quality of the entire game's missions.

Glossary

Predictive Coding

A method for inference and learning by minimizing prediction errors across a hierarchy.

In this paper, predictive coding is used to implement local learning updates in hardware.

RTL (Register Transfer Level)

An abstraction level used to describe digital circuits, commonly used in hardware design and synthesis.

The proposed architecture implements predictive coding updates at the RTL level.

Neural Core

The basic unit in hardware implementing predictive coding updates, maintaining its activity, prediction error, and synaptic weights.

Each neural core communicates only with adjacent layers through hardwired connections.

Clamping Interface

An interface supporting supervised learning and inference by enforcing boundary conditions to control neuron states.

The clamping interface ensures boundary conditions while leaving the internal update schedule unchanged.

Finite State Machine (FSM)

A model for controlling system behavior through finite states and state transitions.

The design is based on a fixed finite-state schedule, ensuring verifiable consistency between update equations and hardware datapath.

Sequential MAC Datapath

A hardware design for implementing predictive coding updates using a multiply-accumulate unit for sequential computation.

The design is based on a sequential MAC datapath rather than executing task-specific instruction sequences within the learning substrate.

Incremental Tick Regime

A mechanism for learning and inference under a limited tick budget by gradually updating states and weights.

Experimental results demonstrate the effectiveness of the incremental tick regime, validating the design's stability.

Verilator Simulation

An open-source simulation tool for verifying hardware designs, supporting efficient RTL-level simulation.

Experiments are implemented as Verilator simulations using the reference RTL implementation.

Nonlinear Activation Function

A function used to introduce nonlinearity, commonly used in neural networks to enhance model expressiveness.

Implementing nonlinear activations and their derivatives for synthesis requires careful numerical design.

Energy Efficiency

The relationship between energy consumed during computation and the efficiency of task completion.

By implementing predictive coding directly in hardware, it reduces reliance on global coordination and centralized storage, advancing energy efficiency and real-time learning capabilities.

Open Questions Unanswered questions from this research

1 Open Question 1: How to maintain real-time performance in large-scale networks? As fan-in grows, tick latency increases, potentially affecting real-time performance. Exploring the balance between parallelism and area/power to enhance scalability is needed.
2 Open Question 2: How to accurately implement nonlinear activations and their derivatives for synthesis? Implementing nonlinear activations and their derivatives requires careful numerical design to ensure precision and stability.
3 Open Question 3: How to verify convergence and stability of the discrete-time, finite-precision system? Convergence and stability guarantees need empirical mapping, and theoretical analysis remains to be explored.
4 Open Question 4: How to optimize the architecture's energy efficiency? By implementing predictive coding directly in hardware, it reduces reliance on global coordination and centralized storage, advancing energy efficiency. However, further optimization of energy efficiency remains to be studied.
5 Open Question 5: How to validate the architecture's applicability across different tasks? Future work can conduct task-driven benchmarks to identify scenarios where local online inference is advantageous.

Applications

Immediate Applications

Embedded Online Learning

The architecture is suitable for embedded online learning scenarios requiring local update structures, reducing reliance on global coordination and centralized storage by implementing predictive coding directly in hardware.

Real-time Signal Processing

In real-time signal processing applications, the architecture can achieve efficient local updates, suitable for scenarios requiring fast response and low latency.

Adaptive Control Systems

In adaptive control systems, the architecture can achieve real-time local learning and adjustment, improving system response speed and stability.

Long-term Vision

Smart IoT Devices

The architecture can be applied to smart IoT devices, achieving efficient local learning and adaptation, driving the intelligent development of IoT devices.

Next-generation Neuromorphic Computing

The architecture provides a new design approach for next-generation neuromorphic computing devices, improving energy efficiency and real-time learning capabilities by implementing predictive coding directly in hardware.

Abstract

Backpropagation has enabled modern deep learning but is difficult to realize as an online, fully distributed hardware learning system due to global error propagation, phase separation, and heavy reliance on centralized memory. Predictive coding offers an alternative in which inference and learning arise from local prediction-error dynamics between adjacent layers. This paper presents a digital architecture that implements a discrete-time predictive coding update directly in hardware. Each neural core maintains its own activity, prediction error, and synaptic weights, and communicates only with adjacent layers through hardwired connections. Supervised learning and inference are supported via a uniform per-neuron clamping primitive that enforces boundary conditions while leaving the internal update schedule unchanged. The design is a deterministic, synthesizable RTL substrate built around a sequential MAC datapath and a fixed finite-state schedule. Rather than executing a task-specific instruction sequence inside the learning substrate, the system evolves under fixed local update rules, with task structure imposed through connectivity, parameters, and boundary conditions. The contribution of this work is not a new learning rule, but a complete synthesizable digital substrate that executes predictive-coding learning dynamics directly in hardware.

cs.NE cs.AI cs.AR cs.LG

References (14)

A theory of cortical responses

Karl J. Friston

2005 4282 citations ⭐ Influential

Learning representations by back-propagating errors

D. Rumelhart, Geoffrey E. Hinton, Ronald J. Williams

1986 30499 citations

The SpiNNaker Project

S. Furber, F. Galluppi, S. Temple et al.

2014 1247 citations

A Stable, Fast, and Fully Automatic Learning Algorithm for Predictive Coding Networks

Tommaso Salvatori, Yuhang Song, Yordan Yordanov et al.

2022 22 citations View Analysis →

On the Global Convergence of (Fast) Incremental Expectation Maximization Methods

Belhal Karimi, Hoi-To Wai, É. Moulines et al.

2019 33 citations View Analysis →

The Forward-Forward Algorithm: Some Preliminary Investigations

Geoffrey E. Hinton

2022 384 citations View Analysis →

Mind Children The Future Of Robot And Human Intelligence

M. Schroder

2016 372 citations

Neuromorphic hardware in the loop: Training a deep spiking network on the BrainScaleS wafer-scale system

Sebastian Schmitt, Johann Klaehn, G. Bellec et al.

2017 148 citations View Analysis →

A View of the Em Algorithm that Justifies Incremental, Sparse, and other Variants

Radford M. Neal, Geoffrey E. Hinton

1998 2870 citations

Predictive coding in the visual cortex: a functional interpretation of some extra-classical receptive-field effects.

Rajesh P. N. Rao, D. Ballard

1999 4843 citations

Loihi: A Neuromorphic Manycore Processor with On-Chip Learning

Mike Davies, N. Srinivasa, Tsung-Han Lin et al.

2018 3307 citations

The free-energy principle: a rough guide to the brain?

Karl J. Friston

2009 1726 citations

An Approximation of the Error Backpropagation Algorithm in a Predictive Coding Network with Local Hebbian Synaptic Plasticity

James C. R. Whittington, R. Bogacz

2017 337 citations

Backpropagation and the brain

T. Lillicrap, Adam Santoro, Luke Marris et al.

2020 899 citations

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Analysis

Background

Core Problem

Innovation

Methodology

Experiments

Results

Applications

Limitations & Outlook

Plain Language Accessible to non-experts

ELI14 Explained like you're 14

Glossary

Predictive Coding

RTL (Register Transfer Level)

Neural Core

Clamping Interface

Finite State Machine (FSM)

Sequential MAC Datapath

Incremental Tick Regime

Verilator Simulation

Nonlinear Activation Function

Energy Efficiency

Open Questions Unanswered questions from this research

Applications

Immediate Applications

Embedded Online Learning

Real-time Signal Processing

Adaptive Control Systems

Long-term Vision

Smart IoT Devices

Next-generation Neuromorphic Computing

Abstract

References (14)

Related Papers

Why Architecture Choice Matters in Symbolic Regression

Structure-Guided Diffusion Model for EEG-Based Visual Cognition Reconstruction

L-System Genetic Encoding for Scalable Neural Network Evolution: A Comparison with Direct Matrix Encoding

Scalable Memristive-Friendly Reservoir Computing for Time Series Classification

Similarity-based Portfolio Construction for Black-box Optimization

Combining Convolution and Delay Learning in Recurrent Spiking Neural Networks