SupraSNN: Exploiting Synapse-Level Parallelism in Spiking Neural Network Accelerators through Co-Optimized Mapping and Scheduling
SupraSNN employs a superscalar-inspired architecture with synapse-level parallelism, achieving 149μs latency and 0.025mJ/image on FPGA for MNIST, outperforming prior accelerators.
Key Findings
Methodology
This paper introduces a hardware-software co-design framework that treats synaptic events as micro-operations, inspired by superscalar processor architecture. The architecture, SupraSNN, physically decouples synaptic and neuronal computations, utilizing a multi-cast tree for efficient spike distribution and a bufferless merge tree for synchronized result accumulation. The mapping strategy considers memory constraints, employing heuristic scheduling to optimize execution order, maximizing throughput and resource utilization. FPGA prototyping validates the design on MNIST and SHD datasets, demonstrating significant improvements in latency and energy efficiency. The system's core innovation lies in enabling massive parallelism at the synapse level while maintaining deterministic neuron updates, addressing the irregular connectivity and sparsity challenges in modern SNNs.
Key Results
- On the MNIST dataset, SupraSNN achieves an inference latency of 149μs and consumes only 0.025mJ per image (0.276nJ per synapse), representing a 47.6% reduction in latency and a 5.6× increase in energy efficiency compared to previous FPGA-based SNN accelerators.
- For the Spiking Heidelberg Dataset (SHD), a recurrent SNN implementation reaches 1.41ms latency and 0.77mJ per sample, demonstrating the architecture's capability for temporal and recurrent tasks with high efficiency.
- The architecture supports unstructured sparsity and complex topologies, significantly improving hardware resource utilization and scalability, with performance maintained across different network configurations.
Significance
This work advances the field of neuromorphic computing by overcoming the bottleneck of limited parallelism in traditional SNN hardware. By introducing a synapse-centric, superscalar-inspired design, it unlocks the potential for large-scale, low-power, high-throughput neural processing. The proposed mapping and scheduling framework effectively handles irregular, sparse, and complex network topologies, which are critical for real-world applications such as vision, temporal processing, and robotics. The demonstrated FPGA results suggest that this architecture can serve as a foundation for future ASIC implementations, paving the way for scalable neuromorphic systems capable of supporting complex AI workloads with unprecedented energy efficiency and speed.
Technical Contribution
The paper's primary contribution is the novel integration of superscalar microarchitecture principles into neuromorphic hardware, enabling high degrees of parallelism at the synapse level. The multi-cast tree efficiently distributes spike events, while the bufferless merge tree ensures deterministic, synchronized accumulation of partial results, eliminating bottlenecks associated with queues and locks. The hardware-software co-design framework intelligently partitions workloads, balancing resource constraints and irregular connectivity. FPGA implementation validates the approach, showing significant improvements over existing accelerators in latency and energy metrics. This work establishes a new paradigm for scalable, flexible, and energy-efficient neuromorphic hardware design, opening avenues for future research in hardware-aware neural network modeling.
Novelty
This research is the first to incorporate superscalar microarchitecture concepts into SNN hardware, specifically targeting synapse-level parallelism. The innovative use of multi-cast and merge trees to handle irregular, sparse connectivity and complex topologies distinguishes it from prior works that either focus on coarse-grained parallelism or fully parallel neuron mapping. The architecture's ability to dynamically schedule and map sparse networks while maintaining deterministic neuron updates is a significant leap forward, setting a new standard for scalable neuromorphic accelerators.
Limitations
- The current implementation is FPGA-based, and while it demonstrates feasibility, ASIC adaptation is necessary for mass deployment and further energy savings. Hardware resource constraints may limit scalability for extremely large networks.
- Handling highly dynamic or evolving network topologies, such as online learning scenarios, requires additional mechanisms for adaptive mapping and scheduling, which are not addressed in this work.
- The architecture's complexity may pose challenges for real-time control and debugging, especially in highly irregular or deep recurrent networks. Further research is needed to optimize design automation and robustness.
Future Work
Future directions include developing ASIC prototypes to reduce costs and improve energy efficiency, extending the architecture to support online learning and adaptive topologies, and exploring dynamic scheduling algorithms that can respond to changing network conditions. Additionally, integrating this hardware with high-level neural network frameworks and developing automated mapping tools will facilitate broader adoption. Scaling the design to support larger, multi-task systems and multi-modal data streams is also a promising avenue, aiming to realize fully autonomous, energy-efficient neuromorphic systems for real-world AI applications.
AI Executive Summary
Neuromorphic computing has emerged as a promising paradigm for achieving brain-like efficiency in artificial intelligence systems. Spiking Neural Networks (SNNs), inspired by biological neural processes, are particularly attractive due to their event-driven, sparse computation. However, hardware implementations have struggled to fully exploit the inherent parallelism of SNNs, especially at the synapse level, due to structural limitations and complex connectivity. Traditional accelerators often couple synaptic and neuronal computations tightly, leading to bottlenecks that hinder scalability and energy efficiency.
This paper introduces SupraSNN, a novel architecture inspired by superscalar processor design, that fundamentally redefines how SNN workloads are executed on hardware. By treating synaptic events as micro-operations that can be dispatched independently to multiple parallel processing units, SupraSNN achieves unprecedented levels of synapse-level parallelism. The core components include a multi-cast tree for efficient spike distribution, a bufferless merge tree for synchronized result accumulation, and a centralized neuron unit for deterministic state updates. This decoupled design allows the system to process thousands of synaptic events simultaneously, significantly boosting throughput while maintaining the precise, deterministic neuron dynamics essential for accurate neural computation.
The architecture is complemented by a sophisticated mapping and scheduling framework that considers memory constraints and irregular network topologies. Heuristic algorithms optimize the assignment of synapses to processing units and determine execution order to maximize resource utilization and minimize latency. FPGA prototypes demonstrate the effectiveness of this approach, achieving 149μs inference latency and 0.025mJ energy per image on MNIST, outperforming prior FPGA accelerators by a large margin. Similar results on the Spiking Heidelberg Dataset validate the architecture's versatility for temporal and recurrent tasks.
The significance of this work lies in its ability to unlock the full potential of SNNs for real-world applications. By enabling high parallelism at the synapse level, it addresses the long-standing bottleneck of irregular, sparse connectivity, paving the way for scalable, energy-efficient neuromorphic systems. The proposed design not only advances academic understanding but also offers practical pathways toward deploying large-scale, low-power neural hardware in industry. Future work will focus on ASIC implementation, adaptive scheduling, and expanding support for online learning, aiming to realize autonomous, intelligent systems capable of complex perception and decision-making with minimal energy consumption.
Deep Analysis
Background
The evolution of neuromorphic hardware reflects an ongoing quest to emulate biological neural efficiency. Early analog systems like TrueNorth and Loihi prioritized low power and scalability but faced challenges in programmability and precision. Digital platforms such as SpiNNaker and ODIN improved flexibility and integration with conventional electronics, yet struggled with supporting complex, irregular, and sparse connectivity typical of advanced SNNs. Recent research emphasizes the importance of high parallelism at the synapse level, supporting unstructured sparsity and complex topologies to enhance scalability and efficiency. Existing approaches often rely on coarse-grained parallelism or fully parallel neuron mappings, which either limit throughput or resource utilization. The need for a flexible, scalable architecture that can handle the irregular, sparse, and large-scale nature of modern SNNs remains unmet. This paper addresses this gap by proposing a synapse-centric, superscalar-inspired architecture that combines hardware innovations with intelligent mapping strategies.
Core Problem
Despite the promise of SNNs, hardware implementations face fundamental bottlenecks. Traditional accelerators couple synaptic and neuronal computations, limiting parallelism and throughput. The high frequency of simple synaptic accumulations contrasts sharply with the lower rate of complex neuron updates, leading to resource underutilization. Irregular connectivity and unstructured sparsity further complicate efficient hardware mapping, causing load imbalance and inefficient resource usage. Synchronizing multiple synaptic updates targeting the same neuron introduces bottlenecks, especially when using queues or atomic operations. These issues collectively hinder the scalability and energy efficiency of existing solutions, preventing SNNs from reaching their full potential in real-world applications.
Innovation
The key innovations include: 1) adopting a superscalar-inspired design that treats each synaptic event as an independent micro-operation, enabling massive parallelism; 2) designing a multi-cast tree for efficient, low-energy spike distribution that avoids the overhead of global broadcasting; 3) creating a bufferless merge tree that synchronously consolidates partial sums, eliminating bottlenecks caused by queues and locks; 4) developing a hardware-software co-design framework that intelligently partitions sparse, irregular workloads across multiple processing units; 5) validating the architecture on FPGA with datasets like MNIST and SHD, demonstrating significant improvements in latency and energy efficiency. These innovations collectively address the core limitations of prior approaches, enabling scalable, flexible, and energy-efficient neuromorphic hardware.
Methodology
- �� Design a superscalar-inspired architecture where incoming spikes are treated as micro-operations dispatched to multiple parallel Synapse Processing Units (SPUs);
- �� Implement a multi-cast tree (MC Tree) that encodes spike distribution using an O(N) bitstream, allowing selective, energy-efficient multicasting to relevant SPUs;
- �� Develop a bufferless merge (ME) tree that synchronously and deterministically sums partial results from all SPUs, avoiding traditional queue and lock bottlenecks;
- �� Formulate a hardware-software co-design framework that partitions the network workload based on memory constraints and irregular connectivity, assigning synapses to SPUs heuristically;
- �� Map the network onto FPGA, optimize scheduling to maximize throughput, and evaluate performance on datasets like MNIST and SHD, adjusting parameters for different sparsity levels.
Experiments
The experimental setup involves FPGA prototyping of the SupraSNN architecture, with datasets including MNIST for static image classification and the Spiking Heidelberg Dataset (SHD) for temporal sequence tasks. Baselines include traditional serial SNN accelerators and existing FPGA solutions. Metrics focus on inference latency, energy consumption per inference, resource utilization, and throughput. Hyperparameters such as network sparsity, synaptic weight distribution, and scheduling heuristics are tuned to assess robustness. The experiments analyze the impact of network topology, sparsity, and scheduling strategies on performance. Results show that SupraSNN achieves 149μs latency and 0.025mJ/image on MNIST, outperforming baseline methods significantly. On SHD, the architecture supports recurrent networks with 1.41ms latency and 0.77mJ/sample, demonstrating versatility across tasks.
Results
The FPGA implementation demonstrates a 47.6% reduction in inference latency and a 5.6× improvement in energy efficiency over prior FPGA SNN accelerators on MNIST. The architecture's ability to handle unstructured sparsity leads to resource savings and scalability. On the SHD dataset, the recurrent network achieves 1.41ms latency and 0.77mJ per sample, validating the architecture's effectiveness for temporal tasks. Ablation studies confirm that the multi-cast and merge trees are critical for performance gains, with scheduling heuristics further enhancing throughput. The results highlight the architecture's capacity to adapt to different network topologies and sparsity levels while maintaining high efficiency.
Applications
This architecture is suitable for deployment in edge devices requiring low-latency, energy-efficient neural processing, such as autonomous robots, wearable sensors, and vision systems. Its support for irregular, sparse, and recurrent networks makes it ideal for complex perception tasks, temporal sequence analysis, and adaptive control systems. The hardware's programmability and mapping flexibility enable integration with existing neural network frameworks, facilitating rapid deployment. Future applications could include large-scale neuromorphic chips for AI inference, brain-inspired computing platforms, and real-time data analytics in IoT environments. The architecture's scalability and energy efficiency open pathways for widespread adoption in industry and research.
Limitations & Outlook
The current FPGA-based prototype, while demonstrating feasibility, faces challenges in scaling to ASIC implementations, where resource constraints and fabrication costs must be addressed. Handling extremely large or highly dynamic networks may require more adaptive mapping and scheduling algorithms. The architecture's complexity could pose difficulties in real-time control and debugging, especially for very deep or densely recurrent networks. Additionally, online learning and plasticity mechanisms are not integrated, limiting immediate applicability to adaptive systems. Future work should focus on optimizing hardware design for ASIC, developing automated mapping tools, and extending the architecture to support on-chip learning and adaptation.
Abstract
Spiking Neural Networks (SNNs) offer a brain-inspired path toward highly efficient computation, but their practical deployment is constrained by the challenge of managing and executing their massive parallelism on physical hardware. This problem mirrors the historical challenge in processor design of moving beyond serial execution, a barrier broken by superscalar architectures that dispatch multiple instructions to parallel functional units. Drawing inspiration from this paradigm, we introduce a hardware-software co-design framework that treats synaptic events as parallelizable micro-operations. We present SupraSNN, a superscalar-inspired architecture that achieves high synapse-level parallelism by physically decoupling synaptic and neuronal computations. Within this architecture, a Multi-Cast Tree routes spike data to multiple parallel Synapse Processing Units serve as the computational pipelines, while a Merge Tree consolidates distributed results for processing by a unified Neuron Unit--deliberately centralizing complex neuron state dynamics to mitigate hardware overhead and resource duplication. The efficacy of this architecture is enabled by a sophisticated partitioning and scheduling framework that first maps the SNN onto hardware respecting memory constraints, then heuristic scheduling determines the synaptic execution order, maximizing throughput and resource utilization. Implementing a feedforward SNN trained on MNIST (93.44% accuracy), SupraSNN achieves 149 $μs$ inference latency and 0.025 mJ per image (0.276 nJ per synapse) on the Xilinx Zynq XC7Z020 FPGA--delivering 47.6% lower latency and 5.6$\times$ better energy efficiency than prior FPGA-based SNN accelerators. Beyond vision tasks, a recurrent SNN on the Spiking Heidelberg Dataset (71.82% accuracy) achieves 1.41 ms latency and 0.77 mJ per sample on XC7Z030.
References (20)
SPIKING NEURON MODELS Single Neurons , Populations , Plasticity
W. Gerstner
Spiker: an FPGA-optimized Hardware accelerator for Spiking Neural Networks
Alessio Carpegna, A. Savino, Stefano Di Carlo
Training Spiking Neural Networks Using Lessons From Deep Learning
J. Eshraghian, Max Ward, Emre O. Neftci et al.
Efficient Processing of Spatio-Temporal Data Streams With Spiking Neural Networks
Alexander Kugele, T. Pfeil, Michael Pfeiffer et al.
SaARSP: An Architecture for Systolic-Array Acceleration of Recurrent Spiking Neural Networks
Jeongjun Lee, Wenrui Zhang, Yuan Xie et al.
Spike-based neuromorphic computing: An overview from bio-inspiration to hardware architectures and learning mechanisms
A. Gebregiorgis, A. Yousefzadeh, S. Eissa et al.
The Heidelberg Spiking Data Sets for the Systematic Evaluation of Spiking Neural Networks
Benjamin Cramer, Yannik Stradmann, J. Schemmel et al.
Scaling mixed-signal neuromorphic processors to 28 nm FD-SOI technologies
N. Qiao, G. Indiveri
A Fast and Energy-Efficient SNN Processor With Adaptive Clock/Event-Driven Computation Scheme and Online Learning
Sixu Li, Zhaomin Zhang, R. Mao et al.
Networks of Spiking Neurons: The Third Generation of Neural Network Models
W. Maass
Are SNNs Truly Energy-efficient? — A Hardware Perspective
Abhiroop Bhattacharjee, Ruokai Yin, Abhishek Moitra et al.
Loihi: A Neuromorphic Manycore Processor with On-Chip Learning
Mike Davies, N. Srinivasa, Tsung-Han Lin et al.
Hardware implementation of spiking neural networks on FPGA
Jianhui Han, Zhao-lin Li, Weimin Zheng et al.
Synapse-Centric Mapping of Cortical Models to the SpiNNaker Neuromorphic Architecture
J. Knight, S. Furber
Point-to-point connectivity between neuromorphic chips using address events
K. Boahen
Stitch-X: An Accelerator Architecture for Exploiting Unstructured Sparsity in Deep Neural Networks
Ching-En Lee, Y. Shao, Jie-Fang Zhang et al.
A Scalable Multicore Architecture With Heterogeneous Memory Structures for Dynamic Neuromorphic Asynchronous Processors (DYNAPs)
S. Moradi, N. Qiao, F. Stefanini et al.
Analog Memristive Synapse in Spiking Networks Implementing Unsupervised Learning
E. Covi, S. Brivio, Alexander Serb et al.
The mnist database of handwritten digits
Yann LeCun, Corinna Cortes
Spiker+: A Framework for the Generation of Efficient Spiking Neural Networks FPGA Accelerators for Inference at the Edge
Alessio Carpegna, A. Savino, S. D. Carlo