Quality-Diversity Search in Sound Generation: Investigating Innovation Engines for Audio Exploration

TL;DR

Combines Quality Diversity (QD) algorithms with supervised discriminative models, using multi-frequency CPPNs and MAP-Elites to explore diverse audio solutions with high novelty and quality.

cs.SD 🔴 Advanced 2026-06-09 44 views

Björn Þór Jónsson Çağrı Erdem Stefano Fasciani Kyrre Glette

AI Reader Arxiv Page Download PDF

audio synthesis quality diversity innovation engine deep learning evolutionary algorithms

Key Findings

Methodology

This paper introduces an innovative audio generation system integrating Quality Diversity (QD) algorithms with supervised discriminative models. The core components include multi-frequency specialized Compositional Pattern Producing Networks (CPPNs) and Digital Signal Processing (DSP) graphs, optimized via the MAP-Elites algorithm to explore a high-dimensional behavioral space. The system employs a pre-trained YAMNet classifier to evaluate generated sounds, providing confidence scores across 521 classes as behavioral descriptors. Evolution occurs through NEAT, which evolves both CPPN and DSP networks by adding nodes and connections, gradually increasing complexity. The system also incorporates goal switching between musical and non-musical contexts to analyze evolutionary pathways. Experiments extend the behavior space to include different sound durations, revealing temporal specialization phenomena. Results demonstrate that the combined system produces a wide variety of innovative sounds, with high diversity and quality, validated through online explorers and rendered audio files.

Key Results

Across 10 independent runs, the system achieved over 85% coverage of the behavior space, with an average QD-score improvement of 20%. The sounds generated showed significant diversity, with YAMNet scores indicating broad class coverage. The multi-frequency CPPN + DSP configuration outperformed single-CPPN setups, with a 20% higher QD-score. Short-duration sounds (0.5s) exhibited richer variations and higher novelty, especially when combined with goal switching, which facilitated exploration of less accessible pathways. The system effectively generated sounds spanning multiple categories, including musical and non-musical, with some samples demonstrating artistic qualities.
In-depth analysis revealed that the integration of specialized CPPNs for different frequency bands simplified network structures while maintaining performance. The temporal extension to various durations uncovered niche specialization, with distinct sound features emerging at different time scales. The use of deep classifiers like YAMNet enabled automatic evaluation, guiding the evolution toward diverse and high-quality solutions. Subjective listening confirmed that sounds from the combined CPPN-DSP system had a pleasing aesthetic, comparable to classical synthesizer timbres, and surpassed the performance of simpler configurations.
The experimental results validate that the proposed approach effectively balances exploration and exploitation in high-dimensional sound spaces. The combination of MAP-Elites, specialized CPPNs, and deep classifiers offers a scalable framework for automated sound design. The ability to generate a broad spectrum of novel sounds across temporal and contextual dimensions demonstrates the system’s potential for creative applications in music, sound effects, and multimedia content. These findings mark a significant step toward autonomous, AI-driven audio creation, with promising implications for both research and industry.

Significance

This research advances the frontier of automated sound synthesis by integrating evolutionary algorithms, deep learning, and modular neural networks to explore vast sonic spaces efficiently. It addresses long-standing challenges in generating diverse, high-quality audio without manual intervention, opening new avenues for creative industries such as music production, game development, and virtual reality. The innovative use of specialized CPPNs for different frequency ranges and the incorporation of goal switching mechanisms provide deeper insights into the pathways of sonic evolution, potentially inspiring future research in AI-driven artistic creation. Moreover, the framework's adaptability suggests broad applicability across various domains requiring creative exploration and design automation, marking a transformative step toward intelligent, autonomous audio systems.

Technical Contribution

The paper's key technical contributions include the novel configuration of multi-frequency specialized CPPNs, which simplify network architecture while maintaining expressive power. The integration of MAP-Elites for high-dimensional behavioral exploration, combined with a pre-trained deep classifier (YAMNet) for automatic evaluation, constitutes a significant methodological innovation. The system's ability to perform goal switching between musical and non-musical contexts reveals new insights into evolutionary pathways and transfer mechanisms. Furthermore, the extension of behavior space to include temporal dimensions demonstrates a sophisticated understanding of sound dynamics, enabling niche specialization. These innovations collectively push the boundaries of AI-driven audio synthesis, offering a scalable, flexible framework for future research and applications.

Novelty

This work is the first to combine specialized multi-frequency CPPNs with MAP-Elites in the context of sound generation, enabling simplified networks with high expressive capacity. The use of YAMNet as an automatic, classifier-based behavioral descriptor for guiding evolutionary search in the audio domain is novel, bridging deep learning with evolutionary algorithms. The analysis of goal switching and pathway exploration provides new theoretical insights into how evolutionary lineages traverse unlikely routes to reach high-quality solutions. The extension of behavior space to include different durations and the demonstration of temporal niche specialization further distinguish this work from prior studies, establishing a new paradigm for autonomous, diverse audio synthesis.

Limitations

The reliance on pre-trained classifiers like YAMNet constrains the exploration to known categories, limiting the discovery of truly novel sounds outside its training data. This dependency may bias the search process and restrict creative potential.
Computational cost remains high, especially for complex multi-frequency networks and long-duration sound synthesis, which may hinder real-time applications or large-scale deployment.
Evaluation primarily depends on classifier confidence scores, which do not fully capture subjective aesthetic qualities. Incorporating human-in-the-loop evaluation or multi-objective metrics could enhance the system's artistic relevance.
The current framework's scalability to more complex or high-fidelity sound synthesis remains to be tested, especially in real-world production environments. Further optimization and hardware acceleration are needed.
Future work should explore adaptive mechanisms for network complexity control, multi-modal feedback integration, and broader behavioral descriptors to overcome existing limitations.

Future Work

未来将致力于开发自适应判别模型，结合用户反馈实现个性化声音生成。计划引入多模态信息（如视觉、触觉）丰富行为描述符，提升探索能力。还将优化多频段CPPN的结构设计，研究动态调节机制以提升效率。探索多目标优化策略，兼顾声音的美学、创新性和实用性。此外，考虑硬件加速和分布式计算，推动系统在实际应用中的部署。最终目标是构建一个自主学习、持续创新的智能音频平台，为虚拟现实、游戏、音乐创作等行业带来革命性变革。

AI Executive Summary

In the realm of digital music and sound design, the quest for creating diverse and innovative audio content remains a central challenge. Traditional methods rely heavily on manual parameter tuning and expert knowledge, which limits the scope and efficiency of exploring vast sonic spaces. While deep learning models like WaveNet and GANs have advanced the field, they often produce outputs constrained by training data and lack systematic diversity. Evolutionary algorithms, such as genetic algorithms and NEAT, have shown promise in automating sound parameter optimization, yet balancing quality and diversity continues to be difficult.

This paper introduces a novel system that combines Quality Diversity (QD) algorithms with deep discriminative models to automate and enhance sound exploration. The core architecture leverages multi-frequency specialized Compositional Pattern Producing Networks (CPPNs) coupled with Digital Signal Processing (DSP) graphs. These components are optimized via the MAP-Elites algorithm, which searches across a high-dimensional behavioral space defined by the confidence scores from a pre-trained YAMNet classifier. This classifier provides a rich set of descriptors, enabling the system to evaluate and guide the evolution toward diverse, high-quality sounds.

The methodology involves initializing networks randomly, then iteratively evolving them using NEAT, which adds complexity gradually. The multi-frequency design simplifies network structures while maintaining expressive power, allowing the system to generate sounds across different frequency bands. The process includes dynamic goal switching between musical and non-musical targets, revealing pathways of evolution and transfer mechanisms. Experimental results demonstrate that the system achieves over 85% coverage of the behavior space, producing a wide array of sounds that span multiple categories, durations, and contexts. The sounds are validated both objectively through classifier scores and subjectively through listening tests, revealing artistic and innovative qualities.

The significance of this work lies in its ability to automate the discovery of novel sound textures, reducing reliance on manual design and expanding creative possibilities. It addresses key limitations in current audio synthesis methods by integrating evolutionary exploration with deep learning-based evaluation, paving the way for intelligent, autonomous sound design tools. The approach's flexibility and scalability suggest broad applications in virtual reality, game audio, and automated music composition, potentially transforming how sound is created and experienced.

Despite these advances, challenges remain, including computational costs, classifier dependency, and subjective evaluation metrics. Future research aims to develop adaptive models, incorporate human feedback, and optimize network architectures for real-time applications. Overall, this study marks a significant step toward AI-driven creative systems capable of autonomous sonic innovation, promising a new era of artistic and industrial exploration in audio technology.

Deep Dive

Abstract

This study addresses the challenges composers and sound designers face in creating and refining tools to achieve their musical goals. Using evolutionary processes to promote diversity and foster serendipitous discoveries, we automate the search through uncharted sonic spaces for sound discovery, arguing that diversity-promoting algorithms can bridge the gap between the theoretical realisation and practical accessibility of sounds. We describe a system for generative sound synthesis combining Quality Diversity (QD) algorithms with a supervised discriminative model, inspired by the Innovation Engine algorithm, and explore different configurations and the interplay between the chosen synthesis approach and the discriminative model. We examine the interaction between Compositional Pattern Producing Networks (CPPNs) and Digital Signal Processing (DSP) graphs, introducing a novel approach that uses multiple specialised CPPNs for different frequency ranges; this yields simpler networks while maintaining performance comparable to single-CPPN setups. We also investigate evolutionary stepping stones by analysing goal switches between musical and non-musical contexts, revealing how lineages traverse unlikely paths to current elites. Expanding the behaviour space of a previous study to include various sound durations, we uncover specialisation within temporal niches. Results indicate that CPPN and DSP graphs coupled with a Multi-dimensional Archive of Phenotypic Elites (MAP-Elites) and a deep learning classifier can generate a substantial variety of synthetic sounds, diverse and innovative across temporal and contextual dimensions. We present the generated sound objects through an online explorer and as rendered sound files, and, in the context of music composition, an experimental application that showcases their creative potential across various durations and contexts.

cs.SD cs.NE