Stable Audio 3 active

Name: Stable Audio 3
Availability: InStock
Author: Stability AI

Stability AI vunknown added 2026-05-29 verified 2026-06-19

[Use when]

You need to generate high-quality audio and music from text prompts, or edit and inpaint existing audio recordings at scale.

Open homepage at stability.ai

Engines

S Standalone

License

Custom

Pricing

Freemium

Last verified

2026-06-19

Added

2026-05-29

about

Stable Audio 3 is a state-of-the-art generative audio platform built on diffusion transformers and the SAME (Semantic-Acoustic Music Encoder) autoencoder. It supports three core workflows: text-to-audio generation from natural language prompts, audio-to-audio editing with prompt-guided style transfer, and precise inpainting or continuation of specific regions within existing recordings.

The platform offers multiple model sizes: Small models (433M params) run on CPU with no GPU required for lightweight music and SFX generation up to 120 seconds, while the Medium model (1.4B params) delivers higher quality output up to 380 seconds on GPU. Generation speed is measured in milliseconds for multi-second outputs on modern hardware. The SAME autoencoder produces stereo 44.1kHz output at 256-dimensional latents, balancing reconstruction fidelity with generative tractability.

Stable Audio 3 includes LoRA fine-tuning support for personalization, variable-length generation to avoid wasting compute on unused latents, and broad hardware compatibility including CUDA, TensorRT, and Apple Silicon via CoreML. Note: the open-weight models are released under the Stability AI Community License (free for research and for commercial use below a revenue threshold), while the largest model is available via API only — it is open-weight, not OSI-approved open source.

Neural Acoustic Fields (NAF)

Standalone

Neural Acoustic Fields is a research implementation that models acoustic propagation in physical scenes as a continuous implicit function. By treating sound propagation as a linear time-invariant system, NAF learns to map any emitter-listener location pair to a neural impulse response that can be applied to arbitrary audio sources. The system enables continuous spatial audio rendering for listeners at any position in a scene, including novel locations not seen during training. NAF learns magnitude-only representations (using random phase similar to Image2Reverb) and demonstrates how acoustic structure emerges as a byproduct of learning spatial sound propagation. The learned representations can also improve visual learning tasks with sparse views. This is research code from a NeurIPS 2022 paper, providing training and evaluation pipelines for learning acoustic fields from 3D scene data. It includes baseline comparisons against codec-based interpolation methods (AAC-LC, Opus) and tools for analyzing spectral accuracy, T60 error, and learned feature representations.

NoiseBandNet

Standalone

NoiseBandNet is a neural network architecture for synthesizing controllable sound effects using filterbanks. It provides multiple control schemes: automatic extraction using loudness and spectral centroid, loudness-only control for loudness transfer between sounds, and user-defined control parameters drawn directly on spectrograms. The system uses a DDSP-inspired approach with learned filter banks, allowing real-time parameter manipulation and amplitude randomization for variations. The tool includes training workflows for custom sound effect datasets and inference notebooks demonstrating loudness transfer, amplitude randomization for stereo generation, and custom control curve synthesis. Users can train models on their own sound libraries and define control parameters through an interactive labeling interface that displays waveforms and spectrograms. Implemented in PyTorch, NoiseBandNet outputs controllable synthesis parameters that can be manipulated post-training without retraining, making it suitable for adaptive sound design and procedural audio generation in interactive contexts.

Raveler

Wwise

Raveler is a Wwise plugin that runs RAVE (Realtime Audio Variational autoEncoder) models for real-time timbre transfer via neural audio synthesis in game audio contexts. The plugin provides direct integration of trained RAVE models into Wwise effect chains, enabling neural processing of game audio with adjustable latent space manipulation. The plugin exposes controls for model performance parameters including latent noise injection, prior sampling, and dry/wet mixing. It offers direct manipulation of up to 8 latent dimensions with bias and scaling controls, all of which can be bound to RTPCs for dynamic runtime control. Buffer settings allow balancing between audio quality and latency based on project requirements. Based on the RAVE VST project, Raveler brings research-grade neural audio synthesis techniques into production game audio workflows through Wwise's standard plugin architecture. Note: the core is released under CC BY-NC 4.0 (non-commercial), which restricts use in commercial products.

about

related