TorchDrum

An audio-driven 808 drum synthesiser built with PyTorch and JUCE

Jordie Shier¹, Charalampos Saitis¹, Andrew Robertson², and Andrew McPherson³

¹Centre for Digital Music, Queen Mary University of London
²Ableton AG
³Dyson School of Design Engineering, Imperial College London

TorchDrum user interface designed by Lewis Wolstanholme and Francis Devine of Julia Set.

What is TorchDrum?
Demo
Timbre Remapping
Parameter Mapping Neural Network
Creating a Plugin
Further Reading

What is TorchDrum?

TorchDrum is an audio plugin that transforms input audio signals into a synthesised 808 drum in real-time. It is a synthesiser that receives audio (not MIDI). The rhythm, loudness, and timbral variations from a percussive input signal are mapped to synthesiser controls, allowing for dynamic, audio-based control. The figure below shows the main signal flow implemented in the audio plug-in, which was built in C++ using JUCE.

The main signal processing and mapping components are:

1) Onset Detection: drum hits in the input signal are detected using an onset detection algorithm. This allows the rhythm from the input to be mapped to the synthesiser.
2) Feature Extraction: a short segment of audio is extracted from the input signal for feature extraction. We only use 256 samples which is ~5.3ms (48kHz sampling rate), this minimises the delay between a detected onset and triggering the synth. This segment of 256 samples is passed into an audio feature extractor that measures the loudness, spectral centroid (brightness), and spectral flatness (noisiness) of the input.
3) Parameter Mapping Neural Network: These audio features are then passed into a neural network that has been trained to produce synthesiser parameter modulations for the 808 synthesiser so the output follows the variations in loudness and timbre observed at the input.

We refer to the process of automatically updating synthesiser parameters to reflect the timbral variations in the input signal as timbre remapping.

Demo

Drummer: Carson Gant (@oneupdrumvids)

Timbre Remapping

Timbre remapping involves the analysis and transfer of timbre from an audio input onto controls for a synthesizer. To achieve timbre remapping we consider musical phrases where timbre changes from note to note. We look at the differences in timbre (and loudness) between notes and try to recreate those differences on our synthesiser. The process is similar to transposing a melody into a different key, but instead of pitch, we are dealing with melodies of timbre and loudness, and are transposing from an input instrument onto synthesiser.

Let’s make this a bit more concrete. Here is an example of a short snare drum phrase with plots of the timbre and loudess for each hit. The arrow shows differences between each note compared to the first note in the phrase.

Here’s the same snare drum phase (without the rhythm) followed by all the differences in timbre and loudness mapped onto a synthesiser. What we end up is a new a synthesiser sound (a modulated preset) for each hit in the input phrase.

Parameter Mapping Neural Network

The neural network enables real-time timbre remapping, allowing for input signals to be transformed with low-latency. The neural network we use is relatively small, but it contains non-linearities, allowing for complex relationships between input features and parameters to be formed.

The neural network is trained to generate synthesiser parameters to match the loudness and timbral changes over full the full length of a drum hit (~1 second). During real-time operation we can’t wait a full second to see how the input sound fully evolves, so we predict parameters based on short segments of audio at detected onsets.

We also employ a recent method called differentiable digital signal processing (DDSP), which enables DSP algorithms to be integrated directly into gradient-descent based optimization used to train neural networks. Practically speaking, this allows us to compute training error on the audio output of our synthesiser using loudness and timbral features. We don’t need to know the parameters for our timbre remapping ahead of time – we only need our input audio which we use during self-supervised training.

This figure outlines the training process:

The main takeaway from the figure is that we’re training the neural network to match differences in audio features between two difference hits. \(y = f(x_a) - f(x_b)\) is the difference between two different sounds in an input, and \(\hat{y} = f(x_c) - f(x_d)\) is the difference between a synthesiser preset and a modulated version of that preset. The goal of the neural network is create a parameter modulation \(\theta_{mod}\), which is added to a static preset \(\theta_{pre}\) so that \(y = \hat{y}\). Furthermore, the neural network is trained to make this prediction from onset features \(f_o(\cdot)\) to allow for real-time parameter prediction.

Creating a Plugin

Neural network training is conducted in Python using PyTorch. After training, the neural network is exported to torchscript, allowing it to be loaded into a JUCE Plugin written in C++ with torchlib.

The components required for real-time inference, namely the onset detection, onset features, and the synthesiser were rewritten in C++ for the audio plug-in. To make sure the Python and C++ matched, we wrote unit tests that loaded the C++ into Python using cppyy and compare the outputs.

Here’s a visual overview of the components required for training in Python and what was included in the real-time plug-in.