TorchDrum

An audio-driven 808 drum synthesiser built with PyTorch and JUCE

Jordie Shier1, Charalampos Saitis1, Andrew Robertson2, and Andrew McPherson3
1Centre for Digital Music, Queen Mary University of London
2Ableton AG
3Dyson School of Design Engineering, Imperial College London
TorchDrum user interface designed by Lewis Wolstanholme and Francis Devine of Julia Set.

Audio Plugin Code | Training Code | Audio Plugin | Google Colab | Paper | Presentation

Contents

What is TorchDrum?

TorchDrum is an audio plugin that transforms input audio signals into a synthesised 808 drum in real-time. It is a synthesiser that receives audio (not MIDI). The rhythm, loudness, and timbral variations from a percussive input signal are mapped to synthesiser controls, allowing for dynamic, audio-based control. The figure below shows the main signal flow implemented in the audio plug-in, which was built in C++ using JUCE.


The main signal processing and mapping components are:

We refer to the process of automatically updating synthesiser parameters to reflect the timbral variations in the input signal as timbre remapping.


Demo

Drummer: Carson Gant (@oneupdrumvids)



Timbre Remapping

Timbre remapping involves the analysis and transfer of timbre from an audio input onto controls for a synthesizer. To achieve timbre remapping we consider musical phrases where timbre changes from note to note. We look at the differences in timbre (and loudness) between notes and try to recreate those differences on our synthesiser. The process is similar to transposing a melody into a different key, but instead of pitch, we are dealing with melodies of timbre and loudness, and are transposing from an input instrument onto synthesiser.

Let’s make this a bit more concrete. Here is an example of a short snare drum phrase with plots of the timbre and loudess for each hit. The arrow shows differences between each note compared to the first note in the phrase.

Here’s the same snare drum phase (without the rhythm) followed by all the differences in timbre and loudness mapped onto a synthesiser. What we end up is a new a synthesiser sound (a modulated preset) for each hit in the input phrase.



Parameter Mapping Neural Network

The neural network enables real-time timbre remapping, allowing for input signals to be transformed with low-latency. The neural network we use is relatively small, but it contains non-linearities, allowing for complex relationships between input features and parameters to be formed.

The neural network is trained to generate synthesiser parameters to match the loudness and timbral changes over full the full length of a drum hit (~1 second). During real-time operation we can’t wait a full second to see how the input sound fully evolves, so we predict parameters based on short segments of audio at detected onsets.

We also employ a recent method called differentiable digital signal processing (DDSP), which enables DSP algorithms to be integrated directly into gradient-descent based optimization used to train neural networks. Practically speaking, this allows us to compute training error on the audio output of our synthesiser using loudness and timbral features. We don’t need to know the parameters for our timbre remapping ahead of time – we only need our input audio which we use during self-supervised training.

This figure outlines the training process:


The main takeaway from the figure is that we’re training the neural network to match differences in audio features between two difference hits. \(y = f(x_a) - f(x_b)\) is the difference between two different sounds in an input, and \(\hat{y} = f(x_c) - f(x_d)\) is the difference between a synthesiser preset and a modulated version of that preset. The goal of the neural network is create a parameter modulation \(\theta_{mod}\), which is added to a static preset \(\theta_{pre}\) so that \(y = \hat{y}\). Furthermore, the neural network is trained to make this prediction from onset features \(f_o(\cdot)\) to allow for real-time parameter prediction.



Creating a Plugin

Neural network training is conducted in Python using PyTorch. After training, the neural network is exported to torchscript, allowing it to be loaded into a JUCE Plugin written in C++ with torchlib.

The components required for real-time inference, namely the onset detection, onset features, and the synthesiser were rewritten in C++ for the audio plug-in. To make sure the Python and C++ matched, we wrote unit tests that loaded the C++ into Python using cppyy and compare the outputs.

Here’s a visual overview of the components required for training in Python and what was included in the real-time plug-in.



Further Reading

If you’re interested in learning more deeply about this research and it’s background, we first recommend checking out the conference video presentation and the paper, which was published at the 2024 New Interfaces for Musical Expression conference. You can find a more detailed description of the methods and all the background references.

If you’re more of a code person then check out this Google Colab which provides a tutorial on training new mapping models. Repositories containing training code and the audio plugin are also available.

Below is a brief curated list of papers and resources that were most influential in the design of this research.

Differentiable Digital Signal Processing

Timbre Space, Timbre Analogies, and Timbre Remapping