TorchDrum

An audio-driven 808 drum synthesiser built with PyTorch and JUCE

Jordie Shier¹, Charalampos Saitis¹, Andrew Robertson², and Andrew McPherson³

¹Centre for Digital Music, Queen Mary University of London
²Ableton AG
³Dyson School of Design Engineering, Imperial College London

Training Code | Audio Plugin | Google Colab | Paper

What is TorchDrum?

TorchDrum is an open-source audio plugin that transforms input signals into a synthesised 808 drum in real-time. The rhythm, loudness, and timbral variations from a percussive input signal are mapped to synthesiser controls, allowing for dynamic, audio-based control. The figure above shows the main signal flow implemented in the audio plug-in, which was built in C++ using JUCE.

The main signal processing and mapping components are:

1) Onset Detection: drum hits in the input signal are detected using an onset detection algorithm. This allows the rhythm from the input to be mapped to the synthesiser.
2) Feature Extraction: a short segment of audio is extracted from the input signal for feature extraction. We only use 256 samples which is ~5.3ms (48kHz sampling rate), this minimises the delay between a detected onset and triggering the synth. This segment of 256 samples is passed into an audio feature extractor that measures the loudness, spectral centroid (brightness), and spectral flatness (noisiness) of the input.
3) Parameter Mapping Neural Network: These audio features are then passed into a neural network that has been trained to produce synthesiser parameter modulations for the 808 synthesiser so the output follows the variations in loudness and timbre observed at the input.

We refer to the process of automatically updating synthesiser parameters to reflect the timbral variations in the input signal as timbre remapping.

What is Timbre Remapping?

Timbre remapping involves the analysis and transfer of timbre from an audio input onto controls for a synthesizer. To achieve timbre remapping we consider musical phrases where timbre changes from note to note. We look at the differences in timbre (and loudness) between notes and try to recreate those differences on our synthesiser. The process is similar to transposing a melody into a different key, but instead of pitch, we are dealing with melodies of timbre and loudness, and are transposing from an input instrument onto synthesiser.

Let’s make this a bit more concrete. Here is an example of a short snare drum phrase with plots of the timbre and loudess for each hit. The arrow shows differences between each note compared to the first note in the phrase.

Here’s the same snare drum phase (without the rhythm) followed by all the differences in timbre and loudness mapped onto a synthesiser. What we end up is a new a synthesiser sound (a modulated preset) for each hit in the input phrase.

How does the Parameter Mapping Neural Network Work?

The neural network enables real-time timbre remapping, allowing for input signals to be transformed with low-latency. The neural network we use is relatively small, but it contains non-linearities, allowing for complex relationships between input features and parameters to be formed.

The neural network is trained to generate synthesiser parameters to match the loudness and timbral changes over full the full length of a drum hit (~1 second). During real-time operation we can’t wait a full second to see how the input sound fully evolves, so we predict parameters based on short segments of audio at detected onsets.

We also employ a recent method called differentiable digital signal processing (DDSP), which enables DSP algorithms to be integrated directly into gradient-descent based optimization used to train neural networks. Practically speaking, this allows us to compute training error on the audio output of our synthesiser using loudness and timbral features. We don’t need to know the parameters for our timbre remapping ahead of time – we only need our input audio which we use during self-supervised training.

This figure outlines the training process:

The main takeaway from the figure is that we’re training the neural network to match differences in audio features between two difference hits. \(y = f(x_a) - f(x_b)\) is the difference between two different sounds in an input, and \(\hat{y} = f(x_c) - f(x_d)\) is the difference between a synthesiser preset and a modulated version of that preset. The goal of the neural network is create a parameter modulation \(\theta_{mod}\) so that \(y = \hat{y}\). Furthermore, the neural network is trained to make this prediction from onset features \(f_o(\cdot)\) to allow for real-time parameter prediction.

Creating a Plugin

Neural network training is conducted using PyTorch

Plugin Overview