An audio-driven 808 drum synthesiser built with PyTorch and JUCE
TorchDrum is an open-source audio plugin that transforms input signals into a synthesised 808 drum in real-time. The rhythm, loudness, and timbral variations from a percussive input signal are mapped to synthesiser controls, allowing for dynamic, audio-based control. The figure above shows the main signal flow implemented in the audio plug-in, which was built in C++ using JUCE.
The main signal processing and mapping components are:
We refer to the process of automatically updating synthesiser parameters to reflect the timbral variations in the input signal as timbre remapping.
Timbre remapping involves the analysis and transfer of timbre from an audio input onto controls for a synthesizer. To achieve timbre remapping we consider musical phrases where timbre changes from note to note. We look at the differences in timbre (and loudness) between notes and try to recreate those differences on our synthesiser. The process is similar to transposing a melody into a different key, but instead of pitch, we are dealing with melodies of timbre and loudness, and are transposing from an input instrument onto synthesiser.
Let’s make this a bit more concrete. Here is an example of a short snare drum phrase with plots of the timbre and loudess for each hit. The arrow shows differences between each note compared to the first note in the phrase.
Here’s the same snare drum phase (without the rhythm) followed by all the differences in timbre and loudness mapped onto a synthesiser. What we end up is a new a synthesiser sound (a modulated preset) for each hit in the input phrase.
The neural network enables real-time timbre remapping, allowing for input signals to be transformed with low-latency. The neural network we use is relatively small, but it contains non-linearities, allowing for complex relationships between input features and parameters to be formed.
The neural network is trained to generate synthesiser parameters to match the loudness and timbral changes over full the full length of a drum hit (~1 second). During real-time operation we can’t wait a full second to see how the input sound fully evolves, so we predict parameters based on short segments of audio at detected onsets.
We also employ a recent method called differentiable digital signal processing (DDSP), which enables DSP algorithms to be integrated directly into gradient-descent based optimization used to train neural networks. Practically speaking, this allows us to compute training error on the audio output of our synthesiser using loudness and timbral features. We don’t need to know the parameters for our timbre remapping ahead of time – we only need our input audio which we use during self-supervised training.
This figure outlines the training process:
The main takeaway from the figure is that we’re training the neural network to match differences in audio features between two difference hits. \(y = f(x_a) - f(x_b)\) is the difference between two different sounds in an input, and \(\hat{y} = f(x_c) - f(x_d)\) is the difference between a synthesiser preset and a modulated version of that preset. The goal of the neural network is create a parameter modulation \(\theta_{mod}\) so that \(y = \hat{y}\). Furthermore, the neural network is trained to make this prediction from onset features \(f_o(\cdot)\) to allow for real-time parameter prediction.
Neural network training is conducted using PyTorch