Take advantage of the processor's parallelism to perform single instruction multiple data calculations. Optimise your audio applications without introducing concurrency.
Level: Advanced
Platforms: Windows , macOS , Linux
Classes: dsp::SIMDRegister, dsp::IIR, dsp::ProcessorDuplicator, AudioDataConverters, dsp::AudioBlock, HeapBlock
Download the demo project for this tutorial here: PIP | ZIP . Unzip the project and open the first header file in the Projucer.
If you need help with this step, see Tutorial: Projucer Part 1: Getting started with the Projucer.
The demo project can play a loaded audio file through an IIR filter in order to be processed and altered when auditioned. The purpose of this optimisation is to see how much CPU power we can alleviate using SIMD instruction sets on the same IIR filter.
SIMD stands for "Single Instruction Multiple Data" and refers to the way modern CPUs can apply a single instruction to a set of data by loading numbers into multiple registers and performing the same calculation all at once. In the world of digital signal processing, this type of parallelism is favoured over other types such as MIMD (Multiple Instruction Multiple Data) because concurrency becomes an issue on an audio level. Making sure that the audio thread is not fighting over its data with other threads is paramount and the order of instructions should be kept in the same order in most cases when processing audio.
SIMD operates on vectors of data streams instead of individual data which makes it even more suitable for audio processing as we are used to receiving blocks of data from the audio buffer. SIMD also thrives when we need to apply the same scalar operation over multiple data points which is something that is very common in DSP algorithms.
The process of optimising general code is usually done by the compiler automatically nowadays but the vectorisation of DSP algorithms is not always trivial. Compilers are not always able to understand humanly what the algorithm is trying to do in order to optimise correctly. Therefore, this task is usually performed manually and the SIMDRegister class is a handy tool to do this in JUCE.
The SIMDRegister class is convenient because it handles different processor types for you. Depending on the CPU, the size and number of registers can vary and it can quickly become difficult to account for all CPU vendors. This is all handled by the SIMDRegister class and all we need to do is to specify which sets of instructions we want to vectorise in our algorithms.
Using the SIMDRegister class is relatively straightforward and it essentially acts as a drop-in replacement for primitive types. Let's take a look at a simple example code such as this one:
This can be easily vectorised by simply wrapping the primitive types with the SIMDRegister class:
In DSP code, conditional statements are very slow and branching should be generally avoided as much as possible. Therefore the following example is a good candidate for SIMD optimisation:
Fortunately, the SIMDRegister class provides us with bit masks that allow us to select the correct result as follows:
For the purpose of this tutorial we will optimise an IIR filter using SIMD, so let's start by taking a look at the IIR filter implementation.
In the SIMDTutorialFilter
class, we first define member variables such as parameters for our filter as shown here:
Defining the IIR filter object within a ProcessorDuplicator allows us to convert our mono processor into a multi-channel one automatically by not worrying about calling the prepare(), process() and reset() functions on each channels individually. We also define the parameters of the filter such as the type of pass filter, the cutoff frequency and the sharpness Q of the filter.
In the updateParameters() function, we make sure that the parameters of the filter are updated when the on-screen controls are modified:
Every time a parameter is modified, we create a new state for the IIR filter with a new set of coefficients depending on the sample rate, cutoff frequency and Q. The DSP module provides us with handy coefficients for our three filter types by using the makeLowPass(), makeHighPass() and makeBandPass() functions respectively.
In the prepare() function, we set the sample rate from the ProcessSpec object, set the IIR filter coefficients for the default case of a low pass filter and prepare the filter using the prepare() function with information on the processing context:
Processing the audio file with the filter is trivial where, in the process() function we call the process() function on the filter with a context where a single block is used for both the input and output:
Finally, we reset the filter by calling reset on the filter in the reset() function:
Let's start optimising this IIR filter now.
Before optimising the code of our IIR Filter, we need to ensure that SIMD is available on our system. Use the JUCE_USE_SIMD
macro to check whether you are developing on a SIMD machine by wrapping the whole filter implementation like so:
Let's first define member variables for the IIR filter as well as AudioBlock and HeapBlock objects to facilitate the processing at the bottom of our SIMDTutorialFilter
class:
Define the IIR coefficients as a pointer and the filter as a unique pointer using the SIMDRegister class to wrap the sample type [1] . Create an AudioBlock to store interleaved data using the SIMDRegister class to wrap the sample type and another AudioBlock for zero data used later to store the output block [2] . Allocate HeapBlock objects to hold the corresponding AudioBlock objects and some channel pointers with the size of the number of elements in a SIMDRegister vector [3] .
In the prepare() function, set the sample rate as before and calculate the default coefficients for the filter [4] . Reset the filter by instantiating a new IIR filter with a SIMDRegister wrapper around the sample type and the coefficients defined earlier [5] as follows:
Create the AudioBlock objects for the interleaved data and the zero data by allocating the corresponding HeapBlock objects defined earlier [6] . The interleaved data block only need one channel and the maximum block size is retrieved from the context information. The zero data block takes the size of the SIMDRegister vector and is cleared before processing. The filter is prepared by reducing the number of channels to mono on the present context information [7] as the multi-channel samples will be interleaved later and processed as one channel.
Finally, in the process() function we will interleave the samples for optimised processing like follows:
The reset() function of the filter remains the same in both cases and the optimisation is complete.
We just have to update the updateParameters() function to account for the new coefficients pointer as follows:
SIMDRegisterTutorial_02.h
file of the demo project.In this tutorial, we have learnt how to optimise DSP code using the SIMDRegister class. In particular, we have: