Tutorial: Optimisation using the SIMDRegister class

Take advantage of the processor's parallelism to perform single instruction multiple data calculations. Optimise your audio applications without introducing concurrency.

Level: Advanced

Platforms: Windows, macOS, Linux

Classes: dsp::SIMDRegister, dsp::IIR, dsp::ProcessorDuplicator, AudioDataConverters, dsp::AudioBlock, HeapBlock

Getting started

Download the demo project for this tutorial here: PIP | ZIP. Unzip the project and open the first header file in the Projucer.

If you need help with this step, see Tutorial: Projucer Part 1: Getting started with the Projucer.

The demo project

The demo project can play a loaded audio file through an IIR filter in order to be processed and altered when auditioned. The purpose of this optimisation is to see how much CPU power we can alleviate using SIMD instruction sets on the same IIR filter.

The demo project window
Note
The code presented here is broadly similar to the SIMDRegisterDemo from the DSP Demo.

SIMD Instructions

SIMD stands for "Single Instruction Multiple Data" and refers to the way modern CPUs can apply a single instruction to a set of data by loading numbers into multiple registers and performing the same calculation all at once. In the world of digital signal processing, this type of parallelism is favoured over other types such as MIMD (Multiple Instruction Multiple Data) because concurrency becomes an issue on an audio level. Making sure that the audio thread is not fighting over its data with other threads is paramount and the order of instructions should be kept in the same order in most cases when processing audio.

SIMD operates on vectors of data streams instead of individual data which makes it even more suitable for audio processing as we are used to receiving blocks of data from the audio buffer. SIMD also thrives when we need to apply the same scalar operation over multiple data points which is something that is very common in DSP algorithms.

The process of optimising general code is usually done by the compiler automatically nowadays but the vectorisation of DSP algorithms is not always trivial. Compilers are not always able to understand humanly what the algorithm is trying to do in order to optimise correctly. Therefore, this task is usually performed manually and the SIMDRegister class is a handy tool to do this in JUCE.

The SIMDRegister class is convenient because it handles different processor types for you. Depending on the CPU, the size and number of registers can vary and it can quickly become difficult to account for all CPU vendors. This is all handled by the SIMDRegister class and all we need to do is to specify which sets of instructions we want to vectorise in our algorithms.

Using the SIMDRegister class is relatively straightforward and it essentially acts as a drop-in replacement for primitive types. Let's take a look at a simple example code such as this one:

float calculateDSPEffect (float x,
float y)
{
auto z = x + (y * 2.0f);
return z;
}

This can be easily vectorised by simply wrapping the primitive types with the SIMDRegister class:

SIMDRegister<float> calculateDSPEffect (SIMDRegister<float> x,
SIMDRegister<float> y)
{
auto z = x + (y * 2.0f);
return z;
}

In DSP code, conditional statements are very slow and branching should be generally avoided as much as possible. Therefore the following example is a good candidate for SIMD optimisation:

float calculateDSPEffect (float x,
float y)
{
auto z = (x > y ? x + (y * 2.0f) : y);
return z;
}

Fortunately, the SIMDRegister class provides us with bit masks that allow us to select the correct result as follows:

SIMDRegister<float> calculateDSPEffect (SIMDRegister<float> x,
SIMDRegister<float> y)
{
auto mask = SIMDRegister<float>::greaterThan (x, y);
auto z = ((x + (y * 2.0f)) & mask) + (y & (~mask));
return z;
}

For the purpose of this tutorial we will optimise an IIR filter using SIMD, so let's start by taking a look at the IIR filter implementation.

The IIR Filter

In the SIMDTutorialFilter class, we first define member variables such as parameters for our filter as shown here:

ChoiceParameter typeParam { { "Low-pass", "High-pass", "Band-pass" }, 1, "Type" };
SliderParameter cutoffParam { { 20.0, 20000.0 }, 0.5, 440.0f, "Cutoff", "Hz" };
SliderParameter qParam { { 0.3, 20.0 }, 0.5, 0.7, "Q" };
std::vector<DSPParameterBase*> parameters { &typeParam, &cutoffParam, &qParam };
double sampleRate = 0.0;
};

Defining the IIR filter object within a ProcessorDuplicator allows us to convert our mono processor into a multi-channel one automatically by not worrying about calling the prepare(), process() and reset() functions on each channels individually. We also define the parameters of the filter such as the type of pass filter, the cutoff frequency and the sharpness Q of the filter.

In the updateParameters() function, we make sure that the parameters of the filter are updated when the on-screen controls are modified:

void updateParameters()
{
if (sampleRate != 0.0)
{
auto cutoff = static_cast<float> (cutoffParam.getCurrentValue());
auto qVal = static_cast<float> (qParam.getCurrentValue());
switch (typeParam.getCurrentSelectedID())
{
case 1: *iir.state = *dsp::IIR::Coefficients<float>::makeLowPass (sampleRate, cutoff, qVal); break;
case 2: *iir.state = *dsp::IIR::Coefficients<float>::makeHighPass (sampleRate, cutoff, qVal); break;
case 3: *iir.state = *dsp::IIR::Coefficients<float>::makeBandPass (sampleRate, cutoff, qVal); break;
default: break;
}
}
}

Every time a parameter is modified, we create a new state for the IIR filter with a new set of coefficients depending on the sample rate, cutoff frequency and Q. The DSP module provides us with handy coefficients for our three filter types by using the makeLowPass(), makeHighPass() and makeBandPass() functions respectively.

In the prepare() function, we set the sample rate from the ProcessSpec object, set the IIR filter coefficients for the default case of a low pass filter and prepare the filter using the prepare() function with information on the processing context:

void prepare (const dsp::ProcessSpec& spec)
{
sampleRate = spec.sampleRate;
iir.state = dsp::IIR::Coefficients<float>::makeLowPass (sampleRate, 440.0);
iir.prepare (spec);
}

Processing the audio file with the filter is trivial where, in the process() function we call the process() function on the filter with a context where a single block is used for both the input and output:

void process (const dsp::ProcessContextReplacing<float>& context)
{
iir.process (context);
}

Finally, we reset the filter by calling reset on the filter in the reset() function:

void reset()
{
iir.reset();
}

Let's start optimising this IIR filter now.

The SIMD-Optimised IIR Filter

Before optimising the code of our IIR Filter, we need to ensure that SIMD is available on our system. Use the JUCE_USE_SIMD macro to check whether you are developing on a SIMD machine by wrapping the whole filter implementation like so:

#if JUCE_USE_SIMD
//==============================================================================
struct SIMDTutorialFilter
{
};
#endif

Let's first define member variables for the IIR filter as well as AudioBlock and HeapBlock objects to facilitate the processing at the bottom of our SIMDTutorialFilter class:

dsp::IIR::Coefficients<float>::Ptr iirCoefficients; // [1]
std::unique_ptr<dsp::IIR::Filter<dsp::SIMDRegister<float>>> iir;
juce::HeapBlock<char> interleavedBlockData, zeroData; // [3]
juce::HeapBlock<const float*> channelPointers { dsp::SIMDRegister<float>::size() };
ChoiceParameter typeParam { { "Low-pass", "High-pass", "Band-pass" }, 1, "Type" };
SliderParameter cutoffParam { { 20.0, 20000.0 }, 0.5, 440.0f, "Cutoff", "Hz" };
SliderParameter qParam { { 0.3, 20.0 }, 0.5, 0.7, "Q" };
std::vector<DSPParameterBase*> parameters { &typeParam, &cutoffParam, &qParam };
double sampleRate = 0.0;

Define the IIR coefficients as a pointer and the filter as a unique pointer using the SIMDRegister class to wrap the sample type [1]. Create an AudioBlock to store interleaved data using the SIMDRegister class to wrap the sample type and another AudioBlock for zero data used later to store the output block [2]. Allocate HeapBlock objects to hold the corresponding AudioBlock objects and some channel pointers with the size of the number of elements in a SIMDRegister vector [3].

In the prepare() function, set the sample rate as before and calculate the default coefficients for the filter [4]. Reset the filter by instantiating a new IIR filter with a SIMDRegister wrapper around the sample type and the coefficients defined earlier [5] as follows:

void prepare (const dsp::ProcessSpec& spec)
{
sampleRate = spec.sampleRate; // [4]
iirCoefficients = dsp::IIR::Coefficients<float>::makeLowPass (sampleRate, 440.0f);
iir.reset (new dsp::IIR::Filter<dsp::SIMDRegister<float>> (iirCoefficients)); // [5]
interleaved = dsp::AudioBlock<dsp::SIMDRegister<float>> (interleavedBlockData, 1, spec.maximumBlockSize);
zero.clear();
auto monoSpec = spec;
monoSpec.numChannels = 1;
iir->prepare (monoSpec); // [7]
}

Create the AudioBlock objects for the interleaved data and the zero data by allocating the corresponding HeapBlock objects defined earlier [6]. The interleaved data block only need one channel and the maximum block size is retrieved from the context information. The zero data block takes the size of the SIMDRegister vector and is cleared before processing. The filter is prepared by reducing the number of channels to mono on the present context information [7] as the multi-channel samples will be interleaved later and processed as one channel.

Finally, in the process() function we will interleave the samples for optimised processing like follows:

void process (const dsp::ProcessContextReplacing<float>& context)
{
jassert (context.getInputBlock().getNumSamples() == context.getOutputBlock().getNumSamples()); // [8]
jassert (context.getInputBlock().getNumChannels() == context.getOutputBlock().getNumChannels());
auto& input = context.getInputBlock(); // [9]
auto& output = context.getOutputBlock();
auto n = input.getNumSamples();
auto* inout = channelPointers.getData();
for (size_t ch = 0; ch < dsp::SIMDRegister<float>::size(); ++ch) // [10]
inout[ch] = (ch < input.getNumChannels() ? const_cast<float*> (input.getChannelPointer (ch)) : zero.getChannelPointer (ch));
juce::AudioDataConverters::interleaveSamples (inout, reinterpret_cast<float*> (interleaved.getChannelPointer (0)),
static_cast<int> (n), static_cast<int> (dsp::SIMDRegister<float>::size())); // [11]
iir->process (dsp::ProcessContextReplacing<dsp::SIMDRegister<float>> (interleaved)); // [12]
for (size_t ch = 0; ch < input.getNumChannels(); ++ch) // [13]
inout[ch] = output.getChannelPointer (ch);
juce::AudioDataConverters::deinterleaveSamples (reinterpret_cast<float*> (interleaved.getChannelPointer (0)),
const_cast<float**> (inout),
static_cast<int> (n), static_cast<int> (dsp::SIMDRegister<float>::size())); // [14]
}
  • [8]: First, make sure that the number of samples and the number of channels is the same for the input and output blocks.
  • [9]: Next, retrieve the input and output block, the number of samples in a block and the data pointed by the channel pointers HeapBlock.
  • [10]: For every channel in a SIMDRegister, check whether the channel is an input channel and copy the channel pointer into the corresponding HeapBlock. Otherwise, it means that it is an output channel and we copy the zero data channel pointer.
  • [11]: Now we interleave all the samples for the different channels by copying from the channel pointers HeapBlock into the interleaved AudioBlock and specifying the number of samples and the number of channels as the SIMDRegister size.
  • [12]: Process the audio with the filter using the interleaved data in a single block context with a SIMDRegister wrapper on the sample type.
  • [13]: Then, for every input channel, copy the output block channel pointer into the corresponding HeapBlock.
  • [14]: Finally, we deinterleave all the samples for the different channels by copying from the interleaved AudioBlock into the channel pointers HeapBlock and specifying the number of samples and the number of channels as the SIMDRegister size.

The reset() function of the filter remains the same in both cases and the optimisation is complete.

We just have to update the updateParameters() function to account for the new coefficients pointer as follows:

void updateParameters()
{
if (sampleRate != 0.0)
{
auto cutoff = static_cast<float> (cutoffParam.getCurrentValue());
auto qVal = static_cast<float> (qParam.getCurrentValue());
switch (typeParam.getCurrentSelectedID())
{
case 1: *iirCoefficients = *dsp::IIR::Coefficients<float>::makeLowPass (sampleRate, cutoff, qVal); break;
case 2: *iirCoefficients = *dsp::IIR::Coefficients<float>::makeHighPass (sampleRate, cutoff, qVal); break;
case 3: *iirCoefficients = *dsp::IIR::Coefficients<float>::makeBandPass (sampleRate, cutoff, qVal); break;
default: break;
}
}
}
Note
The source code for this modified version of the code can be found in the SIMDRegisterTutorial_02.h file of the demo project.

Summary

In this tutorial, we have learnt how to optimise DSP code using the SIMDRegister class. In particular, we have:

  • Learnt the advantages of SIMD instructions.
  • Processed a sound file through an IIR filter.
  • Optimised the IIR filter using the SIMDRegister class.

See also

dsp::AudioBlock
Minimal and lightweight data-structure which contains a list of pointers to channels containing some ...
Definition: juce_AudioBlock.h:66
dsp::AudioBlock::clear
AudioBlock & clear() noexcept
Clears the memory referenced by this AudioBlock.
Definition: juce_AudioBlock.h:302
dsp::SIMDRegister::size
static constexpr size_t size() noexcept
Returns the number of elements in this vector.
Definition: juce_SIMDRegister.h:127
dsp::IIR::Filter
A processing class that can perform IIR filtering on an audio signal, using the Transposed Direct For...
Definition: juce_dsp/processors/juce_IIRFilter.h:56
dsp::ProcessSpec::sampleRate
double sampleRate
The sample rate that will be used for the data that is sent to the processor.
Definition: juce_ProcessContext.h:43
dsp::ProcessSpec::numChannels
uint32 numChannels
The number of channels that the process() method will be expected to handle.
Definition: juce_ProcessContext.h:49
dsp::IIR::Coefficients
A set of coefficients for use in an Filter object.
Definition: juce_dsp/processors/juce_IIRFilter.h:40
dsp::ProcessContextReplacing::getInputBlock
const ConstAudioBlockType & getInputBlock() const noexcept
Returns the audio block to use as the input to a process function.
Definition: juce_ProcessContext.h:99
dsp::ProcessContextReplacing::getOutputBlock
AudioBlockType & getOutputBlock() const noexcept
Returns the audio block to use as the output to a process function.
Definition: juce_ProcessContext.h:102
dsp::SIMDRegister
A wrapper around the platform's native SIMD register type.
Definition: juce_SIMDRegister.h:65
dsp::IIR::Coefficients::makeBandPass
static Ptr makeBandPass(double sampleRate, NumericType frequency)
Returns the coefficients for a band-pass filter.
jassert
#define jassert(expression)
Platform-independent assertion macro.
Definition: juce_PlatformDefs.h:165
dsp::IIR::Coefficients::makeLowPass
static Ptr makeLowPass(double sampleRate, NumericType frequency)
Returns the coefficients for a low-pass filter.
dsp::ProcessSpec::maximumBlockSize
uint32 maximumBlockSize
The maximum number of samples that will be in the blocks sent to process() method.
Definition: juce_ProcessContext.h:46
dsp::ProcessorDuplicator
Converts a mono processor class into a multi-channel version by duplicating it and applying multichan...
Definition: juce_ProcessorDuplicator.h:45
dsp::AudioBlock::getChannelPointer
SampleType * getChannelPointer(size_t channel) const noexcept
Returns a raw pointer into one of the channels in this block.
Definition: juce_AudioBlock.h:238
ReferenceCountedObjectPtr
A smart-pointer class which points to a reference-counted object.
Definition: juce_ReferenceCountedObject.h:249
dsp::ProcessContextReplacing
Contains context information that is passed into an algorithm's process method.
Definition: juce_ProcessContext.h:81
dsp::ProcessSpec
This structure is passed into a DSP algorithm's prepare() method, and contains information about vari...
Definition: juce_ProcessContext.h:40
dsp::IIR::Coefficients::makeHighPass
static Ptr makeHighPass(double sampleRate, NumericType frequency)
Returns the coefficients for a high-pass filter.