Audio Processing for Dummies

In my spare time I’m an emergency services volunteer, and one of the tasks our unit has is to run the radio network and keep track of what’s happening. This can be a pretty stressful job, especially when there’s lots of radio traffic, and it’s not unusual to miss words or entire transmissions.

To help with a personal project that could make the job easier I’d like to implement a basic component of audio processing, the Noise Gate.

The basic idea is to scan through an audio stream and split it into individual clips based on volume, similar to the algorithm mentioned on this Rust Audio discourse thread.

The code written in this article is available on GitHub. Feel free to browse through and steal code or inspiration. It’s also been published as a crate on crates.io.

If you found this useful or spotted a bug, let me know on the blog’s issue tracker!

What Even Is Audio?

We’ve all consumed audio media at some point, but have you ever stopped and wondered how it works under the hood?

At its core, audio works by rapidly reading the volume level (a “sample”), typically 44,100 times per second (44.1 kHz is called the Sample Rate). These samples are then encoded using Pulse Code Modulation.

According to Wikipedia:

Pulse-code modulation (PCM) is a method used to digitally represent sampled analog signals. It is the standard form of digital audio in computers, compact discs, digital telephony and other digital audio applications. In a PCM stream, the amplitude of the analog signal is sampled regularly at uniform intervals, and each sample is quantized to the nearest value within a range of digital steps.

If it helps, a sample can be thought of as how far a speaker/microphone’s membrane is deflected at a particular point in time.

It’s not uncommon to record multiple audio tracks at a time, for example imagine multiple microphones were used to provide a sense of direction/perspective (see Sound Localisation for more). These multiple tracks are usually referred to as Channels.

TL;DR: In Rust lingo, you can think of an audio stream as:

type AudioStream = Vec<Frame>;
type Frame = [Sample; N]; // where `N` is the number of channels in the stream
type Sample = i16 | f32;

The audio formats you are used to (MP3, WAV, OGG) are just different ways to store an AudioStream on disk, along with some metadata describing the audio (artist, year, etc.), typically using tricks like compression or Delta Encoding to make the resulting file as small as possible.

If you’re wondering why compression is important, these are the numbers for a simple uncompressed audio stream with:

  • 30 seconds of audio
  • 44.1 kHz sample rate
  • 2 channels (e.g. left and right speaker)
  • bit depth of 16 (i.e. the samples are i16)

    sizeof(Sample) = 2 bytes
    sizeof(Frame) = 2 * sizeof(Sample) = 4 bytes
    sizeof(1 second) = sizeof(Frame) * 44100 = 176400 bytes
    full clip = 30 * sizeof(1 second) = 5292000 bytes = 5.3 MB

… That’s a lot of data!

Finding Sample Data

If we want to implement a noise gate we’re going to need some sample clips to test it on.

I’ve found the Air Traffic Controller recordings from LiveATC.net are reasonably similar to my target, with the added bonus that they’re publicly available.

One example:

Our end goal is to create a library that can break audio streams up into chunks based on volume without caring where the audio originally came from (MP3 file, microphone, another function, etc.). We’ll start by using the WAV format because it’s simple and a really good crate (hound) already exists for working with WAV files.

You can download the sample clip and convert it to WAV using ffmpeg:

$ mkdir -p tests/data
$ curl "https://forums.liveatc.net/index.php?action=dlattach;topic=15455.0;attach=10441" > a-turtle-of-an-issue.mp3
$ ffmpeg -i a-turtle-of-an-issue.mp3 -ac 1 a-turtle-of-an-issue.wav

Implementing the Noise Gate Algorithm

For now, our Noise Gate will have two knobs for tweaking its behaviour:

  • open_threshold - the (absolute) noise value above which the gate should open
  • release_time - how long to hold the gate open after dropping below the open_threshold. This will manifest itself as the gate being in a sort of half-open state for the next release_time samples, where new samples above the open_threshold will re-open the gate.

The awesome thing about this algorithm is that it can be represented using a simple state machine.

// src/lib.rs

enum State {
    Open,
    Closing { remaining_samples: usize },
    Closed,
}

Our state machine diagram looks roughly like this:

graph TD; linkStyle default interpolate basis; Open[Open]; Closing(Closing); Open-- below threshold -->Closing; Open-- above threshold -->Open; Closing-- above threshold -->Open; Closing-- remaining_samples = 0 -->Closed; Closing-- remaining_samples > 0 -->Closing; Closed-- above threshold -->Open; Closed-- below threshold -->Closed;

We’ll be using some abstractions, namely Frame and Sample from the sample crate, to make the Noise Gate work with multiple channels and any type of audio input.

Let’s define a helper which will take a Frame of audio input and tell us whether all audio channels are below a certain threshold.

// src/lib.rs

use sample::{Frame, SignedSample};

fn below_threshold<F>(frame: F, threshold: F::Sample) -> bool
where
    F: Frame,
{
    let threshold = abs(threshold.to_signed_sample());

    frame
        .channels()
        .map(|sample| sample.to_signed_sample())
        .map(abs)
        .all(|sample| sample < threshold)
}

fn abs<S: SignedSample>(sample: S) -> S {
    let zero = S::equilibrium();
    if sample >= zero {
        sample
    } else {
        -sample
    }
}

The State transitions are done using one big match statement and are almost a direct translation of the previous state machine diagram.

// src/lib.rs

fn next_state<F: Frame>(
    state: State,
    frame: F,
    open_threshold: F::Sample,
    release_time: usize,
) -> State {
    match state {
        State::Open => {
            if below_threshold(frame, open_threshold) {
                State::Closing {
                    remaining_samples: release_time,
                }
            } else {
                State::Open
            }
        }

        State::Closing { remaining_samples } => {
            if below_threshold(frame, open_threshold) {
                if remaining_samples == 0 {
                    State::Closed
                } else {
                    State::Closing {
                        remaining_samples: remaining_samples - 1,
                    }
                }
            } else {
                State::Open
            }
        }

        State::Closed => {
            if below_threshold(frame, open_threshold) {
                State::Closed
            } else {
                State::Open
            }
        }
    }
}

There’s a bit more rightward drift here than I’d like, but the function itself is quite self-contained and readable enough.

That said, as a sanity check it’s a good idea to write some tests exercising each state machine transition.

// src/lib.rs

#[cfg(test)]
mod tests {
    use super::*;

    const OPEN_THRESHOLD: i16 = 100;
    const RELEASE_TIME: usize = 5;

    test_state_transition!(open_to_open: State::Open, 101 => State::Open);
    test_state_transition!(open_to_closing: State::Open, 40 => State::Closing { remaining_samples: RELEASE_TIME });
    test_state_transition!(closing_to_closed: State::Closing { remaining_samples: 0 }, 40 => State::Closed);
    test_state_transition!(closing_to_closing: State::Closing { remaining_samples: 1 }, 40 => State::Closing { remaining_samples: 0 });
    test_state_transition!(reopen_when_closing: State::Closing { remaining_samples: 1 }, 101 => State::Open);
    test_state_transition!(closed_to_closed: State::Closed, 40 => State::Closed);
    test_state_transition!(closed_to_open: State::Closed, 101 => State::Open);
}

When writing these sorts of tests you’ll probably want to minimise boilerplate by pulling the testing code out into a macro. That way you just need to write to case being tested, inputs, and expected outputs, and the macro will do the rest.

This is the definition for test_state_transition!():

macro_rules! test_state_transition {
    ($name:ident, $from:expr, $sample:expr => $expected:expr) => {
        #[test]
        fn $name() {
            let start: State = $from;
            let expected: State = $expected;
            let frame: [i16; 1] = [$sample];

            let got = next_state(start, frame, OPEN_THRESHOLD, RELEASE_TIME);

            assert_eq!(got, expected);
        }
    };
}

To implement the Noise Gate, we’ll wrap our state and configuration into a single NoiseGate struct.

// src/lib.rs

pub struct NoiseGate<S> {
    /// The volume level at which the gate will open (begin recording).
    pub open_threshold: S,
    /// The amount of time (in samples) the gate takes to go from open to fully
    /// closed.
    pub release_time: usize,
    state: State,
}

impl<S> NoiseGate<S> {
    /// Create a new [`NoiseGate`].
    pub const fn new(open_threshold: S, release_time: usize) -> Self {
        NoiseGate {
            open_threshold,
            release_time,
            state: State::Closed,
        }
    }

    /// Is the gate currently passing samples through to the [`Sink`]?
    pub fn is_open(&self) -> bool {
        match self.state {
            State::Open | State::Closing { .. } => true,
            State::Closed => false,
        }
    }

    /// Is the gate currently ignoring silence?
    pub fn is_closed(&self) -> bool {
        !self.is_open()
    }
}

We’ll need to declare a Sink trait that can be implemented by consumers of our Noise Gate in the next step.

// src/lib.rs

pub trait Sink<F> {
    /// Add a frame to the current recording, starting a new recording if
    /// necessary.
    fn record(&mut self, frame: F);
    /// Reached the end of the samples, do necessary cleanup (e.g. flush to disk).
    fn end_of_transmission(&mut self);
}

Processing frames is just a case of iterating over each frame, updating the state, and checking whether we need to pass the frame through to the Sink or detect an end_of_transmission.

// src/lib.rs

impl<S: Sample> NoiseGate<S> {
    pub fn process_frames<K, F>(&mut self, frames: &[F], sink: &mut K)
    where
        F: Frame<Sample = S>,
        K: Sink<F>,
    {
        for &frame in frames {
            let previously_open = self.is_open();

            self.state = next_state(self.state, frame, self.open_threshold, self.release_time);

            if self.is_open() {
                sink.record(frame);
            } else if previously_open {
                // the gate was previously open and has just closed
                sink.end_of_transmission();
            }
        }
    }
}

Measuring Performance

If we want to use the NoiseGate in realtime applications we’ll need to make sure it can handle typical sample rates.

I don’t expect our algorithm to add much in terms of a performance overhead, but it’s always a good idea to check.

The gold standard for benchmarking in Rust is criterion, so let’s add that as a dev dependency.

# Cargo.toml

[dev-dependencies]
criterion = "0.3"

[[bench]]
name = "throughput"
harness = false

We’ll need a Sink implementation which will add as little overhead as possible without being completely optimised out by the compiler.

// benches/throughput.rs

struct Counter {
    samples: usize,
    chunks: usize,
}

impl<F> Sink<F> for Counter {
    fn record(&mut self, _: F) {
        self.samples += criterion::black_box(1);
    }

    fn end_of_transmission(&mut self) {
        self.chunks += criterion::black_box(1);
    }
}

We’ve already downloaded a handful of example WAV files to the data/ directory, so we can register a new benchmark group (a group of related benchmarks which should be graphed together) and register a benchmark for every WAV file in the data/ directory.

// benches/throughput.rs

const DATA_DIR: &str = concat!(env!("CARGO_MANIFEST_DIR"), "/data/");

fn bench_throughput(c: &mut Criterion) {
    let mut group = c.benchmark_group("throughput");

    for entry in fs::read_dir(DATA_DIR).unwrap() {
        let entry = entry.unwrap();
        let path = entry.path();

        if path.is_file() {
            let name = path.file_stem().unwrap().to_str().unwrap();
            add_benchmark(&mut group, name, &path);
        }
    }
}

The setup work for each WAV file benchmark is non-trivial, so we’ve pulled it out into its own function. To set things up we’ll use hound to read the entire audio clip into a Vec<[i16; 1]> in memory and guess a reasonable release_time and noise_threshold.

Then it’s just a case of telling the BenchmarkGroup how many samples we’re working with (throughput) and processing the frames.

// benches/throughput.rs

fn add_benchmark(
    group: &mut BenchmarkGroup<WallTime>,
    name: &str,
    path: &Path,
) {
    let reader = WavReader::open(path).unwrap();

    let desc = reader.spec();
    assert_eq!(desc.channels, 1, "We've hard-coded frames to be [i16; 1]");
    let release_time = 2 * desc.sample_rate as usize;

    let samples = reader
        .into_samples::<i16>()
        .map(|s| [s.unwrap()])
        .collect::<Vec<_>>();

    let noise_threshold = average(&samples);

    group
        .throughput(Throughput::Elements(samples.len() as u64))
        .bench_function(name, |b| {
            b.iter(|| {
                let mut counter = Counter::default();
                let mut gate = NoiseGate::new(noise_threshold, release_time);
                gate.process_frames(&samples, &mut counter);
            });
        });
}

/// A fancy way to add up all the channels in all the frames and get the average
/// sample value.
fn average<F>(samples: &[F]) -> F::Sample
where
    F: Frame,
    F::Sample: FromSample<f32>,
    F::Sample: ToSample<f32>,
{
    let sum: f32 = samples.iter().fold(0.0, |sum, frame| {
        sum + frame.channels().map(|s| s.to_sample()).sum::<f32>()
    });
    (sum / samples.len() as f32).round().to_sample()
}

Finally, we need to invoke a couple macros to register the "throughput" benchmark group and create a main function (remember when declaring the [[bench]] table we told rustc not to write main() for us with harness = false).

// benches/throughput.rs

criterion_group!(benches, bench_throughput);
criterion_main!(benches);

These are the WAV files I’ve downloaded to the data/ directory:

$ ls -l data 
.rw-r--r-- 1.6M michael 27 Oct 21:21 a-turtle-of-an-issue.wav
.rw-r--r-- 4.2M michael 27 Oct 21:17 KBDL-B17-Tribute-20191005.wav
.rw-r--r-- 7.6M michael 27 Oct 21:17 N11379_KSCK.wav
.rw-r--r--  12M michael 27 Oct 21:26 tornado-warning-ground.wav
$ file data/*
data/a-turtle-of-an-issue.wav:      RIFF (little-endian) data, WAVE audio, Microsoft PCM, 16 bit, mono 22050 Hz
data/KBDL-B17-Tribute-20191005.wav: RIFF (little-endian) data, WAVE audio, Microsoft PCM, 16 bit, mono 24000 Hz
data/N11379_KSCK.wav:               RIFF (little-endian) data, WAVE audio, Microsoft PCM, 16 bit, mono 22050 Hz
data/tornado-warning-ground.wav:    RIFF (little-endian) data, WAVE audio, Microsoft PCM, 16 bit, mono 44100 Hz

Now let’s run the benchmarks.

$ cargo bench
     Running target/release/deps/throughput-dbdb305fc8a0e002
Benchmarking throughput/a-turtle-of-an-issue: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 37.5s or reduce sample count to 20
throughput/a-turtle-of-an-issue                                                                             
                        time:   [7.0509 ms 7.1617 ms 7.2892 ms]
                        thrpt:  [113.14 Melem/s 115.15 Melem/s 116.96 Melem/s]
                 change:
                        time:   [-6.5194% -3.2691% -0.1646%] (p = 0.07 > 0.05)
                        thrpt:  [+0.1648% +3.3796% +6.9740%]
                        No change in performance detected.
Found 9 outliers among 100 measurements (9.00%)
  8 (8.00%) high mild
  1 (1.00%) high severe

...

If you’ve got gnuplot installed, this also generates a report under target/criterion.

On my machine the report says our NoiseFilter can process 103.47 million samples per second. This is about 2000 times faster than we need, so it gives us hope that the algorithm won’t add any unnecessary overhead… Of course that just moves the bottleneck from NoiseFilter to the caller’s Sink implementation.

Experimenting With Our Sample Data

We’re now at the point where we have a fully implemented Noise Gate. Let’s create an example program for splitting WAV files and see what happens when we point it at our sample data!

Even though it’s an example, we should probably implement proper command-line argument handling to make experimentation easier. By far the easiest way to do this is with the structopt crate.

// examples/wav-splitter.rs

#[derive(Debug, Clone, StructOpt)]
pub struct Args {
    #[structopt(help = "The WAV file to read")]
    pub input_file: PathBuf,
    #[structopt(short = "t", long = "threshold", help = "The noise threshold")]
    pub noise_threshold: i16,
    #[structopt(
        short = "r",
        long = "release-time",
        help = "The release time in seconds",
        default_value = "0.25"
    )]
    pub release_time: f32,
    #[structopt(
        short = "o",
        long = "output-dir",
        help = "Where to write the split files",
        default_value = "."
    )]
    pub output_dir: PathBuf,
    #[structopt(
        short = "p",
        long = "prefix",
        help = "A prefix to insert before each clip",
        default_value = "clip_"
    )]
    pub prefix: String,
}

Now we’ll need a Sink type. The general idea is every time the record() method is called we’ll write another frame to a cached hound::WavWriter. If the WavWriter doesn’t exist we’ll need to create a new one which writes to a file named like output_dir/clip_1.wav. An end_of_transmission() tells us to finalize() the WavWriter and remove it from our cache.

// examples/wav-splitter.rs

pub struct Sink {
    output_dir: PathBuf,
    clip_number: usize,
    prefix: String,
    spec: WavSpec,
    writer: Option<WavWriter<BufWriter<File>>>,
}

impl Sink {
    pub fn new(output_dir: PathBuf, prefix: String, spec: WavSpec) -> Self {
        Sink {
            output_dir,
            prefix,
            spec,
            clip_number: 0,
            writer: None,
        }
    }

    fn get_writer(&mut self) -> &mut WavWriter<BufWriter<File>> {
        if self.writer.is_none() {
            let filename = self
                .output_dir
                .join(format!("{}{}.wav", self.prefix, self.clip_number));
            self.clip_number += 1;
            self.writer = Some(WavWriter::create(filename, self.spec).unwrap());
        }

        self.writer.as_mut().unwrap()
    }
}

impl<F> noise_gate::Sink<F> for Sink
where
    F: Frame,
    F::Sample: hound::Sample,
{
    fn record(&mut self, frame: F) {
        let writer = self.get_writer();

        for channel in frame.channels() {
            writer.write_sample(channel).unwrap();
        }
    }

    fn end_of_transmission(&mut self) {
        if let Some(writer) = self.writer.take() {
            writer.finalize().unwrap();
        }
    }
}

From there the main function is quite simple. It parses some arguments, reads the WAV file into memory, then throws it at our NoiseGate so the Sink can write the clips to the output/ directory.

// examples/wav-splitter.rs

fn main() -> Result<(), Box<dyn Error>> {
    let args = Args::from_args();

    let reader = WavReader::open(&args.input_file)?;
    let header = reader.spec();
    let samples = reader
        .into_samples::<i16>()
        .map(|result| result.map(|sample| [sample]))
        .collect::<Result<Vec<_>, _>>()?;

    let release_time = (header.sample_rate as f32 * args.release_time).round();

    fs::create_dir_all(&args.output_dir)?;
    let mut sink = Sink::new(args.output_dir, args.prefix, header);

    let mut gate = NoiseGate::new(args.noise_threshold, release_time as usize);
    gate.process_frames(&samples, &mut sink);

    Ok(())
}

Let’s take this for a test-run.

The original clip:

Now let’s split it into pieces with our wav-splitter program. At this point I don’t really know what values of noise_threshold or release_time are acceptible for this audio, but I figure 50 and 0.3s should be usable?

$ ./target/release/examples/wav-splitter -o output --threshold 50 --release-time 0.3 data/N11379_KSCK.wav
$ ls output   
clip_0.wav clip_3.wav clip_6.wav clip_9.wav clip_12.wav clip_15.wav
clip_18.wav clip_21.wav clip_1.wav clip_4.wav clip_7.wav clip_10.wav
clip_13.wav clip_16.wav clip_19.wav clip_22.wav clip_2.wav clip_5.wav
clip_8.wav clip_11.wav clip_14.wav clip_17.wav clip_20.wav

Wow it actually worked on the first try. Now that’s something you don’t see every day.