Deserializing Binary Data Files in Rust

The other day, someone on the Rust user forums posted a question that really nerd-sniped me. They had data generated by a C++ program and were wanting to load it into a Rust program, but when asked what format the data was in the author didn’t provide some something like a JSON schema or Protobuf file, instead they just got the definition for a C struct.

A common method for “serializing” data in C is to create a struct and directly write its bytes into a file, the “deserializing” is just a case of reading the data out of the file and interpreting it as your type. This technique is actually kinda genius when you think about it, it makes no intermediate copies or heap allocations because the OS’s read() function will literally write your data to its destination, and there are no extra dependencies or complicated serialization frameworks involved.

I’m not going to go into this technique’s drawbacks (of which there are many… there is a reason we use things like JSON and Protocol Buffers nowadays, after all) and instead let’s just focus on how to read these sorts of files.

I’ll be approaching the problem from the perspective of someone who:

Has a large number of these files and can’t change the format for practical reasons (backwards compatibility, time constraints, etc.)
Will be working in a safe environment with data they generated themselves (i.e. we know the files are in the right format and people won’t be providing malicious input)
It’s okay if the tool crashes - nobody will die, your servers won’t be hacked, and if push comes to shove we can always open a hex editor and deserialize the data by hand
Just wants a quick-and-dirty solution to their problem

The code written in this article is available on GitHub. Feel free to browse through and steal code or inspiration.

If you found this useful or spotted a bug in the article, let me know on the blog’s issue tracker!

How is the Data Generated? Link to heading

To give us a better idea of what we are dealing with, let’s create a C program that writes some data directly to a file.

To start with we’ll define a simplified version of the original post’s spkr struct.

// examples/main.c

typedef struct
{
    char name[2][20];
    char addr1[40];
    char addr2[40];
    char phone[16];
    uint16_t flags;
} spkr;

There are two largely equivalent ways to write binary data to a file, you could use the POSIX write() function to write directly to a file descriptor, or the fwrite() function from stdio.h in the C standard library.

ssize_t write(int fd, const void *buf, size_t count);
size_t fwrite(const void *restrict ptr, size_t size, size_t nmemb, FILE *restrict stream);

Their definitions are similar, but fwrite() is more portable (i.e. it also works on Windows) so that’s what we’ll use.

If we ignore all error handling, this is how you would write a spkr to some my_speaker.dat file:

// examples/main.c

// Save a `spkr` to a file.
void save(const char *filename, const spkr *speaker)
{
    FILE *f = fopen(filename, "w");
    fwrite(speaker, sizeof(spkr), 1, f);
    fclose(f);
}

Like I said earlier, this method of saving data is simple.

Loading is equally as trivial - create a spkr variable on the stack, use fread() to read one spkr worth of data into the variable from the file, then return our new spkr.

// examples/main.c

// Read a `spkr` from a file.
spkr load(const char *filename)
{
    FILE *f = fopen(filename, "r");

    spkr speaker = {0};
    fread(&speaker, sizeof(spkr), 1, f);
    fclose(f);

    return speaker;
}

Let’s wrap this up in a command-line program so we can play around with it and generate data for our Rust library to use.

$ clang main.c -o main -Wall -Wpedantic
$ ./main
Usage:
	./main generate <output>	write some dumy data to a file
	./main load <filename>		print the contents of a file

$ ./main generate speaker.dat
$ xxd speaker.dat
00000000: 4a6f 7365 7068 0000 0000 0000 0000 0000  Joseph..........
00000010: 0000 0000 426c 6f67 7300 0000 0000 0000  ....Blogs.......
00000020: 0000 0000 0000 0000 3132 3320 4661 6b65  ........123 Fake
00000030: 2053 7472 6565 7400 0000 0000 0000 0000   Street.........
00000040: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000050: 4e65 7720 596f 726b 0000 0000 0000 0000  New York........
00000060: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000070: 0000 0000 0000 0000 3230 322d 3535 352d  ........202-555-
00000080: 3031 3137 0000 0000 0faa                 0117......

$ ./main load speaker.dat
Name: Joseph Blogs
Address:
	123 Fake Street
	New York
Phone: 202-555-0117
Flags: 0xAA0F

Okay, now we’ve got some dummy data and a better understanding of what we are working. Let’s get started on the Rust version.

Make a Speaker Struct Link to heading

After creating a new binary crate (cargo new --bin deserializing-binary-data-files) the first order of business is creating a Rust equivalent for spkr, and for this application our struct definition must exactly match spkr. If it doesn’t, our fields won’t match up and we’ll get garbage.

For example, if left to its own devices the Rust compiler might decide that it can generate more efficient code by moving the flags field to the front of the struct and shuffling everything down. That would mean trying to read the phone field in Rust would pick up the last 2 bytes of addr2 and the first 14 bytes of phone. Not ideal.

Luckily there is a way to explicitly tell the compiler to represent a struct in memory identically to what C would do, #[repr(C)].

// src/main.rs

#[derive(Debug, Default)]
#[repr(C)]
pub struct Speaker {
    name: [[u8; 20]; 2],
    addr1: [u8; 40],
    addr2: [u8; 40],
    phone: [u8; 16],
    flags: u16,
}

For this, we assume that char will be a single byte long. It’s not technically defined that way because the C standard deliberately chose to leave it implementation defined, but with 50 years of C programs in the wild that make this assumption, it’s probably safe to say this is a de-facto standard.

Blindly Assuming the File Contains a Speaker Link to heading

Next we need to do the actual “deserializing”.

The fread() function in C could read data directly into our spkr variable because it accepts a void* pointer as the destination, and all pointers implicitly coerce to void*.

Unfortunately for us the corresponding API, the std::io::Read trait, insists that our programs are type-safe and will only write into a byte buffer (&mut [u8]).

Instead, when creating a load() constructor for loading a Speaker from a reader we’ll create a Speaker variable where all the fields are set to a “sane” default so we avoid passing an uninitialized buffer to our reader…

// src/main.rs

use std::io::{Error, Read};

impl Speaker {
    pub fn load(mut reader: impl Read) -> Result<Self, Error> {
        // Create a Speaker where all the fields are set to some sane default
        // (typically all zeroes)
        let mut speaker = Speaker::default();
        ...
    }
}

… Then tell the compiler to treat our speaker variable as a big byte array that we can data into…

// src/main.rs

use std::mem;

impl Speaker {
    pub fn load(mut reader: impl Read) -> Result<Self, Error> {
        ...

        unsafe {
            // Get a slice which treats the `speaker` variable as a byte array
            let buffer: &mut [u8] = std::slice::from_raw_parts_mut(
                speaker.as_mut_ptr().cast(),
                mem::size_of::<Speaker>(),
            );

            // Read exactly that many bytes from the reader
            reader.read_exact(buffer)?;
            ...
        }
    }
}

… then once we know the speaker has been completely filled with data from the reader we just need to return it.

// src/main.rs

impl Speaker {
    pub fn load(mut reader: impl Read) -> Result<Self, Error> {
        ...

        unsafe {
            ...

            // Our `speaker` has now been updated with data from the reader.
            Ok(speaker)
        }
    }
}

In Rust it’s always a good idea to add a comment above your unsafe blocks noting down your assumptions and why the unsafe block is correct and won’t break memory safety.

// src/main.rs

impl Speaker {
    pub fn load(mut reader: impl Read) -> Result<Self, Error> {
        ...

        // Safety: All the fields in a Speaker are valid for all possible bit
        // combinations.
        unsafe {
            ...
        }
    }
}

The primary reason for this being sound is that a Speaker only contains integers and arrays of integers, and an integer is valid for all possible bit patterns. That means blindly copying bytes from (possibly maliciously crafted) input into a Speaker can’t introduce memory safety issues into our Speaker::load() function. Sure it could give us data that doesn’t make sense, but we’d still have valid byte arrays in our byte array fields and a valid i16 in our flags field.

This assumption isn’t correct in general. It’s one of the reasons the docs for std::mem::transmute() are full of warnings saying there are plenty of better ways to do things and transmute() should be a tool of last resort.

For example, the Speaker struct couldn’t container a string reference (&str) because references must always be non-null, aligned, point at valid instances of that type, and not outlive the things they refer to. Of the 2⁶⁴ different possible bit patterns on a 64-bit machine, that may only be correct for a small handful of values.

As such, it wouldn’t be sound to give our Speaker a &str or Vec<T> field unless we had some outside information which makes extra guarantees. If the caller could provide those guarantees then we would be able to mark the entire Speaker::load() function as unsafe and make memory safety their problem.

Making Things Convenient Link to heading

Now, C’s printf() is more than happy to take a pointer to some bytes and interpret them as text, but if we tried to print Speaker’s fields at the moment we’d get something useless like [0x4a, 0x6f, 0x73, 0x65, 0x70, 0x68, 0x0a, 0x00, 0x00, 0x00] instead of the "Joseph" that we expect.

This is because Rust treats all arrays of bytes as arrays of bytes and doesn’t attach any extra meaning to them. If we want to treat them like strings then we’ll need to use std::str::from_utf8() to convert our byte arrays to a UTF-8 &str.

From the way our C is implemented, if a string field isn’t completely filled with text it will be padded out with zeroes. This is particularly helpful for C programs because it means our fields always have trailing null bytes (as long as we don’t overflow).

In Rust, we don’t to include these trailing null bytes in the final output (null is a valid UTF-8 character), so let’s make a helper function that takes a byte array, trims off any trailing nulls, then tries to interpret it as UTf-8.

// src/main.rs

fn c_string(bytes: &[u8]) -> Option<&str> {
    let bytes_without_null = match bytes.iter().position(|&b| b == 0) {
        Some(ix) => &bytes[..ix],
        None => bytes,
    };

    std::str::from_utf8(bytes_without_null).ok()
}

With our c_string() function in hand we are now ready to give Speaker some getter methods.

// src/main.rs

impl Speaker {
    pub fn address_line_1(&self) -> Option<&str> { c_string(&self.addr1) }

    pub fn address_line_2(&self) -> Option<&str> { c_string(&self.addr2) }

    pub fn phone_number(&self) -> Option<&str> { c_string(&self.phone) }
}

The name field is a bit more interesting because it contains two strings in an array, presumably the first and last names.

To preserve the semantics that a name contains two strings, we can use a mixture of destructuring and the ? operator to create a getter that returns the first and last names only when they are both valid.

// src/main.rs

impl Speaker {

    pub fn name(&self) -> Option<(&str, &str)> {
        let [first, last] = &self.name;
        let first = c_string(first)?;
        let last = c_string(last)?;

        Some((first, last))
    }
}

Tests Link to heading

We’ve now got a Speaker::load() constructor and some convenient getter methods for interpreting the fields so I figure it’s time to write some tests.

First, we’ll take the bytes from the speaker.dat generated earlier and save them as a byte literal. Conveniently, the xxd tool has a flag which prints the input in a form that can be included in source code.

$ cat speaker.dat | xxd -i
  0x4a, 0x6f, 0x73, 0x65, 0x70, 0x68, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
  0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x42, 0x6c, 0x6f, 0x67,
  0x73, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
  0x00, 0x00, 0x00, 0x00, 0x31, 0x32, 0x33, 0x20, 0x46, 0x61, 0x6b, 0x65,
  0x20, 0x53, 0x74, 0x72, 0x65, 0x65, 0x74, 0x00, 0x00, 0x00, 0x00, 0x00,
  0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
  0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x4e, 0x65, 0x77, 0x20,
  0x59, 0x6f, 0x72, 0x6b, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
  0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
  0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
  0x32, 0x30, 0x32, 0x2d, 0x35, 0x35, 0x35, 0x2d, 0x30, 0x31, 0x31, 0x37,
  0x00, 0x00, 0x00, 0x00, 0x0f, 0xaa

All we need to do is create a new test module and paste it into a SPEAKER_DAT constant.

// src/main.rs

#[cfg(test)]
mod tests {
    use super::*;
    use std::io::Cursor;

    const SPEAKER_DAT: [u8; mem::size_of::<Speaker>()] = [ ... ];
}

Now we can write a test which deserializes our Joseph Blogs speaker’s information and compares it to what the C code generated.

// src/main.rs

#[cfg(test)]
mod tests {
    ...

    #[test]
    fn deserialize_joe_bloggs() {
        let reader = Cursor::new(&SPEAKER_DAT);

        let got = Speaker::load(reader).unwrap();

        assert_eq!(got.name().unwrap(), ("Joseph", "Blogs"));
        assert_eq!(got.address_line_1().unwrap(), "123 Fake Street");
        assert_eq!(got.address_line_2().unwrap(), "New York");
        assert_eq!(got.phone_number().unwrap(), "202-555-0117");
        assert_eq!(got.flags, 0xAA0F);
    }
}

Our test passes, of course.

$ cargo test
    Finished test [unoptimized + debuginfo] target(s) in 0.03s
     Running unittests (/home/michael/Documents/deserializing-binary-data-files/target/debug/deps/reading_data_files-f34629f4e016d364)

running 1 test
test tests::deserialize_joe_bloggs ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 4 filtered out; finished in 0.00s

A Note on Packing Link to heading

There is this concept called “packing” which is quite important in telling the compiler how to lay out our Speaker and spkr structs in memory. We are interpreting a bunch of bytes as a Speaker it is very important for us that Speaker and spkr are laid out identically otherwise we’ll get garbage.

You see, processors really like it when things are lined up in memory correctly and they often need to do extra work when they aren’t lined up (which kills performance) or will just error out altogether (which kills your program). For example, a u8 can be placed at addresses that are multiples of 1 byte (i.e. anywhere), a u32 can be placed at multiples of 4 bytes, and so on.

To deal with this alignment issue, compilers will insert some unused bytes (often called “padding”) between fields to make sure they line up correctly - this is actually what the #[repr(C)] attribute does. If we want to tell the compiler not to insert this spacing we can use #[repr(packed)] to tell the compiler “this struct’s bytes must be packed together as closely as possible”.

Most binary formats don’t care about these padding bytes because they want files to be as compact as possible, so it’s not uncommon to see a #[repr(packed)] (or its C cousin, __attribute__((__packed__)) on GCC and __attribute__((packed)) on Clang) next to struct definitions when they are using this direct reading/writing method of serializing data.

We get lucky here because the flags field is at offset 202 (which is a multiple of 2 bytes) so we didn’t need to do anything special, but it’s still good to know. It may also help explain why you’ll often see random fields named spare or unused in a C struct definition.

Conclusions Link to heading

It’s not something you’ll need to (or want to) use too often, but knowing how to read/write data directly to some file without needing any complicated serialization frameworks might be something you’ll use some day.

That said, there are a lot of better alternatives out there, with most of them allowing you to write code that has similar performance characteristics with no need for unsafe.

Personally, I would definitely reach for a better tool (probably a parsing library like nom or binread) if this was an application I cared about or if I didn’t have full control over the input. However, for quick-and-dirty tools where your primary goal is to “do whatever C does”, this technique works pretty well.