The other day, someone on the Rust user forums posted a question that really nerd-sniped me. They had data generated by a C++ program and were wanting to load it into a Rust program, but when asked what format the data was in the author didn’t provide some something like a JSON schema or Protobuf file, instead they just got the definition for a C struct.
A common method for “serializing” data in C is to create a struct and directly
write its bytes into a file, the “deserializing” is just a case of reading the
data out of the file and interpreting it as your type. This technique is
actually kinda genius when you think about it, it makes no intermediate copies
or heap allocations because the OS’s read()
function will literally write your
data to its destination, and there are no extra dependencies or complicated
serialization frameworks involved.
I’m not going to go into this technique’s drawbacks (of which there are many… there is a reason we use things like JSON and Protocol Buffers nowadays, after all) and instead let’s just focus on how to read these sorts of files.
I’ll be approaching the problem from the perspective of someone who:
- Has a large number of these files and can’t change the format for practical reasons (backwards compatibility, time constraints, etc.)
- Will be working in a safe environment with data they generated themselves (i.e. we know the files are in the right format and people won’t be providing malicious input)
- It’s okay if the tool crashes - nobody will die, your servers won’t be hacked, and if push comes to shove we can always open a hex editor and deserialize the data by hand
- Just wants a quick-and-dirty solution to their problem
The code written in this article is available on GitHub. Feel free to browse through and steal code or inspiration.
If you found this useful or spotted a bug in the article, let me know on the blog’s issue tracker!
How is the Data Generated? Link to heading
To give us a better idea of what we are dealing with, let’s create a C program that writes some data directly to a file.
To start with we’ll define a simplified version of the original post’s spkr
struct.
// examples/main.c
typedef struct
{
char name[2][20];
char addr1[40];
char addr2[40];
char phone[16];
uint16_t flags;
} spkr;
There are two largely equivalent ways to write binary data to a file, you could
use the POSIX write()
function to write directly to a file
descriptor, or the fwrite()
function from stdio.h
in the C
standard library.
ssize_t write(int fd, const void *buf, size_t count);
size_t fwrite(const void *restrict ptr, size_t size, size_t nmemb, FILE *restrict stream);
Their definitions are similar, but fwrite()
is more portable (i.e. it also
works on Windows) so that’s what we’ll use.
If we ignore all error handling, this is how you would write a spkr
to some
my_speaker.dat
file:
// examples/main.c
// Save a `spkr` to a file.
void save(const char *filename, const spkr *speaker)
{
FILE *f = fopen(filename, "w");
fwrite(speaker, sizeof(spkr), 1, f);
fclose(f);
}
Like I said earlier, this method of saving data is simple.
Loading is equally as trivial - create a spkr
variable on the stack, use
fread()
to read one spkr
worth of data into the variable from the file, then
return our new spkr
.
// examples/main.c
// Read a `spkr` from a file.
spkr load(const char *filename)
{
FILE *f = fopen(filename, "r");
spkr speaker = {0};
fread(&speaker, sizeof(spkr), 1, f);
fclose(f);
return speaker;
}
Let’s wrap this up in a command-line program so we can play around with it and generate data for our Rust library to use.
$ clang main.c -o main -Wall -Wpedantic
$ ./main
Usage:
./main generate <output> write some dumy data to a file
./main load <filename> print the contents of a file
$ ./main generate speaker.dat
$ xxd speaker.dat
00000000: 4a6f 7365 7068 0000 0000 0000 0000 0000 Joseph..........
00000010: 0000 0000 426c 6f67 7300 0000 0000 0000 ....Blogs.......
00000020: 0000 0000 0000 0000 3132 3320 4661 6b65 ........123 Fake
00000030: 2053 7472 6565 7400 0000 0000 0000 0000 Street.........
00000040: 0000 0000 0000 0000 0000 0000 0000 0000 ................
00000050: 4e65 7720 596f 726b 0000 0000 0000 0000 New York........
00000060: 0000 0000 0000 0000 0000 0000 0000 0000 ................
00000070: 0000 0000 0000 0000 3230 322d 3535 352d ........202-555-
00000080: 3031 3137 0000 0000 0faa 0117......
$ ./main load speaker.dat
Name: Joseph Blogs
Address:
123 Fake Street
New York
Phone: 202-555-0117
Flags: 0xAA0F
Okay, now we’ve got some dummy data and a better understanding of what we are working. Let’s get started on the Rust version.
Make a Speaker Struct Link to heading
After creating a new binary crate (cargo new --bin deserializing-binary-data-files
) the first order of business is creating a Rust
equivalent for spkr
, and for this application our struct definition must
exactly match spkr
. If it doesn’t, our fields won’t match up and we’ll get
garbage.
For example, if left to its own devices the Rust compiler might decide that it
can generate more efficient code by moving the flags
field to the front of
the struct and shuffling everything down. That would mean trying to read the
phone
field in Rust would pick up the last 2 bytes of addr2
and the first
14 bytes of phone
. Not ideal.
Luckily there is a way to explicitly tell the compiler to represent a struct in
memory identically to what C would do, #[repr(C)]
.
// src/main.rs
#[derive(Debug, Default)]
#[repr(C)]
pub struct Speaker {
name: [[u8; 20]; 2],
addr1: [u8; 40],
addr2: [u8; 40],
phone: [u8; 16],
flags: u16,
}
For this, we assume that char
will be a single byte long. It’s not
technically defined that way because the C standard deliberately chose to
leave it implementation defined, but with 50 years of C programs in the wild
that make this assumption, it’s probably safe to say this is a de-facto
standard.
Blindly Assuming the File Contains a Speaker Link to heading
Next we need to do the actual “deserializing”.
The fread()
function in C could read data directly into our spkr
variable
because it accepts a void*
pointer as the destination, and all pointers
implicitly coerce to void*
.
Unfortunately for us the corresponding API, the std::io::Read
trait,
insists that our programs are type-safe and will only write into a byte buffer
(&mut [u8]
).
Instead, when creating a load()
constructor for loading a Speaker
from a
reader we’ll create a Speaker
variable where all the fields are set to a
“sane” default so we avoid passing an uninitialized buffer to our reader…
// src/main.rs
use std::io::{Error, Read};
impl Speaker {
pub fn load(mut reader: impl Read) -> Result<Self, Error> {
// Create a Speaker where all the fields are set to some sane default
// (typically all zeroes)
let mut speaker = Speaker::default();
...
}
}
… Then tell the compiler to treat our speaker
variable as a big byte
array that we can data into…
// src/main.rs
use std::mem;
impl Speaker {
pub fn load(mut reader: impl Read) -> Result<Self, Error> {
...
unsafe {
// Get a slice which treats the `speaker` variable as a byte array
let buffer: &mut [u8] = std::slice::from_raw_parts_mut(
speaker.as_mut_ptr().cast(),
mem::size_of::<Speaker>(),
);
// Read exactly that many bytes from the reader
reader.read_exact(buffer)?;
...
}
}
}
… then once we know the speaker
has been completely filled with data from
the reader we just need to return it.
// src/main.rs
impl Speaker {
pub fn load(mut reader: impl Read) -> Result<Self, Error> {
...
unsafe {
...
// Our `speaker` has now been updated with data from the reader.
Ok(speaker)
}
}
}
In Rust it’s always a good idea to add a comment above your unsafe
blocks
noting down your assumptions and why the unsafe
block is correct and won’t
break memory safety.
// src/main.rs
impl Speaker {
pub fn load(mut reader: impl Read) -> Result<Self, Error> {
...
// Safety: All the fields in a Speaker are valid for all possible bit
// combinations.
unsafe {
...
}
}
}
The primary reason for this being sound is that a Speaker
only contains
integers and arrays of integers, and an integer is valid for all possible bit
patterns. That means blindly copying bytes from (possibly maliciously crafted)
input into a Speaker
can’t introduce memory safety issues into our
Speaker::load()
function. Sure it could give us data that doesn’t make sense,
but we’d still have valid byte arrays in our byte array fields and a valid i16
in our flags
field.
This assumption isn’t correct in general. It’s one of the reasons the docs
for std::mem::transmute()
are full of warnings saying there are
plenty of better ways to do things and transmute()
should be a tool of last
resort.
For example, the Speaker
struct couldn’t container a string reference (&str
)
because references must always be non-null, aligned, point at valid instances of
that type, and not outlive the things they refer to. Of the 264
different possible bit patterns on a 64-bit machine, that may only be correct
for a small handful of values.
As such, it wouldn’t be sound to give our Speaker
a &str
or Vec<T>
field
unless we had some outside information which makes extra guarantees. If the
caller could provide those guarantees then we would be able to mark the entire
Speaker::load()
function as unsafe
and make memory safety their problem.
Making Things Convenient Link to heading
Now, C’s printf()
is more than happy to take a pointer to some bytes and
interpret them as text, but if we tried to print Speaker
’s fields at the
moment we’d get something useless like [0x4a, 0x6f, 0x73, 0x65, 0x70, 0x68, 0x0a, 0x00, 0x00, 0x00]
instead of the "Joseph"
that we expect.
This is because Rust treats all arrays of bytes as arrays of bytes and doesn’t
attach any extra meaning to them. If we want to treat them like strings then
we’ll need to use std::str::from_utf8()
to convert our
byte arrays to a UTF-8 &str
.
From the way our C is implemented, if a string field isn’t completely filled with text it will be padded out with zeroes. This is particularly helpful for C programs because it means our fields always have trailing null bytes (as long as we don’t overflow).
In Rust, we don’t to include these trailing null bytes in the final output (null is a valid UTF-8 character), so let’s make a helper function that takes a byte array, trims off any trailing nulls, then tries to interpret it as UTf-8.
// src/main.rs
fn c_string(bytes: &[u8]) -> Option<&str> {
let bytes_without_null = match bytes.iter().position(|&b| b == 0) {
Some(ix) => &bytes[..ix],
None => bytes,
};
std::str::from_utf8(bytes_without_null).ok()
}
With our c_string()
function in hand we are now ready to give Speaker
some
getter methods.
// src/main.rs
impl Speaker {
pub fn address_line_1(&self) -> Option<&str> { c_string(&self.addr1) }
pub fn address_line_2(&self) -> Option<&str> { c_string(&self.addr2) }
pub fn phone_number(&self) -> Option<&str> { c_string(&self.phone) }
}
The name
field is a bit more interesting because it contains two strings in
an array, presumably the first and last names.
To preserve the semantics that a name
contains two strings, we can use a
mixture of destructuring and the ?
operator to create a getter that returns
the first and last names only when they are both valid.
// src/main.rs
impl Speaker {
pub fn name(&self) -> Option<(&str, &str)> {
let [first, last] = &self.name;
let first = c_string(first)?;
let last = c_string(last)?;
Some((first, last))
}
}
Tests Link to heading
We’ve now got a Speaker::load()
constructor and some convenient getter methods
for interpreting the fields so I figure it’s time to write some tests.
First, we’ll take the bytes from the speaker.dat
generated earlier and save
them as a byte literal. Conveniently, the xxd
tool has a flag which prints
the input in a form that can be included in source code.
$ cat speaker.dat | xxd -i
0x4a, 0x6f, 0x73, 0x65, 0x70, 0x68, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x42, 0x6c, 0x6f, 0x67,
0x73, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x31, 0x32, 0x33, 0x20, 0x46, 0x61, 0x6b, 0x65,
0x20, 0x53, 0x74, 0x72, 0x65, 0x65, 0x74, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x4e, 0x65, 0x77, 0x20,
0x59, 0x6f, 0x72, 0x6b, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x32, 0x30, 0x32, 0x2d, 0x35, 0x35, 0x35, 0x2d, 0x30, 0x31, 0x31, 0x37,
0x00, 0x00, 0x00, 0x00, 0x0f, 0xaa
All we need to do is create a new test module and paste it into a SPEAKER_DAT
constant.
// src/main.rs
#[cfg(test)]
mod tests {
use super::*;
use std::io::Cursor;
const SPEAKER_DAT: [u8; mem::size_of::<Speaker>()] = [ ... ];
}
Now we can write a test which deserializes our Joseph Blogs speaker’s information and compares it to what the C code generated.
// src/main.rs
#[cfg(test)]
mod tests {
...
#[test]
fn deserialize_joe_bloggs() {
let reader = Cursor::new(&SPEAKER_DAT);
let got = Speaker::load(reader).unwrap();
assert_eq!(got.name().unwrap(), ("Joseph", "Blogs"));
assert_eq!(got.address_line_1().unwrap(), "123 Fake Street");
assert_eq!(got.address_line_2().unwrap(), "New York");
assert_eq!(got.phone_number().unwrap(), "202-555-0117");
assert_eq!(got.flags, 0xAA0F);
}
}
Our test passes, of course.
$ cargo test
Finished test [unoptimized + debuginfo] target(s) in 0.03s
Running unittests (/home/michael/Documents/deserializing-binary-data-files/target/debug/deps/reading_data_files-f34629f4e016d364)
running 1 test
test tests::deserialize_joe_bloggs ... ok
test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 4 filtered out; finished in 0.00s
A Note on Packing Link to heading
There is this concept called “packing” which is quite important in
telling the compiler how to lay out our Speaker
and spkr
structs in memory.
We are interpreting a bunch of bytes as a Speaker
it is very important for us
that Speaker
and spkr
are laid out identically otherwise we’ll get garbage.
You see, processors really like it when things are lined up in memory correctly
and they often need to do extra work when they aren’t lined up (which kills
performance) or will just error out altogether (which kills your program). For
example, a u8
can be placed at addresses that are multiples of 1 byte (i.e.
anywhere), a u32
can be placed at multiples of 4 bytes, and so on.
To deal with this alignment issue, compilers will insert some unused bytes
(often called “padding”) between fields to make sure they line up correctly -
this is actually what the #[repr(C)]
attribute does. If we want to tell the
compiler not to insert this spacing we can use #[repr(packed)]
to tell the
compiler “this struct’s bytes must be packed together as closely as possible”.
Most binary formats don’t care about these padding bytes because they want files
to be as compact as possible, so it’s not uncommon to see a #[repr(packed)]
(or its C cousin, __attribute__((__packed__))
on GCC and
__attribute__((packed))
on Clang) next to struct definitions when they are
using this direct reading/writing method of serializing data.
We get lucky here because the flags
field is at offset 202 (which is a
multiple of 2 bytes) so we didn’t need to do anything special, but it’s still
good to know. It may also help explain why you’ll often see random fields named
spare
or unused
in a C struct definition.
Conclusions Link to heading
It’s not something you’ll need to (or want to) use too often, but knowing how to read/write data directly to some file without needing any complicated serialization frameworks might be something you’ll use some day.
That said, there are a lot of better alternatives out there, with most of them
allowing you to write code that has similar performance characteristics with no
need for unsafe
.
Personally, I would definitely reach for a better tool (probably a parsing
library like nom
or binread
) if this was an application I
cared about or if I didn’t have full control over the input. However, for
quick-and-dirty tools where your primary goal is to “do whatever C does”, this
technique works pretty well.