Writing data to disk: transforming brittle code to robust code with atomic writes

This is the first post in a series where I'll cover about writing robust code that's can tolerate both expected and unexpected failures

Problem identification

Receiving feedback through code reviews is one of the many ways to grow your career as a software developer. But of course, not all feedback hold the same value. Not so useful comments tend to focus on nit picking (e.g. white space); moderately useful comments detect logic or semantic bugs; fairly useful ones help you see problems through a different lens; the best comments open your eyes to issues that you didn't even know existed.

One of the most eye-opening code reviews I submitted during my tenure at Amazon Web Service (AWS) revealed to me the importance of atomic writes to disk.

Example: Brittle Non-Atomic write to disk

Let's take a look at the snippet of Python code below that writes data to disk.

dataset = fetch_data()
...
with open('customers.txt') as fh:
    for each customer in dataset:
        fh.write(...)

At a glance, the above code looks and smells πŸ‘ƒ okay. It's coded with idiomatic Python: the context manager (i.e. with open) cleans up lingering resources for you, automatically closing out the file handle. Awesome. I see code like this all the time. But, can you spot the issue?

The lack of atomicity?

What is an atomic write?

In general, an atomic operation is all or nothing, binary, 0 or 1; the operation has either 1) not yet started or 2) has completed successfully. No gray areas. In the context of writing data to disk, the destination must contain all the data we expect to be present in the file, non-corrupted. Not some of the data β€” all of it.

So how do transform the above code such that we atomically write to disk?

As is stands, the above code is brittle, susceptible to failures. What happens if the program raises an exception mid-write? Or if the server powers off in between one of the read or write operations, leaving the data corrupted? In other words, the code opens us up to leaving the file in an unknown state.

Atomically writing a file

Here's how we go about writing an atomic file.

Steps

  1. Create a temporary file
  2. Write contents to temporary file
  3. Flush buffers
  4. Sync to disk
  5. Rename file.

Example

from x import TempFile
# 1. create temporary file
with open(tempfile) as fh:
    # 2. Write contents to file handle
    fh.write(...)
    # 3. Flush from any runtime or OS buffers
    fh.flush()
    # 4. Sync from memory to disk
    os.fsync(fh.fileno()) 

# 5. Rename and replace destination file
os.rename(tempFile, "customers.txt")

We start the procedure with opening a temporary file; this temporary file becomes the intermediate destination in which we direct our writes. By writing to a temporary file, we leave the ultimate destination file (if it exists) in tact, only replacing the destination file if all the data has been successfully written to the temporary file. Once all writes finished, then we simply rename the temporary file to that of the destination file, an atomic operation in itself.

Summary

Above, I demonstrated one way to apply atomicity. This principle can be applied to many other situations. For example, if you are writing multi-threaded code and accessing shared memory, a thread needs to atomically obtain a lock before modifying the underlying shared data structures.

So, moving forward, when writing or reviewing code, keep the possibility of failures at the fore front of your mind and identify ways you can apply the principle of atomicity to turn fragile code into robust software.

Let's Connect

Let's connect and talk more about software and devops. Follow me on Twitter: @memattchung

References

22