Containers from first principles with Rust

Containers have taken the world by storm. Nowadays, many production environments employ containers in one form or another, running on developers' workstations, locally hosted servers and/or the cloud.

If you have worked as a developer or administrator in the industry, chances are you've worked with modern container runtimes such as Docker and Kubernetes or at least heard of them. From a high-level perspective, containers are an OS-level abstraction - a stripped down operating system (OS) containing one or more applications (apps) of interest running atop some host in an isolated manner, which provides an extra layer of security compared to running the app(s) directly on the host and enables the app(s) to be executed on different hosts (possibly with different hardware) in a consistent, reproducible manner, as long as the hosts all run the same OS kernel. But do you know how containers are implemented from a low-level perspective? I decided to find out by writing my own container runtime (if it could be called as such) in Rust with the additional goal of exploring Rust as a type and memory-safe alternative to C as a systems programming language. If you'd like to learn about the internals of container runtimes, read on!

For non-Linux OSes such as Windows or macOS: Ruboxer depends on a number of Linux-specific features and system calls (syscalls) and therefore cannot be run directly on your host system

For Linux-based OSes: Ruboxer implements a very limited subset of features found in proper container runtimes such as Docker and Kubernetes, and as such, provides very limited isolation from the host system. Therefore, you are strongly advised against installing and running Ruboxer on a production system or any system with potentially sensitive or private information. Running it in a throwaway VM provides an extra layer of isolation from your host system in case something goes wrong

You may use any hypervisor of your choice; for example, a popular one is VirtualBox. Windows 10 (or later) users may also be aware of a lightweight virtualization technology known as Windows Subsystem for Linux 2 (WSL2) which is not supported in this article - if you decide to use that, YMMV. There are many online tutorials on setting up virtual machines that are easily searchable so the VM setup process will not be covered here.

At the time of writing, this article has been tested against the following Linux distributions (distros) so it is recommended you install one of the following distros in your VM - if you decide to install another distro, YMMV:

Ubuntu 20.04 LTS (desktop <-- recommended for newcomers to Linux, server)

CentOS 8 Stream

The rest of this article will be done in the command line within the VM (or cloud instance) so a basic familiarity with Bash will help (macOS users: bash is similar to zsh). Now that we have our VM set up and running, fire up the terminal, then download and install Ruboxer, entering your sudo password when prompted:

Ubuntu 20.04 LTS:

$ wget https://github.com/DonaldKellett/ruboxer/releases/download/v0.1.0/ruboxer_0.1.0_amd64.deb && sudo apt install ./ruboxer_0.1.0_amd64.deb

CentOS 8 Stream:

$ wget https://github.com/DonaldKellett/ruboxer/releases/download/v0.1.0/ruboxer-0.1.0-1.el8.x86_64.rpm && sudo dnf install ./ruboxer-0.1.0-1.el8.x86_64.rpm

Note: the $/# at the beginning of each line represents your command prompt and should not be entered

The package you just installed contains a binary ruboxer, plus a man page you can read by entering the following command (press the up and down arrow keys to scroll, 'q' to exit): $ man 8 ruboxer. Anyway, the man page is quite terse so let's go through some examples instead.

In order to run a container, we first need an image which is basically a stripped down Linux system containing the app(s) of interest and its dependencies. So let's fetch a customized Python image built for this article (adapted from Eric Chiang's Containers from Scratch):

$ wget https://github.com/DonaldKellett/containers-from-first-principles-with-rust/releases/download/v0.1.0/rootfs.tar.xz

(this will take a while to download)

Now unpack the image:

$ tar xvf rootfs.tar.xz

And rename it to something more(?) intuitive:

$ mv rootfs container1

We'll also make a second copy of this unpacked image, for reasons we'll see later:

$ cp -r container1 container2

(this may take a few seconds)

Now we're good to go. Print usage instructions for the ruboxer command:

$ ruboxer --help

In its simplest form, ruboxer takes in two arguments, the first of which is the absolute path to the unpacked image and the second of which is the command to execute within the container. Assuming that you executed the above commands in your home directory, the absolute path to your unpacked image is $HOME/container1. If not, you may need to modify the commands that follow accordingly to get them to work. Let's first peer into the unpacked image:

$ ls $HOME/container1

You should see the following output (exact spacing and indentation depends on terminal width):

bin   dev  home  lib64  mnt  proc  run   srv  tmp  var
boot  etc  lib   media  opt  root  sbin  sys  usr

Now compare this with your host system:

$ ls /

The output (exact details may differ):

bin    dev   lib    libx32      mnt   root  snap      sys  var
boot   etc   lib32  lost+found  opt   run   srv       tmp
cdrom  home  lib64  media       proc  sbin  swap.img  usr

Looks kinda similar, right? An unpacked container image is basically the filesystem for a slimmed down Linux system (ideally) containing just the essentials to run the app(s) of interest.

Now we run Ruboxer with $HOME/container1 as the image and bash as the command, which will spawn a shell inside the container. sudo is required here to elevate privileges to root - we'll explain why in a moment:

$ sudo ruboxer $HOME/container1 bash

Notice that the command prompt changed from something like dsleung@ubuntu2:~$ to something like root@ubuntu2:/# (exact output may differ) - in particular, the $ changed to a # (indicating we're now running as root instead of a normal user) and the tilde ~ changed to / (indicating that we moved from our home directory to what appears to the container as the root directory of the filesystem). This should look familiar to you if you have peered inside a Docker container before, using a command similar to the following (this may not work on your newly installed Linux system if it doesn't have Docker installed and running):

$ docker container run -it ubuntu:focal /bin/bash

The only notable difference is that the hostname (everything between the @ and : above) of a Docker container might look something like 23f66e3c1dc8 instead of retaining the hostname of your host system.

Anyway, let's list the contents of the root directory again, as seen from inside the container:

# ls /

You should see the contents of the container, not that of your host system:

bin   dev  home  lib64  mnt  proc  run   srv  tmp  var
boot  etc  lib   media  opt  root  sbin  sys  usr

In fact, as far as the container is concerned, $HOME/container1 is the root directory / - the container cannot (should not) see any files and/or directories outside of the container root $HOME/container1. This brings us to the POSIX syscall central to implementing containers: chroot()

chroot()

chroot() is a syscall (i.e. an API call provided by the OS kernel to applications) that restricts the filesystem view of the currently running process to a particular subdirectory. It is said to virtualize the filesystem, analogous to how containers virtualize the OS and VMs virtualize the hardware. However, it only provides filesystem-level isolation - the chroot()ed process can still view and interact with all processes running on the host system and access all device nodes, for example. Therefore, a container runtime must combine chroot() with other methods of isolation to prevent container processes from escaping the container into the host system.

In fact, Ruboxer does a bit more than chroot(). Inside the container, run the following command to mount the proc filesystem at /proc:

# mount -t proc proc /proc

procfs is a pseudo-filesystem used by the Linux kernel to expose system information (such as a list of running processes) to applications. Mounting it within our container allows command-line utilities present in the container such as ps to obtain a list of running processes and display them to the user. Now get the list of running processes (as seen from within the container) by executing:

# ps aux

You should see output similar to the following:

USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root           1  0.0  0.3  21964  3568 ?        S    08:18   0:00 bash
root           3  0.0  0.2  19184  2304 ?        R+   08:21   0:00 ps aux

If you've opened Task Manager in Windows, Activity Monitor in macOS or similar applications for listing running tasks before, you may notice that something seems a bit off. A normal running system should have at least a few dozen processes running at any given moment in time, often hundreds or even thousands of them. Yet here we see only 2? In fact, within the container, this is good news - it means that our container can only see and interact with the processes belonging to it - the Bash shell that we spawned when starting the container, and ps aux (which would've died by the time you saw the output). This is a process-level isolation provided by the Linux kernel through ...

Process namespaces

Linux defines the notion of namespaces which serve as a mechanism for providing an isolated view of various system resources and parameters such as running processes and network configuration. The Linux kernel provides a number of syscalls for creating, entering and manipulating namespaces such as unshare() and setns(). By default, Ruboxer creates a new process namespace for each container so it can only see its own processes, but it also accepts an option --procns-pid for specifying the process ID (PID) of a process namespace belonging to that process to enter.

To see this in action, open a new terminal session, keeping the original terminal session intact. In the new terminal session, execute $ ps aux | grep bash | grep root to list all processes with the strings "bash" and "root" in them (most of which are actual Bash processes executed by user root). You should see output similar to the following (possibly with occurrences of "bash" highlighted):

root         999  0.0  0.4  11188  4720 pts/0    S    08:17   0:00 sudo ruboxer /home/dsleung/container1 bash
root        1001  0.0  0.0   3140   704 pts/0    S    08:18   0:00 ruboxer /home/dsleung/container1 bash
root        1002  0.0  0.3  21964  3568 pts/0    S+   08:18   0:00 bash

In my case, the last line of output corresponds to the actual Bash process running in my container. The second column in the output is the PID of the process, so my Bash process has PID 1002. Note that this number is very likely different in your case so note down your number and substitute that for 1002 in the upcoming commands.

Now, also in that new terminal session, run another container with Ruboxer using $HOME/container2 as filesystem root and bash as the command, but this time we join the process namespace corresponding to PID 1002 (replace with your PID) instead of creating a new process namespace:

$ sudo ruboxer --procns-pid 1002 $HOME/container2 bash

Now within this new container, mount procfs at /proc again:

# mount -t proc proc /proc

And get a list of all processes:

# ps aux

You should see output similar to:

USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root           1  0.0  0.3  21964  3568 ?        S+   08:18   0:00 bash
root           4  0.0  0.3  21960  3480 ?        S    08:53   0:00 bash
root           6  0.0  0.2  19184  2444 ?        R+   08:54   0:00 ps aux

Here we see two Bash processes plus a ps aux that already exited by the time we see the output. The first Bash process with PID 1 as seen from within the container is that running in our original container, while the second one with PID 4 (in my case, you may get a different PID number) is the current Bash process in the new container we just started. We can also try to execute # ps aux in our original container again to see this output (exact output may differ):

USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root           1  0.0  0.3  21964  3572 ?        S    08:18   0:00 bash
root           4  0.0  0.3  21960  3480 ?        S+   08:53   0:00 bash
root           7  0.0  0.2  19184  2396 ?        R+   08:58   0:00 ps aux

So our original container can see processes running in the new container as well. This is because the new container joined the process namespace of the original container so they can both see and interact with each other's processes, and nothing else.

We just saw how process namespaces provide process-level isolation so container processes can interact with other processes within the same container but cannot interact with other processes on the host system or within other containers for which they do not share a process namespace. This means that a container process should not be able to (as an example) kill an important system process and bring the whole system down. However, what if a container process tries to eat up all RAM or CPU cycles etc. on the host, leaving the rest of the host system with insufficient resources to perform useful work?

Memory cgroups

Linux defines a notion of cgroups, short for "control groups", which enable quotas to be established for various processes on a wide range of system resources. For example, memory cgroups can be used to limit the maximum amount of RAM that a process (or group of processes) can acquire before it it is killed by the system.

There are two major versions of cgroups at the time of writing known as cgroups v1 and v2 respectively, and most commercial Linux distributions currently still heavily rely on the former. In cgroups v1, resource quotas can be accessed and set through the sysfs pseudo-filesystem which resembles a device tree exported by the Linux kernel to applications. By default, Ruboxer enforces a memory limit of 128MiB (base-2) for each container.

To see this in action, first run the following command to mount devtmpfs (yet another pseudo-filesystem) at /dev in both the original and new containers:

# mount -t devtmpfs devtmpfs /dev

Now, in one of the containers (doesn't matter which one), run the following command to print the contents of a memory-hungry application adapted from Eric Chiang's Containers from Scratch to the console:

# cat /bin/memeater

This program reads zero bytes from /dev/zero (which can only be accessed by mounting devtmpfs) in chunks of 16MiB (base-2) repeatedly, until it exhausts all system memory or gets killed by the system. Now run it (in either container) by executing:

# memeater

and watch it get killed after the first few dozen MiBs. The output should be similar to the following:

16MiB
32MiB
48MiB
Killed

Note that if you run it outside the container, it will just keep printing output until you kill it by pressing Ctrl-C or it eats up all the memory in your system, causing it to hang and become unresponsive.

As a final exercise, let's adjust the limit to 512MiB. In the new container, unmount devtmpfs and procfs using umount (NOT a typo!) and exit the container:

# umount /dev
# umount /proc
# exit

You should see the command prompt change from # back to $. Now restart the new container with adjusted memory limits:

$ sudo ruboxer --mem-max-bytes 512M $HOME/container2 bash

Mount devtmpfs and run memeater:

# mount -t devtmpfs devtmpfs /dev
# memeater

Notice how memeater is now allowed to eat much more memory before being killed:

16MiB
32MiB
48MiB
64MiB
80MiB
96MiB
112MiB
128MiB
144MiB
160MiB
176MiB
192MiB
208MiB
224MiB
240MiB
256MiB
272MiB
288MiB
304MiB
320MiB
336MiB
352MiB
368MiB
384MiB
400MiB
416MiB
432MiB
Killed

This concludes our article on container internals. In a production-grade container runtime, many more isolation and sandboxing techniques are used to ensure security, including but not limited to:

Capabilities

Network namespaces

Seccomp profiles

Mandatory access control (MAC)

I hope you enjoyed this article :-) You can find Ruboxer's source code at https://github.com/DonaldKellett/ruboxer which is licensed under the permissive MIT license to encourage you to fork it and learn more about how containers work by adding your own modifications to the source code.

References

Ruboxer on GitHub: https://github.com/DonaldKellett/ruboxer

Eric Chiang's "Containers from Scratch": https://ericchiang.github.io/post/containers-from-scratch/

chroot(2) man page: https://man7.org/linux/man-pages/man2/chroot.2.html

unshare(2) man page: https://man7.org/linux/man-pages/man2/unshare.2.html

setns(2) man page: https://man7.org/linux/man-pages/man2/setns.2.html

cgroups(7) man page: https://man7.org/linux/man-pages/man7/cgroups.7.html

capabilities(7) man page: https://man7.org/linux/man-pages/man7/capabilities.7.html

seccomp(2) man page: https://man7.org/linux/man-pages/man2/seccomp.2.html

MAC (Wikipedia): https://en.wikipedia.org/wiki/Mandatory_access_control