Containers from first principles with Rust

Containers have taken the world by storm. Nowadays, many production environments employ containers in one form or another, running on developers' workstations, locally hosted servers and/or the cloud.
If you have worked as a developer or administrator in the industry, chances are you've worked with modern container runtimes such as Docker and Kubernetes or at least heard of them. From a high-level perspective, containers are an OS-level abstraction - a stripped down operating system (OS) containing one or more applications (apps) of interest running atop some host in an isolated manner, which provides an extra layer of security compared to running the app(s) directly on the host and enables the app(s) to be executed on different hosts (possibly with different hardware) in a consistent, reproducible manner, as long as the hosts all run the same OS kernel. But do you know how containers are implemented from a low-level perspective? I decided to find out by writing my own container runtime (if it could be called as such) in Rust with the additional goal of exploring Rust as a type and memory-safe alternative to C as a systems programming language. If you'd like to learn about the internals of container runtimes, read on!
  • For non-Linux OSes such as Windows or macOS: Ruboxer depends on a number of Linux-specific features and system calls (syscalls) and therefore cannot be run directly on your host system
  • For Linux-based OSes: Ruboxer implements a very limited subset of features found in proper container runtimes such as Docker and Kubernetes, and as such, provides very limited isolation from the host system. Therefore, you are strongly advised against installing and running Ruboxer on a production system or any system with potentially sensitive or private information. Running it in a throwaway VM provides an extra layer of isolation from your host system in case something goes wrong
  • You may use any hypervisor of your choice; for example, a popular one is VirtualBox. Windows 10 (or later) users may also be aware of a lightweight virtualization technology known as Windows Subsystem for Linux 2 (WSL2) which is not supported in this article - if you decide to use that, YMMV. There are many online tutorials on setting up virtual machines that are easily searchable so the VM setup process will not be covered here.
    At the time of writing, this article has been tested against the following Linux distributions (distros) so it is recommended you install one of the following distros in your VM - if you decide to install another distro, YMMV:
  • Ubuntu 20.04 LTS (desktop <-- recommended for newcomers to Linux, server)
  • CentOS 8 Stream
  • The rest of this article will be done in the command line within the VM (or cloud instance) so a basic familiarity with Bash will help (macOS users: bash is similar to zsh). Now that we have our VM set up and running, fire up the terminal, then download and install Ruboxer, entering your sudo password when prompted:
  • Ubuntu 20.04 LTS: $ wget https://github.com/DonaldKellett/ruboxer/releases/download/v0.1.0/ruboxer_0.1.0_amd64.deb && sudo apt install ./ruboxer_0.1.0_amd64.deb
  • CentOS 8 Stream: $ wget https://github.com/DonaldKellett/ruboxer/releases/download/v0.1.0/ruboxer-0.1.0-1.el8.x86_64.rpm && sudo dnf install ./ruboxer-0.1.0-1.el8.x86_64.rpm
  • Note: the $/# at the beginning of each line represents your command prompt and should not be entered
    The package you just installed contains a binary ruboxer, plus a man page you can read by entering the following command (press the up and down arrow keys to scroll, 'q' to exit): $ man 8 ruboxer. Anyway, the man page is quite terse so let's go through some examples instead.
    In order to run a container, we first need an image which is basically a stripped down Linux system containing the app(s) of interest and its dependencies. So let's fetch a customized Python image built for this article (adapted from Eric Chiang's Containers from Scratch):
    $ wget https://github.com/DonaldKellett/containers-from-first-principles-with-rust/releases/download/v0.1.0/rootfs.tar.xz
    (this will take a while to download)
    Now unpack the image:
    $ tar xvf rootfs.tar.xz
    And rename it to something more(?) intuitive:
    $ mv rootfs container1
    We'll also make a second copy of this unpacked image, for reasons we'll see later:
    $ cp -r container1 container2
    (this may take a few seconds)
    Now we're good to go. Print usage instructions for the ruboxer command:
    $ ruboxer --help
    In its simplest form, ruboxer takes in two arguments, the first of which is the absolute path to the unpacked image and the second of which is the command to execute within the container. Assuming that you executed the above commands in your home directory, the absolute path to your unpacked image is $HOME/container1. If not, you may need to modify the commands that follow accordingly to get them to work. Let's first peer into the unpacked image:
    $ ls $HOME/container1
    You should see the following output (exact spacing and indentation depends on terminal width):
    bin   dev  home  lib64  mnt  proc  run   srv  tmp  var
    boot  etc  lib   media  opt  root  sbin  sys  usr
    Now compare this with your host system:
    $ ls /
    The output (exact details may differ):
    bin    dev   lib    libx32      mnt   root  snap      sys  var
    boot   etc   lib32  lost+found  opt   run   srv       tmp
    cdrom  home  lib64  media       proc  sbin  swap.img  usr
    Looks kinda similar, right? An unpacked container image is basically the filesystem for a slimmed down Linux system (ideally) containing just the essentials to run the app(s) of interest.
    Now we run Ruboxer with $HOME/container1 as the image and bash as the command, which will spawn a shell inside the container. sudo is required here to elevate privileges to root - we'll explain why in a moment:
    $ sudo ruboxer $HOME/container1 bash
    Notice that the command prompt changed from something like dsleung@ubuntu2:~$ to something like root@ubuntu2:/# (exact output may differ) - in particular, the $ changed to a # (indicating we're now running as root instead of a normal user) and the tilde ~ changed to / (indicating that we moved from our home directory to what appears to the container as the root directory of the filesystem). This should look familiar to you if you have peered inside a Docker container before, using a command similar to the following (this may not work on your newly installed Linux system if it doesn't have Docker installed and running):
    $ docker container run -it ubuntu:focal /bin/bash
    The only notable difference is that the hostname (everything between the @ and : above) of a Docker container might look something like 23f66e3c1dc8 instead of retaining the hostname of your host system.
    Anyway, let's list the contents of the root directory again, as seen from inside the container:
    # ls /
    You should see the contents of the container, not that of your host system:
    bin   dev  home  lib64  mnt  proc  run   srv  tmp  var
    boot  etc  lib   media  opt  root  sbin  sys  usr
    In fact, as far as the container is concerned, $HOME/container1 is the root directory / - the container cannot (should not) see any files and/or directories outside of the container root $HOME/container1. This brings us to the POSIX syscall central to implementing containers: chroot()
    chroot()
    chroot() is a syscall (i.e. an API call provided by the OS kernel to applications) that restricts the filesystem view of the currently running process to a particular subdirectory. It is said to virtualize the filesystem, analogous to how containers virtualize the OS and VMs virtualize the hardware. However, it only provides filesystem-level isolation - the chroot()ed process can still view and interact with all processes running on the host system and access all device nodes, for example. Therefore, a container runtime must combine chroot() with other methods of isolation to prevent container processes from escaping the container into the host system.
    In fact, Ruboxer does a bit more than chroot(). Inside the container, run the following command to mount the proc filesystem at /proc:
    # mount -t proc proc /proc
    procfs is a pseudo-filesystem used by the Linux kernel to expose system information (such as a list of running processes) to applications. Mounting it within our container allows command-line utilities present in the container such as ps to obtain a list of running processes and display them to the user. Now get the list of running processes (as seen from within the container) by executing:
    # ps aux
    You should see output similar to the following:
    USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
    root           1  0.0  0.3  21964  3568 ?        S    08:18   0:00 bash
    root           3  0.0  0.2  19184  2304 ?        R+   08:21   0:00 ps aux
    If you've opened Task Manager in Windows, Activity Monitor in macOS or similar applications for listing running tasks before, you may notice that something seems a bit off. A normal running system should have at least a few dozen processes running at any given moment in time, often hundreds or even thousands of them. Yet here we see only 2? In fact, within the container, this is good news - it means that our container can only see and interact with the processes belonging to it - the Bash shell that we spawned when starting the container, and ps aux (which would've died by the time you saw the output). This is a process-level isolation provided by the Linux kernel through ...
    Process namespaces
    Linux defines the notion of namespaces which serve as a mechanism for providing an isolated view of various system resources and parameters such as running processes and network configuration. The Linux kernel provides a number of syscalls for creating, entering and manipulating namespaces such as unshare() and setns(). By default, Ruboxer creates a new process namespace for each container so it can only see its own processes, but it also accepts an option --procns-pid for specifying the process ID (PID) of a process namespace belonging to that process to enter.
    To see this in action, open a new terminal session, keeping the original terminal session intact. In the new terminal session, execute $ ps aux | grep bash | grep root to list all processes with the strings "bash" and "root" in them (most of which are actual Bash processes executed by user root). You should see output similar to the following (possibly with occurrences of "bash" highlighted):
    root         999  0.0  0.4  11188  4720 pts/0    S    08:17   0:00 sudo ruboxer /home/dsleung/container1 bash
    root        1001  0.0  0.0   3140   704 pts/0    S    08:18   0:00 ruboxer /home/dsleung/container1 bash
    root        1002  0.0  0.3  21964  3568 pts/0    S+   08:18   0:00 bash
    In my case, the last line of output corresponds to the actual Bash process running in my container. The second column in the output is the PID of the process, so my Bash process has PID 1002. Note that this number is very likely different in your case so note down your number and substitute that for 1002 in the upcoming commands.
    Now, also in that new terminal session, run another container with Ruboxer using $HOME/container2 as filesystem root and bash as the command, but this time we join the process namespace corresponding to PID 1002 (replace with your PID) instead of creating a new process namespace:
    $ sudo ruboxer --procns-pid 1002 $HOME/container2 bash
    Now within this new container, mount procfs at /proc again:
    # mount -t proc proc /proc
    And get a list of all processes:
    # ps aux
    You should see output similar to:
    USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
    root           1  0.0  0.3  21964  3568 ?        S+   08:18   0:00 bash
    root           4  0.0  0.3  21960  3480 ?        S    08:53   0:00 bash
    root           6  0.0  0.2  19184  2444 ?        R+   08:54   0:00 ps aux
    Here we see two Bash processes plus a ps aux that already exited by the time we see the output. The first Bash process with PID 1 as seen from within the container is that running in our original container, while the second one with PID 4 (in my case, you may get a different PID number) is the current Bash process in the new container we just started. We can also try to execute # ps aux in our original container again to see this output (exact output may differ):
    USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
    root           1  0.0  0.3  21964  3572 ?        S    08:18   0:00 bash
    root           4  0.0  0.3  21960  3480 ?        S+   08:53   0:00 bash
    root           7  0.0  0.2  19184  2396 ?        R+   08:58   0:00 ps aux
    So our original container can see processes running in the new container as well. This is because the new container joined the process namespace of the original container so they can both see and interact with each other's processes, and nothing else.
    We just saw how process namespaces provide process-level isolation so container processes can interact with other processes within the same container but cannot interact with other processes on the host system or within other containers for which they do not share a process namespace. This means that a container process should not be able to (as an example) kill an important system process and bring the whole system down. However, what if a container process tries to eat up all RAM or CPU cycles etc. on the host, leaving the rest of the host system with insufficient resources to perform useful work?
    Memory cgroups
    Linux defines a notion of cgroups, short for "control groups", which enable quotas to be established for various processes on a wide range of system resources. For example, memory cgroups can be used to limit the maximum amount of RAM that a process (or group of processes) can acquire before it it is killed by the system.
    There are two major versions of cgroups at the time of writing known as cgroups v1 and v2 respectively, and most commercial Linux distributions currently still heavily rely on the former. In cgroups v1, resource quotas can be accessed and set through the sysfs pseudo-filesystem which resembles a device tree exported by the Linux kernel to applications. By default, Ruboxer enforces a memory limit of 128MiB (base-2) for each container.
    To see this in action, first run the following command to mount devtmpfs (yet another pseudo-filesystem) at /dev in both the original and new containers:
    # mount -t devtmpfs devtmpfs /dev
    Now, in one of the containers (doesn't matter which one), run the following command to print the contents of a memory-hungry application adapted from Eric Chiang's Containers from Scratch to the console:
    # cat /bin/memeater
    This program reads zero bytes from /dev/zero (which can only be accessed by mounting devtmpfs) in chunks of 16MiB (base-2) repeatedly, until it exhausts all system memory or gets killed by the system. Now run it (in either container) by executing:
    # memeater
    and watch it get killed after the first few dozen MiBs. The output should be similar to the following:
    16MiB
    32MiB
    48MiB
    Killed
    Note that if you run it outside the container, it will just keep printing output until you kill it by pressing Ctrl-C or it eats up all the memory in your system, causing it to hang and become unresponsive.
    As a final exercise, let's adjust the limit to 512MiB. In the new container, unmount devtmpfs and procfs using umount (NOT a typo!) and exit the container:
    # umount /dev
    # umount /proc
    # exit
    You should see the command prompt change from # back to $. Now restart the new container with adjusted memory limits:
    $ sudo ruboxer --mem-max-bytes 512M $HOME/container2 bash
    Mount devtmpfs and run memeater:
    # mount -t devtmpfs devtmpfs /dev
    # memeater
    Notice how memeater is now allowed to eat much more memory before being killed:
    16MiB
    32MiB
    48MiB
    64MiB
    80MiB
    96MiB
    112MiB
    128MiB
    144MiB
    160MiB
    176MiB
    192MiB
    208MiB
    224MiB
    240MiB
    256MiB
    272MiB
    288MiB
    304MiB
    320MiB
    336MiB
    352MiB
    368MiB
    384MiB
    400MiB
    416MiB
    432MiB
    Killed
    This concludes our article on container internals. In a production-grade container runtime, many more isolation and sandboxing techniques are used to ensure security, including but not limited to:
  • Capabilities
  • Network namespaces
  • Seccomp profiles
  • Mandatory access control (MAC)
  • I hope you enjoyed this article :-) You can find Ruboxer's source code at https://github.com/DonaldKellett/ruboxer which is licensed under the permissive MIT license to encourage you to fork it and learn more about how containers work by adding your own modifications to the source code.
    References

    34

    This website collects cookies to deliver better user experience

    Containers from first principles with Rust