21
Containers from first principles with Rust
Containers have taken the world by storm. Nowadays, many production environments employ containers in one form or another, running on developers' workstations, locally hosted servers and/or the cloud.
If you have worked as a developer or administrator in the industry, chances are you've worked with modern container runtimes such as Docker and Kubernetes or at least heard of them. From a high-level perspective, containers are an OS-level abstraction - a stripped down operating system (OS) containing one or more applications (apps) of interest running atop some host in an isolated manner, which provides an extra layer of security compared to running the app(s) directly on the host and enables the app(s) to be executed on different hosts (possibly with different hardware) in a consistent, reproducible manner, as long as the hosts all run the same OS kernel. But do you know how containers are implemented from a low-level perspective? I decided to find out by writing my own container runtime (if it could be called as such) in Rust with the additional goal of exploring Rust as a type and memory-safe alternative to C as a systems programming language. If you'd like to learn about the internals of container runtimes, read on!
- For non-Linux OSes such as Windows or macOS: Ruboxer depends on a number of Linux-specific features and system calls (syscalls) and therefore cannot be run directly on your host system
- For Linux-based OSes: Ruboxer implements a very limited subset of features found in proper container runtimes such as Docker and Kubernetes, and as such, provides very limited isolation from the host system. Therefore, you are strongly advised against installing and running Ruboxer on a production system or any system with potentially sensitive or private information. Running it in a throwaway VM provides an extra layer of isolation from your host system in case something goes wrong
You may use any hypervisor of your choice; for example, a popular one is VirtualBox. Windows 10 (or later) users may also be aware of a lightweight virtualization technology known as Windows Subsystem for Linux 2 (WSL2) which is not supported in this article - if you decide to use that, YMMV. There are many online tutorials on setting up virtual machines that are easily searchable so the VM setup process will not be covered here.
At the time of writing, this article has been tested against the following Linux distributions (distros) so it is recommended you install one of the following distros in your VM - if you decide to install another distro, YMMV:
- Ubuntu 20.04 LTS (desktop <-- recommended for newcomers to Linux, server)
- CentOS 8 Stream
The rest of this article will be done in the command line within the VM (or cloud instance) so a basic familiarity with Bash will help (macOS users: bash
is similar to zsh
). Now that we have our VM set up and running, fire up the terminal, then download and install Ruboxer, entering your sudo
password when prompted:
- Ubuntu 20.04 LTS:
$ wget https://github.com/DonaldKellett/ruboxer/releases/download/v0.1.0/ruboxer_0.1.0_amd64.deb && sudo apt install ./ruboxer_0.1.0_amd64.deb
- CentOS 8 Stream:
$ wget https://github.com/DonaldKellett/ruboxer/releases/download/v0.1.0/ruboxer-0.1.0-1.el8.x86_64.rpm && sudo dnf install ./ruboxer-0.1.0-1.el8.x86_64.rpm
Note: the $
/#
at the beginning of each line represents your command prompt and should not be entered
The package you just installed contains a binary ruboxer
, plus a man page you can read by entering the following command (press the up and down arrow keys to scroll, 'q' to exit): $ man 8 ruboxer
. Anyway, the man page is quite terse so let's go through some examples instead.
In order to run a container, we first need an image which is basically a stripped down Linux system containing the app(s) of interest and its dependencies. So let's fetch a customized Python image built for this article (adapted from Eric Chiang's Containers from Scratch):
$ wget https://github.com/DonaldKellett/containers-from-first-principles-with-rust/releases/download/v0.1.0/rootfs.tar.xz
(this will take a while to download)
Now unpack the image:
$ tar xvf rootfs.tar.xz
And rename it to something more(?) intuitive:
$ mv rootfs container1
We'll also make a second copy of this unpacked image, for reasons we'll see later:
$ cp -r container1 container2
(this may take a few seconds)
Now we're good to go. Print usage instructions for the ruboxer
command:
$ ruboxer --help
In its simplest form, ruboxer
takes in two arguments, the first of which is the absolute path to the unpacked image and the second of which is the command to execute within the container. Assuming that you executed the above commands in your home directory, the absolute path to your unpacked image is $HOME/container1
. If not, you may need to modify the commands that follow accordingly to get them to work. Let's first peer into the unpacked image:
$ ls $HOME/container1
You should see the following output (exact spacing and indentation depends on terminal width):
bin dev home lib64 mnt proc run srv tmp var
boot etc lib media opt root sbin sys usr
Now compare this with your host system:
$ ls /
The output (exact details may differ):
bin dev lib libx32 mnt root snap sys var
boot etc lib32 lost+found opt run srv tmp
cdrom home lib64 media proc sbin swap.img usr
Looks kinda similar, right? An unpacked container image is basically the filesystem for a slimmed down Linux system (ideally) containing just the essentials to run the app(s) of interest.
Now we run Ruboxer with $HOME/container1
as the image and bash
as the command, which will spawn a shell inside the container. sudo
is required here to elevate privileges to root
- we'll explain why in a moment:
$ sudo ruboxer $HOME/container1 bash
Notice that the command prompt changed from something like dsleung@ubuntu2:~$
to something like root@ubuntu2:/#
(exact output may differ) - in particular, the $
changed to a #
(indicating we're now running as root
instead of a normal user) and the tilde ~
changed to /
(indicating that we moved from our home directory to what appears to the container as the root directory of the filesystem). This should look familiar to you if you have peered inside a Docker container before, using a command similar to the following (this may not work on your newly installed Linux system if it doesn't have Docker installed and running):
$ docker container run -it ubuntu:focal /bin/bash
The only notable difference is that the hostname (everything between the @
and :
above) of a Docker container might look something like 23f66e3c1dc8
instead of retaining the hostname of your host system.
Anyway, let's list the contents of the root directory again, as seen from inside the container:
# ls /
You should see the contents of the container, not that of your host system:
bin dev home lib64 mnt proc run srv tmp var
boot etc lib media opt root sbin sys usr
In fact, as far as the container is concerned, $HOME/container1
is the root directory /
- the container cannot (should not) see any files and/or directories outside of the container root $HOME/container1
. This brings us to the POSIX syscall central to implementing containers: chroot()
chroot()
is a syscall (i.e. an API call provided by the OS kernel to applications) that restricts the filesystem view of the currently running process to a particular subdirectory. It is said to virtualize the filesystem, analogous to how containers virtualize the OS and VMs virtualize the hardware. However, it only provides filesystem-level isolation - the chroot()
ed process can still view and interact with all processes running on the host system and access all device nodes, for example. Therefore, a container runtime must combine chroot()
with other methods of isolation to prevent container processes from escaping the container into the host system.
In fact, Ruboxer does a bit more than chroot()
. Inside the container, run the following command to mount the proc
filesystem at /proc
:
# mount -t proc proc /proc
procfs
is a pseudo-filesystem used by the Linux kernel to expose system information (such as a list of running processes) to applications. Mounting it within our container allows command-line utilities present in the container such as ps
to obtain a list of running processes and display them to the user. Now get the list of running processes (as seen from within the container) by executing:
# ps aux
You should see output similar to the following:
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 0.0 0.3 21964 3568 ? S 08:18 0:00 bash
root 3 0.0 0.2 19184 2304 ? R+ 08:21 0:00 ps aux
If you've opened Task Manager in Windows, Activity Monitor in macOS or similar applications for listing running tasks before, you may notice that something seems a bit off. A normal running system should have at least a few dozen processes running at any given moment in time, often hundreds or even thousands of them. Yet here we see only 2? In fact, within the container, this is good news - it means that our container can only see and interact with the processes belonging to it - the Bash shell that we spawned when starting the container, and ps aux
(which would've died by the time you saw the output). This is a process-level isolation provided by the Linux kernel through ...
Linux defines the notion of namespaces which serve as a mechanism for providing an isolated view of various system resources and parameters such as running processes and network configuration. The Linux kernel provides a number of syscalls for creating, entering and manipulating namespaces such as unshare()
and setns()
. By default, Ruboxer creates a new process namespace for each container so it can only see its own processes, but it also accepts an option --procns-pid
for specifying the process ID (PID) of a process namespace belonging to that process to enter.
To see this in action, open a new terminal session, keeping the original terminal session intact. In the new terminal session, execute $ ps aux | grep bash | grep root
to list all processes with the strings "bash" and "root" in them (most of which are actual Bash processes executed by user root
). You should see output similar to the following (possibly with occurrences of "bash" highlighted):
root 999 0.0 0.4 11188 4720 pts/0 S 08:17 0:00 sudo ruboxer /home/dsleung/container1 bash
root 1001 0.0 0.0 3140 704 pts/0 S 08:18 0:00 ruboxer /home/dsleung/container1 bash
root 1002 0.0 0.3 21964 3568 pts/0 S+ 08:18 0:00 bash
In my case, the last line of output corresponds to the actual Bash process running in my container. The second column in the output is the PID of the process, so my Bash process has PID 1002
. Note that this number is very likely different in your case so note down your number and substitute that for 1002
in the upcoming commands.
Now, also in that new terminal session, run another container with Ruboxer using $HOME/container2
as filesystem root and bash
as the command, but this time we join the process namespace corresponding to PID 1002
(replace with your PID) instead of creating a new process namespace:
$ sudo ruboxer --procns-pid 1002 $HOME/container2 bash
Now within this new container, mount procfs at /proc
again:
# mount -t proc proc /proc
And get a list of all processes:
# ps aux
You should see output similar to:
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 0.0 0.3 21964 3568 ? S+ 08:18 0:00 bash
root 4 0.0 0.3 21960 3480 ? S 08:53 0:00 bash
root 6 0.0 0.2 19184 2444 ? R+ 08:54 0:00 ps aux
Here we see two Bash processes plus a ps aux
that already exited by the time we see the output. The first Bash process with PID 1 as seen from within the container is that running in our original container, while the second one with PID 4 (in my case, you may get a different PID number) is the current Bash process in the new container we just started. We can also try to execute # ps aux
in our original container again to see this output (exact output may differ):
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 0.0 0.3 21964 3572 ? S 08:18 0:00 bash
root 4 0.0 0.3 21960 3480 ? S+ 08:53 0:00 bash
root 7 0.0 0.2 19184 2396 ? R+ 08:58 0:00 ps aux
So our original container can see processes running in the new container as well. This is because the new container joined the process namespace of the original container so they can both see and interact with each other's processes, and nothing else.
We just saw how process namespaces provide process-level isolation so container processes can interact with other processes within the same container but cannot interact with other processes on the host system or within other containers for which they do not share a process namespace. This means that a container process should not be able to (as an example) kill an important system process and bring the whole system down. However, what if a container process tries to eat up all RAM or CPU cycles etc. on the host, leaving the rest of the host system with insufficient resources to perform useful work?
Linux defines a notion of cgroups, short for "control groups", which enable quotas to be established for various processes on a wide range of system resources. For example, memory cgroups can be used to limit the maximum amount of RAM that a process (or group of processes) can acquire before it it is killed by the system.
There are two major versions of cgroups at the time of writing known as cgroups v1 and v2 respectively, and most commercial Linux distributions currently still heavily rely on the former. In cgroups v1, resource quotas can be accessed and set through the sysfs
pseudo-filesystem which resembles a device tree exported by the Linux kernel to applications. By default, Ruboxer enforces a memory limit of 128MiB (base-2) for each container.
To see this in action, first run the following command to mount devtmpfs
(yet another pseudo-filesystem) at /dev
in both the original and new containers:
# mount -t devtmpfs devtmpfs /dev
Now, in one of the containers (doesn't matter which one), run the following command to print the contents of a memory-hungry application adapted from Eric Chiang's Containers from Scratch to the console:
# cat /bin/memeater
This program reads zero bytes from /dev/zero
(which can only be accessed by mounting devtmpfs
) in chunks of 16MiB (base-2) repeatedly, until it exhausts all system memory or gets killed by the system. Now run it (in either container) by executing:
# memeater
and watch it get killed after the first few dozen MiBs. The output should be similar to the following:
16MiB
32MiB
48MiB
Killed
Note that if you run it outside the container, it will just keep printing output until you kill it by pressing Ctrl-C
or it eats up all the memory in your system, causing it to hang and become unresponsive.
As a final exercise, let's adjust the limit to 512MiB. In the new container, unmount devtmpfs
and procfs
using umount
(NOT a typo!) and exit the container:
# umount /dev
# umount /proc
# exit
You should see the command prompt change from #
back to $
. Now restart the new container with adjusted memory limits:
$ sudo ruboxer --mem-max-bytes 512M $HOME/container2 bash
Mount devtmpfs
and run memeater
:
# mount -t devtmpfs devtmpfs /dev
# memeater
Notice how memeater
is now allowed to eat much more memory before being killed:
16MiB
32MiB
48MiB
64MiB
80MiB
96MiB
112MiB
128MiB
144MiB
160MiB
176MiB
192MiB
208MiB
224MiB
240MiB
256MiB
272MiB
288MiB
304MiB
320MiB
336MiB
352MiB
368MiB
384MiB
400MiB
416MiB
432MiB
Killed
This concludes our article on container internals. In a production-grade container runtime, many more isolation and sandboxing techniques are used to ensure security, including but not limited to:
- Capabilities
- Network namespaces
- Seccomp profiles
- Mandatory access control (MAC)
I hope you enjoyed this article :-) You can find Ruboxer's source code at https://github.com/DonaldKellett/ruboxer which is licensed under the permissive MIT license to encourage you to fork it and learn more about how containers work by adding your own modifications to the source code.
- Ruboxer on GitHub: https://github.com/DonaldKellett/ruboxer
- Eric Chiang's "Containers from Scratch": https://ericchiang.github.io/post/containers-from-scratch/
-
chroot(2)
man page: https://man7.org/linux/man-pages/man2/chroot.2.html -
unshare(2)
man page: https://man7.org/linux/man-pages/man2/unshare.2.html -
setns(2)
man page: https://man7.org/linux/man-pages/man2/setns.2.html -
cgroups(7)
man page: https://man7.org/linux/man-pages/man7/cgroups.7.html -
capabilities(7)
man page: https://man7.org/linux/man-pages/man7/capabilities.7.html -
seccomp(2)
man page: https://man7.org/linux/man-pages/man2/seccomp.2.html - MAC (Wikipedia): https://en.wikipedia.org/wiki/Mandatory_access_control
21