22
The dark side of Docker: Avoid the “latest” tag
We rely on Docker, and it’s one of our favorite technologies. But using Docker for commercial software shows some rough edges that we found the hard way.
It seems like the “latest” tag should work, but it really doesn’t, except in the simplest cases. We didn’t realize how crazy broken this could be until one night we were on a Zoom call and watched a new customer install software that was three months old, when there were 10 newer builds available. Not a good feeling.
We’re not the first to recognize there were major things wrong with the latest tag, but we really didn’t see these problems at first. We’ve now gone the other way and removed our “latest” tags from our Docker images. There were a couple of things that put us over the edge.
The “latest” pattern bends or breaks the kind of caching that every CDN wants to do, because over time this leads to multiple artifacts with the same name but different content. What CDNs really like is immutable artifacts with unique version numbers. What’s really awful is that if the latest tag fails due to this kind of caching, you’ll never know. The latest tag will always resolve to something, even if that something is months old because of some lame caching issue.
The other crazy thing about “latest” is that the actual version number is not easy for users to find. Let’s assume you publish an image under a specific version name and with the “latest” tag. After install, the output of docker images will show “latest” and not the version number, even if a newer version of the container is installed. So how are users supposed to figure out what versions are actually installed? By falling back to comparing the image id hashes, which is an insane use of people’s time.
We get it — it feels like “latest” as documented wants to be an alias for the most recent container…but because this is actually implemented as a named tag, this leaves the door open for race conditions, CDN misfires, and confusion. Just don’t do it.
Docker images are built in layers, and Docker aggressively caches these layers to avoid having to download bits that haven’t changed. The problem is that this caching is based the script text that’s in the Dockerfile, versus the results of those scripts.
Where you’re pretty much guaranteed to see this issue is when applying security and package updates. You’ll be wondering why “apt upgrade” worked at first but then fails to pick up any more updates. The reason is that the “apt upgrade” script didn’t change, so Docker will happily and silently use results from an earlier build, while you spend hours tearing your hair out.
This is easy to work around with a --no-cache directive, but once you start doing this everywhere, you’ll be left wondering why the default behavior for Docker is to be silently vulnerable by relying on cached layers that are missing security patches.
Perhaps the worst observation here is that using the --no-cache directive hasn’t really affected us. Our images are relatively small and so we weren’t getting a lot of benefit from that aggressive caching anyway.
Docker espouses the idea of each container having a single command that it executes. This is great for simple cases, but let’s say that you have three lightweight services that run together. Some folks would argue that this should always be broken down into three separate containers, but that doesn’t come without cost. Now you have 3X as many components to build/manage/version, and are more limited in how these processes can interact with each other, because they are completely separate. Refactoring these services into a different configuration means heavyweight changes to the build & packaging. So there’s no one-size-fits-all answer here.
Our experience is that the ugliest bit around running multiple lightweight services is the scripting that’s involved. We started down the path of doing custom scripting for this, and found issues around startup order and race conditions that quickly became a distraction. This lead to a pretty spirited debate about whether we needed to throw out everything and start with a single-process set of containers, which would have themselves been subject to the same ordering and race condition pitfalls.
Instead we kept our services together and standardized on supervisord as the main control point for the container. This allows us to run multiple services in a single-container package with confidence, without writing any custom scripts. Easy peasy.
P.S. Why not docker-compose or Kubernetes? Because those are heavy and complex. You don’t need a chainsaw to open a can of tuna.
When Docker started gaining a lot of steam, a lot of the justification was that Docker containers would be much smaller and have less overhead than the traditional virtual machines that they replace. You won’t have to run an extra copy of Ubuntu, you’ll have a teeny container instead! Unfortunately this is far from guaranteed.
There’s also a myth that Alpine, as the most obvious teeny distribution, is universally better than other distros. It’s tiny, so it must be faster, right? Not necessarily! We saw significant performance regressions on Alpine that we didn’t see on RedHat or Ubuntu, because Alpine isn’t based on glibc. Ultimately, we decided that the extra time and energy to tune our configuration on Alpine was worth it to get the smallest download size possible. But this was a significant time investment, and it wasn’t obvious how to do at first.
Docker makes it possible to build very small and performant containers, but most of the burden to streamline performance and download size still falls on the container developers. It’s really easy to publish a multi-GB image based on the same distro that you used for development. Getting to a featherweight image still requires skill and care.
Do any of these issues keep us from using and recommending Docker? Nope, most of these are easy to avoid, if you know what pitfalls to watch out for in advance. Hopefully this post saves you some time!