22
Reaction to Ably's viral blog post and subsequent outage
A few days ago, Ably engineer Maik Zumstrull published a now-viral article detailing how the company manages large-scale computing, No, we don’t use Kubernetes. Kubernetes nay-sayers gloated.
Then yesterday, somewhat ironically, the popularity of that blog post caused enough traffic to cause an outage at Ably. Kubernetes fans gloated.
Today I offer my 2¢ on the whole thing.
First, I didn’t even read the article until this morning. I’ve grown tired of the constant bickering about Kubernetes vs no Kubernetes, as both sides of the debate tend to suffer from the exact same problem: projecting local, situational success to the status of universal truth.
But I digress. I finally read the original article today, and here’s my take:
The point of the article is not, per se, that nobody should use Kubernetes. Rather, it’s framed more as an explanation of how and why Ably, in their specific context, does not use Kubernetes. This is great. Here’s a key sentence from the article that I think serves as a great thesis statement:
[With Kubernetes] we would be doing mostly the same things, but in a more complicated way.
Notice there’s no statement of universal truth here. It’s a contexual statement, for Ably’s situation. And based on the preceeding description, this conclusion is probably correct.
Now the article does go on to disparage the complexity inherent in managing Kubernetes:
To move to Kubernetes, an organization needs a full engineering team just to keep the Kubernetes clusters running, and that’s assuming a managed Kubernetes service.
I believe this is going to far. I’ve never seen a small company with a team dedicated to managing Kubernetes. At Ably’s scale, that may well be necessary, but with managed Kubernetes instances, this simply isn’t true. It also ignores the fact that Ably’s home-grown automation also requires management.
So by and large, I like the article as a case study in a viable alternative to Kubernetes. I disagree on the principle of projecting this successful case study to the general/universal case.
And finally, before I close: The outage they experienced most likely would not have been mitigated if they had been using Kubernetes. So gloating is not warranted here, either. According to their public incident details:
The cause was a cloudflare misconfiguration that meant that recent blog posts were not being cached, which put unexpected stress on the site.
If you enjoyed this message, subscribe to The Daily Commit to get future messages to your inbox.
22