What is your error budget?

Part of the reason I built The Slate was because I've seen this common pattern too many times.

Service A (that people interact with) relies on Service B and C. Service B & C constantly fail, but Service A gets a bad rap with their users. The teams responsible for B & C don't know about the impact they are having on Service A. The team running Service A doesn't know Service B or C's error budgets, which is much lower.

What's an error budget?

Pretty simply, an error budget is the maximum amount of time that a team is "happy" to allow a system or service to fail. Usually you'll see it represented as a percentage of availability (e.g. 99.99% uptime!).

If you were to break down the percentages per month, you'd get;

  • 99.000% is 7h 18m 18s per month
  • 99.900% is 43m 50s per month
  • 99.990% is 4m 23s per month
  • 99.999% is 26s per month

Some of these can be pretty tough targets to hit, that might not be required for your business. I'd say a marketing site could easily handle an error budget of 99.9%. An e-commerce site might need to be 99.99%.

Just remember, five 9s (99.999%) can be significantly more expensive than four 9s (99.99%) to achieve. Particularly if you don't have the budget, resources or business support to achieve them.

What about those upstream services?

The services you depend on might have a different error budget. Like in my example earlier, Service A had a budget of four 9s. They had the budget, processes and resources in place to achieve that. But the other services had error budgets of two 9s.

As the owner of Service A, you have a choice. You might be able to build Service A to be resilient of downtime for Service B and/or C. But if you can't, then you've got two options;

  • Reconsider your error budget to reflect your dependancy's budget
  • Speak to the other Service's teams about their error budgets, and if it is possible to improve it.

Who is happy with downtime?!

No one really is. But you have to be realistic, not every service is, or should be, five 9s. Outages happen. How we respond to them, learn from them and adapt to them is the vital part here.

The first step in dealing with them however is being transparent about what each service's error budgets are, and how they might have an impact on dependant service.

This one very small aspect of the governance of your microservices, but it can easily increase your reliability and resilience.

20