It is sometimes the case that you are presented a debugging problem that seems hopelessly complex. You hear a bug report, or see something weird happening, and you think "WTF"? This post is about that.

Now to be sure, if you are intimately aware of your own code, you will often be told of some odd behaviour that seems mysterious from the outside to a support engineer, but you will immediately feel that "click" in your head and know more or less exactly what is going on. This post is not about that.

How do you go about finding the root cause of a problem in a complex system, involving (say) one or more database tables, multiple micro-services talking to each other, and a third-party software package that you did not write and have only limited knowledge of its internals? Reduce the scope of what you are looking at.

There are two major aspects to reducing the scope of the problem, and they need to both be done to be successful. The order in which they are done is interchangeable, depending entirely on what the nature of the problem is. But eventually you'll need both. The aspects you need to focus on are:

  1. Reducing the implicated components of the system to a single component.
  2. Reducing the manifestation of the problem to as tight a use-case as possible.

Reducing problem "surface area"

When you have a complex system involving a lot of moving parts, you absolutely need to narrow down the scope of the issue to as small a number of parts as possible (ideally a single one) before digging in to find the root cause. In the hypothetical case from above: is the problem related to data (a wrong value in a database, or an incorrect assumption about what a value means or when it is written); or is in the communication between the services (are they living up to the service contracts, or is data being transformed correctly as it moves from one to the next); or is the third-party software at fault (is there a bug or, more likely, are you feeding it bad data or assuming a behaviour that is not true).

This type of reduction is critically important to allow you to then further narrow down the cause of the issue, in part because you can then involve other team members once you know, roughly, where the problem lies, whether it be DBAs, service authors, or third-party support. And as you'll see in a moment, this reduction can feed the other key reduction.

Note that if you have a monolithic software system, the same kind of reduction is possible because the same types of abstractions are almost certainly going to exist between your data layer, communication between your internal components, and reliance on third-party libraries or external systems (including the OS). Any system of moderate to high complexity will necessarily involve the cooperation of separate components, and your goal is to find the one that is broken.

And while the tone of this post is focused on back-end systems (since that is my thing), the exact same principle applies to UI components and UX flows.

Reducing the problem manifestation

In order to effectively reduce the problem surface area, you are going to need to reproduce the problem, because doing it by inspection based on verbal reports and log files is only going to get you so far (i.e., not very). Surface area reduction is probably going to involve some kind of binary search through the components that are involved, which means you are going to have to reproduce the problem over and over again. And for an issue that takes a large number of steps, or an extended period of time to manifest, that can be a real problem.

So a second essential reduction is to make reproducing the problem as simple as possible. This has two key benefits:

  1. Having a simple representative scenario allows you to repeat the issue efficiently, making the reduction of the surface area as quick as possible.
  2. Having a simple representative scenario allows you to definitively close the issue, by demonstrating before/after behaviour that can be replicated by QA and in unit or automation tests.

It is easy to see how having a simple exhibit of the issue can help. Part of the art of debugging, however, is in finding that simple case. Again, sometimes this will be obvious as long as this is on your list of things to do to solve any debugging problem. But sometimes it is not, and so this reduction is often done interatively with the reduction of the surface area, until both converge on a case that demonstrates the meat of the issue without having to necessarily go through a complex set of steps.

There is a subtle aspect to this that you need to be mindful of: the representative scenario must faithfully reproduce the problem. If you are creating a proxy scenario that reduces the problem, you need to make absolutely sure it is true and not masking some other problem, because you think the problem is one thing when it is really another. This is a bias that is always there, and requires fact-checking when you think you are done against the real, reported issue.

I was recently debugging an issue where the problem could not be reproduced. Not in the lab, not on my machine. But the issue was 100% reliably reproducible on a customer system. And I could not connect to the customer system to step through the issue. In this case, reduction was critical to solving the issue. Through a report of the problem, and limited log files, I was able to narrow down the issue to a single component, although it was still a fairly big net. But having done that reduction, I was then able to write a simple command-line tool that did the same set of steps on the same component that was being done through the larger system, and then cranking up the logging. This tool was then provided to the customer to run on their system, and the log output immediately highlighted the root cause.

Lucky? That is was resolvable first time, maybe. But the critical power move was reducing the problem scope and encapsulating it in an easily runnable test case. Debugging 101. And you can probably appreciate that if it had not been resolvable on the first iteration, then the log files would have provided additional data to allow a further reduction to take place.

The goal of these posts it not necessarily to reveal some deep secrets that only a select few initiates are aware of. Rather, it is to provide some clear guidance on steps that can take you from a so-so debugger into someone that is called in on the "hard problems". Reducing a problem into simpler steps, and isolating a problem in a complex interdependent chain of components into a single component seems self-evident. But time and again, I have seen developers struggle with a case by sticking to the reported issue, and not digging in. And if you are a tester, you will earn a ton of gratitude when you hand something to a dev with some of this reduction already done. The dev will fact-check you, of course, and you should not take offence at that, but you know that if they are a serious debugger, they'll be doing that anyway.