Pulling a Thread

As an elite debugger, you are conditioned to expect data that highlights a specific problem that needs to be resolved, or to ask for it if it is not there. But there are times when some data is provided through honest attempts but it is still...lacking. What to do?

One approach is to throw up a wall and say, "Sorry, no data, no work."

That is reasonable if, time and time again, you get tossed issues that lack sufficient data. But it is also not very professional.

Sometimes I let frustration get the better of me and push back in this way but, more often than not, I try and make do with what is there and see if I can uncover something useful. It may, rarely, lead to a complete solution but more often it leads to deeper questions that can be asked, and more useful data gathered.

This is about that rare case.

Problem scenario

Imagine a scenario where there is some high level metric computed by a consumer of your data, and you don't really know what is behind that metric. But those that do are pointing at something and saying:

"See this number. It is wrong."

Note that they don't say, "it should be THIS", but only that it is wrong.

Then they present a whole bunch of other examples of the high level metric being incorrect, while not giving the context for THAT NUMBER that is wrong. Rabbit holes are traversed, and other random bits of "helpful" examples are thrown into the pot before it lands in your lap. How do you make any headway with this, let alone solve it?

Again, you could say that you need more, specific, data. When was the last time it worked? What's changed? Did it ever work?

But there is that nuggest way up in the data, where "this number" was wrong. While you don't know what is wrong about it (yet) you can at least try and replicate that number because you have enough context to do so, and see where that leads. It is a small thread, but you pull on it a bit.

The first tug

So you assemble whatever extra data you need to get to that number, and try and recreate it. You account for time zones and other miscellanea. Thankfully you also have database snapshots that drive the computations. And luckily, while THAT NUMBER is usually an aggregation of a lot of rows of data, this sample case turns out to have exactly one row.

I say "luckily" although experience led me to pick that specific case of the wrong number because I saw that a single row of data was feeding it. There were others that had two, or ten, but the single row case was the ultimate reduction of the problem.

And, lo and behold, you get that number. Exactly.

Just as significant, you get that number in the context of other numbers output by the same processing for that single row. And you can now see the intermediate steps that are used to generate that number, and can step through those intermediate results.

You still don't know why (or even if) THAT NUMBER is wrong, but you now have something to dig into. All because you pulled on a thread.

The Great Unravelling

Looking at the other numbers, it becomes immediately obvious where THAT NUMBER comes from, as a sum of a few others. So, are those other numbers correct? Hmm.

Not really expecting much, you realize that there is another way these numbers can be generated, a different process that should yield the same results. For shits and giggles, you run that. And amazingly, the numbers are different. Actually to be precise, all the numbers are the same except for THAT NUMBER.

Looking closely at the code behind both computations, you uncover a subtle but now obvious difference. And looking at the code history, you see that nothing has changed in the errant code's processing, but that the other process has in fact been updated, and that change was not propagated to the broken one. (This, by the way, is one very good reason to not fork code in the name of expediency.)

So now we not only know why the number is different but we have a counter example that shows a (presumably) updated calculation. Going back to the root symptom, it is alleged that THAT NUMBER should be smaller and, indeed, the new calculation is smaller.

Finally, you dig into the detailed specification of what these numbers are supposed to represent (not easily, it turns out) and see that, indeed, the new calculation is the correct one. Yay.

This scenario was not cooked up to make a point, but actually happened. And you might be thinking that I got very lucky. In a sense, there was some luck but, frankly, this was a success as much by asking the right questions as it was by luck. I had no idea what reproducing a single number in a sea of data (and there was a LOT of data collected in the nearly two weeks before this landed on me) would lead to, but it was the only thing I could concretely grab onto...and pull.

And as I said at the outset, it would have been easy to say there is no useful data here, and ship it back with a request to get the "right" data.

But instead of delaying yet again, I dug in for several hours and now we have a path to resolution.

Thread pulling FTW!