17
Debugging 501
In the past few posts, I have tried to outline some of the key "power moves" that separate top debuggers from others. This post will be the first of the "Mastery" series where we will either do a deep dive into a specific aspect of the power moves, or a deep dive into a real-life scenario to show how these may be applied. Today, we will go over a particularly nasty issue that I have been working this past week.
Without giving away too many details the scenario is as follows. A very large customer has been having an issue with some data that comes through an integration pipeline between our system and an external system. The issue is that sometimes the numbers are plain wrong. Now, this pipeline has a backup strategy, where temporary inaccuracies are supposed to be patched up nightly, and yet the answers are still wrong. Why?
The crux of this issue is to identify why the backup strategy is not doing what it is supposed to. The infrequent, temporary glitches in the numbers are a known issue that has deep implications for fixing them, so the "plan B" of a backup is both sensible, and acceptable to the customer. That it does not work is simply unacceptable.
Looking at the big picture, we can immediately identify a number of failure scenarios, such as:
- The backup strategy needs to be explicitly enabled, so it is possible that it is simply not turned on.
- The backup strategy is on, but is failing somehow.
- The backup strategy is on, but is also getting the wrong answer.
Since we know, unfortunately, that the problem can be reproduced by the customer, the first question to ask is "show me the logs". And here is where we hit our first roadblock. There are several separate systems at work here, each with their own logging, and not all of which is under our control. Yes, even on the external system we can turn the logging up, but that needs to be manually done. What is worse, though, is that the logs are aged out (ridiculously) rapidly, meaning that we do not have any logs covering the time period where the backup should be running.
So step one is to turn on deeper logging on that external system, and archive the logs so we can catch the time frame that we need. Curiously, despite the obviousness of this step, it was not until the problem reached my attention that the request to see the logs was made.
Unfortunately there were three wrinkles with that plan:
- The logging won't catch the issue for 24h.
- The request to archive logs on a production system needs to go through proper channels.
- Despite going through channels, the log archiving was still not enabled in time, so we now won't have logs for 24h + 24h.
Rather than waste time waiting for logs that may, or may not, provide the answer we are looking for, we can dig into logs that we have more control over on "our side", logs that are thankfully already archived and searchable. The goal of this would be to reduce the problem surface area, eliminating some of the failure modes.
And that is where we get our first clue as to what might be amiss. There is no record in our logs of the external system reaching out to do the "plan B" backup. Double-checking multiple days worth of logs, and the answer is the same. This customer does not have a "plan B" running successfully or at all.
Now, it has been told to us that yes, the "plan B" is on for this customer. But we question everything. How can it be on when the logs clearly show it is not? Enter another wrinkle: the configuration of these external systems is pretty tighly under wraps. It takes the right kind of access to see what is there, and so that introduces more delays while access is obtained.
Eventually, though, access is obtained and on very close inspection, the configuration appears to be correct: plan B should be running. Hmm. We really, really need those logs! So we double-check that archiving is, indeed, working now and all we need to do is wait.
But again, in order to not waste time, there are still some facts that can be cleared up. The next most important fact to uncover is the answer to the question: what if the "plan B" is running (even though the logs say otherwise) and we are getting the wrong answer? So we set up a test where we simulate the exact same interface to pull the backup data from our system, and compare the results with the reported errant data. It is tedious work to compare data items line for line, but the results are unambiguous: the backup should repair the bad data if it is called.
Again, we are left to conclude that the backup is not running because: (a) there is no record of it running, and (b) if it were running, the data would be fixed. We really, really need those logs!
Finally, the second 24h ticks over and we can now see what is going on through the archived logs. The backup was initiated by the external system, but the request throws a timeout exception. (As an aside, the logs for this particular system are very good, and there is a clear distinction of overlapping jobs through unique identifiers.) So if the job actually ran, why did it not appear in the other logs?
Looking back at the logs on our system, we still don't see that request coming in, but we do see an error 4 minutes later about a socket disconnecting. The request timeout on the external system was only 1 minute, so this socket exception must be a timeout on our side trying to write data after the external system gave up and shut the request down. Frustratingly, there is absolutely no identifier in the log as to the source being handled at the time of that later exception, but the timing is more than a little bit coincidental. A trace/span ID would have been really helpful here.
But at this point, we have a smoking gun that, frustratingly, we would have had two days earlier if logs had been correctly handled on the external system. Two key takeaways from this exercise are:
- We need to have a formal log archiving process on the external system so that we always have searchable logs on hand, without having a gazillion log files on the file system.
- We need to ensure that our system logs context with exceptions so we know what it was working on when it barfed.
The problem has yet to be solved, and I'll be sure to update here when it is. But at this point the back has been broken on it. We know what the root cause is, and so we can try and attack it from several different angles. The technical details of the solution are not important here, but hopefully it has been instructive to see the principles I have been writing about applied to this case.
Specifically, we saw how vitally important complete logs are, and how if the logs lack contextual data, then we can miss important facts. We also saw how being able to simulate what the external system was doing allowed us to eliminate one of the possible failure modes. It didn't help solve the problem, but it reduced the number of possible causes. In retrospect, those simulated calls also did take a very long time, and so will be an important tool in validating possible optimization of the backup processing. And finally, by insisting on seeing the specific configuration rather than simply accepting an important fact, we were able to have confidence that the logs (when they eventually arrived) would tell the whole truth.
The most important moral, however, is this. Log files are like backups. When you need them, you really need them, and so it is important to ensure they are capturing the data you need them to be capturing. A log file that is deleted so quickly that the window of time it is relevant for is impossibly small is, frankly, a waste of disk space. If you can influence log file retention for systems you may be asked to debug, exercise that influence and make it right.