23
Elephant in the Blameless War Room: Accountability
We’ve always advocated that every company can benefit from a blameless culture. Fostering a blameless culture can profoundly boost your organization in powerful ways, from employee retention to developer velocity and innovation. However, there’s an elephant in the room when we talk about blamelessness with executives: accountability. When things go wrong, people still need to get fired, right?
In a discussion with Ajay Varia, former VPE of MasterClass and COO of Emeritus, he shared an example of a real incident and how he held space for a blameless resolution without sacrificing accountability. An engineer was making changes to the administrative panel of what they thought was a testing environment. Regrettably, it turned out to actually be controlling the production environment. Their changes caused a significant outage for the service.
Imagine an executive pressing to know who was responsible for the outage. How would you respond to this demand for accountability while maintaining the ideal of blamelessness? In this blog post, we’ll look at:
- What blaming executives want when they blame
- How to skillfully respond to demands for blame
- When is accountability fair game?
- How to be blamelessly accountable
Although we might not agree with their blameful approach when they ask “Who’s responsible for this incident and what should we do about this person?”, we need to remember that the executive’s goal is the same as ours: to solve the problem and ensure the company’s success. Just like us, they take responsibility of the situation and are eager to restore the service to health.
The executive likely has three goals in mind: dealing with the person involved, resolving and preventing the incident, and restoring trust with affected stakeholders. Given their distance from the day-to-day context of the incident, they may see blaming an individual as one of the only ways to meet their goals.
To understand what an executive wants to achieve when they look for someone to blame, we must empathize with their perspective. They may have certain assumptions about the situation or how their actions will affect it. We must account for those assumptions and meet them where they are so we can skillfully and constructively respond to their questions.
This incident should never have happened. The executive may not realize that failure is inevitable in complex systems. They may assume that if people did their jobs right, then there should be no incidents or outages. Therefore, the person causing the incident must not care or not be skilled enough.
Punishment will deter others from making the same mistake. If they assume the mistake was made because of negligence, they could see punishment as a “wake up call” that would make other engineers more dutiful. They may think that remembering the punished engineer would make other engineers spend more time double checking which admin panel they access.
A skillful engineer would never make this mistake, and therefore the person involved must be unskilled. A lack of context around the issue may lead the executive to assume that it resulted entirely from individual error, with no systemic factors. With our admin panel mixup example, they might assume that any good engineer would be able to consistently identify the production panel.
Removing the person removes the problem. Given their first assumption, it follows that they’d see this as an effective solution. If they removed the unskilled engineer responsible and only skilled engineers remained, the problem wouldn’t reoccur.
Without punishment, the engineer won’t appreciate that they made a mistake. When dealing with the stress of the incident, it’s easy for the executive to feel that they’re alone in their frustration. They may believe that the engineer lacks the perspective to understand the significance of the incident and feel bad about it. They may see punishment as a way of conveying the impact to the engineer.
Punishment is the most persuasive way to alleviate customer concerns and restore trust. Even if the executive believes in blamelessness and addressing systemic causes, they may worry that their stakeholders don’t. They may assume that only removing the involved person will satisfy stakeholders worried about the incident reoccurring. Blame-heavy press releases reflect this concern.
Stakeholders may expect punishment to maintain fairness. Some stakeholders might assume that the incident was due to negligence. These stakeholders could include other teams in your organization impacted by the incident, such as customer success. Since those teams may be measured based on the churn as a direct consequence of the incident, and customer success managers are held accountable to churn, the executive could assume that the engineering team needs to experience similar degrees of negative consequence to uphold fair distribution of accountability across the company. They could see punishment as the best method to achieve this.
Given these assumptions and the deep desire to resolve the problem, it makes sense that the executive would look to blame. To respond to this demand without blame, you have to convince the executive that their goals will still be met. You must assure them that their concerns about the person involved, the incident itself, and the stakeholders will all be addressed.
Engineers in Fight-or-Flight Mode Cannot Problem Solve Well. Finding sources of blame during an incident will likely cause the resolution to go slower, not faster. Engineers who are stressed about their job security will likely not be able to resolve complex problems as quickly as they would if their minds could focus on the incident at hand. Since engineering is about solving complex problems, blame could detriment an engineer’s overall productivity and effectiveness.
A systemic change is more enduring and beneficial. Digging into systemic causes can provide insights in the most fundamental assumptions about your organization. It will provide changes that will help regardless of who is on the team or who is at the helm. In this case, Ajay asked, “Why do the testing and production environment look so similar? Can we create a big red banner warning the team when they are making changes to production?”
Complex system failures are inevitable. There will always be incidents we cannot prepare for, but it’s not all hopeless. We can take measures to get better at detection, get faster at mitigation, and get more proactive with prevention. We can measure improvements across these three areas and focus reliability efforts on the most revenue-critical user journeys. Expecting engineers to keep the service running 100% of the time is not realistic, even our biological hearts do not have 100% reliability. Help the executive see incidents as unplanned investments in reliability.
Anyone in that position could have made the same mistake. When explaining the systemic causes, show how they would apply equally to any engineer, even the most senior architect. Therefore, punishing the specific person who happened to be there won’t necessarily prevent the issue in the future.
We could make this mistake again with a different person. As anyone could have made the mistake, it could easily happen again unless something systematic changes.
No one wanted this outcome, least of all the engineer involved. Assure executives that everyone on the team is aligned on the importance of reliability and customer trust. Re-establish trust with the executive by demonstrating that the engineer and the team overall clearly understand the impact of the incident and feel committed to improving the system. Emphasize the resulting action plan as proof of this commitment.
Even if you don’t directly interface with the customers or other stakeholders, you can advise the executive to respond in these ways:
Our action plan will inspire confidence. If the goals and timeline of the action items are communicated to stakeholders, they’ll see how it creates a more reliable solution than blame.
We can acknowledge their pain without blame. There are ways to show stakeholders that you understand the pain the outage has caused them without retribution. Ajay recommended that people in leadership positions hear out everything from the stakeholder’s perspective. It isn’t about finding a scapegoat, but making sure that the stakeholders understand they aren’t being dismissed or trivialized.
By responding to a demand for blame with these assurances, you can convince the executive that their goals will be met without needing to resort to blame.
There could still be situations where someone needs to be held personally accountable. It may be the best option to advance systemic changes. There are some prerequisites to meet before holding a person accountable.
Here are some questions to ask yourself:
- Were expectations for this person’s job clear, realistic and documented?
- Are there multiple mistakes caused as a direct result of this person’s lack of skill, good intentions, or earnest effort, and little to nothing else?
- Have you shared feedback about gaps in their performance on a consistent basis?
- Do you and/or other members of your team have reasons to believe that this person is not sufficiently coachable?
- Do you have consistent and reliable evidence that this person cannot be trusted to meet the explicitly stated expectations associated with their role?
- Does your organization’s culture acknowledge that complex system failures are inevitable?
- Did you look for contributing factors of their mistakes?
If the answer to all of the above questions is yes, then holding accountability is fair.
However, as you can see, the traditional definition of accountability - attribution and punishment - fall under performance management, which should be an intentional and separate process from incident resolution. Pointing blame for inevitable system failures is not an appropriate substitution for performance management.
Accountability isn’t incompatible with blamelessness. In fact, true accountability - ownership to make the system better facing forward - requires blamelessness.
Blame is an easy way out. It allows people to punish a “responsible” individual and call it a day. But what the company really needs is for leaders and teams to do the hard work of solving the complex and nuanced challenges of your system.
True accountability incorporates nuance and faces forward.
For our example of the production panel being mistaken for the testing panel, here’s how Ajay took accountability as a leader. He asked,
- Why do the admin panels for the production and testing environments look so similar? Should production have a big flashing banner reminding you that you’re working in production?
- Should a single person be able to make changes to admin in the production environment? Should there be a two-person verification system?
- Should every engineer be given the ability to make changes on the production admin panel? Maybe most engineers only need to make changes in testing.
We are sure you can imagine how confidence-inspiring the ensuing follow-up action items are. Ajay showed that companies don’t have to sacrifice accountability to have a blameless culture, nor do they have to default to blame to uphold accountability.
It takes incredible empathy, stress-tolerance, and critical thinking to get blamelessness and accountability working together in harmony, but it is possible.
So don’t hide the elephant, ride it!