It’s been a tough week for me. My dreams were literally shattered last Sunday morning when I awoke far too early to a cacophony of alerts from email and Slack that rarely precedes good news. My eyes still closed, I felt around the nightstand for my phone, my right upper eyelid ungluing from the lower, half open to soften the instant beam of light, mere inches from my face, that was about to fracture the darkness. Jira, one of the core business tools at our company, was down. It was 6:24 am.
The remainder of the day, nearly 10 hours, was spent trying to get hold of Atlassian, the company that makes Jira. I sent many emails to many people. I submitted tickets to their help desk. I enlisted the support of others, and still others volunteered, reaching out to their networks and submitting tickets on my behalf. I even reached out blindly on LinkedIn to four people, two of whom eventually responded, one of whom responded the same day, and none of whom could fix the problem. At 4 pm Pacific time - 9 am on Monday in Sydney, Australia, where Atlassian is headquartered - Atlassian’s website live chat feature opened, and I was first in the queue, where a rockstar support agent named Sheng solved all of life’s problems. I could finally relax for a few hours.
And thus came Monday, determined to Monday, as it is wont to do. Tuesday was marginally better. On Wednesday, traditionally the day when work gets over the hump and starts to wind down ever so slightly in preparation for the glorious relief that is Friday afternoon, my internet service provider had an outage, so I hightailed it to Starbucks for a few hours to get online. It could have been worse - and then it was. My boss, the CFO, called me at 5 pm to tell me that his computer, his second computer in as many months, was on the fritz. And he was on-site at an executive conference out of state, and there were two days of the conference left. The scramble begins. Yesterday, Thursday, more debacles: Getting into and out of our co-working space throughout the day was a challenge. Making sure my boss was up and running with his new computer, which was hand-delivered Thursday morning via a red-eye on Wednesday, was my focus for the morning hours. Sprinkled throughout the rest of Thursday were one-on-one meetings with my team and a smattering of other meetings with colleagues and vendors, culminating with - on Thursday evening - our purchase of CrowdStrike.
And thus came Friday. The headlines were positively apocalyptic. Largest IT outage in history. How CrowdStrike knocked the world offline. CrowdStrike outage sparks global chaos. Rather anticlimactically as far as this post goes, my company was not affected by the CrowdStrike bug. We decided on CrowdStrike as our endpoint detection and response tool after bad experiences with previous tools, and we’ve been testing it for the past several months (which went perfectly fine). We finalized the purchase last night, so we haven’t deployed it to our fleet of computers yet. But this isn’t a post about what CrowdStrike is or how the outage happened, or the superhuman efforts of the IT staff around the globe who are working to get you back in the air, booked into your hotel, or sent an ambulance. You can read all about that at the links above, and from basically any news outlet on Earth at this point. In keeping with the theme of this blog, this post is about the leadership lessons we can take away from this debacle.
Communication
“I attribute my success to this − I never gave or took any excuse.” - Florence Nightingale, English social reformer and founder of modern nursing.
CrowdStrike CEO George Kurtz did a good job communicating under the extremely difficult circumstances he’s facing. Nobody’s ever going to get communication just right, and that’s especially true with crisis communication, but unlike Boeing CEO Dave Calhoun’s response after the Alaska Airlines door plug blowout, Kurtz took responsibility for the bug and didn’t use euphemism or the passive voice to try to absolve CrowdStrike or himself. Despite his fluency in Executive Spin, you can see Calhoun’s discomfort answering straightforward (not even hardball) questions, while Kurtz spoke openly, honestly, and even briefly choked up, engendering the momentary sympathy of one of the reporters. Taking the same responsibility for a losing strategy as you do for a winning one shows that you stand behind your decisions and helps build a culture of accountability where people own their decisions and their actions and where credit is shared and blame is accepted.
But both verbal and nonverbal communication are crucial. Calhoun’s interviews were polished glass, streak-free, set ever so carefully and transparently in front of the living artwork of a finely-tuned manufacturing facility. The most prepared part of Kurtz’s interview looked to be his spiked hair. But this unpolished appearance can serve an important purpose. By eschewing a virtual background (or a background of any kind) and other accoutrement, like lights or a microphone, Kurtz brought more attention to his facial expressions, gestures, tone, and eye contact, all of which can often convey emotions and attitudes more powerfully than words can, serve to reinforce verbal messaging, and signals an authenticity to the viewer. Here I am, just me, talking to you.
Lessons: Speak honestly and acknowledge the situation. Assume responsibility for it. Act with a clear sense of ownership, and promote a sense of urgency that establishes and enforces accountability. Express authenticity in both your speech and your appearance.
Outcomes, Judgment, and Bias
Ultimately, someone is going to be judged for the outcome of today’s events (likely many people), which will take weeks to fully recover from. The questions will be broad, How did this happen?, and specific, Who approved the pull request? Heads will roll, and the Street will get its pound of flesh. But it’s unwise to evaluate the performance of an individual, team, or indeed even a company on outcome alone, even if, as crazy at it sounds, the outcome is grounded airplanes, 911 not working, hospital systems down, hotels out of commission, and every major news outlet in the world making your company the subject of cataclysmic headlines.
A more effective way to evaluate the efficacy and performance of individuals, teams, and companies is to evaluate judgment, not necessarily outcome. This isn’t to say outcomes are unimportant; of course they are. But outcomes cannot always be controlled. There may be unforeseen or unavoidable external factors that prevent an expected or positive outcome. The important part is that the judgment used to achieve the outcome was sound, because when rewards are based on outcomes alone, people will hide a bad result by escalating commitment, but when rewards are determined by a sound decision-making process, people are more motivated to make the best possible decision at every stage, whether or not an earlier decision proved to be correct.
As leaders, when we encounter extraordinary situations, we have to be particularly careful of permitting our cognitive biases to affect our own judgment. The presumed association bias is quite common across industries, but particularly in tech. It states that when the probability of two events co-occurring is judged by the availability (or, how readily something is “available” in memory) of perceived co-occurring instances in our minds, we usually assign the inappropriately high probability that the two events will co-occur again. In the case of CrowdStrike, we can ask the question, “Do CrowdStrike updates cause computer crashes?” There are always at least four separate situations to consider when assessing the association between two dichotomous events:
CrowdStrike update and computer crash (A and B)
CrowdStrike update and no computer crash (A and not B)
No CrowdStrike update and computer crash (Not A and B)
No CrowdStrike update and no computer crash (Not A and not B)
It’s amazing to me how often our guidance in IT is to update your operating system or software because it’s likely to fix a problem, but how reluctant people are to do it because an update broke something once they-can’t-remember-when. This is the presumed association bias at work. The ease of recall bias is another bias from “availability:” When an person judges the frequency with which an event occurs by the availability of its instances, an event whose instances are more easily recalled will appear to be more frequent than an event of equal frequency whose instances are less easily recalled.
For example, vivid instances of a computer crashing will be most easily recalled from memory, will appear more numerous than they are, and are likely to be judged differently than, say, a computer that’s lagging because an update reduced performance rather than completely disabled the computer. And, because of our susceptibility to vividness and recency, we are prone to overestimating the likelihood of unlikely events. If you witness a house burning, you're likely to overestimate the possibility of a house fire. This endangers our judgment because we are likely to weigh options sub-optimally or incorrectly, and leads us to create value judgments that are undeserved.
As we evaluate the CrowdStrike situation through the lens of leadership, we need to de-bias our judgment as much as we can (the best way to do that is by being aware of your biases and do the cognitive work to try to minimize them) and remember that CrowdStrike is a leader in the industry, has a lengthy track record of stability, and that even in the very best of circumstances, outcomes cannot always be controlled, so we must evaluate the judgment that contributed to the outcome (was there a code review process that was followed? Was testing thorough enough?). We also need to understand that basic judgmental biases are unlikely to correct themselves over time, so we need to do the cognitive work to debias our own judgment.
Finally, part of our evaluation should be on response to the crisis. When my team and I are planning a deployment, we consider that outcomes cannot always be controlled, so if the unexpected does occur we need to be ready to swarm the problem, all hands on deck until it is resolved quickly and efficiently. We call this our rapid response plan, and we create one for nearly all deployments. Kurtz mentioned in his interview that CrowdStrike will be unrelenting in their support to get customers back up and running. A good response is a signal that good judgment was used in the planning process.
Lessons: Evaluate the decision-making process rather than just the outcome, and avoid escalating commitment, instead favoring making the best possible decision at every opportunity. Provide unrelenting support during a crisis, which is indicative of good planning and judgment. Foster a culture that values sound decision-making, is prepared for crises, and prizes continual improvement.
While the globally disruptive nature of today’s event is highly consequential for CrowdStrike, it’s a good opportunity to evaluate the behaviors and actions of the company, and particularly CEO George Kurtz. While there’s always a risk of new instability introduced through updates, it would behoove us to not judge immediately based on the impact and the scale of that impact (it is, by all standards, absolutely monumental), but to consider other qualities that provide a more nuanced and textured landscape of the situation, which helps us as leaders self-improve.
Great article written so passionately! Clearly, leadership is in your DNA.