As you and most of the world are well aware, Facebook suffered a global outage on October 4th.
Over a period of only a few minutes, an “error of [their] own making” effectively disconnected their entire network from the rest of the world.
Facebook Engineering published a detailed analysis of what happened, but here’s the gist of it: during a routine maintenance action, an engineer issued a command “with the intention to assess the availability of global backbone capacity, which unintentionally took down all the connections in [their] backbone network.”
The backbone failure caused the company’s internally managed DNS servers to withdraw their BGP (border gateway protocol) advertisements, essentially making the entire Facebook network invisible to the Internet.
As if that wasn’t enough, those DNS servers and network backbone not only served Facebook customers, but also provided connectivity to internal Facebook systems that enabled everything from intra-organizational communications to building access.
In short, they had a ghastly mess.
While most enterprise leaders will never operate a technology stack at the scale or complexity of Facebook’s, there are still several important lessons you can learn from both what caused this outage and how it responded.
Lesson #1: Balance is Critical
As you might expect, Facebook prepares for these types of events and practices how to respond to them. They identified the issue and initiated restoration activities quickly. So, why did it take such a well-disciplined organization so long to recover?
Ironically, it was the company’s efforts to improve security that was the culprit. “We’ve done extensive work hardening our systems to prevent unauthorized access,” wrote Santosh Janardhan, Facebook’s VP of Infrastructure. “And it was interesting to see how that hardening slowed us down as we tried to recover from an outage caused not by malicious activity, but an error of our own making.”
Therein lies the first lesson: you must always be mindful of balancing each trade-off you make. In this case, Facebook traded increased day-to-day security for a slower recovery — a trade-off Janardhan believes is worth it.
Likewise, the BGP advertisement withdrawals were an automatic function of the DNS servers when they lose access to the network. The intention is to protect the customer experience by not routing traffic to an unhealthy connection. In this case, however, focusing on creating a positive customer experience caused all DNS servers to disconnect simultaneously, kicking off the cascade of issues.
To be clear, the intention to harden and automate BPG withdrawal comes from the right place. But the point is that every decision involves a trade-off and a balancing act — so make them mindfully and intentionally.
Lesson #2: Respect the Law of Unintended Consequences
While Janardhan believes that the trade-offs that led to this outage were worth it (and I’d tend to agree), it remains a painful lesson in unintended consequences.
Despite the company’s robust testing and consistent “storm drills,” they never conducted a drill in which their entire backbone went offline. Likewise (I suspect), no one ever contemplated what would happen if every DNS server detected a bad network connection at the same time.
While I’m not suggesting that you will ever be able to dream up every possible negative outcome or consequence, I’m willing to bet that you’ve left a lot of stones unturned in your own contingency planning efforts.
I have long been a fan of Systems Thinking, which is essentially a mental model for contemplating unintended consequences in the form of systemic interactions. Despite being popularized by Peter Senge over fifteen years ago, few enterprises have fully embraced it as a discipline.
The lesson from Facebook’s outage is that almost every decision you make has the potential for an unintended — and catastrophic — consequence. So, dig deep, analyze the potential systemic ramifications, and plan for the worst.
Lesson #3: Human Error Reigns Supreme
One of the less reported elements of the outage is that there was not one, but two human errors at its root.
The first is that an unnamed engineer issued an errant command that started this whole mess. We don’t know if they fat-fingered something, if they got confused, or exactly what happened — but we do know that it was their mistake that gave the boulder the nudge it needed to start rolling down the hill.
But being the highly disciplined operation it is, Facebook had a tool that it had built specifically to protect against this situation. While the company hasn’t provided many details, this tool is supposed to audit commands to prevent such a mistake.
The problem is that there was a “bug in that audit tool [that] prevented it from properly stopping the command.”
While the company’s report doesn’t say this, it seems very likely that the “bug” resulted from yet another human error.
All of this is to say that while enterprise IT leaders love to focus on problems with the technology, the most significant source of risk is still what it was twenty years ago when I was running an operations team: human error.
There is no magic wand to eliminate human error. Still, enterprise leaders would be wise to put a little more attention on the training, communication, and cultural dynamics that most often lead to or exacerbate it.
Lesson #4: Relentlessly Pursue the Single Point of Failure
Facebook, like all hyperscalers, is legendary for its level of redundancies. Yet, even it was not immune to the law of computing that says, “If there is a single point of failure, the universe will find it.”
No matter how many redundancies you build, there will almost always be another single point of failure to find and eliminate.
In this case, it was Facebook’s network backbone itself.
In fairness, that’s not an easy thing to make redundant. Still, it seems a safe bet that it will be implementing some sort of logic to create a virtual redundancy whereby the backbone won’t allow itself to go down all at once again.
Frankly, I’m not technical enough to know how that might work (and perhaps it’s not possible, but I doubt it). But I’m also willing to bet that even once they’ve solved for that single point of failure, there will still be another one lurking somewhere no one is looking for it.
The message for enterprise IT leaders is that you must be relentless in your pursuit and elimination of single points of failure. The minute you think you’ve found them all, well, that’s most likely when you’ll have your own Facebook moment.
So just keep swimming (sorry, I’m on a Disney kick, apparently).
The Intellyx Take: Own the Failure
While the first four lessons are notes of caution, the fifth and final lesson is something that you should emulate if and when you suffer from an outage or similar catastrophic event: own it.
Facebook’s engineering team was not only responsive, they were transparent. They have been relatively forthcoming about the cause, the sequence of events, their actions, and the lessons they’ve learned from this event.
While the outage clearly had pretty severe negative impacts on everything from customer trust to revenue to market cap — and came on top of what was already a pretty bad week for the company — its technical leadership team handled the situation with class and dignity.
They neither ducked responsibility, nor did they try to sugar coat anything. They just put it all out there and tried to learn from it.
It’s a lesson that every enterprise leader should take to heart because, as Eva Galperin, the Director of Cybersecurity for the Electronic Frontier Foundation, put it, “the internet is held together with bubblegum and string.”
It’s not a question of if this will happen to you in some fashion. It’s only a matter of time before it’s your turn in the hot seat — and when that turn comes, the only question that will matter is whether or not you’ve learned these lessons.
© 2021, Intellyx, LLC. Intellyx publishes the weekly Cortex and Brain Candy newsletters, and advises business leaders and technology vendors on their digital transformation strategies. Intellyx retains editorial control over the content of this document. At the time of writing, none of the companies mentioned in this article are Intellyx customers. Image credit: Eduardo Woo.