How Facebook disappeared from the Internet

Cristina De Luca -

October 05, 2021

At 15:51 UTC yesterday, the digital media began to boil over: Facebook and its affiliated services WhatsApp and Instagram were all down. Their DNS names stopped resolving and their infrastructure IPs were inaccessible. It was as if someone had “pulled the wires” of their data centers all at once and disconnected them from the Internet.

At the end of the day, Facebook revealed some details of what happened internally. Today, it went further in the explanations.

Someone, during routine maintenance, issued a command intended to assess the availability of global backbone capacity, which unintentionally took down all backbone connections, effectively disconnecting Facebook’s data centers around the world. 

“Our systems are designed to audit commands like these to prevent errors like this, but a bug in the auditing tool prevented it from stopping the command,” explained Santosh Janardhan, Facebook’s VP of Engineering and Infrastructure, in a new post on the company’s blog.

The problem was simple, but the solution was complicated. “The underlying cause of this outage also impacted many of the internal tools and systems we use in our daily operations, complicating our attempts to diagnose and resolve the problem quickly,” Janardhan said. “As our engineers worked to figure out what was happening and why, they faced two major obstacles: first, our data centers could not be accessed through our normal means because their networks were down, and second, the total loss of DNS broke many of the internal tools we normally use to investigate and resolve outages like this,” he added.

Cloudflare also published a detailed blog post about what happened, from an external perspective, showing a rapid burst of BGP updates just before the outage began. “With these outages,” wrote Cloudflare’s Tom Strickx and Celso Martinho, “Facebook and its sites effectively disconnected from the Internet.”

After backbone network connectivity was restored in all regions of Facebook’s data center, everything came back with it. But the problem wasn’t over – reactivating services all at once could cause another round of crashes due to a sudden increase in traffic. In addition, “individual data centers were reporting drops in power usage in the tens of megawatt range, and suddenly reversing that drop in power consumption could put everything from electrical systems to caches at risk,” Janardhan commented.  

“Fortunately, this is an event we are well prepared for thanks to the “storm” exercises we have been running for a long time. In a storm exercise, we simulate a major system failure by taking an entire service, data center, or region offline, testing all the infrastructure and software involved. The experience with these exercises gave us the confidence and experience to get things back online and carefully manage the increased loads. In the end, our services were back up and running relatively quickly, with no further system-wide failures,” he wrote.

The internet, of course, loved the whole thing. There were a lot of jokes about MySpace. But there were also those who saw the occurrence as an opportunity to learn and improve. For example, how about conducting a broad review to understand how to make your systems more resilient?

Many people also used the unavailability as a chance to call for a decentralized internet (and in some cases, try to sell people on their blockchain-based social app.) “Congratulations to @Facebook for giving us a very real demonstration of why the move to a decentralized Web 3 is necessary and indeed inevitable,” tweeted Polkadot founder and Ethereum co-creator Gavin Wood.