If you awoke to no Facebook, Instagram, or WhatsApp, you weren’t alone. What went wrong?
While NSW saw a long weekend, the morning back was one where you couldn’t easily see the exploits of people from that weekend. If they used Facebook or Instagram, or maybe even communicated on Facebook’s chat services Messenger and WhatsApp, they were out of luck because the systems were down.
Facebook did eventually get it all working again, but the damage had been done, wiping billions from Facebook’s stock presence, and even forcing Facebook to apologise on Twitter:
We’re aware that some people are having trouble accessing our apps and products. We’re working to get things back to normal as quickly as possible, and we apologize for any inconvenience.
— Facebook (@Facebook) October 4, 2021
It wasn’t just a problem with Facebook’s apps, but also Facebook internally, as one New York Times reporter noted that some Facebook employees couldn’t enter buildings. Essentially, the badge check system wasn’t going through through and working, much like the rest of Facebook.
This wasn’t a Facebook scam, but rather a problem inside of one of the world’s biggest social media players. What went wrong, and why did Facebook stop working worldwide?
Facebook’s servers were cut off from the world
When you access the internet, you’re doing so by having your device talk to a series of other devices where information is stored. Think of it as a remote library, where your phone or computer talks to another set of computers to pull up all that information to send it your device.
However, to access this set of information, there needs to be sets of these indexes to tell your device where to go and request that information in the first place.
When a computer makes a call for that, it turns to the DNS, the “Domain Name Service” registry, which points your device to which computer or sets of computers it needs to talk to in order to get the information it needs. There’s more to it than that, but that’s the simple approach to what’s going on.
Yet when a DNS fails or goes down, your access to that database means the computers can’t talk to each other to find where they should go, and things don’t work.
For Facebook’s failure on October 5, the DNS issues were only one part of the problem. Another aspect, the Border Gateway Protocol (shortened to “BGP”) routes traffic to Facebook’s systems to work out the best route for your access, and this stopped working. Ars Technica noted in its story that Cloudflare’s Vice President saw the BGP routes were pulled, as indicated in his tweet below:
— Dane Knecht (@dok2001) October 4, 2021
While this might read as a lot of jargon, the long and short of it is Facebook updated some technology inside its network which saw its systems get cut off from the rest of the web. It wasn’t so much that Facebook’s massive collection of servers weren’t talking or providing access to each other, but rather the pathway to these servers wasn’t working.
To put it simply, it was as if you needed to find a book using a library catalogue’s book card system, and the path to the shelves holding those cards and the Dewey Decimal System was demolished temporarily.
With no way through, you can’t find what you’re looking for, and neither could the apps or web browsers looking for Facebook, Messenger, Instagram, or WhatsApp.
UPDATE: Facebook has come out with a statement on what went wrong, and while it hasn’t used the terms “DNS” or “BGP”, the outcome sounds remarkably like the technologies needed for both, noting that:
configuration changes on the backbone routers that coordinate network traffic between our data centers caused issues that interrupted this communication
In short, the technology (BGP) that coordinated traffic to its systems saw a change, and that managed to break everything in the process.