A Blog by Jonathan Low

 

Oct 13, 2021

Why the Internet Keeps Breaking

The internet has become too centralized, reliant on just a few systems, companies and sources of data. 

When one fails, it increasingly has a cascading effect. JL

Joe Tidy reports in the BBC:

Widespread outages are becoming more frequent and more disruptive. "One of the things that we've seen in the last several years is an increased reliance on a small number of networks and companies to deliver large portions of Internet content." More often than not, it's a mundane case of human error, compounded by the way the internet is held together with a complex set of outdated systems. The internet has become too centralised, ie, too much data comes from a single source. That needs to be reversed with systems that have multiple nodes so that no one failure can stop a service from working.

I doubt Mark Zuckerberg reads the comments people leave on his Facebook posts.

But, if he did, it would take him approximately 145 days, without sleep, to wade through the deluge of comments left for him after he apologised for the meltdown of services last week.

"Sorry for the disruption today" the Facebook founder and chief executive posted, following almost six hours of Facebook, WhatsApp and Instagram being offline.

Facebook blamed a routine maintenance job for the disruption - its engineers had issued a command that unintentionally disconnected Facebook data centres from the wider internet.

Around 827,000 people responded to Mr Zuckerberg's apology.

The messages ranged from the amused: "It was terrible, I had to talk to my family," commented one Italian user, to the confused: "I took my phone into the repair shop thinking it was broken," wrote someone from Namibia.

And, of course, the very upset and angry: "You cannot have everything shut down at the same time. The impact is unprecedented," one Nigerian businessman posted. Another from India asked for compensation for the disruption to their business. What is clear now, if it wasn't obvious already, is just how reliant billions of people have become on these services - not just for fun but also for essential communication and trading.

What is also clear is that this is far from being a one-off situation: experts suggest widespread outages are becoming more frequent and more disruptive.

"One of the things that we've seen in the last several years is an increased reliance on a small number of networks and companies to deliver large portions of Internet content," says Luke Deryckx, Chief Technical Officer at Down Detector.

"When one of those, or more than one, has a problem, it affects not just them, but hundreds of thousands of other services," he says. Facebook, for instance, is now used to sign-in to a range of different services and devices, such as smart televisions.

"And so, you know, we have these sort of internet 'snow days' that happen now," Mr Deryckx says. "Something goes down [and] we all sort of look at each other like 'well, what are we going to do?'"

Mr Deryckx and his team at Down Detector monitor web services and websites for disruption. He says that widespread outages affecting major services are becoming more frequent and more serious.

"When Facebook has a problem, it creates such a big impact for the internet but also the economy, and, you know... society. Millions, or potentially hundreds of millions, of people are just sort of sitting around waiting for a small team in California to fix something. It's an interesting phenomena that has grown in the last couple of years."

Presentational grey line

Significant meltdowns

  • October 2021: A "configuration error" brought down Facebook, Instagram and WhatsApp for nearly 6 hours. Other sites like Twitter were also disrupted due to the surge of new visits to their apps.
  • July 2021: Over 48 services, including: Airbnb, Expedia, Home Depot, Salesforce were down for around an hour after a bug with the Domain Name System (DNS) at content delivery company Akamai. It follows a similar outage at the company a month earlier.
  • June 2021: Amazon, Reddit, Twitch, Github, Shopify, Spotify, several news sites were down for around an hour after a previously unknown bug was triggered accidentally by a customer at cloud computing service provider Fastly.
  • December 2020: Gmail, YouTube, Google Drive and other Google services went down simultaneously for around 90 minutes after the company said it encountered an "internal storage quota issue".
  • November 2020: A technical problem with one of Amazon Web Service's facility in Virginia, USA, impacted thousands of third-party online services for several hours, mostly in North America.
  • March 2019: Facebook, Instagram and WhatsApp all went down or were severely disrupted for around 14 hours after a "server configuration change". Some other sites, including Tinder and Spotify, that use Facebook for logins, were also affected.
Presentational grey line

Inevitably, at some stage, during a large outage of services, people worry that the disruption is the result of some sort of cyber-attack.

But experts suggest, more often than not, it's down to a more mundane case of human error, compounded, they say, by the way the internet is held together with a complex set of outdated and fiddly systems.

During the Facebook outage, experts joked on social media platform Twitter that some of the usual suspects, or reasons for outage problems are "older than the Spice Girls" and "designed on the back of a napkin".

Internet scientist Professor Bill Buchanan agrees with this characterisation: "The internet isn't the large-scale distributed network that DARPA (the Defense Advanced Research Projects Agency), the original architects of the internet, tried to create, which could withstand a nuclear-strike on any part of it.

"The protocols it uses are basically just the ones that were drafted when we connected to mainframe computers from dumb terminals. A single glitch in its core infrastructure can bring the whole thing crashing to the floor."

Professor Buchanan says improvements can be made to make the internet more resilient, but that many of the fundamentals of the net are here to stay for better or worse.

"In general, the systems work and you can't just switch certain protocols of the internet 'off' for a day, to try to remake them," he says.

Instead of trying to rebuild the systems and structure of the internet, Professor Buchanan thinks we need to improve the way we use it to store and share data, or risk more mass outages in the future.

He argues that the internet has become too centralised, i.e. where too much data comes from a single source. That trend needs to be reversed with systems that have multiple nodes, he explains, so that no one failure can stop a service from working.

There is a silver lining here. Although significant internet outages affect users lives and businesses they can also, ultimately, help to improve the resilience of the internet and the web services plugged into it.

For example, Forbes estimates that Facebook lost $66m (£48.5m), during the six-hour outage, from the suspension, or exodus, of advertisers on the site. That sort of loss is likely to focus the minds of senior executives on preventing it happening again.

"They lost a huge amount of money in that day, not just in their stock price but in their operational revenues," according to Mr Deryckx.

"And if you look at outages caused by content delivery networks like Fastly and Cloudflare, they also lost a huge number of customers to the competition. So, I think these operators are doing everything they can to keep things online."



0 comments:

Post a Comment