The Low-Down: Amazon Web Services Outage Reveals Risk Posed By Lack of Backup Across Internet

Becoming dependent on any one provider of anything is unsustainable risk; especially when it comes to critical assets like technology infrastructure and data storage. In an era of consolidation and dominance, too many enterprises may be discovering that they are too reliant on one vendor. JL

Nat Levy reports in Geekwire:

Lasting five hours, the outage knocked out access to a litany of websites and apps that run on AWS, including Expedia, Slack, Medium, the U.S. Securities and Exchange Commission. The most important takeaway is the necessity of redundancy in cloud storage. All technology fails at some point. Large swaths of the internet went down, but other(s) didn’t: those that had their data spread across multiple regions. Experts emphasized using multiple cloud providers to store data.
The digital snow day is over, as Amazon Web Services has fixed the issues with its Simple Storage Service, or S3 for short, that crippled significant chunks of the internet Tuesday.
Starting a little after 9:30 a.m. Pacific time Tuesday, and lasting close to five hours, the S3 cloud storage service started experiencing “high error rates.” This outage knocked out access to a litany of websites and apps that run on AWS, including but not limited to Expedia, Slack, Medium, the U.S. Securities and Exchange Commission. The outage even temporarily affected the AWS service health dashboard, which displays outages and events. Amazon has not fully detailed what caused the high error rates. Nick Kephart, senior director of product marketing for San Francisco-based network intelligence company ThousandEyes, monitored the outage throughout the day. He said information could get into Amazon’s overall network, but attempting to establish a network connection with the S3 servers was like hitting a wall. It stopped all traffic dead in its tracks. So any site or app that hosted data, images or other information on S3 was affected.Without having access to Amazon’s servers, Kephart couldn’t say why it became impossible to connect with the S3 servers. He said it isn’t clear if it was a human error, or infrastructure failure, or a configuration problem or an automation issue that caused the problem. But he theorized it was a pretty complicated malfunction given the proliferation of the outage.
“It wasn’t just the system completely misbehaving but something deeper in the infrastructure that caused these problems,” Kephart said.
ThousandEyes also produced this visualization to show the extent of the outage and all the interactions within the AWS network.
As to why the outage was so widespread, Amazon’s status as cloud king, with a market share of more than 40 percent comes into play. Another factor, Kephart said, is the way AWS programs are built on top of each other, meaning that S3 going down impacts other services.
“Amazon Web Services builds many of their individual services on building blocks built on each other,” Kephart said. “S3 is one of the very fundamental building blocks of AWS. When S3 fails, many, many, many other services fail alongside because they are all built on top of S3.”
Now that the issues have been worked out, the question turns to what can be learned from this outage. Several experts surveyed by GeekWire say the most important takeaway from this event is the necessity of redundancy in cloud storage. Shawn Moore, CTO of Orlando-based web experience platform Solodev, said all technology fails at some point. Large swaths of the internet went down in Tuesday’s outage, but other sites and apps didn’t experience any disruption. Those are the ones that had their data spread across multiple regions.
“The ones who have fully embraced Amazon’s design philosophy to have their website data distributed across multiple regions were prepared,” Moore said. “This is a wakeup call for those hosted on AWS and other providers to take a deeper look at how their infrastructure is set up and emphasizes the need for redundancy – a capability that AWS offers, but it’s now being revealed how few were actually using.” David Linthicum is senior vice president at Cloud Technology Partners, a company based in Boston that helps enterprises migrate their data to cloud storage providers like AWS, Microsoft Azure and Google Cloud. He said the outage seems like an isolated incident, something that is bound to happen occasionally.
“Systems fail, and from time to time clouds will fail,” he said. “Amazon’s ability to get things up and running quickly, and get back to business, will be the real test,” he said.
Linthicum went on to say that he doesn’t think Tuesday’s outage will keep people from using cloud storage.
“Amazon Web services, and the other public cloud providers, pretty much stay on top of their operations,” he said. “Certainly much better than enterprises do.”
In addition to pushing redundancy and hosting data at multiple centers in different regions, experts emphasized using multiple cloud providers to store data. Not only does that protect customers from a system-wide outage, it can also let users switch between providers as cost dictates.Akash Nankani, a former lead program manager at Microsoft and founder of NanSoft Studios and creator of the government filing tracking site SECGems said he tries to make his products “provider agnostic,” so that if an incident like Tuesday’s AWS outage went on for a long time, he could make a quick change to remove AWS dependency.
“In my view, every business should ask this question to themselves: ‘If tomorrow, for whatever reason (valid or invalid), if Amazon (or any other provider that you depend on) decides to ban/blacklist my account or business, how will I deal with it? How soon before I can recover from it? And have I pro-actively tested this scenario before it occurs?'”
“While I have a great deal of respect for Amazon/Microsoft/Google/IBM Bluemix/OVH, etc. and have used/experimented with all of them, from a business continuity perspective, I think investing in ‘multi-provider’ support is more important than ‘multi-region.’ This also comes with the benefit of dynamically switching to lowest cost provider as well as dealing with provider/regional outage.”

2 comments:

Kanika said...: In the rapidly evolving world of mobile app development, businesses and developers are constantly seeking ways to deliver exceptional user experiences across multiple platforms efficiently. Flutter, a versatile and open-source UI framework developed by Google, has emerged as a game-changer in this domain. Flutter mobile app development allows for the creation of high-quality, visually appealing, and performant apps for both iOS and Android using a single codebase. In this article, we delve into the world of Flutter mobile app development, exploring its features, benefits, and considerations.
Hire Best Flutter Mobile App Development Services Company USA; August 31, 2023 at 7:52 AM
Anonymous said...: Thanks for your information! nice post We can say that cloud computing has revolutionized the way we handle computer tools by providing a fluid, scalable and cost-effective option for businesses and individuals alike. The technology harnesses the power of the internet to perform its operations.; April 26, 2024 at 2:53 AM

A Blog by Jonathan Low

Mar 1, 2017

Amazon Web Services Outage Reveals Risk Posed By Lack of Backup Across Internet

2 comments:

Post a Comment

contact

Search This Blog

Blog Archive

Labels

links