The collapse of Amazon Web Services’ (AWS) Elastic Cloud Compute (EC2) left several large websites out of commission on Thursday.
Amazon reportedly attributed the problems to what it called “a networking event” that caused runaway re-mirroring of Elastic Block Storage volumes. The resulting cascade took down hundreds, possibly thousands, of websites, including Foursquare, Hootsuite, Quora and Reddit.
Other details were left foggy.
Although AWS has multiple regions and availability zones — a system designed to prevent a single point of failure — there was apparently an engineering flaw that allowed the mass outage to occur.
Most affected websites were up and running by Friday, but there was still a scattering of unresolved problems, according to news reports.
Amazon did not respond to the E-Commerce Times’ request for comments by press time.
Cloudburst Bound to Come
Amazon’s own infrastructure is designed to be bullet-proof, even during the holiday season, but the cloud technology the company has extended to its Web services customers is apparently not quite so robust.
“This is inevitable,” technology project manager and Geek 2.0 blogger Steven Savage told TechNewsWorld. “Cloud computing is still developing, and EC2 is used by a lot of companies. There would inevitably be some big, public outage — indeed I expected some very public failure of the cloud to happen in the near future… . However, this outage is much, much larger than I’d expected and took longer to recover from.”
While Amazon’s failure may prompt skepticism regarding cloud computing, the trend toward the cloud won’t slow down quickly. It’s just too useful and too convenient.
“This outage is going to deliver a black eye to cloud computing but isn’t going to change the existing trends because the trends are powerful — speed, ease, and reduction of cost,” said Savage. “People who work in IT are going to be battling for weeks or months to clear up misconceptions about this.”
AWS performed poorly in terms of communications with its customers, observed Savage. “The general consensus I see is that their communications were too little, not fast enough, and not detailed enough. There also wasn’t a sense of urgency. People don’t seem happy — and can you blame them? Amazon engaged better than some companies — the infamous Playstation Network outage about a year or so ago comes to mind — but not well enough.”
AWS may need to further consider worst-case scenarios for its service, suggested Savage.
In its promo material, AWS insists that “availability zones are distinct locations that are engineered to be insulated from failures in other availability zones.” That didn’t hold true as the crash cascaded across zones.
“I can’t speak on the exact arrangement of their technologies, but in my past experiences with data centers, frankly, people are not paranoid enough nor do they think worst-case scenario enough,” said Savage. “Caution in design is not the proper approach — assuming the worst is, because in an age of high-speed, high-technology everywhere, the worst will happen by the odds.”
Amazon will need to learn from the crash to avoid future snafus, and sharing its knowledge could improve both its public image and cloud computing in general.
“Amazon needs to assess their technical and failover architecture and publicly communicate how they’ll fix this and improve it,” said Savage. “In fact, they could share their findings publicly as a sort of public service and gain additional credibility on it. The feedback alone will help them further.”
This gives Amazon’s competition the chance to capture some of its clients.
“Expect competitors to swoop on EC2 users after this,” predicted Savage. “Amazon just became spectacularly vulnerable. Amazon can’t let this happen again. It’s embarrassing and destroys confidence — which is a major cornerstone of their strategy.”
Hey, Crashes Happen
When your electricity goes out, you don’t vow to quit using electricity. If the cloud goes down, you don’t quit the cloud.
“It is certainly a sign of the times that cloud outages get more attention today than even electric outages,” said Al Hilwa, program director for applications development software at IDC.
“Outages happen. They happen inside and outside the firewall,” he told TechNewsWorld. “However, when you have big popular and centralized services, they are going to create mass disruption.”
The Amazon crash isn’t likely to put a dent in the move to the cloud, in Hilwa’s view. These are the early days of cloud computing, and crash or no crash, our dependence on cloud computing is only going to increase.
“The levels of services will improve over time as service providers learn,” he said. “[Amazon] should certainly try to communicate better about it, however. In these situations, usually every ounce of their energy is sucked into restoring services as quickly as possible.”