On April 21st, 2011, a thunderstorm hit the Amazon AWS. This storm lasted almost 4 days (timeline documented by @ericmkidd). Later, Amazon AWS team posted a detailed post-mortem analysis on what happened. It turned out an incorrect network configuration action caused the traffic redirected to the lower capacity redundant EBS network. This network is used as a back-up network to allow EBS nodes to reliably communicate with other nodes in the EBS cluster and provide overflow capacity for data replication. This started a cascading failure on EBS volumes and RDS. It also impacted EC2 and CloudWatch.
Some of their customers. such as Quora or Reddit, were brought down. Some of them, like Netflix or SimpleGeo, largely survived. Since then, many of them posted very insightful analysis on what happened, how they reacted, and what they learned from this outage. You can find many of those writings in a nice list composed by @toddhoffious.
Outages are inevitable. We all know that’s true. We don’t have to like it, but we have to live with it. But thinking that good design can eliminate failure outright is naive. Demanding that our providers adapt their services to eliminate specific failures is where the rubber meets the road.
That’s absolutely correct. Outage is part of cloud life. It’s the service provider’s goal to eliminate this. But it will happen. By now, “designing for failure” has been a known principle for implementing a cloud application. I want to point out one more thing that any enterprise who use public cloud needs to do diligently. That is you have to proactively and holistically monitor your services deployed in the public cloud. You need a good weather forecast system.
What I meant “proactively” is that you can’t just wait for your service provider to send you alerts. For example, Amazon first admitted its EBS problem at 1:41am PDT. But for those customers who take the proactive approach, they detected the abnormality well ahead. For example, Heroku’s monitoring system raised alert at 1:15am, almost half hour earlier than Amazon’s first status update. This gives you critical time to mobilize your ops team, prepare, and investigate a remediation plan.
But to make sure your alert is meaningful and actionable, you also must have an extensive monitoring coverage. Bizo contributed the first reason of its survival of this outage to its services being well monitored, through an combination of external verification and CloudWatch data. Your monitoring system should have capabilities to monitor data provided by cloud provider, such as CloudWatch. In most of the time, these data are critical to give you a cloud resource view. But in this case, the CloudWatch service from that region were also affected. To avoid that, you also need either in-host/in-OS monitoring or external verification service to let you know service performance status. If your monitoring system can support both, you will even get bigger benefits from deployment and maintenance. In addition, it also makes the analysis described below much effective.
Collecting the data is the first step. You have to know how to analyze those data holistically to find the probable root cause. When your service uses lots of resources or has complicated architecture, you will be overwhelmed by alerts when a failure storm like the Amazon outrage happens. A sophisticated but intuitive analytic engine could help you correlate those events, learn the behavior, and pinpoint the probable root cause. This result, with the help of well-defined polices and runbooks, increases the self-resiliency of your cloud applications.
As @mkrigsman mentioned in his examination of Amazon’s cloud failure from CIO perspective,
Outsourcing does not relieve enterprise buyers of responsibility to manage their own destiny. For cloud computing, this means the enterprise must design applications for resiliency while planning for disaster recovery.
I will add that enterprise must also design their operations management solution for proactively and holistically detecting and isolating probable root cause of cloud applications. This will ultimately drive the resiliency of your cloud services.