As many of are no doubt aware, Microsites were unavailable from approximately 2:00 PM EDT on October 22 until 7:30 AM EDT on October 23. The cause of this outage was a widespread, catastrophic failure of the Amazon Cloud Services used to host Microsites. Here is what happened, what was done to fix it, and how we will be modifying our infrastructure in response to the event.
First though, I want to provide some background on why we have selected Amazon Web Services (AWS) for our hosting. There are 3 main reasons:
- Uptime, reliability, and automated backups
- Ease of Scalability
- Managed Infrastructure
The first, Uptime, reliability, and automated backups may seem out of place given the recent downtime. However, over the last 12 months we have seen an uptime of 99.79%, even with Monday’s outage. Overall, this has provided much higher stability rates than if we were managing our own hardware, and it has happened at lower cost and lower management overhead. In addition, Amazon provides backup solutions that mean our data is secure, even in the case of catastrophic failures. For example, one of the systems that failed on Monday was the database behind our .JOBS Microsite product. However, no site configurations or user profiles were lost.
The second is the ease of scalability. As our traffic load changes, we can add servers, databases, storage, staging environments, and load balancers in minutes. There is no need to purchase and configure hardware...we can scale our system to twice its normal capacity (and back) in less than an hour. This is simply impossible in a normal data center environment without maintaining unused and expensive hardware on a standby basis.
The third reason is related to the second. Managed Infrastructure means we do not have to be concerned with the mundane issues of maintaining a data center for Microsites. This includes everything from purchasing and installing servers, to replacing worn and outdated equipment, and the time and man hours needed to do all of the above.
This leads us to what happened on Monday the 22nd. For reasons that are still being researched, there was a failure of Amazon’s storage volume service, known as EBS. This service was responsible for serving out all of the templates, css, images, and javascript for microsites. It was these files that first revealed the problem as our load times dramatically spiked (from our normal 3 seconds to 15, 20 seconds, and then minutes) over the course of half an hour. Very quickly, other services began to fail, including the Databasess, the Servers, and even the admin console used to manage everything. This failure was localized to Amazon’s Northern Virginia Data center, and impacted many sites besides Microsites, including Foursquare, Reddit, and Coursera.
Amazon moved quickly to repair the data center, but the problems were widespread, and it took them several hours to make any progress at all. Once they had, they began to move methodically through the data center and rebuild all of the instances. However, this took a long time, and the root of the issue appears to have been in the same cluster in which Microsites are hosted (Availability Zone US-EAST-1d for the curious). As a result, the databases for Microsites were down much longer than the servers that host them, or sites that were in other clusters. Once Amazon corrected the issues with the servers and databases, Microsites came back up immediately.
On our end, we worked late into the night and early the next morning attempting to bring up backup servers and restore databases. However, the problems with cluster meant that even new databases from backups could not be initialized until the cluster was repaired. As a result, we were limited in our ability to restore service until the underlying issue was corrected.
This leads directly into how this is going to change our infrastructure plan. While 99.79% uptime is good, we did learn a very important lesson about putting all our eggs in one basket, and we can do better. In order to do so , we will not be replacing Amazon, but adding an additional layer of backups and redundancy to the Microsite Deployment Environment that will allow us to gracefully handle any future Amazon outages. This will be done in a 3 fold manner:
- We will immediately be creating servers in additional Amazon clusters. This will include at least some servers in Amazon’s Oregon Data Center, and possibly in their European Data Center. This will remove any individual data center from being a single point of failure.
- We will be building a small cluster of servers in our own data center as a last resort fallback system. Previously, the cost to benefit for this type of system ruled out creating it, but recent infrastructure changes in our data center combined with the current situation change that equation. This server cluster, while not normally a part of serving Microsites, will be available should the entirety of Amazon become unavailable.
- We will be researching alternative Cloud Services to Amazon to come along side the Amazon clusters. This will allow us to split load out over not just Amazon system, but to additional data centers that are completely separate from Amazon. This will offer insulation against DDoS attacks or Hardware failure at either facility.
The timeline for the first modification is immediate, and will be in place by the end of 4th Quarter 2012. The timeline for the second and third modifications is longer and fuzzier, but the hope is to have our own data center ready by Q2 2013, and a secondary cloud host online by Q3 2013.
The goal of all of this is stability and availability. It is our goal, and we plan on making it happen.