Reddit, Foursquare, EngineYard and Quora were among the many sites that went down recently due to a rather prolonged outage of Amazon's cloud services. On Thursday April 21, When Amazon Elastic Block Store (EBS) went offline, it took many of its Web and database servers depending on that storage down. With Amazon working aggressively to set this back right, on Sunday April 24, most of the services were restored back . As promised and as would be expected, Amazon has now come out with a detailed explanation describing what went wrong, and explaining why the failure was so widely felt and why it took that much time to restore back all the services. Some say that measured against Amazon’s promised availability, this lengthy outage would mean that Amazon may need to maintain full availability for more than a decade to adhere to their promised availability service level commitments.
Now, let’s examine what happened and how this happened. To start with some basics: Amazon has its facilities spread out around the world. Most users would know that its cloud computing data centers are in five different locations. Virginia, Northern California, Ireland, Singapore, and Tokyo. These centers are so architected that within each of these regions, the cloud services are further separated into what Amazon calls Availability Zone. The availability zones are within themselves self contained with physically and logically separate groups of computers setup therein. Amazon explains that such an arrangement helps customer choose the right level of redundancy as appropriate to their win needs. Such a design with a spectrum of options helps customers choose the right level of robustness also when they for a premium choose to host them in multiple regions. The logic here is that hosting in multiple availability zones within a same region must provide comparable robustness (as in hosting across multiple regions) but would come with a much better economics benefitting the customer.
Amazon offers several services as part of this arrangement. Amongst those services, Elastic Block Store(EBS) is an important service. With EBS, Amazon provides mountable disk volumes to virtual machines using the more well known Elastic Compute Cloud(EC2). This is quite attractive to customers, as Amazon with this service, provides the virtual machines with huge amount of reliable storage – typically this gets used for database hosting and the like. The powerfulness of this feature can be seen by the fact that while this can be used from EC2, another Amazon feature called Amazon Relational Database Service( RDS) also uses this as a data store. As an added feature for its services, Amazon has designed this feature for high availability purposes and replicates data through EBS between multiple systems. Given the volume and variety involved therein, this process is highly automated. In such an arrangement, if for some reason an EBS node loses connection form its replica, instantly an alternate storage within the same zone is made available to maintain connectivity.
As per Amazon, while doing routine maintenance operations in Virgnia operations on April 21, engineers were trying to make a change in network configuration to the zone. As part of the process, traffic to the routers affected apparently got moved into a low capacity network as against getting moved onto a backup. The low capacity network, is meant for handling inter node communication and not large scale replication/data transfer internally between the system and so the additional traffic made the network malfunction. With the primary network brought down for maintenance and the secondary network completely mal-functioning the EBS nodes lost their ability to replicate for want of nodes. This is where the unintended consequence of automation began to rear its ugly head. Every system in this network acted as if they are at risk and began to frenetically look for available nodes with free space for replication. While Amazon tried to restore the primary network, damage has been by then done, with all the available space within the cluster were already used, while some remaining nodes continued their search for nodes with free space available – while such nodes with free space were not available.
With a massive deadlock of nodes trying to find replicas, while there were not nodes with free space, impacted the control system’s performance. The control system performance issue severely impacted execution of new service requests like creating a new volume. A long back up began to get created for the slow control system to act upon and this with time reached catastrophic proportions, with some requests beginning to get returned with fail messages. Now, comes the second but the most crucial part of the outage – unlike other services, the control systems span across the region and not the individual availability zones. The net impact was therefore experienced across different availability zones. Remember the idea of Single Point Of Failure? That was proven here in its full might.
Slowly and deliberately, Amazon began the course correction – by beginning to tend to the control system and by adding more nodes to the cluster. Over time, the backlogs on the control system began to get cleared and this took painful efforts and a lot of time in the process. Outages of public cloud systems have made news in the past but clearly with time, the body of knowledge and maturity levels ought to improve things. Cloud service providers make high availability as the cornerstone of their offerings but this outage would in many ways, put such claims to question. Even while this outage happened with Amazon Virginia operations, there were many users of AWS, who managed to maintain availability of their system. A majority of those installations had fall back in terms of multiple regions, multiple zone coverage. Such moves necessarily bring cost, complexity equation into consideration.
It’s a little odd to see that when the problem of non availability of nodes happened, Amazon almost began to get into a denial –of-service attacks within their environment . Amazon now claims that this aspect of crisis related actions have been set right but one may have to wait till next outage to see what else could give way It may be noted that Amazon cloud services suffered a major outage in 2008 – the failure pattern looks somewhat similar upon diagnosis.Clearly, the systems need to operate differently under different circumstances – while it’s normal for nodes to keep replicating on storage/access concerns, the system ought to exhibit different behavior with a different nature of crisis. With the increasing adoption of public cloud services, certainly the volume, complexity and range of workloads would increase and the systems would get tested under varying circumstances for availability and reliability. All business and IT users would seek answers to such questions as they consider moving their workloads onto the cloud
It is interesting to see how Netflix, a poster user of Amazon cloud services managed to survive this outage. Netflix says,” When we re-designed for the cloud this Amazon failure was exactly the sort of issue that we wanted to be resilient to. Our architecture avoids using EBS as our main data storage service, and the SimpleDB, S3 and Cassandra services that we do depend upon were not affected by the outage”. Netflix admits that their service ran without intervention but with a higher than usual error rate and higher latency than normal through the morning, which is the low traffic time of day for Netflix streaming. Amongst the major engineering decisions that they implemented to avoid such outages includes designing things as stateless applications and maintain multiple redundant hot copies of the data spread across zones. Netflix calls their solution –“ Cloud Solutions for the Cloud” as the claim here is that instead of fork-lifting the existing applications from their data centers to Amazon's and simply using EC2, with their approach they believe that they have fully embraced the cloud paradigm. Essentially, Netflix has automated its zone fail-over and recovery process, hosted its services in multiple regions while reducing its dependence on EBS.
Clearly there are ways to get the best of cloud – except that some of these may have different economics and would call for greater ability to engineer and manage the operations. Amazon may have to increase the level of transparency in terms of their design and the operational metrics need to cover many more areas of operations as against the narrow set of metrics that users get to see now. To sum up , I would hesitate to call AWS as failure of the cloud but this journey into the cloud would call for more preparation and better thought out design to be in place from user’s side.
As enterprises concentrate on growth, they remain vigilant about costs and operational efficiencies – coming out of recession, even in times of high growth and radiant optimism. Such a model of growth provides IT with a lot of fresh opportunities to adapt and innovate . More than ever, this new model of growth mandates IT to raise its strategic importance to the business rather than just be content to focus on delivery of generic business plans. In many ways the tenor of change is set in with such changing context – With the continuing tight budgets, the CIO’s are now getting forced to” think and act different” – one of the critical ways that can be tried is to follow the time tested model of being creative in discarding the past while taking a bold and fresh approach in creating a new future of IT within enterprises.
The classic way of looking into conceptualizing a new IT organization and its contribution to the enterprise starts by thinking aloud typically by asking the question “What If?” Now in the radar of every CIO and IT organization, Cloud happens to be mostly at the top where some expect more than half of new workloads to naturally move to the cloud besides the expectation that a majority of applications and infrastructure inside the enterprises would move to the cloud over the next few years. That forces the hand of enterprises in ensuring quick think through of the future possibilities for IT in terms of alternate models and the cloud.
Let’s face the facts : CIO’s of large organizations have to manage the burden of the past in terms of legacy systems, data and the processes set in there. They come with huge costs in terms of maintenance and in many cases impose restrictions on flexibility and extensibility. More importantly, in many cases, these systems come in the way of contributing to business agility in an increasingly dynamic world of business across industries,
The cloud model is increasingly being adopted by companies looking to lower cost and improve scalability and enhance flexibility. Many different models of cloud adoption abound varied by size, maturity, expectations, nature of the industry etc. But all agree on one need – cloud services need to be well integrated with existing legacy systems. Some are choosing a hybrid approach between online and on-premise services as a low-risk way to test the benefits. To work, these cloud services need to be well integrated with existing legacy systems.
Possibilities include selectively letting go of the past and unlocking resources and in realigning priorities and setting new directions towards creating more space for innovation and greater business value. Some CIO’s see this as an opportunity to look beyond delivery models towards getting strategic advantage to business through sophisticated information and insights. Cloud centric technologies are a big driver in enabling IT to take center stage in support of innovation, business growth and delivered value.
For enterprises and the CIO, this journey is replete with possibilities, challenges where the upside swing could be alluringly high but the downside fall could be steep if not carefully strategized and executed on those strategies well enough. After all, we are living in an era where technology edge is almost equivalent to business edge and this warrants a new approach to business technology architecture and strategy. That’s when Saugatech came with the interview with Mike Wilens, on Fidelity’s cloud journey , I got real interested.
In a very detailed discussion Mike covers a series of topics and brings out the fact that while people talk a lot about lock-in, reliability, and security in the cloud , these are manageable with good engineering and good planning and it’s not really all that scary – cloud is indeed doable and can be a key enabler for business innovation and enterprise agility. The key here is that the cloud is indeed revolutionary in how we think about application delivery and infrastructure.
In discussionspublished as part of cloud leadership strategies, Mike outlines the approach to the cloud, the execution plan and alignment to business needs. Covering all aspects of cloud journey within Fidelity, there are lots of important insights coming out of actual experience. Starting with foundational issues, such as standards, cost avoidance and experimenting with new capabilities in the Cloud, the discussion then extends to centralization versus decentralization. Inside Fidelity, the cloud model is slowly altering the degree of decentralization with a view to lower costs but not compromise on ability to innovate around business needs. The key insight here is dealing with reducing risk and cost while not inhibiting innovation that can lead to top line growth. Moving onto the more interesting aspects of cloud and business the discussions revolve around business innovation, governance frameworks and balancing opportunity and risk. In areas like collaboration and social computing tools, the logic behind determination of what can be used internally and taking into account the regulatory standards, the usage of such tools externally requires very carefully considered solutions. Wilens points out that the standards that are evolving to help public clouds power down the economies of scale are now becoming available for private clouds as well.
Cloud can be a big platform for testing out /piloting new ideas and can be scaled out and scaled up – at any point in time this pilot footprint on cloud should be actively pursued. For example, he believes that migration to mobile devices and the related implications for the presentation layers of any technology infrastructure will be implemented within the cloud based technologies, private or public. Similarly co-opting the startup partners to try out new operational/innovative models and make them scale up on the cloud infuses new dynamics in developing partnerships and new offerings. Fidelity has found that private Cloud portals can deliver to its clients access to financial information, while still maintaining the on-premise, legacy, mainframe record keeping systems. Here comes the reinforcement, that hybrid solutions leveraging the data of on-premise systems will soon become the norm.
The best practices talked about ranges from adopting de-facto cloud standards, for example cloud infrastructure could be coalescing around the LAMP stack. Some other notable insights include: - Creating shared services with a common platform, look and feel - Use of cloud as testing environment - Portioning of clouds – confidential /mission critical data where to keep -on-premise or outside - What volumes of new workloads to be pushed onto the cloud – particularly in long standing industries like financial services where lot of data tend to be in old but reliable platforms
Operating at both ends of the stacks with a robust risk management plan and governance makes cloud an indispensable framework for IT, Innovation and Business Agility. I recommend reading this for demonstrating that with a good strategy and well laid out execution on such strategy, even in a fast moving but highly regulated industry with a lot of legacy system in place, clouds can be successfully and progressively deployed with demonstrable results in providing flexibility and making business agile.
Sadagopan's Weblog on Emerging Technologies, Trends,Thoughts, Ideas & Cyberworld "All views expressed are my personal views are not related in any way to my employer"