How to Prevent Cascading Error From Causing Outage Storm In Your Cloud Environment?

Last week, we talked about how shared resource pools change the way IT operates the cloud environment. We mentioned that how to avoid false positive and save the maintenance costs by measuring the pool decay. Today, I am going to explain how you can avoid another major challenge in the cloud operations –  outage storm.

The Outage storm typically is caused by cascading error and the lack of mechanism to detect those errors. Chances are you are not unfamiliar with this issue. In April, 2011, Amazon AWS experienced a
week-long outage on many of its AWS service offerings. I examined this incident in the article – thunderstorm from Amazon reminded us the importance of weather forecast. In a nutshell, a human error diverted major network traffic to a low bandwidth management channel. This flooded the communication between many EBS nodes. Because of the built-in automation process, these nodes started to unnecessarily replicate themselves and quickly consumed all the storage resources the availability zone. Eventually it brought down not only EBS but all other services relying on it. Almost a year later, Microsoft Azure experienced a day long outage. This time, a software glitch started to trigger unnecessary built-in automation process and brought down the server nodes. You can see the similarity between these two incidents. An error happened and triggered, not intentionally, automation processes that were built for different purpose. The outage storm, without any warning, brings your cloud down.

So how you can detect and stop the cascading as soon as possible? Let’s look at these two incidents. The environment seemed normal during the onset. The capacity in the pool seemed good. I/O was normal. The services run from these pools were not impacted. You felt everything was under control since you were monitoring the availability of each of those resources. Suddenly, you started to notice number of events showing up in your screen. While you were trying to make sense on these events, there were more and more events coming in and alerting you the availability of many devices were gone. Not long, the service help desk tickets swamped in. Customers started to complain large number of their services experiencing performance degradation. Everything happened just so fast that you didn’t get time to understand the root cause and make necessary adjustment. Sounds a nightmare to you?

How one can prevent that from happening? My suggestion is that you need to two thing. One, you need to measure the pool health. Particularly, in this case, you need to monitor the distribution of health status of its member resources. How many of them are in trouble? Do you see any trend how the trouble is propagated? What’s the rate of this propagation? Case in point, the Azure incident could have lasted longer and impacted more customers if Microsoft team hadn’t implemented its “human investigate” threshold. But still it lasted more than 12 hours. The main reason was these thresholds rely on the availability monitoring through periodic pings. And it took three timeouts in a row to trigger the threshold of the pool. And this delays the alert. So if you want to detect storm at the onset, the second thing you need to do is to detect the abnormality of behavior for its member resources, not just the ping. Combining these two measurements, the device can reflect their abnormality health status and the pool can detect the changes of the health distribution among its member resources. You, as an IT operation person, can set up rules to alert you when the health distribution changes across a critical threshold.

How does this benefit you? First you can get the alerts once that threshold is crossed even if the overall performance and capacity of the pool seem good. You will then have enough time to respond, for example diverting services to another pool or have the troubled devices quarantined. In addition, you won’t be swamped by massive alerts from each affected devices and try to guess which one you should look first. You can execute root cause analyses right from that alert at your pool level.

Cloud is built with the automation as the main mechanism to ensure its elasticity and agility. But occasionally, like what happened in these two incidents, errors can amplify their damages through cascading very quickly through those automation. Because of its inherited nature, the outage storm is more often than you think. If you operate a cloud environment, chances are you will face them pretty soon. You need to find a solution that can detect resource health by learning its behavior and can measure the distribution change of those health status at the pool level. The shared pool changes how you operate your cloud environment. Operation solution needs to  evolve to help you better measure pool decay and detect outage storm. Cloud-wash is not going to cut it.

To see how it works in a real world, you can visit booth 701 in this year’s VMworld. You can see a demo over there and get some ideas how you would approach these problems. If you want to discuss this with me, please let the booth staff know.

VMworld 2011 Day 3 Highlights

There was no keynote today in VMworld. But to me, it is a pretty insightful day. It started with a one great conversation, followed by two case study sessions. And I am going to make three prediction for next year’s VMworld at the end of this blog.

One great conversation

During the breakfast, I met a a couple of guys who have been doing customer implementation for all their careers. They gave me a few very insightful views. First, although there are thousands of metrics that one can measure in IT environment, there are a few of them that are critical and cover 80% of the whole IT picture. The problem is that most IT staff are overwhelmed by the number of metrics and have a false security because of that. A good performance tool should reduce the noise and give them the right and a small set of KPI’s that let IT truly understand what is and will be going on so they can act.

The other lesson I learned from this conversation is that the line of business(LOB) needs to know these KPI’s as well. So,  one, IT can establish service level agreement and, two,  LOB can understand and drive the IT supply by business demand. A simple but rich, an interactive but not overwhelming dashboard and report on both service and infrastructure performance will definitely help.

Two case study sessions

After that I attended two case study sessions and heard real world stories how people build a cloud and what kind of challenges they are facing. The cloud infrastructure is complex. One principal these companies all agreed is to keep it simple. That’s why the fabric infrastructure is playing important role in building the cloud. It also requires the management tool to be simple, which means to hide the complexity under the hood and only give relevant and key information to the user.

Another principle mentioned is to make it scale. This means to scale not only internally – cost and infrastructure – but also externally – allowing cloud end users to scale their demand.

In addition, they all mentioned the “noisy neighbor” problem. VMware is handling it within the VMware ESXi host. But what about higher abstract levels, such as clusters or heterogeneous pools which can across multiple hypervisors, physical, or locations?

Three Predictions, but wait…

Before I put my three bets in, I wanted to mention that it is my 3rd day in a row that I totally relied on my iPad without touching any PC and PC apps remotely, for my work and for my personal use. I am not a fan to use your PC apps in your mobile devices. The PC apps are designed and optimized for PC – large screen, high resolution, mouse, and keyboard. Do you see any of these in your tablet? Think about this, why people developed the GUI-based PC apps, not just simple port those text-based apps  from “dumb” terminal era? The same reason is applicable here when you have a totally new form factor (10 inches screen, lightweight, etc) and a new way (finger) for users to interact with their apps.

Ok, enough for this, let’s talk about my predictions for next VMworld.

Three Predictions

1. At least 1/3 of IT management vendors will have a solution solving cloud specific challenges – shared pool, large scale, highly mixed and dynamic workloads, etc. Customers are already asking for it and vendors will have cycle to produce the solution by this time next year.

2. Many customers, specially those earlier adopters, will have passed their day 1 – building the cloud, provisioning, etc. They will look into day 2 – operations, optimization, etc.

3, Temperature outside the VMworld building will be 30 degree lower than this year. I am very confident this is a sure bet.

For me, I have being chanting “cloud!” for a year since my last break. Now it is time for me to see the real cloud. The difference is that, this time, there will be no internet, no phone, and no PC (yes, I will carry iPad)  for a couple of weeks. All I will see is ocean, sky, and, of course, cloud.

Bye Bye VMworld…

And Hello …

Photo used under Creative Commons from tata_aka_T. Will replace by my photo when I come back.

VMworld 2011 Day 2 Highlights

This log was written in the course of last 15 hours as it captured live events of my day 2 In VMworld.

Today, first event in the morning was Steve’s keynote. In previous VMworld, he has always played a role to show products and excite the audience.  Today was no difference. Using white board and postit is a cool way to show the concept. He showed VMware’s new desktop projects – thin app factory and horizon. It is a service catalog but mainly for the desktop apps. I am not sure how this is uniquely different from a universal service catalog that should cover all services I want to use, desktop or not  He also showed project octopus. The concept is not new. The enterprise content management has always tried to do this but failed. Project octopus is essentially a provisioned dropbox managed by IT. The live show is pretty cool by letting Vittorio, VP of end user computing product management, show a day in the life by using these products. As expected, Steve also showed virtual phone. I think the problem it tries to solve is legitimate - using your personal smart phone with your work.  But it is a very difficult problem to solve. How do you solve the coexistence of contacts, email, twitter accounts, blogs, phone numbers from my work and personal?

Steve then shifted its focus on vSphere itself. That really is VMware’s crown jewel. In addition to the monster VM, vSphere 5 largely focuses on storage. Paul and Steve both emphasized that it is a high quality release, which has 1 million development hours and 2 million QA hours spent. Steve also talked about the “noisy neighbor” problem and how vSphere 5 helps alleviate it. This is a quite important problem, particularly in the cloud shared pool. I have talked about it from operations perspective in my earlier blog. We will talk it a lot more in the future.

To complete the story, he moved the topic to a higher layer – management. This is what VMware did many acquisition and has focused in the last couple of years. He first spent several minutes to talk about agentless discovery. And then he moved on to vCloud Director and a little bit vCenter Operations, with sneak peeks for their upcoming features.
Overall, great presentation and great demo. He spent half of the time talking about end user computing and implies that virtualization (on your PC, server, apps, data, and your phone) is the key to solve everything. Will it be true?

Today is the first time that solution exchange hall opens full day. Many vendors demonstrated great infrastructure and management products. The virtualization market is very mature. Consumers are deploying VMs in mission critical services. Vendors has experienced a few innovation cycles and now produced many great products. However, in the cloud world, things are different. It is still emerging. Consumers, at this moment, are little bit ahead of the curve to demand solutions for their newly met cloud management challenges. Vendors, on the other hand, still, most of them, don’t have a real cloud management solution. That’s why cloud washing is popular. I hope next year by this time we will have a range of better solutions.

To end this day, how can I not show one of the purpose-built cloud solution booth. We want to see more like this next year.

VMworld 2011 Day 1 Highlights

The highlight of day 1 of VMworld 2011 is VMware’s CEO Paul Martiz’s keynote. He talked about a brief history of IT and claimed that we are in the 3rd revolution – “Cloud Era”, following Mainframe era and client/server era. He mentioned several interesting data points. For example, over 50% of workload today are running in virtualization environment. We don’t know how this data was measured. Many felt that we are not in that level yet. Nevertheless, it won’t be controversial to say that a significant portion of workload today, especially majority of new workload, are running in the virtualization.

During the keynote, Paul announced several new versions of VMware product, including View 5.0 and vFabric Data Director. However, since VMware announced vSphere 5 and updates of several other products, including vCloud Director, a couple of months ago, there is no new big splash. Paul mentioned a new “vSphere Infrastructure and Operations Suite”, which is essentially a new suite packaging of vSphere, vCD, vCenter Ops, etc. One thing that is worth notice is that there is no new vCenter Ops announcement. Paul mentions that VMware is taking the approach to release the suite as a whole. He casually mentioned that it could a 5.1 suite release next year.  During the speech, Paul revealed that VMware is working on a virtualized mobile phone. It is an interesting concept. But I did not fully comprehend why it is important and how it could be used in IT. We were promised to see the demo in tomorrow’s keynote.

I also attended several sessions, mostly around operations. I felt that vendors are generally behind the curve. Cloud washing is still popular among vendors. “Virtualization + Automation” is still the message to sell cloud IT solutions. But audience are keeping asking questions for specific cloud challenges. How do I manage my shared pools? How do I deal with transient and mixed workloads? It’s my guess is that vendors know those distinct cloud challenges. But it is just that not many vendors has produced a purpose-built solution yet.  The customer, in this cycle of IT revolution, understands the value and meaning of this new era much faster than any revolution we have seen before. This phenomenon is largely contributed by massive knowledge share through web 2.0 and social media. For vendors, whoever can quickly build a truly cloud focused solution could win the market.

Tomorrow will be a long day, I will check around to see who has the solution now that can address directly cloud specific problems (hint, thinking about “shared pool”).

What Can We Expect from VMworld 2011?

This is the first time that I actually write a blog during a flight. I am in a route to the sin city. But I am not in a mood to give out the money to the house (I am sure you know that not many games in Vegas give you the edge). I am heading to VMworld.

Looking outside of the window, all I can see is the clear sky and the desert. No cloud. But I can hear people are chanting “cloud! cloud!” at the horizon. I am sure cloud will be a big theme in this VMworld. But we have talked about cloud for a while. What will be new this time?

In last year’s VMworld and cloud expo, my general sense was that people started to understand the concept. But few had implemented. In last several months, I felt the momentum when I got many inquiries from our customers on very specific questions   On building and running the cloud. Thanks for the social media, the education cycle of any new technology phenomenon becomes short than ever. Customers become much more intelligent even than many vendors. Cloud washing no longer works.

So what can we expect from VMworld? Any new announcement from VMware? Are industry still focusing on cloud provisioning and orchestration? How will cloud customers think about or practice operations management? Any other new trend are coming out? We will find out this week.

Follow

Get every new post delivered to your Inbox.