Automation Creates New Challenges in Cloud Operations

When technology advances, it creates enormous value and benefits to allow the society to run much efficiently. Automation is one of them. From our daily life to manufacture, automation improves our productivity and allow us to do more thing with less time and more accurately. But it also has its side effect. Automation magnifies both positive and negative aspects when it’s hard, by design, for humans to intervene.

Hand-SawConsidering the saw. While it takes you lots of time and effort to cut a log using a hand saw, you will feel much easier to use a chainsaw to do the job. Why? Because the chainsaw uses an engine efficiently passing the energy to the cutting chain which runs in a faster speed. The automation in Chain-Sawthe chainsaw improves the productivity of  anyone who wants to achieve the job of cutting a log. But if you do not properly use it or have an accident, the damage the chainsaw can cause is much bigger than a handsaw can do.

This is the same in IT operations. Automation makes IT operations more efficient, but mistakes caused by humans and machines can easily cascade to do much more damage. The amazon storm happened in 2011 is a perfect example of this. The automated script for EBS mirroring is a really innocent process but acts as the catalyst for the outage storm. This is the nature of cloud, which is built on top of the massive automation.

How should you react to it? First, you should accept it. It happened in many public cloud providers. Chances are it will happen in your private cloud environment. The important thing you need to do is to quickly spot it and be able to stop the cascading before it causes a big damage. The operations management tool you choose for your automated environment should give you this edge. Using traditional way to monitor every resource supporting your cloud won’t cut it. It just gives too much data, to the extend that you won’t be able to grasp the true meaning of these data. Look out a tool that can give you the insight of your cloud environment without showing off itself meaninglessly with mass amounts of data that will bury you and your productivity.

How to Prevent Cascading Error From Causing Outage Storm In Your Cloud Environment?

Last week, we talked about how shared resource pools change the way IT operates the cloud environment. We mentioned that how to avoid false positive and save the maintenance costs by measuring the pool decay. Today, I am going to explain how you can avoid another major challenge in the cloud operations –  outage storm.

The Outage storm typically is caused by cascading error and the lack of mechanism to detect those errors. Chances are you are not unfamiliar with this issue. In April, 2011, Amazon AWS experienced a
week-long outage on many of its AWS service offerings. I examined this incident in the article – thunderstorm from Amazon reminded us the importance of weather forecast. In a nutshell, a human error diverted major network traffic to a low bandwidth management channel. This flooded the communication between many EBS nodes. Because of the built-in automation process, these nodes started to unnecessarily replicate themselves and quickly consumed all the storage resources the availability zone. Eventually it brought down not only EBS but all other services relying on it. Almost a year later, Microsoft Azure experienced a day long outage. This time, a software glitch started to trigger unnecessary built-in automation process and brought down the server nodes. You can see the similarity between these two incidents. An error happened and triggered, not intentionally, automation processes that were built for different purpose. The outage storm, without any warning, brings your cloud down.

So how you can detect and stop the cascading as soon as possible? Let’s look at these two incidents. The environment seemed normal during the onset. The capacity in the pool seemed good. I/O was normal. The services run from these pools were not impacted. You felt everything was under control since you were monitoring the availability of each of those resources. Suddenly, you started to notice number of events showing up in your screen. While you were trying to make sense on these events, there were more and more events coming in and alerting you the availability of many devices were gone. Not long, the service help desk tickets swamped in. Customers started to complain large number of their services experiencing performance degradation. Everything happened just so fast that you didn’t get time to understand the root cause and make necessary adjustment. Sounds a nightmare to you?

How one can prevent that from happening? My suggestion is that you need to two thing. One, you need to measure the pool health. Particularly, in this case, you need to monitor the distribution of health status of its member resources. How many of them are in trouble? Do you see any trend how the trouble is propagated? What’s the rate of this propagation? Case in point, the Azure incident could have lasted longer and impacted more customers if Microsoft team hadn’t implemented its “human investigate” threshold. But still it lasted more than 12 hours. The main reason was these thresholds rely on the availability monitoring through periodic pings. And it took three timeouts in a row to trigger the threshold of the pool. And this delays the alert. So if you want to detect storm at the onset, the second thing you need to do is to detect the abnormality of behavior for its member resources, not just the ping. Combining these two measurements, the device can reflect their abnormality health status and the pool can detect the changes of the health distribution among its member resources. You, as an IT operation person, can set up rules to alert you when the health distribution changes across a critical threshold.

How does this benefit you? First you can get the alerts once that threshold is crossed even if the overall performance and capacity of the pool seem good. You will then have enough time to respond, for example diverting services to another pool or have the troubled devices quarantined. In addition, you won’t be swamped by massive alerts from each affected devices and try to guess which one you should look first. You can execute root cause analyses right from that alert at your pool level.

Cloud is built with the automation as the main mechanism to ensure its elasticity and agility. But occasionally, like what happened in these two incidents, errors can amplify their damages through cascading very quickly through those automation. Because of its inherited nature, the outage storm is more often than you think. If you operate a cloud environment, chances are you will face them pretty soon. You need to find a solution that can detect resource health by learning its behavior and can measure the distribution change of those health status at the pool level. The shared pool changes how you operate your cloud environment. Operation solution needs to  evolve to help you better measure pool decay and detect outage storm. Cloud-wash is not going to cut it.

To see how it works in a real world, you can visit booth 701 in this year’s VMworld. You can see a demo over there and get some ideas how you would approach these problems. If you want to discuss this with me, please let the booth staff know.

Puzzle Pieces vs. LEGO Bricks: How Shared Resource Pools Changed Everything

Jun 23, 2012 is the 100th birthday of Alan Turing. 76 years ago, Turing, just 24 years old, designed an imaginary machine to solve an important question: are all numbers computable? As a result, he actually designed a simple but the most powerful computing model known to computer scientists. To honor Turing, two scientists,  Jeroen van den Bosand and Davy Landman,  constructed a working Turing’s machine .  It is not the first time such a machine is built. The interesting thing this time is that the machine was built totally from a single LEGO Mindstorms NXT set.

The modern brick design of LEGO was developed in 1958. It was a revolutionary concept. The first LEGO brick built 54 years ago still interlocks with those made in the current time to construct toys and even the Turing machine. When you want to build a LEGO toy or machine, you don’t need to worry about when and where  the bricks are manufactured. You focus on the thing you are building and what standard shapes and how many of LEGO bricks you need.  And you can get them in any of those LEGO store no matter what you are building.

Sounds familiar? This is very similar to how one would build a cloud service using resources in a shared fabric pool. You don’t care which or what clusters or storage arrays these resources are hosted. All you care is types (e.g. 4cpu vs 8cpu VM) and service levels (e.g. platinum vs. gold) these resources need to support. Instead of taking each element devices, such as computer hosts or storage arrays, as key building blocks, IT now needs to focus on the logic layer that provides computing power to everything running inside the cloud – VMs, storage, databases, and application services. This new way to build services changed everything on how to measure, analyze, remediate and optimize resources shared within the fabric pool in the cloud.

To understand why we need to shift our focus to pools and away from element devices, let’s talk about another popular toy – puzzle set. Last year, I bought a 3D earth jigsaw puzzle set to my son who was 3 years old at that time. He was very excited as he just took a trip to Shanghai and was expecting a trip to Disney World. He was eager to learn all the places he had been and would be visiting.  So he and I (well, mostly I) built the earth using all those puzzle pieces. The final product was a great sphere constructed with 240 pieces. We have enjoyed it for 2 weeks until one of the pieces was missing. How can you blame a 3 year-old boy who wanted to redo the whole thing by himself? Now here is the problem, unlike those two scientists who used LEGO bricks to build the Turing machine, I can’t easily go to a store to just buy that missing piece.  I need to somehow find that missing piece or call the manufacture to send me a replacement. In the IT, it is called incident based management. When all your applications are built using dedicated infrastructure devices, you have a way to customize those devices and the way how they are put together to tailor to the particular needs of that application. If one of those devices has issue, it impacts the overall health of that application. So you file a ticket and operations team will do triage, isolation, and remediation.

In a cloud environment with shared resource pools, things happen differently. Since now the pool is built  with standard blocks and is shared by applications, you have the ability, through cloud management system, to set policy which moves VMs or logical disks around if their underneath infrastructure blocks get hit by issues. So a small percentage of unhealthy infrastructure blocks doesn’t necessary need immediate triage and repairing action.  If you monitor only the infrastructure blocks themselves, you will be overwhelmed by alerts that not necessary impact your cloud services. To respond all these alerts immediately increases your maintenance costs without necessary improving your service quality. Google did a study on the failure rate of their storage devices. They found that the AFR (annual failure rate) of those storage device is 8%. Assuming Google has 200,000 storage devices (in reality, it may have more than that), every half hour, you will have a storage alert somewhere in your environment. How expensive is it to have a dedicate team to keep doing triage and fixing those problem?

So how do we know when services hosted in a pool will be impacted? We give a name to this problem  – pool decay. You need to measure the decay state – the combination of  performance behavior of the pool itself and distribution of the unhealthy building blocks underneath it. In this way, you will be able to tell how the pool, as a single unit, performs and how much ability it has to provide the computing power to hosted services. When you go out to look for a solution that can truely understand the cloud, you need to check whether it has such ability to detect the pool decay without giving you excessive false positive. Otherwise, you will just get a solution who is cloudwashing.

Back to my missing piece in the 3D jigsaw set, I finally found it under the sofa. But the lesson learned,  I  now  bought my boy LEGO sets instead.

Next week, we will examine how the resource pool with the automation introduces another well known challenge – outage storm. Stay tuned.

Cloud Operations Management Solution Takes On Today’s IT Challenge

I haven’t posted a blog since last VMworld. One reason is that I had two great vacations on beautiful Caribbean islands. But most of time, I was working with a team of excellent talents to finish a project that allows IT to run its cloud environment with high confidence.  Today, I am very proud to say – we did it.

I have talked about how cloud computing poses new challenges in IT operations and why proactive performance management is even more important now. Today, we launched the next generation of cloud operations management solution to provide a set of capabilities to help IT take on those new challenges. These capabilities range from cloud panorama showing performance health for every shared fabric resource in your environment to automated workflow allowing those resources to be optimized for provisioned workloads.

Actionable Views

A cloud environment is complex. Not only do you have to manage the infrastructure resources, such as storage, network, and compute, but you also need to understand how they collectively power cloud services through shared pools. Many approaches you can find today in the market try to collect and show as much as data possible. We believe this is not efficient and actually prevent you from spotting the real issues and risks in your cloud environment. This new release gives IT operations and administration teams an actionable view – cloud panorama.  Cloud panorama not only summarizes the data organized as you see in the cloud (e.g. pools, containers, etc.) but also allows you act upon on what you can get from those data.

High-prevision Root Cause Analyses

The data is important. But the meaning of the data is even more important. What an operations staff wants to understand is what these data really mean to his/her day-to-day job. This is where the analytics comes in. Analytics for performance and capacity data is not a new thing. What unique about the analytics enhanced in this new release is that  it is the first time an analytics engine can provide the insight into how shared pools in the cloud power highly-automated cloud services. Lack of this type of insight causes serious problems. Think about last year’s Amazon AWS outage and this year’s Microsoft Azure disruption. In the coming blogs, I will explain why it matters to you and how you can execute high-precision root cause analyses to prevent this type of outage from happening in your cloud environment.

Intelligent Workflows

When end users asks for a new cloud service, such as new VMs, new database instance, or new storage space, they will get it almost instantly. This is because the provisioning of these services is automated. The challenge to cloud operators is how they can ensure these services run as expected from the get-go. To manually identify, deploy and configure your monitoring agents into these services is not an option. In this new release, we will enable you to automatically deploy and configure your monitoring agents during the provisioning of the service. By doing so, all your cloud services will be instant-on and instant-assured. In addition, when a service is provisioned, the solution tells the provisioning engine how to optimize the workload,  leveraging the workload pattern it  has learned and the capacity supply it knows. Finally, the solution analyzes the data it collects and provides showback reports.

Cloud computing provides IT tremendous advantages to provide services to its end users. But it also creates new challenges that IT operations teams have to face. In the past year, we at BMC worked very hard to understand those new needs. Today, we are excited to announce this new release of cloud operations management solution. Through its actionable views, high-precision root cause analyses, and intelligent workflows, this release enables IT confidently to power the cloud, control the cloud, and take intelligent action to ensure high-quality service delivery. Take a look at the clickable demo my colleague Thad did and check around the product page, particular that 2-minute explainer. We will get into more details in the coming weeks.

VMworld 2011 Day 3 Highlights

There was no keynote today in VMworld. But to me, it is a pretty insightful day. It started with a one great conversation, followed by two case study sessions. And I am going to make three prediction for next year’s VMworld at the end of this blog.

One great conversation

During the breakfast, I met a a couple of guys who have been doing customer implementation for all their careers. They gave me a few very insightful views. First, although there are thousands of metrics that one can measure in IT environment, there are a few of them that are critical and cover 80% of the whole IT picture. The problem is that most IT staff are overwhelmed by the number of metrics and have a false security because of that. A good performance tool should reduce the noise and give them the right and a small set of KPI’s that let IT truly understand what is and will be going on so they can act.

The other lesson I learned from this conversation is that the line of business(LOB) needs to know these KPI’s as well. So,  one, IT can establish service level agreement and, two,  LOB can understand and drive the IT supply by business demand. A simple but rich, an interactive but not overwhelming dashboard and report on both service and infrastructure performance will definitely help.

Two case study sessions

After that I attended two case study sessions and heard real world stories how people build a cloud and what kind of challenges they are facing. The cloud infrastructure is complex. One principal these companies all agreed is to keep it simple. That’s why the fabric infrastructure is playing important role in building the cloud. It also requires the management tool to be simple, which means to hide the complexity under the hood and only give relevant and key information to the user.

Another principle mentioned is to make it scale. This means to scale not only internally – cost and infrastructure – but also externally – allowing cloud end users to scale their demand.

In addition, they all mentioned the “noisy neighbor” problem. VMware is handling it within the VMware ESXi host. But what about higher abstract levels, such as clusters or heterogeneous pools which can across multiple hypervisors, physical, or locations?

Three Predictions, but wait…

Before I put my three bets in, I wanted to mention that it is my 3rd day in a row that I totally relied on my iPad without touching any PC and PC apps remotely, for my work and for my personal use. I am not a fan to use your PC apps in your mobile devices. The PC apps are designed and optimized for PC – large screen, high resolution, mouse, and keyboard. Do you see any of these in your tablet? Think about this, why people developed the GUI-based PC apps, not just simple port those text-based apps  from “dumb” terminal era? The same reason is applicable here when you have a totally new form factor (10 inches screen, lightweight, etc) and a new way (finger) for users to interact with their apps.

Ok, enough for this, let’s talk about my predictions for next VMworld.

Three Predictions

1. At least 1/3 of IT management vendors will have a solution solving cloud specific challenges – shared pool, large scale, highly mixed and dynamic workloads, etc. Customers are already asking for it and vendors will have cycle to produce the solution by this time next year.

2. Many customers, specially those earlier adopters, will have passed their day 1 – building the cloud, provisioning, etc. They will look into day 2 – operations, optimization, etc.

3, Temperature outside the VMworld building will be 30 degree lower than this year. I am very confident this is a sure bet.

For me, I have being chanting “cloud!” for a year since my last break. Now it is time for me to see the real cloud. The difference is that, this time, there will be no internet, no phone, and no PC (yes, I will carry iPad)  for a couple of weeks. All I will see is ocean, sky, and, of course, cloud.

Bye Bye VMworld…

And Hello …

Photo used under Creative Commons from tata_aka_T. Will replace by my photo when I come back.

VMworld 2011 Day 2 Highlights

This log was written in the course of last 15 hours as it captured live events of my day 2 In VMworld.

Today, first event in the morning was Steve’s keynote. In previous VMworld, he has always played a role to show products and excite the audience.  Today was no difference. Using white board and postit is a cool way to show the concept. He showed VMware’s new desktop projects – thin app factory and horizon. It is a service catalog but mainly for the desktop apps. I am not sure how this is uniquely different from a universal service catalog that should cover all services I want to use, desktop or not  He also showed project octopus. The concept is not new. The enterprise content management has always tried to do this but failed. Project octopus is essentially a provisioned dropbox managed by IT. The live show is pretty cool by letting Vittorio, VP of end user computing product management, show a day in the life by using these products. As expected, Steve also showed virtual phone. I think the problem it tries to solve is legitimate – using your personal smart phone with your work.  But it is a very difficult problem to solve. How do you solve the coexistence of contacts, email, twitter accounts, blogs, phone numbers from my work and personal?

Steve then shifted its focus on vSphere itself. That really is VMware’s crown jewel. In addition to the monster VM, vSphere 5 largely focuses on storage. Paul and Steve both emphasized that it is a high quality release, which has 1 million development hours and 2 million QA hours spent. Steve also talked about the “noisy neighbor” problem and how vSphere 5 helps alleviate it. This is a quite important problem, particularly in the cloud shared pool. I have talked about it from operations perspective in my earlier blog. We will talk it a lot more in the future.

To complete the story, he moved the topic to a higher layer – management. This is what VMware did many acquisition and has focused in the last couple of years. He first spent several minutes to talk about agentless discovery. And then he moved on to vCloud Director and a little bit vCenter Operations, with sneak peeks for their upcoming features.
Overall, great presentation and great demo. He spent half of the time talking about end user computing and implies that virtualization (on your PC, server, apps, data, and your phone) is the key to solve everything. Will it be true?

Today is the first time that solution exchange hall opens full day. Many vendors demonstrated great infrastructure and management products. The virtualization market is very mature. Consumers are deploying VMs in mission critical services. Vendors has experienced a few innovation cycles and now produced many great products. However, in the cloud world, things are different. It is still emerging. Consumers, at this moment, are little bit ahead of the curve to demand solutions for their newly met cloud management challenges. Vendors, on the other hand, still, most of them, don’t have a real cloud management solution. That’s why cloud washing is popular. I hope next year by this time we will have a range of better solutions.

To end this day, how can I not show one of the purpose-built cloud solution booth. We want to see more like this next year.

VMworld 2011 Day 1 Highlights

The highlight of day 1 of VMworld 2011 is VMware’s CEO Paul Martiz’s keynote. He talked about a brief history of IT and claimed that we are in the 3rd revolution – “Cloud Era”, following Mainframe era and client/server era. He mentioned several interesting data points. For example, over 50% of workload today are running in virtualization environment. We don’t know how this data was measured. Many felt that we are not in that level yet. Nevertheless, it won’t be controversial to say that a significant portion of workload today, especially majority of new workload, are running in the virtualization.

During the keynote, Paul announced several new versions of VMware product, including View 5.0 and vFabric Data Director. However, since VMware announced vSphere 5 and updates of several other products, including vCloud Director, a couple of months ago, there is no new big splash. Paul mentioned a new “vSphere Infrastructure and Operations Suite”, which is essentially a new suite packaging of vSphere, vCD, vCenter Ops, etc. One thing that is worth notice is that there is no new vCenter Ops announcement. Paul mentions that VMware is taking the approach to release the suite as a whole. He casually mentioned that it could a 5.1 suite release next year.  During the speech, Paul revealed that VMware is working on a virtualized mobile phone. It is an interesting concept. But I did not fully comprehend why it is important and how it could be used in IT. We were promised to see the demo in tomorrow’s keynote.

I also attended several sessions, mostly around operations. I felt that vendors are generally behind the curve. Cloud washing is still popular among vendors. “Virtualization + Automation” is still the message to sell cloud IT solutions. But audience are keeping asking questions for specific cloud challenges. How do I manage my shared pools? How do I deal with transient and mixed workloads? It’s my guess is that vendors know those distinct cloud challenges. But it is just that not many vendors has produced a purpose-built solution yet.  The customer, in this cycle of IT revolution, understands the value and meaning of this new era much faster than any revolution we have seen before. This phenomenon is largely contributed by massive knowledge share through web 2.0 and social media. For vendors, whoever can quickly build a truly cloud focused solution could win the market.

Tomorrow will be a long day, I will check around to see who has the solution now that can address directly cloud specific problems (hint, thinking about “shared pool”).

Follow

Get every new post delivered to your Inbox.