Automation Creates New Challenges in Cloud Operations

When technology advances, it creates enormous value and benefits to allow the society to run much efficiently. Automation is one of them. From our daily life to manufacture, automation improves our productivity and allow us to do more thing with less time and more accurately. But it also has its side effect. Automation magnifies both positive and negative aspects when it’s hard, by design, for humans to intervene.

Hand-SawConsidering the saw. While it takes you lots of time and effort to cut a log using a hand saw, you will feel much easier to use a chainsaw to do the job. Why? Because the chainsaw uses an engine efficiently passing the energy to the cutting chain which runs in a faster speed. The automation in Chain-Sawthe chainsaw improves the productivity of  anyone who wants to achieve the job of cutting a log. But if you do not properly use it or have an accident, the damage the chainsaw can cause is much bigger than a handsaw can do.

This is the same in IT operations. Automation makes IT operations more efficient, but mistakes caused by humans and machines can easily cascade to do much more damage. The amazon storm happened in 2011 is a perfect example of this. The automated script for EBS mirroring is a really innocent process but acts as the catalyst for the outage storm. This is the nature of cloud, which is built on top of the massive automation.

How should you react to it? First, you should accept it. It happened in many public cloud providers. Chances are it will happen in your private cloud environment. The important thing you need to do is to quickly spot it and be able to stop the cascading before it causes a big damage. The operations management tool you choose for your automated environment should give you this edge. Using traditional way to monitor every resource supporting your cloud won’t cut it. It just gives too much data, to the extend that you won’t be able to grasp the true meaning of these data. Look out a tool that can give you the insight of your cloud environment without showing off itself meaninglessly with mass amounts of data that will bury you and your productivity.

How to Prevent Cascading Error From Causing Outage Storm In Your Cloud Environment?

Last week, we talked about how shared resource pools change the way IT operates the cloud environment. We mentioned that how to avoid false positive and save the maintenance costs by measuring the pool decay. Today, I am going to explain how you can avoid another major challenge in the cloud operations –  outage storm.

The Outage storm typically is caused by cascading error and the lack of mechanism to detect those errors. Chances are you are not unfamiliar with this issue. In April, 2011, Amazon AWS experienced a
week-long outage on many of its AWS service offerings. I examined this incident in the article – thunderstorm from Amazon reminded us the importance of weather forecast. In a nutshell, a human error diverted major network traffic to a low bandwidth management channel. This flooded the communication between many EBS nodes. Because of the built-in automation process, these nodes started to unnecessarily replicate themselves and quickly consumed all the storage resources the availability zone. Eventually it brought down not only EBS but all other services relying on it. Almost a year later, Microsoft Azure experienced a day long outage. This time, a software glitch started to trigger unnecessary built-in automation process and brought down the server nodes. You can see the similarity between these two incidents. An error happened and triggered, not intentionally, automation processes that were built for different purpose. The outage storm, without any warning, brings your cloud down.

So how you can detect and stop the cascading as soon as possible? Let’s look at these two incidents. The environment seemed normal during the onset. The capacity in the pool seemed good. I/O was normal. The services run from these pools were not impacted. You felt everything was under control since you were monitoring the availability of each of those resources. Suddenly, you started to notice number of events showing up in your screen. While you were trying to make sense on these events, there were more and more events coming in and alerting you the availability of many devices were gone. Not long, the service help desk tickets swamped in. Customers started to complain large number of their services experiencing performance degradation. Everything happened just so fast that you didn’t get time to understand the root cause and make necessary adjustment. Sounds a nightmare to you?

How one can prevent that from happening? My suggestion is that you need to two thing. One, you need to measure the pool health. Particularly, in this case, you need to monitor the distribution of health status of its member resources. How many of them are in trouble? Do you see any trend how the trouble is propagated? What’s the rate of this propagation? Case in point, the Azure incident could have lasted longer and impacted more customers if Microsoft team hadn’t implemented its “human investigate” threshold. But still it lasted more than 12 hours. The main reason was these thresholds rely on the availability monitoring through periodic pings. And it took three timeouts in a row to trigger the threshold of the pool. And this delays the alert. So if you want to detect storm at the onset, the second thing you need to do is to detect the abnormality of behavior for its member resources, not just the ping. Combining these two measurements, the device can reflect their abnormality health status and the pool can detect the changes of the health distribution among its member resources. You, as an IT operation person, can set up rules to alert you when the health distribution changes across a critical threshold.

How does this benefit you? First you can get the alerts once that threshold is crossed even if the overall performance and capacity of the pool seem good. You will then have enough time to respond, for example diverting services to another pool or have the troubled devices quarantined. In addition, you won’t be swamped by massive alerts from each affected devices and try to guess which one you should look first. You can execute root cause analyses right from that alert at your pool level.

Cloud is built with the automation as the main mechanism to ensure its elasticity and agility. But occasionally, like what happened in these two incidents, errors can amplify their damages through cascading very quickly through those automation. Because of its inherited nature, the outage storm is more often than you think. If you operate a cloud environment, chances are you will face them pretty soon. You need to find a solution that can detect resource health by learning its behavior and can measure the distribution change of those health status at the pool level. The shared pool changes how you operate your cloud environment. Operation solution needs to  evolve to help you better measure pool decay and detect outage storm. Cloud-wash is not going to cut it.

To see how it works in a real world, you can visit booth 701 in this year’s VMworld. You can see a demo over there and get some ideas how you would approach these problems. If you want to discuss this with me, please let the booth staff know.

Puzzle Pieces vs. LEGO Bricks: How Shared Resource Pools Changed Everything

Jun 23, 2012 is the 100th birthday of Alan Turing. 76 years ago, Turing, just 24 years old, designed an imaginary machine to solve an important question: are all numbers computable? As a result, he actually designed a simple but the most powerful computing model known to computer scientists. To honor Turing, two scientists,  Jeroen van den Bosand and Davy Landman,  constructed a working Turing’s machine .  It is not the first time such a machine is built. The interesting thing this time is that the machine was built totally from a single LEGO Mindstorms NXT set.

The modern brick design of LEGO was developed in 1958. It was a revolutionary concept. The first LEGO brick built 54 years ago still interlocks with those made in the current time to construct toys and even the Turing machine. When you want to build a LEGO toy or machine, you don’t need to worry about when and where  the bricks are manufactured. You focus on the thing you are building and what standard shapes and how many of LEGO bricks you need.  And you can get them in any of those LEGO store no matter what you are building.

Sounds familiar? This is very similar to how one would build a cloud service using resources in a shared fabric pool. You don’t care which or what clusters or storage arrays these resources are hosted. All you care is types (e.g. 4cpu vs 8cpu VM) and service levels (e.g. platinum vs. gold) these resources need to support. Instead of taking each element devices, such as computer hosts or storage arrays, as key building blocks, IT now needs to focus on the logic layer that provides computing power to everything running inside the cloud – VMs, storage, databases, and application services. This new way to build services changed everything on how to measure, analyze, remediate and optimize resources shared within the fabric pool in the cloud.

To understand why we need to shift our focus to pools and away from element devices, let’s talk about another popular toy – puzzle set. Last year, I bought a 3D earth jigsaw puzzle set to my son who was 3 years old at that time. He was very excited as he just took a trip to Shanghai and was expecting a trip to Disney World. He was eager to learn all the places he had been and would be visiting.  So he and I (well, mostly I) built the earth using all those puzzle pieces. The final product was a great sphere constructed with 240 pieces. We have enjoyed it for 2 weeks until one of the pieces was missing. How can you blame a 3 year-old boy who wanted to redo the whole thing by himself? Now here is the problem, unlike those two scientists who used LEGO bricks to build the Turing machine, I can’t easily go to a store to just buy that missing piece.  I need to somehow find that missing piece or call the manufacture to send me a replacement. In the IT, it is called incident based management. When all your applications are built using dedicated infrastructure devices, you have a way to customize those devices and the way how they are put together to tailor to the particular needs of that application. If one of those devices has issue, it impacts the overall health of that application. So you file a ticket and operations team will do triage, isolation, and remediation.

In a cloud environment with shared resource pools, things happen differently. Since now the pool is built  with standard blocks and is shared by applications, you have the ability, through cloud management system, to set policy which moves VMs or logical disks around if their underneath infrastructure blocks get hit by issues. So a small percentage of unhealthy infrastructure blocks doesn’t necessary need immediate triage and repairing action.  If you monitor only the infrastructure blocks themselves, you will be overwhelmed by alerts that not necessary impact your cloud services. To respond all these alerts immediately increases your maintenance costs without necessary improving your service quality. Google did a study on the failure rate of their storage devices. They found that the AFR (annual failure rate) of those storage device is 8%. Assuming Google has 200,000 storage devices (in reality, it may have more than that), every half hour, you will have a storage alert somewhere in your environment. How expensive is it to have a dedicate team to keep doing triage and fixing those problem?

So how do we know when services hosted in a pool will be impacted? We give a name to this problem  – pool decay. You need to measure the decay state – the combination of  performance behavior of the pool itself and distribution of the unhealthy building blocks underneath it. In this way, you will be able to tell how the pool, as a single unit, performs and how much ability it has to provide the computing power to hosted services. When you go out to look for a solution that can truely understand the cloud, you need to check whether it has such ability to detect the pool decay without giving you excessive false positive. Otherwise, you will just get a solution who is cloudwashing.

Back to my missing piece in the 3D jigsaw set, I finally found it under the sofa. But the lesson learned,  I  now  bought my boy LEGO sets instead.

Next week, we will examine how the resource pool with the automation introduces another well known challenge – outage storm. Stay tuned.

Cloud Operations Management Solution Takes On Today’s IT Challenge

I haven’t posted a blog since last VMworld. One reason is that I had two great vacations on beautiful Caribbean islands. But most of time, I was working with a team of excellent talents to finish a project that allows IT to run its cloud environment with high confidence.  Today, I am very proud to say – we did it.

I have talked about how cloud computing poses new challenges in IT operations and why proactive performance management is even more important now. Today, we launched the next generation of cloud operations management solution to provide a set of capabilities to help IT take on those new challenges. These capabilities range from cloud panorama showing performance health for every shared fabric resource in your environment to automated workflow allowing those resources to be optimized for provisioned workloads.

Actionable Views

A cloud environment is complex. Not only do you have to manage the infrastructure resources, such as storage, network, and compute, but you also need to understand how they collectively power cloud services through shared pools. Many approaches you can find today in the market try to collect and show as much as data possible. We believe this is not efficient and actually prevent you from spotting the real issues and risks in your cloud environment. This new release gives IT operations and administration teams an actionable view – cloud panorama.  Cloud panorama not only summarizes the data organized as you see in the cloud (e.g. pools, containers, etc.) but also allows you act upon on what you can get from those data.

High-prevision Root Cause Analyses

The data is important. But the meaning of the data is even more important. What an operations staff wants to understand is what these data really mean to his/her day-to-day job. This is where the analytics comes in. Analytics for performance and capacity data is not a new thing. What unique about the analytics enhanced in this new release is that  it is the first time an analytics engine can provide the insight into how shared pools in the cloud power highly-automated cloud services. Lack of this type of insight causes serious problems. Think about last year’s Amazon AWS outage and this year’s Microsoft Azure disruption. In the coming blogs, I will explain why it matters to you and how you can execute high-precision root cause analyses to prevent this type of outage from happening in your cloud environment.

Intelligent Workflows

When end users asks for a new cloud service, such as new VMs, new database instance, or new storage space, they will get it almost instantly. This is because the provisioning of these services is automated. The challenge to cloud operators is how they can ensure these services run as expected from the get-go. To manually identify, deploy and configure your monitoring agents into these services is not an option. In this new release, we will enable you to automatically deploy and configure your monitoring agents during the provisioning of the service. By doing so, all your cloud services will be instant-on and instant-assured. In addition, when a service is provisioned, the solution tells the provisioning engine how to optimize the workload,  leveraging the workload pattern it  has learned and the capacity supply it knows. Finally, the solution analyzes the data it collects and provides showback reports.

Cloud computing provides IT tremendous advantages to provide services to its end users. But it also creates new challenges that IT operations teams have to face. In the past year, we at BMC worked very hard to understand those new needs. Today, we are excited to announce this new release of cloud operations management solution. Through its actionable views, high-precision root cause analyses, and intelligent workflows, this release enables IT confidently to power the cloud, control the cloud, and take intelligent action to ensure high-quality service delivery. Take a look at the clickable demo my colleague Thad did and check around the product page, particular that 2-minute explainer. We will get into more details in the coming weeks.

Proactive Performance Management in the Cloud

Today, BMC launched cloud operations capabilities to safeguard cloud users. As part of that, we released our latest version of proactive performance management product. There are many new things included in this release. But one of them that I am really excited about is its focus on cloud operations.

Photo used under Creative Commons from danshouse

Behavior learning in the cloud

Cloud, compared to traditional IT infrastructure, has its own uniqueness that requires a different approach on how you manage its day-to-day operations. Its mixed workloads, elastic nature, and service-centric principal dictates any static, reactive, and disparate operations solution won’t be able to generate enough power to propel the new cloud engine. That’s why I am excited to see that we are focusing on tuning the analytic engine to better understand this new set of behavior of cloud services and resources. For example, the engine can now  support hundreds of VMs provisioned per hour and readjust the behavior learning within minutes. This release is just the first step in that direction. But the team at BMC worked really hard to understand from customers and the market and assessed the current knowledge based on many year’s of successful application of the behavior learning engine  in the virtualization environment. There are many we can leverage and some we can’t. But that’s the point. The cloud is different from anything we have seen so far, virtualization or not. We found out and have learned from the customer that behavior learning capabilities is generating bigger and bigger value in a cloud environment where dynamic rules the world, from the IT process, the service, to the infrastructure resources.

Get your value fast

Cloud market develops rapidly. Our customers who want to compete in this space need to make their offering available fast. So how to make sure the operations solution can be up and running is one of the focus in this release. Now there will be a guided wizard to allow you plan, install, and configure all the necessary cloud management pieces, including lifecycle management and proactive performance management. One of our customers used to take 2 weeks to put the whole cloud management solution up and running in a small environment. Now they have done that in just 2 days with an even more robust solution.

Scalability for cloud deployment

In this release, we also address the scalability difference between a typical service provider cloud environment and an enterprise data center. The solution now is able to provide performance data from 50,000 cloud devices to hundreds concurrent user access. This enables not only the administrator will see and act upon the data but also the cloud end users will get those data, just like what you can get from Amazon CloudWatch.

Public cloud monitoring for enterprise

Speaking of Amazon CloudWatch, many of our customers who deployed their services in to EC2 monitor the data constantly. But they couldn’t do is to translate those data into actionable insight automatically. Now, we are providing out-of-the-box capabilities (aka.”knowledge module” if you are familiar with our product) to allow enterprises to pull in the performance data from CloudWatch for their instances and feed into our behavior learning engine. You can even build a service that across both your provide and Amazon EC2 and use our solution to measure its availability, impact to your business, and workload by leveraging both our remote and in-guest monitoring capabilities on those public cloud instances. In addition, we also let you monitor Microsoft Azure remotely if you are building applications in its PaaS environment.

Of course, this is just a subset of new features we put in to this release. We will start share more information in the coming weeks. I will be in VMWorld next month, you can meet us and see our demo in BMC booth. I look forward to meeting you there and chatting more about how the cloud operations could be evolved.

Four Indicators to Measure Your Cloud Operations

When you drive a car, you will want to know how fast you are driving, how many fuel is left, and whether your engine is normal. That’s where the dashboard plays an important role. The same is true for your cloud operations. In order for a cloud administrator to know whether the cloud is running fine, he/she needs accurate and meaningful operational indicators. There are four indicators that are very essential when you want to get a complete picture of your cloud operations. You need these indicators in each level – device, pools, and services – of your cloud environment.

1. Workload/Performance

Just like the speedometer tells you how fast or slow your car is running, you need an indicator to tell you how fast and slow your cloud resources, including compute, storage, and network, are performing. This includes two types. One is to measure utilization or workload, which reflects the snapshot of the current usage of total capacity. For examples, CPU and memory utilization of a compute resource pool belong to this category. Another type is to measure the “performance”, fast or slow, of the resource, such as disk I/O or network I/O. In the cloud, you need to find an indicator not only measure workload and performance in the VM or storage array level but also in the resource pools, pod, or even tenant level.

2. Availability

This indicator tells you how are cloud resources up and available based on the condition of your operations goal. For service provider, the service level agreement (SLA) is a key part of their operations element. For private cloud, often, the IT department also has SLA with business units. To be able to measure accurately the cloud service level target is crucial in both private and public cloud.

3. Capacity

While the speedometer lets you act in real-time, pressing the pedal or the brake, the fuel gauge tells you how far you can go and whether you need to fill your gas tank in the next stop. In the cloud, not only you need to know how much workload you have now, you need to prepare how much you will have to add to meet the demand. This is what a good capacity indicator will tell you.

4. Health

Even you car is running fine now and you have plenty of fuel, you won’t be happy if one of tire suddenly is broken or your engine is suddenly dead in the middle of the road. That’s why there are sensors and warning indicators in your dashboard, such as low tire pressure or check engine light. You want to have an early warning so you can have it checked out before you go on the road. You need the same thing in the cloud operations. A good health status (itself is worth a separate article) can take the consideration of all the external events , behavior events (based on intelligent baseline and threshold), capacity situation, and workload/demand anticipation to give you an accurate and predictive status indicator of your cloud health.

One would argue that many tools today can provide some of those indicators already.  But you can’t take these indicators separately. All of them are related to each other. Any single of them reflects a part but not the whole picture of your operations status. The following questions can help you identify the right solutions you can rely on.

  • Can I get a single solution to show all these four indicators holistically?
  • Are these indicator measuring the most important part of the cloud context – pools and services?
  • How can these indicators reflect accurately the dynamics in the cloud?
  • Can I get a predictive status to release me from act reactively?
You need a dashboard when you drive a car. Similarly, you will need these indicators to drive your cloud operations.

End To End Service Level Management – A Key to Your Cloud Success.

Recently, I visited several large service providers who are rolling out their cloud offerings. One common question from them how they can measure the service level agreement in a way that not only they know its financial impact but also  its operations impact. The operations impact they meant is how they can operationally measure, maintain, and report the operational service level to their customers.

In many people’s eyes, the SLA is a number. But the real question is what that number means and how that number matters. In the cloud, when the cloud customer’s business relies on the service that the cloud can deliver, the answers become even more important. The SLA in the cloud should reflect perspective from all constituencies. Cloud users care about the real experience they will get from the cloud. The availability may be good (it is calculated over a period of time, monthly or annually). But what about the service response time? On the other hand, service providers worry about their commitment to the customer. How to proactively minimize any downtime and identify any potential problems become important.

As the diagram shows here, these two perspectives have to be reflected in your service level management. One is related to the other. You’ve got to have an end to end view and in the same time understand the perspectives from both customers and providers. You’ve got to proactively measure, manage, and maintain the service levels. And you can’t do that manually in the cloud where dynamics is the norm. You will need to find a solution to tie all these different perspectives together and proactively maintain them. For service providers, this will be the key to keep your cost down and customers happy.

Follow

Get every new post delivered to your Inbox.