Automation Creates New Challenges in Cloud Operations

When technology advances, it creates enormous value and benefits to allow the society to run much efficiently. Automation is one of them. From our daily life to manufacture, automation improves our productivity and allow us to do more thing with less time and more accurately. But it also has its side effect. Automation magnifies both positive and negative aspects when it’s hard, by design, for humans to intervene.

Hand-SawConsidering the saw. While it takes you lots of time and effort to cut a log using a hand saw, you will feel much easier to use a chainsaw to do the job. Why? Because the chainsaw uses an engine efficiently passing the energy to the cutting chain which runs in a faster speed. The automation in Chain-Sawthe chainsaw improves the productivity of  anyone who wants to achieve the job of cutting a log. But if you do not properly use it or have an accident, the damage the chainsaw can cause is much bigger than a handsaw can do.

This is the same in IT operations. Automation makes IT operations more efficient, but mistakes caused by humans and machines can easily cascade to do much more damage. The amazon storm happened in 2011 is a perfect example of this. The automated script for EBS mirroring is a really innocent process but acts as the catalyst for the outage storm. This is the nature of cloud, which is built on top of the massive automation.

How should you react to it? First, you should accept it. It happened in many public cloud providers. Chances are it will happen in your private cloud environment. The important thing you need to do is to quickly spot it and be able to stop the cascading before it causes a big damage. The operations management tool you choose for your automated environment should give you this edge. Using traditional way to monitor every resource supporting your cloud won’t cut it. It just gives too much data, to the extend that you won’t be able to grasp the true meaning of these data. Look out a tool that can give you the insight of your cloud environment without showing off itself meaninglessly with mass amounts of data that will bury you and your productivity.

How to Prevent Cascading Error From Causing Outage Storm In Your Cloud Environment?

Last week, we talked about how shared resource pools change the way IT operates the cloud environment. We mentioned that how to avoid false positive and save the maintenance costs by measuring the pool decay. Today, I am going to explain how you can avoid another major challenge in the cloud operations –  outage storm.

The Outage storm typically is caused by cascading error and the lack of mechanism to detect those errors. Chances are you are not unfamiliar with this issue. In April, 2011, Amazon AWS experienced a
week-long outage on many of its AWS service offerings. I examined this incident in the article – thunderstorm from Amazon reminded us the importance of weather forecast. In a nutshell, a human error diverted major network traffic to a low bandwidth management channel. This flooded the communication between many EBS nodes. Because of the built-in automation process, these nodes started to unnecessarily replicate themselves and quickly consumed all the storage resources the availability zone. Eventually it brought down not only EBS but all other services relying on it. Almost a year later, Microsoft Azure experienced a day long outage. This time, a software glitch started to trigger unnecessary built-in automation process and brought down the server nodes. You can see the similarity between these two incidents. An error happened and triggered, not intentionally, automation processes that were built for different purpose. The outage storm, without any warning, brings your cloud down.

So how you can detect and stop the cascading as soon as possible? Let’s look at these two incidents. The environment seemed normal during the onset. The capacity in the pool seemed good. I/O was normal. The services run from these pools were not impacted. You felt everything was under control since you were monitoring the availability of each of those resources. Suddenly, you started to notice number of events showing up in your screen. While you were trying to make sense on these events, there were more and more events coming in and alerting you the availability of many devices were gone. Not long, the service help desk tickets swamped in. Customers started to complain large number of their services experiencing performance degradation. Everything happened just so fast that you didn’t get time to understand the root cause and make necessary adjustment. Sounds a nightmare to you?

How one can prevent that from happening? My suggestion is that you need to two thing. One, you need to measure the pool health. Particularly, in this case, you need to monitor the distribution of health status of its member resources. How many of them are in trouble? Do you see any trend how the trouble is propagated? What’s the rate of this propagation? Case in point, the Azure incident could have lasted longer and impacted more customers if Microsoft team hadn’t implemented its “human investigate” threshold. But still it lasted more than 12 hours. The main reason was these thresholds rely on the availability monitoring through periodic pings. And it took three timeouts in a row to trigger the threshold of the pool. And this delays the alert. So if you want to detect storm at the onset, the second thing you need to do is to detect the abnormality of behavior for its member resources, not just the ping. Combining these two measurements, the device can reflect their abnormality health status and the pool can detect the changes of the health distribution among its member resources. You, as an IT operation person, can set up rules to alert you when the health distribution changes across a critical threshold.

How does this benefit you? First you can get the alerts once that threshold is crossed even if the overall performance and capacity of the pool seem good. You will then have enough time to respond, for example diverting services to another pool or have the troubled devices quarantined. In addition, you won’t be swamped by massive alerts from each affected devices and try to guess which one you should look first. You can execute root cause analyses right from that alert at your pool level.

Cloud is built with the automation as the main mechanism to ensure its elasticity and agility. But occasionally, like what happened in these two incidents, errors can amplify their damages through cascading very quickly through those automation. Because of its inherited nature, the outage storm is more often than you think. If you operate a cloud environment, chances are you will face them pretty soon. You need to find a solution that can detect resource health by learning its behavior and can measure the distribution change of those health status at the pool level. The shared pool changes how you operate your cloud environment. Operation solution needs to  evolve to help you better measure pool decay and detect outage storm. Cloud-wash is not going to cut it.

To see how it works in a real world, you can visit booth 701 in this year’s VMworld. You can see a demo over there and get some ideas how you would approach these problems. If you want to discuss this with me, please let the booth staff know.

Puzzle Pieces vs. LEGO Bricks: How Shared Resource Pools Changed Everything

Jun 23, 2012 is the 100th birthday of Alan Turing. 76 years ago, Turing, just 24 years old, designed an imaginary machine to solve an important question: are all numbers computable? As a result, he actually designed a simple but the most powerful computing model known to computer scientists. To honor Turing, two scientists,  Jeroen van den Bosand and Davy Landman,  constructed a working Turing’s machine .  It is not the first time such a machine is built. The interesting thing this time is that the machine was built totally from a single LEGO Mindstorms NXT set.

The modern brick design of LEGO was developed in 1958. It was a revolutionary concept. The first LEGO brick built 54 years ago still interlocks with those made in the current time to construct toys and even the Turing machine. When you want to build a LEGO toy or machine, you don’t need to worry about when and where  the bricks are manufactured. You focus on the thing you are building and what standard shapes and how many of LEGO bricks you need.  And you can get them in any of those LEGO store no matter what you are building.

Sounds familiar? This is very similar to how one would build a cloud service using resources in a shared fabric pool. You don’t care which or what clusters or storage arrays these resources are hosted. All you care is types (e.g. 4cpu vs 8cpu VM) and service levels (e.g. platinum vs. gold) these resources need to support. Instead of taking each element devices, such as computer hosts or storage arrays, as key building blocks, IT now needs to focus on the logic layer that provides computing power to everything running inside the cloud – VMs, storage, databases, and application services. This new way to build services changed everything on how to measure, analyze, remediate and optimize resources shared within the fabric pool in the cloud.

To understand why we need to shift our focus to pools and away from element devices, let’s talk about another popular toy – puzzle set. Last year, I bought a 3D earth jigsaw puzzle set to my son who was 3 years old at that time. He was very excited as he just took a trip to Shanghai and was expecting a trip to Disney World. He was eager to learn all the places he had been and would be visiting.  So he and I (well, mostly I) built the earth using all those puzzle pieces. The final product was a great sphere constructed with 240 pieces. We have enjoyed it for 2 weeks until one of the pieces was missing. How can you blame a 3 year-old boy who wanted to redo the whole thing by himself? Now here is the problem, unlike those two scientists who used LEGO bricks to build the Turing machine, I can’t easily go to a store to just buy that missing piece.  I need to somehow find that missing piece or call the manufacture to send me a replacement. In the IT, it is called incident based management. When all your applications are built using dedicated infrastructure devices, you have a way to customize those devices and the way how they are put together to tailor to the particular needs of that application. If one of those devices has issue, it impacts the overall health of that application. So you file a ticket and operations team will do triage, isolation, and remediation.

In a cloud environment with shared resource pools, things happen differently. Since now the pool is built  with standard blocks and is shared by applications, you have the ability, through cloud management system, to set policy which moves VMs or logical disks around if their underneath infrastructure blocks get hit by issues. So a small percentage of unhealthy infrastructure blocks doesn’t necessary need immediate triage and repairing action.  If you monitor only the infrastructure blocks themselves, you will be overwhelmed by alerts that not necessary impact your cloud services. To respond all these alerts immediately increases your maintenance costs without necessary improving your service quality. Google did a study on the failure rate of their storage devices. They found that the AFR (annual failure rate) of those storage device is 8%. Assuming Google has 200,000 storage devices (in reality, it may have more than that), every half hour, you will have a storage alert somewhere in your environment. How expensive is it to have a dedicate team to keep doing triage and fixing those problem?

So how do we know when services hosted in a pool will be impacted? We give a name to this problem  – pool decay. You need to measure the decay state – the combination of  performance behavior of the pool itself and distribution of the unhealthy building blocks underneath it. In this way, you will be able to tell how the pool, as a single unit, performs and how much ability it has to provide the computing power to hosted services. When you go out to look for a solution that can truely understand the cloud, you need to check whether it has such ability to detect the pool decay without giving you excessive false positive. Otherwise, you will just get a solution who is cloudwashing.

Back to my missing piece in the 3D jigsaw set, I finally found it under the sofa. But the lesson learned,  I  now  bought my boy LEGO sets instead.

Next week, we will examine how the resource pool with the automation introduces another well known challenge – outage storm. Stay tuned.

Cloud Operations Management Solution Takes On Today’s IT Challenge

I haven’t posted a blog since last VMworld. One reason is that I had two great vacations on beautiful Caribbean islands. But most of time, I was working with a team of excellent talents to finish a project that allows IT to run its cloud environment with high confidence.  Today, I am very proud to say – we did it.

I have talked about how cloud computing poses new challenges in IT operations and why proactive performance management is even more important now. Today, we launched the next generation of cloud operations management solution to provide a set of capabilities to help IT take on those new challenges. These capabilities range from cloud panorama showing performance health for every shared fabric resource in your environment to automated workflow allowing those resources to be optimized for provisioned workloads.

Actionable Views

A cloud environment is complex. Not only do you have to manage the infrastructure resources, such as storage, network, and compute, but you also need to understand how they collectively power cloud services through shared pools. Many approaches you can find today in the market try to collect and show as much as data possible. We believe this is not efficient and actually prevent you from spotting the real issues and risks in your cloud environment. This new release gives IT operations and administration teams an actionable view – cloud panorama.  Cloud panorama not only summarizes the data organized as you see in the cloud (e.g. pools, containers, etc.) but also allows you act upon on what you can get from those data.

High-prevision Root Cause Analyses

The data is important. But the meaning of the data is even more important. What an operations staff wants to understand is what these data really mean to his/her day-to-day job. This is where the analytics comes in. Analytics for performance and capacity data is not a new thing. What unique about the analytics enhanced in this new release is that  it is the first time an analytics engine can provide the insight into how shared pools in the cloud power highly-automated cloud services. Lack of this type of insight causes serious problems. Think about last year’s Amazon AWS outage and this year’s Microsoft Azure disruption. In the coming blogs, I will explain why it matters to you and how you can execute high-precision root cause analyses to prevent this type of outage from happening in your cloud environment.

Intelligent Workflows

When end users asks for a new cloud service, such as new VMs, new database instance, or new storage space, they will get it almost instantly. This is because the provisioning of these services is automated. The challenge to cloud operators is how they can ensure these services run as expected from the get-go. To manually identify, deploy and configure your monitoring agents into these services is not an option. In this new release, we will enable you to automatically deploy and configure your monitoring agents during the provisioning of the service. By doing so, all your cloud services will be instant-on and instant-assured. In addition, when a service is provisioned, the solution tells the provisioning engine how to optimize the workload,  leveraging the workload pattern it  has learned and the capacity supply it knows. Finally, the solution analyzes the data it collects and provides showback reports.

Cloud computing provides IT tremendous advantages to provide services to its end users. But it also creates new challenges that IT operations teams have to face. In the past year, we at BMC worked very hard to understand those new needs. Today, we are excited to announce this new release of cloud operations management solution. Through its actionable views, high-precision root cause analyses, and intelligent workflows, this release enables IT confidently to power the cloud, control the cloud, and take intelligent action to ensure high-quality service delivery. Take a look at the clickable demo my colleague Thad did and check around the product page, particular that 2-minute explainer. We will get into more details in the coming weeks.

Proactive Performance Management in the Cloud

Today, BMC launched cloud operations capabilities to safeguard cloud users. As part of that, we released our latest version of proactive performance management product. There are many new things included in this release. But one of them that I am really excited about is its focus on cloud operations.

Photo used under Creative Commons from danshouse

Behavior learning in the cloud

Cloud, compared to traditional IT infrastructure, has its own uniqueness that requires a different approach on how you manage its day-to-day operations. Its mixed workloads, elastic nature, and service-centric principal dictates any static, reactive, and disparate operations solution won’t be able to generate enough power to propel the new cloud engine. That’s why I am excited to see that we are focusing on tuning the analytic engine to better understand this new set of behavior of cloud services and resources. For example, the engine can now  support hundreds of VMs provisioned per hour and readjust the behavior learning within minutes. This release is just the first step in that direction. But the team at BMC worked really hard to understand from customers and the market and assessed the current knowledge based on many year’s of successful application of the behavior learning engine  in the virtualization environment. There are many we can leverage and some we can’t. But that’s the point. The cloud is different from anything we have seen so far, virtualization or not. We found out and have learned from the customer that behavior learning capabilities is generating bigger and bigger value in a cloud environment where dynamic rules the world, from the IT process, the service, to the infrastructure resources.

Get your value fast

Cloud market develops rapidly. Our customers who want to compete in this space need to make their offering available fast. So how to make sure the operations solution can be up and running is one of the focus in this release. Now there will be a guided wizard to allow you plan, install, and configure all the necessary cloud management pieces, including lifecycle management and proactive performance management. One of our customers used to take 2 weeks to put the whole cloud management solution up and running in a small environment. Now they have done that in just 2 days with an even more robust solution.

Scalability for cloud deployment

In this release, we also address the scalability difference between a typical service provider cloud environment and an enterprise data center. The solution now is able to provide performance data from 50,000 cloud devices to hundreds concurrent user access. This enables not only the administrator will see and act upon the data but also the cloud end users will get those data, just like what you can get from Amazon CloudWatch.

Public cloud monitoring for enterprise

Speaking of Amazon CloudWatch, many of our customers who deployed their services in to EC2 monitor the data constantly. But they couldn’t do is to translate those data into actionable insight automatically. Now, we are providing out-of-the-box capabilities (aka.”knowledge module” if you are familiar with our product) to allow enterprises to pull in the performance data from CloudWatch for their instances and feed into our behavior learning engine. You can even build a service that across both your provide and Amazon EC2 and use our solution to measure its availability, impact to your business, and workload by leveraging both our remote and in-guest monitoring capabilities on those public cloud instances. In addition, we also let you monitor Microsoft Azure remotely if you are building applications in its PaaS environment.

Of course, this is just a subset of new features we put in to this release. We will start share more information in the coming weeks. I will be in VMWorld next month, you can meet us and see our demo in BMC booth. I look forward to meeting you there and chatting more about how the cloud operations could be evolved.

Four Indicators to Measure Your Cloud Operations

When you drive a car, you will want to know how fast you are driving, how many fuel is left, and whether your engine is normal. That’s where the dashboard plays an important role. The same is true for your cloud operations. In order for a cloud administrator to know whether the cloud is running fine, he/she needs accurate and meaningful operational indicators. There are four indicators that are very essential when you want to get a complete picture of your cloud operations. You need these indicators in each level – device, pools, and services – of your cloud environment.

1. Workload/Performance

Just like the speedometer tells you how fast or slow your car is running, you need an indicator to tell you how fast and slow your cloud resources, including compute, storage, and network, are performing. This includes two types. One is to measure utilization or workload, which reflects the snapshot of the current usage of total capacity. For examples, CPU and memory utilization of a compute resource pool belong to this category. Another type is to measure the “performance”, fast or slow, of the resource, such as disk I/O or network I/O. In the cloud, you need to find an indicator not only measure workload and performance in the VM or storage array level but also in the resource pools, pod, or even tenant level.

2. Availability

This indicator tells you how are cloud resources up and available based on the condition of your operations goal. For service provider, the service level agreement (SLA) is a key part of their operations element. For private cloud, often, the IT department also has SLA with business units. To be able to measure accurately the cloud service level target is crucial in both private and public cloud.

3. Capacity

While the speedometer lets you act in real-time, pressing the pedal or the brake, the fuel gauge tells you how far you can go and whether you need to fill your gas tank in the next stop. In the cloud, not only you need to know how much workload you have now, you need to prepare how much you will have to add to meet the demand. This is what a good capacity indicator will tell you.

4. Health

Even you car is running fine now and you have plenty of fuel, you won’t be happy if one of tire suddenly is broken or your engine is suddenly dead in the middle of the road. That’s why there are sensors and warning indicators in your dashboard, such as low tire pressure or check engine light. You want to have an early warning so you can have it checked out before you go on the road. You need the same thing in the cloud operations. A good health status (itself is worth a separate article) can take the consideration of all the external events , behavior events (based on intelligent baseline and threshold), capacity situation, and workload/demand anticipation to give you an accurate and predictive status indicator of your cloud health.

One would argue that many tools today can provide some of those indicators already.  But you can’t take these indicators separately. All of them are related to each other. Any single of them reflects a part but not the whole picture of your operations status. The following questions can help you identify the right solutions you can rely on.

  • Can I get a single solution to show all these four indicators holistically?
  • Are these indicator measuring the most important part of the cloud context – pools and services?
  • How can these indicators reflect accurately the dynamics in the cloud?
  • Can I get a predictive status to release me from act reactively?
You need a dashboard when you drive a car. Similarly, you will need these indicators to drive your cloud operations.

End To End Service Level Management – A Key to Your Cloud Success.

Recently, I visited several large service providers who are rolling out their cloud offerings. One common question from them how they can measure the service level agreement in a way that not only they know its financial impact but also  its operations impact. The operations impact they meant is how they can operationally measure, maintain, and report the operational service level to their customers.

In many people’s eyes, the SLA is a number. But the real question is what that number means and how that number matters. In the cloud, when the cloud customer’s business relies on the service that the cloud can deliver, the answers become even more important. The SLA in the cloud should reflect perspective from all constituencies. Cloud users care about the real experience they will get from the cloud. The availability may be good (it is calculated over a period of time, monthly or annually). But what about the service response time? On the other hand, service providers worry about their commitment to the customer. How to proactively minimize any downtime and identify any potential problems become important.

As the diagram shows here, these two perspectives have to be reflected in your service level management. One is related to the other. You’ve got to have an end to end view and in the same time understand the perspectives from both customers and providers. You’ve got to proactively measure, manage, and maintain the service levels. And you can’t do that manually in the cloud where dynamics is the norm. You will need to find a solution to tie all these different perspectives together and proactively maintain them. For service providers, this will be the key to keep your cost down and customers happy.

Enterprise IT Put Their Eye On Cloud Operations Management Solutions

Every year, we have several customer advisory meetings where we meet a group of our customers to share our progress and vision, but most importantly to seek valuable feedback from them on whether we are addressing their biggest challenges in our coming releases.

Last time, when we asked members of advisory board what would they like to hear, by large margin, they told us that they want to understand how cloud will impact the ways, the processes, and the tools they are going to use to operate the IT.

Today, we will meet again. And we will explain how the elasticity, the responsiveness, and the efficiency of the cloud  will require an evolved cloud operations management solutions. This will also be a good opportunity for me to share our vision and explore several areas that we are making big push. But most interestingly, I will conduct a survey so that we can do a simplified conjoint analysis to really gauge the real value of those areas in terms of how they solve the customer challenges in the cloud operations and how much benefits from customer’s perspective.

To prepare the meeting, we did a short survey. We found that the cloud adoption rate is very much inline with the data published by other vendors or analysts. In our case, about 50% reported they are either have a cloud in production or in POC. Another 25% will consider that within one year.  The latest hiring stats seems to conform with that as well. Many of those reported doing cloud in our survey also indicated they are using operations tools. Some of them use the tools they are already using in the traditional data center environment. Some of them are looking for cloud specific solutions. Nevertheless, almost all of them told us that they are looking for a single pane of glass that will give them the whole picture of the cloud, including data and actionable process in performance, availability, and capacity. We are going to find more when I got a chance to discuss these points with customers in the meeting. I am excited and looking forward to it.

The Momentum You Can’t Ignore

Two years ago, cloud computing was just a new concept. To some extend, many enterprises were still skeptical. The industry was still trying to figure out whether the cloud is a technology, a marketing term,  or a business model. And Larry had a fun to bash it. He was sarcastic at that time. But he touched a good point – what is a cloud?

Today, the landscape has changed. As I talked to many of our customers during past year, I can feel the momentum. Initially, many questions I was asked were about what the cloud is. I called it stage day 0. Then six months ago, the questions were more and more about the use and implementation of the cloud. GoGrid recently did a survey and found that 45% of their respondents use IaaS in some way, shape or form. Ovum also surveyed multinational clients and concluded same trend. In the same time, there are many cloud delivery solutions, such as BMC’s Cloud Lifecycle Management , VMware’s vCloud Director, and even Oracle itself, coming into the market. These solutions help companies build a cloud and make it up and running. This is stage day 1.

In the last couple of months, I started to hear questions, such as

  • “How can I make sure my cloud environment run without outage?”
  • “How do I know whether my cloud services are in an acceptable availability level?”
  • “How can I fully automate my cloud operations process so that I can realize the cloud promise – higher quality of service with lower operating costs?”

These questions reflect the fact that those customers have embraced the concept of cloud and are in the stage to take advantage of the full value of the cloud computing. They graduated from day 0, busy in day 1, and ready to go to day 2, where cloud operations management will play a major role to help them realize the value of the cloud.

How Does Operations Management Evolve in the Cloud ? Part 2

Part 2 – 6 Capabilities in the New Generation of Cloud Operations Management

In part 1 of this topic, I talked about the paradigm shifting from boxes, applications, and ownership in the classic data center  to the new cloud model where pools, services, and sharing drive the operations and delivers the powerful values. In this part, I want to examine 6 new capabilities that operations management will evolve in order to deliver the value in the cloud.

1. Operate on the “pools”

Traditionally, operations management solution has lots of coverage on individual servers, storage arrays, or network devices. However, in the cloud, all these become the background and you need to operate on the “pool” level. You have to look beyond what you are monitoring in the individual device level. You need to make sure that you have immediate access to the operation status of the pool. That status could be aggregated workload (current usage) and capacity (past usage and future projection). But more important, it needs to accurately reflect underneath health condition of the pool when individual component availability is not the same as the pool availability. For example, when one ESX host in the pool has problem, the VMs in that host can be migrated to other hosts in the same pool (through vMotion or Live Migration, etc.) as long as the pool still has the capacity. So one host is unavailable doesn’t mean the pool is unavailable. The operations management solution you are using should understand the behavior of the pool and report the health status based on it.

2. Monitor elastic service

The Cloud is all about elasticity. That means several things. First of all, you will have services dynamically expanded and retracted based on the demands. Your operations management needs to fully adapt to this dynamic nature. For example, if you are monitoring the performance of a service, you need to make sure your monitoring coverage expands or retracts with the service. And do it automatically. This means you can’t expect a manual process to figure out and deploy your monitoring capabilities to the target. Your operations management solution needs to know the configuration of that service and automatically deploy or remove necessary agents.

Secondly, specially to the enterprise who is building a private cloud, you want to cover both your cloud and non-cloud resources. Why? Because chances are you have multi-tiered application and the static and legacy part of that, such as database or persistent layer, is still deployed in the physical boxes. If you are in such a situation, you want to monitor your service no matter where its resources are located, cloud or non-cloud. In addition, such a management solution should natively understand different behavior in different environment.

Furthermore, when you have resources in both private and public cloud, you need to look for a solution that can allow you to monitor your services in both side seamlessly and support inter-cloud service migration. At the end of day, you want your service to be monitored no matter where its resources are located. Your operations management solution should know their location and  understand their behavior accordingly.

3. Detect the issue before it happens

Compared to it in the traditional data center,  the workload in the cloud has a wider variety due to its elastic nature. When the service responsiveness is important, relying on reactive alerts or events will not be an option, particularly for service providers, to support a high level SLA. You need to know the issue before it even happens. How do you do that? Your monitoring solution should know how to learn the behavior of your cloud infrastructure and your cloud services. This is not new. But in the Cloud, the device level behavior evolves more rapidly and in less conformity. Your solution should have the ability to learn the behavior in the pool and the service level. Based on those understandings, it should give you predictive warning, such as on capacity, to allow you isolate the problem before it impacts your customer.

Speaking of problem, when you try to pinpoint it, make sure you have done the proper root cause analysis. This becomes an even more critical in the cloud when large number of resources are involved. Amazon’s outage happens a few weeks ago is a good example. According to Amazon service dashboard, a network event triggered many EBS nodes in that region to think they lost mirror and the central management policy kicked in to reconfigure them. As a service provider, if you sit in your monitoring dashboard, you probably see a sea of red alerts suddenly appears. Even though among of them is that network alert, chances are you are not going to notice it. Your operations management solution should intelligently detect the root cause of this and highlight that network event in your dashboard or to the remediation process.

4. Make holistic operations decision

In the cloud, you have to manage more types of constructs in your environment. In addition to servers, OSs, and applications, You will have compute pools, storage pools, network containers, services, and, for service providers, tenants. These new constructs are interrelated. You can’t view their performance and capacity data separately. You have to look at it more holistically. Taking the Amazon’s recent outage as an example again. The root cause was from its network. But it affected the storage pool immediately. And that caused the huge impact to its EC2 instances, CloudWatch, and RDS services, as well as many of its customers. If you treat those symptom separately, you won’t have a solid plan to quickly recover from this outage. You had better know who are your critical customers and their services so you can focus on recovering them in a higher priority. You may want to send out alert to affected customers to proactively let them know the issue. Your operations management solution should give you a panoramic view on all these aspects and their relationships. Not only does it let you quickly isolate the problem, but also it saves you lots of money if you know whose SLAs you don’t want to breach and you can focus your team to deal them in higher priority.

5. Enable self-service for operations

To give your customer better experience and save your support cost, one of the best options is to give your customer constant feedback. Traditionally, performance data, in general, is not available to the end user. In the cloud, you have a larger number of users or service requests with a relatively lower ratio of administrators. You want to minimize the “false alarms” or manual routine requests. The best way is to let your end users see the performance and capacity data surrounding to their services. You can also let your users define KPI to monitor, the threshold level they want to set, and some routine remediation process they want to trigger (such as auto-scaling). Your operations management solution should allow you to easily plug those data into your end user portal.

6.Make cloud services resilient

This is the ultimate goal of your cloud operations management. If you have a solution which can understand the behavior of your services and proactively pinpoint potential issues, naturally the next step is that you want that solution to automatically isolate and eliminate the problem. Well, it sounds simple, but it is more complicated than appeared on the surface. First, you need to make sure your solution does have the accurate behavior learning and analytic capabilities. Second, you still need to put human in control through well-defined policies whether by an automated policy engine or a human interactive process. Last, your solution should be able to seamlessly plugged into other life cycle management solutions, such as  provisioning, change management, service request, etc. The operations management alone can’t make your cloud resilient. You should have a right architectural design (e.g. designing for failure) to start with and a good management process that reflects the paradigm shift to ensure your success.

By no means, these 6 capabilities cover all the aspects of the new generation of cloud operations management. But they are a good start based on what we have heard from our customers and leading cloud providers. What are the most important shift in your operations management? What are critical capabilities that you think a cloud operations management solution should have? I welcome your thoughts.

Follow

Get every new post delivered to your Inbox.