Part 2 – 6 Capabilities in the New Generation of Cloud Operations Management
In part 1 of this topic, I talked about the paradigm shifting from boxes, applications, and ownership in the classic data center to the new cloud model where pools, services, and sharing drive the operations and delivers the powerful values. In this part, I want to examine 6 new capabilities that operations management will evolve in order to deliver the value in the cloud.
1. Operate on the “pools”
Traditionally, operations management solution has lots of coverage on individual servers, storage arrays, or network devices. However, in the cloud, all these become the background and you need to operate on the “pool” level. You have to look beyond what you are monitoring in the individual device level. You need to make sure that you have immediate access to the operation status of the pool. That status could be aggregated workload (current usage) and capacity (past usage and future projection). But more important, it needs to accurately reflect underneath health condition of the pool when individual component availability is not the same as the pool availability. For example, when one ESX host in the pool has problem, the VMs in that host can be migrated to other hosts in the same pool (through vMotion or Live Migration, etc.) as long as the pool still has the capacity. So one host is unavailable doesn’t mean the pool is unavailable. The operations management solution you are using should understand the behavior of the pool and report the health status based on it.
2. Monitor elastic service
The Cloud is all about elasticity. That means several things. First of all, you will have services dynamically expanded and retracted based on the demands. Your operations management needs to fully adapt to this dynamic nature. For example, if you are monitoring the performance of a service, you need to make sure your monitoring coverage expands or retracts with the service. And do it automatically. This means you can’t expect a manual process to figure out and deploy your monitoring capabilities to the target. Your operations management solution needs to know the configuration of that service and automatically deploy or remove necessary agents.
Secondly, specially to the enterprise who is building a private cloud, you want to cover both your cloud and non-cloud resources. Why? Because chances are you have multi-tiered application and the static and legacy part of that, such as database or persistent layer, is still deployed in the physical boxes. If you are in such a situation, you want to monitor your service no matter where its resources are located, cloud or non-cloud. In addition, such a management solution should natively understand different behavior in different environment.
Furthermore, when you have resources in both private and public cloud, you need to look for a solution that can allow you to monitor your services in both side seamlessly and support inter-cloud service migration. At the end of day, you want your service to be monitored no matter where its resources are located. Your operations management solution should know their location and understand their behavior accordingly.
3. Detect the issue before it happens
Compared to it in the traditional data center, the workload in the cloud has a wider variety due to its elastic nature. When the service responsiveness is important, relying on reactive alerts or events will not be an option, particularly for service providers, to support a high level SLA. You need to know the issue before it even happens. How do you do that? Your monitoring solution should know how to learn the behavior of your cloud infrastructure and your cloud services. This is not new. But in the Cloud, the device level behavior evolves more rapidly and in less conformity. Your solution should have the ability to learn the behavior in the pool and the service level. Based on those understandings, it should give you predictive warning, such as on capacity, to allow you isolate the problem before it impacts your customer.
Speaking of problem, when you try to pinpoint it, make sure you have done the proper root cause analysis. This becomes an even more critical in the cloud when large number of resources are involved. Amazon’s outage happens a few weeks ago is a good example. According to Amazon service dashboard, a network event triggered many EBS nodes in that region to think they lost mirror and the central management policy kicked in to reconfigure them. As a service provider, if you sit in your monitoring dashboard, you probably see a sea of red alerts suddenly appears. Even though among of them is that network alert, chances are you are not going to notice it. Your operations management solution should intelligently detect the root cause of this and highlight that network event in your dashboard or to the remediation process.
4. Make holistic operations decision
In the cloud, you have to manage more types of constructs in your environment. In addition to servers, OSs, and applications, You will have compute pools, storage pools, network containers, services, and, for service providers, tenants. These new constructs are interrelated. You can’t view their performance and capacity data separately. You have to look at it more holistically. Taking the Amazon’s recent outage as an example again. The root cause was from its network. But it affected the storage pool immediately. And that caused the huge impact to its EC2 instances, CloudWatch, and RDS services, as well as many of its customers. If you treat those symptom separately, you won’t have a solid plan to quickly recover from this outage. You had better know who are your critical customers and their services so you can focus on recovering them in a higher priority. You may want to send out alert to affected customers to proactively let them know the issue. Your operations management solution should give you a panoramic view on all these aspects and their relationships. Not only does it let you quickly isolate the problem, but also it saves you lots of money if you know whose SLAs you don’t want to breach and you can focus your team to deal them in higher priority.
5. Enable self-service for operations
To give your customer better experience and save your support cost, one of the best options is to give your customer constant feedback. Traditionally, performance data, in general, is not available to the end user. In the cloud, you have a larger number of users or service requests with a relatively lower ratio of administrators. You want to minimize the “false alarms” or manual routine requests. The best way is to let your end users see the performance and capacity data surrounding to their services. You can also let your users define KPI to monitor, the threshold level they want to set, and some routine remediation process they want to trigger (such as auto-scaling). Your operations management solution should allow you to easily plug those data into your end user portal.
6.Make cloud services resilient
This is the ultimate goal of your cloud operations management. If you have a solution which can understand the behavior of your services and proactively pinpoint potential issues, naturally the next step is that you want that solution to automatically isolate and eliminate the problem. Well, it sounds simple, but it is more complicated than appeared on the surface. First, you need to make sure your solution does have the accurate behavior learning and analytic capabilities. Second, you still need to put human in control through well-defined policies whether by an automated policy engine or a human interactive process. Last, your solution should be able to seamlessly plugged into other life cycle management solutions, such as provisioning, change management, service request, etc. The operations management alone can’t make your cloud resilient. You should have a right architectural design (e.g. designing for failure) to start with and a good management process that reflects the paradigm shift to ensure your success.
By no means, these 6 capabilities cover all the aspects of the new generation of cloud operations management. But they are a good start based on what we have heard from our customers and leading cloud providers. What are the most important shift in your operations management? What are critical capabilities that you think a cloud operations management solution should have? I welcome your thoughts.