Two tales, the same ending
When I was in college, I always filled my car only when the gas tank was empty. I did that because I wanted to stretch the use of my limited cash. I had to monitor the fuel gauge in the car’s dashboard when its indicator is close to “E” indicating an empty tank. But there was a twist. A car typically had about another one gallon left even when the indicator pointed to the “E”. So every time, when the gauge told me the tank was empty, I would continue driving another 20 miles or so before hitting the gas station. One afternoon, I was driving in the middle of a highway. Suddenly, the engine stalled. I realized the tank ran out of the gas. I had to walk a mile away to a gas station and bring back the gas in a gas can. It turned out I misestimated the mileage I could drive when the gauge indicator pointed to the “E.” I guessed wrong because that day I had to drive on a few deep slopes. It caused the car consume the fuel faster than its regular rate.
Joe (the real name is disguised), one of our customers, has set 80% as the threshold for the cluster’s disk space usage. It was usually a good indicator for him to monitor the possible shortage of the capacity. A few weeks ago, Joe had turned on the VM snapshotting. This increased the consumption rate of the disk space. The cluster had crossed 80% two-week running. Since his cluster runs mainly VDI, Joe always thought that 20% of the capacity is enough to run a few more months. So he ignored the alert and focused on other issues. But the increased consumption rate reduced the remaining time, which cluster can continue to run, from months to two weeks. Soon, Joe got calls from his VDI users. They couldn’t sign into their session. He took a while to realize that the cluster had run out of the capacity. And he took even longer to figure out it is a problem of snapshotting where he set the frequency too high.
Do you see the similarity between these two stories?
Mine is about the gas, and a car stopped in the middle of a highway. Joe’s story is about the disk capacity, and his end users can’t use their computers. But there are similarities. Both started with a guesswork that went wrong. Both ended at an unpleasant consequence. It is just that impact in Joe’s case is much bigger. It affected the operations of the entire company.
Information vs. signal
Let’s go back to the car. Nowadays, almost all the cars have a new indicator shown in their dashboard. Some car manufacturers call it “mileage left,” while others call it “cruise range.” It means the miles you can drive before your gas tank runs empty. No guesswork and no stop in the middle highway anymore.
What’s the difference between a fuel gauge and the “mileage left” indicator? The former is a piece of information. The latter is a signal.
According to OED, information is “facts provided or learned about something or someone.” The recipients still need to process and figure out what they should do about it. Inherently, the guesswork is involved. The signal, on the other hand, is “an indication of a situation” or “an event or statement that provides the impulse for an occurrence.” In other words, the signal is something that triggers an action.
The fuel gauge and the usage percentage of a cluster are facts. The driver and the administrator have to process them and turn them into something they can act. This process requires drivers or administrators to estimate, if possible, based on their best knowledge of the surrounding situations. When the situation changes, the original estimation could be way off.
As I pointed out in our last blog, today’s administrators face many challenges. They manage many moving parts. They process a mass amount of information all the tools provide. And they are under pressure to support more applications with a smaller number of IT staff. Putting all these together, the chance that an administrator misestimated the behavior of the infrastructure is high. Many times, those misestimations are costly to the business.
Just like the “mileage left” signal in your car’s dashboard, we developed a similar “signal” for IT staff to know when they need to act to clean, optimize, or expand their infrastructure to match the business growth. IT staff only needs to know a single indicator – “runway”.
“Runway” in Prism means the number of days that the cluster can sustain the running workloads. It is X-Fit’s first application in Prism. You can find this number in several places. In the dashboard, the capacity detail tab, and the planning page, you can locate the runway number for each cluster, assuming the workload behavior won’t change. If you want to know the runway number when the workload changes its behavior in the future, you can get that in the “Just-in-time forecast” page, which I will show in our next blog.
Prism measures runway in three layers – the CPU, the memory, and the disk space. The cluster runway is the shortest of the runways of these three. For example, if a cluster has 50 days left in the memory, but 180 days left in both the CPU and the disk space. Prism reports the cluster runway is 50 days.
Prism generates the runway number for a cluster once every 24 hours. It takes seasonalities, noise, and capacity events (such as adding a new node or turning on the compression) into the factors of the calculation. For the disk space, Prism looks into the consumption rates of each storage container space, the commonly shared capacity, and rolls them up into the cluster level.
A user can drill down into details of each runway number. For example, the user can know the types of storages and how they are consumed by looking the runway chart of the disk space at the cluster level. The user can then drill down to each container to get more detailed information.
Furthermore, the user can set up an alert on the runway. The default out-of-the-box threshold is 90 days. Users can change that threshold based on their purchasing cycle or company policies.
The video below shows how you can use runway in Prism.
Next week, I will show you what you can do when you receive a runway alert or when you need to add more workloads into the infrastructure.
Disclaimer: This blog is personal and reflects the opinions of the author, not necessarily those of Nutanix. It has not been reviewed nor approved by Nutanix.