Although interrelated, all of these measures provide different insights into what is going wrong and they must be assessed accurately. Variations in these performance variables can be caused by malfunctions on your end, on your cloud service provider’s end, or somewhere in-between. The sooner you are able to diagnose where the problem is, the sooner it will be fixed.
At the same time, not all problems are created equally. Many SLAs specify the amount of time that can elapse before a provider is obligated to respond. In the industry this time between when the provider answers the call and the time the problem is fixed is known as the resolution time. Typically resolution times are based on the severity of the problem and ranked on 3 or 4 point scale.
- Mission critical
- Same day
- Within a week
- Whenever the provider can get to it
To maintain a productive working relationship during times of stress and crisis, it is important not to be the boy who cried wolf and inflate response time to get results sooner. Accurate measurement has direct bearing on the level of escalation.
Once the crisis has passed and the problem has been fixed, it is important to build Root Cause Analysis (RCA) reporting into the process. This encourages customers and providers to examine the deeper underlying causes of outages with the hope of avoiding them in the future.
Time factors
Adding to the challenge of monitoring performance is the factor of time. This is not always as straightforward to calculate as it might seem. Some SLAs specify a 30-day month, some a 28-day month, others a 31-day month, and a few even a calendar month (which makes February a good month for providers). As this table above indicates, there are two more problems with determining times.
Like rust, cloud services never sleep, so expecting a single person or even a small team to manually monitor uptime is ineffective and unrealistic. Second, the amount of time, especially in the case of 99.999% uptime, is so small you could turn your head and miss a violation of the service level agreement.
The solution
Hopefully by this point you have reached the same conclusion as the authors, that manual monitoring of up and down time is not a very good practice. Fortunately there are some very good vendors of tools for measuring performance--we have included a list of some of them below.
In our experience, taking the time to research the vendors and make a good purchase decision at the beginning will pay many dividends later, especially in terms of a better working relationship with the provider and ultimately more uptime. Some of the features to look for include:
- Dashboard, the more real time the better
- Periodic measurement that conforms with the SLA
- Measurement against all the SLAs terms
- Reporting against all SLA terms
It might not be a bad idea to choose a measurement vendor during the SLA negotiation process and use the metrics they provide to define the technical terms of the SLA.
Cloud vendors are providing customers with near-real-time reporting, but it may not be enough. Measurement of the event on your end and on your cloud service provider’s end can be very different. You may also end up working with more than one cloud provider. We suggest using a neutral third party to measure performance and report the facts to both provider and customer.
Conclusion
Having a common understanding of what the problem is, when it needs to be resolved, and the steps necessary to fix it will also avoid some of the major causes of provider and customer friction.
“Transparency builds trust,” said Mark Rivington, VP of Technology, Nimsoft which was recently acquired by CA for $350 million.
The goal of accurate measurement, as with the SLA itself, is to build the capability of the customer and the provider to respond to crises as a team. In the next issue of Cloudbook Magazine, we will explore what to do when things go badly.
About J Bruce Daley
Founder & CTO at Test Common, Inc
A recognized expert in software Bruce Daley has founded or co-founded six enterprises with very different business models - a publication (The Siebel Observer), a radio business (eCommerce Update), an event (The Enterprise Software Summit), a consulting business (Great Divide Research) an investment advisory firm (Rabbit Ears Capital Advisors) and a social network to test software (Test Common). His publications have been read in over 34 countries and he has a patent (pending) for software testing.
About Alan Rudolph
Senior Vice President at Polycom
Alan is an expert on the economics of cloud computing and in the acquisition and integration of consulting companies. Alan Rudolph has been actively involved in the successful implementation of applications and the building of consulting practices for over 25 years. He was a Managing Director at ACS responsible for the company’s Applications Solutions Group. Prior to coming to ACS, he was director of product delivery at Corio before and after its acquisition by IBM. Prior to that, Mr. Rudolph served as COO of Planalytics, a business intelligence company, where he was recruited to reorganize the company’s sales and marketing, product development, and financial operations.