Business as usual mode revolves around the normal execution of operations to ensure a reliable service. The data center is up and running, the metrics for downtime, latency, and other measurements are being met. There is a lot of work that needs to done to maintain this order, but the pace of events is steady and everyone knows what to expect.
The danger in this mode is to be lulled into thinking that just because things are going well you don’t have to be on top of what is going on. During these routine times the SLA can best be managed by generating reports and reviewing them regularly, whether every day, week or month. The single best piece of advice we can offer during this mode is to put your eggs in one basket and then watch that basket! Although a visit to a data center can be a very calming experience – most data centers are clean, organized, well-lit and quiet – don’t let the peacefulness fool you! Underneath the order, a data center is in a continuous state of flux. New applications are being implemented, new versions of software installed; new equipment is being brought online as old equipment is upgraded or replaced.
In normal times the routine makes this continual change easy to ignore, so it becomes particularly important not to lose sight of the bigger picture. From a philosophical perspective, software is designed and data collected to mirror the world. As the world is ever-changing, so too must the cloud’s configuration change. Both the customer and the service provider need to be aware of what is taking place on both sides of the phone at all times. So how often you need to watch the basket is dependent on a number of factors.
Review periods
The first criteria to determine how often reports should be reviewed is the size of relationship between customer and service provider. This is usually, but not always, based on the dollar size of the agreement. It is simply good business from the service provider’s perspective to concentrate on its best customers. Some of the factors that might give smaller customers more precedence are the level of support they have contracted, the references they have given to prospective clients, and the strength of their personal relationships with the vendor.
The client’s perspective about the size of the relationship may be very different from the service provider’s. IBM may account for a large percentage of your spend, but your spend may not be a very big part of IBM’s revenues. So as a rule of thumb, the larger the relationship, the more frequent the review.
Also playing into the size and importance of the relationship from the client’s perspective is the criticality of the application. An application in production, say an Enterprise Resource Planning (ERP) system, may be mission-critical (or in other terms, if the application stops working the organization stops as well) but an application being tested or developed may not be as critical. So production level applications need shorter periods between reviews.
Also playing into the calculation of review periods are the tiers of service the customer has contracted. Many service providers have a gold, silver, and bronze program (with an occasional leap into platinum, but for marketing reasons never down into iron). The tier of service may determine how often the service provider may be willing to meet with the customer to review reports (although of course the customer may review them anytime without the provider’s participation). Some clients pay more to be in higher tier of service. Level of service also affects how often reports are reviewed.
To be most effective, there should be two types of meetings: tactical and strategic. If the application is big and critical, these meetings may take place every day. They should have a formal agenda and focus on changes taking place in either the data center or on the client side. This tactical meeting is not to take the place of a formal change request and follow-up process, but to get everyone on the same page about what formal requests to expect and to anticipate what impact these changes could have on the service. Any recent interruptions of service should also be discussed in these meetings.
The other type of meeting, which may take place biweekly, monthly, or quarterly, is a strategic review. Service and performance metrics should be analyzed and reviewed, and large client initiatives or major data center changes discussed. The goal of the strategic meetings is to identify how the continuous change in the data center lead to continuous improvements in operations. Ways to leverage the experiences of other clients should be noted, since this is one of the major advantages of any managed service but most especially cloud services.
Although distance can often make the heart grow fonder, relationships tend to go stale if they are not maintained with human contact. The authors are big believers in dialogue on a regular basis and feel at a minimum there should be one face-to-face meeting every year. More frequent meetings are not a bad idea, depending on location and the size of the travel budget. Between face-to-face meetings, reviews can be done over the telephone or on the internet, but never having met someone in person can make things more difficult when times are not routine.
Crisis Mode
Crises are caused by an unstable situation arising in the data center that creates an immediate danger or unusual difficulty that grows over time. Hopefully you will never have to endure such a catastrophic crisis as the Deepwater Horizon oil spill in the Gulf of Mexico, but you very well may be faced with the cloud service being down for a period of time beyond the bounds of the SLA. If the technical staff does not know what is causing the system to fail or how to fix it, you will be in a full-blown crisis.
The atmosphere during a crisis has a very different feel from the normal day-to-day. The pace of events is much quicker. Although crises inspire some to act heroically, not everyone can be relied upon to deal well with the pressure, uncertainty and risk that are to a crisis what darkness, rain and wind are to a storm. That is why it is a good idea to anticipate these situations in the Service Level Agreement and outline a formal procedure for escalation. Escalation comes from the word escalator and means to increase in size, intensity or scope. In the context of a cloud computing SLA it means requesting action from more senior staff in the chain of command or from other specialists. So that the purpose of the escalation and the nature of the required response is absolutely clear to all parties involved, having an agreed-upon written procedure is invaluable. The purpose of escalation is not to prevent the unpredictable but to bring more resources to bear on the problem with the hope it will get resolved sooner.
A state-of-the-art data center has millions of parts (if you include software), and an attempt to fix it is necessarily going to be an exercise in complex problem-solving. Typically one side or the other cannot fix the problem by itself. It takes both customer and service provider working as one team to achieve a solution. The business of determining and apportioning fault can come later, but until the crisis is resolved, finger-pointing can be an expensive indulgence. In the context of a crisis the golden rule for everyone involved is to avoid being blind-sided, and to avoid blind-siding others.
It is also a good idea to stick to a few simple principles:
1. Over-communicate.
2. Go above and beyond.
3. Follow formal procedures closely.
Renegotiate
Eventually the term of the service level agreement will expire, and it will be time to negotiate a new service level agreement. Start with the objective reports of performance. If at all possible, give them to the other side in writing before the meeting
begins.