08 February 2013

DOGMATIC metrics

DOGMATIC information security metrics

Whereas most of us in the profession see business advantages in having reliable, accurate, truthful data about information security, metrics are occasionally used for less beneficial and ethical purposes.  There are situations in which information is deliberately used to mislead the recipient, for example where the reporting party wishes to conceal or divert attention from information security issues in their remit.

We have seen this most obviously in the context of regular performance reporting by service providers to their customers against SLAs (Service Level Agreements) and contractual requirements.  IT outsourcers or IT departments typically report "uptime", a metric that sounds straightforward enough at face value but turns out to be something of a a minefield for the unwary.

Imagine, for instance, that I, an IT Operations manager for an IT outsourcer, report to you, the relationship manager for my client, that we have achieved our SLA targets of 98% uptime for the last month.  Sounds great, right?  Evidently you have set targets and we have met them.  Fantastic.  Imagine also that I don't just tick a box but provide a fancy performance management report complete with glossy cover, technicolor graphs and numerous appendices replete with lengthy tables showing reams of supporting data about the services.  Furthermore, I have been reporting like this for years, since the start of the contract in fact.  

Buried away in those graphs and tables spread throughout the report are some fascinating facts about the services.  If anyone has the patience and dedication to pore over the numbers, they might discover that the services were in fact unavailable to users several times last month:
  • 7 times for a few minutes each due to server hardware issues (total ½ hour);
  • Once for 1 hour to diagnose the above-noted issues, and once more for 2 hours to replace a faulty power supply (total 3 hours);
  • 31 times for between 1 and 4 hours each for backups (total 50 hours);
  • Once for nearly 2 days for a test of the disaster recovery arrangements (total 40 hours);
  • An unknown number of times due to performance and capacity constraints causing short-term temporary unavailability (total unknown).
The total downtime (more than 93½ hours) was far more than the 2% evidently allowed under the SLA (roughly 15 hours per month), so how come I reported that we achieved our targets?  Five possible reasons include:
  1. Backups and disaster recovery testing are classed as 'allowable downtime' and are not classed as part of the defined services covered by the SLA;
  2. The short-term performance and capacity issues were below the level of detection (not recorded) and therefore it is not possible to determine a meaningful downtime figure;
  3. The individual events resulting from hardware glitches were short enough not to qualify as downtime, which is defined in the SLA as something vaguely similar to "identified periods of non-provision of defined services to the customer, outwith than those permitted in this Agreement under sections 3 and 4, lasting for at least five (5) minutes on each and every occasion";
  4. Several of the downtime episodes occurred out-of-hours, specifically not within the "core hours" defined in the SLA;
  5. I lied!
Possibly as a result of complaints from your colleagues and management concerning the service, you may take me to task over my report and we will probably discuss reasons 1-4 in a fraught meeting (strangely enough, both of us know there is a fifth reason, but we never actually discuss that possibility!).  I will quote the SLA's legalese at you and produce reams of statistics, literally.  You will make it crystal clear that your colleagues are close to revolting over the repeated interruptions to their important business activities, and will do your level best to poke me into a corner where I concede that Something Will Be Done.  After thrashing around behind the bike sheds for a while, we will eventually reach a fragile truce, if not mutual understanding and agreement.

Such is life.  

We both know that "uptime" is a poor metric.  Neither of us honestly believes that the 98% target, as narrowly and explicitly specified by our  lawyers in the SLA, is reasonable, and we both know that the service is falling short of the customer's expectations, not least because they have almost certainly changed since the SLA was initially drawn up.  However, this is a commercial relationship with a sole supplier, in a situation that imposes an infeasibly high cost on you to find and transfer to an alternative supplier.  I have commitments to my stakeholders to turn a profit on the deal, and you vaguely remember that we were selected on the basis of the low cost of our proposal.  Uptime is not, in fact, the real issue here, but merely a symptom and, in this case, a convenient excuse for you and I to thrash out our differences every so often and report back to our respective bosses that we are On Top Of It.

Uptime has been used in this way for decades, pre-dating the upsurge in IT outsourcing.  It has never been a particularly PRAGMATIC metric.  It is almost universally despised and distrusted by those on both sides of the report.  And yet there it remains, laughing at us from the page.

No comments:

Post a Comment

Have your say!