The bread and butter of a company that offers services is, of course, just that.
Without the services, they lose their raison d’etre – so one would expect that they would take every precaution to prevent service outages, which they no doubt do. And yet we see major service outages of all kinds of services on nearly a weekly basis.
Those outages are not to be dismissed lightly; research by IDC shows that infrastructure failure can cost large companies as much as $100,000 per hour, while failure of critical applications could cost as much as a million dollars an hour.
Just as bad is the “loss of face” many companies experience, as customers spread word of the failure – and their dissatisfaction – throughout social media sites.
Despite the stakes, though, service outages seem to be common at banks, service sites, and many others.
>See also: 4 lessons learned from the Delta outage
The question is why?
And assuming that is not the case, what additional steps can companies take to ensure that they do not fall victim to service outages? What can they do to protect/restore their reputations?
One thing that many companies seem to do – and that probably does not benefit them in the long run – is play the “blame game.”
Enterprise networks are complicated creatures, and in companies with an IT staff of dozens or even hundreds of employees, there are plenty of heads to roll.
The first priority, of course, is to get services up and running. But after that begins the finger-pointing, recriminations, and investigations – a process that could go on for months, with reports, e-mails, and accusations flying back and forth.
Those who are stuck with the blame will seek out all manner of ways to defend themselves, blaming supervisors or outside factors.
>See also: UK banks struck by IT outages
Each outage is different, but it seems that the recrimination and (often) termination process of the “responsible” party follows a very similar pattern, no matter where it is taking place.
But who is really to blame? Should the mistake of one employee really have that much impact on a corporate-wide network? Did the employee not have supervisors?
Certainly the company had an organised remediation plan or fallback system that would ensure that services were still available; why didn’t it work?
And isn’t the person who was responsible to ensure that such plans were in place and failed to do so not just as responsible as the hapless employee (usually very low in the corporate pecking order) who gets the blame?
All this really illustrates is the futility of playing the blame game. No one wins, and even if a company manages to pin down the blame on an underling, the root problem – the basic reasons for the outage – are probably still extant, which means that in order to solve the blame game, companies must take several steps back, and institute a system that will prevent such outages altogether.
To do that, they may need some extra help. Hardware, software, cloud operations, and much more can impinge on service, and the only way a service provider can get ahead of these issues is by gaining insight into their operations and how changes and shocks to the system – internal and external – impact on service.
The only way to do that effectively is by instituting a predictive approach to IT operations, with processes in place to measure IT readiness and risk on an ongoing basis.
When a configuration update introduces a risk that is detected before it impacts operations, it can turn into a learning experience.
Rather than engaging in finger-pointing, teams can review the details and gain insights into how processes can be further improved looking forward, transforming the organisational culture to focus on operational excellence rather than pinning the blame.
>See also: British Airways suffers IT glitch
There’s an old saying attributed to “Unknown” (s/he was the source of many nuggets of wisdom): The first time you make a mistake it’s an accident, the second time you make the same mistake it’s on purpose, and the third time you make that same mistake it’s no longer a mistake, it’s a habit.
The greatest lesson companies need to draw from serious service outages is not whom to blame – but to ensure that service outage accidents and even mistakes don’t become habits.
Sourced from Yaniv Valik, VP of Customer Success at Continuity Software