How to avoid the season of IT forgiveness

Depending on the faith, the season of forgiveness varies between different times of year. In the world of e-commerce, it appears that July was the month of sin and salvation, or IT forgiveness.

July was the month, after all, that thousands of merchants on the Etsy platform apologised to their customers, the month that Etsy itself apologised to the public, and the month that Worldpay – the group that caused the apology-fest –apologised for its failings.

The debacle wasn’t Etsy’s fault at all, but in order to placate angry customers, the site profusely apologised for its “wrongdoing”.

If you’ve done something wrong, apologies are important – but shouldn’t we be striving to avoid having to have something to apologise for?

Worldpay indeed owes Etsy an apology, but not just them. It’s not Etsy and the customers it let down – it’s their shareholders, their workers, and their families, all of whom will probably suffer in the wake of the damage done to the service’s reputation.

The lesson for Worldpay – and any other service provider – has to be how to avoid getting into a situation where apologies are needed at all. The question for enterprise is, is that even possible?

>See also: Are the Salesforce and AWS outages dirty rain for the cloud?

In a debacle that eventually stretched across three weeks, merchants who use UK-based Worldpay – among them British Airways and the National Lottery – noticed that payments were not going through.

Most affected by the issue were customers of Etsy, the crafts sales platform that sells more than $1.5 billion in merchandise a year. Many Etsy merchants couldn’t process payments for three weeks, resulting in angry e-mails from customers, loss of reputation – and, of course, irretrievable loss of sales.

The specific reason for the outages were not revealed, with Worldpay saying initially that it was a “glitch”, and later blaming it on a software update.

The world may never know the nature of that glitch, of course, but it could have been one of a million things. Perhaps a poorly configured file that stopped functioning when a key piece of software was updated, or a permissions problem that prevented transactions from being executed on a server.

If the history of these kinds of outages in enterprise is any indication, chances are the Worldpay people may not even know exactly.

It took time, but both Etsy and Worldpay apologised to each other, and to the merchants who were caught up in the situation.

On July 19, Worldpay issued a statement saying that it was “experiencing an isolated issue with one of its gateways which is affecting a very small proportion of our customers (substantially less than 1%) and a small proportion of the transactions that we process daily”.

“Efforts to resolve the issues causing settlement delays are ongoing,” the statement continued. “We sincerely apologise for the inconvenience this has caused.”

In another statement the next day, Worldpay said: “We are taking steps to implement changes, with further testing already underway, with the aim of restoring normal operational service as soon as possible, and have proactively communicated with all affected customers. We sincerely apologize for the inconvenience this has caused.”

In its own statement, Etsy told merchants on July 18 that the company was “deeply sorry for the inconvenience and frustration these delays have caused. We thank you for your continued patience and for being part of our community”.

The site repeated the apology on July 25, adding that – nearly four weeks after the problems began on July 1 – it appeared that the glitch had been resolved.

That, of course, was little comfort to the merchants who lost out on sales, and goodwill, of long-time customers – and who were likely to permanently lose at least some of them, despite the apologies they themselves were forced to issue.

While the Etsy debacle garnered a lot of attention because it affected so many people on a consumer-facing site, the same kind of thing goes on every hour of every day on enterprise, government, business and infrastructure sites.

Such outages can be caused by regular day-to-day IT activities – upgrades of hardware, software or cloud-based tools – as well as misconfiguration, bugs (often included in software upgrades), traffic issues, power outages, security issues, and of course human error.

But according to a study by the University of Chicago, the biggest reason for outages is  “unknown” – as in, the IT system is too damn complicated for workers to figure out what went wrong. Can anything be done to prevent an “unknown” glitch? Not by people who don’t know what to look for.

>See also: The cloud is great, but what happens when it goes down?

In the wake of modern IT’s ever-growing complexity, there are thousands of things that can go wrong. In order to catch such glitches in a proactive way (as opposed to learning of their existence when all hell breaks loose), more stringent quality controls must be put in place – preferably following each and every change.

This is, obviously, not something that can be done manually. With thousands of virtual machines, millions of configurable items, rapid technology evolution and daily changes, there’s just not enough time.

The only valid approach is to harness the power of automation. Indeed, more and more enterprises in the financial, telco, utility, retail and public sectors have come to rely on daily, automated configuration validation systems.

The guiding principle is to deploy a risk detection engine, coupled with a dynamic knowledge base loaded with relevant risk signatures.

Much like anti-virus tools in the end-point computing arena, such risk detection tools can harness the experience of multiple vendors and enterprises, to provide a community-driven knowledgebase.

To date, some of the offerings in this field go as far as automating thousands of risk signature checks. With the power of automation, it is possible to proactively detect a huge portion of the issues that today remain dormant in IT – dramatically improving resilience and proactively preventing the next outage.

 

Sourced from Gil Hecht, CEO, Continuity Software

Avatar photo

Nick Ismail

Nick Ismail is the editor for Information Age. He has a particular interest in smart technologies, AI and cyber security.

Related Topics