Achieving IT operational perfection takes a nuanced approach

Retailers, in this technological era, face a dilemma. How can they continue to meet ever increasing demands for low prices, broader, more complex delivery propositions and fantastic service whilst simultaneously improving efficiency and increasing profits?

The answer lies in making some subtle changes to how they are using IT infrastructure and moving away from SLAs towards a goal of operational perfection.

Decreasing costs while eradicating errors and adding multiple layers of resilience and recovery sounds like an impossible task. And it is, unless there is a departure from simply driving tighter and tighter SLAs to acknowledging, accepting and balancing system failure.

>See also: Defining and embracing data-driven decisions in IT operations

Heresy #1

System failure is inevitable so rather than trying to avoid it with contracts, embrace it and judge success from the perspective of the customer rather than the infrastructure availability number.

The best product, the best price, the best operations. Operational excellence as a retail USP has been the mainstay for many brands and retailers over the years – but with ever more complex technical supply chains and customers distributed across the globe, delivering the perfect customer experience is exceptionally hard and costly.

A different perspective is to assume that all things break, but as almost all breakages can be mitigated, the plan should change and focus on delivering a near perfect operational experience (from a customer perspective). It is simple to accept that things break and that the real measure of a system’s goodness is how the customer perceives the experience.

Heresy #2

The focus on Technical SLA’s drives the wrong behaviours. Putting ’operational perfection’ first makes things better.

>See also: Rolling into the digital age: inside Rolls-Royce’s tech transformation

Failed SLAs and associated service credits can also fundamentally miss the economic point – they simply don’t add up, in fact, they can never add up. The cost to a business of a lost sale can be counted in pounds but the cost of the various vendors who contribute to that transaction is often measured in pence. And while everybody is chasing the perfect measurement of failure – they are wasting time figuring out how to provide a better consumer experience.

There is a procurement obsession with availability SLAs – focusing on limiting the number of minutes (or transactions) that the customer is affected by downtime. Even though SAAS providers work hard on availability, scale and security – these should be simply considered as hygiene factors. The most enlightened retailers and carriers invariably talk about the percentage (and absolute number) of customers who don’t have a perfect experience.

Heresy #3

What you measure is less interesting than when and how you measure it.

Online transactions are not considered in hours or minutes, but in milliseconds, so averaging website response times across a week is essentially meaningless. From the perspective of the B2C or B2B customer it is the percentiles that matter not averages, it would be preferable to have a few well considered measures than a host of irrelevant statistics.

>See also: Don’t be fooled by the hype around AI

Recognising that we don’t know everything is important. Try and understand the “normal” and alert and respond when things are simply unexpected. Monitoring 2nd, 3rd and nth order outcomes and consequences (split of transactions between carriers, consumer feedback etc.) is often more helpful than simple technical measures.

Heresy #4

Aggressive scaling just as you need it, and rapid contraction when you don’t, is better for customers than “just in case” capacity.

For online retailers, the worst-case scenario is their website going down. Even a few minutes offline can cost an organisation millions. To avoid this, many have overcompensated for threats to the infrastructure by building in over-capacity and, even after a threat has receded, contracting back slowly. This could be seen as wasteful, but it is felt that it ensures a good experience for the customer.

It often obscures, however, a lack of instantaneous capacity and provides a false sense of security. Subtle but significant changes in provisioning can save money and deal with explosive growth, in particular developing systems built on heuristics or real-time data that provides valuable insight. Capacity changes are fed in slowly just ahead of demand then snapped back at the first opportunity to protect margins from being eroded by over-capacity.

>See also: Human error in IT: a growing but preventable issue


Customer satisfaction is not about 99.99% uptime, or holding platform suppliers to SLAs, it’s about analysing, monitoring, balancing and making informed decisions. The journey to operational perfection generates margin for suppliers and retailers and makes for a happy consumer.


Sourced by David Jack, CIO, MetaPack


The UK’s largest conference for tech leadership, TechLeaders Summit, returns on 14 September with 40+ top execs signed up to speak about the challenges and opportunities surrounding the most disruptive innovations facing the enterprise today. Secure your place at this prestigious summit by registering here

Avatar photo

Nick Ismail

Nick Ismail is a former editor for Information Age (from 2018 to 2022) before moving on to become Global Head of Brand Journalism at HCLTech. He has a particular interest in smart technologies, AI and...

Related Topics

IT Infrastructure