For most of us, Christmas time means fun parties, good times, family events and a pleasant and relaxed atmosphere. That’s for most people; not for Father Christmas and his staff, though.
For them, this is the busy season – and thanks to worldwide supply chains, source manufacturing, international shipping via supertankers, and micro marketing via social media and mobile apps, it’s busier and more complicated than ever.
Complications beg mistakes – for all his sleigh-flying powers, Father Christmas and retailers are still ‘human’. And when humans come into contact with complicated systems, errors often ensue.
How can retailers/Father Christmas protect themselves, their operation and their reputation from “service outages” or snafus that deliver the wrong toy to the wrong child – or worse, deliver nothing at all.
Father Christmas, of course, is a metaphor – but the super-complicated world of holiday gifting is all too real. How does a department store, major online e-commerce site, toy manufacturer, or any other player in the Christmas shopping stakes get their product to market, reach customers (retail or wholesale), collect payments, and ensure that goods get to where they have to go on time?
Welcome to the world of big data-based e-commerce – the nuts and bolts of what makes the business world run today. Cloud-based sales systems like Salesforce, shipping schedules coordinated online, and huge cloud-based customer databases help manufacturers locate the ideal wholesale customers for their products, while in-depth consumer studies, big data-extrapolated targeted customer sales campaigns, and advanced shipping systems that coordinate two-day shipping between two points thousands of miles away are just a small sample of the deep technology involved in the gift business.
The infrastructure that makes this possible is run by tech providers large and small, each of which have their own internal systems consisting of hundreds of thousands of computers, servers, databases, and other hardware and software components that keep the business world going.
That all these diverse systems work together – and work altogether – is somewhat of a tech miracle, considering the many things that could go wrong. But as it turns out, many things do indeed go wrong.
From hack attacks to major service outages of the online systems businesses depend on, hardware and services fail on a constant basis, causing untold delays and losses for the businesses that need these systems to do business.
A recent survey of cloud resiliency at some of America’s top companies shows that there were significant risks of downtime and security breaches in each and every cloud environment tested. Most had multiple downtime risks, and the vast majority (82%) also had data loss risks.
Service outages are all too common, happening multiple times a day – in the study of 100 top enterprises, no fewer than 97% had at least some exposure to service outages.
When it’s a huge outage that hits the retail sector – for example, when Amazon Web Service went down for the better part of a day in March, affecting 13,000 business that depended on the S3 servers in Virginia, or when Bank of America ATMs stop working, as happened in July – the public hears about it, but how many people hear about a service outage on a private enterprise cloud system that provides logistics information for shippers – which holds up shipping for hours because the databases were corrupted, costing companies as much as $750,000 a minute? Yet it happens – and the price paid for those outages in lost time and efficiency are made up for at the cash register, both on the wholesale and retail level.
The sources of these outages are clearly the extremely deep complexity of IT systems. To manufacture, ship, market, and sell the gifts that make Christmas everyone’s favourite shopping season, companies utilise dozens of online services, as well as private and public cloud infrastructure to host dozens or even hundreds of applications around the world.
These systems are constantly updated, patched and expanded to provide more, faster, safer, and better service to users, but often the upgrade process itself is the cause of a service outage. When discussing what may go wrong with IT experts, it quickly turns out that there are thousands of possible configuration risks to consider:
• Software updates that are not correctly applied to all servers.
• Failure to configure new hardware (e.g., servers, storage, network) according to the vendor best practices.
• Single-points-of-failure in network, power, servers or storage – often the result of improper communication between different technical teams.
• Insufficient or infrequent quality testing and resilience testing.
And so on. The fast pace of change (on an almost daily basis) and the large number of people and technologies involved in advanced IT systems do not make things any easier. To expect an IT team, talented as it is, to anticipate and prevent hidden risks, is unrealistic. It is beyond what a human team can do as these possible issues are not scalable and a human team can only offer limited efficiency.
Many companies set up sandbox environments in production systems to test new software hardware or services before they are installed, but those isolated test environments do not reflect the full IT environment.
Even one-time audits to search for all hidden risks can take weeks or months. Usually, the only way an IT team will find out that there is a problem is when things break. When this happens during the busy shopping days before Christmas, a company that relies on proactive remediation (a.k.a “fire-fighting”) might as well just close up shop until the spring thaw.
The only way to significantly reduce risk, and increase resiliency of IT systems is through deep automation. Indeed, a new breed of tools for continual IT configuration resilience and quality management has evolved, to make sure IT systems are safely and consistently configured all the time. At their foundation are a few principles:
• Automated documentation and discovery of the IT configuration at all layers (compute, storage, replication, virtualisation, cloud, app servers, databases, etc.)
• A constantly updated knowledge-base that captures vendor best-practices, as well as insights and reports of critical configuration risks reported by other enterprises (you should expect such tools to look for thousands of possible issues, and incorporate the experience of hundreds of customers).
• An automated process of validation, that is triggered after every change, to test all dependencies and best practices.
• A framework for constant improvement: proactive notification of configuration issues to the relevant subject-matter experts (application, network, storage, virtualization, etc.), dashboards and score-cards for health and risk, etc.
Once, Father Christmas’ job was easy – but times change. Christmas joy today is heavily dependent on joyless databases and 24/7 data centres, where no day is a holiday. The last thing anyone wants – kid, parent, or IT worker – is the failure of the systems that control shipping, sales, credit and all the other things that go into making the holidays a fun time. To protect those systems, and to ensure that the holidays are joyous for everyone, IT departments need to deploy an automated, knowledge-based and quality process.
Sourced by Doron Pinhas, CTO of Continuity Software