Human error in IT: a growing but preventable issue

In late January, an outage at Delta caused hundreds of flight cancellations and delays. More recently, Amazon Web Services’ (AWS) Simple Storage Service (S3) suffered an outage that disrupted services for tens of thousands of customers including popular sites like Netflix, Spotify, Pinterest and Expedia.

When a network goes down, theories run rampant. Was it a software bug? cyber attack? systems failure?

Typically, however, there is often a much simpler explanation: human error. In the case of AWS S3, one of its employees was debugging an issue with the billing system and took more servers offline than intended. One wrong keystroke led to major consequences.

>See also: Will man and machine join forces to defeat the threat of human error?

IT’s “perfect storm”

AWS and Delta’s outages bring to light a growing problem that impacts companies of all sizes: the resources gap. IT is often faced with stagnant budgets and a lack of qualified applicants.

For instance, Quartz reported that there are “almost 10 times more US computing jobs open right now than there were students who graduated with computer science degrees in 2015.” This is according to data from the National Center for Education Statistics and Code.org.

Unfortunately, despite the lack of resources, developers continue to face increased pressure to push out new applications faster and make changes to existing systems quicker in order to stay competitive. This is where IT automation comes into play.

IT automation helps companies coordinate and consolidate IT operations within a consistent and common interface so disparate systems and software become self-acting or self-regulating.

This minimises the amount of manual intervention needed within a workflow and reduces time spent on repetitive tasks. In turn, IT has more time to focus on mission critical work, risk of error is greatly reduced and it becomes easier to identify issues as they occur.

Prevent the preventable, reduce the impact of the unavoidable

While mistakes happen and are sometimes inevitable, it’s important that companies put the right processes in place to help reduce the chance of human error and improve time remediation in case an issue does arise. To do that, companies should take the following steps.

>See also: 3 steps to avoiding outage disasters

Talk about IT automation before a project is implemented, not after

A common problem that companies run into is they start thinking about how they can automate tasks after a project is underway or even after it is completed instead of at the very beginning.

This logic means that IT is going through the time intensive task of creating processes manually, producing the output needed and then thinking about automation.

Instead, companies need to shift their mindsets and ensure automation is a prologue to a discussion whenever a new process is introduced, not a postscript. In turn, activities are streamlined and the chance of a mistake being made is reduced.

Determine how many processes are currently automated vs. done manually

Chances are that a significant number of activities are still being done manually through a labor-intensive process called scripting. While IT relies on scripting to connect heterogeneous applications, databases and platforms that were not designed to work together, it is also prone to errors.

By taking inventory of the number of manual vs. automated processes within an organisation, companies can get a better picture of what steps need to be taken to bring together automated businesses processes in a central location.

>See also: How organisations can take a holistic approach to disaster recovery

While scripting won’t completely go away (and it shouldn’t), IT automation makes it possible to manage scripts, access revision history, implement version control and utilise granular scheduling capabilities. This helps minimise the risk of business disruptions that are more likely to occur without IT automation in place.

Get insight from every department to get a holistic view of the business

With increased innovation and technology advancements, every department within an organisation is adopting new technologies that can cause an IT headache if they don’t seamlessly integrate with existing infrastructure and legacy systems.

To better understand how IT can coordinate and consolidate IT operations to minimise manual intervention, companies need to look for redundancies, as well as disparate tools, applications and servers that aren’t properly integrated.

Once a complete inventory takes place, companies can take advantage of IT automation to reduce time, cost and manual intervention that typically plagues IT. Creating a unified console of all automated solutions also reduces the number of places developers need to check for errors when issues do arise.

This frees up time to focus on projects that matter and the chances of error from human fatigue are diminished.

Create safeguards

After AWS determined what the culprit for the outage was, they included the following in their official statement: “We are making several changes as a result of this operational event. While removal of capacity is a key operational practice, in this instance, the tool used allowed too much capacity to be removed too quickly. We have modified this tool to remove capacity more slowly and added safeguards to prevent capacity from being removed when it will take any subsystem below its minimum required capacity level. This will prevent an incorrect input from triggering a similar event in the future.”

>See also: Can your IT systems heal themselves?

Identifying what safeguards should be in place and what changes need to be approved is critical when putting together an IT automation plan.

The AWS outage was not the first incident that “broke the internet,” and certainly won’t be the last. But with IT automation to help streamline complex processes and workflows, the likelihood of these shutdowns greatly decrease, especially those caused by human error. And in the event that they do occur, having an IT automation solution in place reduces the amount of downtime and helps isolate the cause of the issue in a timely manner.

 

Sourced by Jim Manias, vice president and is responsible for the overall market strategy and planning for a range of products at Advanced Systems Concepts, Inc.

Avatar photo

Nick Ismail

Nick Ismail is a former editor for Information Age (from 2018 to 2022) before moving on to become Global Head of Brand Journalism at HCLTech. He has a particular interest in smart technologies, AI and...

Related Topics

Automation