Can we automate data quality to support artificial intelligence and machine learning?

Over the last decade, companies have begun to grasp and unlock the potential that artificial intelligence (AI) and machine learning (ML) can bring. While still in its infancy, companies are starting to understand the significant impact this technology can bring, helping them make better, faster and more efficient decisions.

Of course, AI and ML is no silver bullet to help businesses embrace innovation. In fact, the success of these algorithms is only as good as their foundations — specifically, quality data.

Without it, businesses will see the very objective they’ve installed AI and ML to do fail, with the unforeseen consequences of bad data causing irreversible damage to the business both in terms of its efficiency and reputation.

But there’s another area of exploration which is ripe for development; namely, can data quality be improved and maintained by automation and machine learning itself?

The key steps to achieving data quality over quantity

50% of global executives and managers report that they cannot easily get the right data needed to make decisions. Read here

The risk of poor data quality

From movie streaming services, to chatbots, to helping inform how supermarkets arrange their shelves and guiding us through major transport hubs, ML influences our lives in ways that were unimaginable a decade ago.

But what happens if the algorithm is set to work on the foundation of poor data quality? The risks in the future could be far more severe than being served a film you don’t like.

If we begin trusting machine learning to improve the discovery and testing of pharmaceuticals, for example, what would happen if a drug were formulated but there were errors in the chemical compound data used to simulate testing? The implications could be grave.

An emerging application of ML which could also be impacted by poor base data is self-driving vehicles. From maps and addresses to how a vehicle reacts to a cyclist, the data used to teach the machine will be crucial to consumer and regulator adoption.

ML algorithms – those sets of rules and calculations that help solve defined problems — can either support the improvement of data quality or be thrown off by inaccurate data should the possibility of poor data not be considered in their construction.

Extracting value from data: how to do it and the obstacles to overcome

Organisations are racing to extract value from data. But, what data is most valuable, and how can they extract it in the face of increasingly stringent regulation? Read here

Automated data quality

As with any digital transformation, moving from manual to automated and then ‘intelligent’ data quality management will require a long-term plan. Experian has identified four stages about the progression of data management, which we call the Data Management Maturity Curve. Unaware, Reactive, Proactive and Optimised & Governed reflect the four stages that span a full cycle of a data quality strategy.

The assessment has revealed a steady progression up the maturity curve, as organisations begin to release the potential of the data they hold and take it more seriously. Most intriguingly of all, those which find themselves at the Optimised & Governed stage, could be seeing the beginnings of another level, something that can be termed ‘intelligently automated.’

‘Intelligently automated’ refers to having systems and processes in place to help the people responsible for data quality identify where their biggest concerns are. We should all by now be reviewing key performance metrics on a regular basis to identify trends in data quality, perhaps looking at overall completion rates of key attributes, or monitoring for any timing concerns with data receipt or data load stages. But truly understanding your data quality requires us to take a deeper look into the content.

For example, is it enough to say that you have collected a date of birth to meet third party data requirements in 99% of cases, when a large proportion of the dates you have collected are system derived and therefore not real dates of birth? This can cause real problems and the unintended consequences can ripple through your decision making process.

The next steps

Most data quality programmes already contain an element of automation and test and learn. The next stage in this evolution is the use of machine learning to automatically recognise and respond to different types of data — ‘intelligently automated.’

For example, a data management tool that can recognise standard information such as an address, email, credit card number, or national insurance number with little pre-training or rule writing before taking actions such as validating the entry or flagging a compliance issue to a manager.

The ultimate goal is ML for data quality that then improves itself over time. A good example of this is company name — is Tesco PLC the same as Tesco Stores Ltd? What about a part of the Tesco group which does not have the word ‘Tesco’ in the company name?

Grouping commercial entities together can be as simple as looking for the name, or more complex by looking at the detail of company accounts, head office addresses, CEO names, web addresses and other metadata to find associations around the globe.

These kinds of hypotheses are the business challenges that a strong data strategy can support. However, can we move to a place where we can automate this learning and improve our data quality over time with less manual effort, giving our data people more time to analyse and support the business?

That’s the challenge for ML — taking the base rules for data quality, implementing them and then suggesting improvements as the real-world changes in data become visible as exceptions or outliers. It’s an emerging subject and one that we expect to see a great deal of development on in the years ahead.

Why a data strategy is key to a successful digital transformation

Matthew Beale, storage solutions specialist at Ultima, explores the importance of a solid data strategy when embarking on digital transformation. Read here

Your data strategy

Fundamentally, every example of ML is reliant on data that is fit for purpose — if not that data, and as a consequence, the decisions which are made because of it, can’t be trusted.

To avoid this, organisations need to ensure they have a robust data strategy. Think about the reasons for embarking on ML; what are the explainable outcomes they want to achieve and avoid?

Then, by conducting an initial assessment of your data to sense check the quality of what they already have, the organisation can take action and plan for what else they need in order to improve the overall quality of their data.

Being able to identify and trace the decisions made via ML — and all automated decision-making processes — is vital if they’re to be adopted and implemented successfully.

Ongoing monitoring of data quality is also crucial. By doing this you’ll be able to identify quickly which areas need attention and be reassured that you’re in the best possible position with current and potential ML initiatives.

Then, organisations will be in a position for ML to enable them to manage their data quality more efficiently, making their decision-making processes faster and better.

Taking this to its logical conclusion, using machine learning can help us identify those data concerns that remain hidden until they are a real problem. If we can train models to identify the key attributes that can influence a decision or process down the line, and then monitor for fluctuations or concerning patterns, we may even be able to predict the impact these data concerns could have on your business.

For example, if we know that the number of bedrooms in a property directly impacts decisions in our business, and we establish that we have incomplete or approximated data in this field to a certain scale that is getting worse, could we predict, based on where we know the data is used, rental income estimates, mortgage valuations, or heating consumption predictions?

The impact of this growing data quality concern, could help build the business case to get it corrected now rather than when it is a real problem.

Written by Clint Hook, director of data governance at Experian

Editor's Choice

Editor's Choice consists of the best articles written by third parties and selected by our editors. You can contact us at timothy.adler at

Related Topics

Data Quality