Machine learning demystified: the importance of data

Machine learning (ML) may sound like a daunting concept to anyone unfamiliar with it, some may believe it to lead to outlandish ideas about machines poised to enslave mankind. Fortunately this isn’t what ML is, it’s basically a major advancement in the development of Information Technology (IT). For ML to benefit an organisation it first has to understand the full benefit and limitations it offers.

While the principles of ML are rather simple and intuitive to grasp, it does require the use of specific statistical and IT skills that few people currently possess. To understand the idea think of a common and rather mundane language translation service – like Google Translate – this helped me realise the transformative potential of ML.

>See also: What is machine learning?

To simplify it, language translation software has long been based on programming dictionaries, grammatical rule and their numerous exceptions. This approach involves considerable effort.

From ‘rule-based’ to ‘data-driven’ processes

The new methodology stemmed from a simpler idea: don’t try to define rule and lexical tables from scratch, let the software discover them. How?

In three steps:

A collection of millions of pages, already translated from one language to another, are collected from international organisations. These include documentation available online from, for example, the UN or European institutions.
When a user submits text for translation, the software slices it into basic elements and then searches for similar ones in the same language.
The most likely translation is the extracted from the bilingual corpus which is suggested to the user. Relevant statistical patterns found in the data, therefore, replace translation rules. Instead of having to be painstakingly programmed, they are simply “learned” by the software. This approach is highly cost efficient and the quality of the translation is often on par with a traditional approach.

In areas less complex than translating human languages, the productivity gains are compounded by substantial quality improvement. Anyone who’s worked on software knows how complex it can be to anticipate all the potential problems once it’s entered production.

The software’s functional rules are based on assumptions that are limited to a linear number of observations. Reality often proves to be far more complex than expected, meaning automation is eventually suboptimal or the software ends up requiring expensive corrections.

Machine learning on the other hand absorbs and develops itself using all available data, regardless of the volume. This means the risk of patterns or a use case being left out of the picture is therefore limited.

Humans must remain in charge

Limitations show their head when machines avoid human intelligence and are restricted to imperfect selections.

A good example is that of the automated processing of loan requests received by banks. An algorithm parses the archives of previous requests where each borrower’s key information is recorded (age, wealth, family status, etc.) along with reimbursement information (whether they day pay the bank or defaulted). It therefore highlights the likely relationship between a borrower profile and a default risk.

Applied to a new loan request, the algorithm will predict with a level of accurate considered sufficient, whether the borrower will pay back the loan. This removes the risk of a bad decision and the impact of a bank operative’s mood.

Nonetheless, it is crucial that humans remain the ultimate decision makers.

The software is not perfect, it is governed by settings made by humans. For instance, it may have been optimised to avoid false-positives (whether the loan is granted to a borrower who will default) and so will lean towards rejecting certain applications. It may also discard observations that don’t fit with its criteria. Therefore, users must check that the systems recommendations are legitimate and, if necessary, reject them.

If a loan is granted despite the system recommendation and it eventually turns out the borrower meets the payment schedule, the new learned criteria will have to be introduced so that the algorithm accepts similar applicant profiles next time.

Another key reason is humans should ensure ethical standards are met, especially when concerning an individual’s rights. The law attaining to the automated processing of non-anonymised date is likely to evolve further to protect citizens and consumers against harmful effects of excessive statistical generalisation.

Data über alles

The performance of the automation will depend on meeting two imperatives:

1. Data quality – To diminish false observations, many cleansing and formatting activities are required. This task is huge compared to the effort needed to set up the model.

2. Training set representatives – The automation is far more efficient when ML is carried out on unbiased observations. These should be similar to real life cases the software will have to deal with. For instance consider the range of wages for one company may be substantially larger than another.

Access to data is crucial to ML project’s success, ultimately no level of algorithmic sophistication will make up for a poor set of data. Machine learning has a tendency to dismiss arbitrary behaviours. It is up to us to make sure it does not replace these with inappropriate over-generalisations.

Sourced by Jean-Cyril Schütterlé, VP Product and Data Science, Sidetrade

The UK’s largest conference for tech leadership, TechLeaders Summit, returns on 14 September with 40+ top execs signed up to speak about the challenges and opportunities surrounding the most disruptive innovations facing the enterprise today. Secure your place at this prestigious summit by registering here

Nick Ismail

Nick Ismail is a former editor for Information Age (from 2018 to 2022) before moving on to become Global Head of Brand Journalism at HCLTech. He has a particular interest in smart technologies, AI and... More by Nick Ismail

Machine learning demystified: the importance of data

From ‘rule-based’ to ‘data-driven’ processes

Humans must remain in charge

Nonetheless, it is crucial that humans remain the ultimate decision makers.

Data über alles

Nick Ismail

Related Topics

Related Stories

Why ISO 42001 sets the standard for responsible AI governance

7 key strategies for MLops success

Why synthetic data is pivotal to successful AI development

Why AI needs a kill switch – just in case

Related Stories

Why ISO 42001 sets the standard for responsible AI governance

7 key strategies for MLops success

Why synthetic data is pivotal to successful AI development

AI vs AI – are cybercriminals or organisations winning?