Q&A: machine learning optimisation through speech recognition

John Hughes, accuracy team lead at AI speech recognition provider Speechmatics, spoke to Information Age about how the machine learning challenge of depicting numbers can be overcome to achieve optimisation

One of the biggest challenges facing machine learning optimisation in speech recognition today is the insertion of numbers, such as dates and addresses. These kinds of data are crucial for sectors where numbers are critical, such as finance, software development and healthcare.

To combat this problem, AI speech recognition provider Speechmatics has released ‘Entity Formatting’ for its platform. This update utilises Inverse Text Normalisation (ITN), to more accurately interpret how entities such as numbers, currencies, percentages, addresses, dates, and times should appear in written form. This, in turn, makes transcripts more readable and reduces post-processing administrative duties.

In conversation with Information Age, Speechmatics accuracy team lead John Hughes discussed why numerical values are such an obstacle in this space, and how organisations can get the best out of this technology.

Why do machine learning models struggle with number formatting?

The formatting of entities like numbers, dates, times, and addresses, is acutely challenging for machine learning. Numbers come in so many different forms and are used in so many different contexts. Intuitively, we – as humans – know how a number is being used within almost any context. For machines though, this is much harder. And until now the problem for end users of speech recognition technology has been that they have to manually unscramble or correct the outputted transcripts the software produces. It’s a time drain to proofread entities that appear as basic language. For example, the machine learning engine needs to be able to understand that the spoken phrase ‘eighty three percent’ should appear as ‘83%’, or that it needs to be able to understand that ‘oh’ could be an exclamation, but could also be the speaker referring to the number zero.

Indeed, this comes before we even begin to address the complexities of achieving this across multiple languages. Financial figures and currencies are another good example – you only have to imagine how £32,574.82 looks in text if written out in full. Accounting for these nuances and ambiguities in language is what makes this such a challenging area to tackle, and yet, getting the numbers right in particular industries is critical so customers need to be able to trust their technology to get it right. Our software is used by a lot of enterprise-level customers in finance, media and a range of other industries where numbers are used continuously in many contexts.

How are the latest updates to Speechmatics’ Autonomous Speech Recognition going to help CTOs to drive value from machine learning?

As CTOs look to digital transformation and innovation adoption to streamline processes, machine learning has a powerful role to play. For technology leaders, the main benefit of this latest addition to the Speechmatics engine will be time. Typically, the post-processing work of transcripts is a time-consuming task, requiring someone to manually correct any mistakes the engine has made. This can be a frustrating process, particularly if you have incorporated speech recognition technology into your stack in order to create efficiencies. There are many industries where getting the numbers right for speech-to-text tasks is crucial, and customers operating in numerically intensive industries need to be confident that their tech is a help and not a hindrance.

How can organisations go about improving recognition of speech/audio within machine learning?

As with many of machine learning’s biggest challenges, it comes down to the data. The use of manually labelled data can be severely limiting in terms of giving machine learning models enough context to perform complex tasks. However, by using self-supervised learning models, we can massively increase the number of data sources available. This method takes vast amounts of unlabelled data and uses some property of the data itself to construct a supervised task, without the need for human intervention. This is how we’ve managed to take on this particular challenge: Speechmatics’ Autonomous Speech Recognition technology is now trained on 1.1 million hours of unlabelled audio and 30,000 hours of labelled audio (as was used before Autonomous Speech Recognition).

Are there any further contextual areas that the team at Speechmatics are looking at improving, such as accents and speech formatting?

At Speechmatics, we are always looking for ways to improve our engine and maintain our best-in-market position. Speech recognition is going to be increasingly threaded throughout our lives – for example, even now, 50 per cent of all searches are completed by voice and this shows no sign of slowing down. We ultimately want speech recognition to understand every voice. Perpetual innovation is at the core of what we do: our team of industry-leading experts are constantly looking for ways to achieve this – whether it be languages, accents, dialects, ethnicity, pitch, tone and more.


Next generation PIM: how AI can revolutionise product data — Dietmar Rietsch, CEO of Pimcore, discusses how artificial intelligence (AI) can supercharge product data management.

Four ways towards automation project management success — Tom Henriksson, general partner at OpenOcean, identifies four ways in which automation experts have achieved project management success.

Avatar photo

Aaron Hurst

Aaron Hurst is Information Age's senior reporter, providing news and features around the hottest trends across the tech industry.