Taking voice interaction to the next level

Google’s answer to the Amazon Echo, Google Home, is set to launch in the UK this summer. This follows a hugely successful launch of the Echo in the UK and Germany last year, which has pushed worldwide sales beyond the nine million mark.

A host of other products in the ‘smart assistant’ category are also vying for consumer’s attention, from Lenovo’s (literal) Echo clone, to Mattel’s intriguing mash up of Amazon and Microsoft AI in Aristotle, the digital assistant for kids.

The challenges of voice recognition

At the heart of this flurry of smart assisting, of course, lies voice interaction, which has been an R&D focal point for the big technology players for some time now.

Each of the technology giants is investing heavily in voice interaction, starting with the still thorny yet fundamental challenge of voice recognition, but also as a facet of their wider dedication to advancing the field of AI. Apple, in fact, can claim to be old stagers – The Economist recently reported that, around the world, Siri now handles two billion voice commands every week.

>See also: AI-driven: the journey into the cloud for Aylesbury Vale District Council

This is a complex area of new technology – both from the point of view of the technology itself, but, just as significantly, also with regard to the design of the interactions.

Designing for voice interactions successfully depends on appropriately addressing many different conditions including:

  • The physical environment
  • The primary physical state of the user (e.g. driving, walking, sitting on a sofa)
  • Connected (or in-range) devices
  • Enabled (or available) services
  • The number and type of device sensors
  • The network status
  • Noise levels and background interference
  • The emotional state of the user
  • The user’s tone of voice
  • The date and time of use
  • The user’s command recall
  • The user’s syntax and phrasing
  • The user’s regional accents and colloquialisms

The role of AI

Beyond all that, primary significance of course sits with the AI engine that underpins the interaction. Roughly half of the considerations listed above will not be resolved wholly satisfactorily until the AI can do the heavy lifting necessary to, for example, autonomously resolve an ambiguity in the stated input.

This is starting to happen, for example, using Google Home (unlike with the Echo), it is possible (subject to the usual tally of hit ‘n’ miss attempts) to ask for something, then ask a contextual follow-up, in what can legitimately be labelled a (basic) conversational interaction.

>See also: Voice recognition: has AI just beaten a human?

The interesting thing for anyone observing the emergence of voice user interfaces in mass-market products, is the relationship between that form of interaction and the more conventional screen-based interactions.

Screen-based interfaces are an abstraction in a way that voice interaction arguably is not. Yet the nascent nature of voice interaction still necessitates a screen for effective ‘long-form” interaction – by which we mean detailed immersion in complex content.

For now, voice augments, rather than displaces the screen-based outcome: witness the regularity with which the Echo will resolve a query by sending some links to the Alexa app, a mode of behaviour also more than familiar to anyone persisting with Siri.

Supplanting the screen

For voice to actually supplant screen (even partially), advances will have to be made in five areas:

1. Intimacy – the biggest barrier to voice interaction gaining mainstream momentum is the awkward nature of publicly, audibly interacting with a device

2. Dependability – you have to be able to make it work every time, all the time, everywhere

3. Intelligence – it has to be able to actually offer you meaningful outcomes

4. Personality – whilst it’s amazing how quickly you find yourself referring to an Amazon Echo as “she”, once all the above issues have been satisfactorily addressed, it will be necessary to develop designed answers that specifically address the ‘feel’ of a smart assistant

5. Conversation – you need to be able to have an actual conversation with the assistant, at a level of nuance and fidelity that can effectively displace the current dependence on a screen.

>See also: The next disruption: pervasive voice-first at work

Imagine two friends decide to order some takeout food – one friend is looking at their smartphone but the other is, for example, driving.

It’s the drivers turn to choose today, but they’re not sure if they want pizza or barbecue, so the passenger friend searches for options. Now imagine the variable, multi-threaded conversation the two of them are having as they settle on a food type, find a local outlet, select specific menu items, change minds a couple of times, schedule delivery and make the payment.

Now consider how all that works if the ‘friend’ is a smart assistant embedded in your car’s multimedia system and isn’t interacting with a series of smartphone apps and web pages, but merely vocalising the relevant data and choices that come into focus as your conversation progresses.

That’s a really tough proposition for an experience designer, to say nothing of the variety of software engineers that need to make the AI work. Yet these are the areas that designers will need to affect in order for this next frontier of interaction to really take off.

It is, in that context, perhaps not surprising that Amazon is rumoured to be working on a second Generation Echo featuring a touchscreen.

Real-life application

They are not the only ones, Baidu’s new TalkType app is a huge leap forward, enabling speech interaction with chat and social platforms, as well as payment services and home automation devices. PayPal has also recently activated Siri for payments.

Online banking has also been an early adopter, in the UK Barclays, HSBC and FirstDirect have already deployed voice recognition interfaces as a security measure; modelling a customer’s voice to use as a digital ID, obviating the need for lengthy security question processes every time you call your telephone banking service.

>See also: Think before you speak: voice recognition replacing the password

An obvious further application is to perform basic customer services. Why spend endless minutes on a call queue, listening to Vivaldi’s Four Seasons and waiting for a person to help you change a direct debit, when you can converse with a bot using voice recognition interfaces and AI to perform that task? The organisation, meanwhile, benefits by being able to serve many other customers simultaneously.

Revolutionising business communication

In the same way, voice interaction could revolutionise how businesses communicate and access data internally. Imagine an intranet that can be accessed by voice from anywhere within an organisation.

Relevant figures or content could be shared or referred to in real-time during meetings, effectively turning an intranet into an additional – all knowing – member of staff.

The opportunities for voice interaction are endless, but as yet the technology has not quite caught up with its potential.

However, businesses need to start thinking about how they can harness the power of voice sooner rather than later, as it is only a matter of time before voice interaction moves to the next level and starts to effectively perform roles that today still required a screen.


Sourced by Matt Clark, head of user experience at Amaze

Avatar photo

Nick Ismail

Nick Ismail is a former editor for Information Age (from 2018 to 2022) before moving on to become Global Head of Brand Journalism at HCLTech. He has a particular interest in smart technologies, AI and...