Google has recently announced a number of updates to the technologies that underpin its Contact Center AI solutions. In particular, Automatic Speech Recognition (ASR).
ASR goes under several names including computer transcription, digital dictaphone and quite wrongly, voice recognition. Speech recognition is all about what someone is saying, whereas voice recognition is about who is saying it. This has been further confused by the launch of products such as Apple’s Siri and Amazon’s Alexa which are described as Voice Assistants. However, we shall stick to the term ASR.
What’s the point of ASR?
The applications for ASR are potentially massive – there are so many situations where it can provide either significant financial or ergonomic benefit. With computers costing far less than people, it’s easy to see why ASR might be a cost-effective way to input data. For a start, most people can speak three times faster than they can type so in any job involving inputting text, the gains are mainly in saving the time of salaried employees. A lawyer might charge his/her time out at hundreds of pounds per hour and a good proportion of billed time will be taken up writing documentation. Lawyers are rarely fast and accurate typists, so dictating to a computer which costs perhaps £2.00 per hour to run provides an obvious benefit. Increasingly, companies of all sizes are finding that ASR helps enormously with productivity.
But there are also situations where speech is the only available control media. This can range from systems that require input from the handicapped, to people who are already mentally overloaded such as fighter pilots.
ASR isn’t easy
But recognising speech is not easy. Like many other Artificial Intelligence (AI) problems, some researchers have tackled ASR by looking at the way that humans speak and understand speech. Others have tried to solve the problem by feeding the computer with huge amounts of data, and getting the computer to work out what makes one word sound the same as another. Typical of the latter approach are Neural Networks, which may be loosely described as an attempt to emulate the way that the human brain works. It has the advantage that the ASR system designer need know nothing about human speech, linguistics, phonetics, acoustics or all the other myriad of disciplines involved in analysing what we say. However, its main disadvantage is that it requires large amounts of computer power and, when it works, nobody knows exactly how.
Transcription of telephone network conversations is generally more challenging than the transcription of speech made direct into a computer or cellphone. The quality of the acoustic signal is lower on telephone calls and by definition, there will be (at least) two speakers rather than one – potentially confusing the ASR software.
It is now possible to not only digitally record calls but also to automatically transcribe them – and far more accurately than ever before. By utilising the context of a conversation, it becomes possible to overcome and compensate for some technical challenges of the telephone system such as the quality of the audio signal and poor enunciation that plague human listeners too, All this is important for businesses looking for services to provide ASR as quality and accuracy have long been a concern.
For some tasks with a limited number of outcomes, good speech recognition is already perfectly adequate – for example, hands-free control of vehicles. However, where there are an infinite number of outcomes, the risks associated with confusion or rejection may be much greater – from losing customers to causing death, with applications such as on-line banking to medical screening. This is where speech understanding is essential.
When AI is not very intelligent
When Alan Turing first coined the term Artificial Intelligence, he defined it in terms of what could fool a human, not what algorithms are used.
To quote Roger Schank from Northwestern University in Illinois.
“When people say AI, they don’t mean AI. What they mean is a lot of brute force computation.”
The history of ASR goes back more than 50 years to the early days of general purpose digital computers. In those days, vast amounts of computer power were simply not available. Even the largest computers available in the 70s and 80s would be dwarfed by a mobile phone of today. So rather than let the computer infer what the human brain was doing there was no other option but to attempt to find out by experiment what the human was actually doing and then try to emulate it.
This did not stop at phonetics – which studies the basic units of human speech (phonemes are the building blocks of syllables, which are the building blocks of words, which are the building blocks of phrases, and so on) – but encompassed many of the things we learn as children to help us recognise speech – grammar, meaning, gestures, history, mood, etc. Indeed, we often recognise words we don’t actually hear. Even though some phrases are phonetically identical, it is only the context in which they are said that elicits their actual meaning.
Take for example, the two phrases
- A tax on merchant shipping
- Attacks on merchant shipping.
They are phonetically identical and even a human could not distinguish between their meaning if heard in isolation. Yet, for sure, they do mean something different. Knowing that one was uttered by an accountant and one by a naval commander would make the distinction so much more obvious. The person that utters the phrase forms part of the context, and there can be literally hundreds of contextual clues that a human uses that are not contained in the raw speech.
A computer which produced either of the above sentences could be said to have correctly recognised the speech that made them, yet only one would be a correct understanding of what was said.
As computer power has become cheaper, the trend has been to let the computer get on and infer the relationship between acoustic and written speech. It may be that in so doing, a computer which is fed with many examples of acoustic phrases and a human transcription of the phrase ends up mimicking part of the human brain. But who cares how it does it? If it works, the end justifies the means.
But does it work? The answer is no, not always. Let us take, for example the word “six” You have a thousand people utter it and feed it into computer so it learns what it thinks “six” sounds like. Then person 1001 comes along and for whatever reason, the computer gets it wrong so the person tries to articulate “six” much more clearly (maybe indicating their annoyance) by saying “sssssss-ic-ssssss”. Ie over stressing the “ssss” sound).. The “ssss” sounds at the beginning and end of “six” will always be recognised by a human as the same sound no matter how long they last (ie are stressed). Other sounds, such as the “i” sound must be kept short otherwise it turns into an “ee” sound.
This is basic phonetics and the computer would not necessarily be able to work this out from a great many samples. Because humans often articulate things similarly, basic statistics are not going to tell the computer that. However, a knowledge of phonetics would,
The importance of context
And so it is with context. Knowing who is speaking, their sex, where they come from, what job they do, what words they use, who they communicate with, for example, all helps us humans understand what they are saying. Cloud computing has increased the temptation for providers of ASR services to use algorithms that learn rather than are taught, simply because computing power is cheap. Why not use a sledgehammer to crack a nut?
This occurs because in the situation where many users require simultaneous ASR services, it is far more efficient to apply the shared resources of all users in one highly powerful ASR engine rather than in many small individual ASR engines. It is more efficient because most of the time, small dedicated ASR engines will be doing nothing. When they are needed, they may take many times longer to process the speech. With one shared cloud computer, all this latency can be mopped up and used to provide one fast service. This is why products such as Dragon running on a local computer cannot compete with a cloud-based ASR service. With so much applied resource, computer-hungry approaches become viable in a way they would not be for local computers.
However, as indicated above, no matter how much computer power is applied, there will be a limit to the ability of the service to understand what is being said. And by understand, I mean establish the meaning of the phrases and sentences, not simply identify the words.
In order to apply context, it is no good taking thousands of random utterances and smoothing them over. What is necessary is to look at specific utterances and establish the local context. This is an approach that has been used successfully in programs which search and retrieve voice and written data. The best of these can use any cloud-based ASR service, and apply the context found in emails and text messages to pre-condition the ASR service. So for example, if Bob and Alice exchange emails, they will be more likely to use certain words than might be the national average. The ASR can then be conditioned to favour those words.
What’s interesting about the Google announcement is that Google seems finally to have accepted the importance of context in their cloud services.
However, Google’s focus of attention is contact centres where the need for ASR is primarily in forensics and damage limitation when things go wrong. The really big opportunities are small workgroups – where the need is to improve efficiency and the context is far more valuable. Even large corporations are built from many small workgroups.
Computing power is now cheap. But that doesn’t mean it can replicate human intelligence. If you want a piece of software to act like a human, it has to understand like a human.
We think it’s good news that Google has finally seen the light, and businesses using this technology should really start to benefit.