There was an interesting paper posted on the LinkedIn Automatic Speech Recognition (ASR) group a few days ago. The OUCH project (Outing Unfortunate Characteristics of HMMs) has been commissioned under a US government contract to look at where we are with ASR, and what can be done to make things better, if that is what we needed.
In a very telling section, the authors basically say that the industry in all its forms commercial, academic and the Fourth Estate has misrepresented speech technology. Essentially the perception has been given that ASR can do more than it can, more easily than is possible, so that when users get real-world experience, they are inevitably disappointed.
Anyone who has tried a desktop application of speech technology will be alive to this. You know when you press a key, the same character will be printed on screen. It is frustrating when you dictate something, and sometimes it gives you what you said, and sometimes it doesn’t.
I have had real world experience of this myself. Because I like to be able to use my iPhone as I walk along the road, without running the risk of bumping into pedestrians, I have tried Siri, Apple’s voice activated personal assistant. Sometimes, it is perfect, and I can send a very short text. On other occasions, it is useless. The title of this piece is a case in point.
Music is one thing that an iPhone is very good at. And I love my Spotify app, with all of my favourite songs and playlists accessible on the move. But no matter how many times I politely ask for Siri to open the app for me, he stubbornly refuses to, insisting that I do not have an application called “Spotty Five”. I know I grew up on the London side of the Essex borders, but my accent isn’t that bad.
And there you have it. We’ve been “sold” an application that purports to give sophisticated voice support, and it can’t even work out one of the internet’s more popular apps – Perhaps Apple is trying to spike Spotify, being an iTunes rival? Who knows.
The fact is that voice recognition is based on flawed mathematics. We can vaguely approximate a model of some of the sounds a human makes, but the “HMM” basis of speech recognition is still fragile. We are a step on from what we had before, which was a system based purely on signal processing, but little has changed in terms of the underlying understanding of how the brain interprets speech, and how that should be implemented in an ASR system.
Fundamentally, ASR systems require training. And that training is very dependent on a number of important factors, not least of which is accent and language. One key factor is distance from the microphone, so called “near-field” and “far-field”, which has such a significant impact that each needs its own trained profile.
It is the failure to fully articulate the failings of the base systems that lead us to, in effect, mislead our potential users, thereby creating a false expectation, which in turn leads to people turning away from what speech recognition has to offer. As the report says, we are in effect working with an emerging technology, but one which has been emerging for 30 years!
If we are to continue using fundamentally the same systems, we have to be clear what can and cannot be achieved with the current state of the art. Closed community systems, properly executed, can be very effective. When you have a limited number of potential users, and you can identify who they are, high accuracy rates can be achieved (and by high, into 90%+). However, if you are dealing with “off-the-street” voices, as is the case with Siri, even if you limit yourself to a command and control like application, it is inevitable that you are looking at a higher error rate.
Put simply, we are not at the point yet where you can walk up to any computer, and have conversational input with it. Using well-thought out techniques, you can reduce the potential for error, but if you need 100% accuracy for your application out of the box, we’re not there yet. Much more research needs to be done to understand how the brain really understands speech, not just from a signal processing point of view, but how it applies its own models based on the first few seconds of speech. We are all used to “tuning” our ear when we hear someone with an unusual accent for the first time, and these are the sorts of mechanisms that need to be better understood before our computers can better understand us.