Text-to-speech is everywhere these days, for example, the iPhone’s Siri, or Amazon’s Echo. Yet, many IVR (interactive voice response) systems still rely on a human voice talent to actually sit down in a studio and read your prompts into a microphone. So, why haven’t more enterprises dumped the human and gone full-on Text-To-Speech in their call center implementations?
For a couple of reasons.
First, fidelity. Edward Tufte once said that the IVR phone interface is the lowest-possible resolution interface there is. (I don’t have a citation on this, because I heard him say it in person, but he did say it.) And he’s right. All we have is the spoken word, an interaction which by necessity occurs over a period of time. Now, TTS has gotten pretty good, but it’s still pretty easy to tell the difference between it and a real human voice. Furthermore, callers are already dealing with things like static-y cell phone connections. So why not take the time, especially if you’re dealing with an important, enterprise system, to produce something that is the highest fidelity possible?
Second, control. In order to create a high-quality user interface, with appropriate pause lengths to support things like turn-taking, as well as natural sounding intonation, the human voice will naturally allow you a lot of control. Professional voice talents are typically coached to get the desired effect. And prompts are often edited during the production process in order to insert precise pauses when necessary. TTS can be controlled as well, depending on the TTS product used, but it can require additional markup to do so, and even then, will probably not be able to supply the same range of versatility that a human can.
Third, elision. As sentences are spoken by a human, the voice naturally elides words and sentences in order to smooth the transitions between words. It’s part of how we speak and hear our language. TTS does a certain amount of this, but it can only go so far. When you record things in larger chunks, it just sounds better and is more understandable. This is why airlines typically invest in an extended set of recordings to read back flight numbers. They definitely don’t use TTS, and they’re not just recording 1 through 10 and using that – they’re recording “fifty three” and “seventy five” using different variations for intonation depending on whether or not it’s at the start or the end of the number. They do this because it’s a critical piece of information for the caller, and it’s vital that it be as clear as possible.
TTS does have its place. Specifically, it’s fine to use it for things that simply cannot be pre-recorded, such as for addresses or other prompts that change frequently. However, even then you’ve got to pay attention to the data that you’re using and make sure it’s clean enough to sound good when it’s spoken back by the TTS engine. Abbreviations and company jargon will come out strangely and reduce your level of caller understanding.
In short, TTS is great for some applications, but voice talents will continue to be important for enterprise applications. Now you’ll excuse me – I have an appointment in the recording studio!
Note: This article was originally published on the blog of my employer, Versay.com