top of page

VUI Design Topic: Prosody

First of all, what is prosody?

Prosody: the rhythm and pattern of sounds of poetry and language (Merriam-Webster)

So when we’re talking about prosody, we’re talking about:

  • How fast or slow the speech is

  • Spaces in between words

  • Intonation (rising, falling, flat pitch of a sentence)

Since English is an atonal language, we only have to contend with pitch as it conveys the overall feel of a sentence, not individual words. For example, here are some standard prosodic habits that all English speakers have without even realizing it: (Try reading these out loud to really get the feel of it).

  • Yes / no questions have a rising intonation at the end of the sentence:

  • “It’s a pretty day out, isn’t it?”

  • Statement sentences tend to end with a falling intonation:

  • “I’m going to the store.”

  • Middle or even rising intonations in the middle of a sentence indicate the speaker isn’t quite done yet despite a pause. This may be used to ‘hold the floor:’

  • “I’m going to the store… and I think I’m going to take the car, do you mind?”

  • Fast pace may convey excitement or energy:

  • “I’m really in a hurry!”

  • Slower pace may convey a relaxed feel:

  • “It’s a be-you-ti-ful day.”

How do all of these considerations come into play when doing voice interaction design?

Intonation chiefly comes into play when putting pieces of prompts together. For example, let’s suppose we’re assembling a dynamic menu of choices. Depending on the back-end data, there could be anywhere from three to five options available on the menu. So, for example, the full-blown menu would be:

What kind of policy? Say just “life,” “auto,” “homeowners,” “renters,” or, “umbrella.”

In order to put that menu together dynamically it needs to be divided into the following pieces:

whatKind.wav What kind of policy? Say just …

life.wav “life,”

auto.wav “auto,”

homeowners.wav “homeowners,”

renters.wav ”renters,”

or.wav or,

umbrella.wav “umbrella.”

However, if we’re taking intonation into account, we haven’t gone far enough. If “umbrella” is at the end, it needs a falling intonation, but what if the database only has 3 choices, and “homeowners” is that the end? Or what if it has four? In order to adequately support building a natural-sounding, prosodically-correct prompt out of these smaller pieces, we need to have more recordings made. If we were to get all possible variations, our prompt list might look something like this:

whatKind.wav What kind of policy? Say just …

lifeMid.wav …life…

lifeEnd.wav …life.

autoMid.wav …auto…

autoEnd.wav …auto.

homeownersMid.wav …homeowners…

homeownersEnd.wav …homeowners.

rentersMid.wav …renters…

rentersEnd.wav …renters.

orMid.wav …or…

umbrellaMid.wav …umbrella…

umbrellaEnd.wav …umbrella.

Note that the prompt file names have changed. To support two unique intonations of the same word, each one will be recorded by the voice talent twice, once as if the item falls in the middle of the list, once as if the item falls at the end of the list. This is denoted in the text that is given to the voice talent by the ellipses. The pair of ellipses represents the item in the middle of a list, whereas the leading ellipsis represents the item at the end of a list.

This can be especially important when assembling things like account numbers or social security numbers for playback on the fly. In this case, it could go as far as three items: a mid, a rising, and an end one, to support a natural sounding grouping. For example: 555 – 55 – 1234, which would sound good as mid-mid-rising, mid-rising, mid-mid-mid-end.

Next time on VUI Design Topics: More on Prosody!

This was originally published on the blog of my employer,

Featured Posts
Check back soon
Once posts are published, you’ll see them here.
Recent Posts
Search By Tags
No tags yet.
bottom of page