Increasing the quality of speech recognition in your IVR and voice bot project

In the previous tutorial, we have discovered how FlexIO’s Visual Designer can be used to create simple yet convincing voice bots using the built-in speech to text capability and some formatting tricks to accept wider voice inputs from callers. In this tutorial, we will discover how to increase the quality of speech recognition in your IVR and voice bot project, whether we use it thru the visual designer or in your own applications.

Contents

Language selection

The very first check to do, how obvious is may sound, is to make sure that you are correctly specifying the language in which the speech recognition has to be executed. It’s easy to overlook that the platform will default to English when nothing is specified, and trying to recognize French or Dutch with a speech recognition engine expecting English is a guarantee for failure.

When working with the Visual Designer, make sure to select the required speech recognition language from the drop down menu, just below the Speech input configuration:

Selecting the right language for speech recognition in the Visual Designer

When using RCML in your own code development, specify the input language as following: <Gather action=yourendpointUrl method=”GET” timeout=”3″ input=”speech” language=”fr-FR” hints=””>

For the list of supported BCP-47 language tag (ie “fr-FR”, “en-US”, …) see the list of supported language on Google’s supported STT language list.

Speech adaptation

As we know, Google trains its speech to text engine on very large sets of data, however the performance might drop when we request the engine to recognize very specialized words such as used a specific domains (such as in medicine) or family names and surnames.

FlexIO can leverage Google Speech to Text’s speech adaptation. As Google explains it on this documention page:

You can use the speech adaptation feature to help Speech-to-Text recognize specific words or phrases more frequently than other options that might otherwise be suggested. For example, suppose that your audio data often includes the word “weather”. When Speech-to-Text encounters the word “weather,” you want it to transcribe the word as “weather” more often than “whether.” In this case, you might use speech adaptation to bias Speech-to-Text toward recognizing “weather”. Speech adaptation is particularly helpful to the following use cases:

  • Improving the accuracy of words and phrases that occur frequently in your audio data. For example, you can alert the recognition model to voice commands that are typically spoken by your users.
  • Expanding the vocabulary of words recognized by Speech-to-Text. Speech-to-Text includes a very large vocabulary. However, if your audio data often contains words that are rare in general language use (such as proper names or domain-specific words), you can add them using speech adaptation.
  • Improving the accuracy of speech transcription when the supplied audio contains noise or is not very clear.

In FlexIO, speech adaptation is designated as “hints” or “hot words”. When working with the Visual Designer, you define your hint phrases or hot words in the configuration of the speech recognition as a comma-separated list, as illustrated here:

Defining hot words or sentences to enforce speech adaptation

When working with RCML in your own code development, make sure to include the hint keyword in the Gather verb: <Gather action=yourendpointUrl method=”GET” timeout=”3″ input=”speech” language=”en-US” hints=”hydrotherapy, osteopathy, re-education,paraplegic”>

Class tokens

Next to helping the speech recognition engine with hint phrases or words, it can also be that you are expecting a very specific type of input from your caller. Let’s imagine you need the caller to provide an identification number. Without any guidance, it could very well be the result of speech recognition will be “two seven five zero three one”, whereas you would have expected to receive “275031”.

By using class tokens, you can tell the speech recognition engine that you are expecting a digit sequence instead of the literal transcription. In our example, the class token for a digit sequence is $OOV_CLASS_DIGIT_SEQUENCE.

When using the Visual Designer, you would specify the class token at the same place you define hot words or hint sentences:

Defining class tokens to optimize speech recognition in teh Visual Designer

When using RCML in your own code development, you will specify the required class token with the same “hint” definition as above: <Gather action=yourendpointUrl method=”GET” timeout=”3″ input=”speech” language=”en-US” hints=”$OOV_CLASS_DIGIT_SEQUENCE”>.

You will find the complete list of available class token on Google’s documentation page.

Note that class tokens can be combined with hot words or hint sentences. A few examples:

  • “my address is $ADDRESSNUM”
  • “my reference code is $OOV_CLASS_ALPHANUMERIC_SEQUENCE