Types of Speech Recognition

There are commonly three different types of speech recognition in use in different applications:

Grammar-Driven

Grammar-driven speech recognition works best when your application is looking for a specific piece of data, like a name, phone number, account number, dollar amount, or date. Think of this type of speech recognition like the small edit fields you find in a web page or mobile app for short answers.

This is the only approach that will work reliably with a minimal amount of effort for names, technical terms, and non-common words that may be awkwardly spelled. The tradeoff for this accuracy is that your app needs to know ahead of time what a caller will say. The grammars in these apps can be complex definitions written in BNF or XML, but also in a simple list of phrases.

Use When:  Answers will be short, and recognition accuracy is critical
Applications: Virtual receptionist, payment processing, order status, hands-free email & calendar

Dictation

Also known as speech-to-text, dictation is used when you want a word-for-word transcription of what a user said. Dictation accuracy has improved significantly in recent years with the introduction of machine learning applied to massive repositories of recorded audio data, but recognition accuracy scores can vary greatly depending on the quality of the audio and the environment it was recorded in.

Use When: You need a transcription of what the user said, but accuracy doesn’t need to be exact
Applications:  Message or note taking, voicemail to text

Natural Language Understanding (NLU)

NLU encompasses a range of technologies, but its most common uses are mapping the many ways humans speak to computer identifiable objects and intents, and sentiment analysis (is the user happy or angry?).  An NLU application can ask broad questions and will try to determine the caller’s intent whether they speak in complete sentences or sentence fragments. This can reduce the number of steps, or turns, that a user takes to reach their goal, and shorten call times..

If the context of the application is broad, it will need to have a higher amount of background information. For example, when an app for a movie theater asks, “How can I help you today?”, it’s most likely going to be expecting requests like “I’d like to buy two tickets to ‘Ferris Bueller’s Day Off'” and “When is the Disney movie playing?”. NLU applications generally are more time consuming to build, and will need rounds of tuning to evaluate and improve their performance.

Use When:  You want to ask an open-ended question and can map intent to action.
Applications:  IVR starting point, virtual assistant

 

FAQ:

“Ok, so which do I use for my app?”

  • With SpeechBridge, you can use the mode that will work best for the task at hand and you can switch at any time during a session to give your users the best experience.

“How can we deliver great user experiences?”

  • For all types of speech recognition, gracefully handling the cases where the results are, shall we say, less than perfect, is just as important as the initial question. Confidence scores can help to decide if your app has a good response to work with or not. Disambiguation as a follow-on step is also effective.
  • Those of us who have been in the speech industry for a while caught early on the benefits of the iterative “ship early, ship often” process popularized by the Agile Software movement. Successful speech interfaces need to be in the hands (so to speak) of users as early in the project timeline as possible. With our Menu Editor, you can rapidly prototype a custom speech app and build out complete IVR applications.