The Speech APIs allow you to add advanced speech skills to your bot that leverage industry-leading algorithms for speech-to-text and text-to-speech conversion, as well as speaker recognition. The Speech APIs use built-in language and acoustic models that cover a wide range of scenarios with high accuracy. In addition, for applications that require further customization, you can use the Custom Recognition Intelligent Service (CRIS), which allows you to calibrate the language and accoustic models of the speech recognizer by tailoring it to the vocabulary of the application, or even the speaking style of your users, thus achieving higher degree of accuracy.
There are 3 Speech APIs available in Cognitive Services to process or synthesize speech:
The Bing Speech API provides speech-to-text and text-to-speech conversion capabilities.
The Speech APIs enable your bots to parse audio and extract useful information from it. For example, bots can identify the presence of certain words, or access the transcribed text to perform an action. In addition, on messaging channels that support voice as input, bots can leverage the Speech APIs to recognize what the users are saying, rather than relying on text messages. Finally, the Speaker Recognition APIs can be used as a means to identify or even authenticate users through their unique voiceprint.
Before you get started, you need to obtain your own subscription key from the Microsoft Cognitive Services site. Our Getting Started guide for the Speech API describes how to obtain the key and start making calls to the APIs. You can find detailed documentation about each API, including developer guides and API references by navigating to the Cognitive Services documentation site and selecting the API you are interested in from the navigation bar on the left side of the screen.
Example: Speech-To-Text Bot
Let’s build a simple bot that leverages the Speech API to perform speech-to-text conversion. Our bot receives an audio file and either responds with the transcribed text or provides some interesting information about the audio it received, such as word, character and vowel count. We will use the Bot Application .NET template as our starting point. Note that this example requires the Newtonsoft.JSON package, which can be obtained via NuGet.
After you create your project with the Bot Application template, add the Newtonsoft.JSON package, and then open the MessagesController.cs file. Start by adding the following namespaces.
Next, you will add some necessary classes to handle authentication and the access token for the Speech API.
You will now write the function that implements the speech-to-text conversion. Note that the function requires a working Speech API key, which can be obtained via your Cognitive Services subscription page.
Finally, replace the code in the Post task with the one below. The code parses the voice attachment sent to the bot, calls the speech-to-text conversion function, and finally responds back to the user with the transcribed text, as well as related metadata, such as character or word count, on the user’s request.
Example: Speaker Recognition Bot
For our second example, we will build a bot that leverages the Speaker Recognition API. The code allows you to use voice for authentication scenarios. The bot receives the audio file, compares it against the sender’s voiceprint and responds back with an accept or reject decision, as well as a confidence score. We will use the Bot Application .NET template as our starting point. Note that the example requires the Microsoft.ProjectOxford.SpeakerRecognition package, which can be obtained via NuGet.
Before you begin, you need to enroll your voice by saying one of the preselected passphrases. The Speaker Verification service requires at least 3 enrollments, so the bot will ask for three enrollment audio files in total, and send a confirmation when the enrollment is completed.
After you create your project with the Bot Application template, add the Microsoft.ProjectOxford.SpeakerRecognition package, and open the MessagesController.cs file. Then, add the following namespaces.
You will now write the function that implements the speaker verification logic. Note that the function requires a working Speaker Recognition API key, which can be be obtained via your Cognitive Services subscription page.
Replace the code in the Post task with the one below. The code parses the voice attachment sent to the bot, calls the speaker verification service, and finally responds back to the user with an accept or reject decision, which also includes the confidence score.