Introduction
Developing a voice-driven user experience raises a design challenge. UX (or User Experience) design is something most developers are familiar with. Voice User experience (VUX) is new territory and with it comes with new terminology and new technology. Traditional applications have a user interface between the user and the application logic. With a voice-first application, natural language processor sits between the user and application logic, which maps the user’s spoken request to an actionable request to a voice application. This article shows you how to design the voice interaction model for both Alexa and Google Assistant.
Alexa and Google Assistant
Alexa Skills are configured in the Alexa Skill Console on developer.amazon.com using an Amazon account. Google Actions are configured on console.actions.google.com and integrate with DialogFlow at console.dialogflow.com with a Google account. This article covers how the user's intention is communicated to Alexa Skills and Google Actions. The Alexa Skill Console and DialogFlow use much of the same terminology and the same concepts discussed in this article carry over to both environments. Differences are discussed where applicable.
Invocation Names
Amazon calls voice applications Alexa Skills; Google calls them Actions. Both use the term "invocation name" to refer to what the user says to launch voice applications.
Use a name that communicates the functionality of the application. For example, I created an Alexa Skill that queries the public clinicaltrials.gov website that hosts a directory of clinical trials. The invocation name is "clinical trial finder."
Avoid homophones in your name. I released a skill with the invocation name "eye of the elder gods." It worked as expected when testing the application, but it did not launch after it was released to production. Alexa heard the user say "i of the elder gods" and it didn't launch. I opened a support ticket directly with Amazon. Within a few weeks, Amazon released an update which resolved the name.
One approach to test whether an invocation name or any phrase is translated to text as expected is to speak it to the device. For example, you could say "Alexa, [test invocation name]" and check your interaction history. To check your history in Alexa, open the Amazon Alexa mobile app. Select the icon in the upper left corner to open the slide out menu. Select Settings -> Alexa Account -> History and you'll see a record of every interaction with Alexa.
For Google, navigate to myactivity.google.com in a browser and log in with your Google account. Expand the plus icon under the search bar and select Assistant.
On a side note, when you delete your interaction history in Alexa, even Amazon tech support cannot retrieve it. I discovered that after testing the "eye of the elder gods" invocation name issue and working through the trouble support ticket.
Amazon allows Skills with duplicate invocation names. Avoid using a duplicate invocation name. You don't have any control over which Skill Alexa starts when the user speaks a duplicate invocation name. The only certain way to launch a skill with duplicate names is to navigate to the Skill listing page on amazon.com and click the "Enable Skill" button. Google, on the other hand, does not allow duplicate invocation names.
It's also worth noting that the invocation name is can be different than the display name on Alexa. I released
Animal Farm P.I. on Alexa. The invocation name is "animal farm p. i." Note the space between p. and the i. The first time I submitted the skill for publication, it was rejected since Alexa requires spaces between abbreviated initials. On Google Assistant, the invocation name is the display name, but you can specify a different pronunciation. When I put "Animal Farm P.I." as the invocation name for a Google Action, it pronounced the word "period" and said, "Animal Farm P period I period." I then tried "Animal Farm PI" and Google pronounced it as "Animal Farm Pie." Finally, I was able to get the right pronunciation with a space between the P and the I: "Animal Farm P I".
Intents
Voice applications on Alexa and Google Assistant do not receive the MP3 file of the user's voice. A natural language processor converts the speech to text and then runs the text through a processor that maps the phrase to an intent
It could be as simple as a one-word response to a question (yes, no) or a full sentence. Let's say you are designing a maze game where the user can go left or right. When Alexa says, "Would you like to go left or right?" the user, most likely, will respond with a phrase indicating the direction. To go left, the user could say:
turn left
hang a left
walk left
make a left
go left
left
All of these phrases map to the same command. The Alexa Skill model refers to them as utterances.
DialogFlow refers to the spoken phrase as a training phrase.
When processing the intent, the voice application must maintain and apply context. Using the maze example, going left at one location may result in a dead end, while in another, it could lead to the exit. Knowing where the user is in the maze provides context to the direction and what directions are available.
If the context should live only for the duration of the user's session, then you can use the session attributes portion of the Alexa response. Session attributes are echoed back to the application in the subsequent Alexa request.
- "session": {
- "new": false,
- "sessionId": "SessionId.DFJHUE....",
- "application": {
- "applicationId": "amzn1.ask.skill.SUFDHG..."
- },
- "attributes": {
- "mazeLocation": "node5"
- },
- "user": {
- "userId": "amzn1.ask.account.edited"
- }
Up to 10k of session data can be stored in a Google DialogFlow response in the userContext attribute.
- {
- "payload": {
- "google": {
- "expectUserResponse": true,
- "userStorage": "{\"mazeLocation\":\"node5\"}"
- }
- }
- }
If the user's context needs to be persisted between sessions, then some form of server-side storage is necessary, like DynamoDB, S3, or an RDS database instance, if you are hosting on Amazon or CosmosDB or Blob storage if you're hosting on Azure.
Handling Unexpected Utterances
The voice application's response should guide the user to the available choices (e.g. "Would you like to go left or right?") and anticipate the different responses a user might use. However, there will be cases where the user says something unexpected and a robust voice application should be prepared to respond with appropriate guidance to re-prompt the user. For example, if the user responds with, "go straight," the maze game should reply with an informative response, like:
That's not a valid direction. You can go left or right.
I'm sorry, I didn't get that. You can go left or right.
Sorry, I don't recognize that. You can go left or right.
Alexa has a default AMAZON.FallBackIntent; and DialogFlow includes a Default Fallback Intent. The response to the fallback should rely on the user's context to guide the user to supported utterances.
Required Intents
Before a voice application is released to production, it must pass a certification checklist. For Alexa, part of that checklist includes supporting required system intents. At minimum, the skill should support the AMAZON.HelpIntent that, ideally, provides contextual guidance to the user. If your skill supports booking a hotel and reserving a flight and the user is in the middle of reserving a flight and asks for help, it should instruct the user how to reserve a flight rather than book a hotel.
Some intents are required for certain skill types. If you are building a skill that plays music, then the skill must support the AMAZON.PauseIntent and the AMAZON.ResumeIntent to handle the music playback.
These requirements do change and you should check Amazon's and/or Google's documentation.
Built-In Intents
Built-in intents are common phrases that are inherently supported with system-defined intents with preconfigured utterances managed by Amazon and Google. For example, Alexa includes the AMAZON.Yes and AMAZON.No intent. If you can use a built-in intent, you should do so. Utterances for these intents are handled by Amazon for all languages and locales supported by Alexa. If you need to port your skill to another language, this is one less set of utterances to translate.
DialogFlow supports a different set of built-in intents, like action.intent.PLAY_GAME and actions.intent.CHECK_AIR_QUALITY.
Accepting Variable Input
Not every user request can map to a fixed phrase. For example, a user could request a city when reserving a flight. Building out a set of utterances that covers every available city is not practical. On Alexa, Slot Types provide a solution for accepting variable input. AMAZON.US_CITY is a built-in slot type that, is the name implies, lists all U.S. cities. The same best practices that apply to built-in intents also applies to built-in Slot Types. Use them if they are available and save yourself the effort of translating the slot values to another language or locale.
The AMAZON.US_CITY Slot Type is a list type containing a list of string values and synonyms. You can define custom slot types with your own values and synonyms. The Animal Farm P.I. skill includes a list of locations the player can visit.
- "name": "FarmLocations",
- "values": [
- {
- "name": {
- "value": "pond",
- "synonyms": [
- "duck pond"
- ]
- }
- },
- {
- "name": {
- "value": "kitchen"
- }
- },
The FarmLocations slot type is used in the GotoLocationIntent.
- "name": "GotoLocationIntent",
- "slots": [
- {
- "name": "location",
- "type": "FarmLocations"
- }
- ],
- "samples": [
- "go to {location}",
- "go back to the {location}",
- "go back to {location}",
This mapping lets the user say:
go to pond
go to duck pond
go back to pond
go back to duck pond
. . .
All of those phrases produce the same GotoLocationIntent with a populated location parameter with the value pond.
- "intent": {
- "name": "GotoLocationIntent",
- "confirmationStatus": "NONE",
- "slots": {
- "location": {
- "name": "location",
- "value": "duck pond",
- "resolutions": {
- "resolutionsPerAuthority": [
- {
- "authority": "amzn1.er-authority.echo-sdk.amzn1.ask.skill.92304d4d-42a5-4371-9b13-97b4a79b9ad0.FarmLocations",
- "status": {
- "code": "ER_SUCCESS_MATCH"
- },
- "values": [
- {
- "value": {
- "name": "pond",
- "id": "d873154f067233434b64c0b8b2348cdb"
- }
- }
- ]
- }
- ]
- },
- "confirmationStatus": "NONE",
- "source": "USER"
- }
The Alexa Request above was generated using the Alexa test console.
It's worth noting that if the user speaks a phrase that matches the GotoLocationIntent but does not match a slot value, the slot value is still submitted to the skill. For example, if the user says and the value hospital is not included in the Slot Type, the request is submitted with the slot name location set to the value hospital.
go to hospital
Be careful with using this approach since this relies on Alexa's capability to translate speech to text. It might not always get it right. As a best practice, if you doing something mission-critical, like reserving a flight, you should confirm the value with the user and respond with something like:
I heard you say you want to fly to Chicago, is that correct?
You can also accept numeric input, like telephone numbers using built-in slot types on Alexa like AMAZON.PhoneNumber.
DialogFlow has the same concept but refers to it at Entities. You can define list-type entities with synonyms just as you can define Slot Types in the Alexa console, including its own set of built-in system entity types, like @sys.phone-number.
Summary
This is just a start and an introduction to modeling voice interaction for your voice application. Once you have the first version of the iteration model, you should test it with users who have no instructions on how to use your skill. This is a good test to determine if your voice applications responses are natural and intuitive. You can also submit it for use by beta testers on both the Alexa console and the Google Action console.
The Alexa console does not report the text of a user's request. If you are in production and find that the AMAZON.FallbackIntent is being invoked often, there is no way to determine what users are saying. DialogFlow, on the other hand, does report requests that were mapped to the Default Fallback Intent.
As of the time of this writing, Amazon released the AWS Certified Alexa Skill Builder exam to beta. I took it on Jan. 7 and can confirm that you need to know how to configure invocation names, intents, and slot types.