The Design Process of AI-driven Voice User Interfaces (VUI)

BYmarketing | 18/07/2019

Products based on artificial intelligence (AI) must be researched, tested and validated long before a minimal viable product (MVP) that serves the target user group hits the market.

Spending time and energy on this pre-production process avoids long, expensive and often unnecessary programming time. Failure to do so usually leads to a very negative user experience.

What is AI?

Artificial intelligence (AI) refers to the simulation of cognitive functions by a machine. AI is strongly based upon learning and problem solving and is often called ‘non-biological intelligence’.

AI does not understand or learn in the same way we do. It carries out the tasks that a human has programmed it to do and gradually improves its performance by way of constantly shifting algorithms.

What is conversational AI?

You will already have heard of and probably use at least one of the best-known conversational AI systems: Google Home, Alexa and Siri. Conversational AIs are found in car navigation systems, home automation networks, hospitals and large corporate concerns, and fulfill a wide range of tasks.

An AI process can be integrated into a user interface (UI) or voice user interface (VUI) in the form of hard- or software. Conversational AI enables a UI to communicate with a user in natural language. In order to do this the system requires a text-to-speech engine (TTS) which can read digital text out loud. Conversational AI ‘understands’ spoken user commands and utterances, and various algorithms decipher this information.

Once deciphered, conversational AI interprets what has been said and either answers the question or performs the requested action. This all seems relatively simple; however human voice and verbal expression, human culture and human values make this – at present – an extremely daunting task for those who wish to do it right.

The AI design process

When creating a VUI, one does not begin with an existing AI system. First, a strong foundation must be created in order to process programmed dialogues. Once this has been achieved, both VUI and user conversations must be tested time and time again.

The very first step is always the concept. For the VUI, this concept is made possible through the user’s voice. Validating such a broad concept requires a plethora of surveys and tests. Only once insights have been accumulated is it time to begin to define and design your VUI. Every step of the way, design must evolve through continuous testing until a validated functional concept emerges. Amendments are an integrated and essential step of AI design before, during and long after development. Only when a minimal viable product has been established based on a series of dialogues can you begin with big data collection gathered from user insights.

Stage One: generating insights through user surveys and tests

The following steps need to be taken well before producing the prototype of a design that responds to specific requests. Avoidance of any of the following ten steps can lead to problems further along the line – one of the reasons why quality VUI is an expensive choice to make. However, the lack of similar quality competition also makes these costs worthwhile for large target markets.

1. Design Thinking

One’s goal must match the needs of the target market. For example, a social robot for elderly populations to help relieve feelings of loneliness is merely a starting point. To validate this concept, one needs to implement the entire process of Design Thinking. This particular process concentrates on the needs of the customer. Upon completion of this process a concept can be considered to be validated. Only then is it time to begin the AI design process.

2. User conversation insights

When designing voice user interfaces it is essential to research how your target user communicates, as only then can a dialogue be developed. Conduct conversations with intended users and record sentences. Respond to the user in order to further understand his or her responses and speech behaviours. This research will allow you to develop mental model mapping of your target group.

In any field where speech is central, user upbringing and cultural background will influence how that person communicates. While a group of newly retired individuals will largely understand and use modern terminology, today’s 80 plussers will have a completely different take. References to Betty Grable’s legs will ring a bell for elderly US audiences, but will fly well over the heads of most young Asians.

Without profiling this particular dimension and being empathic and adaptive to social and cultural norms this step is near impossible. One cannot stress enough that the creation of a voice-user interface is done purely for a better user experience. The joy of the human animal is that no two are alike; that every individual has gone through difference experiences which make him or her the complex personality they are today. These variations continue throughout life. If you want to design a VUI, you need to know exactly how your validated group communicates and carry on observing them as they evolve.

3. UI personality

User surveys glean insights into user speech behaviours. It is therefore important to provide a VUI that matches these behaviours. Picturing the interface as an actor portraying a character can be helpful when writing VUI dialogues. Actors use both verbal and non-verbal communication to present themselves in different lights. Each character they create must be distinct and recognisable and are the result of careful study of a character and not an ad-libbed event. Speech design is no different. In a care setting a robotic voice and lack of empathy will not fit the role. As a navigational aid, long conversation can be a dangerous distraction. Also, when defining character and personality, bear in mind that the character must be accepted by the user and may not be your idea of the perfect representative of your company or brand.

Current research into VUI in retirement homes show elderly populations often treat social robots as they would their grandchildren. In this particular case, design would therefore consider a more childlike voice. As a robot, the compact form also mirrors that of a younger child. These robots are often given human names; “Google” may not be the best choice in these circumstances. For the home automation interface, one expects short user commands and a slightly servile response. Complications arise in all-in-one devices that are expected to act as dictionary, home automation control, social interaction and media centre.

Once the general character has been defined, further nuances exist. For social robot in retirement home settings the personality and speech patterns of a curious child will encourage recall and conversation while a shy character will not; if speech does not play a part in fulfilling the concept, one may as well buy a budgie.

4. Conversation map

Insights gathered during user surveys should be used to create a conversation map. A conversation map gives you an initial overview of possible dialogues and is a timeline packed full with topics and actions that a person may encounter during the journey, activity or day. Different scenarios and possible dialogues should be thoroughly brainstormed as this timeline is the foundation of all conversational AI design.

5. Conversation map dialogues

Once again, it is worth considering the role of the actor when writing conversation map dialogues – specifically the screenplay from which they are given their cues. Screenplay dialogues consist of signal words that lead to action and reaction.

The most important part of a conversation map dialogue is the correct selection and use of signal words or tones to which a response is given. Intonation is a good example, where a questioning voice illicits an answer while a flat tone sounds more like a statement and is not responded to.

With the timeline style of the conversation map, short dialogues can be added for each topic. At the same time, one must ensure that all VUI dialogue fits the character type as described in step three. An example of the conversation map dialogue using the social robot example might show that every Sunday morning at 9 Mr. X turns on ABC Classic.

Fact: 9 a.m. Mr X. turns on ABC Classic

Signal: if (time = 9 a.m.)

Social Robot: Mr X, do you feel like a little music? I’ll turn on the radio. *Turns on radio*

In this case, the VUI is configured to respond to the habits of the user. To increase social interaction, telephone calls may be programmed in order to maintain contact with family and friends. Times can be configured to fit in with Mr X’s day – phoning a grandchild after an evening meal or reminding him of the neighbour’s birthday on the morning of the event, for example. Personalisation can range from preferred route choice to room temperature, meal times and even menus. VUI: Ms. X, is it time to defrost the chicken?

6. Wizard of Oz testing

Conversation map dialogues can be tested according to the Wizard of Oz method or Oz paradigm via computer. This requires a human operator who is not seen by the user (hence the Wizard of Oz). This method immediately shows how a user responds without having to pay for expensive programming.

In Wizard of Oz tests, the dialogue of each script is followed to the book and notes are taken. It is very easy to see where a conversation goes wrong and so correct these mistakes. It is similarly easy to see which conversations and actions run smoothly. Speech development can be further fine-tuned at this stage, too.

7. Guerrilla testing

An alternative to the Wizard of Oz style but often used as an extra level of testing, the Guerrilla style collects feedback from users. Random individuals similar to your target group use a website or respond to an interview which consists of those phrases implemented in your VUI dialogue. The results of these interactions are recorded. For groups without a customer base this is often the only conversation map testing alternative. It is also the cheaper option. That said, this pre-design step should always be allocated a healthy budget.

8. User-expression databank

Iterative testing goes hand in hand with VUI development. This includes the ongoing collection of statements to create a comprehensive database for conversational VUI. At a later phase, VUI can implement machine learning. This may only happen later on as machine learning requires small to large datasets in order to examine, compare and discover common patterns and investigate nuances. One only needs to consider how many ways there are to say hello or goodbye to realise the importance of a broad user-expression databank. Without an initial database, AI will have very little to go on.

9. Define script blocks

When writing conversation dialogues, mark the most important signal and keywords and sketch out each conversation based on script blocks. Script blocks form the framework that determines how a conversation proceeds.

Beginning with the starting phrase that keeps a conversation active, a conversation mode follows where signal and keywords can indicate specific responses. Conjunctions such as ‘so’, and ‘good’, can change how a conversation proceeds. Central to a conversation (its heart) is the reason why this particular the VUI has been designed. Script blocks should also include choices based upon positive and negative values, a response to a request for help however this is worded, and those dialogues that indicate a conversation is to end.

10. Define values

Every user’s language differs as does the way in which they communicate due to their own beliefs and emotions. This is not the same as those nuances listed under conversation insights, as it is now the moment to concentrate on individual values, needs, preferences, and feelings, and on how these are expressed.

When determining values for empathic VUIs in social settings, non-lexical sounds also matter. ‘Hmm’, ‘ooh’, ‘okay’, and ‘uh-huh’ sound more human to the user and can give more of a sense of security and understanding. For home automation systems, non-lexical sounds should be absent or more structured; however, the feeling of natural conversation is often negatively affected. Finding the right balance between voice and functionality as a response to an individual’s values comes naturally to most people but are difficult for machines to mimic.

Stage Two: Designing the right functionalities

The three-step process of stage two is all about creating a product that meets the absolute needs of your target user.

1. Design functionalities

Determining exactly what users want the VUI to do means yet more market research. These should include interviews with experts and look at the needs of the user and those groups that are in some way associated with them, such as caregivers, other family members, and affiliates. However, one should not lose sight of the fact that this is a product designed for the user and not for your company.

2. How users give orders and express intentions

A VUI has various functions which can be activated by the user or system. If the user wants the VUI to do something, the VUI must be designed and programmed to recognise the commands that trigger certain actions.

A spoken command should ensure that an action is executed. Because not all people express their intentions in the same way, assignments must be established by way of user tests. It is important that different ways of giving a command are collected in order to create a comprehensive conversation. For non-individuals, such as the large populations now installing home automation networks in their homes, a certain language should be encouraged but the alternative expressions should also be recognised. This is a monumental task. One only needs to browse YouTube to see the many videos showing how a VUI got it completely wrong.

3. Create a comprehensive database with spoken commands, intentions and values

It is finally time to collect (and continue to collect) spoken assignments, intentions and values in order to provide a comprehensive database for the conversational AI.

Often, single expressions used for VUI are not as clear-cut as we might expect. It is surprising how many individuals when asked how they might ask artificial intelligence to make a ‘Call’ don’t even use the term. ‘Phone’, ‘Talk to’, ‘Ring’ are just some of the possibilities. The larger the target group, the greater the chance that one command option will be insufficient. In these cases it is sometimes worthwhile getting the VUI to ask a confirmation question using the term you wish to make the primary response: “Do you want me to call …?”

The more complex your VUI, the more challenging the act of conversation. The larger your target user group, the greater the language gap. It is therefore important to add fall-back features which bring the interface back to an existing functionality.

Stage 3: develop and continue iterative testing

There is no end to AI design.

Developers must continue to collect answers, signal words and user expressions. They must add to the database of fall-back sentences. They must integrate new routes, equipment or people (voices). This process should continue until conversation seamlessly connects to function.

The most obvious way to speed up this process is to collect personal facts in advance and programme these into the VUI. In terms of the social robot, it can be made aware of family members, places visited, previous addresses and so on. For all-in-one home systems, each family member has their own profile. Naturally, data protection and transparency in such circumstances are essential; remaining open about collected and stored data with the target user but protecting their information from others can sometimes be a fine line to tread.

Personalisation goes much further, however. An older person or someone who is hard of hearing may require longer to respond. Short response times can lead to interruption and confusion for these groups but cause frustration and impatience in others. Visual signs such as a coloured light when the VUI is speaking or a different colour when waiting for a response can help and increase signal interpretation. Finally, designing with emotion in mind turns a robotic, soulless VUI into a partial human. Facial expressions on certain robot types can significantly improve user experience. Voice tone and personality choice can turn a potentially unfriendly character into a pleasant one with which the user enjoys interacting.

Ready to design your VUI?

Only through repeated testing can one create an MVP with the personality, speech behaviours and functions that meet the needs of the user. This process must always precede AI programming itself.

Designing and providing a voice user interface is an expensive undertaking where high programming fees caused by the avoidance of preparatory data collection further increase the bill. While the majority of businesses are not yet ready to justify these costs to produce an on-premise version for staff, the future of such technology inside the office is on the rise. Outside the office, VUIs are becoming ever more popular. Whether this trend continues depends on futures developers’ willingness to study and learn from the highly complex topic of human communication.