Guide

How to create voice-activated and conversational UI prototypes using specialized platforms

Published
December 14, 2025
Share article

Nobody clicks a voice interface. You speak to it, and it either understands or it does not. There is no hover state to guide you, no button to discover by exploring. This changes everything about how you prototype, test, and iterate on voice experiences. (Does that change your prototyping instincts? Yes, because there is nothing to click.)

Last month I tested a voice prototype where the demo script worked perfectly, but any deviation from the exact wording caused complete failure. "Show my calendar" worked. "What's on my calendar" did not. "Display my schedule" triggered an error. The prototype was technically functional but practically useless. Real users do not follow scripts. They speak naturally, and voice interfaces must understand natural speech. (Is it still "functional" if it only follows the script? No, not in the way users actually use it.)

Here is the thesis: voice and conversational UI prototypes require different tools, different testing methods, and different success metrics than visual prototypes. Teams that apply visual prototyping habits to voice interfaces build demos that work in presentations and fail in the real world. (Are you optimizing for the presentation? If so, you will see this gap fast.)

Why Voice Prototyping Is Fundamentally Different

Visual interfaces have affordances. Buttons look clickable. Links change color on hover. Users discover features by exploring the screen, clicking around, and seeing what happens. The interface teaches itself through visual cues. If you do not understand something, you can look at it, study it, and figure it out.

Voice interfaces have no visual affordances. Users must guess what commands work. If they guess wrong, the experience breaks. There is no explore mode where you can wander around testing possibilities. You either know the right incantation or you get an error. This means voice prototypes must account for the enormous variation in how humans actually speak. (Do you know the "right incantation" your users will try? Not unless you capture how they actually phrase it.)

This is what I mean by utterance diversity. The basic gist is this: ten users asking for the same thing will phrase it ten different ways. Your prototype must handle this variation, or your testing will only validate the happy path. The demo might look great, but the real-world experience will frustrate users who do not happen to use your exact phrasing. (Do you have an utterance library yet? If not, this is where it starts.)

Consider a simple request: "Set an alarm for 7 AM tomorrow." Users might say:

  • "Wake me up at 7 tomorrow morning"
  • "Alarm for 7 AM"
  • "I need to get up at 7 tomorrow"
  • "Can you set a 7 o'clock alarm for tomorrow?"
  • "Seven AM alarm please"

Each phrasing expresses the same intent. A voice interface that only recognizes one phrasing fails 80% of the time. Voice prototypes must grapple with this diversity from day one. (How many variations do you test right now? More than one.)

flowchart TD
    A[User Intent] --> B[Spoken Utterance]
    B --> C{Voice Recognition}
    C --> D[Intent Classification]
    D --> E{Match Found?}
    E -->|Yes| F[Execute Action]
    E -->|No| G[Fallback Response]
    G --> H[User Rephrase]
    H --> B
    F --> I[Voice Response]
    I --> J[Conversation Continues]
    J --> K{User Satisfied?}
    K -->|Yes| L[Success]
    K -->|No| M[Clarification Loop]


Specialized Platforms for Voice Prototyping

The tooling landscape for voice prototyping has matured significantly. Each platform has different strengths depending on your deployment target and technical requirements. (Do you need voice, chat, or both? Decide that first, because it drives everything that follows.)

Voiceflow is the most popular platform for conversational design. It supports both voice (Alexa, Google Assistant) and chat interfaces with a unified design approach. The visual flow builder lets designers create conversation trees without code. You drag and drop conversation blocks, define intents and responses, and test directly in the platform. The learning curve is gentle, which matters for design teams without deep technical expertise.

Botpress focuses on text-based conversational UI with strong natural language understanding. It handles complex dialogues with branching logic and maintains context across long conversations. If your primary interface is chat rather than voice, Botpress offers more depth in text handling.

Dialogflow from Google provides enterprise-grade intent recognition. It powers many production voice assistants and offers robust testing tools. The machine learning behind intent matching is sophisticated, and you benefit from Google's scale in natural language processing. The tradeoff is complexity: Dialogflow requires more technical setup than simpler tools.

Amazon Lex is the engine behind Alexa. If you are building for Amazon's ecosystem, Lex provides the closest prototype-to-production path. What you build in Lex can become your production system with minimal changes. The integration with other AWS services makes it attractive for teams already invested in that infrastructure.

For simpler prototypes, Botsociety offers quick mockups of conversational interfaces without deep NLU integration. It is good for stakeholder presentations and early concept validation when you do not yet need sophisticated intent recognition.

How do you choose between these? Consider your deployment target. Building for Alexa? Use Voiceflow or Lex. Building a website chatbot? Botpress or Dialogflow. Need quick stakeholder demos? Botsociety for speed. The closer your prototype tool matches your production platform, the less translation work later. (Do you want the closest prototype-to-production path? Then match the platform early.)

Designing Effective Voice Conversation Flows

Start with intent mapping. What are users trying to accomplish? List every intent your voice interface should handle, then list ten ways users might express each intent. This utterance library becomes your training data and your test cases. (Want a quick sanity check? Try writing ten ways yourself, then compare to real users.)

Be exhaustive in this phase. Interview real users about how they would phrase requests. Run surveys asking people to complete the sentence "I want to..." without giving them options. The diversity you discover will surprise you. Users do not speak in the structured language of your product team.

Build explicit confirmation patterns. Voice has no undo button. If the system misunderstands, users need clear recovery paths. "I heard you want to cancel your order. Is that right?" Confirmation prevents costly errors and builds user trust. The brief friction of confirmation is far less costly than executing the wrong action. (Is the friction worth it? Yes, when the alternative is the wrong action.)

Design for interruption. Users do not wait for systems to finish speaking. Your prototype should handle interruptions gracefully. What happens when a user speaks before the system finishes its response? What happens when they say "stop" or "cancel" mid-flow? Interruption handling is often the difference between a usable interface and a frustrating one. (Should you treat "stop" as a first-class path? Yes, because users will try it.)

Create contextual awareness. "Play that song again" only works if the system remembers which song played last. "What's the weather tomorrow" makes sense after "What's the weather today" but requires context. Conversational UI requires session memory that persists across turns. Your prototype should demonstrate how context flows through the conversation. (Does your prototype remember anything across turns? If not, it is not showing the full interaction.)

Testing Voice Prototypes Properly

Voice prototype testing differs from visual prototype testing in fundamental ways. You cannot watch someone's eyes to see where they look. You cannot observe their mouse cursor hovering over options. You must listen to what they say and watch their frustration when the system misunderstands. (What should you watch instead of the cursor? The words they choose, and the rephrases they attempt.)

Wizard of Oz testing works well for early validation. A human operator handles responses behind the scenes while users interact naturally. The user thinks they are talking to a system, but a human is interpreting their requests and generating responses. This reveals whether your conversation design works before you invest in NLU training. You discover which intents users expect that you did not anticipate. (Should you do this before heavy intent work? Often, yes, because it exposes expectations early.)

Diverse speaker testing is essential. Accents, speech patterns, and vocabulary vary widely. A prototype trained on your team's voices will fail with users who speak differently. Test with people who have different accents, different ages, different levels of comfort with voice technology. The elderly user who says "telephone my daughter" instead of "call my daughter" is just as valid as the tech-savvy user who knows the expected commands.

Edge case probing matters more than visual prototypes. What happens when users say "um" repeatedly? What happens when background noise interferes? What happens when users go silent mid-conversation? What happens when children speak to a device intended for adults? These edge cases are common in real environments.

Test in realistic environments. Voice interfaces that work in quiet offices fail in noisy kitchens. If your product will be used while cooking, test while something is frying. If it will be used while driving, test with road noise. Environmental factors dramatically affect voice recognition accuracy.

Common Voice Prototyping Mistakes

The first mistake is script-dependent demos. If your prototype only works when users follow a specific script, it will fail in production. Real users do not read scripts. They speak their minds in their own words. Build flexibility from the start, not as an afterthought.

The second mistake is ignoring error states. Voice interfaces fail frequently. Recognition is imperfect. Background noise interferes. Users ask for things you do not support. Your prototype should demonstrate graceful failure, not just successful paths. What does the system say when it does not understand? How does it guide users toward successful interactions? (Do you have a real fallback response, or just an error? Show the guidance.)

The third mistake is over-promising capability. Early prototypes should set realistic expectations. Users who expect human-level understanding will be disappointed by intent-matching systems. Design your prototype's persona to suggest its capabilities accurately. A prototype that says "I can help you with scheduling and reminders" sets better expectations than one that says "I can help you with anything."

The fourth mistake is forgetting the visual component. Many voice interfaces include screens (Echo Show, Google Hub, car displays). Your prototype should include multimodal design where voice and visual work together. What appears on screen while the user speaks? What visual feedback confirms that the system is listening? (Is the screen doing real work, or just decorating the demo? Make the handoff explicit.)

The fifth mistake is not testing silence. What happens when users do not respond? How long does the system wait? What does it say to prompt continued conversation? Silence handling is awkward in demos but critical in production.

Connecting Voice Prototypes to Visual Product Design

Voice interfaces often complement visual applications. A user might speak a command that triggers a visual response on their phone or smart display. This means your voice prototype should connect to your product's visual design system.

Tools like Figr help here by generating visual interfaces that match your product language. When the voice command triggers a screen, that screen should look like your application, not a generic mockup. If a user says "show me my recent orders" and a visual appears, that visual should match the visual style of your e-commerce app. (Do you want the screen to feel native? Yes, because the voice command is only half the experience.)

Design the multimodal experience holistically. When should responses be spoken versus displayed? Complex information (lists, tables) often works better visually. Simple confirmations often work better spoken. Your prototype should demonstrate these handoffs. (Should every response be spoken? No, not when a list or table is the point.)

Consider accessibility throughout. Voice interfaces are essential for users who cannot use visual interfaces. But voice-only users need different conversation design than users who also see a screen. Your prototype might need multiple variants for different contexts.

Building Utterance Libraries

The quality of your voice prototype depends on the breadth of your utterance training data. Build systematic processes for collecting and expanding utterances.

Start with brainstorming sessions. Gather your team and generate as many phrasings as possible for each intent. Set quotas: nobody leaves until you have twenty variations of each core intent.

Expand with user research. Interview target users about how they would ask for things. Watch recordings of people using competitor products. Mine support tickets for natural language describing user goals.

Use AI to expand systematically. Tools like ChatGPT can generate utterance variations given a seed intent. "Give me 30 ways someone might ask to reschedule a meeting" produces diverse options quickly. (Want the fastest first pass? Use a seed intent, then compare to real user phrasing.)

Test coverage by trying to break your prototype. Approach it as a hostile user who deliberately uses unusual phrasings. Whatever breaks the prototype becomes new training data.

Document your utterance library formally. It is an asset you will build on over time. When you add new intents, you can look at how similar intents were phrased for inspiration.

Measuring Voice Prototype Success

Define success metrics before testing. Task completion rate measures whether users accomplish their goals. Time to completion measures efficiency. Error rate measures how often the system misunderstands. Recovery rate measures how often users successfully correct errors. (Which metric is your leading indicator? Pick one before you run sessions.)

User satisfaction surveys after testing reveal subjective experience. Was the conversation natural? Did the system understand them? Would they use this again?

Compare metrics across user segments. Do native speakers perform better than non-native speakers? Do tech-savvy users perform better than tech-hesitant users? Segment analysis reveals where your prototype needs improvement.

Track longitudinal improvement. As you refine utterance libraries and conversation flows, metrics should improve. If they do not, you are optimizing the wrong things.

The Takeaway

Voice and conversational UI prototyping demands different thinking than visual design. Invest in platforms that handle utterance variation, test with diverse speakers in realistic environments, and design for graceful failure. Build comprehensive utterance libraries and test edge cases aggressively. Connect voice prototypes to visual design systems for multimodal experiences. The goal is not a perfect demo, it is a system that handles real human speech in all its messy variety.