Multimodal by Design

Table of Contents

This is some text inside of a div block.

Published

October 6, 2025

“Your customer has five senses and a small universe of devices. Why aren’t you designing for all of them?”
(Cheryl Platz)

Interfaces are quietly moving beyond the glass. Text, images, voice, video, AI doesn’t care which you use first. The most natural products will blend modalities natively, adapting to context the way people do. This isn’t sci-fi, it’s the practical next step for teams who want fewer clicks, faster outcomes, and happier customers. (multimodal AI use cases)

So, what actually changes for your product when screens become scenes? Short answer, orchestration becomes the job.

The shift: from screens to scenes

For years, we designed for screens; now we’re designing for scenes (moments where intent, environment, and available modalities shift second by second). Multimodal design adds orchestration on top of voice-only or GUI-only thinking: which mode leads, which supports, and how they trade off without friction. (demystifying multimodal design talk)

Cheryl Platz argues that most designers already do multimodal work (even if unintentionally) and that the real job is managing transitions across inputs and outputs and devices. Those transitions (hands-free, hands-on, public, private, phone, desktop) make or break the experience. (why transitions make or break the experience)

The orchestration mindset

Don’t ask “voice or touch?” Ask “Which sequence of modalities best fits this scene, and how do we signal and hand off between them cleanly?” (orchestration over modality choice)

What leads when hands are busy? Let voice take point, then land in a quick visual check so users can confirm or fix.

Why now: AI made multimodal usable

Modern models interpret text, images, audio, and video within one reasoning loop, which means products can pair what you say with what they see, summarize a meeting, answer questions about a dashboard screenshot, or describe a chart while you point. Healthcare, for example, can fuse notes, scans, and vitals for better decisions.

And the demand side is real: by 2025, about 20.5% of people worldwide use voice search, and there are roughly 8.4 billion voice assistants in use. That’s not niche behavior, it’s mainstream expectation for conversational, hands-free moments.

So, where do you start if your product is screen-first? Begin with one flow where hands or eyes are often busy, then layer a voice lead with a visual finish.

What “multimodal by design” looks like

A quick map

flowchart LR
  I[User Intent] --> C{Context}
  C -->|hands busy| V[Voice]
  C -->|noisy space| T[Touch/GUI]
  C -->|private| Cam[Camera/Vision]
  C -->|eyes busy| Aud[Audio Out]
  V <--> T
  T <--> Cam
  V --> NLG[Summaries/Explanations]
  Cam --> Detect[On-device perception]
  NLG --> T
  Detect --> V
  style I fill:#fff,stroke:#333,stroke-width:1px

The building blocks (keep them short)

| Block | Good for | Examples | | | |:------------:|:--------------------------------------------------------------------:|:---------------------------------------------------------------------------:|---|---| | Voice in/out | Hands-busy, eyes-busy | “Pay invoice,” “Pause timer” | | | | Touch/GUI | Precision, review | Edit table cells; compare states | | | | Vision | Disambiguation, capture | Scan receipt; highlight element | | | | Haptics | Silent alerts | Subtle nudge for next step | | | | Environment | Context gating | Quiet mode in meetings | | | | Agentic UX | AI agents act autonomously via APIs and schemas (approach overview). | Use when tasks can be fully delegated, but always maintain human oversight. | | |

(Yes, there’s nuance, this table is a cheat sheet, not dogma.)

Will this slow teams down? Not if you let the fastest mode set intent and the clearest mode handle edits.

Patterns that win (and where they fail)

Voice to rough-in, GUI to refine. Use speech to set intent (“Create Q3 revenue report”), then land the user on a pre-filled screen for quick confirmation and edits. This pairs speed with precision, and mirrors how teams already work. (multimodal design best practices)
Show-and-tell troubleshooting. Let users show the problem (photo or video) while describing symptoms. The system grounds its answer in both. Great for support, field ops, and onboarding. (examples across support and ops)
Ambient status with explicit control. Keep proactive nudges quiet (haptics or a subtle banner), but make escalation obvious and reversible. Physical remotes remain useful precisely because they are low friction and low distraction. (make escalation obvious and reversible)
Right-time, not real-time. Time prompts and outputs to when the modality fits, don’t read a long summary aloud during a meeting when a crisp, skimmable card will do. (time outputs to the right moment)

Where will this break first? In noisy spaces, shared environments, and low bandwidth. Plan fallbacks and let users switch modes without penalty.

Stats to watch (signal, not gospel)

| Trend | 2025 Snapshot | Why it matters | | | |:----------------------------------:|:--------------------------------------------------------------------:|:----------------------------------------------------------------------------:|---|---| | People using voice search (global) | ~20.5% | Voice is table stakes for hands-busy flows. (global adoption snapshot) | | | | Voice assistants in use | ~8.4 B | Expect multi-device, spoken handoffs. (installed base estimates) | | | | US voice-search users | ~153.5 M | Local and commerce flows are ripe for voice plus GUI combos. (US user count) | | | | Haptics | Silent alerts | Subtle nudge for next step | | | | Environment | Context gating | Quiet mode in meetings | | | | Agentic UX | AI agents act autonomously via APIs and schemas (approach overview). | Use when tasks can be fully delegated, but always maintain human oversight. | | |

(Don’t over-index on any one number, triangulate across your own telemetry.)

What is the one metric to watch early? Track time to intent by mode, then watch correction rate after capture.

Guardrails: beyond “No UI”

Golden Krishna’s provocation, “the best interface is no interface,” is a useful north star: if the product can solve the problem without demanding attention, do that. But “invisible” has limits; hiding systems can also hide complexity and power, reducing user agency. Multimodal design embraces both truths: minimize attention, and make control and legibility easy when needed.

“NoUI” is an aspiration to reduce needless friction, not a mandate to remove affordances. Orchestrate, don’t disappear.
(Noah Fang)

How do we avoid creepiness? Prefer on-device capture, explain why you need sensors, and make off switches obvious.

Design system implications

Tokens & components that flex by scene

Stateful prompts: Same intent field, different defaults and error handling for voice versus typing.
Explainers everywhere: One component renders as a caption (audio), a popover (GUI), or a transcript (text).
Transition micro-patterns: Standard ways to hand off (for example, voice to review screen) with consistent focus, selection, and undo.
Context gates: Policies that switch the leading modality: driving to voice; open office to typed; low bandwidth to on-device. (codify context gates in your DS)

Do we need new components? Mostly no. You need flexible tokens and a small set of transition patterns you can reuse.

Research & measurement

Scene sampling: Observe modality switching in the wild (hands or eyes busy, noise, privacy).
Task time by mode: Where does voice crush setup but fail on edits?
Interruption fitness: How often do proactive nudges land at the wrong moment?
Trust telemetry: Track opt-ins for camera or voice and reasons for opt-out (privacy, reliability). (research prompts and measures)

Who should own orchestration? Design system owners and platform teams together, with shared metrics and patterns.

Real-world momentum

Healthcare: Multimodal AI drafts summaries and recommendations by fusing notes, images, and signals, speeding clinical decisions.
Ops and field: Techs narrate findings while video captures context; AI produces checklists and parts orders. (field workflows grounded by video and voice)
Smart spaces: Leaders like Josh.ai advocate “voice when it helps, alternatives when it doesn’t,” reinforcing the core idea, multiple natural ways in, one coherent way out. (Alex Capecelatro on seamless smart homes)

Will this require a platform rewrite? No. Start with orchestration for a few high-value scenes, then expand component by component.

Strategy: product bets for the next 12 months

Adopt an orchestration layer. Treat modalities as interchangeable ports on a single intent pipeline; log which port wins by scene and why. (treat modalities as ports on one pipeline)
Make handoffs a first-class artifact. Add transition patterns to your design system (for example, Voice to Review, Vision to Explain), with success metrics. (document and measure handoffs)
Bias to on-device for capture. Vision and voice capture on device reduces latency and privacy risk; sync summaries later. Performance and trust improve. (why on-device capture helps)
Narrate your GUI. Wherever there is dense UI, offer one-tap “Explain this” and “Read this” to boost comprehension and accessibility. (add explain and read affordances)
Right-time prompts. Use calendars, location, and activity signals to schedule or suppress interruptions. Respect beats recall. (time prompts to the scene)

What is the smallest viable bet? Ship one handoff pattern end to end, measure it, then templatize it.

A note on craft and culture

John Maeda frames our AI era as a choice: compete with, protest, or collaborate with AI. The pragmatic path for design teams is collaboration, with taste. That means using AI to listen across modalities, then editing with human judgment so outcomes stay legible, inclusive, and kind. (Design in Tech 2024 framing)

So, how do you keep the soul of the product? Set a bar for clarity and kindness, then let AI accelerate the boring parts.

FAQ

Isn’t voice still unreliable?

It’s improving, but don’t make voice the only way. Pair it with quick visual confirmations and easy fallback to touch or keyboard. This combo scores well in both speed and trust. (optimize for reliability and confirmation)

When should camera or vision lead?

When seeing is faster than telling: scanning documents, identifying hardware, or disambiguating which “Settings” you mean. Always give a privacy-first path. (vision-led flows that reduce friction)

How do we keep it accessible?

Design modal parity: offer text equivalents for audio and captions for video; ensure high contrast, focus order, and keyboard paths for any GUI refinement. Multimodality expands access if you plan it. (access checks for multimodal UI)

Does “No UI” mean kill the app?

No, treat it as an aspiration to remove needless steps. Keep affordances that restore control and understanding when the user wants them. (No UI as north star, not dogma)

What should we measure?

Time to intent (voice versus touch), correction rate after voice capture, transition success (voice to edit without rework), and satisfaction by scene (private versus public). (measure transitions and correction rate)

Closing thought

Multimodal is not bells and whistles, it’s basic manners for software in a human world. The interface of the future isn’t a single screen, it’s an ongoing conversation across senses and surfaces, with AI quietly stitching it together. Design for scenes, orchestrate your modes, and let users move the way they already do.

‍

Multimodal by Design

The shift: from screens to scenes

Why now: AI made multimodal usable

What “multimodal by design” looks like

A quick map

The building blocks (keep them short)

Patterns that win (and where they fail)

Stats to watch (signal, not gospel)

Guardrails: beyond “No UI”

Design system implications

Tokens & components that flex by scene

Research & measurement

Real-world momentum

Strategy: product bets for the next 12 months

A note on craft and culture

FAQ

Isn’t voice still unreliable?

When should camera or vision lead?

How do we keep it accessible?

Does “No UI” mean kill the app?

What should we measure?

Closing thought

Design the experience
your users deserve.

Product

Company

Resources

Legal

Multimodal by Design

The shift: from screens to scenes

Why now: AI made multimodal usable

What “multimodal by design” looks like

A quick map

The building blocks (keep them short)

Patterns that win (and where they fail)

Stats to watch (signal, not gospel)

Guardrails: beyond “No UI”

Design system implications

Tokens & components that flex by scene

Research & measurement

Real-world momentum

Strategy: product bets for the next 12 months

A note on craft and culture

FAQ

Isn’t voice still unreliable?

When should camera or vision lead?

How do we keep it accessible?

Does “No UI” mean kill the app?

What should we measure?

Closing thought

Design the experienceyour users deserve.

Product

Company

Resources

Legal

Design the experience
your users deserve.