Deep: New Multi-Modal AI features Explored
30+ real examples from Lyft, Notion, Google, Shopify, Revolut and more. Plus a framework for choosing the right modality for your product.
🔒DoP Deep goes deeper into the concepts and ideas that are covered in the Weekly Briefing to help you learn lessons from the experiences of top tech companies. If you’d like to upgrade to receive these exclusive, in-depth pieces of analysis you can upgrade below. New reports are added every month.
Pinterest’s product teams were recently reported to be in a battle with their CEO over the future of the company’s AI assistant – and at the heart of it was a simple question: what modality should the AI use?
Pinterest’s CEO was reportedly keen to lean heavily into voice, arguing that Gen Z’s expectations were changing and that conversational interfaces would make shopping feel like “talking to a friend.” But the company’s designers and product leaders pushed back, arguing that forcing a voice‑first experience onto a product built around quiet, visual discovery risked destroying its core value proposition.
It’s a small internal dispute, set against a backdrop of broader tension over AI‑driven layoffs and an aggressive pivot to “AI‑forward” products, but it opens up a much bigger discussion about the tough choices faced by product teams in 2026. Before AI, the interface layer was largely fixed. You designed screens, flows, and components, and the input was almost always text, click, or tap. The modality question didn’t really exist as a product design consideration in quite the same way as it does right now, because the answer was typically the same: keyboard, touch, click.
In 2026, that’s changed. The interface layer is now a distinct design decision. A product team can choose how a user interacts with their product in a way they simply couldn’t before – across text, voice, image, video, documents, or some combination of each. Just this past week, for example, Google’s updated Stitch agent went viral in design circles for letting people talk to an infinite canvas, mix voice, text, and images in one space, and have an AI agent understand the entire context of a project. As the Pinterest story shows, getting that choice wrong can erode the value proposition that made your product worth using in the first place.
In this Deep Dive, we’re going to take a closer look at 30+ new examples of multimodal AI features and capabilities recently shipped by some of the world’s top companies - including Google, Anthropic, DoorDash, Lyft, Headspace, Zendesk and more. And alongside the analysis, we’ll share a practical, hands‑on framework you can use to structure your own thinking when it comes to introducing new modalities into your product.
Coming up:
30+ real-world examples of new multimodal AI features from companies including Google, Lyft, Headspace, Shopify, Revolut, Zendesk and more
A hands-on framework for deciding which modality is right for your product - built around five tests that walk you through the exact questions product teams at the world’s best companies are asking right now
Why combining modalities is where the real power lies - and how Google Stitch and Replit Agent 4 are showing what’s possible when voice, image and text work together in a single session
Why Spotify’s engineers now start every day with voice task delegation on their phones - and what that signals about where voice is actually heading
Ideas on how multimodal AI experiences could reshape entirely different industries - from insurance to healthcare to real estate
How this analysis is structured
This analysis includes over 30+ different examples of new AI features across multiple different modalities. Here’s a snapshot of some of the new multi-modal features included in this Deep Dive:
The different modalities explained
There are seven different modalities included in this analysis, with a strong emphasis on the following:
Voice - spoken, conversational interaction. This is different from Audio in that it implies a two-way, dialogue-style interface. The user speaks, the model responds. Features like Headspace’s Ebb and Otter.ai’s meeting agents are Voice whereas a podcast narration tool is not.
Audio - non-conversational sound processing or generation. This covers things like Grok’s article narration, HubSpot generating audio clips for marketing content, or Zoom’s live voice translation, where audio is a medium being processed or produced, rather than an interactive interface.
Text - the most universal modality. The model reads, generates, or reasons over written language. Present in almost every feature as either an input or output, even when other modalities are doing the heavy lifting.
Image - the model can see and/or generate still images. This covers everything from Photoshop’s object removal via natural language to Pinterest’s visual search to product image generation in Shopify. Includes both understanding images and creating them.
Video - the model can analyse, generate, or reason over moving image sequences.
Documents - structured file understanding: PDFs, spreadsheets, contracts, presentations. Distinct from plain Text because the model needs to parse formatting, layout, tables, or multi-page structure. Salesforce’s Agentforce reading CRM records, or DocuSign’s AI summarising agreement terms, are Document-modal features.
A closer look inside the analysis
Now let’s take a closer look at each of the new multimodal AI features included in this Deep Dive, starting with the modality that sparked the internal troubles at Pinterest: Voice.


