Multimodal

A multimodal model handles more than one modality of data. Modern frontier models accept text and images as input, and increasingly audio and video too. Output is often still primarily text, though image and audio generation is rapidly improving.

For operators, the practical implication is in product design. Multimodal opens up: extracting structured data from screenshots and PDFs, describing images for accessibility, analyzing UI mockups, taking voice input naturally, processing audio meeting recordings without a separate transcription step, and reasoning about diagrams and charts.

The gotcha is reliability: text reasoning is still the most-tested capability. Multimodal capabilities are improving fast but operator evals should specifically test the modality you depend on. Don't assume that because a model is great at text reasoning, it's equally great at chart interpretation.

Related

Get in touch