Thalamus

From the brain to the algorithm. The thalamus is the brain's central switchboard: every sense (except smell) passes through it before reaching the cortex. The AI counterpart is native multimodal foundation models: one network that handles text, image, audio and video as a single stream. The capability is in production at every frontier lab.

What the biology does

The thalamus sits at the centre of the brain, just above the brainstem and between the two cerebral hemispheres. It is the gateway for almost every sensory modality: visual signals from the lateral geniculate nucleus, auditory signals from the medial geniculate, somatosensory signals from the ventral posterior nuclei. It gates attention, suppresses irrelevant input, and binds modalities into a single coherent stream. Patients with thalamic lesions suffer sensory neglect and lose the ability to combine sight, sound and touch into a unified percept: they hear and see, but each channel is an island.

What we have built

Multimodality used to mean bolting a vision encoder onto a language model. After 2024 it means one model that tokenises everything in the same space.

September 2022: OpenAI Whisper. Robust open-source speech recognition lands; the bottleneck shifts from "can the machine hear?" to "what does the machine do with what it heard?".
March 2023: GPT-4V. First frontier model to handle vision and text in the same context window.
February 2024: Gemini 1.5. Google ships 1M-token context with native multimodality: text, image, audio and video in the same call.
May 2024: GPT-4o. GPT-4o demonstrates native voice-to-voice at ~320 ms latency, matching human reaction time.
July 2024: Advanced Voice Mode. Full voice rolls out to ChatGPT Plus.
October 2024: OpenAI Realtime API. Production-grade streaming speech-to-speech for developers.
January 2025: Open-weights multimodal catches up. Alibaba's Qwen2.5-VL-72B reaches 70.2 on MMMU and 88.6 on MMBench-EN, matching GPT-4o-class understanding; InternVL3.5 follows later in 2025. Native multimodality is no longer a US-frontier-only capability.
March 2025: Native image generation in GPT-4o. 4o image gen ships; the Studio Ghibli style transfer wave overwhelms OpenAI's GPUs within hours.
May 2025: Google Veo 3. Audio-synced 4K text-to-video with native sound effects.
June 2025: Meta V-JEPA 2. V-JEPA 2 trains a world model on over a million hours of web video; plans robot actions ~15× faster than NVIDIA Cosmos.
November 2025: Google Gemini 3. Top of multimodal understanding benchmarks.
January 2026: Veo 3.1. True 4K, native vertical video for TikTok/Shorts, Ingredients-to-Video for multi-shot character consistency.
February 2026: Gemini 3.1 Pro. 1M context across modalities, 65K-token output, top of twelve of eighteen benchmarks.
May 2026: Google Gemini Omni Flash. A single model that takes text, images, audio and video as input and generates video as output, edited conversationally, any modality in, video out, in one call.

The architectural shift was anticipated in DeepMind's Flamingo paper two years before it became commodity:

"We propose key architectural innovations to: (i) bridge powerful pretrained vision-only and language-only models, (ii) handle sequences of arbitrarily interleaved visual and textual data, and (iii) seamlessly ingest images or videos as inputs. […] A single Flamingo model can achieve a new state of the art with few-shot learning, simply by prompting the model with task-specific examples." Alayrac et al., 2022 (arXiv:2204.14198)

What Flamingo proposed as a bridging architecture became, two model generations later, the default. The Stanford AI Index 2026 records the saturation curve: by 2025, multimodal benchmarks were retiring faster than the labs could publish.

What is still missing

Native multimodality works. The remaining gaps are subtle, but they are still gaps.

Cross-modal grounding hallucinates. Adversarial audio + video prompts can desynchronise model output: the system describes what it sees, not what is actually playing.
Long video is hard. Most production multimodal models still degrade on hour-long video understanding, a constraint biology does not share.
Modality bias persists. Training data is overwhelmingly text. Pure audio reasoning, music understanding and proprioceptive integration lag behind vision-language.
Sub-second generation is one-sided. Voice replies arrive in ~300 ms; quality video at sub-second remains aspirational.

How we read the verdict

We rate the AI counterpart Mature. The thalamus is one of the regions where AI has caught up most credibly: a single network in production at every frontier lab can see, hear, speak and predict the physical world. The biological version still wins on edge cases and on the seamlessness of binding, but the existence of the capability is no longer in question.

Thalamus

Thalamus

What the biology does

What we have built

What is still missing

How we read the verdict

Concrete examples

Health & best practices

For the brain

For the AI counterpart

Milestones

Sources

Related Wikipedia entries

Other regions