72/100 · Mature · Diencephalon · deep central

Thalamus

AI maturityMature72/100
absembpridevmatsup

Thalamus

From the brain to the algorithm. The thalamus is the brain's central switchboard — every sense (except smell) passes through it before reaching the cortex. The AI counterpart is native multimodal foundation models: one network that handles text, image, audio and video as a single stream. The capability is in production at every frontier lab.

What the biology does

The thalamus sits at the centre of the brain, just above the brainstem and between the two cerebral hemispheres. It is the gateway for almost every sensory modality: visual signals from the lateral geniculate nucleus, auditory signals from the medial geniculate, somatosensory signals from the ventral posterior nuclei. It gates attention, suppresses irrelevant input, and binds modalities into a single coherent stream. Patients with thalamic lesions suffer sensory neglect and lose the ability to combine sight, sound and touch into a unified percept — they hear and see, but each channel is an island.

What we have built

Multimodality used to mean bolting a vision encoder onto a language model. After 2024 it means one model that tokenises everything in the same space.

  • September 2022 — OpenAI Whisper. Robust open-source speech recognition lands; the bottleneck shifts from "can the machine hear?" to "what does the machine do with what it heard?".
  • March 2023 — GPT-4V. First frontier model to handle vision and text in the same context window.
  • February 2024 — Gemini 1.5. Google ships 1M-token context with native multimodality — text, image, audio and video in the same call.
  • May 2024 — GPT-4o. GPT-4o demonstrates native voice-to-voice at ~320 ms latency, matching human reaction time.
  • July 2024 — Advanced Voice Mode. Full voice rolls out to ChatGPT Plus.
  • October 2024 — OpenAI Realtime API. Production-grade streaming speech-to-speech for developers.
  • March 2025 — Native image generation in GPT-4o. 4o image gen ships; the Studio Ghibli style transfer wave overwhelms OpenAI's GPUs within hours.
  • May 2025 — Google Veo 3. Audio-synced 4K text-to-video with native sound effects.
  • June 2025 — Meta V-JEPA 2. V-JEPA 2 trains a world model on over a million hours of web video; plans robot actions ~15× faster than NVIDIA Cosmos.
  • November 2025 — Google Gemini 3. Top of multimodal understanding benchmarks.
  • January 2026 — Veo 3.1. True 4K, native vertical video for TikTok/Shorts, Ingredients-to-Video for multi-shot character consistency.
  • February 2026 — Gemini 3.1 Pro. 1M context across modalities, 65K-token output, top of twelve of eighteen benchmarks.

The architectural shift was anticipated in DeepMind's Flamingo paper two years before it became commodity:

"We propose key architectural innovations to: (i) bridge powerful pretrained vision-only and language-only models, (ii) handle sequences of arbitrarily interleaved visual and textual data, and (iii) seamlessly ingest images or videos as inputs. […] A single Flamingo model can achieve a new state of the art with few-shot learning, simply by prompting the model with task-specific examples." — Alayrac et al., 2022 (arXiv:2204.14198)

What Flamingo proposed as a bridging architecture became, two model generations later, the default. The Stanford AI Index 2026 records the saturation curve: by 2025, multimodal benchmarks were retiring faster than the labs could publish.

What is still missing

Native multimodality works. The remaining gaps are subtle, but they are still gaps.

  1. Cross-modal grounding hallucinates. Adversarial audio + video prompts can desynchronise model output: the system describes what it sees, not what is actually playing.
  2. Long video is hard. Most production multimodal models still degrade on hour-long video understanding — a constraint biology does not share.
  3. Modality bias persists. Training data is overwhelmingly text. Pure audio reasoning, music understanding and proprioceptive integration lag behind vision-language.
  4. Sub-second generation is one-sided. Voice replies arrive in ~300 ms; quality video at sub-second remains aspirational.

How we read the verdict

We rate the AI counterpart Mature. The thalamus is one of the regions where AI has caught up most credibly: a single network in production at every frontier lab can see, hear, speak and predict the physical world. The biological version still wins on edge cases and on the seamlessness of binding — but the existence of the capability is no longer in question.

Concrete examples

  • GPT-4o native multimodalSingle model for text, image, audio — 320ms voice response, matching human reaction time.
  • Google Veo 3.1Audio-synced 4K video from text with vertical-format support — sound effects baked in, not post-mixed.
  • Meta V-JEPA 2Predicts physical-world dynamics from video; plans robot actions ~15× faster than NVIDIA Cosmos.

Milestones

  • Sep 2022OpenAI Whisper — robust open-source speech recognition
  • Mar 2023GPT-4V — first frontier model to handle vision and text in the same context
  • Feb 2024Gemini 1.5 — 1M context with native multimodal (text/image/audio/video)
  • May 2024GPT-4o — native voice-to-voice with 320ms latency; full multimodal in one model
  • Jul 2024GPT-4o Advanced Voice Mode rolls out to ChatGPT Plus
  • Oct 2024OpenAI Realtime API — production-grade streaming speech-to-speech
  • Mar 2025GPT-4o native image generation — the Studio Ghibli moment
  • May 2025Google Veo 3 — audio-synced text-to-video at 4K with native sound effects
  • Jun 2025Meta V-JEPA 2 — video world model trained on 1M+ hours of web video
  • Nov 2025Google Gemini 3 — top of multimodal understanding benchmarks
  • Jan 2026Google Veo 3.1 — 4K vertical video with full audio + lip-synced dialogue
  • Feb 2026Gemini 3.1 Pro — 1M context across modalities, 65K token output

Sources

Related Wikipedia entries

Other regions