Visual Cortex

From the brain to the algorithm. The visual cortex turns photons into objects. The AI counterpart (deep convolutional and transformer-based vision models) does the same thing on different hardware, and on most narrow benchmarks it now does it better.

What the biology does

The primary visual cortex (V1) sits at the back of the head, in the occipital lobe. It receives optic-tract signals via the lateral geniculate nucleus and decomposes them into orientation columns, then hands the result up a hierarchy (V2 → V4 → IT) that goes from local edges to object identity in tens of milliseconds, on roughly ten watts of power. The brain is not just a classifier: foveal-peripheral attention loops decide where to look next; the dorsal "where" stream tells the motor cortex how to act on what the ventral "what" stream just identified.

What we have built

Modern computer vision is the AI subfield that crossed the human threshold first. The arc spans fourteen years and three architectural revolutions.

2012: AlexNet. Krizhevsky, Sutskever and Hinton beat the best classical pipelines on ImageNet by ten percentage points, kicking off the deep-learning era.
2015: ResNet-152. Deep residual networks cross human top-5 ImageNet error on the canonical benchmark.
2020: Vision Transformer. Dosovitskiy et al. show that the Vision Transformer, scaled to images, matches and then beats CNNs.
August 2022: Stable Diffusion 1.4. Open-source text-to-image ships alongside DALL·E 2 and Midjourney; the era of "type a sentence, get a picture" begins.
2023: Multimodal frontier models. GPT-4V, Claude 3 and Gemini bring general visual understanding into production.
July 2024: Meta SAM 2. Universal segmentation extended to video, with real-time mask tracking across frames, and SAM 2 generalises "point at a thing, get the thing" from images to motion.
March 2025: Studio Ghibli moment. GPT-4o ships native image generation; OpenAI's GPUs spend a week underwater.
May 2025: Google Veo 3. Audio-synced text-to-video at 4K, with native sound effects and dialogue.
June 2025: Meta V-JEPA 2. An open-source video world model trained on over a million hours of web video, and V-JEPA 2 plans robot actions about fifteen times faster than NVIDIA Cosmos.
September 2025: OpenAI Sora 2. Crosses the realism threshold for short-form video, including synchronised dialogue.
January 2026: Google Veo 3.1. True 4K, native 9:16 vertical output and Ingredients-to-Video character consistency across shots; see Veo 3.1.
April 2026: Sora wind-down. OpenAI deprecates the standalone Sora app and folds the model into ChatGPT; the brand ends but the capability survives.

The architectural shift in 2020 is the one with the longest tail. The Vision Transformer paper (pre-trained on JFT-300M and transferred down) was explicit about what it had proven:

"When pre-trained on large amounts of data and transferred to multiple mid-sized or small image recognition benchmarks (ImageNet, CIFAR-100, VTAB, etc.), Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train." Dosovitskiy et al., 2020 (arXiv:2010.11929)

The Stanford AI Index 2026 records the consequence: by 2024 vision benchmarks were saturating faster than new ones could be designed.

What is still missing

Four things keep computer vision short of biological vision in 2026.

Adversarial robustness. Single-pixel attacks and ~1% Gaussian noise still flip predictions in production classifiers. Humans are not perfect either, but the failure modes differ qualitatively.
Embodied perception. Vision in a body (coupled to action, expectation and physical interaction) remains far from biology. A toddler tracking a ball with their eyes uses prediction loops that current vision-language-action models only approximate. Even Meta's V-JEPA 2 is a planning prior, not a perceptual organ.
Sample efficiency. A child needs three views of a giraffe to recognise it forever. Vision models still need thousands of labelled examples, or rely on multi-billion-image pre-training as a substitute for biological priors.
Persistent identity in video. Diffusion-based generators drift on character continuity across long clips even with explicit conditioning: the reason Veo 3.1 shipped Ingredients-to-Video as a primitive in the first place.

How we read the verdict

We rate the AI counterpart Mature. For static, single-image perception the gap is now narrow; for embodied, robust, sample-efficient visual cognition the gap is still wide. The arrow of progress points unambiguously down.

Visual Cortex

Visual Cortex

What the biology does

What we have built

What is still missing

How we read the verdict

Concrete examples

Health & best practices

For the brain

For the AI counterpart

Milestones

Sources

Related Wikipedia entries

Other regions