58/100 · Developing · Deep cerebral nuclei (caudate, putamen, pallidum)

Basal Ganglia

AI maturityDeveloping58/100
absembpridevmatsup

Basal Ganglia

From the brain to the algorithm. The basal ganglia choose which action to execute and consolidate skills from explicit practice into automatic routines. The AI counterpart is reinforcement learning across the whole training pipeline — from AlphaGo to RLHF to RLVR. 2024–2025 closed the gap between "RL works on games" and "RL is the engine of frontier reasoning".

What the biology does

The basal ganglia are a cluster of deep cerebral nuclei — caudate, putamen, globus pallidus, substantia nigra, subthalamic nucleus — that sit at the interface between cortex and motor system. They handle action selection (which of the many possible next moves should fire?), habit formation (consolidating explicit choices into automatic routines) and reward-based learning. The dopamine signal from substantia nigra encodes reward-prediction error — the same signal modern RL algorithms compute. Damage to this circuit produces the canonical disorders of action: Parkinson's tremor and rigidity, Huntington's chorea, obsessive-compulsive loops.

What we have built

The history of reinforcement learning is the history of teaching machines to choose well. Thirteen milestones, from Atari pixels to the post-training stack that drives every frontier model in 2026.

  • December 2013 — DeepMind DQN. Deep RL plays Atari from pixels (NIPS 2013).
  • March 2016 — AlphaGo beats Lee Sedol. First superhuman Go via deep RL + Monte-Carlo tree search.
  • June 2017 — RLHF foundations. Christiano et al. — Deep Reinforcement Learning from Human Preferences (NIPS 2017).
  • October 2017 — AlphaZero. A single algorithm masters chess, shogi and Go from self-play alone.
  • January 2022 — InstructGPT. RLHF productionises preference-tuned LLMs.
  • December 2022 — Constitutional AI / RLAIF. Anthropic replaces human preference labels with model-generated ones — see the original paper.
  • May 2023 — Direct Preference Optimization. Rafailov et al. show you can do preference learning without an explicit reward model.
  • July 2024 — AlphaProof. RL on Lean proofs reaches IMO silver-medal level — the first self-improving system within one point of a gold-medal mathematician; see the DeepMind write-up.
  • October 2024 — Physical Intelligence π0. First open-source generalist robot foundation model.
  • January 2025 — DeepSeek-R1. Open-source reasoning via Group Relative Policy Optimization (GRPO) and Reinforcement Learning with Verifiable Rewards (RLVR) — DeepSeek-R1 on GitHub is the most-downloaded open reasoning model of 2025.
  • April 2025 — π0.5. Physical Intelligence π0.5 ships open-world generalisation — the robot cleans kitchens it has never trained on.
  • July 2025 — Gemini Deep Think. Wins IMO 2025 gold (35/42) using parallel-thinking RL.
  • 2026 — RL is the dominant post-training recipe. Across reasoning (o-series, Claude Extended Thinking, Deep Think), coding (SWE-bench gains) and computer-use (Operator → ChatGPT agent), RL has become the recipe.

The simplification that made the post-2023 explosion possible came from the DPO paper:

"In this paper we introduce a new parameterization of the reward model in RLHF that enables extraction of the corresponding optimal policy in closed form, allowing us to solve the standard RLHF problem with only a simple classification loss. The resulting algorithm, which we call Direct Preference Optimization (DPO), is stable, performant, and computationally lightweight, eliminating the need for sampling from the LM during fine-tuning or performing significant hyperparameter tuning." — Rafailov et al., 2023 (arXiv:2305.18290)

DPO, GRPO and RLVR are the three names that show up everywhere in 2025–2026 model cards. The Stanford AI Index 2026 records the curve.

What is still missing

The selection circuitry is still fragile in ways the brain is not.

  1. Reward hacking on subjective tasks. Verifiable rewards work where ground truth exists (math, code, formal proof). They do not work for creative writing, brand voice or nuanced argumentation — human preference data remains superior.
  2. Long-horizon credit assignment. When the reward signal is sparse across thousands of steps, agents fail. Multi-hour autonomous tasks still loop, time out, or hallucinate progress.
  3. Skill transfer is poor. A policy that learned task A rarely transfers cleanly to task B without expensive re-training.
  4. No procedural consolidation. Neural networks do not have the obvious "skill-becomes-automatic" transition mammals do. A model that has solved a problem a thousand times costs as many tokens to solve it the thousand-and-first.

How we read the verdict

We rate the AI counterpart Developing. The post-training stack is mature and shipping everywhere; the agentic loop — pick the right action, repeat, build a skill — is closing fast but still fails on long horizons and unfamiliar environments. Among the three V1.2 deep-brain regions, this is the one with the steepest improvement curve right now.

Concrete examples

  • DeepSeek-R1 + GRPOOpen-source reasoning trained with group-relative policy optimisation on verifiable rewards (math, code).
  • Physical Intelligence π0.5Vision-language-action model — RL on diverse robot data lets it clean unseen kitchens.
  • AlphaProof at IMO 2024AlphaZero-style RL trained on Lean proofs solved 4/6 problems — silver-medal score, 1 point off gold.

Milestones

  • Dec 2013DeepMind DQN — deep RL plays Atari from pixels (NIPS 2013)
  • Mar 2016AlphaGo beats Lee Sedol — first superhuman Go via deep RL + Monte-Carlo tree search
  • Jun 2017Christiano et al. — RLHF foundations (NIPS 2017)
  • Oct 2017AlphaZero — single algorithm masters chess, shogi and Go from self-play
  • Jan 2022InstructGPT — RLHF productionises preference-tuned LLMs
  • Dec 2022Constitutional AI / RLAIF — AI-generated preference labels (Anthropic)
  • May 2023DPO — preference learning without an explicit reward model (Rafailov et al.)
  • Jul 2024AlphaProof — RL on Lean proofs reaches IMO silver-medal
  • Oct 2024Physical Intelligence π0 — first open-source generalist robot foundation model
  • Jan 2025DeepSeek-R1 — GRPO + RLVR with rule-based rewards open-sources frontier reasoning
  • Apr 2025Physical Intelligence π0.5 — open-world generalisation in robotic manipulation
  • Jul 2025Gemini Deep Think — IMO 2025 gold via parallel-thinking RL
  • 2026RL becomes the dominant post-training recipe across reasoning, coding and computer-use

Sources

Related Wikipedia entries

Other regions