Basal Ganglia
From the brain to the algorithm. The basal ganglia choose which action to execute and consolidate skills from explicit practice into automatic routines. The AI counterpart is reinforcement learning across the whole training pipeline — from AlphaGo to RLHF to RLVR. 2024–2025 closed the gap between "RL works on games" and "RL is the engine of frontier reasoning".
What the biology does
The basal ganglia are a cluster of deep cerebral nuclei — caudate, putamen, globus pallidus, substantia nigra, subthalamic nucleus — that sit at the interface between cortex and motor system. They handle action selection (which of the many possible next moves should fire?), habit formation (consolidating explicit choices into automatic routines) and reward-based learning. The dopamine signal from substantia nigra encodes reward-prediction error — the same signal modern RL algorithms compute. Damage to this circuit produces the canonical disorders of action: Parkinson's tremor and rigidity, Huntington's chorea, obsessive-compulsive loops.
What we have built
The history of reinforcement learning is the history of teaching machines to choose well. Thirteen milestones, from Atari pixels to the post-training stack that drives every frontier model in 2026.
- December 2013 — DeepMind DQN. Deep RL plays Atari from pixels (NIPS 2013).
- March 2016 — AlphaGo beats Lee Sedol. First superhuman Go via deep RL + Monte-Carlo tree search.
- June 2017 — RLHF foundations. Christiano et al. — Deep Reinforcement Learning from Human Preferences (NIPS 2017).
- October 2017 — AlphaZero. A single algorithm masters chess, shogi and Go from self-play alone.
- January 2022 — InstructGPT. RLHF productionises preference-tuned LLMs.
- December 2022 — Constitutional AI / RLAIF. Anthropic replaces human preference labels with model-generated ones — see the original paper.
- May 2023 — Direct Preference Optimization. Rafailov et al. show you can do preference learning without an explicit reward model.
- July 2024 — AlphaProof. RL on Lean proofs reaches IMO silver-medal level — the first self-improving system within one point of a gold-medal mathematician; see the DeepMind write-up.
- October 2024 — Physical Intelligence π0. First open-source generalist robot foundation model.
- January 2025 — DeepSeek-R1. Open-source reasoning via Group Relative Policy Optimization (GRPO) and Reinforcement Learning with Verifiable Rewards (RLVR) — DeepSeek-R1 on GitHub is the most-downloaded open reasoning model of 2025.
- April 2025 — π0.5. Physical Intelligence π0.5 ships open-world generalisation — the robot cleans kitchens it has never trained on.
- July 2025 — Gemini Deep Think. Wins IMO 2025 gold (35/42) using parallel-thinking RL.
- 2026 — RL is the dominant post-training recipe. Across reasoning (o-series, Claude Extended Thinking, Deep Think), coding (SWE-bench gains) and computer-use (Operator → ChatGPT agent), RL has become the recipe.
The simplification that made the post-2023 explosion possible came from the DPO paper:
"In this paper we introduce a new parameterization of the reward model in RLHF that enables extraction of the corresponding optimal policy in closed form, allowing us to solve the standard RLHF problem with only a simple classification loss. The resulting algorithm, which we call Direct Preference Optimization (DPO), is stable, performant, and computationally lightweight, eliminating the need for sampling from the LM during fine-tuning or performing significant hyperparameter tuning." — Rafailov et al., 2023 (arXiv:2305.18290)
DPO, GRPO and RLVR are the three names that show up everywhere in 2025–2026 model cards. The Stanford AI Index 2026 records the curve.
What is still missing
The selection circuitry is still fragile in ways the brain is not.
- Reward hacking on subjective tasks. Verifiable rewards work where ground truth exists (math, code, formal proof). They do not work for creative writing, brand voice or nuanced argumentation — human preference data remains superior.
- Long-horizon credit assignment. When the reward signal is sparse across thousands of steps, agents fail. Multi-hour autonomous tasks still loop, time out, or hallucinate progress.
- Skill transfer is poor. A policy that learned task A rarely transfers cleanly to task B without expensive re-training.
- No procedural consolidation. Neural networks do not have the obvious "skill-becomes-automatic" transition mammals do. A model that has solved a problem a thousand times costs as many tokens to solve it the thousand-and-first.
How we read the verdict
We rate the AI counterpart Developing. The post-training stack is mature and shipping everywhere; the agentic loop — pick the right action, repeat, build a skill — is closing fast but still fails on long horizons and unfamiliar environments. Among the three V1.2 deep-brain regions, this is the one with the steepest improvement curve right now.