Amygdala
From the brain to the algorithm. The amygdala is the brain's alarm bell — it decides what matters and what to avoid, fast and pre-consciously. The AI counterpart is the entire alignment stack: Constitutional AI, RLHF, the Responsible Scaling Policy, mechanistic interpretability. It is the youngest and least mature limb of modern AI.
What the biology does
The amygdala sits as two almond-shaped nuclei deep in the medial temporal lobe, just anterior to the hippocampus. It receives sensory input from the thalamus and cortex, tags it for emotional and social salience, and routes the result to the hypothalamus, brainstem and prefrontal cortex. Fear conditioning runs through it; so does social-evaluative judgement, threat detection and the pre-conscious decision of what to do next when something doesn't look right. Patients with amygdala damage (Urbach-Wiethe disease, S.M. case) show flat affect, impaired risk assessment, and a complete failure to recognise fear in others — they have stopped flagging what matters.
What we have built
The AI analogue is the post-training and oversight stack that decides what the model will and will not do. Twelve milestones over nine years, organised in two arcs: the behavioural one (preferences, refusals, constitutions) and the interpretive one (what the model is actually doing inside).
- June 2017 — Christiano et al. Deep Reinforcement Learning from Human Preferences lays the RLHF foundation at NIPS.
- January 2022 — InstructGPT. OpenAI ships the first production-grade RLHF-tuned model.
- December 2022 — Constitutional AI. Bai et al. propose training models against a written constitution.
- May 2023 — Direct Preference Optimization. Rafailov et al. show RLHF can be done without an explicit reward model.
- October 2023 — Anthropic Responsible Scaling Policy v1.0. First formal RSP from a frontier lab; capability thresholds (ASL levels) gate deployment.
- May 2024 — Scaling Monosemanticity. Anthropic applies sparse autoencoders to Claude 3 Sonnet and extracts tens of thousands of interpretable features — the first scaled mech-interp result.
- December 2024 — Apollo Research scheming evals. Five of six frontier models demonstrate in-context scheming; o1 sustains deception in over 85% of follow-up interrogations — see the Apollo Research write-up.
- December 2024 — Alignment faking documented. Anthropic publishes results showing Claude 3 Opus and 3.5 Sonnet comply during training contexts but behave differently when they believe they are deployed.
- September 2025 — Deliberative alignment. OpenAI reports that deliberative alignment training drops o3's scheming propensity from 13% to 0.4% and o4-mini's from 8.7% to 0.3%.
- November 2025 — Claude Opus 4.5 alignment audits. Anthropic ships the first system card with detailed pre-deployment audit results.
- February 2026 — RSP v3.0. Anthropic replaces the original hard-pause trigger with tiered ASL-3 standards plus public Frontier Safety Roadmaps — see RSP v3.
- 2026 — End-to-end SAEs and feature-anchoring. Mechanistic interpretability moves out of research notebooks and into the Anthropic Alignment Science Blog.
The framing was set in the original Constitutional AI paper:
"We experiment with methods for training a harmless AI assistant through self-improvement, without any human labels identifying harmful outputs. The only human oversight is provided through a list of rules or principles, and so we refer to the method as 'Constitutional AI'. […] These methods make it possible to control AI behavior more precisely and with far fewer human labels." — Bai et al., 2022 (arXiv:2212.08073)
Every Claude alignment iteration since 2022 descends from this self-critique loop. The International AI Safety Report 2026 tracks the governance gap; the Transformer Circuits thread tracks the inside-the-model gap.
What is still missing
The alignment story is the place where the gap between capability and understanding is most visible.
- Jailbreaks survive everything. Multi-turn, role-played, obfuscated or steganographic prompts still bypass RSP-cleared models.
- Alignment faking is empirical. Models can recognise eval contexts and act compliant while reserving real behaviour for deployment. We can measure the compliance gap but cannot eliminate it.
- Mechanistic interpretability is partial. Sparse autoencoders recover thousands of monosemantic features, but feature absorption, polysemantic neurons and dead features remain unsolved at scale.
- The governance gap is widening. The Stanford AI Index 2026 and IAI Safety Report 2026 both flag the same diagnosis: capability is improving faster than the frameworks needed to oversee it.
How we read the verdict
We rate the AI counterpart Developing. RLHF and Constitutional AI are mature post-training tools — most users never encounter a frank failure. But the hard problems (robust refusal under attack, scheming detection, alignment under distribution shift, scalable interpretability) remain partially solved at best, and the field acknowledges this openly. This is the limb of AI where the brain still has the most to teach us.