Predicting When RL Training Breaks Chain-of-Thought MonitorabilityWhy RL training teaches models to hide their reasoning, and a conceptual framework to predict when it happens.Apr 1Apr 1
Consistency Training Could Help Limit Sycophancy and JailbreaksAuthors: Alex Irpan* and Alex Turner*, Mark Kurzeja, David Elson, and Rohin ShahNov 3, 2025A response icon1Nov 3, 2025A response icon1
Evaluating and monitoring for AI schemingBy Victoria Krakovna, Scott Emmons, Erik Jenner, Mary Phuong, Lewis Ho, and Rohin ShahJul 8, 2025Jul 8, 2025
An Approach to Technical AGI Safety and SecurityWe have written a paper on our approach to technical AGI safety and security. This post is a copy of the extended abstract, which…Apr 8, 2025Apr 8, 2025
Negative Results for Sparse Autoencoders On Downstream Tasks and Deprioritising SAE Research…Lewis Smith*, Sen Rajamanoharan*, Arthur Conmy, Callum McDougall, Janos Kramar, Tom Lieberum, Rohin Shah, Neel NandaMar 26, 2025Mar 26, 2025
Introducing our short course on AGI safetyWe are excited to release a short course on AGI safety for students, researchers and professionals interested in this topic. The course…Feb 14, 2025A response icon2Feb 14, 2025A response icon2
Steering Gemini using BIDPO vectorsBy Alex Turner and Mark KurzejaJan 31, 2025A response icon1Jan 31, 2025A response icon1
MONA: A method for addressing multi-step reward hackingMONA enhances safety when we train an AI system to perform some task that takes multiple steps. Training an AI with MONA reduces its …Jan 23, 2025Jan 23, 2025
Human-AI Complementarity: A Goal for Amplified OversightHow do we ensure humans can continue to oversee increasingly powerful AI systems? We argue that achieving human-AI complementarity is key.Dec 23, 2024Dec 23, 2024
AGI Safety and Alignment at Google DeepMind: A Summary of Recent WorkBy Rohin Shah, Seb Farquhar, and Anca DraganOct 18, 2024Oct 18, 2024