DeepMind Safety Research – Medium

DeepMind Safety Research

Predicting When RL Training Breaks Chain-of-Thought Monitorability

Why RL training teaches models to hide their reasoning, and a conceptual framework to predict when it happens.

Apr 1

Predicting When RL Training Breaks Chain-of-Thought Monitorability

Apr 1

Consistency Training Could Help Limit Sycophancy and Jailbreaks

Authors: Alex Irpan* and Alex Turner*, Mark Kurzeja, David Elson, and Rohin Shah

Nov 3, 2025

Consistency Training Could Help Limit Sycophancy and Jailbreaks

Nov 3, 2025

Evaluating and monitoring for AI scheming

By Victoria Krakovna, Scott Emmons, Erik Jenner, Mary Phuong, Lewis Ho, and Rohin Shah

Jul 8, 2025

Evaluating and monitoring for AI scheming

Jul 8, 2025

An Approach to Technical AGI Safety and Security

We have written a paper on our approach to technical AGI safety and security. This post is a copy of the extended abstract, which…

Apr 8, 2025

An Approach to Technical AGI Safety and Security

Apr 8, 2025

Negative Results for Sparse Autoencoders On Downstream Tasks and Deprioritising SAE Research…

Lewis Smith, Sen Rajamanoharan, Arthur Conmy, Callum McDougall, Janos Kramar, Tom Lieberum, Rohin Shah, Neel Nanda

Mar 26, 2025

Mar 26, 2025

Introducing our short course on AGI safety

We are excited to release a short course on AGI safety for students, researchers and professionals interested in this topic. The course…

Feb 14, 2025

Introducing our short course on AGI safety

Feb 14, 2025

Steering Gemini using BIDPO vectors

By Alex Turner and Mark Kurzeja

Jan 31, 2025

Steering Gemini using BIDPO vectors

Jan 31, 2025

MONA: A method for addressing multi-step reward hacking

MONA enhances safety when we train an AI system to perform some task that takes multiple steps. Training an AI with MONA reduces its …

Jan 23, 2025

MONA: A method for addressing multi-step reward hacking

Jan 23, 2025

Human-AI Complementarity: A Goal for Amplified Oversight

How do we ensure humans can continue to oversee increasingly powerful AI systems? We argue that achieving human-AI complementarity is key.

Dec 23, 2024

Human-AI Complementarity: A Goal for Amplified Oversight

Dec 23, 2024

AGI Safety and Alignment at Google DeepMind: A Summary of Recent Work

By Rohin Shah, Seb Farquhar, and Anca Dragan

Oct 18, 2024

Oct 18, 2024

DeepMind Safety Research

DeepMind Safety Research

We research and build safe AI systems that learn how to solve problems and advance scientific discovery for all. Explore our work: deepmind.google

Help

Status

About

Careers

Press

Blog

Privacy

Rules

Terms

Text to speech