Teaching Teachers How to Teach
Or: What can we learn from LLMs?
One of the big questions is whether AIs will end up capable of generating truly new knowledge. In 2025, co-scientists are developing experimental plans, and autonomous research systems are writing workshop papers. It’s not hard to imagine a future where we can automate scientific discovery. But, until robots run the world, we still need people to implement such insight. So how can we explain the information and algorithms that models have learned to humans? What makes a useful explanation in the first place?
In late 2023, a paper used superhuman chess engines to teach professional human players novel concepts about the game. This worked super well; one of the subjects even went on to win the world championship! However, if you look at the methods they used, it’s fairly difficult to see how they will generalize.
However, LLMs provide a unique promise as a tool for this; they naturally reason in natural language, so why can’t they generate explanations for the concepts they’ve learned? I recently trained an LLM to think about chess in natural language. Can we use similar techniques to those used for training reasoning models in order to train models to explain themselves?
Our approach is to use multi-agent reinforcement learning to directly train teachers to generate precise and efficient explanations of knowledge stored within their weights. To do this, we need to quantify what it means for an explanation to be high quality. One useful formalism (used in Shut et al) is the notion of teachability, using a student network to learn from the teacher. In this work, the student is trained on a dataset generated by the teacher; instead, we use the teacher to generate prompts that the student uses as context.
Given some knowledge K, we consider teacher T (who knows K), and student S (who doesn’t know K). We prompt the teacher to generate a long-form explanation of K and sample several candidate explanations. For each output, we use it as context in the student’s prompt and evaluate the student’s outputs on questions regarding K (using a grader). We can then fine-tune the teacher using RL to generate explanations that maximize the student’s scores.
This type of setup isn’t entirely novel; previous work has successfully used similar approaches to automate the process of eliciting model behavior and improve math performance. However, both of these approaches focus on eliciting capabilities that already exist within the student. Instead, we focus on distilling knowledge within the teacher into explanations, using the student as a method of measuring explanation quality.
I’m excited about initially developing this approach to train explanation models within constrained setups, like generating explanations of optimal strategy in a simple game. I believe that this technique will scale to tackle problems such as model diffing (what prompt makes the student act like the teacher?) and knowledge extraction (what does the teacher know?).