3 Comments
User's avatar
Francesco Rocchi's avatar

I had read of this RCT in your previous post and I was intrigued. There are a few remarks which came to my mind: in general terms, I think that what Google tested wasn't a model for AI tutoring, but for human, AI-assisted tutoring.

1) While the error rate was very low, 23.6% of the answers of the LLM were changed by the tutors before being submitted to students. Those answers weren't wrong altogether, but the tutor intervened to tweak them somehow. Basically, responses from the AI-tutors benefited from a revision the human tutors didn't receive. The same happened when human tutors where assisted by Tutor Co-PIlot, only with the roles reversed. I think Google should have taken this into account, when calculating the effectiveness of the AI-tutors compared to the human ones.

2) Since every response from AI tutors was revised by human tutors, the typical LLM drift effect was eliminated. Errors and subpar answers can't stack up thanks to the oversight, while an unsupervised AI tutor might eventually go off track. On the other hand, if the sessions with AI tutors are brief enough not prevent bugs to accumulate, the problem might be less significant or it may resolve itself whenever the AI is rebooted.

I'm happy to see that the classroom setting remains central to learning. The social side of learning is still important, as noted by you when talking about the mistakes human teachers occasionally make in class.

Furthermore, recent research seems to show that making mistakes can be so beneficial that one should consider making them deliberately. Without going that far, being in class and seeing someone else make a mistake can be quite useful (as can, more broadly, hearing the different perspectives of other students.

Expand full comment
Olivier Chabot's avatar

Hi Daisy,

Here's an interesting article about the implementation of AI oral exams and grading. Maybe their process could be improved with comparative judgement.

https://www.behind-the-enemy-lines.com/2025/12/fighting-fire-with-fire-scalable-oral.html?m=1

Expand full comment
Dan Collison's avatar

When learning, it’s better to explore than to explain. To probe and test rather than rush to premature conclusion.

So, tailored questions that explore.

After all, information is “a difference that makes a difference”—signal, rather than noise. Learning is integrating information and preserving the differences that make a difference.

The very etymology of “explain” reveals its limitation: it means to flatten out. When we explain, we compress reality in a lossy way, sacrificing vital nuance by smoothing away some of the still meaningful wrinkles in the pursuit of brevity.

Even when we must reach conclusions, we should flag them as provisional. After all, to conclude literally means to shut something completely. A concluded mind closes itself to new information, ending the very exploration that generates understanding.

PS:

For communicating complex ideas without losing their texture, we use higher-fidelity compression algorithms. E.g., instead of explanations, we use metaphors, poetry, allegories, or stories. These are able to encode patterns of reality in great richness but still in compact format. These methods don’t flatten truth as much. They preserve the wrinkles that standard explanations smooth away, allowing us to transmit greater complexity. Good writers do this.

PPS: “probing” provides rich semantic insights regarding learning.

In German, we say, “Ich probier’ mal” - “I’ll try it”.

Although German and English are cousin Germanic languages, we’ve lost a bit of that semantic insights of “probe” meaning try. But we can approach that insight of “try” through probe’s cognates of “proof” and “prove.” A “waterproof” jacket is one which has been tested with water, and none gets through. “That’s a proven strategy” means it’s been tested.

How you might program “probing questions” in teaching and learning: you might program it like an octopus’s arms, which are programmed to individually test or probe the ADJACENT environment, even without being directed to do so by the brain. Yet then the arms report their “probings” to the brain, which synthesizes.

The arms don’t explain, they merely probe and report.

In one of your examples, the children in the classroom report back to the teacher, “you made a mistake, your tentacle has missed the mark.” The children do NOT explain what has happened to the teacher, rather, they just report that what has been said not been “proven,” that the sense or meaning has been lost; it’s “leaked.”

Expand full comment