5 Comments
User's avatar
Francesco Rocchi's avatar

I had read of this RCT in your previous post and I was intrigued. There are a few remarks which came to my mind: in general terms, I think that what Google tested wasn't a model for AI tutoring, but for human, AI-assisted tutoring.

1) While the error rate was very low, 23.6% of the answers of the LLM were changed by the tutors before being submitted to students. Those answers weren't wrong altogether, but the tutor intervened to tweak them somehow. Basically, responses from the AI-tutors benefited from a revision the human tutors didn't receive. The same happened when human tutors where assisted by Tutor Co-PIlot, only with the roles reversed. I think Google should have taken this into account, when calculating the effectiveness of the AI-tutors compared to the human ones.

2) Since every response from AI tutors was revised by human tutors, the typical LLM drift effect was eliminated. Errors and subpar answers can't stack up thanks to the oversight, while an unsupervised AI tutor might eventually go off track. On the other hand, if the sessions with AI tutors are brief enough not prevent bugs to accumulate, the problem might be less significant or it may resolve itself whenever the AI is rebooted.

I'm happy to see that the classroom setting remains central to learning. The social side of learning is still important, as noted by you when talking about the mistakes human teachers occasionally make in class.

Furthermore, recent research seems to show that making mistakes can be so beneficial that one should consider making them deliberately. Without going that far, being in class and seeing someone else make a mistake can be quite useful (as can, more broadly, hearing the different perspectives of other students.

Federico's avatar

I read your November 2025 post, so seeing such a quick turnaround on the possibilities of LLMs as tutors is amazing. It seems Google is approaching this from multiple directions; it's also raising the bar on textbooks through Learn Your Way, which personalizes them instantly for every student using principles from the science of learning like dual coding. (There are real risks with that approach too, of course.)

Now, what you were attempting by generating a personalized report that diagnoses and exemplifies the problem before showing it in the student's own text is a great approach. LLMs are pretty bad at plain-old proofreading. Apparently, it has to do with the fact that the training data used by LLMs includes far more polished, publication-ready text than drafts-in-progress, so the model never really learns what "in between" looks like. No More Marking can probably train it by showing large volumes of drafts and final versions, which is what you may be doing already.

That said, LLMs are strong at rewriting drafts into polished versions that are generally error-free. If I had a sufficiently large budget, I'd probably exploit that for training: ask an LLM (specifically, Claude) to rewrite a student text respectfully, then have a second LLM compare the draft and the rewrite, identify what changed, and use that comparison to show the specific corrections that were needed. It's a roundabout route to the goal, but it plays to LLMs' actual strengths.

Neural Foundry's avatar

That 0.14% error rate is wild considering how LLMs usually drift. The bigger point about favoring targeted questions over open-ended explainations really lands tho, I remember spinning my wheels way more often when teachers just talked at me versus when they actualy checked understanding with specific prompts.

Olivier Chabot's avatar

Hi Daisy,

Here's an interesting article about the implementation of AI oral exams and grading. Maybe their process could be improved with comparative judgement.

https://www.behind-the-enemy-lines.com/2025/12/fighting-fire-with-fire-scalable-oral.html?m=1

Dan Collison's avatar

When learning, it’s better to explore than to explain. To probe and test rather than rush to premature conclusion.

So, tailored questions that explore.

After all, information is “a difference that makes a difference”—signal, rather than noise. Learning is integrating information and preserving the differences that make a difference.

The very etymology of “explain” reveals its limitation: it means to flatten out. When we explain, we compress reality in a lossy way, sacrificing vital nuance by smoothing away some of the still meaningful wrinkles in the pursuit of brevity.

Even when we must reach conclusions, we should flag them as provisional. After all, to conclude literally means to shut something completely. A concluded mind closes itself to new information, ending the very exploration that generates understanding.

PS:

For communicating complex ideas without losing their texture, we use higher-fidelity compression algorithms. E.g., instead of explanations, we use metaphors, poetry, allegories, or stories. These are able to encode patterns of reality in great richness but still in compact format. These methods don’t flatten truth as much. They preserve the wrinkles that standard explanations smooth away, allowing us to transmit greater complexity. Good writers do this.

PPS: “probing” provides rich semantic insights regarding learning.

In German, we say, “Ich probier’ mal” - “I’ll try it”.

Although German and English are cousin Germanic languages, we’ve lost a bit of that semantic insights of “probe” meaning try. But we can approach that insight of “try” through probe’s cognates of “proof” and “prove.” A “waterproof” jacket is one which has been tested with water, and none gets through. “That’s a proven strategy” means it’s been tested.

How you might program “probing questions” in teaching and learning: you might program it like an octopus’s arms, which are programmed to individually test or probe the ADJACENT environment, even without being directed to do so by the brain. Yet then the arms report their “probings” to the brain, which synthesizes.

The arms don’t explain, they merely probe and report.

In one of your examples, the children in the classroom report back to the teacher, “you made a mistake, your tentacle has missed the mark.” The children do NOT explain what has happened to the teacher, rather, they just report that what has been said not been “proven,” that the sense or meaning has been lost; it’s “leaked.”