Discussion about this post

User's avatar
Francesco Rocchi's avatar

I had read of this RCT in your previous post and I was intrigued. There are a few remarks which came to my mind: in general terms, I think that what Google tested wasn't a model for AI tutoring, but for human, AI-assisted tutoring.

1) While the error rate was very low, 23.6% of the answers of the LLM were changed by the tutors before being submitted to students. Those answers weren't wrong altogether, but the tutor intervened to tweak them somehow. Basically, responses from the AI-tutors benefited from a revision the human tutors didn't receive. The same happened when human tutors where assisted by Tutor Co-PIlot, only with the roles reversed. I think Google should have taken this into account, when calculating the effectiveness of the AI-tutors compared to the human ones.

2) Since every response from AI tutors was revised by human tutors, the typical LLM drift effect was eliminated. Errors and subpar answers can't stack up thanks to the oversight, while an unsupervised AI tutor might eventually go off track. On the other hand, if the sessions with AI tutors are brief enough not prevent bugs to accumulate, the problem might be less significant or it may resolve itself whenever the AI is rebooted.

I'm happy to see that the classroom setting remains central to learning. The social side of learning is still important, as noted by you when talking about the mistakes human teachers occasionally make in class.

Furthermore, recent research seems to show that making mistakes can be so beneficial that one should consider making them deliberately. Without going that far, being in class and seeing someone else make a mistake can be quite useful (as can, more broadly, hearing the different perspectives of other students.

Federico's avatar

I read your November 2025 post, so seeing such a quick turnaround on the possibilities of LLMs as tutors is amazing. It seems Google is approaching this from multiple directions; it's also raising the bar on textbooks through Learn Your Way, which personalizes them instantly for every student using principles from the science of learning like dual coding. (There are real risks with that approach too, of course.)

Now, what you were attempting by generating a personalized report that diagnoses and exemplifies the problem before showing it in the student's own text is a great approach. LLMs are pretty bad at plain-old proofreading. Apparently, it has to do with the fact that the training data used by LLMs includes far more polished, publication-ready text than drafts-in-progress, so the model never really learns what "in between" looks like. No More Marking can probably train it by showing large volumes of drafts and final versions, which is what you may be doing already.

That said, LLMs are strong at rewriting drafts into polished versions that are generally error-free. If I had a sufficiently large budget, I'd probably exploit that for training: ask an LLM (specifically, Claude) to rewrite a student text respectfully, then have a second LLM compare the draft and the rewrite, identify what changed, and use that comparison to show the specific corrections that were needed. It's a roundabout route to the goal, but it plays to LLMs' actual strengths.

3 more comments...

No posts

Ready for more?