Discussion about this post

User's avatar
Olivier Chabot's avatar

Hi Daisy,

Here's an interesting article about the implementation of AI oral exams and grading. Maybe their process could be improved with comparative judgement.

https://www.behind-the-enemy-lines.com/2025/12/fighting-fire-with-fire-scalable-oral.html?m=1

Expand full comment
Francesco Rocchi's avatar

I had read of this RCT in your previous post and I was intrigued. There are a few remarks which came to my mind: in general terms, I think that what Google tested wasn't a model for AI tutoring, but for human, AI-assisted tutoring.

1) While the error rate was very low, 23.6% of the answers of the LLM were changed by the tutors before being submitted to students. Those answers weren't wrong altogether, but the tutor intervened to tweak them somehow. Basically, responses from the AI-tutors benefited from a revision the human tutors didn't receive. The same happened when human tutors where assisted by Tutor Co-PIlot, only with the roles reversed. I think Google should have taken this into account, when calculating the effectiveness of the AI-tutors compared to the human ones.

2) Since every response from AI tutors was revised by human tutors, the typical LLM drift effect was eliminated. Errors and subpar answers can't stack up thanks to the oversight, while an unsupervised AI tutor might eventually go off track. On the other hand, if the sessions with AI tutors are brief enough not prevent bugs to accumulate, the problem might be less significant or it may resolve itself whenever the AI is rebooted.

I'm happy to see that the classroom setting remains central to learning. The social side of learning is still important, as noted by you when talking about the mistakes human teachers occasionally make in class.

Furthermore, recent research seems to show that making mistakes can be so beneficial that one should consider making them deliberately. Without going that far, being in class and seeing someone else make a mistake can be quite useful (as can, more broadly, hearing the different perspectives of other students.

Expand full comment
1 more comment...

No posts

Ready for more?