Is AI feedback true but useless?

What happened when 9,000 primary pupils received personalised feedback on their writing

May 30, 2026

Last term we completed a large-scale writing redraft project. Nearly 9,000 Year 6 students (10-11 year olds) from 211 schools took part in a two-stage writing assessment.

First, they wrote a creative response to an image prompt in independent conditions
Second, a couple of weeks later they received a mix of human & AI feedback on their writing, and redrafted their response.

It is now relatively easy to use AI to produce reams of polished and personalised written feedback for a student. In some ways this is incredible: as I have written before, if I had had this technology when I started teaching, it would have been transformative.

But there is still a big research question about the effectiveness of written feedback. Do students read it? If they read it, do they understand it? If they understand it, do they act on it? And if they act on it, does it permanently change their mental models?

Dylan Wiliam has a great anecdote where he sees that a student has received the feedback “you need to make your scientific investigations more systematic.” He asks the student what he is going to do differently as a result. The student replies “I don’t know. If I knew how to be more systematic, I would have been so the first time.” This is the problem with all written feedback, whoever writes it. It risks being TBU - true but useless.

Thanks for reading No More Marking! This post is public so feel free to share it.

Our AI-enhanced Comparative Judgement assessments now provide schools with lots of automatic feedback. As well as the traditional written comment, we also provide schools with an examiner’s report for the teacher, and with a set of personalised multiple-choice questions for each student. Our hunch is these will be more useful than the comments, but we need to do more work to test this.

We have the headline results from the assessment, and they show that the cohort improved by 8 scaled score points - an effect size of about 0.2. This is a significant and meaningful improvement. However, not all students improved! Only 64% got a better score on the second assessment. The rest went backwards.

Here is a “before and after” snippet from one of the most improved students. This student received a set of questions on sentence structure, and I would argue that these extracts show how improving sentence structure does have a material impact on writing quality and the reader’s understanding.

Over the next couple of weeks we are going to feature some more case studies about how teachers used this feedback in class, and what makes the difference between the students who improved and those who got worse.

This particular project was large-scale and nationally standardised. However, it is possible for individual schools or teachers to use our system to run small-scale projects like this in their own contexts. If you would like to learn more, you can register for a webinar here or book a demo with me here.

TobyM

May 30

Really enjoyed this, thank you. It makes me think about some advice I was given about formative feedback: focus on the HOW to do better, not the WHAT to do. As the excellent 'systematic' anecdote said, 'if I knew how I'd have done it the first time'. My worry with automated feedback is that the practitioner may not have enough of their own cognitive skin in the game.

Reading automated feedback is not the same as unpicking the text for why something didn't work. If the practitioner hasn't had to think about it, they are unlikely to fully understand it. If memory, after all, is the residue of thought, then that applies to staff as well as children.

My ongoing worry with AI in schools is that we are pursuing efficiency at the expense of efficacy.

Very off the cuff and I may well have missed the point entirely but very much enjoyed it so thank you for sharing.

3 replies by Daisy Christodoulou and others

Michael Tidd

Jun 1

I would hope you would expect me to respond :)

My first question, having tried this out this year, is about the "control": while re-drafting with the AI feedback has produced an improvement, what would be the equivalent outcome for re-drafting without specific feedback - either more generic 'whole class' pointers, or even no feedback at all. I have a suspicion that a clear focus on fewer points of generic feedback might be more effective for those who most need it.

20 more comments...

No More Marking

Discussion about this post

Ready for more?