Teachers spend a lot of time marking students’ work and writing feedback on the work for the students.
New artificially intelligent large language models like ChatGPT are really good at writing.
So it seems like providing written feedback on student work might be an ideal use case for LLMs.
The above 3 bullet points all seem really obvious, and it’s for this reason that we and others have done a lot of experimenting with AI-generated student feedback. However, if you want to design a good AI feedback system, you first of all have to go back a step and think about what we are trying to achieve with all of these written comments.
Why do teachers give feedback? What is the point?
The classic metaphor here is of a thermometer and a thermostat. A thermometer measures the temperature. A thermostat uses the measurement to change the temperature. The fundamental purpose of feedback in any system is to change something.
Ideally, the feedback from a teacher to a student will change the way a student thinks. Let’s suppose a student submits an essay where they have persistently misused full stops. The feedback from a teacher needs to change their understanding of sentences so that in the next piece of work they don’t make the same mistake.
This is obviously incredibly hard to do with a written comment. Dylan Wiliam makes this point very well in Embedded Formative Assessment.
“I remember talking to a middle school student who was looking at the feedback his teacher had given him on a science assignment. The teacher had written, “You need to be more systematic in planning your scientific inquiries.” I asked the student what that meant to him, and he said, “I don’t know. If I knew how to be more systematic, I would have been more systematic the first time.” This kind of feedback is accurate—is describing what needs to happen—but it is not helpful because the learner does not know how to use the feedback to improve. It is rather like telling an unsuccessful comedian to be funnier—accurate, but not particularly helpful, advice.”
Wiliam goes on to say that what students actually need to improve is a “recipe for future action”, and a “series of activities that will move students from their current state to the goal state.”
Written comments are not good at doing this. They are much more like a thermometer than a thermostat. They might provide an accurate summation of the strengths and weaknesses of a piece of work, or of the strengths and weaknesses of a student’s mental model. But they are not very effective at provoking action or change.
So can AI help?
All this poses a challenge for AI-generated feedback. Currently, a great deal of energy is going into making AI-generated feedback more accurate. This is important, but even if we overcome that problem, we face another one: the “accurate but useless” problem.
To extend the metaphor, using AI to improve written comments is like spending all your time improving the precision of your thermometer – and expecting that will magically lead to the heating clicking on or off.
I also think this is a problem with the evaluation of AI more generally. We evaluate it by seeing if it can reproduce something that professionals currently spend a lot of time on. We don’t ask whether the thing the professionals are spending time on is actually valuable.[1]
So what should we do?
We have to find a way of providing feedback that is accurate and useful. If AI can produce written comments efficiently, then we might decide to keep them - but they cannot be the whole answer.
In future posts, we will explore some different answers to this question, including some of the new AI tools we are developing.
[1] This, for me, is the flaw in one of the seminal studies showing that AI can help professional workers. This study showed that LLMs helped management consultants do parts of their job quicker, and human graders judged many of the outputs to be of higher quality too. But the quality of a consultant’s work is not determined by speed or the judgement of a human grader - it’s about whether their advice improves a company’s performance. That is a much harder question to answer, whether the outputs are AI-produced or not. What you really want is hard data on the subsequent stock market performance of the company who received the advice.
If "what students actually need to improve is a “recipe for future action”, and a “series of activities that will move students from their current state to the goal state” "
... then instruct the AI to include in the feedback an explanation of exactly what the student needs to do to improve their work, and a “recipe for future action”, along with a “series of short activities that will move students from their current state of performance to the next, improved level” ...
The more detailed instruction that are given to AI systems, usually, the better outcomes.
Encouraging to see Dylan Wiliam quoted. Undoubtedly most teachers spend too much time writing comments on pupils/students work and those comments make little difference even if they get read. There used to be a fashion for teachers to correct spellings in work. This is probably more linked to marking than assessment although those teachers would then be able to assess which kids could spell particular words more accurately. It certainly didn't make anyone a better speller even if you were required to write the correct spelling out ten times at the end of your work. Ah the joys of 20th century schools. I think it depends hugely on whether your feedback is to do with things like grammar, punctuation etc or style perhaps. As the student says" If I knew how to be more systematic, I would have been more systematic the first time" . The same applies to so many improvements that teachers might suggest.
I was observing once in an FS2/Yr1 class and many examples of the children’s written work had complex comments that patently few of the class could have read or understood. The teacher explained that it was school policy to show that the work had been read and assessed. Accountability to Ofsted too. And the parents liked it.