In our early trials with AI marking, we started with a very simple model where we asked a Large Language Model to give a numerical mark to a piece of writing.
We then compared the mark it gave the writing with the average mark awarded by a large group of teachers.
In quite a lot of cases the AI agreed with the teachers. But in a significant minority of cases it did not agree, and in some of those cases its decision was absolutely baffling and – we felt – decisively wrong. Here are two examples. The human markers gave the piece on the right a higher grade, and every human reviewer I’ve showed these to since has agreed. The AI, however, gave the piece on the left a higher grade.
In a lot of the research literature about assessment there are highly technical and philosophical discussions about marker disagreement which explore the different constructs different markers might use to make decisions, the extent to which markers have different conceptions of quality and whether it is possible to even know what the true mark is that a script deserves.
All this is very interesting and we have contributed to some of it ourselves. However, while these big philosophical discussions matter, it is also important to acknowledge that sometimes you can just get big errors. Some decisions are plain wrong. They are what you might call Maradona errors - the assessment equivalent of a referee failing to spot a blatant handball.
Of course, humans make errors too, and individual humans are certainly capable of making errors like this one – after all, the referee who missed the Maradona handball was a human. But that’s not really a defence of AI. If we are going to introduce AI into our assessments, we want it to offer some improvement, not the same old problems!
AI plus Comparative Judgement
We have now moved on to a different model where we don’t directly ask the LLM to mark a script. Instead, we ask it to make Comparative Judgements: that is, we present the LLM with two scripts and ask it which is better. We ask it to make thousands of judgements like this, and combine those judgements together to create a measurement scale. The process of combining together judgements is one that we have used successfully now for nearly a decade with human Comparative Judgements, and which has been independently evaluated as being more reliable and quicker than traditional marking.
Once we’ve created the AI measurement scale, we then ask our human judges to make some Comparative Judgements, and we can tell you if the scaled scores awarded by the AI model agree with each human decision.
This process gives us some nuance, as it means we are not just looking at human-AI disagreements, but at the size of disagreements. For example, here’s a human-AI disagreement that is very small. The AI model resulted in the piece on the left getting a scaled score of 643, and the one on the right getting a score of 644. Our scale runs from approximately 300 to 700 with a standard deviation of about 40, so a difference of one mark is negligible.
What about the bigger disagreements? So far, we have carried out four large-scale national assessments using this approach. We’ve run two trials: one with Year 7 students and one with Year 6. We’ve also introduced optional AI judges into our standard summer assessments: one for Years 4 & 5 combined, and one for Years 7, 8 & 9 combined. Across all of these projects, the human-AI agreement has been between 80 and 85%. Some of the disagreements are very small, like the one above, but others are much bigger and need to be explained.
We have reviewed the big disagreements and followed up with our participating schools and asked them to review them too.
So far, in our review of all of the big disagreements, we have not found one AI howler. All the big disagreements are human howlers.
This is a hugely significant finding with enormous implications. It suggests that using AI in this way can potentially eliminate basic human errors due to fatigue or carelessness, without introducing any new errors.
Here is an example of a big AI-human disagreement that we think is human error. In the first image you can see the handwritten scripts. The second image is a transcription.
Here’s another one.
When you look at these, you might get disillusioned by the human performance! However, you have to remember that these large disagreements are a tiny fraction of the overall number of judgements - literally one in a thousand occurrences. Any process that uses human decisions will inevitably have some kind of human error, and an error rate under 0.1% is pretty good going.
We are still awaiting the first AI howler using this new method. Our projects provide schools with an easy-to-read report listing all the AI-human disagreements in order of size. Take part in one of our projects and see if you can find an AI error!
So, the human comparison is designed to access the unvoiced expertise of the teacher. Does the AI work in the same way, making up its own internal rules to compare each piece, or is it working from criteria? This is so fascinating.