We are running a free AI assessment trial for secondary schools in England who would like to try out this approach. You can read more about the project here and sign up for a short intro webinar here on Monday 10 February at 4pm.
If you’ve followed this Substack over the past two years, you will know we have been experimenting with using AI to improve assessment. You will also know that we have had a lot of ups and downs along the way! In January 2023, we were at the top of the hype cycle, thinking that Large Language Models (LLMs) were going to transform the world. By January 2024, we’d almost given up in despair. Now, at the end of January 2025, we have a powerful LLM-powered application that we think could change the way teachers assess. Here is the story of our journey and where we are now.
January 2023: Our prediction for LLMs and assessment
Our prediction at this point was as follows.
Within 6 months, a teacher will be able to log in to a website that will allow them to upload a set of typed or handwritten essays and which will instantly and automatically provide them with the following.
1) An accurate, moderated and nationally standardised grade / level / score for every student.
2) A personalised, useful and specific written comment about the strengths and weaknesses of each essay and how it could be improved.
3) A summary of the strengths and weaknesses of the set of essays and suggested next steps for the teacher.
And that was not just our prediction – it was what we were aiming to build ourselves.
January 2024: the problems with LLMs
We now know that we were not able to build this – and nobody else has either.
Why?
There are quite a few reasons, but one major issue is the fact that LLMs hallucinate. The wish list I have outlined above requires reliability and stability. LLMs still make stuff up and change their minds, and this really matters.
Hallucinations are a persistent problem. Whenever we discuss this problem in public, people will tell us that we’re doing something wrong and that it should be possible to prompt or fine tune in such a way as to avoid this problem. But for a sign of just what a persistent problem this is for everyone, look at Apple’s struggles to get their text summariser to work accurately and not hallucinate. Apple clearly have some of the most talented engineers in the world – and even they cannot get a relatively simple text summariser to work accurately and automatically.
We don’t know how long it will take to completely solve hallucinations. We often hear that they will be fixed “soon”. What do you mean by soon? 3 months, 3 years, 30 years? The time it takes to solve them will make a big difference to your decision making in the present moment. Our current feeling is that whilst they might be reduced and mitigated in different ways, fixing them completely is going to be very hard as in many ways they are a feature, not a bug, of the way LLMs work.
Hallucinations really matter. Another thing we often hear is that LLM errors don’t matter because humans make errors too – and LLMs are quicker than humans. So if humans and LLMs have the same error rate and LLMs are quicker, clearly they are better. If you’d made this argument to us a few years ago, we would have agreed with it. A lot of our previous research has focussed on how algorithms are often superior to human judgement. However, LLMs complicate this picture. What matters is not just how many errors you make, but the type of error, and LLMs often make errors that humans would not make. This matters because many of the systems we use every day have tried and tested methods that catch and reduce human error, meaning that the overall system error can be much lower than the errors of the individual humans within it. Many of these methods involve some form of individual accountability: eg a marker who consistently gets stuff wrong or a driver who consistently breaks speed limits will face sanctions. Textbooks and Wikipedia have relatively few errors, despite being written by humans, because they have multiple humans proofing and fact-checking. These systems are designed to catch human errors. LLM errors are often different, and so the methods don’t work so well with them.
Because of all the above problems, at this point we came to the conclusion that it would not be possible to use LLMs in the way of traditional software, where you automate it and forget about it. Instead we would have to design a different system that would include humans at various vital points.
January 2025: the human in the loop
The “human in the loop” is a phrase you hear a lot at the moment as lots of people recognise the flaws with LLMs and seek to mitigate them. What is the best way of combining human and LLM expertise?
One platitude that gets repeated is that AI will do all the routine work and will free up humans to focus on the creative insights. But how true is that? Moravec’s paradox from the 1980s reached a very different conclusion: AI would be exceptionally good at difficult tasks like playing grandmaster level chess, but completely awful at mundane everyday tasks like loading a dishwasher. The new class of LLMs complicate the picture even further. They do not have the traditional AI strengths of reproducibility and consistency, and in that sense are more like humans. They can also be uncannily good at what we would typically think of as creative tasks like writing poetry or brainstorming business ideas. But they are also capable of making very weird errors that humans would not make. We think you have to look at things on a case by case basis and work out the best combination for each use case.
If you go back to our wish list at the start, we essentially wanted two things – accurate essay marking and useful feedback. Here’s how we have combined human and AI expertise for both.
Useful feedback
We experimented with getting the LLMs to produce feedback directly. They were very good at providing bland generic comments that looked good but wouldn’t really help a student to improve. When we asked them to be more specific and give specific examples of strengths and weaknesses, they were less good – not terrible, but error-prone.
The simplest way of adding a human in the loop here is to make the AI feedback editable and get teachers to review it and edit the errors.
We tried this, and the problem was it took more time for teachers to read all the essays, read the AI feedback and edit the errors than it would have taken for them to provide the feedback themselves in the first place.
There is an analogy here with self-driving cars. Many self driving models are good at 90-95% of driving. But they struggle a lot with the final 5-10% - so called “edge cases” – where they can make unfathomable and dangerous errors. One suggested solution is to let the car drive itself 90-95% of the time and then, when it meets a problem it can’t solve, to notify the human driver who can take over the controls. The problem with this approach, as a number of people have noticed, is that human attention does not work in this way. If you are snoozing or texting or reading a book or staring vacantly out of the window and then the car suddenly demands you take over and deal with the looming threat on the road, you will not be able to switch on quickly enough.
There is also a paradox here that LLMs that are 80% accurate might end up leading to better overall system performance than LLMs that are 98% accurate. That’s because at 80% accuracy, the human in the loop knows they have to stay alert as the errors are coming along fairly frequently. If the errors come along less frequently but are still present, they are more likely to have switched off. This is kind of the opposite of the Boy who Cried Wolf problem: with the boy who cried wolf you stop paying attention because there are too many false alarms; with this problem you stop paying attention because there are so few alarms.
In both these cases, a relatively simple and naïve form of “human in the loop” is not good enough. It’s not enough to just have a human overseeing or checking.
What we’ve tried instead, which has been more successful, is to rework the process entirely.
Teachers read and judge their students’ writing, as they would normally.
They leave an audio comment on each piece of writing.
The AI does two things: it transcribes the audio, and it combines together audio from different teachers into one final polished comment for the student, and a feedback report for the teacher.
So here we have a human in the loop, but in a slightly more sophisticated way that leads to better outcomes. The AI is doing a lot of the legwork, and human time and effort is reduced, but we are mitigating the problem of hallucinations.
Essay grading
Initially, we tried getting LLMs to grade / score essays directly. This was underwhelming. The overall correlations with human grades often looked relatively impressive – but when we dug into the data, we found that LLMs made weird mistakes and didn’t use the full range of available marks, which inflated their success rate.
So we moved instead to asking LLMs to comparatively judge essays. We know that humans are much better at Comparative Judgement (CJ) than they are at absolute judgement – and it turns out LLMs prefer CJ too! They do still make unusual errors that humans don’t make, but we can use human judgements and our statistical model to identify and eliminate such errors.
You might think that LLMs would make CJ redundant, but we think they are a perfect match. First, LLMs themselves are trained using CJ – that is how they learn human preferences. Second, the simplicity of a CJ decision, left or right, makes it much easier for us to see what is going wrong with individual decisions and improve our prompts. Third, over the last decade or so we have built a powerful Rasch equating model on top of CJ which allows us to place all schools and all sessions on the same scale. This allows us to standardise scores across schools in a way in which LLMs on their own would never be able to do.
Given that CJ is the obvious choice for a marking model with LLMs, why has it taken us so long to build?
We have been on a very long journey here. In short: models, prompts, scalability, bias, injection attacks, reproducibility and generalisability. However, we are finally in a place where we believe, with a human in the loop, we can use LLMs to significantly reduce the time it takes for humans to assess student writing.
Where will we be in January 2026?
Let’s go back to our original prediction.
Within 6 months, a teacher will be able to log in to a website that will allow them to upload a set of typed or handwritten essays and which will instantly and automatically provide them with the following.
1) An accurate, moderated and nationally standardised grade / level / score for every student.
2) A personalised, useful and specific written comment about the strengths and weaknesses of each essay and how it could be improved.
3) A summary of the strengths and weaknesses of the set of essays and suggested next steps for the teacher.
We do not think we will be able to get these outputs “instantly and automatically”. However, we are working towards a goal where a class teacher might be able to get all of these outputs with just 15-30 Comparative Judgements and audio comments. This would be a dramatic time saving compared to current practice.
We are running a free trial for secondary schools in England who would like to try out this approach. You can read more about the project here and sign up for a short intro webinar here on Monday 10 February at 4pm.
Thank you, that was informative. But I don't understand why AI avoids some of the hallucination errors under CJ prompts/tasks? Is ChatGPT ordinarily doing some kind of CJ process on a massive scale or is does it behave differently within a CJ process? And if so, how does it "know" to behave differently?
Excellent post! Thank you for sharing.
I am continuously looking for examples of edge case scenarios, and only by solving them will we eventually decide if an activity can be fully automated and replace humans or if it will be a human-in-the-loop activity for now.