At the end of January we launched our new AI assessment project, CJ Lightning.
We now have results. Our headline finding is that AI is very good at judging student writing and is a viable and time-saving alternative for many forms of school assessment.
Here are the details.
The task
This assessment project assessed the writing of 5,251 Year 7 students from 44 secondary schools. The students wrote a non-fiction response to a short text prompt about improving the environment. Teachers uploaded their students’ writing to our website, and then used Comparative Judgement to assess it. With Comparative Judgement, you are presented with two pieces of writing and have to decide which is the better piece of writing. You can read more about the non-AI aspects of the task here.
We have been running similar tasks since 2017 for students at primary and secondary, and have assessed nearly 3 million pieces of writing using human Comparative Judgement. The process typically delivers very high levels of inter-rater reliability, and is the gold standard of human judgement.
The crucial difference with this assessment is that we did not just have human teachers making the judgements. We got an AI to make judgements too.
Involving humans alongside the AI allowed us to compare the judgements made by the humans and the AI to see if they agreed.
Does the AI agree with the humans?
In total, our human judges made 3,640 decisions. Of these 3,640 decisions our AI agreed with 81% of them. On our most recent previous human-judged Year 7 assessment the human judges agreed with each other 87% of the time, which is fairly typical.
What type of disagreements are there?
However, total levels of disagreement are not conclusive. The type of error matters. The overall agreement can be good, but if the 20% of disagreements are full of absolute howlers, that’s still a huge problem.
So how big were the disagreements? The figure below shows the frequency of agreements by the scaled score difference between them. Reassuringly, disagreements peak where the scaled score difference is small.
We feel that the above graph is the most interesting and significant finding of this project - more significant than the headline levels of agreement.
AI essay markers have been around for the 1960s, and they have been able to generate impressively high surface correlations with humans since the 1960s too. (Daisy wrote a whole chapter about them in her 2020 book Teachers vs Tech).
But their major weakness - and a major weakness of AI applications more generally - is that they could be “right 80% of the time, and wrong when it matters most”. It's possible for an AI system to get an impressive-looking 80% agreement with humans, but for it still to be making its decisions using very different values than humans, which then leads to a) big howlers in the 20% of disagreements and b) easy ways to game the system.
The famous example of this was a paper showing you could cheat an AI essay marker by writing the same paragraph 37 times. The cheating was rewarded because the AI essay marker was generating those impressive correlations by largely judging on length.
Superficial features are not being rewarded here. The AI is not making howlers and is, on the contrary, highlighting human error.
We have scrutinised a sample of the biggest disagreements in detail, and talked to some of the teachers who made the decisions. They are not cases where the AI is wrong and the human is right. In fact, some of the biggest disagreements involved teachers being biased by handwriting, and accepting on review that the AI was probably right and they were wrong. Here is one such example.
Other examples involved teachers making a manual error and clicking the wrong button.
Our qualitative review of the ten biggest human-AI disagreements is that they were all some form of human error, not AI error.
Concurrent validity
Ideally, to produce a validity measure, we would like to predict something important such as future GCSE results. However, at this point we have to make do with considering parallel measures of achievement. In our case, 2,297 pupils had already taken part in a similar assessment in September last year. The correlation of scores between the two sessions is 0.65. Last year we saw a correlation of 0.58 between the September and May test sessions. The high correlation reassures us that the AI is not judging on some strange dimension of writing ability, but is actually providing us with a similar dimension to the one we value.
Maintenance of standards
One of the huge advantages of using Comparative Judgement is that we are able to directly assess work against other pieces of graded work in order to align standards. We have worked with Ofqual on this process to develop its use in informing the maintenance of standards in national qualifications, and published our work in academic journals.
Reassuringly, the AI gave us very stable results, showing only a 0.03 change in standard deviations from the mean score in September. Any huge swings in ability would be cause for concern given a cohort of over 2,000 students.
A powerful statistical model with psychometric validity
Anyone can upload a set of essays to a chatbot and ask the chatbot to give the essays a mark or a grade. The problem you have then is - how do you know those marks or grades are right? How do you identify the ones that might be wrong without going through and remarking everything yourself?
Our approach to AI assessment is very different to the “ask an AI for a mark” approach, and offers far more assurances that you are getting the right grade. Here are the key differences.
First, we are getting the AI to make Comparative Judgements, not absolute ones - because the AI, like humans, is better at Comparative Judgement than absolute judgement.
Second, we have been working with schools to ensure that the prompts we use reward the aspects of writing that teachers value, and that we have established as critical over the last 10 years of assessing writing.
Third, we validate the AI judgements against our human judgements and can easily flag up and review the small percentage of big disagreements.
Fourth, we integrate the AI judgements into our powerful Comparative Judgement statistical model which allows us to minimise hallucinations, minimise bias, and provide meaningful scores and grades that are linked to live national standards.
Our AI-enabled CJ approach has a high level of psychometric validity which has been shown to maintain standards over time.
What we love about using CJ with AI is that we can meaningfully involve teachers in the process of producing marks. We can combine human and AI decisions together in various ratios, depending on how much we trust the AI! We can then go further and improve the model by informing the AI of where we do actually disagree. Very much like driverless cars, we still want a human behind the steering wheel. By applying our powerful statistical models that aggregate decisions we can ensure that the results we produce are reliable and valid.
How much time does this take? Can you do it with no human judging?
Yes, we think that you could run a 100% AI judged assessment with no human judging. However, we would not recommend that you routinely do this. You would always want to run some human-AI hybrids to a) keep validating the AI model and b) make sure that teachers are engaging with student writing.
If you run a hybrid assessment, you can blend the human & AI judgements in different ways. For this assessment, we recommended a split of 10% human judgements and 90% AI. Given our model of assessment, that meant every student response was seen on average 18 times by an AI and twice by a human teacher.
In one of the schools who took part, the head of department of a school with 269 Year 7s chose to do all the decisions herself. She completed her quota in 1 hours 12 minutes. That was enough to validate all the other AI decisions and provide robust and meaningful scores for every student. In other schools, they shared the decisions out amongst lots of teachers, resulting in 5-10 minutes of judging per teacher.
What about feedback?
As well as the AI judging, this project provided a mix of direct AI feedback and AI-enhanced human feedback.
So are you now on the AI hype train?
We’ve written before about our rollercoaster journey with new AI technologies. We still think they have flaws and are prone to hallucinations, but we think the process we’ve developed here has the potential to revolutionise assessment and decimate workload (quite literally decimate if you follow our recommended 10% human judging approach!).
What next?
We have free projects in the summer term for any UK primaries or secondaries who would like to trial this approach. The primary project is for Year 6 and the secondary project is for Year 9. Our training webinars provide more information and are listed here.
We will have a more comprehensive plan available in academic year 2025-26.
Really fascinating.
I'm really excited about the potential here. It seems like a big shift to get teachers (at least in the US) bought into comparative judgement, but I think the opportunity here to add AI as a long-term time saver might be the hook needed to get folks on board.