Solving marking at scale!
The AI and assessment state of play, February 2026
In March last year, we presented a major breakthrough in our AI assessment model. We were able to use a blend of human and AI judgement to reliably and efficiently assess student writing.
Nearly a year on, where are we? What else have we learned and what’s next?
What we’ve done so far
Standardised writing assessment at scale: The model we developed in early 2025 has proven itself at scale. We’ve now used it to assess nearly half a million pieces of student writing, from students aged 5 to 16. Most of these are from schools in England, some are from the US, and this month we’re running our first AI-enhanced assessment in Australia & New Zealand. We’ve been able to validate our model in various different ways too.
Assessing other subjects at scale using AI: we’ve run our first national history assessment, which had different challenges to writing but still worked well.
Improving AI rubrics: the way you prompt the AI is probably not as important as everyone thinks - but nevertheless, we have gained a lot of insights about what makes for the best rubric to give the AI
Bespoke AI tasks for individual schools: as well as our big nationally-standardised assessments, schools can also use all the AI judging and feedback features for their own assessments on their own timeline. These will not be standardised, but in a big school you can use a mix of statistics and human judgements to set your own grade boundaries. There’s a case study here.
Better feedback: We have an audio feedback system that allows teachers to provide audio comments on every piece of writing which are then transcribed and polished by the AI. We really like this system - but so far it seems our schools prefer the direct AI feedback which is generated automatically. As well as a written comment, the AI can now also generate personalised quizzes.
What we’re doing next
Improving handwriting recognition: we have written a lot about how uncannily accurate the AI judging is. We still have not yet encountered a major human-AI disagreement where we think the AI has made the wrong decision. However, where there are still issues is with the step before the judging, where the AI transcribes the student handwriting. The AI does sometimes improve the writing when it transcribes, imposing sense and meaning that are not in the original piece. This can then lead to the wrong judgement being made, but the source of the problem here is not the AI judging going wrong, but the AI transcription. We have a system to catch and correct these errors, and we are working on developing better open-sourced handwriting models .
Using AI to create rubrics, not just use them: typically, we give the AI criteria and it uses that to judge. We are working on an RE project where we will get the AI to create criteria after it judges - eg, to tell us what the typical features of the best and weakest writing are.
GCSE / multi-question assessment: so far, most of our big projects have involved assessments of just one piece of writing. GCSEs pose a more difficult logistical challenge, because you have to combine several questions and several marks. The AI is still good at judging these; we just have to find ways to make it simpler to pull together all the marks from all the questions and apply a grade.
Is AI going to change the world?
As we have been working hard on all of the above, there has been a wider debate going on about the extent to which AI is going to change / destroy the wider economy.
We set up this Substack to detail our journey through AI, and if you go back to 2023-24 you can see that we were much more sceptical than we are now.
For us, the thing that has made the biggest difference is not necessarily the improvement in quality of the cutting-edge models - but the dramatic reduction in cost of the standard models. This has allowed us to “over-assess” - to send writing off to be judged multiple times, which helps us weed out inconsistent and biased judgements.
A lot of our theoretical concerns about LLMs still exist - the hallucinations, the probabilistic decision-making, the challenges with getting them to work reliably at scale. But we have found ways around most of these problems, such that in practice, our model is very useful!
Even now, it is easy for us to get bogged down in the details of the fraction of percents that aren’t working right - and that is the right thing for us to do, because a fraction of one percent at scale is still a big number.
But it is also important to sometimes step back and take a look at where we are. And when I do that, I keep coming back to the same thought: if this system had arrived on my desk halfway through my teacher training year in 2007-08, I would have thought it was unbelievably brilliant and it would have dramatically changed my life - and my students’ - for the better.
Make your own mind up
Our next intro webinar is in April. These webinars are very popular - we show you how the system works and at the end we give 30 free credits to all attendees, so you can try it yourself on a class set of essays in any subject.
If you work in a school, you can also book a 30 minute call with me here where I can get you set up on our system with 30 free credits.

