AI assessments for KS3 English Literature: a case study
Making English essay marking quicker than maths - and maybe just as reliable!
Recently, I spoke with Phill Chater from Landau Forte Academy about how he is using our AI-enhanced Comparative Judgement system to assess his school’s Key Stage 3 English Literature essays.
Here’s a summary of his approach, organised by the three features we use to evaluate all our assessments: reliability, efficiency and validity.
Reliability
Phill set up his English Literature assessments as custom tasks that are bespoke to his school, which means they are not nationally standardised. Our AI-enhanced Comparative Judgement system gives you a scaled score, and then you can apply the grade boundaries on top. In order to gather this evidence he got the AI to judge each year group twice, to see if it came up with the same results each time.
This is a very sensible approach to take. If the AI came back with a completely different rank order of students each time, you would have very little faith in its outputs. Interestingly, you can do similar checks on human marking, and the results are often quite underwhelming. Ofqual’s marking reliability studies in 2017 found English Literature had the worst marking reliability of any subject, with candidates only likely to get their true grade 52% of the time compared to Maths where the probability is 96%.1 If we built an AI marking model with such poor consistency, we would not let it see the light of day!
However, when you are assessing essays, you also probably don’t want there to be perfect consistency between one iteration of marking and another. This would suggest that your markers – whether AI or human – are too deterministic, and are judging on surface features which make the model easy to game. For example, some of the very earliest AI marking models delivered incredibly good agreement between iterations, but on closer investigation this was because they were largely judging on surface features like length.
One of the advantages that LLMs have over these older models – and why it is worth persisting with them despite their other flaws – is that they are not making completely deterministic judgements. A major focus of our research has been building an LLM-powered model that gets this balance right, and validating its outputs. You can read the extensive work we’ve done on this here, here, here and here.
We were extremely encouraged by the results of Phill’s assessment: the correlations between each iteration of the AI judging ranged from 0.91-0.96, which feels about right: too low would suggest issues with the transcriptions, hallucinations, order bias and the world of other woes we see with LLMs. Too high and we’ve built a model that is likely over deterministic and consistently wrong! There is always further work you can do on validation, and we will update this Substack with more data when we have it.
The other reassuring aspect of this assessment is that it was on literature. Most of our national AI assessments so far have been assessments of writing. AI can be quite brittle, so there is no guarantee that if it works for writing it will work for literature. Assessments that require specific content knowledge pose an extra challenge, so it was good for us to learn that the scores were sensible and in line with expectations. Phill created a fairly holistic mark scheme to guide the AI, broadly in line with the advice we give here about not being too prescriptive.
Efficiency
This is obviously one of the major benefits of adding AI judges. It can whiz through assessments very quickly, and it did so in this case. Phill made a particularly telling point about the speed of assessment. Previously, English teachers would typically be marking right up to the deadline for big assessments, while the maths department would often finish earlier in the window. This time, for the first time Phill could recollect, the English teachers finished before the maths teachers—something that, in my experience, is unheard of.
But obviously, every solution contains within it the seeds of a new problem. One of the fears people have about English teachers spending less time marking is that they will not understand their students as well. But Phill’s comparison with maths assessment is instructive. Maths teachers spend less time marking, but they understand their students just as well. It’s just that the nature of maths marking is such that it can deliver equivalent levels of understanding in less time. There has to be a way that we could imagine this working for English: teachers will spend less time than they do currently on marking, but will get equivalent levels of understanding.
For me, there is definitely value in teachers reading students’ writing, but there is less value in the time spent painstakingly writing out comments by hand. Our feedback systems are designed to maintain the high-value thought processes and eliminate or reduce the lower-value ones. However, we are also aware that teachers and students will use our feedback in different ways, and we want to learn more about what is most effective. Which brings us to the final section….
Validity & Feedback
Efficiency can enable better feedback in two ways. First, Phill said the faster turnaround time enabled them to dedicate an entire lesson to feedback. Second, the quicker you get the feedback the more relevant it is.
The first part of the feedback lesson had students working with a model essay that had been selected previously, pre-AI, so this portion wasn’t dependent on the AI at all.
In the second part of the lesson, teachers gave students the direct AI feedback and asked them to summarise their areas for improvement based on the AI and the model essay.
Phill felt the feedback was “uncannily accurate”, but he did identify a couple of areas where we could improve our feedback, and we have some ideas about how to address them. The direct AI feedback is currently a bit too verbose and maybe a bit too harsh too. We are working on making it nicer!
The other challenge is making the feedback actionable. This is tricky because the more specific you get, the more risk there is of AI hallucination and errors. We’re currently developing an approach that Phill hasn’t been able to use yet, but we hope to roll out for everyone soon: getting the AI to create personalised multiple-choice questions for each student.
Conclusion
This is all new territory. Phill is a pioneer who is applying new technology to existing practice in exciting ways, and both his practice and ours will continue to evolve. But there’s enough here to suggest that significant reductions in workload and improvements in feedback are possible.
If you’d like to try out an AI-enhanced Comparative Judgement assessment, join our webinar on Wednesday 25th February where we will give all attendees 30 free AI assessment credits.
If you have an idea for a case study, let us know here.
The true grade is a theoretical concept that estimates what a candidate would achieve if they took an assessment an infinite number of times. Of course it doesn’t measure if the assessment or the marking are aligned with the curriculum or mark scheme, which is why we tend to favour reporting broader measures of validity such as correlations between assessments over time. Nonetheless, without high reliability there can be no validity.


It's reassuring to read this as I've been experimenting with a comparative marking approach like this and have noticed similar results. I'm also an English teacher. As to the feedback stage, I've also been trying out some prompts to create personalised feedback for analysis and language use. Creates multiple choice. So, it's definitely doable. Looking forward to finding how it works out for you.
It’s both interesting and heartening to see this research evolve. As an English teacher our interests are not only workload reduction but maintaining the knowledge of students as writers so it is great to see that priority included in the pilots. Well done to Phill and his team!