No More Marking

Motorways with no speed limits and car-free town centres

Daisy Christodoulou — Sun, 24 May 2026 08:41:00 GMT

Germany has motorways with no speed limits. It also has medieval town centres where cars are banned.

So is Germany a pro-car or an anti-car society? Do people love cars or hate them? Is Germany’s attitude to cars libertarian or authoritarian?

Obviously, the context matters. In some contexts, cars are the solution to a problem. In other cases, cars are the problem.

In my 2020 book Teachers vs Tech, I made a similar argument about education technology. I thought that ed tech was indispensable in some places, but a terrible idea in others.

Since then, we’ve had an explosion of interest in how AI can help students learn. You have boosters arguing that kids can learn anything they need to on screen, but you also have sceptics saying that AI taints everything it touches. I disagree with both these positions.

In this post, I will update my 2020 argument for why we need both more and less technology, and also put forward three approaches that embody my car-free town centre / limit-free motorway concept.1

Thanks for reading No More Marking! This post is public so feel free to share it.

The limit-free motorway: why we should persist with ed tech

For me, there is one fundamental reason why we should persist with ed tech: scale.

Imagine the best possible human-led classroom lesson, the one where every student is engaged and enthused and learning at every point. Whatever your vision looks like, one thing is true: it will be incredibly hard to scale up. There are massive variations in quality between classrooms. Even within schools, there are variations in quality between classrooms, to the extent that OECD data shows that variations within schools are greater than variations between them. This suggests that more learning can happen in the best classroom of a struggling school than in the weakest classroom of a high-performing one.

On its own, money does not solve this problem. Expensive private schools have not solved the variation problem. If you reduce class sizes you often end up exacerbating the problem, because you then have to recruit and train thousands more teachers who are unlikely to immediately (or, perhaps, ever) be as good as the existing teachers in the system.

It is this problem which is, for me, the biggest and most legitimate justification for education technology. Technology can be a great equaliser. In its ability to deliver consistent quality at scale, it is perhaps the most democratic force in existence. The printing press made books available for the masses. Recorded music allows everyone to listen to world-class performances. The internet makes information freely accessible. These technologies have already helped democratise access to education, both inside and outside of traditional educational institutions.

But given that the variation problem remains, these technologies have not done enough. For many, it can seem as though the next step is a fully personalised screen-based education that cuts out the human teacher. Instead of learning in a traditional classroom with human teachers and paper-based materials, perhaps in the future students will learn using a combination of personalised devices, AI chatbots, eye-tracking technology and adaptive algorithms.

This is where I depart from the pro-technology line. I think we have to find ways of using technology to get scale, but for the majority of school education I don’t think that will involve kids sitting at screens.

The car-free town centre: learning is social & embodied

Right now, education still does mostly take place in human-scale, human-led physical environments. No-one has yet found a way of successfully educating students in large numbers with a human-light and tech-heavy alternative that involves students learning from screens most of the time.

We also recently had the natural experiment of Covid, the type of external shock which often does accelerate systemic technological change. And in many areas, that is exactly what happened: Covid massively accelerated remote working, telemedicine, online shopping and cashless payments. Six years on, these trends are all permanently higher than they were before Covid. But in education, this did not happen. Yes, there were plenty of attempts to learn online and plenty of interesting innovations. But most parents were desperate to get kids back in school and most of the data (ours included) showed that kids did not learn much at home. So for all of the pro-tech arguments about scale, what Covid showed is that even the inconsistent in-person school is still better than learning remotely on screen.

Why is this? In 2020, I argued that there were subtle ways in which learning was social and physically embodied, and that these mechanisms were particularly important for younger students. Post-Covid, I feel even more confident that this is the case.

What do I mean by “learning is social” and “learning is embodied”?

For very young children (ie toddlers and infants), there is a lot of evidence that learning relies on eye contact, shared attention, imitation, emotional connection, and interactive back-and-forth conversation. Researchers have been trying for decades to get young children to develop first or second language skills by video, and they have found there is a “video deficit” which makes it very hard. Children pay more attention to words spoken by a physical human being in their presence than to the same human saying the same words on a screen. They copy human adults. They care about what their teachers and their peers think is important, and are more likely to pay attention to that. Even when they are watching a video or listening to a recording on their own, often their motivation or interest derives from knowing that other humans are interested in it too.

These mechanisms don’t disappear for older students. “Social contingency” is the term used in the research literature to describe interactions that are dynamic, where your response or behaviour changes the other person’s. Research on instructional design suggests that this kind of responsive back-and-forth interaction remains important for learning well beyond early childhood. One of the reasons direct instruction programmes are so successful is that they are highly socially contingent, with multiple opportunities for the student response to change the teacher’s behaviour. The best screen-based programmes attempt to recreate this dynamism too, but doing so reliably and at scale is difficult. Intelligent tutoring systems tend to use banks of pre-written statements and questions, which aren’t as dynamic as humans. Even the tiny delays on video calls can subtly disrupt turn-taking.

In terms of learning being physically embodied, I’ve frequently summarised the research showing the difference between learning on screen and learning on paper. We skim and scan more when we read on screen, we think differently when we take notes on laptops, and there are powerful “mode effects” that lead to students doing worse when they take tests on-screen than on paper.

Note here that the anti-tech aspect of my argument is different from a lot of other anti-tech arguments. For example, I hear a lot of tech sceptics say that learning on screen may well be efficient, but efficiency isn’t everything. That is not my argument! I care a lot about efficiency! I think efficient learning is exceptionally important and too often it gets downgraded and neglected. When I hear someone say “efficiency isn’t everything”, I get nervous because I think that is too often used to justify some quite vague and nebulous classroom activities that add very little social or academic value.

My argument is that screens are currently not an efficient way of learning for under-11s, and that they may never be, in the same way that cars are not an efficient way of getting around a medieval town centre. I think the most efficient way of educating children under 11 involves whole-class explicit instruction, paper-based materials and teacher read-alouds.

Isn’t this all quite waffly?

How important are these effects? Are they really crucial, or are they just nice to have? I am sure there were travel agents in the 1990s who said “no one will want to book a holiday on a screen! They will want to come into the social and embodied space of the warm and inviting travel agent on the high street and book it with a real human being instead!” Now maybe there was some truth to that. Probably a lot of customers would have even said something like that if you’d asked them. But ultimately, whilst they might have valued the high street travel agent, they didn’t value it that much. They preferred getting 50 quid off their next package holiday more than they preferred booking it in a shop with a human agent.

Am I at risk of making the same mistake? I am talking up the social and the embodied aspects of learning, and there is a lot of evidence that they are real. But technology has benefits that humans can’t provide, and it is always possible that it is worth losing the social and embodied in return for the benefits of scale and consistency.

Let’s think this through using two different scenarios.

Let’s imagine you are aged 25 and you need to take a course on a specific type of database in order to get a promotion at work. Option A is to do the course in person. The next course starts in six months, costs £5000 and involves a 2-hour round commute. Option B is to do the course online. The next course starts when you want it to, and costs £500. Given this choice, I would pick Option B – the online screen-based no-human option.

Here is another scenario. You have a child aged 6 who is learning to read. Option A is to attend the median primary school in England which will teach them to read in a class of about 25 using one of the standard phonics programmes in England – paper-based, led by a human teacher who has been trained in the use of that particular programme. Option B is to attend a school where the student learns to read by sitting at a screen with headphones on following an app. A human teacher is around to supervise and motivate, but they do not know anything about what the student is learning. Given this choice, I would pick Option A – the in-person paper-based no-tech option.

I think the major factor here is the age of the student. The physical and social aspects of learning matter a lot more for younger students than for adults. With younger students, it is not worth trading off the physical and social. As students get older, the trade-offs start to make more sense.

You also have the extra wrinkle which is that LLMs have made cheating far easier for everyone, making a lot of unsupervised screen-based tasks untenable even for older students. There is a lot of interest at the moment in “online proctoring” where you use screen monitoring software and cameras to prevent a student from cheating. However, in many contexts I think we might conclude that human-supervised paper-based tasks provide an easier and more effective solution.

Ultimately, I am making an empirical argument. If it turns out I am wrong, then, like the travel agents of the 1990s, we will soon see it show up in practice. Schools will start closing as parents choose to educate their children at home on screens, or schools will restructure themselves around individual computer pods.

There are a lot of experiments like this happening at the moment. Most famously, Alpha School in the US have gained a lot of attention over the last couple of months for their particular approach to technology and education. I haven’t visited an Alpha School, although I have used one of the apps they’ve developed – Math Academy, which I thought was very good. I remain sceptical that students can learn all the academic content they need from apps like this, however.

So what do I suggest instead?

Here are three specific approaches that I think embody my “limit-free motorway and car-free town centre” concept.

Paper-based AI assessment. Our approach at No More Marking embodies these principles. We use AI-enhanced comparative judgement to assess student writing, but our assessments are paper-based, so students never need to see a screen. Handwriting recognition technology has improved dramatically in the past couple of years, which is a great example of how new technological advances can improve the efficiency of older media.
Professional development at scale. If it’s easier for adults to learn on-screen than students, maybe the solution is to create high-quality online professional development resources for teachers. Over the last decade or so, England has had a lot of success scaling up quality instruction of phonics. There are several government-approved programmes that consist of detailed student resources and teacher training that explains how the resources are used. I think we could copy this approach for other concepts and subjects, but use technology to make the training easier for all teachers to access.
LLM tutors. LLMs are making screen-based instruction harder because, as noted above, they make cheating easier. However, I think LLM tutors do offer some promise because they are more responsive than traditional learning apps. In older year groups where a little bit of screen time is more acceptable, I think there is a role for them. The best model I have come across is Google and Eedi’s collaboration. A human teacher teaches a maths concept to a class, and then each student uses an LLM tutor for individual practice of the concept.

The basic challenge here is common to a lot of other fields: how do you maintain the human essence of a particular service but also scale it up so that it is available for millions of people? This is not easy, and it requires careful thought and design. But it is also unavoidable. Children can only learn with substantial human input, but there are eight million children in England alone, and we have to find ways of providing high-quality instruction to all of them.

We need to be more sceptical about screens in the classroom, at the same time as being more ambitious about the power of technology to raise standards.

I am not pretending to be an expert on German car policies, and given that, this metaphor could backfire. Perhaps there are huge controversies about these laws, and perhaps even now the Bundestag are debating proposals to change them. However, I think the central point stands, regardless of what happens in Germany. In my own context of London, it is possible to be in favour of Covent Garden being pedestrianised, and also in favour of more rapid construction of the Lower Thames Crossing.

The black box of AI assessment

Daisy Christodoulou — Sat, 25 Apr 2026 07:45:47 GMT

We have added a lot of new subscribers in the past few weeks - welcome! Our Substack has a mix of big-picture articles about education, assessment and technology - eg, Are we living in a stupidogenic society?; Why education can never be fun - and detailed research from our AI-enhanced Comparative Judgement assessment projects - eg, What is Comparative Judgement and why does it work? and So, can AI assess writing? This is one of the latter, co-written by our Director of Education, Daisy Christodoulou, and our CEO Dr Chris Wheadon. We have future articles coming up on the value of Classics, insights from our new Year 6 redraft project, and whether you can assess creativity or not. Please share with anyone who you think might be interested!

In the world of assessment, we lean on two pillars: reliability and validity.

Reliability is essentially a measure of precision. It asks: If we administer this test multiple times, will we get the same result?
Validity, however, is a much broader, more elusive concept. It asks: Is this result actually meaningful? Does the assessment measure what we intend it to measure?

Imagine a simple spelling test of ten words. It’s likely this kind of test would have high levels of marker reliability, in that different markers would all agree on the score a student should get. However, such a test would not provide you with valid insights about the totality of a student’s writing ability, because it is measuring only one small aspect of writing.

As we hit the one-year mark of introducing AI judges into our assessment processes, it is worth reflecting on the evidence we’ve collected regarding the reliability and the validity of our AI judges.

The reproducible judge

Reliability is relatively easy to quantify: we can give the same set of assessments to different sets of AI judges and measure their agreement. So far, the data is compelling.

Higher agreement: AI judges tend to agree with each other more than human judges do.
Reduced noise: We’ve seen distributions narrow and correlations between assessments increase over time.
Rasch Separation Ratios: Typically, our AI-driven assessments yield higher separation ratios (a statistical measure of internal reliability) than those involving human markers.

Reliability is an important part of the puzzle, and it is a prerequisite for making valid inferences. But, as we have noted before, it isn’t everything. You could have a perfectly reliable system that is telling you something completely meaningless. In an extreme case, an AI could simply be measuring essay length - which is highly reproducible - while a human is looking for a much broader concept of writing quality.

Validity also includes second-order effects. An assessment can work well to begin with, but degrade over time as it influences the way a subject is taught. If an AI only rewards long-windedness, teachers will eventually teach pupils to be long-winded.

Thanks for reading No More Marking! This post is public so feel free to share it.

But is it meaningful?

So, how does AI hold up on validity? We looked at three key areas:

1. Human-AI Agreement

If humans and AI agree, it suggests they are valuing the same features. In creative writing—an open-ended domain—two humans generally agree on a comparative judgment (which essay is better) about 85% of the time. Our AI judges agree with the human judges about 82–83% of the time. This suggests that, broadly speaking, the AI is valuing the same things in children’s writing that we are.

2. Session-on-Session Correlations

We follow children over time as their writing develops. We expect to see two things: writing should improve with age, and there should be a strong correlation between one session and the next. Our data shows that AI identifies this improvement clearly. In a natural experiment where we split schools between AI and human judges, the correlations were remarkably similar.

3. Theoretical Alignment

We know from years of research that children’s writing improves rapidly at first and then slows down, creating a characteristic growth curve. In a recent session with Australian schools (Years 2–6), our AI judges produced a developmental picture that mapped almost perfectly onto this established curve.

Relationship between achievement on a writing test and age. Year 1 was judged by teachers, the other year groups were judged by AI judges.

The psychometric perspective from Chris

From the beginning, we’ve faced a theoretical hurdle: How can an AI - sometimes described as a “stochastic parrot” predicting the next word in a sequence - produce meaningful decisions on human creativity?

I’ve moved through stages of scepticism and disbelief. However, after building these models myself, I’ve gained a new perspective. AI models employ vast matrices of probabilities. To a psychometrician, these probabilities are actually quite reassuring; our statistical models are already built on deriving measurements from “stochastic instruments.”

While we may never fully understand the exact “thought process” of an AI as it navigates millions of dimensions of probability, the evidence suggests that it is successfully reproducing human values.

The pedagogical perspective from Daisy

I first wrote extensively about the theory of AI marking in my 2020 book Teachers vs Tech. I discussed the idea that AI markers were unaccountable “black boxes” that can’t explain their decisions, and also considered the possibility that human markers were just as unaccountable and “black box” as the AI!

However, ultimately I concluded with the following: “I still think that we are better off having human judgement involved in marking essays because ultimately, we want our students to learn to write in ways that other humans appreciate.”

I stand by that conclusion, and it’s important to us that our AI-enhanced model includes human judgements. But the reason why I am supportive of using AI judges is that they really do seem to be rewarding the same qualities as our human judges. And, weirdly, there are important ways in which our AI is better at rewarding what humans want than humans are. After every assessment I do a qualitative review of the top essays, the weakest essays and the big human-AI disagreements. Most of the big disagreements do not involve profound philosophical disagreements between human and machine intelligence. They involve humans accidentally pressing the wrong button, or getting misled by bad handwriting.

So far, I simply cannot see any systematic way that you can game the AI judges. In my head, I keep trying to think like a hacker or an adversarial attacker - to find ways that you could get a top mark from the AI judges that you didn’t really deserve. I can’t really find any method that would work!

We will always need to keep monitoring this and ensuring the AI is aligned with humans, and that is why our 90% AI - 10% human model is so important, as it makes sure that every piece of writing is still seen twice by a human, and will immediately alert us to any rogue AI problems.

The patchwork of validity: our next steps

Validity is rarely conclusive. It is a patchwork built over time. We still have work to do, such as:

Addressing outliers: AI still struggles with responses that fall outside its training data.
External variables: We need to see how AI writing scores relate to external tests in reading or mathematics.
Preventing “gaming”: Ensuring the system remains robust against those trying to trick the algorithm. Facial recognition systems are tested using “adversarial attacks” designed to expose their weaknesses. We have a few ideas for something similar!
Addressing “washback”: As teachers and their pupils understand they are being assessed by AI, will teaching and learning change?

We are no longer just theorising about AI in education; we are measuring its impact. And so far, the “strange process” of the AI judge is proving to be a remarkably human-like one.

If you would like to try out our AI judges for yourself - you can!

Our next intro webinar is on Monday! These webinars are very popular - we show you how the system works and at the end we give 30 free credits to all attendees, so you can try it yourself on a class set of essays in any subject.
If you work in a school, you can also book a 30-minute call with me here where I can get you set up on our system with 30 free credits.

Do knowledge-rich curriculums cause mental health problems?

Daisy Christodoulou — Sun, 19 Apr 2026 07:52:11 GMT

Between 2010 and 2024, England carried out a significant programme of education reform. It now has a more knowledge-rich national curriculum, more terminal exams and more policies recommending or even mandating approaches like phonics teaching. There is also now a growing body of international evidence that these reforms have had a positive impact on attainment.

However, over a similar time period the mental health of England’s children seems to have deteriorated. Data from England’s National Health Service shows increases in teens with mental health problems, and various international surveys show similar trends.

Over the last year or so, a lot of policymakers and think tanks have been queuing up to link these two trends. What if England’s tough new education policies have succeeded in raising attainment but at the cost of making kids miserable?

Thanks for reading No More Marking! This post is public so feel free to share it.

The Social Market Foundation explicitly blames Michael Gove, the Secretary of State for Education from 2010-2014, for creating a system “which encouraged rote learning, discouraged school investment in extra- and co-curricular learning, and contributed to high-stress academic environments.” Just a few days ago the former Children’s Commissioner, Anne Longfield, blamed rising absenteeism on “a rigid curriculum that hasn’t allowed a lot of the more varied aspects and creative aspects of education”. The National Education Union says that “a top-down ‘exam factory’ culture and a stifling curriculum have, up to now, resulted in high rates of mental ill-health among young people.”

The timelines certainly seem to match up. England made big reforms to its education system, and mental health and social problems spiked soon after. But of course, correlation isn’t causation. Can we be sure the education reforms caused these problems? What if something else is the culprit?

A natural experiment

At the same time as England was implementing the above curriculum reforms, two other parts of the UK designed very different education policies. In 2010, Scotland introduced a skills-based curriculum with significant non-examined elements - the Curriculum for Excellence. Wales chose not to copy England’s reforms in 2010, and instead recruited one of the designers of Scotland’s curriculum for their own reforms. Northern Ireland are currently carrying out their own education review which is much more influenced by England’s approach.

The international attainment data shows that scores in Scotland and Wales have declined and those in England have improved. But there are also large international health and wellbeing datasets. We can carry out similar analyses on this data to see if there are relative changes between the nations.

One good study to look at is the World Health Organisation’s Health Behaviour in School-Aged Children (HBSC), which has a battery of questions on mental health and wellbeing. England, Scotland and Wales are reported separately. Northern Ireland are not included. Here is a chart showing the mean life satisfaction of 15-year-olds in England, Scotland & Wales over a 20-year period which covers the timescale of these reforms.

You can see that at the start of this period, all three countries had very similar scores, with Scotland slightly ahead. By the end, all three countries also had very similar scores, with Scotland slightly ahead by an even smaller amount, and all three below the Western Europe average. The major standout trend over this time is not anything to do with the relative position of each country, but the big fall in each country’s scores from 2018 to 2022. This fall was seen across Western Europe and in most of the other regions too - it’s a global fall, and one that it seems plausible to attribute to the Covid-19 pandemic.

HBSC asks other questions we might be interested in. For example, 15-year-olds were asked how much they like school. The vast majority of English teens do not like school a lot. But the vast majority of Welsh and Scottish teens don’t either.

Most strikingly, there is barely anything to pick between the three countries when it comes to feeling pressured by schoolwork. I found this remarkable - England really does have a lot more exams and less coursework than Scotland & Wales, and I would have predicted this would have led to English teens feeling much greater pressure. But it hasn’t.

The striking feature of all of this data is the similarity of the three countries’ data, not the differences. Their education policies and attainment scores have definitely diverged. But happiness and wellbeing have not. It is very hard to look at this data and argue that England’s knowledge-based curriculum has made students unhappy, or that Scotland’s skills-based curriculum has made students love school.

What about other studies? The most recent “World Happiness Report” found that the trend of declining teen happiness was particularly stark in Anglophone countries. Under-25s in America, Australia, the UK, Canada, Ireland and New Zealand all had dramatic declines in life satisfaction, more so than in other countries.

What is going on here? People have put forward plenty of possible culprits: smartphones, economic insecurity, the Covid pandemic, housing, family disintegration, and so on and so forth. All of these seem like plausible possible candidates which are at least worthy of discussion.

What is not plausible is to pin these dramatic UK, Anglosphere and global trends on a specific set of education policy decisions that only affected England. Teen life satisfaction is declining in London, Edinburgh, Cardiff, Belfast, Dublin, Melbourne, Auckland, New York and Toronto. You can’t blame this on Michael Gove removing Of Mice and Men from the GCSE syllabus.

What do they know of England who only England know?

When devolution happened in the late 1990s, one of the more wonkish claims for it was that it would provide useful policy laboratories, where one nation could try something different and the rest could learn from a culturally similar region. For example, Scotland introduced a smoking ban in 2006, it went well, and England followed suit in 2007.

For this to work journalists and policy makers in Westminster would actually have to pay attention to what is happening in Scotland, Wales and Northern Ireland - which by and large, they don’t. Time and again I read articles in prestigious publications that lazily conflate England and the UK, implicitly assume that the Westminster Secretary of State has the power to reform education across the UK, and totally ignore the very significant Welsh and Scottish education reforms.1 Just this week, a “Big Read” in the Financial Times made all these errors.

Journalists generally prefer to visit Helsinki and Tallinn than Cardiff; they prefer to talk about how there are no league tables in Finland rather than about how Wales abolished them in 2001 and it had an immediate and negative impact on school performance. More positively, Wales has had notable successes with its SEND policies, but in the acres of newsprint devoted to English SEND reforms you rarely hear this mentioned. Scotland has no university tuition fees for Scottish students at Scottish universities, but there is very little analysis on what difference this has made to subject uptake compared with England.

In Wales, the journalist Rhys Williams has done incredible work on how reading is taught in Welsh schools. In Scotland, Lindsay Paterson’s research on the curriculum demonstrates the power of content knowledge. As mentioned earlier, Northern Ireland are carrying out some far-reaching reforms to their curriculum and assessment system.

At No More Marking, we work with schools across the UK and in the USA, Australia, and New Zealand. Half the subscribers to this Substack are based outside the UK. I’m always fascinated by how the surface features of each system are very different and take time to understand, but often the deeper forces at play are very similar. Sometimes the problems we think are unique to our own system are not, and studying how other systems work can stop us leaping to parochial conclusions about our own.

In some ways it is not unreasonable to conflate UK and England statistics: England makes up 85% of the population of the UK, so if you see a UK population average it is fair to assume that the England average will be fairly similar. That is obviously not true of the other three countries. And in some ways the Westminster government does have power over all UK schools: a recent high-profile policy involved the UK government removing VAT exemption for independent schools. Tax policy is not devolved, so this policy does affect independent schools in Wales, Scotland and Northern Ireland.

Why grades are misleading

Daisy Christodoulou — Sat, 11 Apr 2026 08:04:32 GMT

Grades are an established feature of most assessment systems, and are taken for granted as a sensible way of reporting attainment data.

But do they deserve that status? They create a lot of distortions, they don’t mean what people think they do, and there are better alternatives available.

In this post, we’ll explain what the problems with grades are, how we do things differently, and what our new “grade probabilities” report looks like.

Thanks for reading No More Marking! This post is public so feel free to share it.

What people think student attainment looks like

Many people have a mental model of a grade as a discrete category that is separate and distinct from other grades. They think students in one grade are qualitatively different from students in another grade, as shown in the following image.

A number of aspects of our current grading system reinforce this idea. For example, we give grades labels like “at the expected standard”, and we have marking rubrics that suggest there are discrete breaks in performance between one grade and the next. In the chart above I have used the grades from England’s primary system, but almost every jurisdiction we work in has something similar. A lot of teacher-created grading systems have the same problem. “Red, amber, green” is a grading system. So is “emerging, expected, exceeding”.

However, this is not how attainment works, and thinking it does causes a lot of problems.

What student attainment actually looks like

Student attainment follows a continuous distribution. The image below gives a much better representation of how it works.

Why this is a problem

Grades are just lines drawn on an underlying distribution. They don’t correspond to sudden leaps in student attainment. When you treat them like they are discrete categories, it causes big distortions, as you can see in the image below.

Paul and George both have the same grade. But Paul has more in common with John, in the grade below. And George has more in common with Ringo, in the grade above.

A three-part grading system is the worst kind of grading system and causes all kinds of problems. The categories are too big to be useful. They incentivise tiny progress at the grade boundaries, and don’t reward big progress elsewhere. They result in very volatile accountability measures. And perhaps most damagingly, they are particularly bad for students at the bottom of the middle category – that is, in the chart above, Paul. Paul is told everything is OK and he is doing fine but in reality he is struggling as much as John is.

As well as these very practical and immediate problems, there is a deeper conceptual problem with thinking that student attainment is discrete. Three-part grading systems encourage the flawed idea that skills are discrete and that you can “level up” by teaching a new skill and jumping to the next grade. If, on the other hand, you accept that skills are composed of sub-skills and knowledge, you will recognise that students improve on a slow and steady incline, not in sudden jagged steps. I’ve written more about this link between assessment and the knowledge-skills debate here.

Improving reporting with scaled scores and writing ages

The ideal improvement would be to report scaled scores, not grades, and that’s what we do with all of our writing assessments. A criticism of this approach is that people don’t know what a scaled score means. One way we have tried to fix this in the past is by converting all our scaled scores to a writing age. We are quite proud of this and think that it is the first writing age anywhere in the world (although it follows very similar principles to reading ages, which are very popular). The basic principle is that we are trying to address the misconception about grades being discrete by using a comparison with an everyday metric – age – which everybody intuitively understands is continuous.

However, we still operate within a national system that uses a three part grading system, and the clash between the two systems causes problems. We report the writing age alongside the scaled score and the national Working Towards, Expected Standard, Greater Depth indicator. This means that it is possible for a student to get the Expected Standard label and still get a writing age that is lower than their chronological age. For example, a Year 6 student who is aged 11 could get a writing age of 9 years and 6 months, and still get the Expected Standard. We get so many questions from schools asking us how this is possible, and of course it is very confusing.

But it is the result of the government setting the Expected Standard at the 28th percentile. Expected Standard does not mean, as many people assume, that you are working at the average standard for your age. It includes students who are about 18-24 months below the average. This is true for reading and maths too. Our writing age hasn’t created this problem; it has just revealed it.

Our latest innovation: grade probabilities

In our upcoming set of Year 6 writing results, we’re going to introduce a new report: grade probabilities. This will tell you the percentage chance that a student is at a certain grade.

Here’s an anonymised example of a student with a similar profile to Paul. He has a 47.5% chance of getting the lower grade, and a 51.5% chance of getting the middle grade.

The metric here is measuring something different from the writing age. The writing age is a measure of attainment. It takes a given scaled score and just converts it into a typical age.

The grade probability is a measurement of certainty: how sure can we be that this student is above a certain threshold?

However, what both metrics have in common is that they replace a crude and distorting threshold system with a smooth and continuous metric.

We hope this will help schools when it comes to making decisions about Year 6 writing moderation. If it works well and schools like it, we can introduce it for more year groups and jurisdictions. If you’d like to learn more about our assessments, we have an intro webinar coming up later this month.

Education technology is never neutral

Daisy Christodoulou — Sat, 28 Mar 2026 08:42:42 GMT

“Any technology can be used well or badly.”
“Technology is just a tool - what matters is what you do with it.”
“Kids can use a tablet to study or to play games - the issue isn’t the tablet, it’s what they are doing on the tablet.”

I hear this argument all the time: that when technology gives you a bad outcome, the problem is not the technology but the way teachers or kids are using it.

For example, last week, Matt Yglesias wrote an article called “Ed tech is not the answer or the problem”. Referring to a specific app that has come in for a lot of criticism, he said that it was probably being used well in some effective schools, but poorly in some ineffective ones. The issue was not the app, but how it was being used.

But asking whether ed tech is “good” or “bad” is like asking whether schools should have desks or whether teachers should use erasers. In both cases, they almost certainly should!
But the presence or absence of erasers is not what’s making the difference between effective and ineffective schools. If you had a building full of good teachers who were using a good curriculum and had adequate support from administrators and other stakeholders but for some reason they weren’t allowed to use erasers, they would find that annoying, but I’m sure they’d figure it out.

This is a really popular and persuasive argument, and there is a bit of truth to it, because high-functioning and well-managed organisations can make the best of a bad situation. But ultimately, I think it’s misleading. Truly high-functioning organisations do not deliberately choose tools that create bad situations. They choose the tools that are right for the job. And they do so because they understand that tools are vitally important. They are not neutral and interchangeable widgets, and they are capable of having a profound impact on the way we think and behave.

Thanks for reading No More Marking! This post is public so feel free to share it.

Tools change our behaviour

Tools make some behaviours more likely and others less so.

Take Yglesias’s own example of an eraser. It’s a very simple tool, but it still changes behaviour. It makes some behaviours less likely, and other behaviours more likely. In a classroom where every pupil has an eraser, the attitudes to error will be different from those in a classroom where no one has an eraser. A teacher could try to create the same culture and norms in each classroom, but the presence or absence of a specific tool will make it easier or harder.

Recently, Adam Kucharski wrote about coding using Mathematica, which only allows one “undo”. When he codes using that app, he is far more careful and cautious than if he had unlimited “undos”. The “undo” tool - basically a digital eraser - shaped the way he thought.

The argument I am making here is an extension of Marshall McLuhan’s “the medium is the message” argument. I think Neil Postman has given the best concrete example of this: if your major medium of communication is smoke signals, then your messages are unlikely to include philosophical tracts. The form of smoke signals precludes certain content and types of thought.

Screens make certain behaviours more likely

Laptops, tablets and phones are far more powerful than an eraser, and have a much more powerful effect. They often replace a textbook or an exercise book, but compared to those paper technologies they make task-switching much more likely.

You could be the best teacher in the world, and be completely committed to getting students to concentrate deeply and read difficult texts. But if you are in a classroom where every pupil accesses the content via a screen, I think you will be less likely to achieve your aims than a weaker teacher in a classroom with no screens at all.

Not only that, but there are big differences between different screen types. They are all optimised for different functions, and make those different functions more likely.

Desktops and laptops have physical keyboards, and are optimised for long-form writing, and not for messaging on the move. Mobile phones are optimised for scrolling, swiping, and short messages. You don’t see people walking down the street texting on their laptop. And people tend not to write novels on their phones. Tablets are different again. I think they are optimised for passive consumption of media, as opposed to creation of it.

The mode effects research: yes, ed tech is a problem

I don’t know enough about the specific app Yglesias refers to in his article. But I do think that regardless of the quality of the app or the content on it, there is a difference between learning on screen and learning on paper.

There is a large research literature on “mode effects” - essentially what happens when you change the medium of an assessment but keep the content the same.

One of the best and most rigorous recent studies analysed the results of more than 3,000 students in Germany, Ireland and Sweden, who had taken the 2015 Programme for International Student Assessment tests in reading, maths and science. The students were randomised into two groups. One group took the test on paper; the other took it on a computer. The paper-based group achieved a full 20 scaled-score points better than the computer-based group. That is the equivalent of about six months of additional schooling - a huge difference.

I spoke to the author of the paper, John Jerrim, about this research for an article I wrote about it for the TES, and he told me that he was really surprised by the magnitude of the effect. If an educational intervention caused that kind of improvement we would be rushing to scale it up!

AI-enhanced Comparative Judgement: can we make ed tech part of the solution?

At No More Marking, this is something we think about constantly. What do our tools and technologies make more likely? What do they make less likely?

We’ve been running Comparative Judgement assessments for nearly a decade, and have put significant effort into creating paper-based assessments that can be assessed digitally. Our system allows you to assess writing in an incredibly technologically sophisticated way - without a pupil ever seeing a screen. We’ve assessed about 3 million pieces of writing using this process.

We have now added AI judges to our assessments, which changes the dynamics yet again. What will AI assessments make more and less likely? Well, AI assessment is faster and easier than human assessment. If you make something quicker and easier, it tends to happen more often. So schools may run more writing assessments.

That could be good. It could mean better validation of interventions, reduced teacher workload, more opportunities for pupils to receive feedback — even, potentially, daily practice in the run-up to big national exams.

But it could also be bad. In younger years, for example, an increase in extended writing assessments may not be desirable. Shorter, different kinds of assessment may be more appropriate. We have some of these already, but maybe we’ll need to beef them up and make them more prominent.

We also provide a wider range of feedback, some of it directly created by AI. What will the effect of this be? We recently wrote about some focus groups we’ve been doing asking students about what kinds of feedback they prefer. We also have a project running right now where we measure how much improvement students make when they redraft their writing in response to AI feedback.

Our aim is to create tools that make good outcomes more likely and bad outcomes less likely. This will not happen by accident!

Am I denying human agency?

The attraction of the “technology is neutral” argument is that it makes us feel like we are in control. As I say, there is a grain of truth to this: there will be a range of ways you can deploy a technology.1 My argument is that the range is limited. The technology sets the floor and ceiling you operate within. The more powerful the tool, the narrower the range you can operate within. An eraser is a relatively weak tool which still gives the teacher and students a wide range to operate in. A tablet is a much more powerful tool, and its power constrains the behaviour of teachers and students.

Your true agency is not in how you use the tool. By that point, the constraints are already in place. Your true agency involves how you select the tool, and in the input you have on its design. As Winston Churchill (almost) put it: we shape our tools, and thereafter our tools shape us.

Help us shape AI feedback & assessment!

And that is why we are constantly talking to schools about what they want in an assessment system, and reviewing the data to see what impact it’s having. If you would like to be a part of these design efforts, you can!

My colleague Chris has created a user group for secondary schools in England who want to use our AI system to mark GCSE mocks. If you’d like to learn more about this, contact us.
I am leading our efforts to optimise the AI feedback in different subjects. If you’d like to try out our system with 30 free credits, you can book a call with me here.
If you would just like to learn more, sign up for our next intro webinar on Mon 27 April.

Matt Yglesias presents some data to show that different schools using the same app can still get different outcomes. Yes, of course that will happen. I would not expect every school with omnipresent tablets to get exactly the same outcomes. I would not expect every school with phone bans to get exactly the same outcomes. Still, it is suggestive that in England, the highest performing schools tend to have very sparse use of screens in the classroom. (“Highest performing” as measured by the very sophisticated Progress 8 measure which measures how much value every secondary school adds across 5 years of education.)

The ethics of AI assessment: five big issues

Daisy Christodoulou — Sat, 21 Mar 2026 08:30:56 GMT

Over the past couple of weeks, my colleague Chris and I have been out and about talking to teachers, students and parents about what they do - and do not - want from an AI assessment system.

Here are five big issues that keep recurring.

Thanks for reading No More Marking! This post is public so feel free to share it.

Speed & efficiency matter

Students care about getting feedback quickly, teachers care about excessive workload and senior teams care about students getting enough exam practice.

There is nothing wrong with any of this. As I’ve written about in a completely different context, speed matters. Speed is not opposed to quality; it is an aspect of quality. If you get feedback on your essays within a couple of hours of completing the essay, you will be much more likely to understand it and act on it.

And yet it is obviously wildly unrealistic to expect human teachers to routinely provide feedback on essays within a couple of hours. In fact, one of my biggest bugbears as a teacher was when a student would hand an essay in a week late, and then turn up at the staff room door at the end of the day asking if I’d marked it!

If AI can deliver quicker feedback, that’s definitely a good thing.

Accuracy

Everyone worries about AI errors, and about what the process is for dealing with them. Again, this is perfectly legitimate: one of the things that we’ve written a lot about is that traditional exam systems have well-established processes for dealing with human errors which don’t work with AI (we are building the processes for AI- see here).

But the flip side is that everyone understands that humans make errors too. I have spoken to groups of students and teachers who were genuinely shocked to learn that currently, with human marking, in GCSE English Literature you only have a 52% chance of getting your true grade.

Human contact

Students care what their teachers think about them and they want their teachers to read what they write.

However, there were some ways in which they were interested in the concept of AI feedback for its own sake - not just because it would be quicker.

For example, one student told us they liked the idea of AI feedback because it might offer a different perspective from their teacher, and pick up things their teacher had not thought of.

De-skilling

Another concern we hear - particularly from senior teams - is about de-skilling: what if teachers lose the capacity to mark essays and give feedback on student writing?

In some areas, I don’t care about de-skilling. For example, I have seen so many examples of teachers staying late formatting PowerPoint slides and trying to find just the right image for their worksheet, and I have never been convinced it’s a good use of their time. Andrew Old has written about this recently and I largely agree with him. If a teacher never designed another resource again I would not be that bothered.

However, when it comes to assessment, I am much more concerned about the possibility of teachers losing important skills. If a teacher never read another student essay again I would be very concerned.

We have to design systems that reduce workload and speed things up, but that preserve teachers engaging with student writing.

The environment

Recently, we have been hearing more concerns about the environment, particularly about how much water AI uses, and therefore how much water it would take to mark an essay.

We are not unconditional AI boosters, and we are always willing to consider the downsides of the technology. But on water use specifically, I think the concerns have been overblown. Andy Masley has done some excellent analyses of this, including showing that one of the most famous analyses of AI water use confused cubic metres and litres and was out by a factor of 4500!

Find out more

As ever, if you want to find out more about what we do, you can join one of our intro webinars. The next one is on Monday April 27. These webinars are very popular - we show you how the system works and at the end we give 30 free credits to all attendees, so you can try it yourself on a class set of essays in any subject.

If you work in a school, you can also book a 30-minute call with me here where I can get you set up on our system with 30 free credits.

The democratisation of cheating

Daisy Christodoulou — Sat, 14 Mar 2026 08:45:14 GMT

A couple of years ago, a grocery delivery company came up with a catchy slogan. Their service, they said, was about the “democratisation of laziness.”

It is a memorable phrase, and also slightly unsettling. It’s true that it is easier to be lazy if you are wealthy and privileged, but it’s also true that we don’t think of laziness as an absolute good to be maximised. Rather than giving everyone the chance to be lazy, maybe we should think about finding ways to make everyone less lazy?

Of course, laziness has its upsides as well as downsides, so maybe democratising it is not so bad. I’ve written about this dilemma in a piece on the stupidogenic society.

But there are some things that are more unambiguously bad where we should definitely try to remove elite privileges rather than spread those privileges to everyone. Cheating is one of them. Wealthy students have always been able to pay top dollar to have bespoke essays written for them. But until recently, this kind of unidentifiable cheating was only available to the very wealthy. Large Language Models have changed all that. They have democratised cheating, and made it available for the masses.

Thanks for reading No More Marking! This post is public so feel free to share it.

This is a problem for schools, but it’s a much bigger problem for universities, who rely a lot more than schools on unsupervised written assessments. A few years ago, when it became clear that LLMs were so good at writing essays, I naively thought that universities would just have to take all their assessments in-person. That has not happened. Social media is full of academics lamenting the AI slop they have to mark. There is a lot of lamentation, but a lot less action.

I have a longer article in Engelsberg Ideas this week where I draw a historical parallel with the medieval sale of indulgences - another case where technology dramatically expanded access to a controversial shortcut.

In Germany and in England, one of the first uses for the new printing press was to create pro forma indulgence certificates that could be filled in with the purchaser’s name. The first item printed in England, by William Caxton, was one of these certificates. In Germany entire batches of them were printed.

In the short-term, this made the church a lot of money. In the medium-term, it caused them a lot of problems. Before long, Martin Luther started using the printing press in a different way, to spread his criticisms of the sale of indulgences. (Interestingly, the printer of his 95 Theses also printed books of indulgence certificates.)

Maybe in the short-term it is easier for universities to turn a blind eye to the obvious cheating that is going on. I can see how students and professors might grumble if their traditional assessment system was changed, and perhaps students would be less likely to attend universities that had cheat-proof in-person assessments, which in turn would affect their bottom line.

But the medium- to long-term consequences of letting the AI slop become normalised are terrible. Maybe not “Thirty Years’ War” terrible, but arguably “Dissolution of the Universities” terrible. Students are not stupid! They know that if they are putting everything through AI, so are all their classmates! At a time in the UK when people are starting to question the monetary value of a degree, and to wonder whether some university expansion is justified, the inability of universities to respond to technological change is storing up massive problems.

The AI Death Zone: a cautionary tale

Chris Wheadon — Sat, 07 Mar 2026 08:30:25 GMT

This week’s Substack is a bit of an “inside baseball” report on how we are - and are not - using AI tools to develop our website. It’s by our CEO, Chris, with reference to our CTO, Brian. The same AI promises and pitfalls we’ve found when marking writing are also present when developing websites. For more on our AI marking platform sign up to our next webinar on Mon 27 April.

Brian and I have been coding together for over 20 years now. I met him at the CEM Centre at Durham University in the early 2000s, and he taught me how to code. In 2013, we founded No More Marking together and have scaled it to process millions of writing scripts every year. We’ve been through every technological fad there is.

We started with trying to deliver Python code on the web, Django, which taught us that the web and Python were not easy bed follows. I lost weekends configuring linux boxes with endless scrolling text that would end in baffling errors. Thank goodness for stackoverflow. We then moved to Meteor, which offered real-time updating of information for users that seemed magical but didn’t scale. Finally, we moved on to a proper serverless stack when serverless was just becoming a thing. At every stage, we’ve been ready to take on the latest innovation to learn how to deliver using the best of new technologies.

We’re certainly far from the stereotype of the programmers who learned to code at university using PHP and never really wanted to move beyond it. We’ve been through Ruby on Rails, Scala and fulfilled our Bayesian dreams with OpenBugs. Recently, however, we’ve been faced with a new challenge: the AI coding agent.

For the first time in over 20 years, Brian and I no longer see eye to eye. Brian has fallen prey to AI addiction in a way I fear is irredeemable. It started fairly innocently; I noticed Brian was making function calls that simply didn’t exist in the libraries he was using. When I asked him if he’d actually read the documentation, he would say we no longer need to—the AI would do it for us. He was at a loss to explain why the AI was inventing methods and properties that the library simply didn’t support.

That’s where it started. It has since got a lot worse.

Thanks for reading No More Marking! This post is public so feel free to share it.

The Mirage of “Vibe Coding”

Brian has now developed what I would call a severe case of AI dependency: he requires a regular hit, a new model, a new release from Anthropic or OpenAI to keep him going. A small does used to be enough, but he’s now taking stronger and stronger hits, always craving the latest designer drug.

In the early days of No More Marking Brian’s nickname was imaginary Brian. As I was out on the road talking to users, I would always refer to “Brian” who would sort things out back at the ranch. As no one had ever seen Brian, there was a rumour he didn’t actually exist. Now I fear that rumour has become reality, and I wish we could get the real Brian back.

Most recently, he convinced me that we could “vibe code” an entire application. I read up on it; I read all the blogs from the Deep Mind crew and suspended disbelief while I read about spec driven AI development. As a huge fan of Test Driven Development, where you write a test first and then write the code, this seemed like something I could get on board with. We fired up the planner from GitHub and spent hours crafting a specification for a fairly simple component of the app. We did everything we thought was expected for vibe coding success. After 10 minutes of watching GitHub whirring away, producing the most insane set of documents, specifications, features, plans and architectures I’ve ever seen, we stopped it.

But that wasn’t the end. We had to switch tools—there’s always a better tool, a different approach, a better model. We switched platform. We spent 70% of our budgeted time planning, making sure the architecture we were dictating was sensible. It was at a level where a team of human coders could have created the app feature for us. I wouldn’t let Brian press the “start coding” button until I was sure we had everything in place. As the planning stage got longer, his finger got more and more twitchy. Eventually, I let him press the button for the AI to start coding.

The Syntax vs. The System

He assured me we could go and have lunch, and when we came back, the code would be ready. It sounded fabulous. The promise was that we had done the hard work—the human thinking—and we’d leave the grunt work—the actual writing of the syntax—to the AI. Surely nobody wants to be writing syntax when they could just be thinking.

Two hours later, a rather crestfallen Brian calls me. “I don’t think we got the specification quite right,” he says. I asked to see what it had done. He says, “Well, we’ve got issues with things like data typing and interface issues, but let me just see what happens if I click this button here.”

I asked, “Have you read the code? Do you know what that button is going to do? If you attach a PDF and click that button, do you know where that PDF is going to go?”

Of course he hadn’t read the code, but thankfully when he clicked the button, nothing at all happened. At this point I knew we were about to enter the Death Zone. The Death Zone for mountaineers is where, starved of oxygen, progress slows to a slow-motion crawl. Programmers are not starved of oxygen, they are starved of understanding. They are looking at code they haven’t written, assumptions they never would have made. We only seem to hear these days from those who have summited, but I suspect the AI death zone is piled high with bodies, all with the words “just one more prompt” frozen on their lips.

The Art of the Language

Brian and I are experienced coders. We’ve delivered web apps at scale with millions of concurrent hits and performed sophisticated statistical analysis. I, for one, have been published in the Journal of Statistical Software—a career highlight. No doubt some will think we just used the wrong model or the wrong tool. Maybe. But while Brian gets a hit of adrenaline with every new model, I have that familiar sinking feeling.

From a personal point of view, I would be very happy if vibe coding turned out to be a mirage. We and others have written about how writing is thinking1. Surely coding is thinking? For the last 10 years, I’ve worked with the R language and seen how it has developed. That evolutionary process has been vital to it becoming the world’s most popular statistical language.

The ecosystem in R called the tidyverse2 is a work of great beauty. Those of us who learned R before the tidyverse remember just how difficult some things were to achieve in an elegant fashion. The tidyverse evolved on top of the base R language through the hard work and dedication of creative, and unpaid individuals and opened up new possibilities. The tidyverse became a vibrant ecosystem which makes data science accessible and fun. No one got rich from creating the tidyverse, but the world got to be a safer, more creative and more beautiful place.

I simply cannot understand how a Large Language Model that is trained to reproduce patterns could ever produce genuine evolutions in the coding languages we use. The tidyverse was carved out by people who understood the base language so deeply they knew exactly how to improve it. They didn't just "vibe" with the syntax; they mastered it and improved it. The manifesto quotes Hal Abelson, “Programs must be written for people to read, and only incidentally for machines to execute.” Are we entering a world now where only machines read programs?

It’s probably time to get back to Brian now and see if I can rescue him from the death zone. He’s soon off for a trip to Shanghai where he’s looking forward to being driven around in a driverless car. When he ends up in the Yangtze he’ll still be wondering if he got the prompt wrong.

https://tidyverse.tidyverse.org/articles/manifesto.html

Solving marking at scale!

Daisy Christodoulou — Sat, 28 Feb 2026 08:45:44 GMT

In March last year, we presented a major breakthrough in our AI assessment model. We were able to use a blend of human and AI judgement to reliably and efficiently assess student writing.

Nearly a year on, where are we? What else have we learned and what’s next?

Thanks for reading No More Marking! This post is public so feel free to share it.

What we’ve done so far

Standardised writing assessment at scale: The model we developed in early 2025 has proven itself at scale. We’ve now used it to assess nearly half a million pieces of student writing, from students aged 5 to 16. Most of these are from schools in England, some are from the US, and this month we’re running our first AI-enhanced assessment in Australia & New Zealand. We’ve been able to validate our model in various different ways too.
Assessing other subjects at scale using AI: we’ve run our first national history assessment, which had different challenges to writing but still worked well.
Improving AI rubrics: the way you prompt the AI is probably not as important as everyone thinks - but nevertheless, we have gained a lot of insights about what makes for the best rubric to give the AI
Bespoke AI tasks for individual schools: as well as our big nationally-standardised assessments, schools can also use all the AI judging and feedback features for their own assessments on their own timeline. These will not be standardised, but in a big school you can use a mix of statistics and human judgements to set your own grade boundaries. There’s a case study here.
Better feedback: We have an audio feedback system that allows teachers to provide audio comments on every piece of writing which are then transcribed and polished by the AI. We really like this system - but so far it seems our schools prefer the direct AI feedback which is generated automatically. As well as a written comment, the AI can now also generate personalised quizzes.

What we’re doing next

Improving handwriting recognition: we have written a lot about how uncannily accurate the AI judging is. We still have not yet encountered a major human-AI disagreement where we think the AI has made the wrong decision. However, where there are still issues is with the step before the judging, where the AI transcribes the student handwriting. The AI does sometimes improve the writing when it transcribes, imposing sense and meaning that are not in the original piece. This can then lead to the wrong judgement being made, but the source of the problem here is not the AI judging going wrong, but the AI transcription. We have a system to catch and correct these errors, and we are working on developing better open-sourced handwriting models .
Using AI to create rubrics, not just use them: typically, we give the AI criteria and it uses that to judge. We are working on an RE project where we will get the AI to create criteria after it judges - eg, to tell us what the typical features of the best and weakest writing are.
GCSE / multi-question assessment: so far, most of our big projects have involved assessments of just one piece of writing. GCSEs pose a more difficult logistical challenge, because you have to combine several questions and several marks. The AI is still good at judging these; we just have to find ways to make it simpler to pull together all the marks from all the questions and apply a grade.

Is AI going to change the world?

As we have been working hard on all of the above, there has been a wider debate going on about the extent to which AI is going to change / destroy the wider economy.

We set up this Substack to detail our journey through AI, and if you go back to 2023-24 you can see that we were much more sceptical than we are now.

For us, the thing that has made the biggest difference is not necessarily the improvement in quality of the cutting-edge models - but the dramatic reduction in cost of the standard models. This has allowed us to “over-assess” - to send writing off to be judged multiple times, which helps us weed out inconsistent and biased judgements.

A lot of our theoretical concerns about LLMs still exist - the hallucinations, the probabilistic decision-making, the challenges with getting them to work reliably at scale. But we have found ways around most of these problems, such that in practice, our model is very useful!

Even now, it is easy for us to get bogged down in the details of the fraction of percents that aren’t working right - and that is the right thing for us to do, because a fraction of one percent at scale is still a big number.

But it is also important to sometimes step back and take a look at where we are. And when I do that, I keep coming back to the same thought: if this system had arrived on my desk halfway through my teacher training year in 2007-08, I would have thought it was unbelievably brilliant and it would have dramatically changed my life - and my students’ - for the better.

Make your own mind up

Our next intro webinar is in April. These webinars are very popular - we show you how the system works and at the end we give 30 free credits to all attendees, so you can try it yourself on a class set of essays in any subject.

If you work in a school, you can also book a 30 minute call with me here where I can get you set up on our system with 30 free credits.

Is it possible to develop a tutor-proof test?

Daisy Christodoulou — Sat, 21 Feb 2026 08:45:43 GMT

At No More Marking, most of the assessments we provide are fairly low-stakes. However, we do have experience with high-stakes tests, and we know how challenging they are to design.

If you are using a test as a selection mechanism for a prestigious institution, you will have armies of very smart parents and well-paid tutors trying to crack the code of the test.

Over the past decade or so, a couple of phrases have cropped up to describe the way these selection tests should work. First, people argue that we should have “tutor-proof tests” that cannot be cracked by the parents and tutors. Second, we should have “tests worth teaching to”, so that if students are being prepped for the test, the prep is worthwhile.

Do these two concepts hold water? In this post, we’ll examine the idea of tutor-proof tests.

Thanks for reading No More Marking! This post is public so feel free to share it.

Some historical background

Historically, many famous English public schools selected pupils at age 13 using the Common Entrance exam.

Common Entrance exams are linked to a defined curriculum. The advantage of this is great coherence and clarity for students and teachers at the prep schools and public schools. The disadvantage is it probably restricts the pool of students who can apply to the public schools.

Not all independent schools operated on this model. I went to a selective secondary school, City of London School for Girls, which used a more curriculum-neutral test consisting of a reading comprehension, writing task and maths paper. I hadn’t attended a private prep school or had a private tutor, but the test resembled a lot of what I had done at my state primary school, so I was not at a massive disadvantage compared to others. Had CLSG run Common Entrance, it’s unlikely I would have even applied, let alone got in.

However, whilst the test I sat was more curriculum-neutral than Common Entrance, it was not completely curriculum-neutral, and nor was it immune to tutoring and preparation. In the last decade or so, even this kind of maths, reading and writing assessment has been criticised for excluding talented but disadvantaged students who don’t have access to good schools and tutors.

The tutor-proof test

Is it possible to design a test so content-free that it captures something like raw potential, or the underlying ability to flourish in an academic environment? Verbal reasoning tests reward vocabulary knowledge, which can be taught. Numerical reasoning tests reward maths knowledge, which can also be taught. But what about non-verbal reasoning tests? These are the kinds of tests where you are given four shapes and then asked: which shape continues the sequence?

You can see how these tests are less tied to curriculum knowledge, and there is serious research in this area suggesting that they might therefore be useful for identifying talented but disadvantaged students. David Card is a Nobel laureate who has done research showing that a non-verbal test administered at second grade in a district in Florida “led to large increases in the fractions of economically disadvantaged and minority students placed in gifted programs.” Jonathan Wai is another researcher who has done a lot of interesting work on these types of questions, and who has also been involved with talent identification programmes.

In large-scale government-run school systems with lots of disadvantaged students, non-verbal assessments can help identify students who are able but poorly served by their schooling.

But there are big differences between low-stakes talent-identification across a government school system and high-stakes entry to prestigious selective schools. When an expensive tutor hears the phrase “tutor-proof test”, he doesn’t interpret that as a warning but a challenge.

Practice effects

There is a huge literature on “practice effects”, which essentially show that if you practice a specific skill, you will get better at that specific skill. If you practice touch typing every day, you will get better at it. If you practice your multiplication tables every day, you’ll get better at them. If you practice tying your shoelaces every day, you’ll get better at it.

The practice effect is one of the most robust findings in cognitive psychology, and poses an enormous challenge to the idea of the tutor-proof test.

The response of test developers to this challenge is to say that they can create enough new question types that practice on past question types won’t deliver huge gains.

That is, they’ll say that you can practice tying your shoelaces, but then the test will be on a different kind of knot, so you won’t have any advantage. From a cognitive science point of view, this is a tricky one. It is true that the practice effect holds for practice of a specific skill. It is also true that transfer to different contexts is hard, and that so-called “far transfer” is exceptionally difficult. So yes, the test developers are right to say that the more novel the question type, the less valuable the practice of old question types is.

But “less valuable” is not the same as “not valuable at all”. And whilst far transfer is extremely difficult, near transfer is more possible. Even if practice of old question types gives you quite small gains, in a high-stakes environment those small gains can be the difference between success and failure.

Also, to make this system work, you require test developers to constantly create new types of question that are as different as possible from what has gone before. This poses a number of difficult technical challenges.

First, there are obvious constraints to just how many new types of short non-verbal test questions it is possible to create. If you are running 3 test sessions a year, after ten years you will need to come up with thirty different types of question. There are limits to how many ways you can vary the essential concept of looking at a 2D shape and moving it around in some way.

Second, if you really are creating very new questions for each round of tests, then you need to run a new validation process each time. Good validation processes take time: ideally you want to wait a few years and gather information on whether the students who passed that test are thriving at their new school. But if you are constantly having to create new question types, you don’t have the time for that.

Third, even if your system works for the first few years, there is no guarantee it will keep working over time as tutors learn more about it and optimise their teaching. This is a classic Goodhart’s Law problem: when a measure becomes a target, it loses value as a measure.

We see numerous examples of this in our work and research. A really famous one is that early AI essay markers delivered pretty good levels of agreement with human markers, and seemed to have solved the problem of AI marking. However, on closer investigation it turned out that they were largely just rewarding the length of the essay. In a low-stakes environment, it is possible that this wouldn’t cause too many problems. But in a high-stakes assessment where students, teachers and parents are all striving to do as well as they can, the system will break down, because students will realise that the way to succeed is to write the same sentence a couple of hundred times.

Likewise, it is possible that tutors find ways of teaching tips and tricks that help students answer the non-verbal questions, but that systematically break the link between the question and what it is supposed to be measuring.

What is the impact on students?

The extensive literature on the practice effect shows it delivers substantial gains. But there is a chance that even the substantial gains reported in the literature underestimate its effect, because most of the research is lab-based, and may not properly account for the scale and effect of real-world intensive practice in some environments. Tutoring for entrance exams is taken very seriously by a lot of very smart people, and it is big business.

Many students will be preparing for their entrance exam 18 months or 2 years in advance, and will be doing several hours of practice every week. The question is, would you rather that prep is on shape rotation? Or would you rather students were reading interesting books and doing maths problems?

It’s also worth remembering that the original impulse for introducing tests like this was the social justice aspect – that schools wanted to find a way of identifying talented but disadvantaged students. But once a non-verbal test becomes a target, it is going to discriminate against those students too, as you are much less likely to get any practice of those tests in a typical state school – whereas you will be taught reading, writing and maths. The worst-case outcome is that the non-verbal test is as socially exclusionary as Common Entrance, just with none of its educational benefits.

When you stop and think about it, the concept of the tutor-proof test does not really hold water. Of course you get better at something if you practice it. That is a good thing, and that is why education works! The whole point of education is to practice valuable things and get better at the valuable things. A good assessment should promote practice of the valuable things. It shouldn’t remove the valuable things and replace them with less valuable things, on the grounds that some students will get more practice of the valuable things.

Which brings us to another popular concept: we should create “tests that are worth teaching to”. Is this a better guide to assessment design? We’ll discuss that in a future post.

Does a university education help you earn more?

Daisy Christodoulou — Sun, 15 Feb 2026 08:45:52 GMT

Over the last couple of weeks in the UK, there’s been a lot of controversy about student loans. Graduates have been posting their student loan balance statements on social media showing some punitive interest rates.

That has spilled over into a wider debate about the economic value of university itself. If it is so expensive, is it worth it economically?

Thanks for reading No More Marking! This post is public so feel free to share it.

What is the graduate premium?

The “graduate premium” refers to the fact that university graduates earn more than non-graduates, and is often used as a reason why a) students should pay for the costs of their university education and b) we should try to get as many school-leavers as possible to go to university.

A popular critique of this argument is that whilst graduates earn more on average, the average is misleading: a small number of degrees command large premiums, whereas some degrees offer no premium at all. This critique is summed up well by Fraser Nelson, who argues that as a result the government should release national data on earnings by subject and institution so school-leavers can make more informed choices about where to study. For example, Glasgow history grads earn only 30k five years after graduating, whereas LSE history grads earn 50k. LSE therefore offers “absurdly good value” and should probably put its fees up.

However, even this more nuanced take on the graduate premium misses something crucial. Discussion about the graduate premium assumes that it is caused by going to university. The implicit reasoning is that you go to university, you acquire knowledge and skills you wouldn’t have got otherwise, and these make you a more productive worker who can therefore command a higher salary.

But graduate income data does not prove this causal chain. Yes, it is true that people who go to university earn more, on average, than people who don’t. But people who wear Rolex watches earn more, on average, than people who don’t. No-one is proposing a “Rolex Premium” whereby every school-leaver is encouraged to take out an expensive loan to fund the purchase of a Rolex watch, on the grounds that it will lead to them having much greater lifetime earnings.

Likewise, the way to critique that argument is not to say “Well yes, the Rolex Glasgow model doesn’t help you earn much, but the Rolex LSE model leads to really high future earnings, so we should encourage school-leavers to buy the Rolex LSE - and in fact, Rolex should charge even more for it because it will pay for itself over time!”

To prove the graduate premium is more than just a Rolex premium, we need some causal evidence that it is caused by skills acquired at university.

Human capital vs signalling

There is an extensive academic debate about this: the human capital vs signalling debate.

The human capital side says that the graduate premium is caused by the knowledge and skills universities impart.
The signalling side says that the graduate premium is caused by the signal that is sent by a degree. Universities select their students based on prior attainment. Employers use degrees as a cheap (for them!) way to select employees who are already smart. They are not that bothered about what the student learns at university.

This debate obviously has enormous implications for public policy.

If it turns out the returns to a degree are mostly due to human capital, then we should definitely be aiming to get 50% of school-leavers to university, and arguably we should be aiming for an even bigger proportion. If university really does reliably impart skills and knowledge that reliably increase your lifetime earnings, then expanding access is economically rational.
If it turns out the returns to a degree are mostly due to signalling, then we are wasting gigantic sums of private and public money. We could essentially replace degrees with some kind of basic test taken at age 18 and that would provide employers with what they are currently getting with hugely expensive three-year degree courses.

So if there is a huge debate, what is the consensus about which side is right? What does the data say?

It’s a difficult question to answer because degrees are used as a filter for a lot of well-paid jobs. One way you could research this question is to compare two cohorts of school-leavers with exactly the same A-levels and prior attainment. One cohort goes to university; one doesn’t. If the university grads do better in the job market, that suggests the university imparts valuable skills and knowledge, and therefore is evidence for the human capital theory.

But of course, that won’t work, because lots of jobs are restricted to graduates. Maybe the non-grads would have done perfectly well at them, but they never get a chance.

Another obvious way you could measure the impact of university is to directly measure the skills and knowledge it imparts by assessing students and seeing what they have learned. This is what happens at school, and this is one of the reasons why we have good evidence that schools do succeed at teaching skills that are valuable in the job market, and that some schools are more successful than others. But most academic degree courses don’t feature any kind of nationally-standardised assessment that could be used for this purpose.

As a result, a lot of the research in this area is hugely complex and ultimately quite inconclusive.

That in itself is quite striking though – given how embedded the human capital theory is, and how much it governs the public debate and policymaking in this area, you would expect there to be some quite solid evidence in its favour!

Assessment data is limited, but it is less limited than earnings data!

A lot of criticisms are made of educational assessment – some of them justified. Can assessment truly capture everything we value about education? Does it distort the thing it is trying to measure? Does it lead to the things that can’t be assessed being neglected? We write about a lot of these themes on this Substack!

Still, even if you are sceptical of assessment, you have surely got to admit that even a basic standardised assessment is a better way of measuring the impact of universities than later earnings.

Earnings data doesn’t capture the value of learning for its own sake. It undervalues low-paid but socially vital jobs. It is often not a measure of productivity because a lot of public sector salaries are set by government. Similarly, it doesn’t account for regional differences in pay (this is probably a significant factor in why Glasgow history grads earn less than LSE ones). It might also lead universities to make bad decisions about which courses to offer or which type of students to recruit.

There is an analogy with health targets, which have their flaws but still tell you something useful. How would you rather measure the success of a hospital – by its infection rate, by the proportion of patients treated at A & E within four hours, by the numbers of beds in corridors? Or by how much its patients earned five years after being treated there?

Standardised assessments at university

Andreas Schleicher is the Director for Education and Skills at the OECD, who run PISA, the international school assessment. He has pointed out that “there was a time when people looked to universities to judge the quality of education. Today, it is the other way around: the public want better information on the quality of universities.”

Given the scale of public subsidy and private debt involved, why not make some form of standardised assessment compulsory for all degrees? This could take a variety of different formats: maybe certain subjects all have to have a couple of shared “national” exam modules covering the content that every university will teach. Or maybe students in every essay-based subject have to take a compulsory writing module and exam – which might also help assuage concerns about the impact of AI on the traditional take-home essay.

You could even assess it with Comparative Judgement…

AI assessments for KS3 English Literature: a case study

Daisy Christodoulou — Sat, 07 Feb 2026 08:45:13 GMT

Recently, I spoke with Phill Chater from Landau Forte Academy about how he is using our AI-enhanced Comparative Judgement system to assess his school’s Key Stage 3 English Literature essays.

Here’s a summary of his approach, organised by the three features we use to evaluate all our assessments: reliability, efficiency and validity.

Thanks for reading No More Marking! This post is public so feel free to share it.

Reliability

Phill set up his English Literature assessments as custom tasks that are bespoke to his school, which means they are not nationally standardised. Our AI-enhanced Comparative Judgement system gives you a scaled score, and then you can apply the grade boundaries on top. In order to gather this evidence he got the AI to judge each year group twice, to see if it came up with the same results each time.

This is a very sensible approach to take. If the AI came back with a completely different rank order of students each time, you would have very little faith in its outputs. Interestingly, you can do similar checks on human marking, and the results are often quite underwhelming. Ofqual’s marking reliability studies in 2017 found English Literature had the worst marking reliability of any subject, with candidates only likely to get their true grade 52% of the time compared to Maths where the probability is 96%.1 If we built an AI marking model with such poor consistency, we would not let it see the light of day!

However, when you are assessing essays, you also probably don’t want there to be perfect consistency between one iteration of marking and another. This would suggest that your markers – whether AI or human – are too deterministic, and are judging on surface features which make the model easy to game. For example, some of the very earliest AI marking models delivered incredibly good agreement between iterations, but on closer investigation this was because they were largely judging on surface features like length.

One of the advantages that LLMs have over these older models – and why it is worth persisting with them despite their other flaws – is that they are not making completely deterministic judgements. A major focus of our research has been building an LLM-powered model that gets this balance right, and validating its outputs. You can read the extensive work we’ve done on this here, here, here and here.

We were extremely encouraged by the results of Phill’s assessment: the correlations between each iteration of the AI judging ranged from 0.91-0.96, which feels about right: too low would suggest issues with the transcriptions, hallucinations, order bias and the world of other woes we see with LLMs. Too high and we’ve built a model that is likely over deterministic and consistently wrong! There is always further work you can do on validation, and we will update this Substack with more data when we have it.

The other reassuring aspect of this assessment is that it was on literature. Most of our national AI assessments so far have been assessments of writing. AI can be quite brittle, so there is no guarantee that if it works for writing it will work for literature. Assessments that require specific content knowledge pose an extra challenge, so it was good for us to learn that the scores were sensible and in line with expectations. Phill created a fairly holistic mark scheme to guide the AI, broadly in line with the advice we give here about not being too prescriptive.

Efficiency

This is obviously one of the major benefits of adding AI judges. It can whiz through assessments very quickly, and it did so in this case. Phill made a particularly telling point about the speed of assessment. Previously, English teachers would typically be marking right up to the deadline for big assessments, while the maths department would often finish earlier in the window. This time, for the first time Phill could recollect, the English teachers finished before the maths teachers—something that, in my experience, is unheard of.

But obviously, every solution contains within it the seeds of a new problem. One of the fears people have about English teachers spending less time marking is that they will not understand their students as well. But Phill’s comparison with maths assessment is instructive. Maths teachers spend less time marking, but they understand their students just as well. It’s just that the nature of maths marking is such that it can deliver equivalent levels of understanding in less time. There has to be a way that we could imagine this working for English: teachers will spend less time than they do currently on marking, but will get equivalent levels of understanding.

For me, there is definitely value in teachers reading students’ writing, but there is less value in the time spent painstakingly writing out comments by hand. Our feedback systems are designed to maintain the high-value thought processes and eliminate or reduce the lower-value ones. However, we are also aware that teachers and students will use our feedback in different ways, and we want to learn more about what is most effective. Which brings us to the final section….

Validity & Feedback

Efficiency can enable better feedback in two ways. First, Phill said the faster turnaround time enabled them to dedicate an entire lesson to feedback. Second, the quicker you get the feedback the more relevant it is.

The first part of the feedback lesson had students working with a model essay that had been selected previously, pre-AI, so this portion wasn’t dependent on the AI at all.

In the second part of the lesson, teachers gave students the direct AI feedback and asked them to summarise their areas for improvement based on the AI and the model essay.

Phill felt the feedback was “uncannily accurate”, but he did identify a couple of areas where we could improve our feedback, and we have some ideas about how to address them. The direct AI feedback is currently a bit too verbose and maybe a bit too harsh too. We are working on making it nicer!

The other challenge is making the feedback actionable. This is tricky because the more specific you get, the more risk there is of AI hallucination and errors. We’re currently developing an approach that Phill hasn’t been able to use yet, but we hope to roll out for everyone soon: getting the AI to create personalised multiple-choice questions for each student.

Conclusion

This is all new territory. Phill is a pioneer who is applying new technology to existing practice in exciting ways, and both his practice and ours will continue to evolve. But there’s enough here to suggest that significant reductions in workload and improvements in feedback are possible.

If you’d like to try out an AI-enhanced Comparative Judgement assessment, join our webinar on Wednesday 25th February where we will give all attendees 30 free AI assessment credits.
If you have an idea for a case study, let us know here.

The true grade is a theoretical concept that estimates what a candidate would achieve if they took an assessment an infinite number of times. Of course it doesn’t measure if the assessment or the marking are aligned with the curriculum or mark scheme, which is why we tend to favour reporting broader measures of validity such as correlations between assessments over time. Nonetheless, without high reliability there can be no validity.

What would Mr Toad make of school phone bans?

Daisy Christodoulou — Sun, 01 Feb 2026 08:45:20 GMT

In the last 12 months or so, there has been a rapid, almost palpable, change in attitudes to children and technology. A number of anti-phone pressure groups like Smartphone Free Childhood have sprung up, while many countries are starting to legislate for various kinds of phone bans: Australia banned under-16s from social media in December 2025, France is moving to ban social media for under-15s, Denmark announced plans to ban under-15s in November 2025, Norway raised its age limit from 13 to 15, and Malaysia announced a ban for under-16s coming in July 2026. In the UK, the House of Lords voted in January 2026 for an amendment to ban under-16s from social media, and the government has launched a consultation on the issue.

I am supportive of these moves, but I have also been somewhat surprised by the speed of change. I’ve been consistently anti-phones in the classroom for well over a decade now, and I’ve become used to having polite disagreements with people on the other side of the debate—which, until recently, was most people.

Thanks for reading No More Marking! This post is public so feel free to share it.

Over the last few months, I’ve visited quite a few schools and have been astonished to find that there was no argument to be had. I would say the thing I have said for ten years, brace myself for the usual objections, and instead I’d hear “yes you’re totally right. We’ve had a phone ban for x months and I can’t believe how well it’s going.”

How social change happens

In the past, I have compared attitudes to mobile phones in the classroom to attitudes to cigarettes.

I am constantly drawn to this analogy because the shift in attitudes to smoking occurred as I was in my late teens and early twenties and was probably the first time I realised that the social norms of my childhood were not permanent fixtures.

In the mid-90s, smoking in public places was normal and commonplace. I remember the first time someone suggested you might ban smoking in pubs, and it felt as crazy as suggesting you might ban drinking in pubs. People went to pubs to smoke! That was the point! But within a decade, a public smoking ban was in place and we were all wondering how we put up with smoky clothes for so long.

But phones aren’t like cigarettes

However, smoking is not the ideal analogy here, for a couple of reasons. Cigarettes have very few upsides, whereas mobile phones have lots. Cigarettes are also not that central to society, whereas if you got rid of all mobile phones in the world tomorrow, society and the economy would grind to a halt.

A better analogy—but an older one, which no one today has a memory of—is the invention of the automobile. Like phones, cars very quickly established themselves as vital and irreplaceable. They also had terrible downsides, and precipitated a culture war which in some ways is still smouldering today. If we go back to that moment in time, we can learn a lot about the way forward for the use of technology in education.

A literary-historical interlude: cars in the early 20^th century imagination

The light fiction of the early 20th century is littered with the impact of the automobile.

One of the funniest and most famous is Mr Toad, from Kenneth Grahame’s The Wind in the Willows (1908). After encountering his first car on a sleepy country lane, he is entranced and can think of nothing else.

They found him in a sort of a trance, a happy smile on his face, his eyes still fixed on the dusty wake of their destroyer. At intervals he was still heard to murmur “Poop-poop!”

“Glorious, stirring sight!” murmured Toad. “The poetry of motion! The real way to travel! The only way to travel! Here to-day—in next week to-morrow! Villages skipped, towns and cities jumped—always somebody else’s horizon! O bliss! O poop-poop! O my! O my!”

Before long, he steals a car and goes on a joyride.

He increased his pace, and as the car devoured the street and leapt forth on the high road through the open country, he was only conscious that he was Toad once more, Toad at his best and highest, Toad the terror, the traffic-queller, the Lord of the lone trail, before whom all must give way or be smitten into nothingness and everlasting night. He chanted as he flew, and the car responded with sonorous drone; the miles were eaten up under him as he sped he knew not whither, fulfilling his instincts, living his hour, reckless of what might come to him.

If Mr Toad were alive today, he’d be running the MrToadLambo Youtube account, full of viral livestreams of him in car chases on the M25. He’d have a memecoin called PoopPoop and on X, he’d complain about the “legacy mindset” of speed limits.

The Mr Toad-style roadhog was not an isolated figure. Thirty-one years later, Agatha Christie wrote one of the best-selling books of all time: And Then There Were None. The premise of the book is that some people have committed acts which are not legally crimes, but which are morally criminal and therefore deserve punishment. One of them is a young man, Anthony Marston, who has killed two young siblings while driving recklessly:

‘Of course it was a pure accident. They rushed out of some cottage or other. I had my licence suspended for a year. Beastly nuisance.’

Dr Armstrong said warmly: ‘This speeding’s all wrong—all wrong! Young men like you are a danger to the community.’

Anthony shrugged his shoulders. He said: ‘Speed’s come to stay. English roads are hopeless, of course. Can’t get up a decent pace on them.’

The book was published in the very early months of World War II, and I think there is an obvious political undercurrent. The Nazis were obsessed with youth, speed, and technological progress, and Hitler had made new roads and new cars symbols of his regime.

You can also see clear parallels with debates about social media and mobile phones today. The pro-car lobby, which disproportionately consisted of young men, felt that their opponents were creating a moral panic that turned commonplace everyday accidents into existential threats. The anti-car lobby was more middle-aged and female, and they thought their opponents were proto-fascists intent on destroying the lives of poor children.

In Christie’s autobiography, she wrote about her own experience of car ownership. She bought her first car at a time when there was no driving test. She could barely drive when, in 1926, she had to drive her husband to work because of the General Strike. She made it back from Hounslow to Sunningdale (about 15 miles!!) just about in one piece, but a neighbour who saw her parking up said ‘I saw the first floor driving back this morning. I don’t think she has ever driven a car before. She drove into that garage absolutely shaking and as white as a sheet. I thought she was going to ram the wall, but she just didn’t!’

But Christie also goes on to say that once she had learnt to drive, the experience gave her enormous pleasure:

Oh the joy that car was to me! I don’t suppose anyone nowadays could believe the difference it made to one’s life. To be able to go anywhere you chose; to places beyond the reach of your legs—it widened your whole horizon.1

We are now fully in the doomscrolling era of the internet, but it is worth remembering that it was just as horizon-expanding and liberating in its early days as the car. And, similarly, as much as traffic and ragebait might annoy me, I do not want to live in a world without cars or mobile phones. They are both vital parts of the modern world. The task is to make them work to serve our aims.

With that in mind, here are five lessons we can learn from the early automobile debate.

Social norms matter just as much as legislation

One of the fascinating things about And Then There Were None is its focus on acts that were legal but frowned upon, acts where the social norm was in the process of shifting. It is astonishing for us to read it now and realise that the punishment for killing two children while speeding was just a year-long driving ban. But it is very hard for governments to legislate when the social norm is against it. Laws cannot get that far in front or behind of public opinion. If you had tried to implement a smoking ban in the 1950s, you would probably have had mass civil disobedience. Likewise, whilst I’ve been in favour of school level phone bans for a while, I’ve recognised that until recently it would have been exceptionally hard for a government to legislate for one, because not enough parents, students and teachers thought it was necessary, and you’d have had mass evasion of the law.

I think the time is right now for legislation. And the reason why we need a ban, and we can’t just depend on social norms changing, is that this is a clear example of a collective action problem: teenagers and their parents tell us that they would like to use their phones less, or give up social media, but they don’t want to be the only ones!

These kinds of co-ordination problems are the places where there is a strong case for government intervention, and where their intervention adds value over and above self-regulation.

2. Some kinds of regulation are fundamental and inevitable

Driving tests were one of the least controversial aspects of early automobile regulation. It’s arguable that in the modern state, the state monopoly of driving testing & licensing is one of its most fundamental functions (which is just one of the reasons why the breakdown of the UK government’s testing system is a really big problem.)

Similarly, another pretty fundamental and uncontroversial aspect of the modern state is its enforcement of age norms. You can argue about what age they should kick in, and where they should apply, but pretty much every state in the world provides children with special protections and restrictions. We frequently restrict childhood liberty, to the extent that most serious classical liberal and libertarian philosophers spend a lot of time thinking about why this is. JS Mill’s On Liberty is a great example - a book about liberty that spends large chunks discussing the education of the young.

3. Early regulation can get it wrong

Not all regulations are good regulations. The Red Flag Act of 1865 required early automobiles to be preceded by a man on foot carrying a red flag.

I can think of a lot of current internet regulations that are not working brilliantly. Cookie consent warnings seem to be security theatre that cause a lot of hassle but don’t really address the big problems.

The Online Safety Act is a major piece of UK legislation that aims to protect children from the downsides of the internet. Critics say it is poorly drafted and will have a lot of negative unintended consequences. We will soon see who is right.

4. You can be pro and anti technology

Modern Germany has an extensive motorway network with no speed limits. It also has medieval town centres that are car-free. These are not contradictory. Likewise, it is possible to believe that schools should make use of a lot more technology in a lot of ways, whilst remaining largely screen-free for students.

5. Technology can mitigate technology

New technology will always cause problems. Most of the time, instead of getting rid of the technology, we prefer to use more and different technology to mitigate the problem. Seatbelts, airbags, anti-lock brakes and sat-nav are all examples of technology that’s designed to mitigate the negative impacts of cars.

I think this approach is the right one for education too. One technology we’re excited about at No More Marking is handwriting recognition. Before LLMs came along, handwriting recognition was a stubborn and seemingly intractable problem. LLMs have largely - although not completely - solved it. It is now possible to get instant and mostly accurate transcriptions of student writing, which in turn makes screen-free classrooms much more viable. Off-the-shelf LLMs are still not perfect though, and there is room for improvement, which is why we are working on optimising an open-source LLM to recognise handwriting with an even higher degree of accuracy.

And Then There Were Norms

The other big lesson from the car debate is that some of these debates never go away and are never truly resolved. Cars and phones are fundamental to modern society, and anything so fundamental will inevitably provoke conflict. The norms might change, but the arguments will remain.

Christie was not the only mega bestselling author of the 20th century who had trouble with cars. JRR Tolkien bought a car at around the same time as Christie and also seems to have struggled to learn to drive. He also wrote a book inspired by his misadventures, called Mr Bliss, but although it was written in 1932, it wasn’t published until 1982. Unlike Christie, he gave up driving at the start of the war and deplored the impact automobiles had on the Oxfordshire countryside.

How to write a good rubric (for humans and AI)

Daisy Christodoulou — Sat, 24 Jan 2026 08:45:23 GMT

Last term, we ran a national history assessment on the topic of the Battle of Hastings. The essays were judged by a mix of human and AI judges, and we saw pretty good levels of agreement between the humans and AI, and barely any glaring AI errors.

Still, that does not stop our teachers - and us - asking a set of questions about how the AI makes its decisions. What does it value? Does it value historical accuracy? Does it notice when claims are false? Does it base its judgements on fluency of writing or quality of historical analysis? What weight does it give to the various aspects of a good essay? And - a question we are getting more and more - what kind of rubric or guidance should we give the AI to help it make the best decisions?

To explore this, we ran a small experiment.

We selected a sample of the c. 4,000 essays and got an AI to isolate every truth claim in every essay and then to assess whether each claim was true or not.

Thanks for reading No More Marking! This post is public so feel free to share it.

This is not as simple as it sounds, and an essay about the Battle of Hastings is probably not the best test case for this approach, because there is a lot about it that is genuinely uncertain: did Harold really die after being hit in the eye by an arrow? Exactly how long did it take Harold and his men to march from Stamford Bridge to Hastings?

Still, there are plenty of known facts about the Battle, and generally speaking the AI was good at spotting these and assigning a truth score for each essay. We reviewed these “truth scores” and felt they were broadly correct.

Next, we looked at whether these truth scores correlated with the scaled score given to each essay. We expected a weak positive correlation. What we actually found was a negative correlation.

In other words, essays that contained more false statements were, on average, getting better scores!

What on earth is going on?!

This is not about AI

We don’t think this problem is an AI problem. We’ve seen something similar in the past with writing rubrics long before we used AI in our assessments. (In fact, I wrote an article about a similar issue with writing assessments almost ten years ago, before I started working at No More Marking, and before chatbots existed!)

Essentially, when you give pupils an extended writing task, the more they write and the more ambitious they are, the more chances there are for them to make errors. The very best and most creative responses can therefore have more errors than weaker responses.

Here is a great example from the history assessment.

Script A is the first part of one of the highest-scoring essays. Script B is the entirety of one of the lowest-scoring essays.

Script A has a straightforward factual error in its first sentence: the Battle of Hastings was not fought on the 14th February 1066 but the 14th October. The second paragraph also has some factual issues: it says “Harold Godwinson’s army was mostly dead, wounded or incredibly tired from the battle and the journey, making their army much smaller and easier to defeat”. This is all a bit more arguable, but “mostly” is probably too strong here, and “much smaller” depends on what your reference point is: it was probably smaller than it would have been if there had been no Battle of Stamford Bridge, but it also probably wasn’t much smaller than William’s army. These were all flagged up by our AI truth checker.

Script B has no factual errors at all. Our AI truth checker judged it all to be correct, apart from the final unfinished sentence which was “unverifiable”.

However, when the scripts were assessed as part of our national history project, both the teachers and the AI thought that Script A was better than Script B. I agree and I cannot believe that any history teacher would seriously argue that B is better than A.

Cardinal Richelieu and Jose Mourinho

There’s a famous line alleged to have been said by Cardinal Richelieu: “Give me six lines written by the hand of the most honest of men, and I will find something in them to hang him.”

Is that the message we want to send to our children?

The football manager José Mourinho has a touch of the Cardinal Richelieus about him. According to a biographer, in his later managerial career he developed an uncompromisingly cynical approach to football tactics, which included the following principles.

Whoever has the ball is more likely to make a mistake.
Whoever renounces possession reduces the possibility of making a mistake.
Whoever has the ball has fear.
Whoever does not have it is thereby stronger.

Interestingly, in Mourinho’s case, this strategy was not hugely successful. His biggest successes came before he developed this approach, because in football, success is not measured by who makes the fewest mistakes, but by who scores the most goals.

If you push these strategies to their ultimate limit they become entirely self-defeating. You end up with football teams trying not to play football and writing lessons that are about avoiding writing. Ultimately, if you want to win football games you have to try and play some football. If you want to be a good writer you have to write something. Neither writing nor football are exercises in trying not to make mistakes.

So does this mean factual accuracy doesn’t matter?

I think people are really surprised when I make this argument, because I have basically made a career out of saying factual accuracy is important.

And I haven’t changed my mind. I still think factual accuracy is supremely important, I still think that historical understanding is built on accuracy, and I still think we should teach students facts and get them to memorise them.

My objection is not to teaching & assessing facts. My objection is to using essays to assess facts. That’s for two reasons.

Essays are not designed to test factual accuracy

An essay is an open-ended task, which means that students have some freedom in how they respond to it. This means that students will essentially set themselves different tasks. Some students will choose to mention the date of the Battle of Hastings, and some won’t. If you have a very strict rubric that insists on factual accuracy, then the student who chooses to mention the date and gets it wrong is penalised. The student who chooses not to mention it is not penalised, even though they may not know the date either!

So the essay is basically a terrible way of telling if a student knows when the Battle of Hastings happened. The right way to assess this is with a simple short answer or multiple-choice question, where every student is given the same question and there is one clear right answer.

Short answer questions and multiple choice questions are often seen as being too simplistic or basic, but they are really powerful tools. Essays and MCQs are complementary - like the two wing mirrors on a car. They give you different views of the same reality. Don’t make your essay responsible for incentivising and measuring factual recall. Set an accompanying quiz, and let that do the job instead. (We have a nice system that will do this for you!)

If you do that, I think then you would see a strong correlation between scores on the quiz and scores on the essay. In fact, when we have tried this with writing, we do see strong correlations between simple quizzes on spelling & grammatical features, and overall writing quality.

I can’t prove it, but I suspect if we gave student A and B ten questions on the facts about Hastings, student A would do better than student B.

If you use essays to assess factual accuracy by creating a strict rubric, you will create terrible incentives

One of the things we saw - and still see - with very prescriptive writing rubrics is that you get awful second-order effects. Your rubric does not end up incentivising factual accuracy. It incentivises short and basic pieces of writing. Once teachers and students know that factual and grammatical errors will be penalised heavily, they take the Richelieu / Mourinho approach and become very negative and defensive.

What does this mean for rubric design?

Whether you are using human or AI markers, you have to allow your markers some discretion.

Open-ended tasks give pupils discretion. If pupils have discretion, then markers must have discretion too. Otherwise, you create distortions.

So a principle for both humans or machines is as follows: Do not use prescriptive criteria to judge extended writing.

We’ve found that humans can judge accurately and consistently using just one incredibly holistic criterion: which is the better response?

We think the AI does need a bit more guidance than this, but it should still have latitude to make holistic judgements. We have a section on our website where you can paste in your criteria, and some advice on setting holistic criteria here. We will update this with more examples and advice as we trial assessments in different subjects.

You can trial different criteria and assessments yourself. You can purchase AI custom task credits on our website, and we are also giving away 30 free AI credits to everyone who attends our next introductory webinar on 25 February.

If you are already using custom tasks, let us know in the comments what criteria you’ve used.

How do you know your feedback is working?

Daisy Christodoulou — Sat, 17 Jan 2026 09:45:25 GMT

One of the major problems with a lot of classic education research papers is that they are based on very small numbers of students. This means that if the paper does show a certain intervention is effective, it is entirely possible that it is the result of chance and not the intervention.

This problem is compounded when the interventions involve writing assessments, because traditional writing assessments are quite unreliable. Again, this adds yet more noise to the results.

We have a new assessment model which addresses both of these problems and makes it easy, quick, and reliable to evaluate the impact that feedback has on writing.

Thanks for reading No More Marking! This post is public so feel free to share it.

We trialled the new approach last year, and are running a bigger project in March this year for Year 6 students.

Here is how it works.

Students take part in our established Year 6 writing assessment in March. We expect about 30,000 students will take part in this.
Schools will receive extensive feedback reports, with a mix of AI & human feedback. They will share the reports with their students and can provide their own feedback too.
Students will then redraft their original piece of writing.
Schools can then submit this redrafted piece of work to be assessed again as part of a national assessment window. The scores of both the original and redrafted pieces of work will be on the same scale, allowing us to measure the impact of the feedback.
Both the original and redrafted writing will be assessed using our Comparative Judgement plus AI model. This is highly reliable and dramatically reduces the teacher workload.

We ran a project like this last year, but gave schools very short notice about the redraft. This meant that whilst approximately 36,000 students from 900 schools took part in the original assessment, only 3,851 from 73 schools took part in the follow up. This year we have given schools more notice, so we hope that we’ll get more taking part in the redraft.

The project is not a gold-standard randomised controlled trial, but it will still provide schools with rapid and useful information about how students respond to feedback. It would also be possible to use the same Comparative Judgement plus AI write-feedback-redraft model as part of an RCT.

Improving the feedback

We’re also planning a couple of changes to the feedback that students get.

Last year, we gave every student a set of five multiple-choice questions that were created by us - not AI. We created three sets of questions, and then split students into three groups based on their scaled score. Students in the lowest-scoring group got a set of questions on capital letters, students in the middle group got questions on run-on sentences, and students in the top-scoring group got questions on vocabulary.

This year, we will continue to allocate question sets by scaled score, but we are going to introduce a little bit of AI into the mix.

Students in the lowest-scoring group will continue to receive a set of questions on capital letters. These questions will be created by us, but we will use AI to customise them slightly. We will make the content used in the questions match the content used in each individual student’s story. E.g. if the student has written about two children called Ilsa and Bob, the questions will mention Ilsa and Bob.
We’ll do something similar for the middle third of students. They’ll get a set of questions on run-on sentences, created by us but tweaked by AI to include the content of their story.
For students in the top third, we will be making a more substantial change. These students will get a set of questions entirely designed by AI. The questions will focus on more creative aspects of writing.

We’re currently developing and trialling these new question types, and will shortly be emailing our participating schools to get their opinion on them.

If you are not currently a participating school but would like to be, you can join us! Read more about the project and how to take part here.

Could this model work at a smaller scale?

One of the big advantages of this model is the scale - thousands of participating students. However, we have had a lot of requests from schools who would like to try it out at a smaller scale, in their own school or class. Obviously you would not be able to generalise as much from a smaller scale, but we agree that it would be incredibly valuable for an individual school or class teacher to be able to get such rapid feedback on their interventions. We can also place these bespoke individual assessments onto our national scale by including anchor scripts from previous assessments, which means that even small assessments can have some of the benefits of scale. We are looking at ways that we can make this write-feedback-redraft cycle easy for an individual school or teacher to implement. Get in touch if this interests you.

Further reading and information

We published a series of posts about last year’s project: the original intro post, our trial school results, the full set of results, a qualitative analysis of one school’s results.
How Comparative Judgement plus AI works.
This year’s calendar and help page.
A guide to all of our feedback reports
Our events page - we have two online introductory webinars scheduled in the next six weeks.

Maybe LLM tutors might be able to work...

Daisy Christodoulou — Sun, 11 Jan 2026 10:20:21 GMT

In a post last year, I looked at some of the barriers to creating a good LLM tutor. In summary, here were the four challenges that LLM tutors need to overcome.

Questions not explanations. LLMs are very good at explanations, but explanations are over-rated as a means of learning. Sets of really good questions are better.
Reducing hallucinations. Good questions have to be precise and accurate, and LLMs are not so great at precision and accuracy, because they hallucinate.
Improving on pre-LLM technologies. An LLM tutor not only has to prove it is better than a human tutor, but also that it is better than pre-LLM technologies like textbooks and intelligent tutoring systems (ITS). These have zero or close to zero error rates.
Providing structure and discipline. An LLM tutor has to find some way of replicating the structure and discipline of an in-person classroom, because students can’t learn everything from sitting on their own at a screen.

Late last year, a new paper was published with the best answer I have seen to all four challenges.

Thanks for reading No More Marking! This post is public so feel free to share it.

The paper is the result of a collaboration between Google LearnLM and Eedi. The link is here and you can read a summary of it here by Craig Barton, Eedi’s head of education. I’ve known Craig for a long time (you can hear me on his podcast here) and have always admired his work but I have no affiliation with Eedi or Google, so this is just my view as I see it from the outside.

Here is a brief summary of the study, as I understand it.

The study took place in 5 UK secondary schools whose students were used to using the Eedi online learning platform as part of their maths lessons.
The students started an Eedi unit online in class as normal. When they got a question wrong, they were then able to start an online chat with a tutor. There were three conditions: (1) chat with a human tutor, (2) chat with an LLM tutor (whose messages were supervised by a human), (3) receive a pre-written static hint that was the same for everyone who got that question wrong.
The effectiveness of each approach was measured in three ways. (1) The students were given the exact same question they got wrong at the start. (2) If they still got it wrong, they got to have a follow up chat and then two attempts at a new question on the exact same topic. (3) The students moved on to the next unit in the sequence and the study measured their success on the first question of that sequence.
Essentially, the students got fewest questions right when taught with the “static hint” approach. There wasn’t much difference between the human tutor and the LLM tutor. The humans who supervised the LLM didn’t have to make that many edits and were themselves impressed by the LLM’s responses. Crucially, the LLM made very few errors, and the paper lists them all in an appendix.

So how does this study address my four challenges?

Questions not explanations. The student-LLM discussions were focussed on questions and answers. The LLM wasn’t just explaining a concept and assuming the student got it. It asked questions to check for understanding, and then, when the understanding wasn’t there, it was capable of recognising that and following up with other questions until the student did understand. And then of course the success of the intervention was immediately measured with the original question and a subsequent question.
Reducing hallucinations. The most striking part of this study is that the LLM error rate was reduced down to just 0.14% - just over one error every thousand messages. This is extremely impressive. It didn’t report what the human tutor error rate was, and more broadly we don’t really have reliable data on how often teachers make basic errors in class, but even highly skilled teachers will make errors from time to time. Does the average human teacher in a traditional class make one error for every thousand “messages” they speak? It’s not insane to think they might.
Improving on pre-LLM technologies. A 0.14% error rate is good for an LLM or a human, but probably not as good as a textbook or an intelligent tutoring system (ITS) which are capable of close to zero errors, especially once they get into a second edition or version. However, this study specifically compares the LLM tutor performance with pre-written static hints, which in some ways are analogous to textbooks or ITSs, and the LLM tutor outperformed the static hint. I like the concept of pre-written static hints, and I think they are under-rated, but clearly they have their flaws. They are kind of similar to the customer service chat bots that give you a pre-loaded menu of options to choose from. A lot of the time, the pre-loaded options don’t address your question, and you want to find a way to talk to a human instead.
Providing structure and discipline. The study involved students in a typical classroom. They weren’t sitting in a lab or at home. The structure and discipline of an in-person class with an in-person human teacher are present. As a result, it is much easier to see how this study - which was quite small - could scale up to larger numbers. (I still retain concerns about moving all learning on-screen, and even when screens are used I think we need to do more on optimising them for learning, blocking distractions, etc.)

There will be plenty of people who think this study isn’t ambitious enough. The things that I think are strengths – the focus on correct questions and answers, the way it is embedded in a typical classroom – they will see as weaknesses. Why isn’t it tearing down the traditional classroom and re-imagining education for the fifth industrial revolution? Those people will have to look elsewhere. For those of us in the evidence-based community, this is a significant breakthrough.

What’s also interesting is that for perhaps the first time, a major technology company are listening to the evidence about education. In my 2020 book, Teachers vs Tech, I lamented the fact that most of the big technology companies were spending their education budgets on things like “demonstrate how to solve equations with iMovie videos in the style of a cooking show”(Apple) or “different students within a single class could be completing different projects about the topic, each tailored to their learning style.” (Summit Learning, funded by Chan Zuckerberg).

If we are at a point where a major technology company is committing significant resources and talent to evidence-backed principles, then there is the potential for big breakthroughs.

The error rate is low, but it still matters

Although the low error rate is impressive and far better than I thought was possible, I still think it’s high enough to worry about. In this study, the human supervisor edited out the errors, so the results reported don’t include their impact. There are so few errors that you might argue they wouldn’t have changed the results, but at a larger scale with no human supervision, we just don’t know how these errors would propagate and affect a student’s understanding.

Also, whilst one in a thousand errors sounds low, it’s possible that one lesson’s worth of conversation with the chatbot might include 50 or so messages, which would effectively 50x the error rate. Over the course of one year of using the chatbot in every maths lesson, a student might encounter 12 errors (50 messages a lesson, 5 lessons a week, 35 weeks a year, 0.14% error rate). That feels significant enough to worry about, and significant enough that students would start to doubt the chatbot even when it was right. Obviously what would be great is if we could have a chatbot with zero errors, but I think we are in real “March of the Nines” territory here - it is often as difficult to get from 99% accuracy to 99.9% as it is to get from 0% to 90%.

Instead, I think we need to focus more on the social norms around errors. If a human teacher makes a mistake, they often know about it because two or three students look puzzled and raise their hands and say “Miss, you’ve made a mistake”. (This is one of the advantages of a large and non-personalised classroom - the teacher gets feedback from multiple sources). What should a student do if they think a chatbot has made a mistake? What process should we put in place to deal with those errors?

This is an issue for all uses of AI. In many cases it already makes fewer errors than humans, but it makes different kinds of errors in different ways. We have established and often centuries-old systems for catching and mitigating human errors. A lot of these just don’t work with AI, so we need to build new error-mitigation systems.

What are the implications for other subjects?

This paper solely looks at maths. What about those of us involved with teaching and assessing other subjects? At No More Marking, we focus on writing, and for the last couple of years we have been looking at ways of getting LLMs to provide useful feedback on student writing. You can read a summary of our journey here.

What we would really like to do is provide students with specific questions on specific aspects of their work that they can answer and that will improve their work. Last year, we ran a project called CJ Dynamo which tested the effectiveness of the various types of feedback we are able to produce.

We’d like to improve our feedback further. Here’s a fairly simple example of what we’d like to do.

Here’s an extract from a piece of work by a student.

What we’d like is for the AI to automatically produce something like the following.

We are not able to get LLMs to do this reliably enough. Instead, we’ve settled on a different approach.

The AI produces some personalised but less specific advice about the content of the writing, where it is less likely to go wrong.
We create sets of multiple-choice questions about the technical aspects of writing, which we allocate to students based on scaled score – not on whether they have made that specific error or not.
We also have personalised AI-transcribed teacher feedback based on audio teacher comments

Our current multiple-choice questions are more like the “static hint” approach in the Google paper. This is better than nothing, and our CJ Dynamo project shows it is having a positive impact. However, it would be better to have something more personalised and dynamic, and the way to do so is probably by fine-tuning an open-source LLM. This is possible and we are working on it – but it is hard and expensive, which is probably why Google are leading the way in this area currently!

Bloom's famous 2 sigma tutoring paper is incredibly misleading

Daisy Christodoulou — Sat, 03 Jan 2026 08:45:26 GMT

In 1984, Benjamin Bloom published a famous paper: The 2 Sigma Problem: The Search for Methods of Group Instruction as Effective as One-to-One Tutoring.

The paper claims that one-to-one tuition produces 2 sigma improvements when compared to traditional whole-class teaching. This is a massive deal: it means that one-to-one tuition can raise the test scores of an average student to those of a student in the top two percent.

Imagine a year group of a couple of hundred students. Imagine the average students in that year group. Imagine an intervention that could move them all to the standard of the very best students in that year group - and that would simultaneously improve the scores of all the other students by an equivalent amount too. That is what 2 sigma means.

Although the paper was not about education technology, it has had enormous influence and impact in the ed tech world. The logic runs like this: Bloom has shown that one-to-one tuition is the best form of instruction; human one-to-one tuition is impossible at scale; technology could provide one-to-one tuition for everyone and provide 2 sigma gains for everyone.

Sal Khan of Khan Academy has based a theory of AI tutoring around the paper, the Chan Zuckerberg Initiative refer to it, World Bank researchers love it.

The only problem is that the paper cannot bear anything like the weight of these conclusions. Here is why.

(I wrote a briefer critique of the paper in my 2020 book Teachers vs Tech, which you can purchase here.)

Thanks for reading No More Marking! This post is public so feel free to share it.

Six major problems with Bloom’s 2 sigma tutoring claim

The study taught and assessed students on narrow domains of cartography and probability. Most education studies measure performance on much broader domains – typically literacy or numeracy. The smaller the domain, the more sensitive it is to instruction, meaning that outsize gains are more likely. As well as the tests being closely linked to the study content, they were also designed by the researchers and were not standardised.
The participating students were novices. The topics were completely new to them. This matters because beginners tend to make rapid progress at first, and that progress slows over time. Again, this affects the statistics, making dramatic outlying gains more likely and less meaningful. We have found this with our assessments of writing. You have to be careful in interpreting and comparing results in the first few months of instruction with those from later in instruction.
It was a short-term intervention with short-term metrics. Students received 11 40-minute lessons over 3 weeks and then had a test on the content straight away. We don’t know if those gains were a) maintained or b) sustainable – eg, if you came back after a year, would the students have maintained that standard? Would they have continued to improve at the rate of 2 sigma every 11 lessons? Learning is a change in long-term memory, and this study tells you nothing about the long term.

These three deliberate design choices all make big effects more likely and less meaningful. They don’t invalidate the results, but they do severely limit the conclusions you can draw. You can’t conclude from a 3 week intervention into a small, new domain that you can turn a median student into a Rhodes scholar.

I think the root cause of all these three problems is a conflating of formative and summative assessment. Bloom’s assessments are optimised to provide short-term formative feedback, which is fine, but you cannot then use that same information to provide summative insights. This is something I write about at much greater length in my 2017 book Making Good Progress, and you can see schools in the UK and America making the same error. Like Bloom, they will give students tests on small, recently-studied domains which all the students will ace. This is totally fine if you want to check students have understood what you have just taught. However, they will want to claim much more than that, and will say that high performance on this test is predictive of getting the top grade on national exams. This is not a valid inference!

There are then 3 further methodological problems with the Bloom paper which are worth mentioning.

Bloom didn’t actually carry out any of the studies in question. He was reporting data from two PhD students. One of those dissertations is available online - the other isn’t. My analysis is based on the one that is. In Bloom’s paper he has a famous graph showing students jumping from the 50th to the 98th percentile. This isn’t based on the underlying data: it’s just a stylised representation of what that type of progress looks like.
The studies divided students into 3 groups: traditional whole-class instruction, mastery whole-class instruction and one-to-one tuition. The one-to-one groups got extra input: they were given more feedback and corrective tests than the other two groups.
The numbers involved in each study were very small - just a couple of hundred students in total. We have no idea whether these effects would hold if tuition were scaled up. This is a major problem with all educational interventions, particularly those which involve reducing class sizes – and one-to-one human tuition is basically just the most extreme version of reducing class sizes. The literature on reducing class sizes shows that it can be effective at a small scale, but it is hard to scale up – because to reduce class sizes at scale, you have to recruit a lot of new teachers, and often the new teachers you recruit are not as good as the existing teachers in the system. Interestingly, one of the Bloom studies had this exact recruitment problem. They used undergraduate students as tutors, and in two of the grades being studied, they couldn’t recruit enough – so they increased the tutor groups from one to three. This suggests that at scale and in real-world contexts, the gains from reducing class sizes may not be as great as the gains from improving whole-class instruction - which is the exact opposite of the message conveyed by the paper.

A sporting diversion: can we use the standard deviation to find the best sportsperson ever?

These students really did make 2 sigma improvements. But they did it in such a narrow domain, in such an early part of their training, and over such a short period of time that it provides us with very few generalisable insights.

To see why, here’s an extended sporting analogy.

Don Bradman is widely regarded as the best cricketer ever. He has a batting average of 99.94. This is crazily exceptional, and one way of explaining to a non-cricket fan why this is such a big deal is to use the standard deviation.

Cricket batsmen average about 40 runs per innings, with a standard deviation of about 9. Bradman is therefore over 6 standard deviations better than the average batsman. This is the equivalent of meeting a man with a height of over 7 feet 6 inches. It’s insane!!

You can use the standard deviation to measure exceptional performance in other sports, and it’s very rare to see anyone being more than 2 or 3 SD away from the mean. So does this mean Bradman is not just the greatest cricketer of all time, but the greatest sportsperson of all time - the GOAT to end all GOATs?

Maybe.

The power of the standard deviation is abstraction. What the standard deviation lets you do is take cricket runs, football goals, 100-metre sprint times, and gymnastics scores, and essentially put them all onto the same scale. It means you are no longer comparing apples with oranges, but apples with apples.
The limitation of the SD is also abstraction. It takes away a lot of the underlying domain specific detail of different sets of numbers and enables a comparison that may not really be legitimate. There is a risk that you are still comparing apples with oranges, but you’re just pretending that you’ve turned some oranges into apples.

The case against Bradman being the greatest sportsperson ever is that 1930s cricket was not as professional or as global a sport as modern football, sprinting and gymnastics. The talent pool Bradman was competing against simply wasn’t as competitive, and that has the potential to skew his stats.1

Basically, in order to see whether it is legitimate to compare the standard deviations of Bradman to Messi or Federer or Bolt or Biles, you need some domain-specific understanding of each sport and its historic context.

In this particular case, I think the standard deviation is useful and appropriate, but not conclusive. However, there are ways in which you can use the standard deviation which are obviously just absurdly inappropriate.

Imagine a group of 8-year-old footballers who get some extra instruction on doing keepy-uppies. One kid gets some extra one to one coaching from his dad. A week later, his dad devises a keepy-uppy tournament for all the kids. His son wins! He completes 20 keepy-uppies when the tournament average is 8 and the standard deviation is 2.

If you then said, “This kid is 6 standard deviations above the mean, therefore he is a better footballer than Lionel Messi”, that would obviously be absurd.

That is what I think happens with the Bloom 2 sigma study. Novice students make rapid progress on a new, small domain over a short period of time when given extra coaching and assessed with a non-standardised test. We then fall over ourselves not just to declare that the students are better than Messi – but that their coach is the next Alex Ferguson or Pep Guardiola and we should all be copying their methods.

Are outsize gains like Bloom reports really possible?

At this point it is customary to say that Bloom sets our expectations too high. I don’t think this is the case. I think education has for a long time been in a pre-scientific phase, and that if we could better align it with science, then big 2 sigma style gains are possible. My issue with the Bloom paper is not that it sets unrealistic expectations, but that it won’t help us achieve any kind of expectations.

Does any of this matter? Surely we know that one-to-one tuition is better than whole-class teaching?

You might say OK, who cares, maybe the study is slightly ropey but we all know that one-to-one tuition is better than whole-class instruction, so is it really that misleading?

Yes. As we have seen human one-to-one tuition is extraordinarily expensive and hard to scale. Bloom and his grad students acknowledged this and the point of their research was to try and find whole group methods that were as effective as one to one tuition.

However, by emphasising the impact of one-to-one tuition so much, the effect has been to make human one-to-one tuition seem like the gold standard to which we should all be aspiring. Post-Covid, many governments spent huge sums of money on catch-up human tuition, often implicitly or explicitly justified by Bloom’s research. The programmes ran into predictable problems of recruitment and training and had underwhelming results - nothing like 2 sigma every 3 weeks.

Similarly, the impact on ed tech has been to encourage learning platforms to mimic one to one tuition and to focus on personalising instruction for the individual student.

But what if this is the wrong way round? What if actually, the gold standard of effective human pedagogy at scale is in whole-class instruction, and actually ed tech platforms should take that as a basis to learn from? Interestingly, a recent study from Google Deep Mind embedded LLM tutors within a typical whole-class environment, and showed some impressive results.

We also have better and more robust data about what works in whole-class instruction - including, in England, some much better uses of standard deviations.

What 2 sigma progress really looks like

Every secondary school in England gets a Progress 8 score, measuring how much progress students make across eight subjects from age 11 to 16. A Progress 8 score of 0 means that, on average, pupils at the school made the same amount of progress as pupils nationally with similar starting points.

The mean is always close to 0, and most schools cluster around the mean, with over half of schools getting a score between -0.25 and +0.25.

However, there are a handful of outliers scoring above 1.5. These schools are achieving something close to a 2 sigma improvement.

Now of course, this is a school-level measure, not a pupil-level intervention like in Bloom. But it can still give us some useful insights. And if we run through all the flaws of Bloom again, Progress 8 avoids them.

It measures progress on 8 big subjects – not one sub-topic!
The tests at the end are standardised and not designed by the teachers.
It measures gains over 5 years, not 11 weeks.
It includes the performance of about 3,500 schools and 600,000 students – a big sample.
Most of the schools in the sample have broadly equivalent resources.

Obviously no metric is perfect and Progress 8 has its flaws too. But it is far less flawed than Bloom’s study, and a far better guide to what 2 sigma improvement in education actually looks like.

Stephen Jay Gould discusses this problem in his book Full House: The Spread of Excellence from Plato to Darwin. He argues that the greater the standard deviation in a sports league, the lower quality it is, and the greater chance there will be of exceptional players registering exceptional scores. In a higher quality league, we will see narrower standard deviations and it will be harder for exceptional players to register exceptional scores.

Using AI to judge the best Christmas film quote

Daisy Christodoulou — Wed, 17 Dec 2025 22:46:36 GMT

Every year at No More Marking we run a fun and festive Comparative Judgement Christmas task where we get people to judge their favourite Christmas song, chocolate, film, etc.

This year we are adding AI into the mix, so you can judge your favourite Christmas film quote and see if our new robot overlords agree with you!

It’s also a nice way of seeing how our new AI features work.

AI bless us, everyone!

Here is how it works.

Click on this link to register as a judge.
You’ll be presented with a pair of quotes from a famous Christmas film (you’ll also get the name of the film.) The interface will look like this.

Read both quotes and have a think about what the best one is!
You can click on the button with the snail in the bottom right hand corner to see which quote our AI picks. You will also get to see a sentence where the AI explains its decision!
You can then make your own decision by clicking on the button that says “left” or “right". You’ll then be taken to a new decision.

We will share the overall results in a week or so’s time on our social media feeds.

Thanks for reading No More Marking! This post is public so feel free to share it.

Now we have AI judges. Ho ho ho!

In the big writing projects that we run, we typically get about 80-85% agreement between humans and AI.

In these big writing projects, we don’t include the button on the bottom right that gives you real-time information from the AI. Instead, we get teachers to complete their judgements independently, with no influence from the AI. You can then download the results of the AI judging and the AI feedback separately - and see the percentage agreement between your teachers and the AI.

Faith is believing in things when AI tells you not to

We don’t necessarily expect human & AI choices on Christmas film quotes to align! In fact, we don’t necessarily expect human choices on Christmas film quotes to align!

For our serious assessment projects we spend a great deal of time validating and fine-tuning to ensure that the AI decisions align with the decisions of teachers.

If you’d like to try out a writing assessment, register for one of our introductory webinars here.

Merry Christmas!

Updates from our AI assessment projects

Daisy Christodoulou — Sun, 23 Nov 2025 08:35:32 GMT

This term, we’ve been busy turning out results and analysis for all of our big Comparative Judgement national assessment projects.

The Comparative Judgement plus AI model, which we developed earlier this year and trialled in March, is now available as standard for all of our national & bespoke assessments. We have now assessed over 200,000 pieces of writing using this model and we have just completed our first national history assessment.

Here are three new things we’ve learned from this term’s assessments.

Thanks for reading No More Marking! This post is public so feel free to share it.

AI Comparative Judgement delivers very similar results to human Comparative Judgement - it’s just quicker!

We’ve written extensively about the high agreement rates we see between our human & AI judges.

We now have some different data points showing something similar.

For the Year 3 writing assessment that we ran this term, almost half of our schools chose to use AI judges, and the rest chose not to. This means we can compare the results of each sub-group and see if there are any discrepancies.

What we found was that the two groups were very similar. The overall means of each group were exactly the same: 493. The standard deviation for the AI-judged group was slightly smaller - 39 compared to 46. This means there were fewer very high and very low scores in the AI-judged group. We are not totally sure why this is, but it is not a huge difference.

AI is better at Comparative Judgement than absolute judgement

There are a lot of organisations out there doing AI marking. Most of them get the AI to do traditional marking, which is a form of absolute judgement. You are asking the AI to look at one piece of writing and place it on an absolute scale.

We trialled this approach in the past and moved away from it for several reasons, the most important of which was that the AI just wasn’t very good at it.1

One way we can validate the scores from any assessment is to see if they help you predict the scores the same students got on other assessments of the same construct. We have used this method to validate our human Comparative Judgement assessments over the last few years and we routinely see 0.7+ correlation between student scores on one assessment and the next. Using the AI to make absolute judgements, we saw only a 0.5 correlation.

However, now we are using Comparative Judgement, we are seeing much higher correlations. Approximately 23,000 students who took part in this term’s Year 3 assessment also took part in a similar Year 2 assessment in February which was entirely judged by humans. We could therefore measure the correlation between the Feb Y2 assessment and the Oct Y3 assessment. We found that the October Y3 human and AI judges both achieved high correlations with the Feb Y2 assessment. (This of course is another data point showing that the AI is as good as humans).

AI can judge subjects other than English Language

We have just completed our first nationally standardised history assessment. 25 schools and just over 4,000 students took part. In the past we have had lots of schools use our platform for history assessment, but we’ve never run a nationally standardised project, partly because there aren’t as many history teachers as English teachers and this makes judging quite time consuming.

Adding in AI judges dramatically reduced the time it took to judge. On average, each teacher in the project judged for just under 20 minutes - which is what we predicted. In return, they got 7 PDFs with incredibly detailed data and feedback.

Was the AI good at judging more complex essays where the focus is not just on writing but on subject content too? The AI agreed with the human decisions 77% of the time. This is slightly lower than the 85% we typically get for writing assessments, but it’s still not bad. Our initial feedback from schools is that the results made sense.

We also have hundreds of schools using our platform to run custom AI assessments in a whole range of subjects. Custom assessments use all of our AI features, but they are customised to an individual’s schools curriculum & calendar and aren’t nationally standardised. It is early days, but so far the approach seems to be working well for all these other subjects too.

If you would like to learn more, our next introduction webinar is in January.

Even if the AI does get better at absolute judgement, there are still problems. It’s hard to use human oversight to validate this approach, and there is no statistical model underneath it - which is a problem given that most grades involve a significant statistical element (eg in England about 2.5% of students get a grade 9 in GCSE English Language).

Can LLMs be personal tutors?

Daisy Christodoulou — Sun, 16 Nov 2025 16:45:47 GMT

Think how amazing it would be to have a personal tutor who is an expert in every subject under the sun and available on-demand 24/7.

That is the incredibly exciting promise of Large Language Models - that they will be able to teach you anything you want, whenever you want.

However, I think the barriers to getting there are more significant than we imagine.

Here are four challenges that LLM tutors have to overcome.

Thanks for reading No More Marking! This post is public so feel free to share it.

1. LLMs are good at providing explanations, but explanations are over-rated

LLMs are good at providing explanations. The problem is that pedagogically, explanations are over-rated.

Thomas Kuhn, the famous philosopher of science, once asked why it was that a group of students could all read the same chapter of a physics textbook and say they had understood it – but then get the questions at the end of the chapter totally wrong.

Kuhn concluded that what these students really needed was not explanations but lots and lots of examples and questions.

He was right. Questions are important for two reasons: they force the student into mental activity, which is necessary for learning. And they tell the student and the teacher if the student actually has understood what has been taught.

The research also shows that students often don’t like this. They prefer to read, reread and highlight explanations than to answer questions. That’s probably because rereading an explanation is easy, but answering questions is hard. It’s also because reading an explanation feels like you have understood something. It gives you the illusion of understanding, whereas answering a set of questions exposes the reality that you don’t.

What we need are not LLMs that answer questions from students. We need LLMs that ask students questions.

But the problem with that is…

2. LLMs are not as good at creating precise questions

LLMs still have problems with hallucinations, and this is a real problem when you want to create banks of questions and answers where precision and accuracy really matter.

We have experience of this with the feedback we provide on our writing assessments. We provide LLM-generated written feedback for students and teachers. At that level of generality, the LLM does a good job.

But we also wanted something more precise – so we asked the LLM to generate a series of multiple-choice questions based on each student’s piece of writing. It found that task much harder, and a number of errors crept in. Errors like these can cause enormous confusion for novices. [We ended up creating our own and allocating them based on the students’ scaled score.]

When I talk about the error rate of LLMs, the inevitable response I get is “yes but humans aren’t perfect either”. That is absolutely true. In the great “algorithms vs humans” debate, here at No More Marking we are mostly on the side of the algorithms, because we know that humans make so many mistakes.

However, in this particular case – the creation of personalised questions – the correct comparison is not between error-prone LLMs and error-prone humans. The correct comparison is between error-prone LLMs and older technologies which have largely eliminated errors. Which brings me to my third point.

3. Pre-LLM technologies are very good at creating error-free, scalable and personalised resources

The original technology for creating error-free and scalable educational resources is about half a millennium old – it’s the printing press. Once you have a really good set of questions (or indeed an explanation) you can proofread it and get it checked over by multiple other humans and then get it printed as many times as you need.1

Of course, printed textbooks aren’t personalised or interactive. But personalised and interactive resources do exist already too – not for as long as the printing press, of course, but for several decades.

Many online learning platforms consist of enormous banks of accurate questions. Students can proceed through them at their own pace and receive personalised feedback and next steps based on their pattern of correct and incorrect questions. There are many platforms like this. They obviously vary in style and quality, but the best of them have decent track records.

So, one major question for me is this: how are LLMs going to improve on these pre-existing technologies? What can they offer that is better?

And this also brings me to my fourth point. These very effective pre-LLM digital tutors have been around for decades, and they have not made the human teacher or the physical classroom obsolete. Why?

4. There is a limit to what students will learn on their own and on a screen

The Covid pandemic provided us with a natural experiment in the effectiveness of online learning. Did everybody say at the end of it, fantastic, actually, it turns out that we don’t really need physical schools and human teachers any more?

No. Everybody said: we need to get the kids back into school. The global data shows that students learnt less when schools were closed, not more, even in countries where they had access to the internet and many brilliant online learning tools. And even before Covid, we knew that online learning courses had very high drop out rates.

The structure and discipline of in-person classrooms are important, and online platforms lack this structure. So even if they are full of brilliant content and sound pedagogical principles, they may not be as effective as in-person teaching.

For LLM tutors to succeed where other online learning platforms have not, they have to overcome this problem. Either they have to find ways of incorporating the structure and discipline of an in-person class, or they have to be so much more engaging and compelling than existing online learning platforms that they will eliminate the need for structure and discipline as students will prefer using them to doing anything else online.

The latter is going to be very hard and is largely beyond the control of any online learning platform, as it is competing against entertainment platforms that aren’t constrained by learning. Optimising for one parameter is easier than optimising for two.

Questions about questions

So, to sum up, here are the four questions you need to ask of any LLM tutor.

Does it rely solely on explanations?
If it does use questions, how does it ensure they are accurate?
In what ways is it better than pre-existing online learning systems that don’t use LLMs?
Is it integrated with a traditional classroom, or is it designed for students to use on their own? If the latter, how will it get high completion rates?

Some systems are engaging seriously with these questions and coming up with good answers, and I will profile a few in a future post. But many are not, and the risk is that LLMs just get added to the long line of technological innovations that promised and failed to improve education.

Some of the earliest printed books do have quite a few errors, and completely eliminating all errors in any format is not easy. Andrej Karpathy’s “march of nines” is as true of Gutenberg’s books as of Waymo’s self-driving cars. But a modern textbook that is in its 2nd edition is likely to have vanishingly few errors. EG this textbook is the one I know best and neither I or several colleagues / students have spotted any errors in it.