Can AI assist the Comparative Judgement of primary writing?
And can students respond well to AI feedback?
Last half term we ran a brand new assessment, CJ Dynamo, that was doing two new things: AI-assisted assessment, and assessment of Year 6 students’ redrafting skills. You can read more about the project design and the research that informed it here.
In this post, we will share the results and explain a) how well the AI judged and b) how well the students redrafted their work.
Can the AI judge?
We used a similar structure for this assessment as our earlier CJ Lightning Y7 AI assessment - and we’ve found similar positive results.
In total, 3,851 Year 6 students from 73 primary schools took part in the assessment. This task was judged using a combination of human and AI Comparative Judgement. 449 teachers provided 3,825 decisions, and the AI provided 46,844 decisions.
Of those 3,825 human decisions, our AI agreed with 3,118 and disagreed with 707. That’s an 82% agreement rate, which is similar to the typical human-human agreement across our projects.
What about those 18% of disagreements? This is really important. It’s easier than you think to get 82% human-AI agreement. What really matters is what they are disagreeing about! Reassuringly, most of the human-AI disagreements were small. Of the 707 judgements where the human and AI disagreed, 50% were under 21 points, 90% were under 60 points, and 97% were under 80 points. (Our scale is fine-grained, and runs from about 300 - 700).
3% of the decisions - 21 in total - were above 80 points. That is 3% of the total disagreements, but just 0.5% of the total number of human judgements.
Some element of disagreement is always going to exist with assessments of extended writing, whoever is judging it. This is a very low rate of serious disagreement, and one that we think is acceptable.
What about those bigger disagreements?
We have reviewed those bigger disagreements and feel that they are the result of human error, which all forms of human judgement are subject to. Here is the biggest human-AI disagreement.
We have highlighted this big disagreement, but it is literally a one in a thousand occurrence. These kinds of big disagreements are a tiny fraction of the overall judgements. Overall we are very happy with the performance of both the AI and the human judges, and have faith in the results they have produced.
Did students improve their writing?
Having established the reliability of the judging, we can now turn to what the results tell us about the students’ writing. This task was our first ever assessment of redrafting skills. Of our participating schools, 56 schools and 2,651 students took part in the original assessment back in March. Those students were given a mix of human & AI feedback, and then redrafted their work in response.
Overall, their scores improved from 545 on the first assessment to 557 on this redrafting assessment.
How big a deal is an 12 point improvement? It’s an effect size of 0.3, which is generally seen as being meaningful. On our writing age conversion, it translates to an 11 month improvement, and only 2 months elapsed between the assessments.
In future posts we will share examples of what this improvement looks like. We have already written a qualitative review of two students’ work here.
For now, there are two big caveats.
One: that 12 point improvement was not evenly distributed amongst all the students. 62% of the students improved their score - but 38% regressed! Clearly some students were able to respond well to feedback - but others were not!
Two: will students sustain this improvement into their next assessment? This assessment has definitely shown that on average, students can improve their writing in response to feedback. But has their thinking improved? Has their writing skill improved such that in their next writing assessment they will sustain this improvement? This is something we would like to monitor going forward, but the Year 6 students who took part in this project will be moving to secondary school soon, which makes it tricky. We will look to follow up in other ways with future assessments.
If you would like to learn more about these results, we have a webinar on Wednesday 4th June going through them in more detail. If you’d like to learn more about our future CJ-AI assessments, we have an intro webinar on Monday 23 June. You can register for them here.
Can you say more about the AI human disagreement? I agree with the AI, and think the human is being thrown off by the handwriting.