How reliably can we judge writing?

Predictive values suggests teachers are able judge writing with a high level of precision

Apr 20, 2024

Since 2019 we have run large scale assessments of writing involving thousands of pupils in the UK and abroad. All the pupils write to the same prompts under controlled conditions. The writing is then judged by their teachers using comparative judgement to create scales of achievement that can be tracked over time. We now have a longitudinal set of data that allows to examine how well this system works.

Typically the quality of an assessment is measured by its reliability, but measures of reliability cannot be readily compared between assessments as they depend on the scales used. A good alternative is to use the predictive value of scores. We can consider how well the scores pupils achieve in one assessment predict their achievement in the next assessment.

Since 2019 we have found the predictive value of our assessments to stay fairly stable, with higher values for the younger pupils. The highest value 0.76 is for predicting year 3 scores (pupils aged 7-8) from year 2 scores (pupils aged 6-7). The lowest value 0.63 is for predicting year 6 scores (pupils aged 10-11) from year 5 scores (pupils aged 9-10).

Predictive values of writing assessments

Some care must be taken with directly comparing these values, however, as the time difference between the assessments varies.

Average predictive value and time difference

Year 3 and year 5 have the same time difference, with the higher correlation for the younger pair, while year 1 has a longer time difference than year 4, but a higher correlation. It would seem reasonable to conclude, therefore, that the predictive value is higher for younger pupils than older pupils.

Why does predictive value fall as the pupils get older?

If we return to our original question, how are the assessments functioning, what we can infer from the predictive value is that the assessment is more precise for younger year groups than older year groups. As pupils get older, and their writing becomes more sophisticated, more factors come into play into any decision on the quality of the writing. We know teachers take longer to judge the older pupils’ writing, which suggests that there are more factors to consider in their judgements, and leads to less consensus.

How good are these predictive values?

For single snapshots of writing taken on a single day to a specific prompt, these predictive values are high, suggesting teachers using Comparative Judgement are able to discern qualities in writing in even very young pupils that predict future achievement well.

No More Marking

Discussion about this post

Ready for more?