National statistics vs human judgement

What's the best way of setting standards?

Jul 19, 2024

If you use Comparative Judgement for an assessment, it will provide you with a very reliable and robust measurement scale. Here is an example: if you comparatively judge 100 pieces of student writing, you will get a scaled score for each one.

At this point, the most common question people have is this: how does that scaled score match up to a national standard? If a student has a scaled score of 520, what does that mean?

We work in lots of different countries, and people typically want some idea of how the scaled score matches up to whatever the national grades are in that country or jurisdiction. Teachers in England often want to know how the scaled score matches up to a GCSE grade, teachers in Australia want to know how it matches to a NAPLAN band, teachers in different American states want to know what “proficient” and “developing” are.

Thank you for reading No More Marking. This post is public so feel free to share it.

There are two ways we can set a standard. However we choose to set the standard, it’s then fairly straightforward to maintain it.

Option one for setting the standard: use national statistics

Our national assessment projects often have thousands of students taking part from hundreds of school. When big numbers like this take part, it means we can use national statistics to set the standard.

Here’s a simplified explanation of how it works.

First, we check to see if our participating schools are nationally representative by looking at their performance on national government tests. (In our big projects, our participating schools nearly always are representative.)
Second, if they are representative, we award the same proportion of grades to our students as on the national test.

Here’s a simple worked example: if 2.5% of students nationally get Grade 9 at GSCE, we give the top 2.5% of students in our assessments Grade 9.

We use this approach for most of our big assessments. It means that our grading is never harsher or more generous than the national system.

Option two for setting the standard: use human judgement

What if you don’t have good national statistics, or if you’re setting an assessment for the first time and want to set a new standard?

The other option is to use human judgement to draw a line on the distribution. Ask your teachers: where do you think the standard should go? They won’t all agree, but you can take an average.

We used this approach in an assessment we ran in March with some schools in New Zealand. We needed to set two standards: level 2 and working towards level 3. Here’s what our teachers said.

Based on this, we set level 2 at a scaled score of 503 and working towards level 3 at a scaled score of 592.

Maintaining the standard over time

Which approach is better? Actually, it doesn’t matter too much how you set the standard. What really matters is how you maintain the standard going forward. How do you make sure that the standard chosen in that first year is held constant for subsequent years? Neither statistics or human judgement is great for this. If you use statistics, you risk ending up with grade quotas that don’t recognise true improvement (or regression!) across a cohort. If you use human judgement, you risk getting very inconsistent results and wild swings from one year to the next.

But this is where Comparative Judgement can be very useful. With Comparative Judgement, you can easily link assessments over time by pulling forward a sample of responses from one assessment to the next. This allows us to place all our assessments on the same consistent scale - and therefore to hold the standard the same too. Indeed our research has lead to research and pilots of using Comparative Judgement to maintain standards in national examinations in England. Cambridge Assessment describe here how Comparative Judgement was used live in examinations for the first time in autumn 2020 to maintain standards.

In October, we will be running a follow-up assessment with our New Zealand schools. 503 will remain the level 2 standard. 592 will be the working toward level 3 standard. And we will be able to tell how many more students reached each standard compared with the March assessment.

No More Marking

National statistics vs human judgement

What's the best way of setting standards?

Discussion about this post