Can criterion-referencing work?

Or is it reindeer bladders all the way down?

Aug 24, 2023

It’s exam season, and at this time of year people often start to debate the best way of assessing exams. One thing you’ll often hear is that criterion-referencing is the best and fairest way of assessing national exams.

So what is criterion-referencing?

Criterion-referencing

This is when you establish a standard - the criterion - and everyone meeting that standard can get a certain grade or pass mark. Here's a simple example: a bunch of students take a test on fractions. The criterion to achieve the top grade is 'Can compare two fractions to identify which is larger'. If you meet that criterion, you get the top grade.

Criterion-referencing is often maintained to be the best method as it allows for exam performance to be explicitly linked to curriculum content and standards. It rewards the performance of the individual regardless of the performance of their peers.

Many alternative methods of assessment rely on some kind of norm: the performance of the current or a previous group of students. This seems less fair as it means an individual’s performance is defined in terms of what others can do, not in terms of what they can do.

However, criterion-referencing does have serious practical and theoretical problems which make it very hard to implement.

Thank you for reading No More Marking. This post is public so feel free to share it.

Defining the criteria

The important thing with criterion-referencing is to define the criterion. One common way of defining the criteria is with a list of statements which a marker uses to assess the students' work against.

For example, take the criterion above: 'can compare two fractions to see which is larger'. An examiner can set a bunch of questions which target this criterion.

This all seems very sensible, but the only problem is that it just doesn't work. Even tightly defined criteria can be interpreted in wildly different ways, as this shows.

These questions are clearly all targeting the same criterion, but they are also quite clearly not of equivalent difficulty.

It feels like this should be an easy problem to solve, but it is not. Examiners basically find it impossible to predict in advance how difficult a particular question will be, and similarly find it impossible to predict in advance what mark on a test should constitute a pass grade.

And remember, 'can compare two fractions to see which is larger' is a relatively precisely defined criterion. How much more of a problem will this be when you are trying to mark extended writing using criteria like 'can use vocabulary with originality and flair'?

The practical history of this kind of criterion-referencing has shown all these problems. When New Zealand adopted a strong criterion-referenced system about 20 years ago, it led to a dramatic fall in pass rates from one year to the next.

Is criterion-referencing even possible?

There are legitimate philosophical debates about whether pure absolute measures of attainment - or anything - are ever really possible. One big cheese in assessment theory, William Angoff, has argued that norms lurk behind most criteria.

There is an analogy with measurement theory in other walks of life. In medieval times, weights and measures were determined by easily available everyday norms - a foot, or a grain, or a stone’s throw, or, amongst the Saami people, the poromkusema - the distance a reindeer can walk before urinating (about 6 miles).

The history of modern measurement is the attempt to ground measurements in something more absolute and less arbitrary. The problem is that every new method you come up with arguably just depends on a new kind of norm. Weights and measures used to be normed against standards kept in special conditions to stop deterioration and degradation.

Nowadays, most modern measurements are ultimately based on the frequency of caesium-133 atoms. This is certainly more precise and stable than reindeer bladders, but you can still have a philosophical debate about whether it is just another norm lurking beneath the criterion.

1 Comment

Grant Hillebrand

Sep 27, 2023

Hi Daisy,

In terms of the fraction example, is there not possibly a dimension missing? For want of a better term possibly, I tend to use "cognitive level". Whilst they all do cover the same _topic_, they are different difficulty levels.

The first is easy - level 1 - because the denominators are the same.

The second is a little trickier - level 2 - because, whilst the denominators are the different, they are both one away from unity, and thus quite easy to intuit.

The third is hard - level 4 in my semi-arbitrary scale - because you now have to actually normalise the fractions to some common denominator to properly compare them. That is a two step process, and much harder than the previous questions.

In my experience of the UK exams vs our local SA exams, the notion of cognitive level never seems to appear. I think that it helps a lot in setting a fair assessment, particularly if you want to avoid the whole "grading to a curve" idea, and recognise what students actually know.

Grant

Expand full comment