It’s exam season, and at this time of year people often start to debate the best way of assessing exams. One thing you’ll often hear is that criterion-referencing is the best and fairest way of assessing national exams.

So what is criterion-referencing?

**Criterion-referencing**

This is when you establish a standard - the criterion - and everyone meeting that standard can get a certain grade or pass mark. Here's a simple example: a bunch of students take a test on fractions. The criterion to achieve the top grade is 'Can compare two fractions to identify which is larger'. If you meet that criterion, you get the top grade.

Criterion-referencing is often maintained to be the best method as it allows for exam performance to be explicitly linked to curriculum content and standards. It rewards the performance of the individual regardless of the performance of their peers.

Many alternative methods of assessment rely on some kind of norm: the performance of the current or a previous group of students. This seems less fair as it means an individual’s performance is defined in terms of what others can do, not in terms of what they can do.

However, criterion-referencing does have serious practical and theoretical problems which make it very hard to implement.

**Defining the criteria**

The important thing with criterion-referencing is to define the criterion. One common way of defining the criteria is with a list of statements which a marker uses to assess the students' work against.

For example, take the criterion above: 'can compare two fractions to see which is larger'. An examiner can set a bunch of questions which target this criterion.

This all seems very sensible, but the only problem is that it just doesn't work. Even tightly defined criteria can be interpreted in wildly different ways, as this shows.

These questions are clearly all targeting the same criterion, but they are also quite clearly not of equivalent difficulty.

It feels like this * should* be an easy problem to solve, but it is not. Examiners basically find it impossible to predict in advance how difficult a particular question will be, and similarly find it impossible to predict in advance what mark on a test should constitute a pass grade.

And remember, 'can compare two fractions to see which is larger' is a relatively precisely defined criterion. How much more of a problem will this be when you are trying to mark extended writing using criteria like 'can use vocabulary with originality and flair'?

The practical history of this kind of criterion-referencing has shown all these problems. When New Zealand adopted a strong criterion-referenced system about 20 years ago, it led to a dramatic fall in pass rates from one year to the next.

**Is criterion-referencing even possible?**

There are legitimate philosophical debates about whether pure absolute measures of attainment - or anything - are ever really possible. One big cheese in assessment theory, William Angoff, has argued that norms lurk behind most criteria.

There is an analogy with measurement theory in other walks of life. In medieval times, weights and measures were determined by easily available everyday norms - a foot, or a grain, or a stone’s throw, or, amongst the Saami people, the *poromkusema* - the distance a reindeer can walk before urinating (about 6 miles).

The history of modern measurement is the attempt to ground measurements in something more absolute and less arbitrary. The problem is that every new method you come up with arguably just depends on a new kind of norm. Weights and measures used to be normed against standards kept in special conditions to stop deterioration and degradation.

Nowadays, most modern measurements are ultimately based on the frequency of caesium-133 atoms. This is certainly more precise and stable than reindeer bladders, but you can still have a philosophical debate about whether it is just another norm lurking beneath the criterion.