Designing the perfect assessment system, part 3

There are no solutions, only trade-offs

Dec 15, 2024

Regular readers of this Substack may know that I have recently written a book about VAR – the video assistant referee system used in football.

One of the arguments I make is that there are quite a few overlaps between the kind of decision-making needed in education and assessment, and that required by football referees.

One important concept that’s relevant to both is the idea of trade-offs. For years, before VAR was introduced, we would hear football fans and players complain about bad refereeing decisions. The phrase you’d hear a lot was “we just want more right decisions”, and that was the justification for introducing technological support to referees.

Except, of course, within just a few games of VAR’s introduction, we suddenly realised that whilst we wanted “more right decisions”, we wanted a lot of other things too, some of which we maybe hadn’t even realised were important. Rio Ferdinand put this well back in 2019, when he argued that VAR’s quest for more right decisions was compromising another crucially important part of football – its simplicity.

If you add too much to this game you are going to detach it from the everyday man or woman coming in to the game. Look at American sports - a lot of people outside America wouldn't watch it because they think there's too much going on and too many rules. That's the beauty of our game - it's simple, everyone can play it. That's starting to change now and I hope it doesn't go too far.

The longest VAR check for offside took nearly six minutes, and the natural flow of football has been interrupted by all of the checks.

Thanks for reading No More Marking! This post is public so feel free to share it.

Trade-offs and the worst of both worlds

These trade-offs are everywhere in life, not just in football – to the extent that one economist has argued that there are no solutions, only trade-offs.

When you are thinking about trade-offs, you need to be honest about what you are willing to give up in order to get something new. How much time are you willing to spend to get an accurate offside decision? Fans who care a lot about accuracy but not so much about the flow of the game might turn up the accuracy dial and say they don’t mind waiting for ten minutes. Others will say ten seconds is too much. But everyone will have a limit. I doubt there is any football fan who thinks that ten hours, for example, is an acceptable amount of time to spend on making an offside decision.

You also want to avoid the situation where you end up giving up something you value and gain nothing in return for it. This is the “worst of all worlds” scenario which many people would argue we have ended up with VAR. We have definitely given up a lot of the game’s simplicity and spontaneity, but how much accuracy has been gained in return? VAR has led to a spate of very marginal handball and offside decisions that would never have been given in the pre-VAR game. You have traded off simplicity and spontaneity, and have gained…lots of baffling decisions!

So what does this mean for assessment?

We have been posting a series of articles on designing the perfect assessment system, but if you accept the argument above, then perhaps there is no perfect assessment system. There are a number of different end goals we want to achieve which are all in tension. We have to try and find some kind of balance between all these competing goals. Here are two important assessment trade-offs that I think we have to take seriously.

The reliability-authenticity trade-off

Reliable assessments are consistent. They are like a functioning set of kitchen scales: if you weigh a bag of flour ten times, a reliable set of scales will return the same measurement each time. Closed questions with just one right or wrong answer are typically very reliable assessments. They are easy for humans or even machines to mark consistently.

However, some of the things we want to assess don’t easily lend themselves to closed questions. Writing is an obvious example. We want to be able to set extended written tasks for students, but these are much harder to mark reliably and typically teachers will not reach perfect agreement about the quality of a piece of writing.

It is very easy to end up with assessment questions that are the worst of both worlds: questions that give up reliability to achieve authenticity, and end up with neither.

I think the last round of assessment reforms introduced quite a few of these styles of question into the Maths GCSE. Here’s an example of what I mean. This question requires a short written response for just 2 marks.

I have seen maths teachers spend ages debating whether or not different responses to this question deserve one mark, 2 marks or zero marks. I have also seen teachers and revision guides getting students to memorise the “perfect” form of words to answer this question.

I think these kinds of short written responses are the worst of both worlds. You give up the reliability of a proper closed question, because teachers don’t agree on what mark the responses should get. But you haven’t gained the authenticity of a genuine open-ended written response, because a student is taught to memorise a form of words that ticks the mark scheme.

I’d manage the reliability-authenticity trade-off in a very different way.

I’d have exams consisting of 2 very different styles of question. I’d have a lot of multiple-choice and short answer questions, which are designed to maximise reliability, curriculum coverage, and speed of assessment.

I’d then have a couple of completely open extended written questions, marked using Comparative Judgement, which would allow you to get rid of a detailed rubric that causes problems with stereotyped responses.

The combination of these two styles of assessment would be like the way wing mirrors operate on a car. Each wing mirror gives you a different view, and combining the two together gives you the best view of reality.

The simplicity-perfection trade-off

Suppose I got my way, and the upcoming curriculum review decided that we should revamp every GCSE exam along the lines outlined above. Would that be a good thing?

If you were just focussing on building the perfect assessment system, then yes, I think it would.

But the assessment system exists within a wider educational system, and within wider society. Currently, in England we have severe problems with recruitment & retention of teachers, and with rising numbers of students being diagnosed with SEN.

Again, we have a trade-off. Designing a perfect system will require a lot of time and effort. If the system doesn’t have the capacity for that time and effort, the perfect system may not be possible to implement. We might be better off with a simpler and less complicated reform that requires less effort.

That’s why I think we probably need to tread cautiously and focus our reform efforts in the places where the current system is causing the most problems. In future posts we will outline a couple of case studies where we think a minimum of reform could lead to maximum benefits.

Oliver Batchelor

Dec 23, 2024

I often struggled with this when I had to mark exams or write questions (thankfully, I generally don't anymore - though marking theses is no easier!). It seems hard to develop an objective and fair marking schedule for questions, even when they're not very open-ended (for example, writing out a small snippet of code or several sentences of explanation). Inevitably, when marking, I'd change how I marked something as I progressed through the answers.

It really seems like no one talks about it either, especially in higher education. It's accepted that you made a very approximate effort and didn't offer any feedback - just as long as your grades are approximately OK and most people pass!

Expand full comment

1 reply by Daisy Christodoulou

Andrew Phan

Dec 16, 2024

I love the observation about the 2-mark explanation question. It seems like whenever we're trying to do too many things at once (in this case, marking reliability and authenticity of the responses) we often end up with the worst of both worlds.

The same thing can be said with subject knowledge and inquiry/problem solving/communication questions. Perhaps they should be evaluated separately so that a student's lack of communication skill doesn't prevent them from getting full marks on the subject knowledge part? But then we have the trade-off between the length of exams and the validity of the results.

What do you think Daisy?

4 more comments...

No More Marking

Designing the perfect assessment system, part 3

There are no solutions, only trade-offs

Discussion about this post