Are exams getting easier? Are grades being inflated? Are the kids of today doing better or worse than ever before?
Too often this debate turns into a slanging match between the generations.
Is there any data we can bring to bear on these questions?
Why monitoring standards over time is hard
Working out the relative difficulty of different exam papers from different eras is actually surprisingly hard to do, for a variety of reasons.
It's hard to tell just by looking at a question or an exam paper exactly how hard it is.
Even if you do look at two exams and decide that one is much harder than the other, even that gut instinct judgement doesn't tell you that much, because it may well be that students did very badly on the hard exam and much better on the easier exam.
So it's not enough to just look at the exam paper in isolation: you need data on how students responded to the paper.
And when you are comparing exams over time, the content and syllabus often changes - sometimes in ways that make it extremely hard to make a fair comparison.
How Comparative Judgement can help
Is there anyway around these problems? In 2016, my colleague Chris Wheadon and three co-authors published a paper that addressed exactly this issue. It uses some very clever techniques to solve these problems, and won the 2017 British Educational Research Journal’s Editor’s Choice award.
They analysed an exam where the content had not changed too dramatically - pure mathematics A-level papers. They used 66 archive responses to exams from 1964, 1968, 1996 and 2012. These archive responses were ones that had been graded at A, B or E. They split each archive response into individual question responses, giving a total of 546 question responses.
They then used Comparative Judgement to assess the quality of each question response. Comparative Judgement is an innovative assessment technique which relies on the fact that humans are much better at making comparative judgements than absolute ones. So instead of looking at individual questions and deciding how hard they were, judges were given pairs of responses and had to say “which student you think is the better mathematician”. Here’s an example of a comparison.
The judges were all PhD maths students who had passed a maths test in order to be able to judge. The overall reliability of their judging was acceptably high.
At the end of this process, all 546 question responses were given a score, and then all of the original exam papers could be assigned an overall score by aggregating each individual response score.
The results
We are now able to compare the standards from different years. Here are the key findings.
Standards did not change much between 1964 and 1968.
Standards declined between 1968 and 1996. The quality of work that would have produced an E in 1968 would have got a B in 1996.
Standards did not change much between 1996 and 2012.
So what is happening?
Chris and his co-authors put forward some suggestions as to why this might have happened. I’d like to suggest a reason they don’t consider: the expansion in numbers of students taking maths A-level. A-levels in general were taken by about 5-10% of students in the 1960s. Today, about half the cohort take them, and a lot of that increase happened between 1968 and 1996. Keeping the gradeset at the same standard after such an expansion in numbers would mean a large chunk of students failing the exam.
So does this prove kids of today just aren’t doing as well as kids in the past?
At the start of this article, I posed three questions.
Are exams getting easier? Are grades being inflated? Are the kids of today doing better or worse than ever before?
These questions are all related, but they are not the same. I think this paper shows that grade inflation, for maths A-level at least, is real. The standard represented by a B grade maths in 1996 was not as difficult as the standard represented by a B-grade 30 years earlier. The currency represented by each grade has been devalued.
In the long run, are we are all doing better?
However, this does not necessarily mean that the kids of today are doing better or worse than in the past. To see why, think of an analogy with the economy. The pound is now not worth nearly as much as it was 100 years ago. But the UK economy has grown enormously in that time and we are much wealthier on average.
Here’s a worked example of how this might work for exams.
Imagine an exam that in year 1 has a very tough gradeset - only 15% pass.
Imagine that in year 2 there is no change to underlying attainment, but the markers decide to change the gradeset and make it less tough. Now all the pupils pass and the standard that would only have got you a C in Year 1 will now get you an A! Textbook grade inflation.
But now look at Year 3. In Year 3, underlying improvement has improved significantly - by an entire standard deviation. We retain the inflated gradeset of Year 2.
Thus, when we compare Year 3 with Year 1, the following two things can be true: the cohort are much better at maths in Year 3 compared to Year 1, and the grades represent a lower standard in Year 3 compared to Year 1.
The above is a hypothetical worked example. I don’t think it is what actually happened. I think it’s quite hard to establish what has happened to underlying attainment in the past half century or so, but my best guess is it has probably stayed about the same.
There are other analogies here with economics. Keynesians and Hayekians love to argue about whether inflation boosts or hinders the economy, and you could imagine a similar argument about whether grade inflation motivates or demotivates students.
But I don’t want to weigh in on that here. What I want to show with the above is that grades can be very misleading. They are essentially layered on top of underlying attainment scales, and yet very often we take them as direct measures of underlying attainment. We’ve written in more depth about the distortions this can cause here.
I would like to see some of the bigger national exams reported on a consistent underlying scale, which would make it easier to see what was happening to standards year on year. Comparative Judgement could play a part in making that happen.
I think I understand this. Probably. Grades comparison longitudinally are not the ‘real measure’ of actual attainment over time. Comparative judgement is a much more precise tool. Thank you.
Great post Daisy, now let me play Devil Advocate for a moment. It might be tricky to compare and draw conclusions as you’ve done here. But can we conclude that student achievement has declined / improved based on standardized assessment like PISA? Can we also go a bit on a limb and determine that at least 40-60 years ago, the content on exams was very different because a larger cohort quit school at a younger age to work and provide for their families? In Canada this was very prevalent. My grandparents and father in law all quit school at 12 or 14 because their families needed the money. Yet when it came to maths, nobody was smarter than my Grandfather.