Many new Large Language Models (LLMs) are very good at taking exams. Over the past few months, there's been a steady stream of reports about LLMs acing all kinds of assessments, from open-ended writing tasks to tough post-grad exams. We ourselves have contributed to this literature, showing that ChatGPT outstripped the writing of tens of thousands of eight year olds.
It can apparently also pass legal exams, medical exams, business school exams and even exams for sommeliers!
Some of these exams are tough creations that most humans fail. And many of them are gatekeepers to professions: when humans do pass them, they get to become lawyers / doctors / sommeliers. So surely if a computer model is passing them, that means they too are capable of becoming lawyers / doctors / sommeliers?
Not so fast.
Test scores don't matter
In actual fact, when humans take exams, we put certain restrictions on the conditions they take them in.
If a human cheats at the exam by bringing in the answers, viewing the paper the night before, paying someone to take the test in their name, then it doesn't matter what score they get on the test. They will be disqualified.
This might seem so obvious it is barely worth mentioning, but it gets to the heart of one of the most fundamental principles of assessment - which is that the test score by itself is totally meaningless. What matters are the inferences that you can make from this test score.
It doesn't matter if you get every question right. It doesn't matter if you get every question wrong. What matters are the inferences that an end-user can make about those results. Here’s the Harvard professor of educational assessment, Daniel Koretz, making this point in his book Measuring Up.
Test scores reflect a small sample of behavior and are valuable only insofar as they support conclusions about the larger domains of interest. This is perhaps the most fundamental principle of achievement testing.
There are a number of things which can obviously compromise those inferences, and the kind of cheating I have outlined above is the most obvious example.
But there are a bunch of more subtle issues which test designers obsess about too. If a student does poorly on a maths exam with lots of word questions, can you infer they are bad at maths, or only that they are bad at language? If they do well on a maths exam with no word questions, can you infer that they are good at applying maths to real-world problems? If they do well on a maths test after receiving intensive test prep and sitting a dozen very similar past papers, can you infer that they will do better on an advanced maths course than a pupil who got a lower score but did no prep?
These are all difficult questions, but they are ones that good test designers try to find answers for.
So what has all this got to do with AI?
If we are apply the same standard to AI, then the fact that it is acing exams is essentially meaningless. In order to validate its results, we need to know more about how they are achieved and therefore what inferences the results allow us to support or not.
So how do LLMs pass these tests? Ultimately, we just don’t know the full details, and that in itself means it is hard to support any inferences at all. There are persistent suggestions that LLMs ace these tests because they are, essentially, cheating - some of the tests they are passing are widely available, together with their answer sets, on the data sets that they are trained on. But even if that is not the case, until we know more about how LLMs pass these tests, the fact they can do so doesn't tell us anything useful. It certainly doesn't mean they can work unaided as lawyers or doctors.
Koretz, who I quoted above, goes on to say that this fundamental principle of inference is perhaps one of the most poorly understood aspects of assessment.
A failure to grasp this principle is at the root of widespread misunderstandings of test scores. It has often led policymakers makers astray in their efforts to design productive testing and accountability systems. And it has also resulted in uncountable instances stances of bad test preparation by teachers and others, in which instruction is focused on the small sample actually tested rather than the broader set of skills the mastery of which the test is supposed posed to signal.
The failure to grasp this principle extends beyond education. AI researchers have rushed to write peer-reviewed articles about how LLMs passing exams is a watershed moment of huge significance - without even bothering to do the most basic research on what those exams are and what they can and cannot tell you.
This principle is also a reason why we should be wary about humans using LLMs in assessments. We’ve written about the impact of LLMs on human assessment here and will return to the issue again in future articles.
So have we moved the goalposts?
There's a big debate between AI boosters and sceptics about the right way to assess AI performance. AI boosters will often claim that AI sceptics 'move the goalposts'. That is, they set up task X as a benchmark and claim that AI will never be able to achieve it. Then, when AI does achieve it, they change their mind and say 'oh well task X is really trivial, of course I never thought it was that important. But you will never find an AI that can achieve Task Y!" And then of course it turns out AI can solve Task Y, and we begin all over again.
I have some sympathy with this argument. I have written in the past about how we have to be careful about not defining creativity as 'something humans can do and computers can't. I've also written about the flaws and biases in human judgement, and how simple statistical algorithms can outperform even expert humans.
But I hope you can see that in this case, I am absolutely not moving the goalposts. The goalpost in question here is a principle that forms the fundamental basis of all assessment, and has done for decades. And it's a goalpost that I would apply equally to a human test-taker.
What would be a better benchmark?
When we want to find out if someone is suited for a job, then a well-designed and well-validated test can be very useful. But no test will perfectly predict job performance, and so another very useful method is a probationary period. Get someone to do the job for a few months to see if they are any good at it! This is a more direct measure that doesn't require any complex questions about validation.
Personally, I would be much more in favour of evaluating LLMs in this way. What would happen if we did that?
Plenty of types of AI do well on this metric. For example, at No More Marking we’ve integrated an AI tool into our systems that automatically spots when a teacher has accidentally uploaded a blank sheet of paper, instead of a student response. AI is really good at this job - much quicker and more accurate than humans.
But other types of AI don’t do so well. We’ve shown that Chat GPT is not as good as humans at assessing writing. Attempts to actually use LLMs to do legal work - as opposed to passing legal exams - have not ended well.