Jul 15, 2023Liked by Daisy Christodoulou, Chris Wheadon
What prompts did you give chatGPT to mark with? Evaluating the prompt you used is fundamental to getting GPT to do anything for you, especially something as complex as marking. I'm little surprised you saw worse performance with GPT-4 being that every piece of research I've seen indicates it performs better than 3.5 in pretty much every task.
Yes, a good question. We have tried a great variety of essays, prompts and mark schemes and have actually found very little substantial difference in the marks created when you modify the prompt. You get more variation when you modify the mark scheme, but still we haven't found any combination of mark schemes and prompts that work. So a simple 'What would you mark this essay and why?' while providing a levels of response mark scheme works pretty much as well as any other prompt. You can add in some context on the question and the level of the pupils, but it has little impact. I think we are up against a fundamental theoretical failing of LLMs, that they are essentially good at reproducing patterns, but have no analytical power. Marking requires a complex set of analyses and weighted decisions that no form of AI has yet been unable to reproduce in a reliable and valid manner. I'm not surprised that GPT-4 performed no better than GPT-3.5, as they report no improvement in performance in most AP examinations: https://openai.com/research/gpt-4
You're right that the initial publication says that, but that research was done prior to the discovery of a number of best-practice prompting techniques like Chain of thought, Reflexion and DERA. Are you familiar with them / did you use any of them in your prompting? I don't have the large datasets to test on, as you do, but in my limited testing GPT4 is so much more than it appears on the surface.
This paper shows a degradation of performance og GPT-4 from GPT-3.5 on a simple maths reasoning task using chain of thought techniques: https://arxiv.org/pdf/2307.09009.pdf The issue with LLMs, and deep learning more generally, is that when you don't get the correct answer you are never sure if the problem lies in the program, the method or the environment.
Agree on that, it's hard to rely on it at the moment, especially with the opaque updates. I think you might have linked me the wrong paper though? This one shows 4 outperforming 3.5 back in march but having suffered major degradation since then to a state worse than 3.5 by June? It's very bizarre. What on Earth must they be doing behind the scenes.
That's the right paper, and yes, it does depend on the time point! With the number of models and potential configurations it feels like a time sink with no guarantee of success. I prefer to build on some form of theory rather than fitting curves and hoping one may eventually fit.
What's really needed is a more fine tuned model. I think this is where openAI are going to fall behind. Many of the other models are already rolling out the ability to fine tune their baseline models with your own datasets. That'll be the real difference maker, specific modelling for the LLM to emulate.
Have I misunderstood the test? Wouldn’t a better test of ChatGPT v No More Marking be to make it do comparative judgement with all the scripts, and rank them accordingly, rather than use a mark scheme? Isn’t the point of No More Marking that mark schemes are ambiguous?
Comparative Judgement relies on teachers agreeing on the relative ranking of the scripts, even if they are unable to place the scripts on a common scale of shared standards. The ranking is fairly robust to ambiguous mark schemes. What we see with GPT-3 and GPT-4 is that it is unable to create a stable and reproducible ranking.
Thanks Chris, this is very helpful. So, ChatGPT didn't apply a mark scheme, it ranked using comparative judgement. But when you asked Chat GPT to rank the same responses again, it came up with a different ranking.
My own experience using ChatGPT to do comparative judgement, and to mark for a set of criteria (both its own that it generates, or AQA's), is that it gives me a different answer each time it sees the same script/answer.
That said, Ofqual's research into examiner marking shows that English grades are wrong nearly 50% of the time (when examiners are compared to senior examiner gradings). On that basis, ChatGPT is probably not that much worse.
I strongly suspect that if I combined the mark it gives me for both, and take the mean, that this would be better than an examiner - it would give me the 'right' grade more than 50% of the time.
For example, if I give two answers side by side, and ask ChatGPT a numerical score for both, based on the AQA mark scheme, it does so. If I then repeat the experiment, asking ChatGPT to give me an individual mark, one response at a time, I get a different and usually higher score. The former is normally below the 'right' mark, and the latter is usually above it. I should say that to test the accuracy, I've only fed in responses which AQA have already published with a senior examiner score, so I am measuring accuracy against that standard.
This takes longer than actually marking it through comparative judgement, so I won't bother, even if the premise is correct. But, in principle, it is an interesting hypothesis.
However, if I ask ChatGPT to give me 3-5 ways the essay/response could be improved it gives me a very specific set of steps, most of which are entirely relevant.
You have to be very careful with headline statements such as 'Ofqual's research into examiner marking shows that English grades are wrong nearly 50% of the time' . It is more useful to look at how far our marks tend to be when multiple marking is used, and the disparity between the marks. In all my time at AQA and Ofqual I never saw a 40 mark essay that is given a mark of 20 by one examiner, and then marks from 1 to 40 from other examiners. That is essentially what we are seeing with GPT. Aggregating noise with a signal won't improve the signal. Interestingly, the latest research from AQA on using AI to mark short answers suggests that they are far from getting a reliable method even for fairly short, seemingly objective answers. We are a long way from an AI essay marker at present. On feedback, the steps are interesting as you say, so we have built these into our site, but they are still prone to hallucinations and meaningless pastiche.
I appreciate your thoughtfulness Chris. I didn't realise you did time at AQA and Ofqual! I hope I am being careful and accurate. I'm not sure it is a headline, it is the only way to interpret their published data.
Regardless of that, by 50% I'm not referring to crazy differences in marking as you cite with ChatGPT, with swings of 20 marks in a 40 mark question.
I haven't seen it be wildly out like that. In a 30 mark question it might vary between 22 and 30 for the same answer. But AQA is happy for a senior examiner to give it 27, and other examiners to give it 24 or 30, and all three marks will stand, even on appeal, even though they will represent different grades. Taking the mean of ChatGPT's two attempts, 26, gets us much closer to the 'true' mark of 27. Obviously, I have a tiny sample size to base this on, but it would be fascinating if replicated at scale.
Yes you can get good results with small samples, but as soon as you try a larger sample you run into issues. It’s also a nightmare to reproduce any results. LLMs are good at predicting words, but really poor at numbers, with well documented mathematical failures. Marking is all about the numbers!
What prompts did you give chatGPT to mark with? Evaluating the prompt you used is fundamental to getting GPT to do anything for you, especially something as complex as marking. I'm little surprised you saw worse performance with GPT-4 being that every piece of research I've seen indicates it performs better than 3.5 in pretty much every task.
Yes, a good question. We have tried a great variety of essays, prompts and mark schemes and have actually found very little substantial difference in the marks created when you modify the prompt. You get more variation when you modify the mark scheme, but still we haven't found any combination of mark schemes and prompts that work. So a simple 'What would you mark this essay and why?' while providing a levels of response mark scheme works pretty much as well as any other prompt. You can add in some context on the question and the level of the pupils, but it has little impact. I think we are up against a fundamental theoretical failing of LLMs, that they are essentially good at reproducing patterns, but have no analytical power. Marking requires a complex set of analyses and weighted decisions that no form of AI has yet been unable to reproduce in a reliable and valid manner. I'm not surprised that GPT-4 performed no better than GPT-3.5, as they report no improvement in performance in most AP examinations: https://openai.com/research/gpt-4
You're right that the initial publication says that, but that research was done prior to the discovery of a number of best-practice prompting techniques like Chain of thought, Reflexion and DERA. Are you familiar with them / did you use any of them in your prompting? I don't have the large datasets to test on, as you do, but in my limited testing GPT4 is so much more than it appears on the surface.
Chain-of-thought-Reasoning: https://arxiv.org/abs/2305.02897
Reflexion: https://arxiv.org/abs/2303.11366
DERA: https://arxiv.org/abs/2303.17071
This paper shows a degradation of performance og GPT-4 from GPT-3.5 on a simple maths reasoning task using chain of thought techniques: https://arxiv.org/pdf/2307.09009.pdf The issue with LLMs, and deep learning more generally, is that when you don't get the correct answer you are never sure if the problem lies in the program, the method or the environment.
Agree on that, it's hard to rely on it at the moment, especially with the opaque updates. I think you might have linked me the wrong paper though? This one shows 4 outperforming 3.5 back in march but having suffered major degradation since then to a state worse than 3.5 by June? It's very bizarre. What on Earth must they be doing behind the scenes.
That's the right paper, and yes, it does depend on the time point! With the number of models and potential configurations it feels like a time sink with no guarantee of success. I prefer to build on some form of theory rather than fitting curves and hoping one may eventually fit.
What's really needed is a more fine tuned model. I think this is where openAI are going to fall behind. Many of the other models are already rolling out the ability to fine tune their baseline models with your own datasets. That'll be the real difference maker, specific modelling for the LLM to emulate.
AQA are researching whether it is possible to mark short answer STEM questions with AI, but last I heard without much success.
Great, insightful article. I wonder whether you've looked at, or have plans to look at using GPT-4 to mark STEM subjects?
AQA are researching whether it is possible to mark short answer STEM questions with AI, but last I heard without much success.
I just want to say thanks for this valuable research. Love you real pupil real classroom guys.
Have I misunderstood the test? Wouldn’t a better test of ChatGPT v No More Marking be to make it do comparative judgement with all the scripts, and rank them accordingly, rather than use a mark scheme? Isn’t the point of No More Marking that mark schemes are ambiguous?
Comparative Judgement relies on teachers agreeing on the relative ranking of the scripts, even if they are unable to place the scripts on a common scale of shared standards. The ranking is fairly robust to ambiguous mark schemes. What we see with GPT-3 and GPT-4 is that it is unable to create a stable and reproducible ranking.
Thanks Chris, this is very helpful. So, ChatGPT didn't apply a mark scheme, it ranked using comparative judgement. But when you asked Chat GPT to rank the same responses again, it came up with a different ranking.
My own experience using ChatGPT to do comparative judgement, and to mark for a set of criteria (both its own that it generates, or AQA's), is that it gives me a different answer each time it sees the same script/answer.
That said, Ofqual's research into examiner marking shows that English grades are wrong nearly 50% of the time (when examiners are compared to senior examiner gradings). On that basis, ChatGPT is probably not that much worse.
I strongly suspect that if I combined the mark it gives me for both, and take the mean, that this would be better than an examiner - it would give me the 'right' grade more than 50% of the time.
For example, if I give two answers side by side, and ask ChatGPT a numerical score for both, based on the AQA mark scheme, it does so. If I then repeat the experiment, asking ChatGPT to give me an individual mark, one response at a time, I get a different and usually higher score. The former is normally below the 'right' mark, and the latter is usually above it. I should say that to test the accuracy, I've only fed in responses which AQA have already published with a senior examiner score, so I am measuring accuracy against that standard.
This takes longer than actually marking it through comparative judgement, so I won't bother, even if the premise is correct. But, in principle, it is an interesting hypothesis.
However, if I ask ChatGPT to give me 3-5 ways the essay/response could be improved it gives me a very specific set of steps, most of which are entirely relevant.
You have to be very careful with headline statements such as 'Ofqual's research into examiner marking shows that English grades are wrong nearly 50% of the time' . It is more useful to look at how far our marks tend to be when multiple marking is used, and the disparity between the marks. In all my time at AQA and Ofqual I never saw a 40 mark essay that is given a mark of 20 by one examiner, and then marks from 1 to 40 from other examiners. That is essentially what we are seeing with GPT. Aggregating noise with a signal won't improve the signal. Interestingly, the latest research from AQA on using AI to mark short answers suggests that they are far from getting a reliable method even for fairly short, seemingly objective answers. We are a long way from an AI essay marker at present. On feedback, the steps are interesting as you say, so we have built these into our site, but they are still prone to hallucinations and meaningless pastiche.
I appreciate your thoughtfulness Chris. I didn't realise you did time at AQA and Ofqual! I hope I am being careful and accurate. I'm not sure it is a headline, it is the only way to interpret their published data.
Regardless of that, by 50% I'm not referring to crazy differences in marking as you cite with ChatGPT, with swings of 20 marks in a 40 mark question.
I haven't seen it be wildly out like that. In a 30 mark question it might vary between 22 and 30 for the same answer. But AQA is happy for a senior examiner to give it 27, and other examiners to give it 24 or 30, and all three marks will stand, even on appeal, even though they will represent different grades. Taking the mean of ChatGPT's two attempts, 26, gets us much closer to the 'true' mark of 27. Obviously, I have a tiny sample size to base this on, but it would be fascinating if replicated at scale.
Yes you can get good results with small samples, but as soon as you try a larger sample you run into issues. It’s also a nightmare to reproduce any results. LLMs are good at predicting words, but really poor at numbers, with well documented mathematical failures. Marking is all about the numbers!