17 Comments

What prompts did you give chatGPT to mark with? Evaluating the prompt you used is fundamental to getting GPT to do anything for you, especially something as complex as marking. I'm little surprised you saw worse performance with GPT-4 being that every piece of research I've seen indicates it performs better than 3.5 in pretty much every task.

Expand full comment
author

Yes, a good question. We have tried a great variety of essays, prompts and mark schemes and have actually found very little substantial difference in the marks created when you modify the prompt. You get more variation when you modify the mark scheme, but still we haven't found any combination of mark schemes and prompts that work. So a simple 'What would you mark this essay and why?' while providing a levels of response mark scheme works pretty much as well as any other prompt. You can add in some context on the question and the level of the pupils, but it has little impact. I think we are up against a fundamental theoretical failing of LLMs, that they are essentially good at reproducing patterns, but have no analytical power. Marking requires a complex set of analyses and weighted decisions that no form of AI has yet been unable to reproduce in a reliable and valid manner. I'm not surprised that GPT-4 performed no better than GPT-3.5, as they report no improvement in performance in most AP examinations: https://openai.com/research/gpt-4

Expand full comment

You're right that the initial publication says that, but that research was done prior to the discovery of a number of best-practice prompting techniques like Chain of thought, Reflexion and DERA. Are you familiar with them / did you use any of them in your prompting? I don't have the large datasets to test on, as you do, but in my limited testing GPT4 is so much more than it appears on the surface.

Chain-of-thought-Reasoning: https://arxiv.org/abs/2305.02897

Reflexion: https://arxiv.org/abs/2303.11366

DERA: https://arxiv.org/abs/2303.17071

Expand full comment
author

This paper shows a degradation of performance og GPT-4 from GPT-3.5 on a simple maths reasoning task using chain of thought techniques: https://arxiv.org/pdf/2307.09009.pdf The issue with LLMs, and deep learning more generally, is that when you don't get the correct answer you are never sure if the problem lies in the program, the method or the environment.

Expand full comment

Agree on that, it's hard to rely on it at the moment, especially with the opaque updates. I think you might have linked me the wrong paper though? This one shows 4 outperforming 3.5 back in march but having suffered major degradation since then to a state worse than 3.5 by June? It's very bizarre. What on Earth must they be doing behind the scenes.

Expand full comment
author

That's the right paper, and yes, it does depend on the time point! With the number of models and potential configurations it feels like a time sink with no guarantee of success. I prefer to build on some form of theory rather than fitting curves and hoping one may eventually fit.

Expand full comment

What's really needed is a more fine tuned model. I think this is where openAI are going to fall behind. Many of the other models are already rolling out the ability to fine tune their baseline models with your own datasets. That'll be the real difference maker, specific modelling for the LLM to emulate.

Expand full comment
author

AQA are researching whether it is possible to mark short answer STEM questions with AI, but last I heard without much success.

Expand full comment

Great, insightful article. I wonder whether you've looked at, or have plans to look at using GPT-4 to mark STEM subjects?

Expand full comment
author

AQA are researching whether it is possible to mark short answer STEM questions with AI, but last I heard without much success.

Expand full comment

I just want to say thanks for this valuable research. Love you real pupil real classroom guys.

Expand full comment

Have I misunderstood the test? Wouldn’t a better test of ChatGPT v No More Marking be to make it do comparative judgement with all the scripts, and rank them accordingly, rather than use a mark scheme? Isn’t the point of No More Marking that mark schemes are ambiguous?

Expand full comment
author

Comparative Judgement relies on teachers agreeing on the relative ranking of the scripts, even if they are unable to place the scripts on a common scale of shared standards. The ranking is fairly robust to ambiguous mark schemes. What we see with GPT-3 and GPT-4 is that it is unable to create a stable and reproducible ranking.

Expand full comment

Thanks Chris, this is very helpful. So, ChatGPT didn't apply a mark scheme, it ranked using comparative judgement. But when you asked Chat GPT to rank the same responses again, it came up with a different ranking.

My own experience using ChatGPT to do comparative judgement, and to mark for a set of criteria (both its own that it generates, or AQA's), is that it gives me a different answer each time it sees the same script/answer.

That said, Ofqual's research into examiner marking shows that English grades are wrong nearly 50% of the time (when examiners are compared to senior examiner gradings). On that basis, ChatGPT is probably not that much worse.

I strongly suspect that if I combined the mark it gives me for both, and take the mean, that this would be better than an examiner - it would give me the 'right' grade more than 50% of the time.

For example, if I give two answers side by side, and ask ChatGPT a numerical score for both, based on the AQA mark scheme, it does so. If I then repeat the experiment, asking ChatGPT to give me an individual mark, one response at a time, I get a different and usually higher score. The former is normally below the 'right' mark, and the latter is usually above it. I should say that to test the accuracy, I've only fed in responses which AQA have already published with a senior examiner score, so I am measuring accuracy against that standard.

This takes longer than actually marking it through comparative judgement, so I won't bother, even if the premise is correct. But, in principle, it is an interesting hypothesis.

However, if I ask ChatGPT to give me 3-5 ways the essay/response could be improved it gives me a very specific set of steps, most of which are entirely relevant.

Expand full comment
author

You have to be very careful with headline statements such as 'Ofqual's research into examiner marking shows that English grades are wrong nearly 50% of the time' . It is more useful to look at how far our marks tend to be when multiple marking is used, and the disparity between the marks. In all my time at AQA and Ofqual I never saw a 40 mark essay that is given a mark of 20 by one examiner, and then marks from 1 to 40 from other examiners. That is essentially what we are seeing with GPT. Aggregating noise with a signal won't improve the signal. Interestingly, the latest research from AQA on using AI to mark short answers suggests that they are far from getting a reliable method even for fairly short, seemingly objective answers. We are a long way from an AI essay marker at present. On feedback, the steps are interesting as you say, so we have built these into our site, but they are still prone to hallucinations and meaningless pastiche.

Expand full comment

I appreciate your thoughtfulness Chris. I didn't realise you did time at AQA and Ofqual! I hope I am being careful and accurate. I'm not sure it is a headline, it is the only way to interpret their published data.

Regardless of that, by 50% I'm not referring to crazy differences in marking as you cite with ChatGPT, with swings of 20 marks in a 40 mark question.

I haven't seen it be wildly out like that. In a 30 mark question it might vary between 22 and 30 for the same answer. But AQA is happy for a senior examiner to give it 27, and other examiners to give it 24 or 30, and all three marks will stand, even on appeal, even though they will represent different grades. Taking the mean of ChatGPT's two attempts, 26, gets us much closer to the 'true' mark of 27. Obviously, I have a tiny sample size to base this on, but it would be fascinating if replicated at scale.

Expand full comment
author

Yes you can get good results with small samples, but as soon as you try a larger sample you run into issues. It’s also a nightmare to reproduce any results. LLMs are good at predicting words, but really poor at numbers, with well documented mathematical failures. Marking is all about the numbers!

Expand full comment