I appreciate your thoughtfulness Chris. I didn't realise you did time at AQA and Ofqual! I hope I am being careful and accurate. I'm not sure it is a headline, it is the only way to interpret their published data.
Regardless of that, by 50% I'm not referring to crazy differences in marking as you cite with ChatGPT, with swings of 20 marks…
I appreciate your thoughtfulness Chris. I didn't realise you did time at AQA and Ofqual! I hope I am being careful and accurate. I'm not sure it is a headline, it is the only way to interpret their published data.
Regardless of that, by 50% I'm not referring to crazy differences in marking as you cite with ChatGPT, with swings of 20 marks in a 40 mark question.
I haven't seen it be wildly out like that. In a 30 mark question it might vary between 22 and 30 for the same answer. But AQA is happy for a senior examiner to give it 27, and other examiners to give it 24 or 30, and all three marks will stand, even on appeal, even though they will represent different grades. Taking the mean of ChatGPT's two attempts, 26, gets us much closer to the 'true' mark of 27. Obviously, I have a tiny sample size to base this on, but it would be fascinating if replicated at scale.
Yes you can get good results with small samples, but as soon as you try a larger sample you run into issues. It’s also a nightmare to reproduce any results. LLMs are good at predicting words, but really poor at numbers, with well documented mathematical failures. Marking is all about the numbers!
I appreciate your thoughtfulness Chris. I didn't realise you did time at AQA and Ofqual! I hope I am being careful and accurate. I'm not sure it is a headline, it is the only way to interpret their published data.
Regardless of that, by 50% I'm not referring to crazy differences in marking as you cite with ChatGPT, with swings of 20 marks in a 40 mark question.
I haven't seen it be wildly out like that. In a 30 mark question it might vary between 22 and 30 for the same answer. But AQA is happy for a senior examiner to give it 27, and other examiners to give it 24 or 30, and all three marks will stand, even on appeal, even though they will represent different grades. Taking the mean of ChatGPT's two attempts, 26, gets us much closer to the 'true' mark of 27. Obviously, I have a tiny sample size to base this on, but it would be fascinating if replicated at scale.
Yes you can get good results with small samples, but as soon as you try a larger sample you run into issues. It’s also a nightmare to reproduce any results. LLMs are good at predicting words, but really poor at numbers, with well documented mathematical failures. Marking is all about the numbers!