How good is ChatGPT at writing essays? Some data!
How did ChatGPT-produced writing do when compared to writing by real 9 year olds?
Since ChatGPT was launched last year, there have been a number of claims about how good it is at writing essays. Can it produce top grade 9 GCSE essays? A* A-level essays? Grade 5 or Grade 6 GCSE essays? My own claim is that for pure writing assessments (the ‘Question 5’ type ones on the AQA English Language GCSE paper), it is very good. However, these are all just claims. Is there any way we can get on data on this question?
We were already running a large-scale Year 4 writing assessment of about 50,000 pupils from about 1,100 primary schools. The Year 4s had to write about whether they thought mobile phones should be banned for under-13s. We created an extra dummy ‘ChatGPT’ school featuring 8 essays that were all written by ChatGPT.
We did not design them all to be perfect. We mixed up the prompt a couple of times, and we included a couple that were deliberately humorous!
In our previous blog, we showed that teachers could not reliably tell the difference between the ChatGPT scripts and the human ones. In this blog, we’ll share the actual results of the 8 ChatGPT scripts.
In all of our assessments, we select a set of moderation scripts — about 20–30% of pupil scripts which are judged by teachers in other schools, not just teachers in their own school. For this assessment, the moderation sample consisted of 14,083 scripts. How did the ChatGPT scripts do compared to this sample? Here is a table showing their results.
So, we can see that for the four essays where we kept the prompt the same as the one the students got, the scores were all in the top 7%. The top two essays scored in the top percentile and were within the margin of error of the top script. The only three to finish outside the top 7% were ones where we had changed the prompt to be unusual or to include errors. Interestingly, the essay where we asked ChatGPT to write the article in the form of a song also scored very highly! You may remember from our previous blog that the essay in the style of Harry Potter was the only one that was flagged by our teachers as being ChatGPT-authored.
These pieces of writing were assessed with Comparative Judgement, which is an assessment technique that rewards the overall holistic quality of a piece of writing, not just the accuracy. For ChatGPT to do well on an assessment like this, being accurate is not enough. It has to have some element of style and originality too.
So what does a top-percentile impossible-to-recognise robot essay look like? Here you are!
Essentially, ChatGPT reached the ceiling of scores on this task, so we have not assessed its limits. To do so, we’d have to carry out a similar project with the essays of older pupils. Stay tuned!