Can teachers spot a ChatGPT essay?
ChatGPT is capable of writing very realistic and original essays.
ChatGPT is capable of writing very realistic and original essays.
Can teachers spot them? There have been a lot of claims and counter-claims about this, so we decided to put it to the test.
We were already running a large-scale Year 4 writing assessment of about 50,000 pupils from about 1,100 primary schools. We created an extra dummy ‘ChatGPT’ school featuring 8 essays that were all written by ChatGPT.
The Year 4s had to write about whether they thought mobile phones should be banned for under-13s.
We did not design them all to be perfect. We mixed up the prompt a couple of times, and we included a couple that were deliberately humorous!
None of them were edited after being produced by ChatGPT.
Once ChatGPT had written them, we got some friends and friends’ children to copy them out in their handwriting.
We told our teachers about these 8 scripts, and offered a prize if they could successfully identify a ChatGPT essay when they were judging this task. They could flag up a suspected ChatGPT essay very simply, by clicking on our standard button for recording a comment.
The results
12,595 teachers took part in the judging.
In total, the 8 ChatGPT essays were judged 112 times by 112 different teachers.
Only two of those judgements flagged up the script as being authored by ChatGPT. Both of these judgements were on the same script — one of the ones with a more humorous prompt.
In total, 107 judgements were flagged as ChatGPT. So that means 105 of those flags were wrong.
Two scripts which were not written by ChatGPT were flagged four times each — which is more than any individual actual ChatGPT script.
We feel it is fair to conclude that teachers cannot spot a ChatGPT script, and that ChatGPT does a very good job of producing realistically human writing!
The next question is of course how well the ChatGPT scripts scored on this assessment. We will have more on that shortly.
Limitations of this design
All the writing in our Comparative Judgement tasks is anonymised, and the ChatGPT essays were programmed to appear in the moderation sample. The moderation sample is a 20% sample of essays that are judged by teachers from outside the school those essays were written in. In our Comparative Judgement assessments, a pair of scripts from the moderation sample will appear on every fifth judgement. Teachers know this, and know that every fifth judgement they do will feature two scripts that are not from their school. We were asking them to spot plagiarism in this cohort of students, not in the cohort of students they know.
This is an obvious limitation of this task design. A lot of the discussion about ChatGPT plagiarism has been about whether teachers can spot plagiarism carried out by their own pupils. If a teacher knows a student and knows how they work in class, they may well be more likely to spot if that student’s writing style or standard dramatically changes.
This study does not shed any light on that question. We were testing if teachers could recognise plagiarism amongst students they did not know.
However, we still think this study is useful, as a lot of assessment does involve markers assessing writing from students they don’t know. And it has also helped us to see if there is something so weird or unusual or inhuman about ChatGPT essays that make them very obvious to identify.