This is a great post. There is so much noise in Ai at the moment but your experiences match mine - we are yet to see Ai mark work with any integrity. Thanks for sharing and brining a sensible voice to the discussion
Here is my experience with the ChatGPT. I have not used it to submit for grading but to generate text based on a prompt.
1. A sentence or two is/are always long-winded in every paragraph.
2. Excessive use of certain words: not only and transformative are two examples.
3. Some sentences repeat with slightly different wording if it is a 500+ word write-up.
4. Always requires updates to meet your needs and how you write, irrespective of how good your prompt is, because it largely feels emotionless. I expect we will all sound similar in the future if we all use the same models.
For now, I believe it can be used to give a starting point but cannot be used for a submission without significant updates.
The poem below by Shel Silverstein was written in 1981 and sums it up better than I would. I am not saying it is as bad as the machine in the poem is stating:
The Homework Machine
The Homework Machine,
Oh, the Homework Machine,
Most perfect
contraption that's ever been seen.
Just put in your homework, then drop in a dime,
Snap on the switch, and in ten seconds' time,
Your homework comes out, quick and clean as can be.
Here it is— 'nine plus four?' and the answer is 'three.'
Thanks for an insightful article as always Chris. However, I must point out that there is some work to be done on your methods on a technical standpoint before dismissing AI marking completely. For instance, you’re right in saying that LLMs cannot count. However, there are ways to format its outputs in such a way that can be counted accurately. Consider a hypothetical scenario where a teacher can assess a piece of written work but cannot count. Rather than dismissing their ability to mark completely, you would provide them with appropriate support in the form of say a calculator or an assistant. Likewise, in our own internal studies (for science though I must confess) we have found that such a system of ‘chaining together’ LLMs with mathematically proficient non-AI systems enables us to mark to a significantly higher degree of accuracy to teachers. I wonder whether such a system may also improve the low kappa found in your studies.
I must admit my remarks are intended to be constrained to the marking of open-ended work as we haven't run any experiments on the more constrained answers you mention so I apologise if I have not made this clear. In our case you could obviously take the number of errors as a formatted output and then process them to match a mark scheme, but it is clear that the errors GPT finds in the text are often not errors.
This is a great post. There is so much noise in Ai at the moment but your experiences match mine - we are yet to see Ai mark work with any integrity. Thanks for sharing and brining a sensible voice to the discussion
Here is my experience with the ChatGPT. I have not used it to submit for grading but to generate text based on a prompt.
1. A sentence or two is/are always long-winded in every paragraph.
2. Excessive use of certain words: not only and transformative are two examples.
3. Some sentences repeat with slightly different wording if it is a 500+ word write-up.
4. Always requires updates to meet your needs and how you write, irrespective of how good your prompt is, because it largely feels emotionless. I expect we will all sound similar in the future if we all use the same models.
For now, I believe it can be used to give a starting point but cannot be used for a submission without significant updates.
The poem below by Shel Silverstein was written in 1981 and sums it up better than I would. I am not saying it is as bad as the machine in the poem is stating:
The Homework Machine
The Homework Machine,
Oh, the Homework Machine,
Most perfect
contraption that's ever been seen.
Just put in your homework, then drop in a dime,
Snap on the switch, and in ten seconds' time,
Your homework comes out, quick and clean as can be.
Here it is— 'nine plus four?' and the answer is 'three.'
Three?
Oh me . . .
I guess it's not as perfect
As I thought it would be.
Thanks for an insightful article as always Chris. However, I must point out that there is some work to be done on your methods on a technical standpoint before dismissing AI marking completely. For instance, you’re right in saying that LLMs cannot count. However, there are ways to format its outputs in such a way that can be counted accurately. Consider a hypothetical scenario where a teacher can assess a piece of written work but cannot count. Rather than dismissing their ability to mark completely, you would provide them with appropriate support in the form of say a calculator or an assistant. Likewise, in our own internal studies (for science though I must confess) we have found that such a system of ‘chaining together’ LLMs with mathematically proficient non-AI systems enables us to mark to a significantly higher degree of accuracy to teachers. I wonder whether such a system may also improve the low kappa found in your studies.
I must admit my remarks are intended to be constrained to the marking of open-ended work as we haven't run any experiments on the more constrained answers you mention so I apologise if I have not made this clear. In our case you could obviously take the number of errors as a formatted output and then process them to match a mark scheme, but it is clear that the errors GPT finds in the text are often not errors.