What do you prefer – human error or machine error?
The human in the loop and the ghost in the machine
After we posted our last article about AI marking, we had an interesting response on social media.
We all know that humans make mistakes, and technology promises that you can reduce those mistakes. But the kinds of mistakes made by humans and AI are different, and maybe you don’t want any kind of AI mistakes involved with assessing students’ writing.
What are the trade-offs between human & machine errors?
The strengths of machines
The great advantage that machines have over humans is their speed and scalability. They don’t get tired or fatigued or need breaks, and once you’ve trained one machine, it’s easy to train lots more. In fields where there are huge workload pressures and vast piles of work to get through, this is very compelling. Look at something like credit applications. In the UK alone, millions of people make credit applications every year. Expecting every single application to be reviewed by a human would be incredibly expensive and require a huge human workforce - and still wouldn’t get you 100% accurate decisions. Or think about something like facial recognition technology. When you stand in an airport queue in the early hours of the morning waiting for a lone human worker to check everyone’s passport picture, you can see the appeal of automated facial recognition.
It can feel very reductive and utilitarian to just focus on the speed of machine decision-making without thinking about the quality, but with humans, speed and quality are linked. The quality of human decision making gets worse if you increase the speed and scale. If you ask an individual human to make more decisions or to make them quicker, they will get start making mistakes. If you try to recruit more humans to make the extra decisions, you will have difficulties recruiting and training enough people. By contrast, it’s easier for machines to scale up without losing quality.
This is why, in fields that are being overwhelmed with decisions, technology is so appealing. And this is why it isn’t fair to compare exceptional or even average human performance with average machine performance. You can easily scale up average machine performance. You can’t easily scale up average human performance (let alone exceptional human performance).
There are clearly huge workload pressures in schools at the moment, huge amounts of marking to get through, and huge difficulties recruiting trained teachers. It’s a perfect example of a situation where technology could really help.
The weaknesses of machines
However, the great weakness of machine intelligence is that it doesn’t make its decisions the way a human would.
This is a fundamental issue that goes to the heart of all modern developments in AI. Sometimes we do know how the AI model makes its decisions, and that can make it easy to break the model. Sometimes, with complex deep learning models, we genuinely do not how the model is making its decisions, but that also leaves the model open to being gamed and broken.
A funny example from finance involves fennel purchases. A big insurance company cross-referenced their data with a supermarket loyalty scheme and found that purchases of fresh fennel were correlated with lower home insurance claims. Now you can speculate as to why this is, but it’s unlikely that purchasing fresh fennel causes people to claim less on insurance. Imagine if you set up a home insurance system that gave you lower premiums if you were a purchaser of fresh fennel, and everyone knew this was the case. The system might work to begin with but it would very quickly get gamed.
The corresponding issue with older AI marking models is that they are very biased towards length. Again, we can see why this is: length does correlate with quality in the same way buying fennel correlates with low insurance claims. But it doesn’t cause quality, and once people know that length is what is being rewarded, they have an incentive to write reams of gobbledegook.
With some of the more complex deep learning models, we genuinely do not how they make their decisions, and this poses related but more complex issues. For example, researchers have managed to successfully fool facial recognition technologies by changing just one or two pixels in an image. These changes would make no difference at all to a human assessor, but can cause the AI to change its mind about what’s in the image. Again, a similar risk is out there if you use newer and more complex AI models to assess student essays. They might work to begin with, but if you don’t know how the model makes its decisions, you incentivise gaming and risk the model degrading over time.
The power of validation
The solution to all these problems is validation. You need some way of using expert human judgement to validate your machine model. And this can’t be a one-off process: you have to continually validate your model to guard it against gaming. That is what we are doing with our AI marking experiments: taking every human judgement and seeing if the AI model agrees with it.
Our recommended model is for human teachers to do 10% of the judging in our projects and for AI to do the other 90%. This saves enormous amounts of time - 90%, in fact! It also means that every piece of writing still gets seen twice by humans - yes, twice, which is double the amount of times it’s seen by humans with traditional non-AI marking!
It also means we can use the human judgements to validate the AI model. We have free projects in the summer term for Year 6 and Year 9 where teachers can see if they agree with the machine. We also have a webinar on Wednesday 23 April where you can try out some judging and see how the AI works.
What do you prefer - absolute error or comparative error? Some of the resistance that Comparative Judgement faces more generally seems to be based on similar instincts of resistance to types of error that people are not so familiar with.
On human error vs machine error, what does recourse to review look like in a machine error world? People seem to have an instinct that some recourse to a second authority is part of what it means for something to be fair, and certainly the concept of remarking is highly embedded in our high-stakes exam system, but it would seem to be meaningless once you have already scaled the machine in the initial marking? I think tennis is an interesting example here; the technology has existed for a while such that all line calls could be automated, but stakeholders seem more comfortable with a system of human error with (some) potential for review, at least at the highest stakes tournaments.
Does length correlate with quality? I'm not sure it does. I think lack of length correlates with a lack of quality, which isn't the same thing. Lots of concise writing is produced which covers everything expected in a piece of work but when it falls outside of "the Goldilocks zone" then it will either lack breadth or depth, or, as suggested, turn into gobbledygook.