Building trust in AI
What the housing disasters of the 60s & 70s tell us about current progress in AI
Last week I went to a conference at Cambridge about some of the challenges facing the modern British society & economy.
Two talks on housebuilding and artificial intelligence stayed with me and have more in common than you might first think.
In the talk on planning and housebuilding, the panel agreed that it was important for the UK to build more houses and that planning regulations would need to be changed to make this happen.
However, Nicholas Boys Smith put forward an important reason why people are wary of new housebuilding - because so much modern housebuilding has been shoddy.
I agree with this. I'd argue we went through a period in the 60s and 70s where a lot of substandard housing was built, and this has made a lot of people averse to all new housebuilding in general.
The bad houses of the 60s and 70s didn't just affect the lives of the people who lived in them and near them. They destroyed the trust people had in the ability of our society to build nice new houses, and have therefore contributed to the lack of houses we have generations later.
So what does this all have to do with artificial intelligence?
The final panel of the conference was about artificial intelligence. Most members of the panel were convinced that Large Lange Models (LLMs) like ChatGPT will be a total gamechanger, revolutionising society and the economy (mostly) for the better.
This narrative is one that we at No More Marking have a hard time accepting. We have spent a large part of the last 6 months seeing if we can get LLMs to a) mark students' writing and b) respond to customer queries on our helpdesk.
Our findings, many of which we have published, show that LLMs are not very good at these tasks. They make a lot of errors, which is a problem per se. But they are also extremely unreliable, meaning that even when they give you the right answer, you are not sure if they will still give you the right answer when you ask again.
When I speak to LLM boosters about this, they tend to a) not be that interested in the problems we are trying to solve b) say that we are using the wrong prompts c) tell us to wait for the next LLM model.
In response to this, I reply that a) we are trying to solve quintessential language problems that have huge real world applications b) we have tried a lot of different prompts and prompting techniques, as have many other similar organisations and c) we have found performance has got worse from GPT 3 to 4.
I also frequently ask for examples of where LLMs are able to do the following: a) solve language based problems b) reliably c) at reasonable scale d) in a reasonably high stakes environment e) with limited human oversight.
For LLMs to have a big positive impact on our society and economy, I think they have to meet these five criteria. I have not yet been given any good example of where they are able to do this.
So where is the link with housebuilding? It is this: if what we are currently hearing about LLMs does turn out to be overhype, then many people who invest significant time, energy and money into them are going to feel burnt. Wider society is going to feel lied to. And the long-term impact of that is probably that people are going to be more sceptical about all forms of technological innovation. AI will get a bad reputation, and the more effective narrow AI applications which really do work will get tarred by association.
This is not pure speculation. The history of AI is full of these cycles of hype and disillusion - the so-called AI winters. In my 2020 book, Teachers vs Tech, I looked at the way education has been particularly susceptible to hype cycles, with the result that many teachers quite understandably want nothing to do with technology at all.
In the last few months, I have spoken to several groups of teachers who have been astonished by my explanations of the errors of LLMs. One of them said to me that whilst they had heard researchers say that LLMs made errors, they'd assumed that these errors were quite trivial ones, like minor typos in an Wikipedia article. Or, they'd thought that the errors it made were on extremely controversial issues where there were varying opinions. They were astonished to learn that it made basic maths errors!
This worries me because it suggests that the current narrative around LLMs is not setting teachers up for a realistic understanding of what they can and cannot do. And it is also therefore setting teachers up for yet more wasted energy, disillusion and disappointment.
Of course, it is entirely possible that GPT-5 comes out, blows everything out of the water, and solves all these problems effortlessly.
If that is the case, no-one will be happier than us, as we love seeing technology solve complex problems and make people's lives easier.
But even if that does happen, that will not mean our scepticism at this stage was wrong. Our scepticism is a response to the state of the evidence as it is at the moment. We're willing to change our mind when we see the evidence change.
We are not willing to burn people's trust on the promise of progress that might never happen. Hype has consequences.
Dear D-AI-sy, this blog is indeed very thoughtful. LLMs are not appropriate for use in a scale setting without a significant amount of model enhancements (and I don’t just mean prompt engineering). They have many valuable use cases as you know. One of them is content ideation and generation. I too am concerned with the amount of hype. In particular a lot of time seems to be wasted by people creating classroom lesson plans that have no/weak curricular alignment or even grounding. Everyone has been given the ability to print a personalized textbook but without the pedagogy and performance. This is a recipe for inequality. It’s also not really saving anyone very much time given all the crafting that goes into each LLM encounter. But it certainly is fun!
I do think there are many good short term use cases. Marking essays is not one of them. Editing helper is one. In fact the best LLM cases are on a 1-2-1 basis where the user has some degree of English reading and writing proficiency and background knowledge on the subject, as well as intrinsic motivation. This alone whittles down the addressable market of beneficiaries to the top 20% of learners (and not necessarily the ones we need to help most).
For teachers, for LLMs to be effective and really save time, they need to be adapted or enhanced for classroom use. LLMs have a material alignment problem. They don’t teach to a developmentally appropriate level. But they do serve as a powerful research assistant (for content that then needs to be carefully checked and aligned) and a writing assistant (for people who already have strong foundational skills and are using it to save time).
They will get better but not all in the same way. Some will be better at x and some at y. We are already seeing this. Specialised LLMs and associated technologies have tremendous potential value.
Thanks for your tireless work,
Sof