Really insightful for my research. Up to now there is little evidence that LLM can reliably grade structured response questions in A level chemistry exams. Human in the loop examiners could comparatively grade 1000s, but one of the bugbears as a teacher is being able to decipher handwriting. Has anyone got a tool that can do this?
Thank you, that was informative. But I don't understand why AI avoids some of the hallucination errors under CJ prompts/tasks? Is ChatGPT ordinarily doing some kind of CJ process on a massive scale or is does it behave differently within a CJ process? And if so, how does it "know" to behave differently?
LLMs continue to hallucinate, but the powerful statistical model built into CJ allows us to isolate the hallucinations and minimise their impact. The LLMs in this way are no different to humans in the way they are not perfectly reliable. The existence of LLMs doesn't remove the need for old fashioned statistics to generate reliable, reproducible measurement scales.
I am continuously looking for examples of edge case scenarios, and only by solving them will we eventually decide if an activity can be fully automated and replace humans or if it will be a human-in-the-loop activity for now.
Really insightful for my research. Up to now there is little evidence that LLM can reliably grade structured response questions in A level chemistry exams. Human in the loop examiners could comparatively grade 1000s, but one of the bugbears as a teacher is being able to decipher handwriting. Has anyone got a tool that can do this?
Thank you, that was informative. But I don't understand why AI avoids some of the hallucination errors under CJ prompts/tasks? Is ChatGPT ordinarily doing some kind of CJ process on a massive scale or is does it behave differently within a CJ process? And if so, how does it "know" to behave differently?
LLMs continue to hallucinate, but the powerful statistical model built into CJ allows us to isolate the hallucinations and minimise their impact. The LLMs in this way are no different to humans in the way they are not perfectly reliable. The existence of LLMs doesn't remove the need for old fashioned statistics to generate reliable, reproducible measurement scales.
Thanks, Chris, is that statistical model described somewhere?
Excellent post! Thank you for sharing.
I am continuously looking for examples of edge case scenarios, and only by solving them will we eventually decide if an activity can be fully automated and replace humans or if it will be a human-in-the-loop activity for now.