<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[No More Marking]]></title><description><![CDATA[Education, assessment and technology by Daisy Christodoulou & Dr Chris Wheadon]]></description><link>https://substack.nomoremarking.com</link><image><url>https://substackcdn.com/image/fetch/$s_!g-Kw!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdef3cf81-2f9b-4576-8d8b-92dcad390e4f_256x256.png</url><title>No More Marking</title><link>https://substack.nomoremarking.com</link></image><generator>Substack</generator><lastBuildDate>Thu, 16 Apr 2026 04:20:54 GMT</lastBuildDate><atom:link href="https://substack.nomoremarking.com/feed" rel="self" type="application/rss+xml"/><copyright><![CDATA[No More Marking Ltd]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[daisychristodoulou@substack.com]]></webMaster><itunes:owner><itunes:email><![CDATA[daisychristodoulou@substack.com]]></itunes:email><itunes:name><![CDATA[Daisy Christodoulou]]></itunes:name></itunes:owner><itunes:author><![CDATA[Daisy Christodoulou]]></itunes:author><googleplay:owner><![CDATA[daisychristodoulou@substack.com]]></googleplay:owner><googleplay:email><![CDATA[daisychristodoulou@substack.com]]></googleplay:email><googleplay:author><![CDATA[Daisy Christodoulou]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[Why grades are misleading]]></title><description><![CDATA[But grade probabilities are better!]]></description><link>https://substack.nomoremarking.com/p/why-grades-are-misleading</link><guid isPermaLink="false">https://substack.nomoremarking.com/p/why-grades-are-misleading</guid><dc:creator><![CDATA[Daisy Christodoulou]]></dc:creator><pubDate>Sat, 11 Apr 2026 08:04:32 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!QVVM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc90dd9a2-cbe0-454b-a97f-83e58d5c182b_1684x1190.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Grades are an established feature of most assessment systems, and are taken for granted as a sensible way of reporting attainment data. </p><p>But do they deserve that status? They create a lot of distortions, they don&#8217;t mean what people think they do, and there are better alternatives available.</p><p>In this post, we&#8217;ll explain what the problems with grades are, how we do things differently, and what our new &#8220;grade probabilities&#8221; report looks like.</p><div class="captioned-button-wrap" data-attrs="{&quot;url&quot;:&quot;https://substack.nomoremarking.com/p/why-grades-are-misleading?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="CaptionedButtonToDOM"><div class="preamble"><p class="cta-caption">Thanks for reading No More Marking! This post is public so feel free to share it.</p></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://substack.nomoremarking.com/p/why-grades-are-misleading?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://substack.nomoremarking.com/p/why-grades-are-misleading?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p></div><p><strong>What people think student attainment looks like</strong></p><p>Many people have a mental model of a grade as a discrete category that is separate and distinct from other grades. They think students in one grade are qualitatively different from students in another grade, as shown in the following image.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!QVVM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc90dd9a2-cbe0-454b-a97f-83e58d5c182b_1684x1190.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!QVVM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc90dd9a2-cbe0-454b-a97f-83e58d5c182b_1684x1190.jpeg 424w, https://substackcdn.com/image/fetch/$s_!QVVM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc90dd9a2-cbe0-454b-a97f-83e58d5c182b_1684x1190.jpeg 848w, https://substackcdn.com/image/fetch/$s_!QVVM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc90dd9a2-cbe0-454b-a97f-83e58d5c182b_1684x1190.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!QVVM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc90dd9a2-cbe0-454b-a97f-83e58d5c182b_1684x1190.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!QVVM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc90dd9a2-cbe0-454b-a97f-83e58d5c182b_1684x1190.jpeg" width="1456" height="1029" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c90dd9a2-cbe0-454b-a97f-83e58d5c182b_1684x1190.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1029,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:159883,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://substack.nomoremarking.com/i/193791930?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc90dd9a2-cbe0-454b-a97f-83e58d5c182b_1684x1190.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!QVVM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc90dd9a2-cbe0-454b-a97f-83e58d5c182b_1684x1190.jpeg 424w, https://substackcdn.com/image/fetch/$s_!QVVM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc90dd9a2-cbe0-454b-a97f-83e58d5c182b_1684x1190.jpeg 848w, https://substackcdn.com/image/fetch/$s_!QVVM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc90dd9a2-cbe0-454b-a97f-83e58d5c182b_1684x1190.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!QVVM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc90dd9a2-cbe0-454b-a97f-83e58d5c182b_1684x1190.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>A number of aspects of our current grading system reinforce this idea. For example, we give grades labels like &#8220;at the expected standard&#8221;, and we have marking rubrics that suggest there are discrete breaks in performance between one grade and the next. In the chart above I have used the grades from England&#8217;s primary system, but almost every jurisdiction we work in has something similar. A lot of teacher-created grading systems have the same problem. &#8220;Red, amber, green&#8221; is a grading system. So is &#8220;emerging, expected, exceeding&#8221;.</p><p>However, this is not how attainment works, and thinking it does causes a lot of problems.</p><p><strong>What student attainment actually looks like</strong></p><p>Student attainment follows a continuous distribution. The image below gives a much better representation of how it works.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!cO8w!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7e2d0f8-120d-47f6-8c89-9f531deec453_1684x1190.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!cO8w!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7e2d0f8-120d-47f6-8c89-9f531deec453_1684x1190.jpeg 424w, https://substackcdn.com/image/fetch/$s_!cO8w!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7e2d0f8-120d-47f6-8c89-9f531deec453_1684x1190.jpeg 848w, https://substackcdn.com/image/fetch/$s_!cO8w!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7e2d0f8-120d-47f6-8c89-9f531deec453_1684x1190.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!cO8w!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7e2d0f8-120d-47f6-8c89-9f531deec453_1684x1190.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!cO8w!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7e2d0f8-120d-47f6-8c89-9f531deec453_1684x1190.jpeg" width="1456" height="1029" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f7e2d0f8-120d-47f6-8c89-9f531deec453_1684x1190.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1029,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:136247,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.nomoremarking.com/i/193791930?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7e2d0f8-120d-47f6-8c89-9f531deec453_1684x1190.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!cO8w!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7e2d0f8-120d-47f6-8c89-9f531deec453_1684x1190.jpeg 424w, https://substackcdn.com/image/fetch/$s_!cO8w!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7e2d0f8-120d-47f6-8c89-9f531deec453_1684x1190.jpeg 848w, https://substackcdn.com/image/fetch/$s_!cO8w!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7e2d0f8-120d-47f6-8c89-9f531deec453_1684x1190.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!cO8w!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7e2d0f8-120d-47f6-8c89-9f531deec453_1684x1190.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Why this is a problem</strong></p><p>Grades are just lines drawn on an underlying distribution. They don&#8217;t correspond to sudden leaps in student attainment. When you treat them like they are discrete categories, it causes big distortions, as you can see in the image below.</p><p>Paul and George both have the same grade. But Paul has more in common with John, in the grade below. And George has more in common with Ringo, in the grade above.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Mxhn!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdafe4033-22d6-49f9-887b-4166bb16a893_1684x1190.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Mxhn!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdafe4033-22d6-49f9-887b-4166bb16a893_1684x1190.jpeg 424w, https://substackcdn.com/image/fetch/$s_!Mxhn!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdafe4033-22d6-49f9-887b-4166bb16a893_1684x1190.jpeg 848w, https://substackcdn.com/image/fetch/$s_!Mxhn!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdafe4033-22d6-49f9-887b-4166bb16a893_1684x1190.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!Mxhn!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdafe4033-22d6-49f9-887b-4166bb16a893_1684x1190.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Mxhn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdafe4033-22d6-49f9-887b-4166bb16a893_1684x1190.jpeg" width="1456" height="1029" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/dafe4033-22d6-49f9-887b-4166bb16a893_1684x1190.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1029,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:166022,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.nomoremarking.com/i/193791930?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdafe4033-22d6-49f9-887b-4166bb16a893_1684x1190.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Mxhn!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdafe4033-22d6-49f9-887b-4166bb16a893_1684x1190.jpeg 424w, https://substackcdn.com/image/fetch/$s_!Mxhn!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdafe4033-22d6-49f9-887b-4166bb16a893_1684x1190.jpeg 848w, https://substackcdn.com/image/fetch/$s_!Mxhn!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdafe4033-22d6-49f9-887b-4166bb16a893_1684x1190.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!Mxhn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdafe4033-22d6-49f9-887b-4166bb16a893_1684x1190.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>A three-part grading system is the worst kind of grading system and causes all kinds of problems. The categories are too big to be useful. They incentivise tiny progress at the grade boundaries, and don&#8217;t reward big progress elsewhere. They result in very volatile accountability measures. And perhaps most damagingly, they are particularly bad for students at the bottom of the middle category &#8211; that is, in the chart above, Paul. Paul is told everything is OK and he is doing fine but in reality he is struggling as much as John is.</p><p>As well as these very practical and immediate problems, there is a deeper conceptual problem with thinking that student attainment is discrete. Three-part grading systems encourage the flawed idea that skills are discrete and that you can &#8220;level up&#8221; by teaching a new skill and jumping to the next grade. If, on the other hand, you accept that skills are composed of sub-skills and knowledge, you will recognise that students improve on a slow and steady incline, not in sudden jagged steps. I&#8217;ve written more about this link between assessment and the knowledge-skills debate <a href="https://substack.nomoremarking.com/p/skills-vs-knowledge-13-years-on">here</a>. </p><p><strong>Improving reporting with scaled scores and writing ages</strong></p><p>The ideal improvement would be to report scaled scores, not grades, and that&#8217;s what we do with all of our writing assessments. A criticism of this approach is that people don&#8217;t know what a scaled score means. One way we have tried to fix this in the past is by converting all our scaled scores to a writing age. We are quite proud of this and think that it is the first writing age anywhere in the world (although it follows very similar principles to reading ages, which are very popular). The basic principle is that we are trying to address the misconception about grades being discrete by using a comparison with an everyday metric &#8211; age &#8211; which everybody intuitively understands is continuous.</p><p>However, we still operate within a national system that uses a three part grading system, and the clash between the two systems causes problems. We report the writing age alongside the scaled score and the national Working Towards, Expected Standard, Greater Depth indicator. This means that it is possible for a student to get the Expected Standard label and still get a writing age that is lower than their chronological age. For example, a Year 6 student who is aged 11 could get a writing age of 9 years and 6 months, and still get the Expected Standard. We get so many questions from schools asking us how this is possible, and of course it is very confusing.</p><p>But it is the result of the government setting the Expected Standard at the 28th percentile. Expected Standard does not mean, as many people assume, that you are working at the average standard for your age. It includes students who are about 18-24 months below the average. This is true for reading and maths too. Our writing age hasn&#8217;t created this problem; it has just revealed it. </p><p><strong>Our latest innovation: grade probabilities</strong></p><p>In our upcoming set of Year 6 writing results, we&#8217;re going to introduce a new report: grade probabilities. This will tell you the percentage chance that a student is at a certain grade.</p><p>Here&#8217;s an anonymised example of a student with a similar profile to Paul. He has a 47.5% chance of getting the lower grade, and a 51.5% chance of getting the middle grade.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!xKOt!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ebc92ef-4bc7-4370-99c9-9d2ee81b822f_1112x688.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!xKOt!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ebc92ef-4bc7-4370-99c9-9d2ee81b822f_1112x688.png 424w, https://substackcdn.com/image/fetch/$s_!xKOt!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ebc92ef-4bc7-4370-99c9-9d2ee81b822f_1112x688.png 848w, https://substackcdn.com/image/fetch/$s_!xKOt!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ebc92ef-4bc7-4370-99c9-9d2ee81b822f_1112x688.png 1272w, https://substackcdn.com/image/fetch/$s_!xKOt!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ebc92ef-4bc7-4370-99c9-9d2ee81b822f_1112x688.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!xKOt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ebc92ef-4bc7-4370-99c9-9d2ee81b822f_1112x688.png" width="1112" height="688" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8ebc92ef-4bc7-4370-99c9-9d2ee81b822f_1112x688.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:688,&quot;width&quot;:1112,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!xKOt!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ebc92ef-4bc7-4370-99c9-9d2ee81b822f_1112x688.png 424w, https://substackcdn.com/image/fetch/$s_!xKOt!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ebc92ef-4bc7-4370-99c9-9d2ee81b822f_1112x688.png 848w, https://substackcdn.com/image/fetch/$s_!xKOt!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ebc92ef-4bc7-4370-99c9-9d2ee81b822f_1112x688.png 1272w, https://substackcdn.com/image/fetch/$s_!xKOt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ebc92ef-4bc7-4370-99c9-9d2ee81b822f_1112x688.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The metric here is measuring something different from the writing age. The writing age is a measure of attainment. It takes a given scaled score and just converts it into a typical age.</p><p>The grade probability is a measurement of certainty: how sure can we be that this student is above a certain threshold?</p><p>However, what both metrics have in common is that they replace a crude and distorting threshold system with a smooth and continuous metric.</p><p>We hope this will help schools when it comes to making decisions about Year 6 writing moderation. If it works well and schools like it, we can introduce it for more year groups and jurisdictions. If you&#8217;d like to learn more about our assessments, we have an <a href="https://www.nomoremarking.com/events">intro webinar</a> coming up later this month.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://substack.nomoremarking.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading No More Marking! Subscribe for free to receive new posts and support my work</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Education technology is never neutral]]></title><description><![CDATA[It is both a problem and a solution]]></description><link>https://substack.nomoremarking.com/p/education-technology-is-never-neutral</link><guid isPermaLink="false">https://substack.nomoremarking.com/p/education-technology-is-never-neutral</guid><dc:creator><![CDATA[Daisy Christodoulou]]></dc:creator><pubDate>Sat, 28 Mar 2026 08:42:42 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/4deb79e8-8991-4f3b-ba52-68c782550a8c_1536x1024.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<ul><li><p>&#8220;Any technology can be used well or badly.&#8221;</p></li><li><p>&#8220;Technology is just a tool - what matters is what you do with it.&#8221;</p></li><li><p>&#8220;Kids can use a tablet to study or to play games - the issue isn&#8217;t the tablet, it&#8217;s what they are doing on the tablet.&#8221;</p></li></ul><p>I hear this argument all the time: that when technology gives you a bad outcome, the problem is not the technology but the way teachers or kids are using it.</p><p>For example, last week, Matt Yglesias wrote an article called &#8220;<a href="https://www.slowboring.com/p/ed-tech-is-not-the-answer-or-the">Ed tech is not the answer or the problem</a>&#8221;. Referring to a specific app that has come in for a lot of criticism, he said that it was probably being used well in some effective schools, but poorly in some ineffective ones. The issue was not the app, but how it was being used.</p><blockquote><p>But asking whether ed tech is &#8220;good&#8221; or &#8220;bad&#8221; is like asking whether schools should have desks or whether teachers should use erasers. In both cases, they almost certainly should!</p><p>But the presence or absence of erasers is not what&#8217;s making the difference between effective and ineffective schools. If you had a building full of good teachers who were using a good curriculum and had adequate support from administrators and other stakeholders but for some reason they weren&#8217;t allowed to use erasers, they would find that annoying, but I&#8217;m sure they&#8217;d figure it out.</p></blockquote><p>This is a really popular and persuasive argument, and there is a bit of truth to it, because high-functioning and well-managed organisations can make the best of a bad situation. But ultimately, I think it&#8217;s misleading. Truly high-functioning organisations do not deliberately choose tools that create bad situations. They choose the tools that are right for the job. And they do so because they understand that tools are vitally important. They are not neutral and interchangeable widgets, and they are capable of having a profound impact on the way we think and behave.</p><div class="captioned-button-wrap" data-attrs="{&quot;url&quot;:&quot;https://substack.nomoremarking.com/p/education-technology-is-never-neutral?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="CaptionedButtonToDOM"><div class="preamble"><p class="cta-caption">Thanks for reading No More Marking! This post is public so feel free to share it.</p></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://substack.nomoremarking.com/p/education-technology-is-never-neutral?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://substack.nomoremarking.com/p/education-technology-is-never-neutral?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p></div><p><strong>Tools change our behaviour</strong></p><p>Tools make some behaviours more likely and others less so.</p><p>Take Yglesias&#8217;s own example of an eraser. It&#8217;s a very simple tool, but it still changes behaviour. It makes some behaviours less likely, and other behaviours more likely. In a classroom where every pupil has an eraser, the attitudes to error will be different from those in a classroom where no one has an eraser.  A teacher could try to create the same culture and norms in each classroom, but the presence or absence of a specific tool will make it easier or harder.</p><p>Recently, <a href="https://kucharski.substack.com/p/not-going-to-throw-away-my-shot">Adam Kucharski wrote about coding using Mathematica, which only allows one &#8220;undo&#8221;. </a>When he codes using that app, he is far more careful and cautious than if he had unlimited &#8220;undos&#8221;.  The &#8220;undo&#8221; tool - basically a digital eraser - shaped the way he thought.</p><p>The argument I am making here is an extension of Marshall McLuhan&#8217;s &#8220;the medium is the message&#8221; argument. I think <a href="https://en.wikipedia.org/wiki/Amusing_Ourselves_to_Death">Neil Postman </a>has given the best concrete example of this: if your major medium of communication is smoke signals, then your messages are unlikely to include philosophical tracts. The form of smoke signals precludes certain content and types of thought.</p><p><strong>Screens make certain behaviours more likely</strong></p><p>Laptops, tablets and phones are far more powerful than an eraser, and have a much more powerful effect. They often replace a textbook or an exercise book, but compared to those paper technologies they make task-switching much more likely.</p><p>You could be the best teacher in the world, and be completely committed to getting students to concentrate deeply and read difficult texts. But if you are in a classroom where every pupil accesses the content via a screen, I think you will be less likely to achieve your aims than a weaker teacher in a classroom with no screens at all.</p><p>Not only that, but there are big differences between different screen types. They are all optimised for different functions, and make those different functions more likely. </p><p>Desktops and laptops have physical keyboards, and are optimised for long-form writing, and not for messaging on the move. Mobile phones are optimised for scrolling, swiping, and short messages. You don&#8217;t see people walking down the street texting on their laptop. And people tend not to write novels on their phones. Tablets are different again. I think they are optimised for passive consumption of media, as opposed to creation of it. </p><p><strong>The mode effects research: yes, ed tech is a problem</strong></p><p>I don&#8217;t know enough about the specific app Yglesias refers to in his article. But I do think that regardless of the quality of the app or the content on it, there is a difference between learning on screen and learning on paper. </p><p>There is a large research literature on &#8220;mode effects&#8221; - essentially what happens when you change the medium of an assessment but keep the content the same.</p><p>One of the <a href="https://www.tandfonline.com/doi/abs/10.1080/03054985.2018.1430025">best and most rigorous recent studies</a> analysed the results of more than 3,000 students in Germany, Ireland and Sweden, who had taken the 2015 Programme for International Student Assessment tests in reading, maths and science. The students were randomised into two groups. One group took the test on paper; the other took it on a computer. The paper-based group achieved a full 20 scaled-score points better than the computer-based group. That is the equivalent of about six months of additional schooling - a huge difference. </p><p>I spoke to the author of the paper, John Jerrim, about this research for an <a href="https://www.tes.com/magazine/teaching-learning/secondary/future-of-assessment-onscreen-exams-no-grades-ai">article I wrote about it for the TES</a>, and he told me that he was really surprised by the magnitude of the effect. If an educational intervention caused that kind of improvement we would be rushing to scale it up!</p><p><strong>AI-enhanced Comparative Judgement: can we make ed tech part of the solution?</strong></p><p>At No More Marking, this is something we think about constantly. What do our tools and technologies make <em><strong>more</strong></em> likely? What do they make <em><strong>less</strong></em> likely?</p><p>We&#8217;ve been running Comparative Judgement assessments for nearly a decade, and have put significant effort into creating <a href="https://substack.nomoremarking.com/p/paper-and-on-screen-assessments">paper-based assessments </a>that can be assessed digitally. Our system allows you to assess writing in an incredibly technologically sophisticated way - without a pupil ever seeing a screen. We&#8217;ve assessed about 3 million pieces of writing using this process. </p><p>We have now <a href="https://substack.nomoremarking.com/p/what-is-comparative-judgement-and">added AI judges to our assessments</a>, which changes the dynamics yet again. What will AI assessments make more and less likely? Well, AI assessment is faster and easier than human assessment. If you make something quicker and easier, it tends to happen more often. So schools may run more writing assessments.</p><p>That could be good. It could mean better validation of interventions, reduced teacher workload, more opportunities for pupils to receive feedback &#8212; even, potentially, daily practice in the run-up to big national exams.</p><p>But it could also be bad. In younger years, for example, an increase in extended writing assessments may not be desirable. Shorter, different kinds of assessment may be more appropriate. We have <a href="https://substack.nomoremarking.com/p/the-no-more-marking-writing-progression">some of these already</a>, but maybe we&#8217;ll need to beef them up and make them more prominent. </p><p>We also provide <a href="https://substack.nomoremarking.com/p/bringing-our-feedback-philosophy">a wider range of feedback</a>, some of it directly created by AI. What will the effect of this be?<a href="https://substack.nomoremarking.com/p/the-ethics-of-ai-assessment-five"> We recently wrote about</a> some focus groups we&#8217;ve been doing asking students about what kinds of feedback they prefer. We also have <a href="https://substack.nomoremarking.com/p/how-do-you-know-your-feedback-is">a project running right now </a>where we measure how much improvement students make when they redraft their writing in response to AI feedback.</p><p>Our aim is to create tools that make good outcomes more likely and bad outcomes less likely. This will not happen by accident!</p><p><strong>Am I denying human agency?</strong></p><p>The attraction of the &#8220;technology is neutral&#8221; argument is that it makes us feel like we are in control. As I say, there is a grain of truth to this: there will be a range of ways you can deploy a technology.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-1" href="#footnote-1" target="_self">1</a>  My argument is that the range is limited. The technology sets the floor and ceiling you operate within. The more powerful the tool, the narrower the range you can operate within. An eraser is a relatively weak tool which still gives the teacher and students a wide range to operate in. A tablet is a much more powerful tool, and its power constrains the behaviour of teachers and students. </p><p>Your true agency is <strong>not</strong> in how you use the tool. By that point, the constraints are already in place. Your true agency involves how you select the tool, and in the input you have on its design. As Winston Churchill (<a href="https://quoteinvestigator.com/2016/06/26/shape/">almost</a>) put it: we shape our tools, and thereafter our tools shape us.</p><p><strong>Help us shape AI feedback &amp; assessment!</strong></p><p>And that is why we are constantly talking to schools about what they want in an assessment system, and reviewing the data to see what impact it&#8217;s having. If you would like to be a part of these design efforts, you can! </p><ul><li><p>My colleague Chris has created a user group for secondary schools in England who want to use our AI system to mark GCSE mocks. If you&#8217;d like to learn more about this, <a href="https://bit.ly/TalkToNMM">contact us</a>.</p></li><li><p>I am leading our efforts to optimise the AI feedback in different subjects. If you&#8217;d like to try out our system with 30 free credits, you can book a call with me <a href="https://calendar.app.google/4zj6oi3gbTY4AQkY6">here</a>.</p></li><li><p>If you would just like to learn more, <a href="https://www.nomoremarking.com/events">sign up for our next intro webinar</a> on Mon 27 April.</p></li></ul><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://substack.nomoremarking.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading No More Marking! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-1" href="#footnote-anchor-1" class="footnote-number" contenteditable="false" target="_self">1</a><div class="footnote-content"><p>Matt Yglesias presents some data to show that different schools using the same app can still get different outcomes. Yes, of course that will happen. I would not expect every school with omnipresent tablets to get exactly the same outcomes. I would not expect every school with phone bans to get exactly the same outcomes. Still, it is suggestive that in England, the highest performing schools tend to have very sparse use of screens in the classroom. (&#8220;Highest performing&#8221; as measured by the very sophisticated Progress 8 measure which measures how much value every secondary school adds across 5 years of education.) </p></div></div>]]></content:encoded></item><item><title><![CDATA[The ethics of AI assessment: five big issues]]></title><description><![CDATA[What do teachers, students and parents think about AI marking essays?]]></description><link>https://substack.nomoremarking.com/p/the-ethics-of-ai-assessment-five</link><guid isPermaLink="false">https://substack.nomoremarking.com/p/the-ethics-of-ai-assessment-five</guid><dc:creator><![CDATA[Daisy Christodoulou]]></dc:creator><pubDate>Sat, 21 Mar 2026 08:30:56 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/52c1341d-b5cf-4ba5-980e-af05d7c9f611_1536x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Over the past couple of weeks, my colleague Chris and I have been out and about talking to teachers, students and parents about what they do - and do not -  want from an AI assessment system. </p><p>Here are five big issues that keep recurring.</p><div class="captioned-button-wrap" data-attrs="{&quot;url&quot;:&quot;https://substack.nomoremarking.com/p/the-ethics-of-ai-assessment-five?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="CaptionedButtonToDOM"><div class="preamble"><p class="cta-caption">Thanks for reading No More Marking! This post is public so feel free to share it.</p></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://substack.nomoremarking.com/p/the-ethics-of-ai-assessment-five?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://substack.nomoremarking.com/p/the-ethics-of-ai-assessment-five?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p></div><ol><li><p><strong>Speed &amp; efficiency matter</strong></p></li></ol><p>Students care about getting feedback quickly, teachers care about excessive workload and senior teams care about students getting enough exam practice. </p><p>There is nothing wrong with any of this. As I&#8217;ve written about in<a href="https://amzn.eu/d/02DMykGW"> a completely different context</a>, speed matters. Speed is not opposed to quality; it is an aspect of quality. If you get feedback on your essays within a couple of hours of completing the essay, you will be much more likely to understand it and act on it. </p><p>And yet it is obviously wildly unrealistic to expect human teachers to routinely provide feedback on essays within a couple of hours. In fact, one of my biggest bugbears as a teacher was when a student would hand an essay in a week late, and then turn up at the staff room door at the end of the day asking if I&#8217;d marked it!</p><p>If AI can deliver quicker feedback, that&#8217;s definitely a good thing. </p><ol start="2"><li><p><strong>Accuracy</strong></p></li></ol><p>Everyone worries about AI errors, and about what the process is for dealing with them. Again, this is perfectly legitimate: one of the things that we&#8217;ve written a lot about is that traditional exam systems have <a href="https://substack.nomoremarking.com/p/what-do-you-prefer-human-error-or?utm_source=publication-search">well-established processes for dealing with human errors</a> which don&#8217;t work with AI (we are building the processes for AI- see <a href="https://help.nomoremarking.com/en/article/what-to-do-if-you-spot-an-ai-error-m0fggq/">here</a>).</p><p>But the flip side is that everyone understands that humans make errors too. I have spoken to groups of students and teachers who were genuinely shocked to learn that currently, with human marking, in GCSE English Literature you only have <a href="https://assets.publishing.service.gov.uk/media/5bfbfd70e5274a0fb775cca3/Marking_consistency_metrics_-_an_update_-_FINAL64492.pdf">a 52% chance of getting your true grade</a>. </p><ol start="3"><li><p><strong>Human contact</strong></p></li></ol><p>Students care what their teachers think about them and they want their teachers to read what they write. </p><p>However, there were some ways in which they were interested in the concept of AI feedback for its own sake - not just because it would be quicker.</p><p>For example, one student told us they liked the idea of AI feedback because it might offer a different perspective from their teacher, and pick up things their teacher had not thought of.</p><ol start="4"><li><p><strong>De-skilling</strong></p></li></ol><p>Another concern we hear - particularly from senior teams - is about de-skilling: what if teachers lose the capacity to mark essays and give feedback on student writing?</p><p>In some areas, I don&#8217;t care about de-skilling. For example, I have seen so many examples of teachers staying late formatting PowerPoint slides and trying to find just the right image for their worksheet, and I have never been convinced it&#8217;s a good use of their time. Andrew Old has <a href="https://andrewold.substack.com/p/deskilling-teachers-part-1">written about this recently</a> and I largely agree with him. If a teacher never designed another resource again I would not be that bothered.</p><p>However, when it comes to assessment, I am much more concerned about the possibility of teachers losing important skills. If a teacher never read another student essay again I would be very concerned. </p><p>We have to design systems that reduce workload and speed things up, but that preserve teachers engaging with student writing.  </p><ol start="5"><li><p><strong>The environment</strong></p></li></ol><p>Recently, we have been hearing more concerns about the environment, particularly about how much water AI uses, and therefore how much water it would take to mark an essay. </p><p>We are not unconditional AI boosters, and we are always willing to consider the downsides of the technology. But on water use specifically, I think the concerns have been overblown.  Andy Masley has done some <a href="https://substack.com/@andymasley/p-175834975">excellent analyses</a> of this, including showing that <a href="https://andymasley.com/writing/empire-of-ai-is-wildly-misleading/">one of the most famous analyses of AI water use confused cubic metres and litres and was out by a factor of 4500!</a>  </p><p><strong>Find out more</strong></p><p>As ever, if you want to find out more about what we do, you can join one of our <a href="https://www.nomoremarking.com/events">intro webinars.</a> The next one is on Monday April 27. These webinars are very popular - we show you how the system works and at the end we give 30 free credits to all attendees, so you can try it yourself on a class set of essays in any subject.</p><p>If you work in a school, you can also book a 30-minute call with me <a href="https://calendar.app.google/YknS6isPuH3vn4u49">here</a> where I can get you set up on our system with 30 free credits.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://substack.nomoremarking.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading No More Marking! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p>]]></content:encoded></item><item><title><![CDATA[The democratisation of cheating]]></title><description><![CDATA[When everybody knows that everybody cheats]]></description><link>https://substack.nomoremarking.com/p/the-democratisation-of-cheating</link><guid isPermaLink="false">https://substack.nomoremarking.com/p/the-democratisation-of-cheating</guid><dc:creator><![CDATA[Daisy Christodoulou]]></dc:creator><pubDate>Sat, 14 Mar 2026 08:45:14 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/92e73c4b-cab8-40ff-9d13-a9c0f3d6a6c5_1024x1536.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>A couple of years ago, a grocery delivery company came up with a catchy slogan. Their service, they said, was about the &#8220;democratisation of laziness.&#8221;</p><p>It is a memorable phrase, and also slightly unsettling. It&#8217;s true that it is easier to be lazy if you are wealthy and privileged, but it&#8217;s also true that we don&#8217;t think of laziness as an absolute good to be maximised. Rather than giving everyone the chance to be lazy, maybe we should think about finding ways to make everyone less lazy?</p><p>Of course, laziness has its upsides as well as downsides, so maybe democratising it is not so bad. I&#8217;ve written about this dilemma in a <a href="https://substack.nomoremarking.com/p/are-we-living-in-a-stupidogenic-society">piece on the stupidogenic society.</a></p><p>But there are some things that are more unambiguously bad where we should definitely try to remove elite privileges rather than spread those privileges to everyone. Cheating is one of them. Wealthy students have always been able to pay top dollar to have bespoke essays written for them. But until recently, this kind of unidentifiable cheating was only available to the very wealthy. Large Language Models have changed all that. They have democratised cheating, and made it available for the masses.</p><div class="captioned-button-wrap" data-attrs="{&quot;url&quot;:&quot;https://substack.nomoremarking.com/p/the-democratisation-of-cheating?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="CaptionedButtonToDOM"><div class="preamble"><p class="cta-caption">Thanks for reading No More Marking! This post is public so feel free to share it.</p></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://substack.nomoremarking.com/p/the-democratisation-of-cheating?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://substack.nomoremarking.com/p/the-democratisation-of-cheating?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p></div><p>This is a problem for schools, but it&#8217;s a much bigger problem for universities, who rely a lot more than schools on unsupervised written assessments. A few years ago, when it became clear that LLMs were so good at writing essays, I naively thought that universities would just have to take all their assessments in-person. That has not happened. Social media is full of academics lamenting the AI slop they have to mark. There is a lot of lamentation, but a lot less action.</p><p>I have <a href="https://engelsbergideas.com/notebook/indulgences-llms-and-the-crisis-of-the-university/">a longer article in Engelsberg Ideas this week</a> where I draw a historical parallel with the medieval sale of indulgences - another case where technology dramatically expanded access to a controversial shortcut.</p><p>In Germany and in England, one of the first uses for the new printing press was to create pro forma indulgence certificates that could be filled in with the purchaser&#8217;s name. The <a href="https://www.nationalarchives.gov.uk/education/resources/significant-people-collection/william-caxton/#:~:text=View%20full%20image,Return%20to%20Significant%20People">first item printed in England</a>, by William Caxton, was one of these certificates. In Germany entire batches of them were printed. </p><p>In the short-term, this made the church a lot of money. In the medium-term, it caused them a lot of problems. Before long, Martin Luther started using the printing press in a different way, to spread his criticisms of the sale of indulgences.  (Interestingly, the printer of his 95 Theses <a href="https://dia.pitts.emory.edu/collections/digitalcollections/mss085.cfm#:~:text=The%20printing%20of%20this%20indulgence,indulgences%2C%20issued%20in%2015%20editions">also printed books of indulgence certificates</a>.)</p><p>Maybe in the short-term it is easier for universities to turn a blind eye to the obvious cheating that is going on. I can see how students and professors might grumble if their traditional assessment system was changed, and perhaps students would be less likely to attend universities that had cheat-proof in-person assessments, which in turn would affect their bottom line. </p><p>But the medium- to long-term consequences of letting the AI slop become normalised are terrible. Maybe not &#8220;Thirty Years&#8217; War&#8221; terrible, but arguably &#8220;Dissolution of the Universities&#8221; terrible. Students are not stupid! They know that if they are putting everything through AI, so are all their classmates! At a time in the UK when people are starting to question <a href="https://substack.nomoremarking.com/p/does-a-university-education-help">the monetary value of a degree</a>, and to wonder whether some university expansion is justified, the inability of universities to respond to technological change is storing up massive problems. </p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://substack.nomoremarking.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading No More Marking! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[The AI Death Zone: a cautionary tale]]></title><description><![CDATA[Vibe-coding off a cliff edge]]></description><link>https://substack.nomoremarking.com/p/the-ai-death-zone-a-cautionary-tale</link><guid isPermaLink="false">https://substack.nomoremarking.com/p/the-ai-death-zone-a-cautionary-tale</guid><dc:creator><![CDATA[Chris Wheadon]]></dc:creator><pubDate>Sat, 07 Mar 2026 08:30:25 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/6d2c0bc4-0b36-46fa-9a3c-cd6c38aa95a7_1024x1536.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>This week&#8217;s Substack is a bit of an &#8220;inside baseball&#8221; report on how we are - and are not - using AI tools to develop our website. It&#8217;s by our CEO, Chris, with reference to our CTO, Brian. The same AI promises and pitfalls we&#8217;ve found when marking writing are also present when developing websites. For more on our AI marking platform sign up to <a href="https://us02web.zoom.us/webinar/register/WN_z3bLNQZsTMqcNXNyDEkH4w">our next webinar on Mon 27 April.</a></em></p><p>Brian and I have been coding together for over 20 years now. I met him at the CEM Centre at Durham University in the early 2000s, and he taught me how to code. In 2013, we founded <strong>No More Marking</strong> together and have scaled it to process millions of writing scripts every year. We&#8217;ve been through every technological fad there is.</p><p>We started with trying to deliver Python code on the web, <strong>Django</strong>, which taught us that the web and Python were not easy bed follows. I lost weekends configuring linux boxes with endless scrolling text that would end in baffling errors. Thank goodness for <strong>stackoverflow</strong>. We then moved to <strong>Meteor</strong>, which offered real-time updating of information for users that seemed magical but didn&#8217;t scale. Finally, we moved on to a proper <strong>serverless</strong> stack when serverless was just becoming a thing. At every stage, we&#8217;ve been ready to take on the latest innovation to learn how to deliver using the best of new technologies.</p><p>We&#8217;re certainly far from the stereotype of the programmers who learned to code at university using PHP and never really wanted to move beyond it. We&#8217;ve been through Ruby on Rails, Scala and fulfilled our Bayesian dreams with OpenBugs. Recently, however, we&#8217;ve been faced with a new challenge: the <strong>AI coding agent</strong>. </p><p>For the first time in over 20 years, Brian and I no longer see eye to eye. Brian has fallen prey to <strong>AI addiction </strong>in a way I fear is irredeemable. It started fairly innocently; I noticed Brian was making function calls that simply didn&#8217;t exist in the libraries he was using. When I asked him if he&#8217;d actually read the documentation, he would say we no longer need to&#8212;the AI would do it for us. He was at a loss to explain why the AI was inventing methods and properties that the library simply didn&#8217;t support.</p><p>That&#8217;s where it started. It has since got a lot worse.</p><div class="captioned-button-wrap" data-attrs="{&quot;url&quot;:&quot;https://substack.nomoremarking.com/p/the-ai-death-zone-a-cautionary-tale?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="CaptionedButtonToDOM"><div class="preamble"><p class="cta-caption">Thanks for reading No More Marking! This post is public so feel free to share it.</p></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://substack.nomoremarking.com/p/the-ai-death-zone-a-cautionary-tale?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://substack.nomoremarking.com/p/the-ai-death-zone-a-cautionary-tale?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p></div><h3>The Mirage of &#8220;Vibe Coding&#8221;</h3><p>Brian has now developed what I would call a severe case of AI dependency: he requires a regular hit, a new model, a new release from <strong>Anthropic or OpenAI </strong>to keep him going. A small does used to be enough, but he&#8217;s now taking stronger and stronger hits, always craving the latest designer drug.</p><p>In the early days of No More Marking Brian&#8217;s nickname was imaginary Brian. As I was out on the road talking to users, I would always refer to &#8220;Brian&#8221; who would sort things out back at the ranch. As no one had ever seen Brian, there was a rumour he didn&#8217;t actually exist. Now I fear that rumour has become reality, and I wish we could get the real Brian back.</p><p>Most recently, he convinced me that we could <strong>&#8220;vibe code&#8221;</strong> an entire application. I read up on it; I read all the blogs from the Deep Mind crew and suspended disbelief while I read about spec driven AI development. As a huge fan of Test Driven Development, where you write a test first and then write the code, this seemed like something I could get on board with. We fired up the planner from GitHub and spent hours crafting a specification for a fairly simple component of the app. We did everything we thought was expected for vibe coding success. After 10 minutes of watching GitHub whirring away, producing the most insane set of documents, specifications, features, plans and architectures I&#8217;ve ever seen, we stopped it.</p><p>But that wasn&#8217;t the end. We had to switch tools&#8212;there&#8217;s always a better tool, a different approach, a better model. We switched platform. We spent 70% of our budgeted time planning, making sure the architecture we were dictating was sensible. It was at a level where a team of human coders could have created the app feature for us. I wouldn&#8217;t let Brian press the &#8220;start coding&#8221; button until I was sure we had everything in place. As the planning stage got longer, his finger got more and more twitchy. Eventually, I let him press the button for the AI to start coding.</p><h3>The Syntax vs. The System</h3><p>He assured me we could go and have lunch, and when we came back, the code would be ready. It sounded fabulous. The promise was that we had done the hard work&#8212;the human thinking&#8212;and we&#8217;d leave the grunt work&#8212;the actual writing of the syntax&#8212;to the AI. Surely nobody wants to be writing syntax when they could just be thinking.</p><p>Two hours later, a rather crestfallen Brian calls me. &#8220;I don&#8217;t think we got the specification quite right,&#8221; he says. I asked to see what it had done. He says, &#8220;Well, we&#8217;ve got issues with things like data typing and interface issues, but let me just see what happens if I click this button here.&#8221;</p><p>I asked, &#8220;Have you read the code? Do you know what that button is going to do? If you attach a PDF and click that button, do you know where that PDF is going to go?&#8221;</p><p>Of course he hadn&#8217;t read the code, but thankfully when he clicked the button, nothing at all happened. At this point I knew we were about to enter the Death Zone. The Death Zone for mountaineers is where, starved of oxygen, progress slows to a slow-motion crawl. Programmers are not starved of oxygen, they are starved of understanding. They are looking at code they haven&#8217;t written, assumptions they never would have made. We only seem to hear these days from those who have summited, but I suspect the AI death zone is piled high with bodies, all with the words &#8220;just one more prompt&#8221; frozen on their lips. </p><h3>The Art of the Language</h3><p>Brian and I are experienced coders. We&#8217;ve delivered web apps at scale with millions of concurrent hits and performed sophisticated statistical analysis. I, for one, have been published in the <em>Journal of Statistical Software</em>&#8212;a career highlight. No doubt some will think we just used the wrong model or the wrong tool. Maybe. But while Brian gets a hit of adrenaline with every new model, I have that familiar sinking feeling.</p><p>From a personal point of view, I would be very happy if vibe coding turned out to be a mirage. We and others have written about how writing is thinking<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-1" href="#footnote-1" target="_self">1</a>. Surely coding is thinking? For the last 10 years, I&#8217;ve worked with the <strong>R language</strong> and seen how it has developed. That evolutionary process has been vital to it becoming the world&#8217;s most popular statistical language.</p><p>The ecosystem in R called the <strong>tidyverse</strong><a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-2" href="#footnote-2" target="_self">2</a> is a work of great beauty. Those of us who learned R before the <strong>tidyverse</strong> remember just how difficult some things were to achieve in an elegant fashion. The <strong>tidyverse</strong> evolved on top of the base R language through the hard work and dedication of creative, and unpaid individuals and opened up new possibilities. The <strong>tidyverse</strong> became a vibrant ecosystem which makes data science accessible and fun. No one got rich from creating the <strong>tidyverse</strong>, but the world got to be a safer, more creative and more beautiful place.</p><p>I simply cannot understand how a Large Language Model that is trained to reproduce patterns could ever produce genuine evolutions in the coding languages we use. The <strong>tidyverse</strong> was carved out by people who understood the base language so deeply they knew exactly how to improve it. They didn't just "vibe" with the syntax; they mastered it and improved it. The manifesto quotes Hal Abelson, &#8220;Programs must be written for people to read, and only incidentally for machines to execute.&#8221; Are we entering a world now where only machines read programs?</p><p>It&#8217;s probably time to get back to Brian now and see if I can rescue him from the death zone. He&#8217;s soon off for a trip to Shanghai where he&#8217;s looking forward to being driven around in a driverless car. When he ends up in the Yangtze he&#8217;ll still be wondering if he got the prompt wrong.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://substack.nomoremarking.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading No More Marking! Subscribe for free to receive new posts.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-1" href="#footnote-anchor-1" class="footnote-number" contenteditable="false" target="_self">1</a><div class="footnote-content"><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;879261b5-e376-408a-8725-75b7758c0c77&quot;,&quot;caption&quot;:&quot;In the last few months, I&#8217;ve read and heard so many stories of teachers and university professors getting frustrated with students handing in written assignments that have completed by generative AI.&quot;,&quot;cta&quot;:&quot;Read full story&quot;,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;What is the point of learning to write in a world with AI?&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:4339905,&quot;name&quot;:&quot;Daisy Christodoulou&quot;,&quot;bio&quot;:&quot;Director of Education at No More Marking&quot;,&quot;photo_url&quot;:&quot;https://bucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com/public/images/af9e996a-8b0d-463b-914b-78c16231b1a6_500x500.jpeg&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2025-07-18T08:01:22.956Z&quot;,&quot;cover_image&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7f0765ab-5627-4622-adf0-7e3dd23b7c68_502x314.jpeg&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://substack.nomoremarking.com/p/what-is-the-point-of-learning-to&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:168155746,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:111,&quot;comment_count&quot;:25,&quot;publication_id&quot;:1499167,&quot;publication_name&quot;:&quot;No More Marking&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/$s_!g-Kw!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdef3cf81-2f9b-4576-8d8b-92dcad390e4f_256x256.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><p></p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-2" href="#footnote-anchor-2" class="footnote-number" contenteditable="false" target="_self">2</a><div class="footnote-content"><p>https://tidyverse.tidyverse.org/articles/manifesto.html</p><p></p><p></p></div></div>]]></content:encoded></item><item><title><![CDATA[Solving marking at scale! ]]></title><description><![CDATA[The AI and assessment state of play, February 2026]]></description><link>https://substack.nomoremarking.com/p/solving-marking-at-scale</link><guid isPermaLink="false">https://substack.nomoremarking.com/p/solving-marking-at-scale</guid><dc:creator><![CDATA[Daisy Christodoulou]]></dc:creator><pubDate>Sat, 28 Feb 2026 08:45:44 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/9678592d-fc5b-4851-addc-5173dda85f02_1536x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In <a href="https://substack.nomoremarking.com/p/so-can-ai-assess-writing">March last year</a>, we presented a major breakthrough in our AI assessment model. We were able to use <a href="https://substack.nomoremarking.com/p/what-is-comparative-judgement-and">a blend of human and AI judgement</a> to reliably and efficiently assess student writing.</p><p>Nearly a year on, where are we? What else have we learned and what&#8217;s next?</p><div class="captioned-button-wrap" data-attrs="{&quot;url&quot;:&quot;https://substack.nomoremarking.com/p/solving-marking-at-scale?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="CaptionedButtonToDOM"><div class="preamble"><p class="cta-caption">Thanks for reading No More Marking! This post is public so feel free to share it.</p></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://substack.nomoremarking.com/p/solving-marking-at-scale?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://substack.nomoremarking.com/p/solving-marking-at-scale?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p></div><p><strong>What we&#8217;ve done so far</strong></p><ul><li><p><strong>Standardised writing assessment at scale:</strong> The model we developed in early 2025 has proven itself at scale. We&#8217;ve now used it to assess nearly half a million pieces of student writing, from students aged 5 to 16. Most of these are from schools in England, some are from the US, and this month we&#8217;re running our first AI-enhanced assessment in Australia &amp; New Zealand. We&#8217;ve been able to <a href="https://substack.nomoremarking.com/p/updates-from-our-ai-assessment-projects">validate</a> our model in various different ways too.</p></li><li><p><strong>Assessing other subjects at scale using AI:</strong> we&#8217;ve run our <a href="https://blog.nomoremarking.com/cj-history-results-646a7edf1ea7">first national history assessment</a>, which had different challenges to writing but still worked well.</p></li><li><p><strong>Improving AI rubrics:</strong> the way you prompt the AI is probably not as important as everyone thinks - but nevertheless, we have gained <a href="https://substack.nomoremarking.com/p/how-to-write-a-good-rubric-for-humans">a lot of insights</a> about what makes for the best rubric to give the AI</p></li><li><p><strong>Bespoke AI tasks for individual schools:</strong> as well as our big nationally-standardised assessments, schools can also use all the AI judging and feedback features for their own assessments on their own timeline. These will not be standardised, but in a big school you can use a mix of statistics and human judgements to set your own grade boundaries. There&#8217;s a case study <a href="https://substack.nomoremarking.com/p/ai-assessments-for-ks3-english-literature">here</a>. </p></li><li><p><strong>Better feedback</strong>: We have an audio feedback system that allows teachers to provide audio comments on every piece of writing which are then transcribed and polished by the AI. We really like this system - but so far it seems our schools prefer the direct AI feedback which is generated automatically. As well as a  written comment, the AI can now also generate <a href="https://substack.nomoremarking.com/p/how-do-you-know-your-feedback-is">personalised quizzes</a>. </p><p></p></li></ul><p><strong>What we&#8217;re doing next</strong></p><ul><li><p><strong>Improving handwriting recognition:</strong> we have written a lot about how uncannily accurate the AI judging is. We still have not yet encountered a major human-AI disagreement where we think the AI has made the wrong decision. However, where there are still issues is with the step <em><strong>before</strong></em> the judging, where the AI transcribes the student handwriting. The AI does sometimes improve the writing when it transcribes, imposing sense and meaning that are not in the original piece. This can then lead to the wrong judgement being made, but the source of the problem here is not the AI judging going wrong, but the AI transcription. We have a system to catch and correct these errors, and we are working on developing better open-sourced handwriting models .</p></li><li><p><strong>Using AI to create rubrics, not just use them</strong>: typically, we give the AI criteria and it uses that to judge. We are working on an RE project where we will get the AI to create criteria <strong>after</strong> it judges - eg, to tell us what the typical features of the best and weakest writing are.</p></li><li><p><strong>GCSE / multi-question assessment:</strong> so far, most of our big projects have involved assessments of just one piece of writing. GCSEs pose a more difficult logistical challenge, because you have to combine several questions and several marks. The AI is still good at judging these; we just have to find ways to make it simpler to pull together all the marks from all the questions and apply a grade. </p></li></ul><p><strong>Is AI going to change the world?</strong></p><p>As we have been working hard on all of the above, there has been a wider debate going on about the extent to which AI is going to change / destroy the wider economy.</p><p>We set up this Substack to detail our journey through AI, and if you go <a href="https://substack.nomoremarking.com/p/ai-powered-essay-marking">back to 2023-24 you can see </a>that we were much more sceptical than we are now. </p><p>For us, the thing that has made the biggest difference is not necessarily the improvement in quality of the cutting-edge models - but the dramatic reduction in cost of the standard models. This has allowed us to &#8220;over-assess&#8221; - to send writing off to be judged multiple times, which helps us weed out inconsistent and biased judgements.</p><p>A lot of our theoretical concerns about LLMs still exist - the hallucinations, the probabilistic decision-making, the challenges with getting them to work reliably at scale. But we have found ways around most of these problems, such that in practice, our model is very useful!</p><p>Even now, it is easy for us to get bogged down in the details of the fraction of percents that aren&#8217;t working right - and that is the right thing for us to do, because a fraction of one percent at scale is still a big number.</p><p>But it is also important to sometimes step back and take a look at where we are. And when I do that, I keep coming back to the same thought: if this system had arrived on my desk halfway through my teacher training year in 2007-08, I would have thought it was unbelievably brilliant and it would have dramatically changed my life - and my students&#8217; - for the better.</p><p><strong>Make your own mind up</strong></p><p>Our next <a href="https://www.nomoremarking.com/events">intro webinar is in April</a>. These webinars are very popular - we show you how the system works and at the end we give 30 free credits to all attendees, so you can try it yourself on a class set of essays in any subject.</p><p>If you work in a school, you can also book a 30 minute call with me <a href="https://calendar.app.google/YknS6isPuH3vn4u49">here</a> where I can get you set up on our system with 30 free credits. </p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://substack.nomoremarking.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading No More Marking! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p>]]></content:encoded></item><item><title><![CDATA[Is it possible to develop a tutor-proof test?]]></title><description><![CDATA[Or should we focus on tests worth teaching to instead?]]></description><link>https://substack.nomoremarking.com/p/is-it-possible-to-develop-a-tutor</link><guid isPermaLink="false">https://substack.nomoremarking.com/p/is-it-possible-to-develop-a-tutor</guid><dc:creator><![CDATA[Daisy Christodoulou]]></dc:creator><pubDate>Sat, 21 Feb 2026 08:45:43 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/c8830952-8f3c-46e6-a0dc-cd2bcf355e99_1024x1536.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>At No More Marking, most of the assessments we provide are fairly low-stakes. However, we do have experience with high-stakes tests, and we know how challenging they are to design.</p><p>If you are using a test as a selection mechanism for a prestigious institution, you will have armies of very smart parents and well-paid tutors trying to crack the code of the test.</p><p>Over the past decade or so, a couple of phrases have cropped up to describe the way these selection tests should work. First, people argue that we should have &#8220;tutor-proof tests&#8221; that cannot be cracked by the parents and tutors. Second, we should have &#8220;tests worth teaching to&#8221;, so that if students are being prepped for the test, the prep is worthwhile.</p><p>Do these two concepts hold water? In this post, we&#8217;ll examine the idea of tutor-proof tests.</p><div class="captioned-button-wrap" data-attrs="{&quot;url&quot;:&quot;https://substack.nomoremarking.com/p/is-it-possible-to-develop-a-tutor?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="CaptionedButtonToDOM"><div class="preamble"><p class="cta-caption">Thanks for reading No More Marking! This post is public so feel free to share it.</p></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://substack.nomoremarking.com/p/is-it-possible-to-develop-a-tutor?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://substack.nomoremarking.com/p/is-it-possible-to-develop-a-tutor?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p></div><h2>Some historical background</h2><p>Historically, many famous English public schools selected pupils at age 13 using the Common Entrance exam.</p><p>Common Entrance exams are linked to a defined curriculum. The advantage of this is great coherence and clarity for students and teachers at the prep schools and public schools. The disadvantage is it probably restricts the pool of students who can apply to the public schools.</p><p>Not all independent schools operated on this model. I went to a selective secondary school, City of London School for Girls, which used a more curriculum-neutral test consisting of a reading comprehension, writing task and maths paper. I hadn&#8217;t attended a private prep school or had a private tutor, but the test resembled a lot of what I had done at my state primary school, so I was not at a massive disadvantage compared to others. Had CLSG run Common Entrance, it&#8217;s unlikely I would have even applied, let alone got in.</p><p>However, whilst the test I sat was <em>more</em> curriculum-neutral than Common Entrance, it was not <em>completely</em> curriculum-neutral, and nor was it immune to tutoring and preparation. In the last decade or so, even this kind of maths, reading and writing assessment has been criticised for excluding talented but disadvantaged students who don&#8217;t have access to good schools and tutors.</p><h2>The tutor-proof test</h2><p>Is it possible to design a test so content-free that it captures something like raw potential, or the underlying ability to flourish in an academic environment? Verbal reasoning tests reward vocabulary knowledge, which can be taught. Numerical reasoning tests reward maths knowledge, which can also be taught. But what about non-verbal reasoning tests? These are the kinds of tests where you are given four shapes and then asked: which shape continues the sequence?</p><p>You can see how these tests are less tied to curriculum knowledge, and there is serious research in this area suggesting that they might therefore be useful for identifying talented but disadvantaged students. David Card is a Nobel laureate who has done research showing that a non-verbal test administered at second grade in a district in Florida &#8220;led to large increases in the fractions of economically disadvantaged and minority students placed in gifted programs.&#8221; Jonathan Wai is another researcher who has done a lot of interesting work on these types of questions, and who has also been involved with talent identification programmes.</p><p>In large-scale government-run school systems with lots of disadvantaged students, non-verbal assessments can help identify students who are able but poorly served by their schooling.</p><p>But there are big differences between low-stakes talent-identification across a government school system and high-stakes entry to prestigious selective schools. When an expensive tutor hears the phrase &#8220;tutor-proof test&#8221;, he doesn&#8217;t interpret that as a warning but a challenge.</p><h2>Practice effects</h2><p>There is a huge literature on &#8220;practice effects&#8221;, which essentially show that if you practice a specific skill, you will get better at that specific skill. If you practice touch typing every day, you will get better at it. If you practice your multiplication tables every day, you&#8217;ll get better at them. If you practice tying your shoelaces every day, you&#8217;ll get better at it.</p><p>The practice effect is one of the most robust findings in cognitive psychology, and poses an enormous challenge to the idea of the tutor-proof test.</p><p>The response of test developers to this challenge is to say that they can create enough <strong>new</strong> question types that practice on <strong>past</strong> question types won&#8217;t deliver huge gains.</p><p>That is, they&#8217;ll say that you can practice tying your shoelaces, but then the test will be on a different kind of knot, so you won&#8217;t have any advantage. From a cognitive science point of view, this is a tricky one. It is true that the practice effect holds for practice of a specific skill. It is also true that transfer to different contexts is hard, and that so-called &#8220;far transfer&#8221; is exceptionally difficult. So yes, the test developers are right to say that the more novel the question type, the less valuable the practice of old question types is. </p><p>But &#8220;less valuable&#8221; is not the same as &#8220;not valuable at all&#8221;. And whilst far transfer is extremely difficult, near transfer is more possible. Even if practice of old question types gives you quite small gains, in a high-stakes environment those small gains can be the difference between success and failure.</p><p>Also, to make this system work, you require test developers to constantly create new types of question that are as different as possible from what has gone before. This poses a number of difficult technical challenges.</p><p>First, there are obvious constraints to just how many new types of short non-verbal test questions it is possible to create. If you are running 3 test sessions a year, after ten years you will need to come up with thirty different types of question. There are limits to how many ways you can vary the essential concept of looking at a 2D shape and moving it around in some way.</p><p>Second, if you really are creating very new questions for each round of tests, then you need to run a new validation process each time. Good validation processes take time: ideally you want to wait a few years and gather information on whether the students who passed that test are thriving at their new school. But if you are constantly having to create new question types, you don&#8217;t have the time for that.</p><p>Third, even if your system works for the first few years, there is no guarantee it will keep working over time as tutors learn more about it and optimise their teaching. This is a classic Goodhart&#8217;s Law problem: when a measure becomes a target, it loses value as a measure.</p><p>We see numerous examples of this in our work and research. A really famous one is that early AI essay markers delivered pretty good levels of agreement with human markers, and seemed to have solved the problem of AI marking. However, on closer investigation it turned out that they were largely just rewarding the length of the essay. In a low-stakes environment, it is possible that this wouldn&#8217;t cause too many problems. But in a high-stakes assessment where students, teachers and parents are all striving to do as well as they can, the system will break down, because students will realise that the way to succeed is to <a href="https://substack.nomoremarking.com/p/can-chatgpt-mark-writing-c98ff1f1a89">write the same sentence a couple of hundred times</a>.</p><p>Likewise, it is possible that tutors find ways of teaching tips and tricks that help students answer the non-verbal questions, but that systematically break the link between the question and what it is supposed to be measuring.</p><h2>What is the impact on students?</h2><p>The extensive literature on the practice effect shows it delivers substantial gains.  But there is a chance that even the substantial gains reported in the literature underestimate its effect, because most of the research is lab-based, and may not properly account for the scale and effect of real-world intensive practice in some environments. Tutoring for entrance exams is taken very seriously by a lot of very smart people, and it is big business. </p><p>Many students will be preparing for their entrance exam 18 months or 2 years in advance, and will be doing several hours of practice every week. The question is, would you rather that prep is on shape rotation? Or would you rather students were reading interesting books and doing maths problems?</p><p>It&#8217;s also worth remembering that the original impulse for introducing tests like this was the social justice aspect &#8211; that schools wanted to find a way of identifying talented but disadvantaged students. But once a non-verbal test becomes a target, it is going to discriminate against those students too, as you are much less likely to get any practice of those tests in a typical state school &#8211; whereas you will be taught reading, writing and maths. The worst-case outcome is that the non-verbal test is as socially exclusionary as Common Entrance, just with none of its educational benefits.</p><p>When you stop and think about it, the concept of the tutor-proof test does not really hold water. Of course you get better at something if you practice it. That is a good thing, and that is why education works! The whole point of education is to practice valuable things and get better at the valuable things. A good assessment should promote practice of the valuable things. It shouldn&#8217;t remove the valuable things and replace them with less valuable things, on the grounds that some students will get more practice of the valuable things. </p><p>Which brings us to another popular concept: we should create &#8220;tests that are worth teaching to&#8221;. Is this a better guide to assessment design? We&#8217;ll discuss that in a future post. </p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://substack.nomoremarking.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading No More Marking! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p>]]></content:encoded></item><item><title><![CDATA[Does a university education help you earn more?]]></title><description><![CDATA[Maybe, but not in the way you think]]></description><link>https://substack.nomoremarking.com/p/does-a-university-education-help</link><guid isPermaLink="false">https://substack.nomoremarking.com/p/does-a-university-education-help</guid><dc:creator><![CDATA[Daisy Christodoulou]]></dc:creator><pubDate>Sun, 15 Feb 2026 08:45:52 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/7671339f-dfcb-40a7-a0da-d0477101a276_1536x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Over the last couple of weeks in the UK, there&#8217;s been a lot of controversy about student loans. Graduates have been posting their student loan balance statements on social media showing some <a href="https://substack.com/home/post/p-186676857">punitive interest rates</a>.</p><p>That has spilled over into a wider debate about the economic value of university itself. If it is so expensive, is it worth it economically? </p><div class="captioned-button-wrap" data-attrs="{&quot;url&quot;:&quot;https://substack.nomoremarking.com/p/does-a-university-education-help?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="CaptionedButtonToDOM"><div class="preamble"><p class="cta-caption">Thanks for reading No More Marking! This post is public so feel free to share it.</p></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://substack.nomoremarking.com/p/does-a-university-education-help?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://substack.nomoremarking.com/p/does-a-university-education-help?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p></div><p><strong>What is the graduate premium?</strong></p><p>The &#8220;graduate premium&#8221; refers to the fact that university graduates earn more than non-graduates, and is often used as a reason why a) students should pay for the costs of their university education and b) we should try to get as many school-leavers as possible to go to university.</p><p>A popular critique of this argument is that whilst graduates earn more on average, the average is misleading: a small number of degrees command large premiums, whereas some degrees offer no premium at all. This critique is <a href="https://frasernelson.substack.com/p/how-much-do-graduates-really-earn">summed up well by Fraser Nelson</a>, who argues that as a result the government should release national data on earnings by subject and institution so school-leavers can make more informed choices about where to study. For example, Glasgow history grads earn only 30k five years after graduating, whereas LSE history grads earn 50k. LSE therefore offers &#8220;absurdly good value&#8221; and should probably put its fees up.</p><p>However, even this more nuanced take on the graduate premium misses something crucial. Discussion about the graduate premium assumes that it is caused by going to university. The implicit reasoning is that you go to university, you acquire knowledge and skills you wouldn&#8217;t have got otherwise, and these make you a more productive worker who can therefore command a higher salary. </p><p>But graduate income data does not prove this causal chain. Yes, it is true that people who go to university earn more, on average, than people who don&#8217;t. But people who wear Rolex watches earn more, on average, than people who don&#8217;t. No-one is proposing a &#8220;Rolex Premium&#8221; whereby every school-leaver is encouraged to take out an expensive loan to fund the purchase of a Rolex watch, on the grounds that it will lead to them having much greater lifetime earnings. </p><p>Likewise, the way to critique that argument is not to say &#8220;Well yes, the Rolex Glasgow model doesn&#8217;t help you earn much, but the Rolex LSE model leads to really high future earnings, so we should encourage school-leavers to buy the Rolex LSE - and in fact, Rolex should charge even more for it because it will pay for itself over time!&#8221;</p><p>To prove the graduate premium is more than just a Rolex premium, we need some causal evidence that it is caused by skills acquired at university.</p><p><strong>Human capital vs signalling</strong></p><p>There is an extensive academic debate about this: <a href="https://www.aeaweb.org/articles?id=10.1257/jep.9.4.133">the human capital vs signalling debate.</a></p><ul><li><p>The human capital side says that the graduate premium is caused by the knowledge and skills universities impart.</p></li><li><p>The signalling side says that the graduate premium is caused by the signal that is sent by a degree.  Universities select their students based on prior attainment. Employers use degrees as a cheap (for them!) way to select employees who are already smart. They are not that bothered about what the student learns at university. </p></li></ul><p>This debate obviously has enormous implications for public policy.</p><ul><li><p>If it turns out the returns to a degree are mostly due to human capital, then we should definitely be aiming to get 50% of school-leavers to university, and arguably we should be aiming for an even bigger proportion.  If university really does reliably impart skills and knowledge that reliably increase your lifetime earnings, then expanding access is economically rational.</p></li><li><p>If it turns out the returns to a degree are mostly due to signalling, then we are wasting gigantic sums of private and public money. We could essentially replace degrees with some kind of basic test taken at age 18 and that would provide employers with what they are currently getting with hugely expensive three-year degree courses.  </p></li></ul><p>So if there is a huge debate, what is the consensus about which side is right? What does the data say? </p><p>It&#8217;s a difficult question to answer because degrees are used as a filter for a lot of well-paid jobs. One way you could research this question is to compare two cohorts of school-leavers with exactly the same A-levels and prior attainment. One cohort goes to university; one doesn&#8217;t. If the university grads do better in the job market, that suggests the university imparts valuable skills and knowledge, and therefore is evidence for the human capital theory. </p><p>But of course, that won&#8217;t work, because lots of jobs are restricted to graduates. Maybe the non-grads would have done perfectly well at them, but they never get a chance.</p><p>Another obvious way you could measure the impact of university is to directly measure the skills and knowledge it imparts by assessing students and seeing what they have learned. This is what happens at school, and this is one of the reasons why we have good evidence that <a href="https://www.proquest.com/openview/239877c0f1c4563fc61e5fc39b94840f/1?pq-origsite=gscholar&amp;cbl=54479">schools do succeed</a> at <a href="https://d1wqtxts1xzle7.cloudfront.net/86280672/1468-0335.0027320220522-1-1j00x79-libre.pdf?1653211521=&amp;response-content-disposition=inline%3B+filename%3DThe_Return_on_Post_Compulsory_School_Mat.pdf&amp;Expires=1771023744&amp;Signature=XXAsV60gsL1oIUVMcCrtQj3WOtewiWQd4bwuUtOhGeb12gB~dXVa8gpRf18fI63ofFGwkSZU9WTBxknhexn4vN8pyA9rODXxdPM4PPb-M6T5e~t5lxIkL~Jo15m5bnuq-lgwRd7QMJRkczgwcuJozGbHT3enq-~V~pFIgn8x8MxhT3xRk31rA1EI0dME-VECWcpGjTM9JiF5IrnRwFbT6A3BVrSmyJVAnbFRZWoAm16~qHPFB-MPkIlkNSoysSQM6cHBXh1cM-l-WJLVgn8J9hgw6ccQUqtlDBjiFgAAxA4yQvjDSWSMj1cDfg7weAWsjKcQ~UTPpd3~f85Gu5GOeA__&amp;Key-Pair-Id=APKAJLOHF5GGSLRBV4ZA">teaching skills</a> that are <a href="https://files.eric.ed.gov/fulltext/ED406585.pdf">valuable in the job market</a>, and that some schools are more successful than others. But most academic degree courses don&#8217;t feature any kind of nationally-standardised assessment that could be used for this purpose. </p><p>As a result, a lot of the research in this area is hugely complex and ultimately quite inconclusive. </p><p>That in itself is quite striking though &#8211; given how embedded the human capital theory is, and how much it governs the public debate and policymaking in this area, you would expect there to be some quite solid evidence in its favour!</p><p> <strong>Assessment data is limited, but it is less limited than earnings data!</strong></p><p>A lot of criticisms are made of educational assessment &#8211; some of them justified. Can assessment truly capture everything we value about education? Does it distort the thing it is trying to measure? Does it lead to the things that can&#8217;t be assessed being neglected?  We write about a lot of these themes on this Substack!</p><p>Still, even if you are sceptical of assessment, you have surely got to admit that even a basic standardised assessment is a better way of measuring the impact of universities than later earnings.</p><p>Earnings data doesn&#8217;t capture the value of learning for its own sake. It undervalues low-paid but socially vital jobs. It is often not a measure of productivity because a lot of public sector salaries are set by government. Similarly, it doesn&#8217;t account for regional differences in pay (this is probably a significant factor in why Glasgow history grads earn less than LSE ones). It might also lead universities to make bad decisions about which courses to offer or which type of students to recruit. </p><p>There is an analogy with health targets, which have their flaws but still tell you something useful. How would you rather measure the success of a hospital &#8211; by its infection rate, by the proportion of patients treated at A &amp; E within four hours, by the numbers of beds in corridors? Or by how much its patients earned five years after being treated there?</p><p><strong>Standardised assessments at university</strong></p><p>Andreas Schleicher is the Director for Education and Skills at the OECD, who run PISA, the international school assessment. He has <a href="https://www.hepi.ac.uk/wp-content/uploads/2016/01/Andreas-Schleicher-lecture.pdf">pointed out that</a> &#8220;there was a time when people looked to universities to judge the quality of education. Today, it is the other way around: the public want better information on the quality of universities.&#8221;</p><p>Given the scale of public subsidy and private debt involved, why not make some form of standardised assessment compulsory for all degrees?  This could take a variety of different formats: maybe certain subjects all have to have a couple of shared &#8220;national&#8221; exam modules covering the content that every university will teach. Or maybe students in every essay-based subject have to take a compulsory writing module and exam &#8211; which might also help assuage concerns about the impact of AI on the traditional take-home essay. </p><p>You could even assess it with <a href="https://substack.nomoremarking.com/p/what-is-comparative-judgement-and">Comparative Judgement</a>&#8230;</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://substack.nomoremarking.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading No More Marking! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[AI assessments for KS3 English Literature: a case study]]></title><description><![CDATA[Making English marking quicker than maths...]]></description><link>https://substack.nomoremarking.com/p/ai-assessments-for-ks3-english-literature</link><guid isPermaLink="false">https://substack.nomoremarking.com/p/ai-assessments-for-ks3-english-literature</guid><dc:creator><![CDATA[Daisy Christodoulou]]></dc:creator><pubDate>Sat, 07 Feb 2026 08:45:13 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/92eaeb1d-7b03-4fcc-b3df-23d0d9a9c13c_1536x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Recently, I spoke with Phill Chater from Landau Forte Academy about how he is using our <a href="https://substack.nomoremarking.com/p/what-is-comparative-judgement-and">AI-enhanced Comparative Judgement system</a> to assess his school&#8217;s Key Stage 3 English Literature essays.</p><p>Here&#8217;s a summary of his approach, organised by the three features we use to evaluate all our assessments: reliability, efficiency and validity.</p><div class="captioned-button-wrap" data-attrs="{&quot;url&quot;:&quot;https://substack.nomoremarking.com/p/ai-assessments-for-ks3-english-literature?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="CaptionedButtonToDOM"><div class="preamble"><p class="cta-caption">Thanks for reading No More Marking! This post is public so feel free to share it.</p></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://substack.nomoremarking.com/p/ai-assessments-for-ks3-english-literature?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://substack.nomoremarking.com/p/ai-assessments-for-ks3-english-literature?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p></div><p><strong>Reliability</strong></p><p>Phill set up his English Literature assessments as custom tasks that are bespoke to his school, which means they are not nationally standardised. Our AI-enhanced Comparative Judgement system gives you a scaled score, and then you can apply the grade boundaries on top. In order to gather this evidence he got the AI to judge each year group twice, to see if it came up with the same results each time.</p><p>This is a very sensible approach to take. If the AI came back with a completely different rank order of students each time, you would have very little faith in its outputs. Interestingly, you can do similar checks on human marking, and the results are often quite underwhelming. Ofqual&#8217;s <a href="https://assets.publishing.service.gov.uk/media/5bfbfd70e5274a0fb775cca3/Marking_consistency_metrics_-_an_update_-_FINAL64492.pdf">marking reliability studies in 2017</a> found English Literature had the worst marking reliability of any subject, with candidates only likely to get their true grade 52% of the time compared to Maths where the probability is 96%.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-1" href="#footnote-1" target="_self">1</a> If we built an AI marking model with such poor consistency, we would not let it see the light of day!</p><p>However, when you are assessing essays, you also probably don&#8217;t want there to be perfect consistency between one iteration of marking and another. This would suggest that your markers &#8211; whether AI or human &#8211; are too deterministic, and are judging on surface features which make the model easy to game. For example, some of the very earliest AI marking models delivered incredibly good agreement between iterations, but on closer investigation this was because they were largely judging on surface features like length.</p><p>One of the advantages that LLMs have over these older models &#8211; and why it is worth persisting with them despite their other flaws &#8211; is that they are not making completely deterministic judgements. A major focus of our research has been building an LLM-powered model that gets this balance right, and validating its outputs. You can read the extensive work we&#8217;ve done on this <a href="https://substack.nomoremarking.com/p/superintelligent-judges?utm_source=publication-search">here</a>, <a href="https://substack.nomoremarking.com/p/the-human-in-the-loop">here</a>, <a href="https://substack.nomoremarking.com/p/so-can-ai-assess-writing">here</a> and <a href="https://substack.nomoremarking.com/p/what-do-you-prefer-human-error-or">here</a>.</p><p>We were extremely encouraged by the results of Phill&#8217;s assessment: the correlations between each iteration of the AI judging ranged from 0.91-0.96, which feels about right: too low would suggest issues with the transcriptions, hallucinations, order bias and the world of other woes we see with LLMs. Too high and we&#8217;ve built a model that is likely over deterministic and consistently wrong! There is always further work you can do on validation, and we will update this Substack with more data when we have it.</p><p>The other reassuring aspect of this assessment is that it was on literature. Most of our national AI assessments so far have been assessments of writing. AI can be quite brittle, so there is no guarantee that if it works for writing it will work for literature. Assessments that require specific content knowledge pose an extra challenge, so it was good for us to learn that the scores were sensible and in line with expectations. Phill created a fairly holistic mark scheme to guide the AI, broadly in line with the advice we give <a href="https://substack.nomoremarking.com/p/how-to-write-a-good-rubric-for-humans">here</a> about not being too prescriptive.</p><p><strong>Efficiency</strong></p><p>This is obviously one of the major benefits of adding AI judges. It can whiz through assessments <a href="https://help.nomoremarking.com/en/article/how-long-does-it-take-to-assess-one-classs-essays-using-comparative-judgement-lrxp0m/">very quickly</a>, and it did so in this case. Phill made a particularly telling point about the speed of assessment. Previously, English teachers would typically be marking right up to the deadline for big assessments, while the maths department would often finish earlier in the window. This time, for the first time Phill could recollect, the English teachers finished before the maths teachers&#8212;something that, in my experience, is unheard of.</p><p>But obviously, every solution contains within it the seeds of a new problem. One of the fears people have about English teachers spending less time marking is that they will not understand their students as well. But Phill&#8217;s comparison with maths assessment is instructive. Maths teachers spend less time marking, but they understand their students just as well. It&#8217;s just that the nature of maths marking is such that it can deliver equivalent levels of understanding in less time. There has to be a way that we could imagine this working for English: teachers will spend less time than they do currently on marking, but will get equivalent levels of understanding.</p><p>For me, there is definitely value in teachers reading students&#8217; writing, but there is less value in the time spent painstakingly writing out comments by hand. <a href="https://substack.nomoremarking.com/p/bringing-our-feedback-philosophy">Our feedback systems</a> are designed to maintain the high-value thought processes and eliminate or reduce the lower-value ones. However, we are also aware that teachers and students will use our feedback in different ways, and we want to learn more about what is most effective. Which brings us to the final section&#8230;.</p><p><strong>Validity &amp; Feedback</strong></p><p>Efficiency can enable better feedback in two ways. First, Phill said the faster turnaround time enabled them to dedicate an entire lesson to feedback. Second, the quicker you get the feedback the more relevant it is.</p><p>The first part of the feedback lesson had students working with a model essay that had been selected previously, pre-AI, so this portion wasn&#8217;t dependent on the AI at all.</p><p>In the second part of the lesson, teachers gave students the direct AI feedback and asked them to summarise their areas for improvement based on the AI and the model essay. </p><p>Phill felt the feedback was &#8220;uncannily accurate&#8221;, but he did identify a couple of areas where we could improve our feedback, and we have some ideas about how to address them. The direct AI feedback is currently a bit too verbose and maybe a bit too harsh too. We are working on making it nicer!</p><p>The other challenge is making the feedback actionable. This is tricky because the more specific you get, the more risk there is of AI hallucination and errors. We&#8217;re currently developing an approach that Phill hasn&#8217;t been able to use yet, but we hope to roll out for everyone soon: <a href="https://substack.nomoremarking.com/p/how-do-you-know-your-feedback-is">getting the AI to create personalised multiple-choice questions for each student</a>.</p><p><strong>Conclusion</strong></p><p>This is all new territory. Phill is a pioneer who is applying new technology to existing practice in exciting ways, and both his practice and ours will continue to evolve. But there&#8217;s enough here to suggest that significant reductions in workload <strong>and</strong> improvements in feedback are possible.</p><ul><li><p><em>If you&#8217;d like to try out an AI-enhanced Comparative Judgement assessment, join our <a href="https://www.nomoremarking.com/events">webinar</a> on Wednesday 25th February where we will give all attendees 30 free AI assessment credits.</em></p></li><li><p><em>If you have an idea for a case study, let us know <a href="https://go.crisp.chat/chat/embed/?website_id=c8a23a97-02b4-4c59-8012-a7acfc05d267">here</a>.</em> </p></li></ul><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://substack.nomoremarking.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading No More Marking! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-1" href="#footnote-anchor-1" class="footnote-number" contenteditable="false" target="_self">1</a><div class="footnote-content"><p>The true grade is a theoretical concept that estimates what a candidate would achieve if they took an assessment an infinite number of times. Of course it doesn&#8217;t measure if the assessment or the marking are aligned with the curriculum or mark scheme, which is why we tend to favour reporting <a href="https://substack.nomoremarking.com/p/updates-from-our-ai-assessment-projects">broader measures of validity such as correlations between assessments over time</a>. Nonetheless, without high reliability there can be no validity.</p><p></p></div></div>]]></content:encoded></item><item><title><![CDATA[What would Mr Toad make of school phone bans?]]></title><description><![CDATA[Why phones are more like cars than cigarettes]]></description><link>https://substack.nomoremarking.com/p/what-would-mr-toad-make-of-school</link><guid isPermaLink="false">https://substack.nomoremarking.com/p/what-would-mr-toad-make-of-school</guid><dc:creator><![CDATA[Daisy Christodoulou]]></dc:creator><pubDate>Sun, 01 Feb 2026 08:45:20 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/e416ba03-409f-4ad4-ac40-040c8f729f57_1024x1536.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In the last 12 months or so, there has been a rapid, almost palpable, change in attitudes to children and technology. A number of anti-phone pressure groups like Smartphone Free Childhood have sprung up, while many countries are starting to legislate for various kinds of phone bans: <a href="https://en.wikipedia.org/wiki/Online_Safety_Amendment">Australia banned under-16s from social media in December 2025</a>, <a href="https://www.connexionfrance.com/news/french-mps-back-plan-to-ban-under-15s-from-social-media/766072">France is moving to ban social media for under-15s</a>, <a href="https://en.wikipedia.org/wiki/Social_media_age_verification_laws_by_country">Denmark announced plans to ban under-15s in November 2025</a>, <a href="https://studyinternational.com/news/countries-social-media-ban-children/">Norway raised its age limit from 13 to 15</a>, and <a href="https://www.wionews.com/trending/no-social-media-for-children-france-passes-bill-to-restrict-under-15s-know-the-countries-that-have-banned-platforms-for-kids-1769518456388">Malaysia announced a ban for under-16s coming in July 2026</a>. In the UK, <a href="https://www.itv.com/news/2026-01-19/government-to-hold-consultation-on-social-media-ban-for-under-16s">the House of Lords voted in January 2026 for an amendment</a> to ban under-16s from social media, and the government has launched a <a href="https://www.gov.uk/government/news/government-to-drive-action-to-improve-childrens-relationship-with-mobile-phones-and-social-media">consultation on the issue</a>.</p><p>I am supportive of these moves, but I have also been somewhat surprised by the speed of change. I&#8217;ve been consistently anti-phones in the classroom for well over a decade now, and I&#8217;ve become used to having polite disagreements with people on the other side of the debate&#8212;which, until recently, was most people.</p><div class="captioned-button-wrap" data-attrs="{&quot;url&quot;:&quot;https://substack.nomoremarking.com/p/what-would-mr-toad-make-of-school?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="CaptionedButtonToDOM"><div class="preamble"><p class="cta-caption">Thanks for reading No More Marking! This post is public so feel free to share it.</p></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://substack.nomoremarking.com/p/what-would-mr-toad-make-of-school?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://substack.nomoremarking.com/p/what-would-mr-toad-make-of-school?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p></div><p>Over the last few months, I&#8217;ve visited quite a few schools and have been astonished to find that there was no argument to be had. I would say the thing I have said for ten years, brace myself for the usual objections, and instead I&#8217;d hear &#8220;yes you&#8217;re totally right. We&#8217;ve had a phone ban for x months and I can&#8217;t believe how well it&#8217;s going.&#8221;</p><p><strong>How social change happens</strong></p><p>In the past, I have compared <a href="https://substack.nomoremarking.com/p/mobile-phones-and-the-right-side">attitudes to mobile phones in the classroom to attitudes to cigarettes.</a></p><p>I am constantly drawn to this analogy because the shift in attitudes to smoking occurred as I was in my late teens and early twenties and was probably the first time I realised that the social norms of my childhood were not permanent fixtures.</p><p>In the mid-90s, smoking in public places was normal and commonplace. I remember the first time someone suggested you might ban smoking in pubs, and it felt as crazy as suggesting you might ban drinking in pubs. People went to pubs to smoke! That was the point! But within a decade, a public smoking ban was in place and we were all wondering how we put up with smoky clothes for so long.</p><p><strong>But phones aren&#8217;t like cigarettes</strong></p><p>However, smoking is not the ideal analogy here, for a couple of reasons. Cigarettes have very few upsides, whereas mobile phones have lots. Cigarettes are also not that central to society, whereas if you got rid of all mobile phones in the world tomorrow, society and the economy would grind to a halt.</p><p>A better analogy&#8212;but an older one, which no one today has a memory of&#8212;is the invention of the automobile. Like phones, cars very quickly established themselves as vital and irreplaceable. They also had terrible downsides, and precipitated a culture war which in some ways is still smouldering today. If we go back to that moment in time, we can learn a lot about the way forward for the use of technology in education.</p><p><strong>A literary-historical interlude: cars in the early 20<sup>th</sup> century imagination</strong></p><p>The light fiction of the early 20th century is littered with the impact of the automobile.</p><p>One of the funniest and most famous is Mr Toad, from Kenneth Grahame&#8217;s <em>The Wind in the Willows</em> (1908). After encountering his first car on a sleepy country lane, he is entranced and can think of nothing else.</p><p><em>They found him in a sort of a trance, a happy smile on his face, his eyes still fixed on the dusty wake of their destroyer. At intervals he was still heard to murmur &#8220;Poop-poop!&#8221;</em></p><p><em>&#8220;Glorious, stirring sight!&#8221; murmured Toad. &#8220;The poetry of motion! The real way to travel! The only way to travel! Here to-day&#8212;in next week to-morrow! Villages skipped, towns and cities jumped&#8212;always somebody else&#8217;s horizon! O bliss! O poop-poop! O my! O my!&#8221;</em></p><p>Before long, he steals a car and goes on a joyride.</p><p><em>He increased his pace, and as the car devoured the street and leapt forth on the high road through the open country, he was only conscious that he was Toad once more, Toad at his best and highest, Toad the terror, the traffic-queller, the Lord of the lone trail, before whom all must give way or be smitten into nothingness and everlasting night. He chanted as he flew, and the car responded with sonorous drone; the miles were eaten up under him as he sped he knew not whither, fulfilling his instincts, living his hour, reckless of what might come to him.</em></p><p>If Mr Toad were alive today, he&#8217;d be running the <strong>MrToadLambo</strong> Youtube account, full of viral livestreams of him in car chases on the M25. He&#8217;d have a memecoin called PoopPoop and on X, he&#8217;d complain about the &#8220;legacy mindset&#8221; of speed limits.</p><p>The Mr Toad-style roadhog was not an isolated figure. Thirty-one years later, Agatha Christie wrote one of the best-selling books of all time: <em>And Then There Were None</em>. The premise of the book is that some people have committed acts which are not legally crimes, but which are morally criminal and therefore deserve punishment. One of them is a young man, Anthony Marston, who has killed two young siblings while driving recklessly:</p><p><em>&#8216;Of course it was a pure accident. They rushed out of some cottage or other. I had my licence suspended for a year. Beastly nuisance.&#8217;</em></p><p><em>Dr Armstrong said warmly: &#8216;This speeding&#8217;s all wrong&#8212;all wrong! Young men like you are a danger to the community.&#8217;</em></p><p><em>Anthony shrugged his shoulders. He said: &#8216;Speed&#8217;s come to stay. English roads are hopeless, of course. Can&#8217;t get up a decent pace on them.&#8217;</em></p><p>The book was published in the very early months of World War II, and I think there is an obvious political undercurrent. The Nazis were obsessed with youth, speed, and technological progress, and Hitler had made new roads and new cars symbols of his regime.</p><p>You can also see clear parallels with debates about social media and mobile phones today. The pro-car lobby, which disproportionately consisted of young men, felt that their opponents were creating a moral panic that turned commonplace everyday accidents into existential threats. The anti-car lobby was more middle-aged and female, and they thought their opponents were proto-fascists intent on destroying the lives of poor children.</p><p>In Christie&#8217;s autobiography, she wrote about her own experience of car ownership. She bought her first car at a time when there was no driving test. She could barely drive when, in 1926, she had to drive her husband to work because of the General Strike. She made it back from Hounslow to Sunningdale (about 15 miles!!) just about in one piece, but a neighbour who saw her parking up said &#8216;I saw the first floor driving back this morning. I don&#8217;t think she has ever driven a car before. She drove into that garage absolutely shaking and as white as a sheet. I thought she was going to ram the wall, but she just didn&#8217;t!&#8217;</p><p>But Christie also goes on to say that once she had learnt to drive, the experience gave her enormous pleasure:</p><p><em>Oh the joy that car was to me! I don&#8217;t suppose anyone nowadays could believe the difference it made to one&#8217;s life. To be able to go anywhere you chose; to places beyond the reach of your legs&#8212;it widened your whole horizon.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-1" href="#footnote-1" target="_self">1</a></em></p><p>We are now fully in the doomscrolling era of the internet, but it is worth remembering that it was just as horizon-expanding and liberating in its early days as the car. And, similarly, as much as traffic and ragebait might annoy me, I do not want to live in a world without cars or mobile phones. They are both vital parts of the modern world. The task is to make them work to serve our aims.</p><p>With that in mind, here are five lessons we can learn from the early automobile debate.</p><ol><li><p><strong>Social norms matter just as much as legislation</strong></p></li></ol><p>One of the fascinating things about <em>And Then There Were None</em> is its focus on acts that were legal but frowned upon, acts where the social norm was in the process of shifting. It is astonishing for us to read it now and realise that the punishment for killing two children while speeding was just a year-long driving ban. But it is very hard for governments to legislate when the social norm is against it. Laws cannot get that far in front or behind of public opinion. If you had tried to implement a smoking ban in the 1950s, you would probably have had mass civil disobedience. Likewise, whilst I&#8217;ve been in favour of school level phone bans for a while, I&#8217;ve recognised that until recently it would have been exceptionally hard for a government to legislate for one, because not enough parents, students and teachers thought it was necessary, and you&#8217;d have had mass evasion of the law.</p><p>I think the time is right now for legislation. And the reason why we need a ban, and we can&#8217;t just depend on social norms changing, is that this is a clear example of a collective action problem: teenagers and their parents tell us that they would like to use their phones less, or give up social media, but they don&#8217;t want to be the only ones!</p><p>These kinds of co-ordination problems are the places where there is a strong case for government intervention, and where their intervention adds value over and above self-regulation.</p><p><strong>2. Some kinds of regulation are fundamental and inevitable</strong></p><p>Driving tests were one of the least controversial aspects of early automobile regulation. It&#8217;s arguable that in the modern state, the state monopoly of driving testing &amp; licensing is one of its most fundamental functions (which is just one of the reasons why the breakdown of the UK government&#8217;s testing system is a really big problem.)</p><p>Similarly, another pretty fundamental and uncontroversial aspect of the modern state is its enforcement of age norms. You can argue about what age they should kick in, and where they should apply, but pretty much every state in the world provides children with special protections and restrictions. We frequently restrict childhood liberty, to the extent that most serious classical liberal and libertarian philosophers spend a lot of time thinking about why this is. JS Mill&#8217;s <em>On Liberty</em> is a great example - a book about liberty that spends large chunks discussing the education of the young. </p><p><strong>3. Early regulation can get it wrong</strong></p><p>Not all regulations are good regulations. The Red Flag Act of 1865 required early automobiles to be preceded by a man on foot carrying a red flag. </p><p>I can think of a lot of current internet regulations that are not working brilliantly. Cookie consent warnings seem to be security theatre that cause a lot of hassle but don&#8217;t really address the big problems. </p><p>The Online Safety Act is a major piece of UK legislation that aims to protect children from the downsides of the internet. Critics say it is poorly drafted and will have a lot of negative unintended consequences. We will soon see who is right. </p><p><strong>4. You can be pro and anti technology</strong></p><p>Modern Germany has an extensive motorway network with no speed limits. It also has medieval town centres that are car-free. These are not contradictory. Likewise, it is possible to believe that schools should make use of a lot more technology in a lot of ways, whilst remaining largely screen-free for students.</p><p><strong>5. Technology can mitigate technology</strong></p><p>New technology will always cause problems. Most of the time, instead of getting rid of the technology, we prefer to use more and different technology to mitigate the problem. Seatbelts, airbags, anti-lock brakes and sat-nav are all examples of technology that&#8217;s designed to mitigate the negative impacts of cars.</p><p>I think this approach is the right one for education too. One technology we&#8217;re excited about at No More Marking is <a href="https://substack.nomoremarking.com/p/can-ai-solve-handwriting-bias">handwriting recognition</a>. Before LLMs came along, handwriting recognition was a stubborn and seemingly intractable problem. LLMs have largely - although not completely - solved it. It is now possible to get instant and mostly accurate transcriptions of student writing, which in turn makes <a href="https://substack.nomoremarking.com/p/paper-and-on-screen-assessments">screen-free classrooms much more viable</a>.  Off-the-shelf LLMs are still not perfect though, and there is room for improvement, which is why we are working on optimising an open-source LLM to recognise handwriting with an even higher degree of accuracy.</p><p><strong>And Then There Were Norms</strong></p><p>The other big lesson from the car debate is that some of these debates never go away and are never truly resolved. Cars and phones are fundamental to modern society, and anything so fundamental will inevitably provoke conflict. The norms might change, but the arguments will remain. </p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://substack.nomoremarking.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading No More Marking! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-1" href="#footnote-anchor-1" class="footnote-number" contenteditable="false" target="_self">1</a><div class="footnote-content"><p>Christie was not the only mega bestselling author of the 20th century who had trouble with cars. JRR Tolkien bought a car at around the same time as Christie and also seems to have struggled to learn to drive. He also wrote a book inspired by his misadventures, called Mr Bliss, but although it was written in 1932, it wasn&#8217;t published until 1982. Unlike Christie, he gave up driving at the start of the war and deplored the impact automobiles had on the Oxfordshire countryside.</p></div></div>]]></content:encoded></item><item><title><![CDATA[How to write a good rubric (for humans and AI)]]></title><description><![CDATA[Don't be like Jose Mourinho]]></description><link>https://substack.nomoremarking.com/p/how-to-write-a-good-rubric-for-humans</link><guid isPermaLink="false">https://substack.nomoremarking.com/p/how-to-write-a-good-rubric-for-humans</guid><dc:creator><![CDATA[Daisy Christodoulou]]></dc:creator><pubDate>Sat, 24 Jan 2026 08:45:23 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/3fc904b8-0ba0-4fa4-bf26-780dec000acc_1024x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Last term, we ran <a href="https://blog.nomoremarking.com/cj-history-results-646a7edf1ea7">a national history assessment</a> on the topic of the Battle of Hastings. The essays were judged by a mix of human and AI judges, and we saw pretty good levels of agreement between the humans and AI, and barely any glaring AI errors. </p><p>Still, that does not stop our teachers - and us - asking a set of questions about how  the AI makes its decisions. What does it value? Does it value historical accuracy? Does it notice when claims are false? Does it base its judgements on fluency of writing or quality of historical analysis? What weight does it give to the various aspects of a good essay? And - a question we are getting more and more - what kind of rubric or guidance should we give the AI to help it make the best decisions?</p><p>To explore this, we ran a small experiment.</p><p>We selected a sample of the c. 4,000 essays and got an AI to isolate every truth claim in every essay and then to assess whether each claim was true or not. </p><div class="captioned-button-wrap" data-attrs="{&quot;url&quot;:&quot;https://substack.nomoremarking.com/p/how-to-write-a-good-rubric-for-humans?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="CaptionedButtonToDOM"><div class="preamble"><p class="cta-caption">Thanks for reading No More Marking! This post is public so feel free to share it.</p></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://substack.nomoremarking.com/p/how-to-write-a-good-rubric-for-humans?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://substack.nomoremarking.com/p/how-to-write-a-good-rubric-for-humans?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p></div><p>This is not as simple as it sounds, and an essay about the Battle of Hastings is probably not the best test case for this approach, because there is a lot about it that is genuinely uncertain: did Harold really die after being hit in the eye by an arrow? Exactly how long did it take Harold and his men to march from Stamford Bridge to Hastings? </p><p>Still, there are plenty of known facts about the Battle, and generally speaking the AI was good at spotting these and assigning a truth score for each essay. We reviewed these &#8220;truth scores&#8221; and felt they were broadly correct. </p><p>Next, we looked at whether these truth scores correlated with the scaled score given to each essay. We expected a weak positive correlation. What we actually found was a negative correlation.</p><p><em>In other words, essays that contained more false statements were, on average, getting better scores!</em></p><p>What on earth is going on?!</p><p><strong>This is not about AI</strong></p><p>We don&#8217;t think this problem is an AI problem. We&#8217;ve seen something similar in the past with writing rubrics long before we used AI in our assessments. (In fact, <a href="https://daisychristodoulou.com/2016/05/best-fit-is-not-the-problem/">I wrote an article about a similar issue with writing assessments</a> almost ten years ago, before I started working at No More Marking, and before chatbots existed!)</p><p>Essentially, when you give pupils an extended writing task, the more they write and the more ambitious they are, the more chances there are for them to make errors. The very best and most creative responses can therefore have more errors than weaker responses. </p><p>Here is a great example from the history assessment. </p><p>Script A is the first part of one of the highest-scoring essays. Script B is the entirety of one of the lowest-scoring essays. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Pa-d!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a9f5379-72ed-44d1-b0ef-a8219ca4a1a5_2864x1608.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Pa-d!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a9f5379-72ed-44d1-b0ef-a8219ca4a1a5_2864x1608.png 424w, https://substackcdn.com/image/fetch/$s_!Pa-d!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a9f5379-72ed-44d1-b0ef-a8219ca4a1a5_2864x1608.png 848w, https://substackcdn.com/image/fetch/$s_!Pa-d!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a9f5379-72ed-44d1-b0ef-a8219ca4a1a5_2864x1608.png 1272w, https://substackcdn.com/image/fetch/$s_!Pa-d!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a9f5379-72ed-44d1-b0ef-a8219ca4a1a5_2864x1608.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Pa-d!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a9f5379-72ed-44d1-b0ef-a8219ca4a1a5_2864x1608.png" width="1456" height="817" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2a9f5379-72ed-44d1-b0ef-a8219ca4a1a5_2864x1608.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:817,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:3601627,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.nomoremarking.com/i/185430249?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a9f5379-72ed-44d1-b0ef-a8219ca4a1a5_2864x1608.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Pa-d!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a9f5379-72ed-44d1-b0ef-a8219ca4a1a5_2864x1608.png 424w, https://substackcdn.com/image/fetch/$s_!Pa-d!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a9f5379-72ed-44d1-b0ef-a8219ca4a1a5_2864x1608.png 848w, https://substackcdn.com/image/fetch/$s_!Pa-d!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a9f5379-72ed-44d1-b0ef-a8219ca4a1a5_2864x1608.png 1272w, https://substackcdn.com/image/fetch/$s_!Pa-d!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a9f5379-72ed-44d1-b0ef-a8219ca4a1a5_2864x1608.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Script A has a straightforward factual error in its first sentence: the Battle of Hastings was not fought on the 14th February 1066 but the 14th October.  The second paragraph also has some factual issues: it says &#8220;Harold Godwinson&#8217;s army was mostly dead, wounded or incredibly tired from the battle and the journey, making their army much smaller and easier to defeat&#8221;. This is all a bit more arguable, but &#8220;mostly&#8221; is probably too strong here, and &#8220;much smaller&#8221; depends on what your reference point is: it was probably smaller than it would have been if there had been no Battle of Stamford Bridge, but it also probably wasn&#8217;t much smaller than William&#8217;s army. These were all flagged up by our AI truth checker.</p><p>Script B has no factual errors at all. Our AI truth checker judged it all to be correct, apart from the final unfinished sentence which was &#8220;unverifiable&#8221;. </p><p>However, when the scripts were assessed as part of our national history project, both the teachers and the AI thought that Script A was better than Script B. I agree and I cannot believe that any history teacher would seriously argue that B is better than A.</p><p><strong>Cardinal Richelieu and Jose Mourinho</strong></p><p>There&#8217;s a famous line alleged to have been said by Cardinal Richelieu: &#8220;Give me six lines written by the hand of the most honest of men, and I will find something in them to hang him.&#8221;</p><p>Is that the message we want to send to our children?</p><p>The football manager Jos&#233; Mourinho has a touch of the Cardinal Richelieus about him. <a href="https://www.theguardian.com/football/blog/2015/apr/23/jose-mourinho-the-anti-barcelona-chelsea-pep-guardiola">According to a biographer</a>, in his later managerial career he developed an uncompromisingly cynical approach to football tactics, which included the following principles.</p><ul><li><p>Whoever has the ball is more likely to make a mistake.</p></li><li><p>Whoever renounces possession reduces the possibility of making a mistake.</p></li><li><p>Whoever has the ball has fear.</p></li><li><p>Whoever does not have it is thereby stronger.</p></li></ul><p>Interestingly, in Mourinho&#8217;s case, this strategy was not hugely successful. His biggest successes came before he developed this approach, because in football, success is not measured by who makes the fewest mistakes, but by who scores the most goals. </p><p>If you push these strategies to their ultimate limit they become entirely self-defeating. You end up with football teams trying not to play football and writing lessons that are about avoiding writing. Ultimately, if you want to win football games you have to try and play some football. If you want to be a good writer you have to write something. Neither writing nor football are exercises in trying not to make mistakes. </p><p><strong>So does this mean factual accuracy doesn&#8217;t matter?</strong></p><p>I think people are really surprised when I make this argument, because I have basically <a href="https://substack.nomoremarking.com/p/skills-vs-knowledge-13-years-on">made a career</a> out of saying factual accuracy is important. </p><p>And I haven&#8217;t changed my mind. I still think factual accuracy is supremely important, I still think that historical understanding is built on accuracy, and I still think we should teach students facts and get them to memorise them.</p><p>My objection is not to teaching &amp; assessing facts. My objection is to <em><strong>using essays to assess facts.</strong></em> That&#8217;s for two reasons. </p><ol><li><p><strong>Essays are not designed to test factual accuracy</strong></p></li></ol><p>An essay is an open-ended task, which means that students have some freedom in how they respond to it. This means that students will essentially set themselves different tasks. Some students will choose to mention the date of the Battle of Hastings, and some won&#8217;t. If you have a very strict rubric that insists on factual accuracy, then the student who chooses to mention the date and gets it wrong is penalised. The student who chooses not to mention it is not penalised, even though they may not know the date either!</p><p>So the essay is basically a terrible way of telling if a student knows when the Battle of Hastings happened. The right way to assess this is with a simple short answer or multiple-choice question, where every student is given the same question and there is one clear right answer. </p><p>Short answer questions and multiple choice questions are often seen as being too simplistic or basic, but they are really powerful tools. Essays and MCQs are complementary - like the two wing mirrors on a car. They give you different views of the same reality. Don&#8217;t make your essay responsible for incentivising and measuring factual recall. Set an accompanying quiz, and let that do the job instead.  (We have <a href="https://help.nomoremarking.com/en/article/what-is-automark-h3lw10/">a nice system</a> that will do this for you!)</p><p>If you do that, I think then you <em><strong>would</strong></em> see a strong correlation between scores on the quiz and scores on the essay. In fact, when we have tried this with writing, we do see <a href="https://aus.nomoremarking.com/does-spelling-matter-23e051c4cf4">strong correlations between simple quizzes on spelling &amp; grammatical features</a>, and overall writing quality.</p><p>I can&#8217;t prove it, but I suspect if we gave student A and B ten questions on the facts about Hastings, student A would do better than student B.  </p><ol start="2"><li><p><strong>If you use essays to assess factual accuracy by creating a strict rubric, you will create </strong><em><strong>terrible</strong></em><strong> incentives</strong></p></li></ol><p>One of the things we saw - and still see - with very prescriptive writing rubrics is that you get awful second-order effects. Your rubric does not end up incentivising factual accuracy. It incentivises short and basic pieces of writing.  Once teachers and students know that factual and grammatical errors will be penalised heavily, they take the Richelieu / Mourinho approach and become very negative and defensive.</p><p><strong>What does this mean for rubric design?</strong></p><p>Whether you are using human or AI markers, you have to allow your markers some discretion. </p><p>Open-ended tasks give pupils discretion. If pupils have discretion, then markers must have discretion too. Otherwise, you create distortions.</p><p>So a principle for both humans or machines is as follows: Do not use prescriptive criteria to judge extended writing.</p><p>We&#8217;ve found that humans can judge accurately and consistently using just one incredibly holistic criterion: which is the better response?</p><p>We think the AI does need a bit more guidance than this, but it should still have latitude to make holistic judgements. We have a section on our website where you can paste in your criteria, and some advice on setting holistic criteria <a href="https://help.nomoremarking.com/en/article/ai-enhanced-custom-tasks-how-to-set-criteria-1e6sg9h/">here</a>. We will update this with more examples and advice as we trial assessments in different subjects. </p><p>You can trial different criteria and assessments yourself. You can purchase <a href="https://help.nomoremarking.com/en/article/ai-custom-tasks-our-newest-prodct-zwam10/">AI custom task credits on our website</a>, and we are also giving away 30 free AI credits to everyone who attends <a href="https://www.nomoremarking.com/events">our next introductory webinar on 25 February.</a></p><p>If you are already using custom tasks, let us know in the comments what criteria you&#8217;ve used. </p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://substack.nomoremarking.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading No More Marking! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[How do you know your feedback is working?]]></title><description><![CDATA[Rapid and large-scale evaluation of writing feedback]]></description><link>https://substack.nomoremarking.com/p/how-do-you-know-your-feedback-is</link><guid isPermaLink="false">https://substack.nomoremarking.com/p/how-do-you-know-your-feedback-is</guid><dc:creator><![CDATA[Daisy Christodoulou]]></dc:creator><pubDate>Sat, 17 Jan 2026 09:45:25 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/8903b825-29f2-4f5f-897a-efb5782bd5a9_1536x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>One of the major <a href="https://substack.nomoremarking.com/p/blooms-famous-2-sigma-tutoring-paper">problems with a lot of classic education research papers</a> is that they are based on very small numbers of students. This means that if the paper does show a certain intervention is effective, it is entirely possible that it is the result of chance and not the intervention. </p><p>This problem is compounded when the interventions involve writing assessments, because traditional writing assessments are quite unreliable. Again, this adds yet more noise to the results.</p><p>We have a new assessment model which addresses both of these problems and makes it easy, quick, and reliable to evaluate the impact that feedback has on writing. </p><div class="captioned-button-wrap" data-attrs="{&quot;url&quot;:&quot;https://substack.nomoremarking.com/p/how-do-you-know-your-feedback-is?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="CaptionedButtonToDOM"><div class="preamble"><p class="cta-caption">Thanks for reading No More Marking! This post is public so feel free to share it.</p></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://substack.nomoremarking.com/p/how-do-you-know-your-feedback-is?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://substack.nomoremarking.com/p/how-do-you-know-your-feedback-is?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p></div><p>We trialled the new approach last year, and are running a bigger project in March this year for Year 6 students. </p><p><strong>Here is how it works.</strong></p><ul><li><p>Students take part in our established Year 6 writing assessment in March. We expect about 30,000 students will take part in this.</p></li><li><p>Schools will receive extensive feedback reports, with a mix of AI &amp; human feedback. They will share the reports with their students and can provide their own feedback too.</p></li><li><p>Students will then redraft their original piece of writing.</p></li><li><p>Schools can then submit this redrafted piece of work to be assessed again as part of a national assessment window. The scores of both the original and redrafted pieces of work will be on the same scale, allowing us to measure the impact of the feedback. </p></li><li><p>Both the original and redrafted writing will be assessed using our Comparative Judgement plus AI model. This is highly reliable and <a href="https://help.nomoremarking.com/en/article/how-long-does-it-take-to-assess-one-classs-essays-using-comparative-judgement-lrxp0m/">dramatically reduces the teacher workload</a>. </p></li></ul><p>We ran a project like this last year, but gave schools very short notice about the redraft. This meant that whilst approximately 36,000 students from 900 schools took part in the original assessment, only 3,851 from 73 schools took part in the follow up. This year we have given schools more notice, so we hope that we&#8217;ll get more taking part in the redraft. </p><p>The project is <em><strong>not</strong></em> a gold-standard randomised controlled trial, but it will still provide schools with rapid and useful information about how students respond to feedback. It would also be possible to use the same Comparative Judgement plus AI write-feedback-redraft model as part of an RCT. </p><p><strong>Improving the feedback</strong></p><p>We&#8217;re also planning a couple of changes to the feedback that students get. </p><p>Last year, we gave every student a set of five multiple-choice questions that were created by us - not AI. We created three sets of questions, and then split students into three groups based on their scaled score. Students in the lowest-scoring group got a set of questions on capital letters, students in the middle group got questions on run-on sentences, and students in the top-scoring group got questions on vocabulary. </p><p>This year, we will continue to allocate question sets by scaled score, but we are going to introduce a little bit of AI into the mix. </p><ul><li><p>Students in the lowest-scoring group will continue to receive a set of questions on capital letters. These questions will be created by us, but we will use AI to customise them slightly. We will make the content used in the questions match the content used in each individual student&#8217;s story. E.g. if the student has written about two children called Ilsa and Bob, the questions will mention Ilsa and Bob. </p></li><li><p>We&#8217;ll do something similar for the middle third of students. They&#8217;ll get a set of questions on run-on sentences, created by us but tweaked by AI to include the content of their story.</p></li><li><p>For students in the top third, we will be making a more substantial change. These students will get a set of questions entirely designed by AI. The questions will focus on more creative aspects of writing. </p></li></ul><p>We&#8217;re currently developing and trialling these new question types, and will shortly be emailing our participating schools to get their opinion on them. </p><p>If you are not currently a participating school but would like to be, you can join us! Read more about the project and how to take part <a href="https://help.nomoremarking.com/en/article/assessing-primary-writing-year-6-redraft-2026-1h8wbmz/">here</a>.</p><p><strong>Could this model work at a smaller scale?</strong></p><p>One of the big advantages of this model is the scale - thousands of participating students. However, we have had a lot of requests from schools who would like to try it out at a smaller scale, in their own school or class. Obviously you would not be able to generalise as much from a smaller scale, but we agree that it would be incredibly valuable for an individual school or class teacher to be able to get such rapid feedback on their interventions. We can also place these bespoke individual assessments onto our national scale by including anchor scripts from previous assessments, which means that even small assessments can have some of the benefits of scale. We are looking at ways that we can make this write-feedback-redraft cycle easy for an individual school or teacher to implement. Get in touch if this interests you.</p><p><strong>Further reading and information</strong></p><ul><li><p>We published a series of posts about last year&#8217;s project: the original <a href="https://substack.nomoremarking.com/p/dynamic-assessment-of-hard-to-measure">intro</a> post, our <a href="https://substack.nomoremarking.com/p/dynamic-writing-assessment-with-ai">trial school</a> results, the full set of <a href="https://medium.com/blog-nomoremarking-com/cj-dynamo-2024-25-year-6-results-b6ed2d3ad667">results</a>, a <a href="https://substack.nomoremarking.com/p/how-do-students-redraft-their-writing">qualitative analysis</a> of one school&#8217;s results.</p></li><li><p><a href="https://substack.nomoremarking.com/p/what-is-comparative-judgement-and">How Comparative Judgement plus AI works</a>.</p></li><li><p>This year&#8217;s <a href="https://www.nomoremarking.com/calendars">calendar</a> and <a href="https://help.nomoremarking.com/en/article/assessing-primary-writing-year-6-redraft-2026-1h8wbmz/">help page</a>.</p></li><li><p>A guide to all of our <a href="https://help.nomoremarking.com/en/article/new-feedback-reports-interactive-dashboard-9gi1t6/">feedback reports</a></p></li><li><p>Our <a href="https://www.nomoremarking.com/events">events</a> page - we have two online introductory webinars scheduled in the next six weeks. </p></li></ul><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://substack.nomoremarking.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading No More Marking! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p>]]></content:encoded></item><item><title><![CDATA[Maybe LLM tutors might be able to work...]]></title><description><![CDATA[The best study I have seen so far]]></description><link>https://substack.nomoremarking.com/p/maybe-llm-tutors-might-be-able-to</link><guid isPermaLink="false">https://substack.nomoremarking.com/p/maybe-llm-tutors-might-be-able-to</guid><dc:creator><![CDATA[Daisy Christodoulou]]></dc:creator><pubDate>Sun, 11 Jan 2026 10:20:21 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/725888ed-7660-4433-9868-896fab554a19_1536x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In a post <a href="https://substack.nomoremarking.com/p/llm-tutors-are-they-any-good">last year</a>, I looked at some of the barriers to creating a good LLM tutor. In summary, here were the four challenges that LLM tutors need to overcome.</p><ol><li><p><strong>Questions not explanations.</strong> LLMs are very good at explanations, but explanations are over-rated as a means of learning. Sets of really good questions are better.</p></li><li><p><strong>Reducing hallucinations.</strong> Good questions have to be precise and accurate, and LLMs are not so great at precision and accuracy, because they hallucinate. </p></li><li><p><strong>Improving on pre-LLM technologies.</strong> An LLM tutor not only has to prove it is better than a human tutor, but also that it is better than pre-LLM technologies like textbooks and intelligent tutoring systems (ITS). These have zero or close to zero error rates.</p></li><li><p><strong>Providing structure and discipline.</strong> An LLM tutor has to find some way of replicating the structure and discipline of an in-person classroom, because students can&#8217;t learn everything from sitting on their own at a screen. </p></li></ol><p>Late last year, a new paper was published with the best answer I have seen to all four challenges. </p><div class="captioned-button-wrap" data-attrs="{&quot;url&quot;:&quot;https://substack.nomoremarking.com/p/maybe-llm-tutors-might-be-able-to?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="CaptionedButtonToDOM"><div class="preamble"><p class="cta-caption">Thanks for reading No More Marking! This post is public so feel free to share it.</p></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://substack.nomoremarking.com/p/maybe-llm-tutors-might-be-able-to?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://substack.nomoremarking.com/p/maybe-llm-tutors-might-be-able-to?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p></div><p>The paper is the result of a collaboration between Google LearnLM and Eedi. The link is <a href="https://storage.googleapis.com/deepmind-media/LearnLM/learnLM_nov25.pdf">here</a> and you can read a summary of it <a href="https://eedi.substack.com/p/ai-empowers-the-teacher-so-they-can">here</a> by Craig Barton, Eedi&#8217;s head of education. I&#8217;ve known Craig for a long time (you can hear me on his podcast <a href="https://tipsforteachers.co.uk/daisy-christodoulou/">here</a>) and have always admired his work but I have no affiliation with Eedi or Google, so this is just my view as I see it from the outside.</p><p>Here is a brief summary of the study, as I understand it.</p><ul><li><p>The study took place in 5 UK secondary schools whose students were used to using the Eedi online learning platform as part of their maths lessons.</p></li><li><p>The students started an Eedi unit online in class as normal. When they got a question wrong, they were then able to start an online chat with a tutor. There were three conditions: (1) chat with a human tutor, (2) chat with an LLM tutor (whose messages were supervised by a human), (3) receive a pre-written static hint that was the same for everyone who got that question wrong.  </p></li><li><p>The effectiveness of each approach was measured in three ways. (1) The students were given the exact same question they got wrong at the start. (2) If they still got it wrong, they got to have a follow up chat and then two attempts at a new question on the exact same topic. (3) The students moved on to the next unit in the sequence and the study measured their success on the first question of that sequence.</p></li><li><p>Essentially, the students got fewest questions right when taught with the &#8220;static hint&#8221; approach. There wasn&#8217;t much difference between the human tutor and the LLM tutor. The humans who supervised the LLM didn&#8217;t have to make that many edits and were themselves impressed by the LLM&#8217;s responses. Crucially, the LLM made very few errors, and the paper lists them all in an appendix.</p></li></ul><p>So how does this study address my four challenges?</p><ul><li><p><strong>Questions not explanations.</strong> The student-LLM discussions were focussed on questions and answers. The LLM wasn&#8217;t just explaining a concept and assuming the student got it. It asked questions to check for understanding, and then, when the understanding wasn&#8217;t there, it was capable of recognising that and following up with other questions until the student did understand. And then of course the success of the intervention was immediately measured with the original question and a subsequent question.</p></li><li><p><strong>Reducing hallucinations.</strong> The most striking part of this study is that the LLM error rate was reduced down to just 0.14% - just over one error every thousand messages. This is extremely impressive. It didn&#8217;t report what the human tutor error rate was, and more broadly we don&#8217;t really have reliable data on how often teachers make basic errors in class, but even highly skilled teachers will make errors from time to time. Does the average human teacher in a traditional class make one error for every thousand &#8220;messages&#8221; they speak? It&#8217;s not insane to think they might. </p></li><li><p><strong>Improving on pre-LLM technologies.</strong> A 0.14% error rate is good for an LLM or a human, but probably not as good as a textbook or an intelligent tutoring system (ITS) which are capable of close to zero errors, especially once they get into a second edition or version. However, this study specifically compares the LLM tutor performance with pre-written static hints, which in some ways are analogous to textbooks or ITSs, and the LLM tutor outperformed the static hint.   I like the concept of pre-written static hints, and I think they are under-rated, but clearly they have their flaws. They are kind of similar to the customer service chat bots that give you a pre-loaded menu of options to choose from. A lot of the time, the pre-loaded options don&#8217;t address your question, and you want to find a way to talk to a human instead. </p></li><li><p><strong>Providing structure and discipline.</strong> The study involved students in a typical classroom. They weren&#8217;t sitting in a lab or at home. The structure and discipline of an in-person class with an in-person human teacher are present. As a result, it is much easier to see how this study - which was quite small - could scale up to larger numbers. (I still retain concerns about<a href="https://substack.nomoremarking.com/p/paper-and-on-screen-assessments"> moving all learning on-screen</a>, and even when screens are used I think we need to do more on optimising them for learning, blocking distractions, etc.) </p></li></ul><p>There will be plenty of people who think this study isn&#8217;t ambitious enough. The things that I think are strengths &#8211; the focus on correct questions and answers, the way it is embedded in a typical classroom &#8211; they will see as weaknesses. Why isn&#8217;t it tearing down the traditional classroom and re-imagining education for the fifth industrial revolution? Those people will have to look elsewhere. For those of us in the evidence-based community, this is a significant breakthrough. </p><p>What&#8217;s also interesting is that for perhaps the first time, a major technology company are listening to the evidence about education. In my 2020 book, <a href="https://amzn.eu/d/ekdvhnc">Teachers vs Tech</a>, I lamented the fact that most of the big technology companies were spending their education budgets on things like &#8220;demonstrate how to solve equations with iMovie videos in the style of a cooking show&#8221;(<a href="https://education.apple.com/learning-center/T020408A">Apple</a>) or &#8220;different students within a single class could be completing different projects about the topic, each tailored to their learning style.&#8221; (Summit Learning, funded by Chan Zuckerberg). </p><p>If we are at a point where a major technology company is committing significant resources and talent to evidence-backed principles, then there is the potential for big breakthroughs.</p><p><strong>The error rate is low, but it still matters</strong></p><p>Although the low error rate is impressive and far better than I thought was possible, I still think it&#8217;s high enough to worry about. In this study, the human supervisor edited out the errors, so the results reported don&#8217;t include their impact. There are so few errors that you might argue they wouldn&#8217;t have changed the results, but at a larger scale with no human supervision, we just don&#8217;t know how these errors would propagate and affect a student&#8217;s understanding. </p><p>Also, whilst one in a thousand errors sounds low, it&#8217;s possible that one lesson&#8217;s worth of conversation with the chatbot might include 50 or so messages, which would effectively 50x the error rate. Over the course of one year of using the chatbot in every maths lesson, a student might encounter 12 errors (50 messages a lesson, 5 lessons a week, 35 weeks a year, 0.14% error rate).  That feels significant enough to worry about, and significant enough that students would start to doubt the chatbot even when it was right. Obviously what would be great is if we could have a chatbot with zero errors, but I think we are in real &#8220;<a href="https://www.dwarkesh.com/p/andrej-karpathy">March of the Nines</a>&#8221; territory here - it is often as difficult to get from 99% accuracy to 99.9% as it is to get from 0% to 90%.</p><p>Instead, I think we need to focus more on the social norms around errors.  If a human teacher makes a mistake, they often know about it because two or three students look puzzled and raise their hands and say &#8220;Miss, you&#8217;ve made a mistake&#8221;. (This is one of the advantages of a large and non-personalised classroom - the teacher gets feedback from multiple sources). What should a student do if they think a chatbot has made a mistake? What process should we put in place to deal with those errors?</p><p>This is an issue for all uses of AI. In many cases it already makes fewer errors than humans, but it makes different kinds of errors in different ways. We have established and often centuries-old systems for catching and mitigating human errors. A lot of these just don&#8217;t work with AI, so we need to build new error-mitigation systems. </p><p><strong>What are the implications for other subjects?</strong></p><p>This paper solely looks at maths. What about those of us involved with teaching and assessing other subjects? At No More Marking, we focus on writing, and for the last couple of years we have been looking at ways of getting LLMs to provide useful feedback on student writing. You can read a summary of our journey <a href="https://substack.nomoremarking.com/p/bringing-our-feedback-philosophy">here</a>. </p><p>What we would really like to do is provide students with specific questions on specific aspects of their work that they can answer and that will improve their work. Last year, we ran a project called <a href="https://substack.nomoremarking.com/p/dynamic-writing-assessment-with-ai">CJ Dynamo </a>which tested the effectiveness of the various types of feedback we are able to produce. </p><p>We&#8217;d like to improve our feedback further.  Here&#8217;s a fairly simple example of what we&#8217;d like to do. </p><p>Here&#8217;s an extract from a piece of work by a student.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ZcWb!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc9a6339-b6a5-436a-8e47-e90687965a89_1636x272.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ZcWb!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc9a6339-b6a5-436a-8e47-e90687965a89_1636x272.png 424w, https://substackcdn.com/image/fetch/$s_!ZcWb!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc9a6339-b6a5-436a-8e47-e90687965a89_1636x272.png 848w, https://substackcdn.com/image/fetch/$s_!ZcWb!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc9a6339-b6a5-436a-8e47-e90687965a89_1636x272.png 1272w, https://substackcdn.com/image/fetch/$s_!ZcWb!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc9a6339-b6a5-436a-8e47-e90687965a89_1636x272.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ZcWb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc9a6339-b6a5-436a-8e47-e90687965a89_1636x272.png" width="1456" height="242" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fc9a6339-b6a5-436a-8e47-e90687965a89_1636x272.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:242,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:176358,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.nomoremarking.com/i/183951019?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc9a6339-b6a5-436a-8e47-e90687965a89_1636x272.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ZcWb!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc9a6339-b6a5-436a-8e47-e90687965a89_1636x272.png 424w, https://substackcdn.com/image/fetch/$s_!ZcWb!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc9a6339-b6a5-436a-8e47-e90687965a89_1636x272.png 848w, https://substackcdn.com/image/fetch/$s_!ZcWb!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc9a6339-b6a5-436a-8e47-e90687965a89_1636x272.png 1272w, https://substackcdn.com/image/fetch/$s_!ZcWb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc9a6339-b6a5-436a-8e47-e90687965a89_1636x272.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>What we&#8217;d like is for the AI to automatically produce something like the following.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!9LaT!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4234929b-8f96-47cc-b910-6ff0dc043d6a_1836x1004.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!9LaT!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4234929b-8f96-47cc-b910-6ff0dc043d6a_1836x1004.png 424w, https://substackcdn.com/image/fetch/$s_!9LaT!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4234929b-8f96-47cc-b910-6ff0dc043d6a_1836x1004.png 848w, https://substackcdn.com/image/fetch/$s_!9LaT!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4234929b-8f96-47cc-b910-6ff0dc043d6a_1836x1004.png 1272w, https://substackcdn.com/image/fetch/$s_!9LaT!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4234929b-8f96-47cc-b910-6ff0dc043d6a_1836x1004.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!9LaT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4234929b-8f96-47cc-b910-6ff0dc043d6a_1836x1004.png" width="1456" height="796" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4234929b-8f96-47cc-b910-6ff0dc043d6a_1836x1004.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:796,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:152524,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.nomoremarking.com/i/183951019?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4234929b-8f96-47cc-b910-6ff0dc043d6a_1836x1004.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!9LaT!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4234929b-8f96-47cc-b910-6ff0dc043d6a_1836x1004.png 424w, https://substackcdn.com/image/fetch/$s_!9LaT!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4234929b-8f96-47cc-b910-6ff0dc043d6a_1836x1004.png 848w, https://substackcdn.com/image/fetch/$s_!9LaT!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4234929b-8f96-47cc-b910-6ff0dc043d6a_1836x1004.png 1272w, https://substackcdn.com/image/fetch/$s_!9LaT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4234929b-8f96-47cc-b910-6ff0dc043d6a_1836x1004.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>We are not able to get LLMs to do this reliably enough. Instead, we&#8217;ve settled on a <a href="https://help.nomoremarking.com/en/article/new-feedback-reports-student-report-1invdkw/">different approach</a>. </p><ul><li><p>The AI produces some personalised but less specific advice about the content of the writing, where it is less likely to go wrong. </p></li><li><p>We create sets of multiple-choice questions about the technical aspects of writing, which we allocate to students based on scaled score &#8211; not on whether they have made that specific error or not.</p></li><li><p>We also have personalised <a href="https://help.nomoremarking.com/en/article/new-feedback-reports-audio-feedback-1xv2znq/">AI-transcribed teacher feedback</a> based on audio teacher comments</p></li></ul><p>Our current multiple-choice questions are more like the &#8220;static hint&#8221; approach in the Google paper. This is better than nothing, and our CJ Dynamo project shows <a href="https://substack.nomoremarking.com/p/how-do-students-redraft-their-writing">it is having a positive impact</a>. However, it would be better to have something more personalised and dynamic, and the way to do so is probably by fine-tuning an open-source LLM. This is possible and we are working on it &#8211; but it is hard and expensive, which is probably why Google are leading the way in this area currently!</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://substack.nomoremarking.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading No More Marking! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Bloom's famous 2 sigma tutoring paper is incredibly misleading]]></title><description><![CDATA[One-to-one tuition is not what it's cracked up to be]]></description><link>https://substack.nomoremarking.com/p/blooms-famous-2-sigma-tutoring-paper</link><guid isPermaLink="false">https://substack.nomoremarking.com/p/blooms-famous-2-sigma-tutoring-paper</guid><dc:creator><![CDATA[Daisy Christodoulou]]></dc:creator><pubDate>Sat, 03 Jan 2026 08:45:26 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/80a6b4d0-04cf-4712-9f7e-ea07773d9c09_1536x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In 1984, Benjamin Bloom published <a href="https://web.mit.edu/5.95/www/readings/bloom-two-sigma.pdf">a famous paper</a>: <em>The 2 Sigma Problem: The Search for Methods of Group Instruction as Effective as One-to-One Tutoring.</em></p><p>The paper claims that one-to-one tuition produces 2 sigma improvements when compared to traditional whole-class teaching. This is a massive deal: it means that one-to-one tuition can raise the test scores of an average student to those of a student in the top two percent.</p><p>Imagine a year group of a couple of hundred students. Imagine the average students in that year group. Imagine an intervention that could move them all to the standard of the very best students in that year group - and that would simultaneously improve the scores of all the other students by an equivalent amount too. That is what 2 sigma means. </p><p>Although the paper was not about education technology, it has had enormous influence and impact in the ed tech world. The logic runs like this: Bloom has shown that one-to-one tuition is the best form of instruction; human one-to-one tuition is impossible at scale; technology could provide one-to-one tuition for everyone and provide 2 sigma gains for everyone. </p><p>Sal Khan of Khan Academy has based a <a href="https://blog.khanacademy.org/sal-khans-2023-ted-talk-ai-in-the-classroom-can-transform-education/">theory of AI tutoring</a> around the paper, the <a href="https://www.chalkbeat.org/2018/1/29/21104250/why-personalized-learning-advocates-like-mark-zuckerberg-keep-citing-a-1984-study-and-why-it-might-n/">Chan Zuckerberg Initiative</a> refer to it, <a href="https://www.gettingsmart.com/2025/11/25/can-tutoring-and-technology-finally-solve-blooms-two-sigma-problem/">World Bank researchers</a> love it.</p><p>The only problem is that the paper cannot bear anything like the weight of these conclusions. Here is why. </p><p><em>(I wrote a briefer critique of the paper in my 2020 book Teachers vs Tech, which you can purchase <a href="https://www.amazon.co.uk/Teachers-Tech-case-tech-revolution/dp/1382004125/ref=sr_1_1?crid=32V2YPZMR8JKU&amp;dib=eyJ2IjoiMSJ9.IIIvOLipXG-cP2HA1OUatQ.X68uHQtVKhxkeaeVFsRKm_oECRAGykW4qj4TRoDOwgw&amp;dib_tag=se&amp;keywords=teachers+vs+tech&amp;qid=1767030748&amp;sprefix=teachers+vs+tech%2Caps%2C100&amp;sr=8-1">here</a>.)</em></p><div class="captioned-button-wrap" data-attrs="{&quot;url&quot;:&quot;https://substack.nomoremarking.com/p/blooms-famous-2-sigma-tutoring-paper?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="CaptionedButtonToDOM"><div class="preamble"><p class="cta-caption">Thanks for reading No More Marking! This post is public so feel free to share it.</p></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://substack.nomoremarking.com/p/blooms-famous-2-sigma-tutoring-paper?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://substack.nomoremarking.com/p/blooms-famous-2-sigma-tutoring-paper?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p></div><p><strong>Six major problems with Bloom&#8217;s 2 sigma tutoring claim</strong></p><ol><li><p>The study taught and assessed students on narrow domains of cartography and probability. Most education studies measure performance on much broader domains &#8211; typically literacy or numeracy. The smaller the domain, the more sensitive it is to instruction, meaning that outsize gains are more likely. As well as the tests being closely linked to the study content, they were also designed by the researchers and were not standardised.</p></li><li><p>The participating students were novices. The topics were completely new to them. This matters because beginners tend to make <a href="https://assets.publishing.service.gov.uk/media/5a7da3bced915d2ac884ca86/RR344_-_Performance_Indicators_in_Primary_Schools.pdf">rapid progress at first</a>, and that progress slows over time. Again, this affects the statistics, making dramatic outlying gains more likely and less meaningful. We have found this with our <a href="https://blog.nomoremarking.com/when-do-pupils-make-progress-at-primary-6fbcde3b5319">assessments of writing</a>. You have to be careful in interpreting and comparing results in the first few months of instruction with those from later in instruction.</p></li><li><p>It was a short-term intervention with short-term metrics. Students received 11 40-minute lessons over 3 weeks and then had a test on the content straight away. We don&#8217;t know if those gains were a) maintained or b) sustainable &#8211; eg, if you came back after a year, would the students have maintained that standard? Would they have continued to improve at the rate of 2 sigma every 11 lessons? <a href="https://www.tandfonline.com/doi/abs/10.1207/s15326985ep4102_1">Learning is a change in long-term memory</a>, and this study tells you nothing about the long term. </p></li></ol><p>These three deliberate design choices all make big effects more likely and less meaningful. They don&#8217;t invalidate the results, but they do severely limit the conclusions you can draw. You can&#8217;t conclude from a 3 week intervention into a small, new domain that you can turn a median student into a Rhodes scholar. </p><p>I think the root cause of all these three problems is a conflating of formative and summative assessment. Bloom&#8217;s assessments are optimised to provide short-term formative feedback, which is fine, but you cannot then use that same information to provide summative insights. This is something I write about at much greater length in my 2017 book <em><a href="https://amzn.eu/d/aWYr97a">Making Good Progress</a></em>, and you can see schools in the UK and America making the same error. Like Bloom, they will give students tests on small, recently-studied domains which all the students will ace. This is totally fine if you want to check students have understood what you have just taught. However, they will want to claim much more than that, and will say that high performance on this test is predictive of getting the top grade on national exams. <a href="https://daisychristodoulou.com/2019/05/what-is-mastery-the-good-the-bad-the-ugly/">This is </a><em><strong><a href="https://daisychristodoulou.com/2019/05/what-is-mastery-the-good-the-bad-the-ugly/">not</a></strong></em><a href="https://daisychristodoulou.com/2019/05/what-is-mastery-the-good-the-bad-the-ugly/"> a valid inference!</a></p><p>There are then 3 further methodological problems with the Bloom paper which are worth mentioning.</p><ol><li><p>Bloom didn&#8217;t actually carry out any of the studies in question.<a href="https://www.educationnext.org/two-sigma-tutoring-separating-science-fiction-from-science-fact/"> He was reporting data from two PhD students. </a>One of those dissertations is available online - the other isn&#8217;t. My analysis is based on <a href="https://gwern.net/doc/psychology/1983-anania.pdf">the one that is</a>. In Bloom&#8217;s paper he has a famous graph showing students jumping from the 50th to the 98th percentile. This isn&#8217;t based on the underlying data: it&#8217;s just a stylised representation of what that type of progress looks like.</p></li><li><p>The studies divided students into 3 groups: traditional whole-class instruction, mastery whole-class instruction and one-to-one tuition. The one-to-one groups got extra input: they were given more feedback and corrective tests than the other two groups.</p></li><li><p>The numbers involved in each study were very small - just a couple of hundred students in total. We have no idea whether these effects would hold if tuition were scaled up. This is a major problem with all educational interventions, particularly those which involve reducing class sizes &#8211; and one-to-one human tuition is basically just the most extreme version of reducing class sizes. The <a href="https://classsizematters.org/wp-content/uploads/2024/04/Jepsen-ClassSizeReduction-2009.pdf">literature</a> on reducing class sizes shows that it can be effective at a small scale, but it is hard to scale up &#8211; because to reduce class sizes at scale, you have to recruit a lot of new teachers, and often the new teachers you recruit are not as good as the existing teachers in the system. Interestingly, one of the Bloom studies had this exact recruitment problem. They used undergraduate students as tutors, and in two of the grades being studied, they couldn&#8217;t recruit enough &#8211; so they increased the tutor groups from one to three. This suggests that at scale and in real-world contexts, the gains from reducing class sizes may not be as great as the gains from improving whole-class instruction - which is the exact opposite of the message conveyed by the paper.</p></li></ol><p><strong>A sporting diversion: can we use the standard deviation to find the best sportsperson ever?</strong></p><p>These students really did make 2 sigma improvements. But they did it in such a narrow domain, in such an early part of their training, and over such a short period of time that it provides us with very few generalisable insights.</p><p>To see why, here&#8217;s an extended sporting analogy.</p><p>Don Bradman is widely regarded as the best cricketer ever. He has a batting average of 99.94. This is crazily exceptional, and one way of explaining to a non-cricket fan why this is such a big deal is to use the standard deviation.</p><p>Cricket batsmen average about 40 runs per innings, with a standard deviation of about 9. <a href="https://www.espncricinfo.com/story/the-gap-between-bradman-and-the-next-best-using-z-scores-anantha-narayanan-1432647">Bradman is therefore over 6 standard deviations better</a> than the average batsman. This is the equivalent of meeting a man with a height of over 7 feet 6 inches. It&#8217;s insane!!</p><p>You can use the standard deviation to measure <a href="https://significancemagazine.com/did-don-bradman-s-cricketing-genius-make-him-a-statistical-outlier/#:~:text=This%20distribution%20is%20reasonably%20symmetric,Players%20with%202%20or%20more">exceptional performance in other sports</a>, and it&#8217;s very rare to see anyone being more than 2 or 3 SD away from the mean. So does this mean Bradman is not just the greatest cricketer of all time, but the greatest sportsperson of all time - the GOAT to end all GOATs?</p><p>Maybe.</p><ul><li><p>The power of the standard deviation is abstraction. What the standard deviation lets you do is take cricket runs, football goals, 100-metre sprint times, and gymnastics scores, and essentially put them all onto the same scale. It means you are no longer comparing apples with oranges, but apples with apples.</p></li><li><p>The limitation of the SD is also abstraction. It takes away a lot of the underlying domain specific detail of different sets of numbers and enables a comparison that may not really be legitimate. There is a risk that you are still comparing apples with oranges, but you&#8217;re just pretending that you&#8217;ve turned some oranges into apples.</p></li></ul><p>The case <em><strong>against</strong></em> Bradman being the greatest sportsperson ever is that 1930s cricket was not as professional or as global a sport as modern football, sprinting and gymnastics. The talent pool Bradman was competing against simply wasn&#8217;t as competitive, and that has the potential to skew his stats.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-1" href="#footnote-1" target="_self">1</a> </p><p>Basically, in order to see whether it is legitimate to compare the standard deviations of Bradman to Messi or Federer or Bolt or Biles, you need some domain-specific understanding of each sport and its historic context. </p><p>In this particular case, I think the standard deviation is useful and appropriate, but not conclusive. However, there are ways in which you can use the standard deviation which are obviously just absurdly inappropriate.</p><p>Imagine a group of 8-year-old footballers who get some extra instruction on doing keepy-uppies. One kid gets some extra one to one coaching from his dad. A week later, his dad devises a keepy-uppy tournament for all the kids. His son wins! He completes 20 keepy-uppies when the tournament average is 8 and the standard deviation is 2.</p><p>If you then said, &#8220;This kid is 6 standard deviations above the mean, therefore he is a better footballer than Lionel Messi&#8221;, that would obviously be absurd.</p><p>That is what I think happens with the Bloom 2 sigma study. Novice students make rapid progress on a new, small domain over a short period of time when given extra coaching and assessed with a non-standardised test. We then fall over ourselves not just to declare that the students are better than Messi &#8211; but that their coach is the next Alex Ferguson or Pep Guardiola and we should all be copying their methods.</p><p><strong>Are outsize gains like Bloom reports really possible?</strong></p><p>At this point it is customary to say that Bloom sets our expectations too high. I don&#8217;t think this is the case. I think education has for a long time been in a pre-scientific phase, and that if we could better align it with science, then big 2 sigma style gains <em><strong>are</strong></em> possible. My issue with the Bloom paper is not that it sets unrealistic expectations, but that it won&#8217;t help us achieve any kind of expectations. </p><p><strong>Does any of this matter? Surely we know that one-to-one tuition is better than whole-class teaching?</strong></p><p>You might say OK, who cares, maybe the study is slightly ropey but we all know that one-to-one tuition is better than whole-class instruction, so is it really that misleading?</p><p>Yes. As we have seen human one-to-one tuition is extraordinarily expensive and hard to scale. Bloom and his grad students acknowledged this and the point of their research was to try and find whole group methods that were as effective as one to one tuition.</p><p>However, by emphasising the impact of one-to-one tuition so much, the effect has been to make human one-to-one tuition seem like the gold standard to which we should all be aspiring. Post-Covid, many governments spent huge sums of money on catch-up human tuition, often implicitly or explicitly justified by Bloom&#8217;s research.  The programmes ran into predictable problems of recruitment and training and had <a href="https://www.nfer.ac.uk/publications/independent-evaluation-of-the-national-tutoring-programme-year-2-impact-evaluation/">underwhelming results</a> - nothing like 2 sigma every 3 weeks.</p><p>Similarly, the impact on ed tech has been to encourage learning platforms to mimic one to one tuition and to focus on personalising instruction for the individual student.</p><p>But what if this is the wrong way round? What if actually, the gold standard of effective human pedagogy at scale is in whole-class instruction, and actually ed tech platforms should take that as a basis to learn from? Interestingly, <a href="https://storage.googleapis.com/deepmind-media/LearnLM/learnLM_nov25.pdf">a recent study</a> from Google Deep Mind embedded LLM tutors within a typical whole-class environment, and showed some impressive results. </p><p>We also have better and more robust data about what works in whole-class instruction - including, in England, some much better uses of standard deviations. </p><p><strong>What 2 sigma progress really looks like</strong></p><p>Every secondary school in England gets a <a href="https://en.wikipedia.org/wiki/Progress_8_benchmark">Progress 8 score</a>, measuring how much progress students make across eight subjects from age 11 to 16. A Progress 8 score of 0 means that, on average, pupils at the school made the same amount of progress as pupils nationally with similar starting points.</p><p>The mean is always close to 0, and most schools cluster around the mean, with over half of schools getting a score between -0.25 and +0.25.</p><p>However, there are a <a href="https://www.gov.uk/government/statistics/secondary-school-performance-data-in-england-2023-to-2024">handful of outliers scoring above 1.5</a>. These schools are achieving something close to a 2 sigma improvement.</p><p>Now of course, this is a school-level measure, not a pupil-level intervention like in Bloom. But it can still give us some useful insights. And if we run through all the flaws of Bloom again, Progress 8 avoids them.</p><ul><li><p>It measures progress on 8 big subjects &#8211; not one sub-topic!</p></li><li><p>The tests at the end are standardised and not designed by the teachers.</p></li><li><p>It measures gains over 5 years, not 11 weeks.</p></li><li><p>It includes the performance of about 3,500 schools and 600,000 students &#8211; a big sample.</p></li><li><p>Most of the schools in the sample have broadly equivalent resources.</p></li></ul><p>Obviously no metric is perfect and Progress 8 has its flaws too. But it is far less flawed than Bloom&#8217;s study, and a far better guide to what 2 sigma improvement in education actually looks like.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://substack.nomoremarking.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading No More Marking! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-1" href="#footnote-anchor-1" class="footnote-number" contenteditable="false" target="_self">1</a><div class="footnote-content"><p>Stephen Jay Gould discusses this problem in his book <em>Full House: The Spread of Excellence from Plato to Darwin</em>. He argues that the greater the standard deviation in a sports league, the lower quality it is, and the greater chance there will be of exceptional players registering exceptional scores. In a higher quality league, we will see narrower standard deviations and it will be harder for exceptional players to register exceptional scores.</p></div></div>]]></content:encoded></item><item><title><![CDATA[Using AI to judge the best Christmas film quote]]></title><description><![CDATA[Have yourself a very merry Christmas Judgement Day]]></description><link>https://substack.nomoremarking.com/p/using-ai-to-judge-the-best-christmas</link><guid isPermaLink="false">https://substack.nomoremarking.com/p/using-ai-to-judge-the-best-christmas</guid><dc:creator><![CDATA[Daisy Christodoulou]]></dc:creator><pubDate>Wed, 17 Dec 2025 22:46:36 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!q4Jr!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F100c0d21-fc26-44ad-88a4-01fee1068f3b_2862x1542.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Every year at No More Marking we run a fun and festive Comparative Judgement Christmas task where we get people to judge their favourite Christmas song, chocolate, film, etc.</p><p>This year we are adding AI into the mix, so you can judge your favourite Christmas film quote <strong>and</strong> see if our new robot overlords agree with you!</p><p>It&#8217;s also a nice way of seeing how our new AI features work.</p><p><strong>AI bless us, everyone!</strong></p><p>Here is how it works.</p><ul><li><p>Click on <a href="https://au.nomoremarking.com/judging/signup/cb59bad9-8d61-4d78-ae1a-7cf7c97a2012">this link </a>to register as a judge.</p></li><li><p>You&#8217;ll be presented with a pair of quotes from a famous Christmas film (you&#8217;ll also get the name of the film.) The interface will look like this.</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!q4Jr!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F100c0d21-fc26-44ad-88a4-01fee1068f3b_2862x1542.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!q4Jr!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F100c0d21-fc26-44ad-88a4-01fee1068f3b_2862x1542.png 424w, https://substackcdn.com/image/fetch/$s_!q4Jr!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F100c0d21-fc26-44ad-88a4-01fee1068f3b_2862x1542.png 848w, https://substackcdn.com/image/fetch/$s_!q4Jr!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F100c0d21-fc26-44ad-88a4-01fee1068f3b_2862x1542.png 1272w, https://substackcdn.com/image/fetch/$s_!q4Jr!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F100c0d21-fc26-44ad-88a4-01fee1068f3b_2862x1542.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!q4Jr!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F100c0d21-fc26-44ad-88a4-01fee1068f3b_2862x1542.png" width="1456" height="784" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/100c0d21-fc26-44ad-88a4-01fee1068f3b_2862x1542.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:784,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1386838,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://substack.nomoremarking.com/i/181876996?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F100c0d21-fc26-44ad-88a4-01fee1068f3b_2862x1542.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!q4Jr!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F100c0d21-fc26-44ad-88a4-01fee1068f3b_2862x1542.png 424w, https://substackcdn.com/image/fetch/$s_!q4Jr!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F100c0d21-fc26-44ad-88a4-01fee1068f3b_2862x1542.png 848w, https://substackcdn.com/image/fetch/$s_!q4Jr!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F100c0d21-fc26-44ad-88a4-01fee1068f3b_2862x1542.png 1272w, https://substackcdn.com/image/fetch/$s_!q4Jr!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F100c0d21-fc26-44ad-88a4-01fee1068f3b_2862x1542.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><ul><li><p>Read both quotes and have a think about what the best one is!</p></li><li><p>You can click on the button with the snail in the bottom right hand corner to see which quote our AI picks. You will also get to see a sentence where the AI explains its decision!</p></li><li><p>You can then make your own decision by clicking on the button that says &#8220;left&#8221; or &#8220;right". You&#8217;ll then be taken to a new decision. </p></li></ul><p>We will share the overall results in a week or so&#8217;s time on our social media feeds.</p><div class="captioned-button-wrap" data-attrs="{&quot;url&quot;:&quot;https://substack.nomoremarking.com/p/using-ai-to-judge-the-best-christmas?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="CaptionedButtonToDOM"><div class="preamble"><p class="cta-caption">Thanks for reading No More Marking! This post is public so feel free to share it.</p></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://substack.nomoremarking.com/p/using-ai-to-judge-the-best-christmas?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://substack.nomoremarking.com/p/using-ai-to-judge-the-best-christmas?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p></div><p><strong>Now we have AI judges. Ho ho ho!</strong></p><p>In <a href="https://substack.nomoremarking.com/p/what-is-comparative-judgement-and">the big writing projects that we run</a>, we typically get about 80-85% agreement between humans and AI. </p><p>In these big writing projects, we don&#8217;t include the button on the bottom right that gives you real-time information from the AI. Instead, we get teachers to complete their judgements independently, with no influence from the AI. You can then download the results of the AI judging and the <a href="https://substack.nomoremarking.com/p/bringing-our-feedback-philosophy">AI feedback</a> separately - and see the percentage agreement between your teachers and the AI. </p><p><strong>Faith is believing in things when AI tells you not to</strong></p><p>We don&#8217;t necessarily expect human &amp; AI choices on Christmas film quotes to align! In fact, we don&#8217;t necessarily expect human choices on Christmas film quotes to align! </p><p>For our serious assessment projects we spend a great deal of time <a href="https://substack.nomoremarking.com/p/updates-from-our-ai-assessment-projects">validating</a> and fine-tuning to ensure that the AI decisions align with the decisions of teachers. </p><p>If you&#8217;d like to try out a writing assessment, register for one of our introductory webinars <a href="https://www.nomoremarking.com/events">here</a>.</p><p>Merry Christmas!</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://substack.nomoremarking.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading No More Marking! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Updates from our AI assessment projects]]></title><description><![CDATA[Three new findings]]></description><link>https://substack.nomoremarking.com/p/updates-from-our-ai-assessment-projects</link><guid isPermaLink="false">https://substack.nomoremarking.com/p/updates-from-our-ai-assessment-projects</guid><dc:creator><![CDATA[Daisy Christodoulou]]></dc:creator><pubDate>Sun, 23 Nov 2025 08:35:32 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!g-Kw!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdef3cf81-2f9b-4576-8d8b-92dcad390e4f_256x256.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>This term, we&#8217;ve been busy turning out results and analysis for all of our big <a href="https://substack.nomoremarking.com/p/what-is-comparative-judgement-and">Comparative Judgement national assessment projects.</a></p><p>The Comparative Judgement plus AI model, which <a href="https://substack.nomoremarking.com/p/the-human-in-the-loop">we developed earlier this year</a> and <a href="https://substack.nomoremarking.com/p/so-can-ai-assess-writing">trialled in March</a>, is now available as standard for all of our national &amp; bespoke assessments. We have now assessed over 200,000 pieces of writing using this model and we have just completed <a href="https://substack.nomoremarking.com/p/what-makes-a-good-history-essay">our first national history assessment.</a></p><p>Here are three new things we&#8217;ve learned from this term&#8217;s assessments.</p><div class="captioned-button-wrap" data-attrs="{&quot;url&quot;:&quot;https://substack.nomoremarking.com/p/updates-from-our-ai-assessment-projects?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="CaptionedButtonToDOM"><div class="preamble"><p class="cta-caption">Thanks for reading No More Marking! This post is public so feel free to share it.</p></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://substack.nomoremarking.com/p/updates-from-our-ai-assessment-projects?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://substack.nomoremarking.com/p/updates-from-our-ai-assessment-projects?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p></div><p><strong>AI Comparative Judgement delivers very similar results to human Comparative Judgement - it&#8217;s just quicker!</strong></p><p>We&#8217;ve written extensively about t<a href="https://substack.nomoremarking.com/p/ai-is-uncannily-good-at-judging-writing">he high agreement rates </a>we see between our human &amp; AI judges.</p><p>We now have some different data points showing something similar.</p><p>For the Year 3 writing assessment that we ran this term, almost half of our schools chose to use AI judges, and the rest chose not to. This means we can compare the results of each sub-group and see if there are any discrepancies.</p><p>What we found was that the two groups were very similar. The overall means of each group were exactly the same: 493. The standard deviation for the AI-judged group was slightly smaller - 39 compared to 46. This means there were fewer very high and very low scores in the AI-judged group. We are not totally sure why this is, but it is not a huge difference.</p><p><strong>AI is better at Comparative Judgement than absolute judgement</strong></p><p>There are a lot of organisations out there doing AI marking. Most of them get the AI to do traditional marking, which is a form of absolute judgement. You are asking the AI to look at one piece of writing and place it on an absolute scale.</p><p>We trialled this approach in the past and moved away from it for several reasons, the most important of which was that the AI just wasn&#8217;t very good at it.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-1" href="#footnote-1" target="_self">1</a></p><p>One way we can validate the scores from any assessment is to see if they help you predict the scores the same students got on other assessments of the same construct. We have used this method to validate our human Comparative Judgement assessments over the last few years and we routinely see 0.7+ correlation between student scores on one assessment and the next. Using the AI to make absolute judgements, <a href="https://substack.nomoremarking.com/p/would-you-ask-gpt-4-to-mark-your">we saw</a> only<a href="https://substack.nomoremarking.com/p/more-gpt-marking-data-is-it-better"> a 0.5 correlation. </a></p><p>However, now we are using Comparative Judgement, we are seeing much higher correlations. Approximately 23,000 students who took part in this term&#8217;s Year 3 assessment also took part in a similar Year 2 assessment in February which was entirely judged by humans. We could therefore measure the correlation between the Feb Y2 assessment and the Oct Y3 assessment. We found that the October Y3 human and AI judges <strong>both</strong> achieved high correlations with the Feb Y2 assessment. (This of course is another data point showing that the AI is as good as humans).</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!JaKE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9b114e4-20d4-4a01-b24c-d48773cfdf2b_1308x278.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!JaKE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9b114e4-20d4-4a01-b24c-d48773cfdf2b_1308x278.png 424w, https://substackcdn.com/image/fetch/$s_!JaKE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9b114e4-20d4-4a01-b24c-d48773cfdf2b_1308x278.png 848w, https://substackcdn.com/image/fetch/$s_!JaKE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9b114e4-20d4-4a01-b24c-d48773cfdf2b_1308x278.png 1272w, https://substackcdn.com/image/fetch/$s_!JaKE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9b114e4-20d4-4a01-b24c-d48773cfdf2b_1308x278.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!JaKE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9b114e4-20d4-4a01-b24c-d48773cfdf2b_1308x278.png" width="1308" height="278" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d9b114e4-20d4-4a01-b24c-d48773cfdf2b_1308x278.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:278,&quot;width&quot;:1308,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:45855,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.nomoremarking.com/i/179556820?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9b114e4-20d4-4a01-b24c-d48773cfdf2b_1308x278.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!JaKE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9b114e4-20d4-4a01-b24c-d48773cfdf2b_1308x278.png 424w, https://substackcdn.com/image/fetch/$s_!JaKE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9b114e4-20d4-4a01-b24c-d48773cfdf2b_1308x278.png 848w, https://substackcdn.com/image/fetch/$s_!JaKE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9b114e4-20d4-4a01-b24c-d48773cfdf2b_1308x278.png 1272w, https://substackcdn.com/image/fetch/$s_!JaKE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9b114e4-20d4-4a01-b24c-d48773cfdf2b_1308x278.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p></p><p><strong>AI can judge subjects other than English Language</strong></p><p>We have just completed our first nationally standardised history assessment. 25 schools and just over 4,000 students took part.  In the past we have had lots of schools use our platform for history assessment, but we&#8217;ve never run a nationally standardised project, partly because there aren&#8217;t as many history teachers as English teachers and this makes judging quite time consuming.</p><p>Adding in AI judges dramatically reduced the time it took to judge. On average, each teacher in the project judged for just under 20 minutes - <a href="https://help.nomoremarking.com/en/article/how-long-does-it-take-to-assess-one-classs-essays-using-comparative-judgement-lrxp0m/">which is what we predicted</a>. In return, they got<a href="https://blog.nomoremarking.com/cj-history-what-reports-are-available-cfc87fc3f8d7"> 7 PDFs with incredibly detailed data and feedback. </a></p><p>Was the AI good at judging more complex essays where the focus is not just on writing but on subject content too? <a href="https://blog.nomoremarking.com/cj-history-how-did-the-ai-judges-do-688176be649e">The AI agreed with the human decisions 77% of the time</a>. This is slightly lower than the 85% we typically get for writing assessments, but it&#8217;s still not bad. Our initial feedback from schools is that the results made sense. </p><p>We also have hundreds of schools using our platform to run custom AI assessments in a whole range of subjects. Custom assessments use all of our AI features, but they are customised to an individual&#8217;s schools curriculum &amp; calendar and aren&#8217;t nationally standardised. It is early days, but so far the approach seems to be working well for all these other subjects too.  </p><p>If you would like to learn more, our next <a href="https://us02web.zoom.us/webinar/register/WN_RmM4ooBLTD6HO3AnTmzl2Q">introduction webinar</a> is in January.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://substack.nomoremarking.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading No More Marking! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-1" href="#footnote-anchor-1" class="footnote-number" contenteditable="false" target="_self">1</a><div class="footnote-content"><p>Even if the AI does get better at absolute judgement, there are still problems. It&#8217;s hard to use human oversight to validate this approach, and there is no statistical model underneath it - which is a problem given that most grades involve a significant statistical element (eg in England about 2.5% of students get a grade 9 in GCSE English Language).</p><p></p></div></div>]]></content:encoded></item><item><title><![CDATA[Can LLMs be personal tutors?]]></title><description><![CDATA[Four big challenges]]></description><link>https://substack.nomoremarking.com/p/llm-tutors-are-they-any-good</link><guid isPermaLink="false">https://substack.nomoremarking.com/p/llm-tutors-are-they-any-good</guid><dc:creator><![CDATA[Daisy Christodoulou]]></dc:creator><pubDate>Sun, 16 Nov 2025 16:45:47 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/4dd6eeb8-5285-4f10-bed5-4894c9351722_1024x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Think how amazing it would be to have a personal tutor who is an expert in every subject under the sun and available on-demand 24/7.</p><p>That is the incredibly exciting promise of Large Language Models - that they will be able to teach you anything you want, whenever you want. </p><p>However, I think the barriers to getting there are more significant than we imagine.</p><p>Here are four challenges that LLM tutors have to overcome. </p><div class="captioned-button-wrap" data-attrs="{&quot;url&quot;:&quot;https://substack.nomoremarking.com/p/llm-tutors-are-they-any-good?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="CaptionedButtonToDOM"><div class="preamble"><p class="cta-caption">Thanks for reading No More Marking! This post is public so feel free to share it.</p></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://substack.nomoremarking.com/p/llm-tutors-are-they-any-good?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://substack.nomoremarking.com/p/llm-tutors-are-they-any-good?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p></div><p><strong>1. LLMs are good at providing explanations, but explanations are over-rated</strong></p><p>LLMs are good at providing explanations. The problem is that pedagogically, explanations are over-rated.</p><p>Thomas Kuhn, the famous philosopher of science, <a href="https://depts.washington.edu/lsearlec/205/Texts/2nd-thoughts.pdf">once asked why it was </a>that a group of students could all read the same chapter of a physics textbook and say they had understood it &#8211; but then get the questions at the end of the chapter totally wrong.</p><p>Kuhn concluded that what these students really needed was not explanations but lots and lots of examples and questions. </p><p>He was right. Questions are important for two reasons: they force the student into <a href="https://applied-science-cisdev.sites.olt.ubc.ca/files/2018/06/retrieval-and-spacing.pdf">mental activity</a>, which is necessary for learning. And they tell the student and the teacher if the student actually has understood what has been taught. </p><p>The research also shows that students often don&#8217;t like this.<a href="https://www.tandfonline.com/doi/abs/10.1080/09658210802647009"> They prefer to read, reread and highlight explanations than to answer questions</a>. That&#8217;s probably because rereading an explanation is easy, but answering questions is hard. It&#8217;s also because reading an explanation feels like you have understood something. It gives you the <a href="https://en.wikipedia.org/wiki/Illusion_of_explanatory_depth">illusion of understanding</a>, whereas answering a set of questions exposes the reality that you don&#8217;t.</p><p>What we need are not LLMs that answer questions from students. We need LLMs that ask students questions. </p><p>But the problem with that is&#8230;</p><p><strong>2. LLMs are not as good at creating precise questions</strong></p><p>LLMs still have problems with hallucinations, and this is a real problem when you want to create banks of questions and answers where precision and accuracy really matter. </p><p>We have experience of this with <a href="https://substack.nomoremarking.com/p/bringing-our-feedback-philosophy">the feedback we provide on our writing assessments</a>. We provide LLM-generated written feedback for students and teachers. At that level of generality, the LLM does a good job. </p><p>But we also wanted something more precise &#8211; so we asked the LLM to generate a series of multiple-choice questions based on each student&#8217;s piece of writing. It found that task much harder, and a number of errors crept in. Errors like these can cause enormous confusion for novices. [We ended up creating our own and allocating them based on the students&#8217; scaled score.]</p><p>When I talk about the error rate of LLMs, the inevitable response I get is &#8220;yes but humans aren&#8217;t perfect either&#8221;. That is absolutely true. In the great &#8220;algorithms vs humans&#8221; debate, here at No More Marking we are <a href="https://substack.nomoremarking.com/p/what-do-you-prefer-human-error-or">mostly on the side of the algorithms</a>, because we know that humans make so many mistakes.</p><p>However, in this particular case &#8211; the creation of personalised questions &#8211; the correct comparison is not between error-prone LLMs and error-prone humans. The correct comparison is between error-prone LLMs and older technologies which have largely eliminated errors. Which brings me to my third point.</p><p><strong>3. Pre-LLM technologies are very good at creating error-free, scalable and personalised resources</strong></p><p>The original technology for creating error-free and scalable educational resources is about half a millennium old &#8211; it&#8217;s the printing press. Once you have a really good set of questions (or indeed an explanation) you can proofread it and get it checked over by multiple other humans and then get it printed as many times as you need.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-1" href="#footnote-1" target="_self">1</a> </p><p>Of course, printed textbooks aren&#8217;t personalised or interactive. But personalised and interactive resources <strong>do</strong> exist already too &#8211; not for as long as the printing press, of course, but for several decades. </p><p>Many online learning platforms consist of enormous banks of accurate questions. Students can proceed through them at their own pace and receive personalised feedback and next steps based on their pattern of correct and incorrect questions. There are <a href="https://en.wikipedia.org/wiki/Intelligent_tutoring_system">many</a> platforms like this. <a href="https://pmc.ncbi.nlm.nih.gov/articles/PMC12078640/">They obviously vary in style and quality</a>, but the best of them have decent track records.</p><p>So, one major question for me is this: how are LLMs going to improve on these pre-existing technologies?  What can they offer that is better?</p><p>And this also brings me to my fourth point. These very effective pre-LLM digital tutors have been around for decades, and they have not made the human teacher or the physical classroom obsolete. Why?</p><p><strong>4. There is a limit to what students will learn on their own and on a screen</strong></p><p>The Covid pandemic provided us with a natural experiment in the effectiveness of online learning. Did everybody say at the end of it, fantastic, actually, it turns out that we don&#8217;t really need physical schools and human teachers any more? </p><p>No. Everybody said: we need to get the kids back into school. The <a href="https://www.sciencedirect.com/science/article/pii/S0959475225000350?via%3Dihub">global</a> <a href="https://openknowledge.worldbank.org/entities/publication/ce21738f-72d5-55ac-9876-23ac39efffea">data</a> shows that students learnt less when schools were closed, not more, even in countries where they had access to the internet and many brilliant online learning tools. And even before Covid, we knew that online learning courses had <a href="https://www.erudit.org/en/journals/irrodl/2015-v16-n3-irrodl04980/1065985ar.pdf">very high drop out rates</a>. </p><p>The structure and discipline of in-person classrooms are important, and online platforms lack this structure. So even if they are full of brilliant content and sound pedagogical principles, they may not be as effective as in-person teaching.</p><p>For LLM tutors to succeed where other online learning platforms have not, they have to overcome this problem. <strong>Either</strong> they have to find ways of incorporating the structure and discipline of an in-person class, <strong>or</strong> they have to be so much more engaging and compelling than existing online learning platforms that they will eliminate the need for structure and discipline as students will prefer using them to doing anything else online. </p><p>The latter is going to be very hard and is largely beyond the control of any online learning platform, as it is competing against entertainment platforms that aren&#8217;t constrained by learning. <a href="https://substack.nomoremarking.com/p/why-education-can-never-be-fun">Optimising for one parameter is easier than optimising for two. </a></p><p><strong>Questions about questions</strong></p><p>So, to sum up, here are the four questions you need to ask of any LLM tutor.</p><ul><li><p>Does it rely solely on explanations?</p></li><li><p>If it does use questions, how does it ensure they are accurate?</p></li><li><p>In what ways is it better than pre-existing online learning systems that don&#8217;t use LLMs?</p></li><li><p>Is it integrated with a traditional classroom, or is it designed for students to use on their own? If the latter, how will it get high completion rates?</p></li></ul><p>Some systems are engaging seriously with these questions and coming up with good answers, and I will profile a few in a future post. But many are not, and the risk is that LLMs just get added to the long line of technological innovations that <a href="https://www.amazon.co.uk/Teachers-Tech-case-tech-revolution/dp/1382004125/ref=sr_1_1?crid=261WKS4T6SL8A&amp;dib=eyJ2IjoiMSJ9.1D0AEyrYQ5lrrRieVpqEM_0Jy0EDDNZRbu31xD5qY9-j8v11TCYjK5hr2CfpsaZw3cWk7U6Goid8xkz8vnaWUMw3zPiDl9ZWVPuREcCcSb4.o6tEqBibgHRnzNApwscdKz6SBXRwUK37jx6DZo-CSj8&amp;dib_tag=se&amp;keywords=teachers+vs+tech&amp;qid=1763303858&amp;sprefix=teachers+vs+te%2Caps%2C278&amp;sr=8-1">promised and failed to improve education.</a></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://substack.nomoremarking.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading No More Marking! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-1" href="#footnote-anchor-1" class="footnote-number" contenteditable="false" target="_self">1</a><div class="footnote-content"><p>Some of the earliest printed books do have quite a few errors, and completely eliminating all errors in any format is not easy. Andrej Karpathy&#8217;s &#8220;<a href="https://www.dwarkesh.com/p/andrej-karpathy">march of nines</a>&#8221; is as true of Gutenberg&#8217;s books as of Waymo&#8217;s self-driving cars. But a modern textbook that is in its 2nd edition is likely to have vanishingly few errors. EG <a href="https://www.amazon.co.uk/Expressive-Writing-Level-Workbook-Bk/dp/0076035891/ref=asc_df_0076035891?mcid=24aec99cd6013590b6072ec692cf01ee&amp;th=1&amp;psc=1&amp;tag=googshopuk-21&amp;linkCode=df0&amp;hvadid=697219656396&amp;hvpos=&amp;hvnetw=g&amp;hvrand=12021444788765579376&amp;hvpone=&amp;hvptwo=&amp;hvqmt=&amp;hvdev=c&amp;hvdvcmdl=&amp;hvlocint=&amp;hvlocphy=9044967&amp;hvtargid=pla-761467384512&amp;psc=1&amp;hvocijid=12021444788765579376-0076035891-&amp;hvexpln=0&amp;gad_source=1">this</a> textbook is the one I know best and neither I or several colleagues / students have spotted any errors in it.</p></div></div>]]></content:encoded></item><item><title><![CDATA[Can we teach students to spot misinformation?]]></title><description><![CDATA[Does the Pacific Northwest Tree Octopus exist?]]></description><link>https://substack.nomoremarking.com/p/can-we-teach-students-to-spot-misinformation</link><guid isPermaLink="false">https://substack.nomoremarking.com/p/can-we-teach-students-to-spot-misinformation</guid><dc:creator><![CDATA[Daisy Christodoulou]]></dc:creator><pubDate>Sun, 09 Nov 2025 10:05:32 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/736767ee-3978-4ea4-acd3-f1d5a74e630c_1024x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Last week, the UK government released its long-awaited <a href="https://assets.publishing.service.gov.uk/media/690b96bbc22e4ed8b051854d/Curriculum_and_Assessment_Review_final_report_-_Building_a_world-class_curriculum_for_all.pdf">Curriculum &amp; Assessment Review</a>. One recommendation in particular has been getting a lot of attention: that the government should &#8220;strengthen the role of media literacy&#8221; with a particular focus on &#8220;understanding how to identify and protect against misinformation and disinformation.&#8221;</p><p>Much of the subsequent discussion about this proposal has focused on the challenges of defining misinformation in an era of political polarisation. However, I want to focus on something even more basic: even when falsehoods are obvious and undeniable, many students struggle to spot them.</p><div class="captioned-button-wrap" data-attrs="{&quot;url&quot;:&quot;https://substack.nomoremarking.com/p/can-we-teach-students-to-spot-misinformation?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="CaptionedButtonToDOM"><div class="preamble"><p class="cta-caption">Thanks for reading No More Marking! This post is public so feel free to share it.</p></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://substack.nomoremarking.com/p/can-we-teach-students-to-spot-misinformation?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://substack.nomoremarking.com/p/can-we-teach-students-to-spot-misinformation?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p></div><p><strong>The Pacific Northwest Tree Octopus</strong></p><p>The Pacific Northwest Tree Octopus is a deliberate and humorous <a href="https://chadoh.com/tree-octopus/index.html">hoax website</a> about a fabricated species of octopus that supposedly lives in the forests of the Pacific Northwest. The website has all the features of a serious conservation website, but it&#8217;s completely made up.</p><p>A couple of research studies have used this website to see how good students are at evaluating the reliability of online sources. In a <a href="http://advance.uconn.edu/2006/061113/06111308.htm">2007 study in America</a>, barely any 7th grade students identified it as a hoax, and a more recent <a href="https://dspace.library.uu.nl/bitstream/handle/1874/421595/10_1108_ILS_04_2018_0031.pdf?sequence=1">Dutch study</a> found something similar with 11 &amp; 12 year olds.  In the US study, students still insisted the octopus was real even after being told it was a fake.</p><p><strong>How could we get students to spot this hoax?</strong></p><p>A common response to this problem is to say that we should teach students to be digitally literate, or to teach them some kind of checklist that they can use to evaluate websites. For example, there is the <a href="https://en.wikipedia.org/wiki/CRAAP_test">CRAAP</a> checklist and the <a href="https://guides.lib.uchicago.edu/c.php?g=1241077&amp;p=9082322">SIFT</a> checklist, and <a href="https://www.twinkl.co.uk/resource/t2-e-3741-ks2-fake-news-checklist">other resources </a>designed for younger students.</p><p>The problem with all of these checklists is that they&#8217;re a bit like telling a student to look up words they don&#8217;t know in a dictionary. This only works if the student has a big enough vocabulary to know what the words in the definition mean. If they don&#8217;t, they&#8217;re caught in an infinite loop of looking up the words they don&#8217;t know, only to find more words they don&#8217;t know.</p><p>Most of these checklists recommend that students should check the source of the information to see if it is trustworthy and reliable. But how do you know if a source is trustworthy or reliable? The Pacific Northwest Tree Octopus website is associated with the &#8220;Kelvinic University branch of the Wild Haggis Conservation Society.&#8221; If you&#8217;re an adult, that just sounds a bit off. But lots of students think it sounds like a great endorsement.</p><p>Of course, all the checklists recommend doing further research and online searches to verify information. What happens if you do a Google search on the Kelvinic University? Currently, the first result you get tells you &#8220;Kelvinic University is a fully accredited, independent institute of higher learning that offers Bachelor&#8217;s, Master&#8217;s, and PhD programs.&#8221; Well, that sounds OK. I guess the Tree Octopus is real!</p><p><strong>It&#8217;s the same for critical thinking</strong></p><p>In a <a href="https://www.aft.org/sites/default/files/media/2014/Crit_Thinking.pdf">2007 article</a>, the cognitive scientist Dan Willingham noted that over the last 20 years, programmes designed to teach critical thinking had become very popular, but  they were not very effective. He concludes with the following: </p><blockquote><p>&#8220;Can critical thinking actually be taught? Decades of cognitive research point to a disappointing answer: not really.&#8221;</p></blockquote><p>He makes the same point about the limitations of teaching maxims. </p><blockquote><p>&#8220;If you remind a student to &#8216;look at an issue from multiple perspectives&#8217; often enough, he will learn that he ought to do so, but if he doesn&#8217;t know much about an issue, he can&#8217;t think about it from multiple perspectives.&#8221;</p></blockquote><p>He gives another example: suppose you want to investigate why one car gets better gas mileage than another. How will you devise your research hypothesis and which factors will you choose to investigate?  Your decision about what to investigate depends on specific knowledge. </p><blockquote><p>You won&#8217;t choose to investigate a difference between cars A and B that you think is unlikely to contribute to gas mileage (e.g., paint color), but if someone provides a reason to make this factor more plausible (e.g., the way your teenage son&#8217;s driving habits changed after he painted his car red), you are more likely to say that this now-plausible factor should be investigated. One&#8217;s judgment about the plausibility of a factor being important is based on one&#8217;s knowledge of the domain.</p></blockquote><p>Your ability to apply maxims like &#8220;devise a research hypothesis&#8221; and &#8220;control the variables&#8221; depends on very specific contextual knowledge.</p><p><strong>What is truth, said jesting Pilate, and would not stay for an answer</strong></p><p>In order to make a judgement about a truth claim, you have to know something about the claim itself. Here is the philosopher Dan Williams making this point.</p><blockquote><p>The fundamental problem is that there are no intrinsic differences between true and false claims. That is, whether a claim is right or wrong&#8212;or informative or misleading&#8212;depends not on characteristics of the claim itself but on whether it accurately represents how things are.</p></blockquote><p>If someone tells you that there are tree octopuses living in the forests of the Pacific Northwest, &#8220;you cannot simply examine the statement&#8212;or even its surrounding rhetorical context&#8212;to figure out whether it is true or false; its truth or falsity depends on the world.&#8221;</p><p>Williams is not claiming that we have to independently verify every single fact before we can trust it. That&#8217;s impossible. He is just pointing out that statements and their rhetorical contexts are imperfect guides to truth, and telling students they are is setting them up to fail.</p><p>I think history teachers in the UK have similar experiences of the challenges of trying to teach students to evaluate truth claims with general principles. Are sources written by individuals more or less reliable than ones written by governments? Are sources written by eyewitnesses more or less reliable than ones written centuries after the event? Are sources written for publication more or less reliable than private diaries or letters?</p><p>The answer in every case is &#8220;it depends&#8221;. If only we could say: this source has feature x, therefore it is definitely true. It would be wonderful &#8211; we wouldn&#8217;t have to teach any history at all! But we do have to teach history, and science, and geography, and it&#8217;s good teaching of these kinds of traditional school subjects that will, in the long-term, provide students with the best possible defence against hoaxes and misinformation.</p><p>The Curriculum and Assessment Review quite rightly recognises a lot of what I&#8217;ve said above. It makes it very clear that background knowledge is necessary to evaluate truth claims, and that &#8220;having secure knowledge is essential to discerning truth from falsehoods and is one of the many reasons why a knowledge-rich curriculum is more, not less important in the modern world.&#8221;</p><p>It is now up to the government to implement their recommendations about improving media literacy and helping students spot misinformation. So what should they do?</p><p><strong>So how can we improve media literacy?</strong></p><p>One of the major themes of this Substack is that assessment is where the rubber hits the road, or, in Dylan Wiliam&#8217;s terms, assessment operationalises curriculum. </p><p>If we&#8217;re interested in strengthening media literacy, we need to design assessments that will help us a) work out exactly what it is we are trying to improve and b) whether the interventions we&#8217;re proposing work or not. </p><p>Here is a suggestion about how this could work. We could create a pre-test consisting of four websites. Three are accurate, and one is a hoax, like the Pacific Northwest Tree Octopus website. We could then create a post-test of four new websites, again made up of three accurate ones and one hoax. In each case, we ask the students to identify the hoax website and explain why. (You could use <a href="https://substack.nomoremarking.com/p/what-is-comparative-judgement-and">Comparative Judgement</a> to evaluate their explanations!)</p><p>In between the pre and post test, we can deliver our intervention and see if it leads to improvements on the post-test.</p><p>I would suggest that any new media literacy curriculum should be piloted and evaluated in this way before wider implementation. I think it&#8217;s unlikely that any generic checklist approach will be successful - but I might be wrong, and either way, we will be adding to the sum of human knowledge and discovering more about what does and doesn&#8217;t work.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://substack.nomoremarking.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading No More Marking! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[AI is uncannily good at judging writing]]></title><description><![CDATA[Results from our latest assessments]]></description><link>https://substack.nomoremarking.com/p/ai-is-uncannily-good-at-judging-writing</link><guid isPermaLink="false">https://substack.nomoremarking.com/p/ai-is-uncannily-good-at-judging-writing</guid><dc:creator><![CDATA[Daisy Christodoulou]]></dc:creator><pubDate>Sat, 25 Oct 2025 07:24:22 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!rfOG!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1153b35c-c6a1-4780-b137-f93595c42d5f_1106x692.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>We&#8217;ve had a lot of new subscribers over the past few weeks - welcome! Our Substack has a mix of big-picture articles about the impact of technology on education - eg, <a href="https://substack.nomoremarking.com/p/are-we-living-in-a-stupidogenic-society">Are we living in a stupidogenic society?</a>; <a href="https://substack.nomoremarking.com/p/why-education-can-never-be-fun">Why education can never be fun</a> - and detailed research from our AI-enhanced Comparative Judgement assessment projects - eg, <a href="https://substack.nomoremarking.com/p/what-is-comparative-judgement-and">What is Comparative Judgement and why does it work?</a> and <a href="https://substack.nomoremarking.com/p/so-can-ai-assess-writing">So, can AI assess writing?</a> Enjoy!</em></p><p>Over the last couple of years, we&#8217;ve carried out <a href="https://substack.nomoremarking.com/p/the-human-in-the-loop">a lot of research</a> into how Large Language Models can be used to assess student writing.</p><p>We started out by asking the LLM to assign a mark to each individual piece of writing. However, we found that this approach didn&#8217;t work that well. The LLM would make baffling errors and frequently disagreed with the human consensus.</p><p>So we tried a different tack - we asked the LLMs to make <a href="https://substack.nomoremarking.com/p/what-is-comparative-judgement-and">Comparative Judgements </a>instead. They have to read two pieces of writing and choose which is better, and we can then combine together all of these decisions to create a very sophisticated measurement scale for every piece of writing. This approach also makes it easy to add in human judgements which can then be used to validate the AI.  </p><p>This approach is much more effective, and results in very high levels of agreement between our AI and human judges. We ran a number of trials at the end of last academic year, and for this academic year, we have integrated AI judges into all of our national projects. Schools can choose what ratio of AI judges they want. <a href="https://help.nomoremarking.com/en/article/how-long-does-it-take-to-assess-one-classs-essays-using-comparative-judgement-lrxp0m/">Our recommendation is 90% AI, 10% human.</a> This will obviously reduce the time it takes humans to judge by 90%!</p><div class="captioned-button-wrap" data-attrs="{&quot;url&quot;:&quot;https://substack.nomoremarking.com/p/ai-is-uncannily-good-at-judging-writing?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="CaptionedButtonToDOM"><div class="preamble"><p class="cta-caption">Thanks for reading No More Marking! This post is public so feel free to share it.</p></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://substack.nomoremarking.com/p/ai-is-uncannily-good-at-judging-writing?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://substack.nomoremarking.com/p/ai-is-uncannily-good-at-judging-writing?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p></div><p><strong>Our latest assessment</strong></p><p>Our latest assessment involved approximately 70,000 pieces of writing completed by Year 7, 8 &amp; 9 students from 177 UK secondary schools. </p><p>Most of the schools followed our recommendation to do 90% AI judgements. </p><p>In total, our human teachers<strong> </strong>made 133,983 decisions. The AI judges agreed with 83% of them, which is similar to the typical human-human agreement across our projects.</p><p>Of the 22,913 judgements where the human and AI disagreed, 50% were 15 points or under, 90% were 45 points or under, and 97% were 67 points or under. (Our scale is fine-grained, and runs from about 300 - 700).</p><p>1.4% of the decisions - 324 in total - were above 80 points. That is 1.4% of the total disagreements, but just 0.24% of the total number of human judgements.</p><p>Some element of disagreement is always going to exist with assessments of extended writing, whoever is judging it. This is a very low rate of serious disagreement, and one that we think is acceptable.</p><p><strong>What about the bigger disagreements?</strong></p><p>Here is the biggest disagreement. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!rfOG!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1153b35c-c6a1-4780-b137-f93595c42d5f_1106x692.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!rfOG!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1153b35c-c6a1-4780-b137-f93595c42d5f_1106x692.png 424w, https://substackcdn.com/image/fetch/$s_!rfOG!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1153b35c-c6a1-4780-b137-f93595c42d5f_1106x692.png 848w, https://substackcdn.com/image/fetch/$s_!rfOG!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1153b35c-c6a1-4780-b137-f93595c42d5f_1106x692.png 1272w, https://substackcdn.com/image/fetch/$s_!rfOG!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1153b35c-c6a1-4780-b137-f93595c42d5f_1106x692.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!rfOG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1153b35c-c6a1-4780-b137-f93595c42d5f_1106x692.png" width="1106" height="692" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1153b35c-c6a1-4780-b137-f93595c42d5f_1106x692.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:692,&quot;width&quot;:1106,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:597092,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://substack.nomoremarking.com/i/176722336?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1153b35c-c6a1-4780-b137-f93595c42d5f_1106x692.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!rfOG!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1153b35c-c6a1-4780-b137-f93595c42d5f_1106x692.png 424w, https://substackcdn.com/image/fetch/$s_!rfOG!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1153b35c-c6a1-4780-b137-f93595c42d5f_1106x692.png 848w, https://substackcdn.com/image/fetch/$s_!rfOG!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1153b35c-c6a1-4780-b137-f93595c42d5f_1106x692.png 1272w, https://substackcdn.com/image/fetch/$s_!rfOG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1153b35c-c6a1-4780-b137-f93595c42d5f_1106x692.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The piece on the right is hard to read, but the AI was able to make an accurate typed transcription of it which reveals it is a good piece of writing. We feel this is an unambiguous example of a human error. So far, we have assessed nearly 200,000 pieces of writing using this method and all of the really big disagreements are the result of human, not AI error. </p><p>That&#8217;s pretty remarkable! We&#8217;re very familiar with the problem of LLMs hallucinating, but they do just seem much better at Comparative Judgement than at many other tasks.</p><p>We have found a few smaller disagreements where we think the AI has erred, but we also think these can be fixed with some tweaks to the judging prompt. We will share more about this, and more statistics on the predictive validity of the AI, in future posts.</p><p><strong>What&#8217;s next</strong></p><p>As well as integrating AI judges into our national projects, we have also made AI judges available for any <a href="https://help.nomoremarking.com/en/article/ai-enhanced-custom-tasks-overview-c03pfu/">custom assessment</a> that an individual school might want to run. Schools can choose their own criteria for these assessments.</p><p>We&#8217;ll continue to report on our research on this Substack. If you would like to learn more about how our Comparative Judgement + AI approach works, you can take part in one of our intro webinars <a href="https://www.nomoremarking.com/events">here</a>.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://substack.nomoremarking.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading No More Marking! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[What kind of student feedback do teachers think is best?]]></title><description><![CDATA[The tension between personalised and in-person education]]></description><link>https://substack.nomoremarking.com/p/what-kind-of-student-feedback-do</link><guid isPermaLink="false">https://substack.nomoremarking.com/p/what-kind-of-student-feedback-do</guid><dc:creator><![CDATA[Daisy Christodoulou]]></dc:creator><pubDate>Sun, 19 Oct 2025 08:31:24 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!rA7o!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb969f4e1-8515-41fd-a0fc-42b1dff371b6_1600x669.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Over the last year, we have added <a href="https://substack.nomoremarking.com/p/bringing-our-feedback-philosophy">many different types of AI-enhanced student feedback</a> into our <a href="https://substack.nomoremarking.com/p/what-is-comparative-judgement-and">Comparative Judgement writing assessments</a>. AI has made it incredibly quick and easy to create feedback that would typically have taken hours of teacher time.</p><p>However, whether the feedback is generated by AI or by humans, we need to be sure it is useful, and there are <a href="https://substack.nomoremarking.com/p/ai-feedback-a-thermometer-or-a-thermostat">long-standing concerns about whether </a><em><strong><a href="https://substack.nomoremarking.com/p/ai-feedback-a-thermometer-or-a-thermostat">any</a></strong></em><a href="https://substack.nomoremarking.com/p/ai-feedback-a-thermometer-or-a-thermostat"> form of written feedback is useful</a>. </p><p>(This is a problem with the evaluation of AI more generally. We evaluate AI by seeing if it can reproduce something that professionals currently spend a lot of time on. We don&#8217;t ask whether the thing the professionals are spending time on is actually valuable.)</p><p>Here are the questions we need to ask to decide whether any form of writing feedback is useful.</p><ul><li><p>Do students understand what this feedback means and what action they need to take to improve?</p></li><li><p>If students follow the advice given in this feedback, will it make them better writers?</p></li><li><p>Is the feedback helping to improve their overall writing skills, or just improving the specific piece of work?</p></li></ul><p>Fortunately, AI can help us assess these questions more speedily too. We have now integrated AI judges into our Comparative Judgement assessment process, making it quicker and less burdensome to run follow-up assessments measuring student progress. Our first attempt at this, <a href="https://substack.nomoremarking.com/p/dynamic-writing-assessment-with-ai">CJ Dynamo</a>, showed that on average students made good progress in response to feedback, but a significant minority went backwards.</p><p><strong>What do teachers think of it all?</strong></p><p>As well as hard assessment data on student improvement, we also want to know what teachers and students think about <a href="https://help.nomoremarking.com/en/category/feedback-reports-7nh9ew/">our new reports</a>.</p><div class="captioned-button-wrap" data-attrs="{&quot;url&quot;:&quot;https://substack.nomoremarking.com/p/what-kind-of-student-feedback-do?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="CaptionedButtonToDOM"><div class="preamble"><p class="cta-caption">Thanks for reading No More Marking! This post is public so feel free to share it.</p></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://substack.nomoremarking.com/p/what-kind-of-student-feedback-do?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://substack.nomoremarking.com/p/what-kind-of-student-feedback-do?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p></div><p>So far, our teachers have told us that the report they find the most useful is the <a href="https://help.nomoremarking.com/en/article/new-feedback-reports-teacher-report-11vo08/">teacher report</a>, consisting of personalised information on every student designed for teachers. There are three elements in the report: data, AI feedback and the student writing. They prefer this to the student report, which is similar but doesn&#8217;t have data and has simplified AI feedback.</p><p>The most-requested feature from teachers has been an AI-generated year group summary of all this personalised feedback. We&#8217;ve developed that, and you can see an example of what it looks like <a href="https://help.nomoremarking.com/en/article/new-feedback-reports-teacher-report-11vo08/">here</a>. It&#8217;s kind of like an examiner&#8217;s report, but just for your students.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!rA7o!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb969f4e1-8515-41fd-a0fc-42b1dff371b6_1600x669.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!rA7o!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb969f4e1-8515-41fd-a0fc-42b1dff371b6_1600x669.png 424w, https://substackcdn.com/image/fetch/$s_!rA7o!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb969f4e1-8515-41fd-a0fc-42b1dff371b6_1600x669.png 848w, https://substackcdn.com/image/fetch/$s_!rA7o!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb969f4e1-8515-41fd-a0fc-42b1dff371b6_1600x669.png 1272w, https://substackcdn.com/image/fetch/$s_!rA7o!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb969f4e1-8515-41fd-a0fc-42b1dff371b6_1600x669.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!rA7o!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb969f4e1-8515-41fd-a0fc-42b1dff371b6_1600x669.png" width="1456" height="609" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b969f4e1-8515-41fd-a0fc-42b1dff371b6_1600x669.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:609,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!rA7o!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb969f4e1-8515-41fd-a0fc-42b1dff371b6_1600x669.png 424w, https://substackcdn.com/image/fetch/$s_!rA7o!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb969f4e1-8515-41fd-a0fc-42b1dff371b6_1600x669.png 848w, https://substackcdn.com/image/fetch/$s_!rA7o!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb969f4e1-8515-41fd-a0fc-42b1dff371b6_1600x669.png 1272w, https://substackcdn.com/image/fetch/$s_!rA7o!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb969f4e1-8515-41fd-a0fc-42b1dff371b6_1600x669.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Why do teachers want this feedback?</strong></p><p>So, teachers seem to prefer our teacher report to our <a href="https://help.nomoremarking.com/en/article/new-feedback-reports-student-report-1invdkw/">student report</a>, and their most-requested feature is a class / year group summary. That suggests to us that they don&#8217;t want to give feedback on writing directly to students. Instead, what they want is to mediate the feedback via whole-class instruction.</p><p>This has been a trend in English schools since before the development of Large Language Models. We have also written extensively about t<a href="https://substack.nomoremarking.com/p/but-my-students-like-written-comments">he value of whole-class feedback compared to traditional written comments</a>, so to that extent we are pleased to see teachers recognising this. </p><p><strong>Would students be better off with personalised feedback?</strong></p><p>One of the long-standing criticisms of whole-class feedback is that it <em><strong>isn&#8217;t</strong></em> personalised. Generally speaking, a teacher will design whole-class feedback to focus on the most common errors they see in the class, and AI-generated whole-class feedback does something similar. Students who make rare errors, or no errors, or more than average errors, will not be getting ideal feedback. There are some ways of to mitigate this problem, but it would be unrealistic to pretend it can ever be completely solved.</p><p>However, it&#8217;s also fair to accept that this is not just a limitation of whole-class feedback. It is a fundamental limitation of the traditional human classroom itself. One teacher cannot realistically personalise instruction for 20 or 30 students. </p><p>One of the major arguments in favour of education technology has been that it can solve this problem and provide tailored instruction for every student. The development of Large Language Models has only increased the momentum for personalised learning. </p><p>So you might be thinking that it is a bit backward of us, and our teachers, to be using LLMs to provide something that <em><strong>isn&#8217;t</strong></em> personalised - that is just a summary of where a class are. Surely they would be better off with the personalised LLM feedback?</p><p>Not necessarily. The reality is that our current education system is based around in-person physical schools and classrooms, led by human teachers. There is a good reason for that: we saw in the pandemic that however amazing your online learning platform, it cannot provide the structure, routine, discipline and community of an in-person classroom. </p><p>And those in-person virtues create constraints. A teacher generally does want their students to be working on the same concepts and moving at roughly the same pace. If they are all doing their own thing at their own pace, it is hard to keep those valuable structures and routines together. </p><p>The great challenge for modern education technology is to find ways of integrating technology and the classroom: to resolve the tensions between the personalised and the in-person. Technology&#8217;s ability to personalise instruction is genuinely amazing, and we need it. But the human-scale screen-free community of the traditional classroom is also amazing, and we need that too.  There are limits to how much students will learn on a screen and on their own. </p><p>It&#8217;s counter-intuitive, but using LLMs to generate <em><strong>non</strong></em>-personalised feedback might be more effective than you first think.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://substack.nomoremarking.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading No More Marking! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item></channel></rss>