Primary teachers in England are complaining about the Sats reading paper taken by 11-year-olds. The test was too hard, it left children in tears, and the content of the reading papers featured unusual and antiquated topics about dodos and lost queens. The DfE responded by saying that the point of the test was to be difficult and to differentiate between pupils at different standards.
This news story is not from last week, but from May 2016. As a result of all these complaints, Ofqual carried out an investigation into this particular test, which you can read here. So how, 7 years later, have we ended up in a situation where teachers are making exactly the same complaints and the DfE are making the same defence?
Here's what I think the three main problems are, both in 2016 and today.
The Sats reading test attempts to assess a very wide range of attainment
Before 2016, there was a separate harder test for pupils who were level 6 and above. From 2016 on, there has only been one test for all pupils, but it is still expected to provide information on the full range of attainment. This is quite hard to do.
If you create a test that is too hard, then the test does not give you useful information about low-attaining students and can also leave them stressed and unhappy. If you create a test that is too easy, then the test does not give you useful information about high-attaining students and can leave them bored and demotivated.
There are also other, subtler, problems with tests that are not well-suited to particular students. For example, tough tests often have quite low average scores. A struggling student can take a tough test, make a few lucky guesses and end up with an average score. Similarly, a stronger student can take an easy test, make some careless mistakes on very simple questions, and end up with a below average score.
The dream is that we create tests with super low floors and super high ceilings, but in practice that is very hard to do and to a certain extent the floor and ceiling are in tension.
The floor-ceiling trade-off is always with us, but it will be a bigger problem when there is a bigger attainment gap. Our data at No More Marking suggest that the attainment gap has increased quite significantly at primary in the last year or so, which may be a factor in some students’ struggles with this reading test.
The obvious reform here is to move back to a two-test structure, or to introduce on-screen adaptive testing which tailors the content of the test to each student. These reforms have their challenges and drawbacks too, but we might decide those challenges are preferable to the ones we are facing at the moment.
You might ask why the problems we’ve seen with the reading test have not been as much as an issue with maths, which also moved from a two-test structure to a one-test structure in 2016 as well. I think this may be because of differences in the structure of each test. The maths test has dozens of questions, and you can have a mix of easy and hard questions and ramp up gently across the test. There are a similar number of questions on the reading test, but there are only three reading texts. If just one of them is harder than expected it will therefore have an outsize impact on the entire test. If the text is difficult, then even the easy questions on that text are not actually that easy.
Reading tests depend on background knowledge
Another issue specific to reading is that all reading tests are essentially tests of background knowledge and vocabulary. If a student takes an unseen reading comprehension test and happens to know a lot about the topic it’s on, they can often end up doing much better than if they sat a test of equivalent difficulty on a topic they knew little about.
This feature of reading tests makes it quite hard to provide sensible test prep for them. I think it can also make the test experience less pleasant for students in a way that is hard to identify in the data.
In Ofqual's review of the 2016 Dodo paper, they point out that statistically the test worked exactly as expected - the mean facility index was 51%, and the median index was 54%, which is about what you would expect of a well-designed test. And yet teachers reported that their students found the reading texts particularly difficult and hard to understand. What is happening here?
My personal experience of this, which I have written about before, is of giving two classes of year 11s a GCSE past paper which had an unseen text from an Arthur C Clarke short story. The story was set in a futuristic ice age where London was being threatened by a glacier. Not one of the students knew what a glacier was. (The examiners' report noted that this was the case nationally too.) I remember the students' heads slumping on to the desk, their looks of despair, and their admission at the end that even though they had done their best with all the questions, they had absolutely no idea what was going on in the story.
And yet this test - both in my class and at the national level - was still able to provide a distribution of scores. Some pupils got higher grades than others, even though most of them had no understanding of the central concept of the story.
This is by definition hard to measure, but I think it is possible for a reading test to perform well statistically but for students to have a bad experience with it.
One possible solution to this background knowledge issue, which has been repeatedly advocated by E.D. Hirsch, is to specify the broad content areas that the reading test will be drawn from. For example, tell schools that the reading test topic will be taken from the history, science or geography curriculum.
Theoretically, the Sats reading paper does this already. The problem is that our national curriculum is not really a national curriculum at all. There is a vast amount of content on it, too much to be taught in detail, and schools have a lot of latitude in how they interpret it. Different schools choose different content to focus on and teach it in different orders.
For this reform to work, what would be needed was a national curriculum that was a) slimmer and b) more specific about the detail and order of topics to be taught. This would undoubtedly be controversial, but it might also lead to some of the impressive reading score gains seen in areas that have adopted such an approach.
The impact of threshold scores
This is a long-standing bugbear of mine which regular readers will be sick to death of hearing about. The primary assessment system uses threshold scores to decide if students are at a certain standard or not. The problem is that these threshold scores are simply lines drawn on a continuous distribution. Students do not come neatly pre-packaged into two or three categories, and there are not discrete breaks between those categories. When we pretend that students are in neat categories, it leads to distortions, which you can see in the below graph showing the problems with the primary writing thresholds.
In reality, students like John and George are working at about the same standard. Because of the distorted reporting system, students like George get intervention and support, but students like John are lumped in with Paul. If Paul moves up to Ringo’s standard, it’s assumed he’s made some enormous leap, even though he has not significantly improved at all. If John moves to Paul’s standard, it’s assumed he’s ‘stayed the same’ - even though in reality he has made an enormous improvement.
I think it is also possible that the demand for thresholds is distorting the design of the reading test. The Ofqual 2016 review noted that the Sats reading test was ‘designed to accommodate a higher “expected standard” threshold’, and that the test had ‘to target its cohort differently as there would no longer be a separate test for pupils with higher levels of attainment.’
There are some unavoidable situations in life where we have to draw lines on a distribution - to decide at what age you can drink alcohol or learn to drive, or whether you can study at a certain university or not - but the end of year 6 is not one of those situations. Why not abolish the labels and standards and just report the underlying scaled score? The typical reason given against this is, again, public understanding. People don’t really understand what a scaled score is, so they need some kind of easier label to make sense of it. But there are other intuitive ways of reporting assessment information which are not as distorting and misleading.
There are no solutions, only trade-offs
All of the above reforms have their drawbacks. But the current system has its problems too. At what point will they become pressing enough that we decide it’s better to make a change?
The John Paul George and Ringo distribution map is excellent. This of course is precisely the same phenomenon as that of ability grouping in class or setting. It all depends on where we draw the lines, or the boundaries around groupings. Do you also consider the ability grouping model to be in need of drastic change?
A well-designed test should not rely on background knowledge. The knowledge should be either provided in the context of the passage or separately. This is a well-known test framework parameter. It will interesting to get a copy of the test to see if the complaints are valid or taken out of context.