Is it possible to develop a tutor-proof test?
Or should we focus on tests worth teaching to instead?
At No More Marking, most of the assessments we provide are fairly low-stakes. However, we do have experience with high-stakes tests, and we know how challenging they are to design.
If you are using a test as a selection mechanism for a prestigious institution, you will have armies of very smart parents and well-paid tutors trying to crack the code of the test.
Over the past decade or so, a couple of phrases have cropped up to describe the way these selection tests should work. First, people argue that we should have “tutor-proof tests” that cannot be cracked by the parents and tutors. Second, we should have “tests worth teaching to”, so that if students are being prepped for the test, the prep is worthwhile.
Do these two concepts hold water? In this post, we’ll examine the idea of tutor-proof tests.
Some historical background
Historically, many famous English public schools selected pupils at age 13 using the Common Entrance exam.
Common Entrance exams are linked to a defined curriculum. The advantage of this is great coherence and clarity for students and teachers at the prep schools and public schools. The disadvantage is it probably restricts the pool of students who can apply to the public schools.
Not all independent schools operated on this model. I went to a selective secondary school, City of London School for Girls, which used a more curriculum-neutral test consisting of a reading comprehension, writing task and maths paper. I hadn’t attended a private prep school or had a private tutor, but the test resembled a lot of what I had done at my state primary school, so I was not at a massive disadvantage compared to others. Had CLSG run Common Entrance, it’s unlikely I would have even applied, let alone got in.
However, whilst the test I sat was more curriculum-neutral than Common Entrance, it was not completely curriculum-neutral, and nor was it immune to tutoring and preparation. In the last decade or so, even this kind of maths, reading and writing assessment has been criticised for excluding talented but disadvantaged students who don’t have access to good schools and tutors.
The tutor-proof test
Is it possible to design a test so content-free that it captures something like raw potential, or the underlying ability to flourish in an academic environment? Verbal reasoning tests reward vocabulary knowledge, which can be taught. Numerical reasoning tests reward maths knowledge, which can also be taught. But what about non-verbal reasoning tests? These are the kinds of tests where you are given four shapes and then asked: which shape continues the sequence?
You can see how these tests are less tied to curriculum knowledge, and there is serious research in this area suggesting that they might therefore be useful for identifying talented but disadvantaged students. David Card is a Nobel laureate who has done research showing that a non-verbal test administered at second grade in a district in Florida “led to large increases in the fractions of economically disadvantaged and minority students placed in gifted programs.” Jonathan Wai is another researcher who has done a lot of interesting work on these types of questions, and who has also been involved with talent identification programmes.
In large-scale government-run school systems with lots of disadvantaged students, non-verbal assessments can help identify students who are able but poorly served by their schooling.
But there are big differences between low-stakes talent-identification across a government school system and high-stakes entry to prestigious selective schools. When an expensive tutor hears the phrase “tutor-proof test”, he doesn’t interpret that as a warning but a challenge.
Practice effects
There is a huge literature on “practice effects”, which essentially show that if you practice a specific skill, you will get better at that specific skill. If you practice touch typing every day, you will get better at it. If you practice your multiplication tables every day, you’ll get better at them. If you practice tying your shoelaces every day, you’ll get better at it.
The practice effect is one of the most robust findings in cognitive psychology, and poses an enormous challenge to the idea of the tutor-proof test.
The response of test developers to this challenge is to say that they can create enough new question types that practice on past question types won’t deliver huge gains.
That is, they’ll say that you can practice tying your shoelaces, but then the test will be on a different kind of knot, so you won’t have any advantage. From a cognitive science point of view, this is a tricky one. It is true that the practice effect holds for practice of a specific skill. It is also true that transfer to different contexts is hard, and that so-called “far transfer” is exceptionally difficult. So yes, the test developers are right to say that the more novel the question type, the less valuable the practice of old question types is.
But “less valuable” is not the same as “not valuable at all”. And whilst far transfer is extremely difficult, near transfer is more possible. Even if practice of old question types gives you quite small gains, in a high-stakes environment those small gains can be the difference between success and failure.
Also, to make this system work, you require test developers to constantly create new types of question that are as different as possible from what has gone before. This poses a number of difficult technical challenges.
First, there are obvious constraints to just how many new types of short non-verbal test questions it is possible to create. If you are running 3 test sessions a year, after ten years you will need to come up with thirty different types of question. There are limits to how many ways you can vary the essential concept of looking at a 2D shape and moving it around in some way.
Second, if you really are creating very new questions for each round of tests, then you need to run a new validation process each time. Good validation processes take time: ideally you want to wait a few years and gather information on whether the students who passed that test are thriving at their new school. But if you are constantly having to create new question types, you don’t have the time for that.
Third, even if your system works for the first few years, there is no guarantee it will keep working over time as tutors learn more about it and optimise their teaching. This is a classic Goodhart’s Law problem: when a measure becomes a target, it loses value as a measure.
We see numerous examples of this in our work and research. A really famous one is that early AI essay markers delivered pretty good levels of agreement with human markers, and seemed to have solved the problem of AI marking. However, on closer investigation it turned out that they were largely just rewarding the length of the essay. In a low-stakes environment, it is possible that this wouldn’t cause too many problems. But in a high-stakes assessment where students, teachers and parents are all striving to do as well as they can, the system will break down, because students will realise that the way to succeed is to write the same sentence a couple of hundred times.
Likewise, it is possible that tutors find ways of teaching tips and tricks that help students answer the non-verbal questions, but that systematically break the link between the question and what it is supposed to be measuring.
What is the impact on students?
The extensive literature on the practice effect shows it delivers substantial gains. But there is a chance that even the substantial gains reported in the literature underestimate its effect, because most of the research is lab-based, and may not properly account for the scale and effect of real-world intensive practice in some environments. Tutoring for entrance exams is taken very seriously by a lot of very smart people, and it is big business.
Many students will be preparing for their entrance exam 18 months or 2 years in advance, and will be doing several hours of practice every week. The question is, would you rather that prep is on shape rotation? Or would you rather students were reading interesting books and doing maths problems?
It’s also worth remembering that the original impulse for introducing tests like this was the social justice aspect – that schools wanted to find a way of identifying talented but disadvantaged students. But once a non-verbal test becomes a target, it is going to discriminate against those students too, as you are much less likely to get any practice of those tests in a typical state school – whereas you will be taught reading, writing and maths. The worst-case outcome is that the non-verbal test is as socially exclusionary as Common Entrance, just with none of its educational benefits.
When you stop and think about it, the concept of the tutor-proof test does not really hold water. Of course you get better at something if you practice it. That is a good thing, and that is why education works! The whole point of education is to practice valuable things and get better at the valuable things. A good assessment should promote practice of the valuable things. It shouldn’t remove the valuable things and replace them with less valuable things, on the grounds that some students will get more practice of the valuable things.
Which brings us to another popular concept: we should create “tests that are worth teaching to”. Is this a better guide to assessment design? We’ll discuss that in a future post.


I'm working on a different problem. We need to find out exactly what it is that the foreign students do not know (which we expect them to) so we can give them a remedial class with the if they need it. (Figuring out that some of them can skip an introductory course is another benefit). It may be that this has promise for your problem. Keep advancing the problems until, for every student, they are clearly up against new things they haven't learned. You will need an AI for this. Then evaluate them on how well they learn the new material.
Isn't part of the problem the way tests are marked? The drive to make assessment marking less subjective has created a situation where external exams are marked not for the intelligence of the answer but for the presence of discoursearkers that imply evaluation or comparison or another form of analysis but with no marks supplied for actual perception in the response to the question being asked.
This is not a random assertion. I have been teaching English for 20 years. To gain a pass grade for a question a student must include the discourse markers but they can obvious fail to understand the text or the question. I have been in an AQA meeting when the audience of teachers turned on the exam board representative because they presented us with 2 answers. The perceptive ( but flawed) one got below half marks. One that had the discourse markers but showed no understanding passed. This is reflected across all exam boards and all the difficult questions. And this is what tutors do, they rehearse students writing in a very particular way. They do not aid the student understanding which is what the test that could not be practiced would do.
I don't disagree with practice, but the way we award marks bears very little relevance to student understanding and we are destroying not just English as a subject but all humanities subjects, as a result.