Bloom's famous 2 sigma tutoring paper is incredibly misleading
Flawed research, bad policy, wasted billions
In 1984, Benjamin Bloom published a famous paper: The 2 Sigma Problem: The Search for Methods of Group Instruction as Effective as One-to-One Tutoring.
The paper claims that one-to-one tuition produces 2 sigma improvements when compared to traditional whole-class teaching. This is a massive deal: it means that one-to-one tuition can raise the test scores of an average student to those of a student in the top two percent.
Imagine a year group of a couple of hundred students. Imagine the average students in that year group. Imagine an intervention that could move them all to the standard of the very best students in that year group - and that would simultaneously improve the scores of all the other students by an equivalent amount too. That is what 2 sigma means.
Although the paper was not about education technology, it has had enormous influence and impact in the ed tech world. The logic runs like this: Bloom has shown that one-to-one tuition is the best form of instruction; human one-to-one tuition is impossible at scale; technology could provide one-to-one tuition for everyone and provide 2 sigma gains for everyone.
Sal Khan of Khan Academy has based a theory of AI tutoring around the paper, the Chan Zuckerberg Initiative refer to it, World Bank researchers love it.
The only problem is that the paper cannot bear anything like the weight of these conclusions. Here is why.
(I wrote a briefer critique of the paper in my 2020 book Teachers vs Tech, which you can purchase here.)
Six major problems with Bloom’s 2 sigma tutoring claim
The study taught and assessed students on narrow domains of cartography and probability. Most education studies measure performance on much broader domains – typically literacy or numeracy. The smaller the domain, the more sensitive it is to instruction, meaning that outsize gains are more likely. As well as the tests being closely linked to the study content, they were also designed by the researchers and were not standardised.
The participating students were novices. The topics were completely new to them. This matters because beginners tend to make rapid progress at first, and that progress slows over time. Again, this affects the statistics, making dramatic outlying gains more likely and less meaningful. We have found this with our assessments of writing. You have to be careful in interpreting and comparing results in the first few months of instruction with those from later in instruction.
It was a short-term intervention with short-term metrics. Students received 11 40-minute lessons over 3 weeks and then had a test on the content straight away. We don’t know if those gains were a) maintained or b) sustainable – eg, if you came back after a year, would the students have maintained that standard? Would they have continued to improve at the rate of 2 sigma every 11 lessons? Learning is a change in long-term memory, and this study tells you nothing about the long term.
These three deliberate design choices all make big effects more likely and less meaningful. They don’t invalidate the results, but they do severely limit the conclusions you can draw. You can’t conclude from a 3 week intervention into a small, new domain that you can turn a median student into a Rhodes scholar.
I think the root cause of all these three problems is a conflating of formative and summative assessment. Bloom’s assessments are optimised to provide short-term formative feedback, which is fine, but you cannot then use that same information to provide summative insights. This is something I write about at much greater length in my 2017 book Making Good Progress, and you can see schools in the UK and America making the same error. Like Bloom, they will give students tests on small, recently-studied domains which all the students will ace. This is totally fine if you want to check students have understood what you have just taught. However, they will want to claim much more than that, and will say that high performance on this test is predictive of getting the top grade on national exams. This is not a valid inference!
There are then 3 further methodological problems with the Bloom paper which are worth mentioning.
Bloom didn’t actually carry out any of the studies in question. He was reporting data from two PhD students. One of those dissertations is available online - the other isn’t. My analysis is based on the one that is. In Bloom’s paper he has a famous graph showing students jumping from the 50th to the 98th percentile. This isn’t based on the underlying data: it’s just a stylised representation of what that type of progress looks like.
The studies divided students into 3 groups: traditional whole-class instruction, mastery whole-class instruction and one-to-one tuition. The one-to-one groups got extra input: they were given more feedback and corrective tests than the other two groups.
The numbers involved in each study were very small - just a couple of hundred students in total. We have no idea whether these effects would hold if tuition were scaled up. This is a major problem with all educational interventions, particularly those which involve reducing class sizes – and one-to-one human tuition is basically just the most extreme version of reducing class sizes. The literature on reducing class sizes shows that it can be effective at a small scale, but it is hard to scale up – because to reduce class sizes at scale, you have to recruit a lot of new teachers, and often the new teachers you recruit are not as good as the existing teachers in the system. Interestingly, one of the Bloom studies had this exact recruitment problem. They used undergraduate students as tutors, and in two of the grades being studied, they couldn’t recruit enough – so they increased the tutor groups from one to three. This suggests that at scale and in real-world contexts, the gains from reducing class sizes may not be as great as the gains from improving whole-class instruction - which is the exact opposite of the message conveyed by the paper.
A sporting diversion: can we use the standard deviation to find the best sportsperson ever?
These students really did make 2 sigma improvements. But they did it in such a narrow domain, in such an early part of their training, and over such a short period of time that it provides us with very few generalisable insights.
To see why, here’s an extended sporting analogy.
Don Bradman is widely regarded as the best cricketer ever. He has a batting average of 99.94. This is crazily exceptional, and one way of explaining to a non-cricket fan why this is such a big deal is to use the standard deviation.
Cricket batsmen average about 40 runs per innings, with a standard deviation of about 9. Bradman is therefore over 6 standard deviations better than the average batsman. This is the equivalent of meeting a man with a height of over 7 feet 6 inches. It’s insane!!
You can use the standard deviation to measure exceptional performance in other sports, and it’s very rare to see anyone being more than 2 or 3 SD away from the mean. So does this mean Bradman is not just the greatest cricketer of all time, but the greatest sportsperson of all time - the GOAT to end all GOATs?
Maybe.
The power of the standard deviation is abstraction. What the standard deviation lets you do is take cricket runs, football goals, 100-metre sprint times, and gymnastics scores, and essentially put them all onto the same scale. It means you are no longer comparing apples with oranges, but apples with apples.
The limitation of the SD is also abstraction. It takes away a lot of the underlying domain specific detail of different sets of numbers and enables a comparison that may not really be legitimate. There is a risk that you are still comparing apples with oranges, but you’re just pretending that you’ve turned some oranges into apples.
The case against Bradman being the greatest sportsperson ever is that 1930s cricket was not as professional or as global a sport as modern football, sprinting and gymnastics. The talent pool Bradman was competing against simply wasn’t as competitive, and that has the potential to skew his stats.1
Basically, in order to see whether it is legitimate to compare the standard deviations of Bradman to Messi or Federer or Bolt or Biles, you need some domain-specific understanding of each sport and its historic context.
In this particular case, I think the standard deviation is useful and appropriate, but not conclusive. However, there are ways in which you can use the standard deviation which are obviously just absurdly inappropriate.
Imagine a group of 8-year-old footballers who get some extra instruction on doing keepy-uppies. One kid gets some extra one to one coaching from his dad. A week later, his dad devises a keepy-uppy tournament for all the kids. His son wins! He completes 20 keepy-uppies when the tournament average is 8 and the standard deviation is 2.
If you then said, “This kid is 6 standard deviations above the mean, therefore he is a better footballer than Lionel Messi”, that would obviously be absurd.
That is what I think happens with the Bloom 2 sigma study. Novice students make rapid progress on a new, small domain over a short period of time when given extra coaching and assessed with a non-standardised test. We then fall over ourselves not just to declare that the students are better than Messi – but that their coach is the next Alex Ferguson or Pep Guardiola and we should all be copying their methods.
Are outsize gains like Bloom reports really possible?
At this point it is customary to say that Bloom sets our expectations too high. I don’t think this is the case. I think education has for a long time been in a pre-scientific phase, and that if we could better align it with science, then big 2 sigma style gains are possible. My issue with the Bloom paper is not that it sets unrealistic expectations, but that it won’t help us achieve any kind of expectations.
Does any of this matter? Surely we know that one-to-one tuition is better than whole-class teaching?
You might say OK, who cares, maybe the study is slightly ropey but we all know that one-to-one tuition is better than whole-class instruction, so is it really that misleading?
Yes. As we have seen human one-to-one tuition is extraordinarily expensive and hard to scale. Bloom and his grad students acknowledged this and the point of their research was to try and find whole group methods that were as effective as one to one tuition.
However, by emphasising the impact of one-to-one tuition so much, the effect has been to make human one-to-one tuition seem like the gold standard to which we should all be aspiring. Post-Covid, many governments spent huge sums of money on catch-up human tuition, often implicitly or explicitly justified by Bloom’s research. The programmes ran into predictable problems of recruitment and training and had underwhelming results - nothing like 2 sigma every 3 weeks.
Similarly, the impact on ed tech has been to encourage learning platforms to mimic one to one tuition and to focus on personalising instruction for the individual student.
But what if this is the wrong way round? What if actually, the gold standard of effective human pedagogy at scale is in whole-class instruction, and actually ed tech platforms should take that as a basis to learn from? Interestingly, a recent study from Google Deep Mind embedded LLM tutors within a typical whole-class environment, and showed some impressive results.
We also have better and more robust data about what works in whole-class instruction - including, in England, some much better uses of standard deviations.
What 2 sigma progress really looks like
Every secondary school in England gets a Progress 8 score, measuring how much progress students make across eight subjects from age 11 to 16. A Progress 8 score of 0 means that, on average, pupils at the school made the same amount of progress as pupils nationally with similar starting points.
The mean is always close to 0, and most schools cluster around the mean, with over half of schools getting a score between -0.25 and +0.25.
However, there are a handful of outliers scoring above 1.5. These schools are achieving something close to a 2 sigma improvement.
Now of course, this is a school-level measure, not a pupil-level intervention like in Bloom. But it can still give us some useful insights. And if we run through all the flaws of Bloom again, Progress 8 avoids them.
It measures progress on 8 big subjects – not one sub-topic!
The tests at the end are standardised and not designed by the teachers.
It measures gains over 5 years, not 11 weeks.
It includes the performance of about 3,500 schools and 600,000 students – a big sample.
Most of the schools in the sample have broadly equivalent resources.
Obviously no metric is perfect and Progress 8 has its flaws too. But it is far less flawed than Bloom’s study, and a far better guide to what 2 sigma improvement in education actually looks like.
Stephen Jay Gould discusses this problem in his book Full House: The Spread of Excellence from Plato to Darwin. He argues that the greater the standard deviation in a sports league, the lower quality it is, and the greater chance there will be of exceptional players registering exceptional scores. In a higher quality league, we will see narrower standard deviations and it will be harder for exceptional players to register exceptional scores.


Interesting take. I don’t think Progress 8 attainment data is at all comparable. It doesn’t measure intervention. There are many other studies supporting high-dosage, high-quality tutoring besides Bloom, and others still supporting the kind of spaced repetition you can get with high dosage tutoring.
The issue with the UK post covid tutoring investment is that it lacked the high-quality component - in fact the quality was detrimental it was so low. I interviewed a couple of the key players - outsourced to Indian low quality call center type operations. It was scammy. So I wouldn’t
use that as evidence against Bloom. Read Koedinger’s the Astonishing Regularity in student learning. It supports both high dosage tutoring and high quality classroom instruction backed by data. The same goes for a recent Harvard large scale study on the science of reading in classrooms - not being upheld despite training. There are so so many wrongs here stacking up inside and outside the classroom.
The point is the more focussed good stuff moves the needle irrespective of subject. And that’s expensive. So how do we manufacture it at reduced cost.
I've always been astonished by how uncritical people are about Bloom's claim. It never seems to cross anyone's mind that it might not be entirely true.
The claim is clearly false, at least in the simplistic manner in which it's usually interpreted. Can *every* student experience 2-sigma growth, in *every* subject? Surely not. There are lots of students who will benefit enormously from 1-on-1 tutoring, but there are also students who won't show such progress.
Grim observation: the students who need tutoring and extra assistance the most are generally not able to take advantage of it, because they are so far behind. If you are working with a "B" student, that person is already understanding most of the material, and just needs a little extra help to get up to an "A" level. But if a student is getting "D" and "F" grades, then they are really struggling with the material and a few tutoring sessions probably aren't going to make much difference.