Critique of Paper: An astonishing regularity in student learning rate

by Justin Skycak on

It rests on a critical assumption that the amount of learning that occurs during initial instruction is zero or otherwise negligible, which is not true.

The paper An astonishing regularity in student learning rate has been making rounds in the news lately. Here are a couple articles:

I’ve been asked about this paper multiple times. For most people, the result is counterintuitive and doesn’t pass the sniff test – but they’re not able to pinpoint a specific issue with the setup, data, or interpretation of the experiment. The purpose of this post is to suggest a specific issue.

The Critique

The paper finds that students have wildly different baseline knowledge after initial instruction. Immediately after being shown how to do a problem, some students are almost at mastery right away and only need a couple practice problems. Other students need many more practice problems. However, during those practice problems, students’ knowledge increases at more similar rates (*). The paper interprets this to mean that students come in with wildly different prior knowledge, but learn at similar rates.

I don’t think that’s the right interpretation. It rests on a critical assumption that the amount of learning that occurs during initial instruction is zero or otherwise negligible. But that assumption just doesn’t make any sense to me, having worked with lots of kids across different ability levels. Even if they have exactly the same prior knowledge, some kids just get it right away after you demonstrate just one instance of new skill, whereas other kids need lots of different examples and plenty of practice with feedback before they really start to grok it.

(*) Note that if you measure in raw percents, as the paper does, the 75th percentile learners are found to increase their knowledge about 1.5x as fast as 25th percentile learners per problem. If you measure performance in log-odds, which is a more appropriate metric that accounts for the fact that it's harder to increase performance when one's performance is high to begin with, the multiplier rises from 1.5x to 2x. It's debatable whether 2x is really a "similar" learning rate. Personally, I think it is not -- not only does "learns twice as fast" feel like a substantial difference, but it is also only comparing the 25th and 75th percentiles, and even the 75th percentile is far lower than the kind of person we have in mind when we think of somebody who is shockingly good at math. For instance, math majors at elite universities tend to be well above the 99th percentile in math. However, this is not the focus of my critique. In this critique, I wish to highlight a more subtle methodological issue and demonstrate that even if the performance improvement per practice opportunity came out to be exactly the same for all students, this would still not be enough to conclude that all students learn at the same rate.

Concrete Illustration

Here’s a concrete illustration using numbers pulled directly from the paper (the 25th and 75th percentile students in Table 2). Suppose you’re teaching two students how to solve a type of math problem.

  • Student A gets it pretty much immediately and starts off at a performance level of 75% (i.e. their initial knowledge level is such that they have a 75% chance of getting a random question right). After 3 or 4 practice questions, their performance level is 80%.
  • Student B kind-of, sort-of gets it and starts off at a performance level of 55%. After 13 practice questions, their performance level reaches 80%.

This clearly illustrates a difference in learning rates, right? Student A needed 3 or 4 questions. Student B needed 13. Student A learns faster, student B learns slower.

Well, in the study, the operational definition of “learning rate” is, to quote, “log-odds increase in performance per opportunity . . . to reach mastery after noninteractive verbal instruction (i.e., text or lecture).” Opportunities mean practice questions. Log-odds just means you take the performance $P$ and plug it into the formula $\ln \left( \frac{P}{1-P} \right).$

  • Student A's log-odds performance goes from $\ln \left( \frac{0.75}{1-0.75} \right) = 1.10$ to $\ln \left( \frac{0.8}{1-0.8} \right) = 1.39.$ That's an increase of 0.29, over the course of 3 to 4 opportunities (let's say 3.5), for a learning rate of 0.08.
  • Student B's log-odds performance goes from $\ln \left( \frac{0.55}{1-0.55} \right) = 0.20$ to $\ln \left( \frac{0.8}{1-0.8} \right) = 1.39.$ That's an increase of 1.19, over the course of 13 opportunities, for a learning rate of 0.09.

So… according to this definition of learning rate, students A and B learn at roughly the same rate, about 0.1 log odds per practice opportunity.

Critique Stated More Precisely

Now that we’ve worked through an example with the metrics involved in the paper, I can phrase my critique more preciesely. For simplicity, I’ll refer to “y-intercept” as the student’s level of performance immediately after initial instruction and “slope” as a student’s log-odds performance increase per opportunity.

While I was surprised to read that the 75th percentile students don’t have that much greater slope than the 25th percentile students, it seems at least within the realm of possibility to me, and that’s not what I’m arguing against.

My critique is that I don’t think the y-intercept only measures differences in background knowledge. In my experience teaching and tutoring, I have noticed a second component that is independent from the student’s level of background knowledge, and this second component becomes increasingly relevant as you get up into higher levels of math.

I would loosely describe this second component as some kind of generalization ability. In my experience, individual differences in generalization ability create a phenomenon that math gets really hard for different people at different levels.

While I would agree that most people have enough generalization ability to get through algebra and geometry with a reasonable amount of productive practice, I’ve noticed lots of students run into issues learning higher math (especially calculus and beyond) as the level of technicality and abstraction increases, even in a mastery-based learning environment where they engage in numerous practice opportunities with immediate explanatory feedback.

For this reason I don’t agree that the study results naturally extrapolate as suggested:

  • "Some readers may object that near constant student learning rate unrealistically implies that everyone can master advanced level calculus or interpret abstract data. Indeed, not everyone has favorable learning conditions nor will everyone choose to engage in the substantial number of practice opportunities required. However, our results suggest that if a learner has access to favorable learning conditions and engages in the many needed opportunities, they will master advanced level calculus."


Okay, now it’s time for me to back up my claims about this so-called “generalization ability.” I’ll provide some case studies from personal experience, in which students with the same background knowledge learned new material at very different rates due to differences in generalization ability.

Case Study 1

I spent several years teaching in a radically accelerated math sequence that used a fully individualized, mastery-based learning system. Kids came into the program in 6th grade with knowledge of arithmetic. Granted, at this point, they all had different background knowledge and were starting at different places in the curriculum (typically between 20% and 60% of the way through pre-algebra). But each time they were given a topic, they had mastered all the prerequisites leading up to that topic.

Students’ knowledge profiles were initially estimated through a diagnostic, so there was of course a bit of uncertainty in their knowledge profile when they first started on the system. But after a couple years of working on the system, their knowledge profiles were grounded very soundly in actual work they had completed on the system (not just an “estimate” like the initial diagnostic). The kids were not doing extra math outside of school – they did not do math “for fun” and they would even stop doing their homework if I didn’t stay on top of them.

So we’re talking about 8th graders who are taking calculus, who have completed pre-algebra / algebra / geometry / algebra 2 / precalculus using the same curriculum, and who have not been learning extra math outside of school. They are each moving through individualized learning paths so that whenever they are asked to learn something new, they have evidenced knowledge of all the prerequisites. So it seems more than reasonable to say that they have the same background knowledge.

When these kids hit calculus, everybody finds derivative computations (e.g. the power rule) to be fairly straightforward, but there’s a particular topic that throws some students for a loop: the idea that the derivative is the slope of the tangent line to a curve. Different kids pick up on this topic at wildly different rates. Some kids just get it right away because the difference quotient is an instance of the slope formula. Some kids don’t really make the connection until they see an animation of the secant line turning into the tangent line. Other kids don’t get it even after you show that animation to them, and they continually forget that “slope of tangent line” means the same thing as “derivative” – even though they can tell you what the slope of a line is, compute the slope given two points on a curve, tell you what a limit is, evaluate limits, etc.

The same thing happens in tests of convergence, in particular, the limit comparison test. If you take a really strong student who has learned all the prerequisites, and you show them the series $\sum \frac{1}{-1+n+n^2}$ and ask them to guess whether it converges by thinking of a similar series, they might correctly guess that it converges because it’s similar to the convergent series $\sum \frac{1}{n^2}.$ But if you do the same with a different student who has also learned all the prerequisites, they might incorrectly guess that it diverges because it’s similar to the divergent series $\sum \frac{1}{n}.$ Even though they’ve never encountered this material before, the really strong student intuitively knows what we mean by a “similar series” in the context of this question, while the other student needs many more practice problems to develop that understanding.

Case Study 2

To offer another case study with some more concrete numbers: in an extreme case, while teaching these radically accelerated courses, I was also mentoring another 7th grade student (let’s call him M for “mentoring”) who was learning calculus despite using far fewer practice opportunities. M was incredibly gifted at math and I had worked with him for 3 years beforehand, but not through any sort of structured curriculum, just chatting about math for an hour each weekend. He wasn’t taking any special math courses at school, just his grade-level courses. A couple times I tried to get him doing some more structured work out of some textbooks (I felt like he could be making a lot faster progress that way), but he and his parents weren’t really interested in it, and I didn’t want to push too hard.

Midway through 7th grade, M and his parents started thinking more about his future and came to agree that it would be a good idea for him to knock out the AP Calculus BC exam in 8th grade so that he could make a convincing case for enrolling in university courses the following year. But at the same time, he wasn’t going to take a separate calculus course, and he didn’t want to do a whole lot more work outside of our weekly discussions. The totality of M’s calculus instruction before the exam was limited to an hour-long chat each weekend and 5-10 homework problems per week for about 12 months, about 400 problems total, plus 3 or 4 practice exams.

Meanwhile, students that I taught in the radically accelerated school program did about 300 lessons (each with about 10 questions) and 300 reviews (each with about 3 questions) for a total of about 4000 questions, plus 6 practice exams. Not only did they solve an order of magnitude more problems than M student, they also had more one-on-one time with me (there were only 5 students in the class), they were doing this every single school day for an hour, and they were also working from a far more scaffolded and comprehensive curriculum (whereas for M, I had to slim down the curriculum to the bare essentials, otherwise we wouldn’t get through it all). Yet, M thought the AP exam was pretty easy, came out of the exam fairly confident that he got a 5 out of 5, and indeed he did – whereas in my class, the average score was 3.6 out of 5 (two 5’s, a 4, two 2’s), and even the students who ended up getting a 5’s did not come out of the exam confident that they got a 5.

Based on my experiences interacting with M and my classroom students, if I had to pick one defining trait that separated them, it would be the generalization ability I described earlier. On any topic, M would require very minimal explanation and he would naturally fill in most of the details. Most students will only absorb a fraction of information presented during initial instruction and will fill in the rest of their understanding as they solve problems that force them to grapple with things they hadn’t absorbed or hadn’t generalized. But M would typically absorb way more information from the initial explanation and generalize it much further. M would also retain it much longer after the initial practice – for instance, we could cover a new topic one week and then he’d be able to recall most of it a week or two later, whereas many students in my class would forget most of a new topic within a couple days of learning it if they did not receive additional practice. That said, M is not immune to forgetting, and it’s not like he’s “locking things into place” indefinitely in his brain. It’s just that his rate of forgetting is much slower.

I’ve worked with plenty of students who are well above average mathematically but not nearly to the extent of M. They are much slower to absorb new information, and even after they are able to consistently solve problems correctly, they will forget it almost entirely within a week or two. Imagine you’re writing code to develop some application, but you’re using some buggy version control where each day, 10% of your code is deleted. That’s how it feels working with these other students. It’s like writing in disappearing ink. On the other hand, for M, it’s like his code gets implemented in a more robust way, and less than 1% gets deleted each day.

More About Exceptionally Fast Learners

M and other exceptionally fast learners are very rare, but they do exist, and as you narrow down to world-class mathematicians, physicists, programmers, etc, they become a lot more common. It’s like how world-class basketball talent is very rare, but simultaneously common enough that the vast majority of people could not hope to become NBA players, or even minor league players, even if they engaged in optimal training for years.

Although rare, these exceptionally fast learners illustrate just how incorrect it is to assume that differences in problem-solving ability after initial instruction (or even before initial instruction) indicate differences in prior knowledge.

In my experience, they can sometimes pick up on things so quickly that if you just show them a question that is “beyond, but not too far beyond” for them, then they will figure it out on the fly without even seeing a demonstration.

  • For instance, I've worked with kids who, having learned only arithmetic but no algebra, were able to figure out how to solve the following question on the fly: "if $x$ represents a number, and $2 \ast x+3$ equals $11,$ then can you tell me what $x$ is?"
  • And I've also worked with students who have no issues with arithmetic but struggle to solve that kind of question even after seeing a demonstration of how to solve it and going through a couple practice questions with feedback.

As a specific example, back when I was chatting with M about algebra, pretty quickly I realized that the quickest way to get him through the content was to just ask him questions like that (“if $x$ represents a number, and $2x+3=11,$ what is $x$”) without even demonstrating anything.

  • Most of the time he would completely figure it out on the fly rather quickly.
  • Sometimes I'd have to draw his attention to edge cases: "okay, you told me that the solution to $x^2=9$ is $x=3,$ but is there any other number that you can square and get the same result?"
  • Other times I'd have to give him a tip to get him started: "okay, you need some help getting started with solving $x^2 + 2x = 9,$ see if you can add something to both sides of the equation that lets you write the left-hand side as a perfect square." (M had not previously learned the trick for completing the square.)

With these kinds of students, you can often tell them facts and they immediately understand why the facts are true. For instance, if you tell them that $y = a(x - h)^2 + k$ is the general formula for a parabola with vertex at $(h,k),$ they immediately intuitively understand why that is: the graph turns when the $(x-h)^2$ part hits zero. (Almost all students need to be explicitly taught this intuition, and for most students, it does not fully “click” even after being taught it – they “kind of, sort of” get it, and while they can use the formula to solve problems, they have to be exposed to the intuition periodically into the future across wider variety of settings/examples before the formula makes perfect intuitive sense to them.)

You can sometimes even teach formulas by having the students derive the formulas themselves. For example:

  • Student: I don't like the quadratic equations where the answer comes out with messy numbers...
  • Teacher: Why don't you just solve the general equation $ax^2 + bx + c = 0$ to get a general formula for the solution? Then you can just plug the numbers into that formula when you know the manipulations would otherwise get messy.
  • Student: Good idea! ... (5 minutes later) Found it: $-\frac{b}{2a} \pm \sqrt{ \left( \frac{b}{2a} \right)^2 - c}$
  • Teacher: Great. By the way, that thing is called the quadratic formula. It's usually written in the form $\frac{-b \pm \sqrt{b^2-4ac}}{2a}.$ See if you can rearrange your expression into that.

Again, these kinds of students are very rare, but they illustrate just how incorrect it is to assume that differences in problem-solving ability after initial instruction (or even before initial instruction) indicate differences in knowledge that the student was previously taught.

Why Care About Exceptionally Fast Learners if They're So Rare?

I realize that I’m making a big fuss about a very small segment of the population – but I think it’s warranted because if a student wants to learn a subject to an exceptionally high level, enough to build a career around it and achieve a high level of success in their field, then these are the types of other people they’re going to have to compete against. Just like professional sports – at a high enough level, everybody who plays is a member of what was (at lower levels) a very small segment of the population.

Think about it this way. Suppose we ignore the really exceptionally fast learners because they’re rare and effectively invisible in aggregate statistics. For instance, the vast majority of people do not pick up on basketball exceptionally quickly, so suppose we ignore those people who do. We’ll make the (incorrect) assumption that, because they’re a miniscule segment of the population, they will have negligible impact on any sort of conclusion we make about talent development in basketball. Consequently, when we watch a professional basketball game on TV, we’ll come to the conclusion that anyone can become a professional basketball player if they just put in enough work. Do you see the logical flaw that’s happening here?

It may help to hear the famed Douglas Hofstadter (2012) recount the time when he realized that he did not have enough of that so-called “generalization ability” to stand out as a professional mathematician:

  • "I am a 'mathematical person', that's for sure, having grown up profoundly in love with math and having thought about things mathematical for essentially all of my life (all the way up to today), but in my early twenties there came a point where I suddenly realized that I simply was incapable of thinking clearly at a sufficiently abstract level to be able to make major contributions to contemporary mathematics.
    I had never suspected for an instant that there was such a thing as an 'abstraction ceiling' in my head. I always took it for granted that my ability to absorb abstract ideas in math would continue to increase as I acquired more knowledge and more experience with math, just as it had in high school and in college.
    I found out a couple of years later, when I was in math graduate school, that I simply was not able to absorb ideas that were crucial for becoming a high-quality professional mathematician. Or rather, if I was able to absorb them, it was only at a snail's pace, and even then, my understanding was always blurry and vague, and I constantly had to go back and review and refresh my feeble understandings. Things at that rarefied level of abstraction ... simply didn't stick in my head in the same way that the more concrete topics in undergraduate math had ... It was like being very high on a mountain where the atmosphere grows so thin that one suddenly is having trouble breathing and even walking any longer.
    To put it in terms of another down-home analogy, I was like a kid who is a big baseball star in high school and who is consequently convinced beyond a shadow of a doubt that they are destined to go on and become a huge major-league star, but who, a few years down the pike, winds up instead being merely a reasonably good player on some minor league team in some random podunk town, and never even gets to play one single game in the majors. ... Sure, they have oodles of baseball talent compared to most other people -- there's no doubt that they are highly gifted in baseball, maybe 1 in 1000 or even 1 in 10000 -- but their gifts are still way, way below those of even an average major leaguer, not to mention major-league superstars!
    On the other hand, I think that most people are probably capable of understanding such things as addition and multiplication of fractions, how to solve linear and quadratic equations, some Euclidean geometry, and maybe a tiny bit about functions and some inklings of what calculus is about."

A Potential Way to Reconcile the Conclusions of the Paper: Tightening the Definition of "Favorable Learning Conditions"

There is one setting in which the conclusions of the paper might make sense to me. It involves tightening the definition of “favorable learning conditions” to the point that it becomes more theoretical than practical, and it doesn’t imply that students actually learn at similar absolute rates, but here it is.

The paper limits its conclusions to the context of “favorable learning conditions,” which it described as follows:

  • "a) provide immediate feedback on errors in problem solving or performance contexts (21, 22),
  • b) provide explanatory context-specific instruction on demand (e.g., ref. 23), including an example correct response if needed (24–26),
  • c) highly encourage or enforce students to enter or observe a correct response before moving on,
  • d) provide tailored tasks designed through data-based cognitive task analysis to practice specific cognitive competences aligned with course goals for improving student thinking (e.g., refs. 27 and 28), and
  • e) give repeated opportunities to ensure student mastery of these cognitive competences (e.g., ref. 29) in varied tasks that require appropriate generalized, but not overgeneralized, knowledge and skill acquisition (e.g., ref. 30)."

I wonder if the definition of “favorable learning conditions” also needs to specify (in some more precise way) that the curriculum is sufficiently granular relative to most students’ comfortable “bite sizes” for learning new information, and includes sufficient review relative to their forgetting rates.

Under that definition, it would make more intuitive sense to me that (barring hard cognitive limits) such favorable learning conditions could to some extent factor out cognitive differences, causing learning rates to appear surprisingly regular. A metaphor: “students eat meals of information at similar bite rates when each spoonful fed to them is sufficiently small.”

Though, ceiling effects may confound when the curriculum is too granular or provides too much review relative to the learner’s needs – so perhaps the definition would need to be amended once more to specify that the curriculum’s granularity is equal to the student’s bite size and rate of review is equal to the student’s rate of forgetting. The amended metaphor: “students eat meals of information at similar bite rates when each spoonful fed to them is sized appropriately relative to the size of their mouth.” (Note that equal bite rates does not imply equal rates of food volume intake.)

This definition of “favorable learning conditions” would also allow for anecdotes / case studies of math becoming hard for different students at different levels, because the following factors affect students differentially as they move up the levels of math:

  • Combinatorial explosion in the problem space -- lowers the "bite size" more for students with lower generalization ability, or, equivalently, reduces the perceived granularity of the curriculum. (Side note: I've always suspected combinatorial explosion was the reason why I encountered so many high schoolers who did well in math but struggled in physics.)
  • Large body of knowledge to maintain -- increases the amount of review more for students with higher forgetting rates. Also reduces effective "bite size" since an increasing portion of each bite will consist of reviewing fuzzy prerequisite material.

It would even allow for the concept of soft and hard ceilings on the highest level of math that one can reach:

  • Say we have a student with low generalization ability and high forgetting rate. Then a favorable curriculum takes more time to work through (as compared to a favorable curriculum for an average student) due to increased granularity and review, and that multiplier increases as they go up the levels of math.
  • At some point "it requires lots of practice to learn" becomes synonymous with "can't learn" -- first in a soft sense of "the benefits of engaging in this much practice do not outweigh the opportunity costs of neglecting to develop my skills in other domains that I find easier," and then in a hard sense of "the amount of practice required exceeds the sum of waking hours over the remainder of my life."

Defense Against Misinterpretation

I am NOT saying that anyone’s level of knowledge is set in stone.

I want to see every single student grow and learn as much as they can. But in order to support every student and maximize their learning, it’s necessary to provide some students with more practice than others. If a student is catching on slowly, and you don’t give them enough practice and instead move them on to the next thing before they are able to do the current thing, then you’ll soon push them so far out of their depth that they’ll just be struggling all the time and not actually learning anything, thereby stunting their growth.

Likewise, if a student picks up on something really quickly and you make them practice it for way longer than they need to instead of allowing them to move onward to more advanced material, that’s also stunting their growth.

I’m 100% in the camp of maximizing each individual student’s growth on each individual skill that they’re learning, giving them enough practice to achieve mastery and allowing them to move on immediately after mastery.

I am NOT saying that background knowledge is unimportant.

In hierarchical subjects like mathematics, background knowledge is one of the largest, if not the largest, determinants regarding whether a student will succeed in learning new material. And that’s obvious: how can a student learn something new if they do not know the prerequisites?

All I’m saying is that background knowledge is not the sole determinant. In other words, the reason why reason why professional athletes, musicians, mathematicians, etc. are so good at their skill is typically not just that they started training earlier.

In most complex skill domains, there are typically other factors (cognitive, physical, dispositional, etc.) that come into play. Sometimes these other factors can sometimes be improved through extra training, but other times they can’t (e.g., height of basketball players). Even factors that can be improved often have soft limits to the range of improvement that can be accomplished in a reasonable amount of training time.

Of course, extra practice is a big advantage that can, to some extent, make up for a lower rate of skill acquisition. It’s often true that “hard work beats talent when talent doesn’t work hard.” And as a corollary, it’s often true that unreasonably hard work catches up to talent even when talent works reasonably hard.

But the catch is that you have to be working way harder than the people you’re trying to catch up to, and if your rate of skill acquisition is low then even the theoretical maximum possible amount of work you could put forth in your lifetime might not be enough to catch you up and make you competitive.

The good news is that in the early stages of these talent domains, the skills can be learned by virtually everybody. Virtually everybody can learn counting and basic arithmetic; virtually everybody can learn how to dribble a basketball and shoot a free throw; virtually everybody can learn how to play Hot Cross Buns or Ode to Joy on an instrument.

And more good news is that vast majority of people (not all, but most) can learn way more than the basics. Most people can learn algebra and some basic calculus; most people can learn to dribble between the legs and sink three-pointers; most people can learn to play numerous pop songs on an instrument.

But the thing is – even though these seem like advanced skills to the general population, they’re not anywhere close to the skills that you need to become a successful professional in any of these domains, much less a world-class standout professional.

I am NOT saying that students can learn a lot without solving problems.

I was asked the following question about this critique:

  • "We do not know for sure how much students learn from readings and videos but 'doer effect' studies indicate it is up to 6 times less than from practice questions. If that's true, then how is it possible for some students to learn a lot from text or lecture before the y-intercept, and wouldn't we expect a higher average correlation between text or video experiences and learning outcomes?"

Yes, I totally agree that actively solving practice questions is where the vast majority of the learning typically happens. I don’t mean to suggest otherwise.

When I think of the students I’ve worked with who I would characterize as having high generalization ability, they’re a miniscule segment of the total population, so I would not expect them to influence any aggregate metrics. I also don’t think that high y-intercept can be used as a proxy for high generalization ability. I would expect students with high generalization ability to have high y-intercept, but not the converse, because a high y-intercept will also be produced by a student having previously been taught the material.

Additionally, when I claim that there exist students who can learn a lot from text or lecture, I don’t mean that they are learning from passively watching the text or video. What I mean is this: when I’ve seen these students read a text or watch a video, they tend to actively relate it to concrete examples and prior knowledge. It’s like they self-construct their own active learning experience.

  • If they view a definition, they immediately try to think of some objects that fit the definition.
  • If they view a theorem, they immediately try to think of the simplest non-trivial case in which the theorem applies, and why the theorem might make sense intuitively.
  • If they view a formula, they pay close attention to where that formula came from (i.e., how it was derived) and they "play" with the formula a bit in their head to get an intuitive grasp of the properties it encodes.
  • If they view a procedure, they think critically about the "why" behind each step and might even attempt to come up with a shortcut. (Often the shortcut turns out not to work in all cases, but they realize this quickly and the experience deepens their understanding.)

It’s totally different from typical students, who don’t think much beyond the words on the page, even if their teacher tries to get them to engage with it. Some part of that is likely due to interest/motivation, but I’d be surprised if cognitive differences (e.g. working memory capacity) didn’t play a role too.

Because these students take such an active role in constructing their own active learning experience, they often end up extrapolating large-scale implications well beyond the scope of what they are expected to learn. As a result, I suppose it might be technically true that these students come in with more prior knowledge, but not because it was actually taught to them, and not in a way that provides evidence for the last sentence in the abstract of the paper:

  • "[these results] suggest that educational achievement gaps come from differences in learning opportunities and that better access to such opportunities can help close those gaps."

To be clear: I would agree that often, some portion of the difference in educational achievement across students comes from differences access to learning opportunities. But I would not agree that all students would achieve the same if they all had the same access to such opportunities.

Additionally, as educational technology increases the degree to which individualized instruction is universally available, I would expect differences in achievement to shrink or grow depending on how achievement is measured. For instance:

  • If achievement is measured as the likelihood of passing, say, Algebra 2 by the end of high school, then I would expect differences in achievement to shrink.
  • If achievement is measured as the highest level of math learned, then I would expect differences in achievement to grow. (Why? Because aside from access to learning opportunities, whatever factors predispose one student to learning more math than another student, will continue predisposing them to learning more math.)

Literature Search: Speed of Learning

Q&A #1

I received the following question about this critique:

  • "Maybe there is other [published empirical] evidence out there of the 'just get it right away' -- I suspect, though, that many of those situations is more 'had already had some exposure' -- but I would much love to see any evidence otherwise if you know of any?"

To be clear: I would agree that many instances of “just get it right away” are “already had some exposure.” My claim is just that there are also instances to the contrary, where the student gets it right away but has not been previously introduced to it by an external entity (though maybe it’s possible they’ve thought about something similar internally when organizing information within their own mind).

I haven’t really dug into the literature surrounding the existence or non-existence of fast learners – having worked longitudinally with many students who I would call fast learners (not just “previously-exposed” learners) and many who I would not, I never realized there was even a debate about whether fast learners exist. Though now that I’m aware of this debate, I’ll put it on my reading list.

In the meantime, I’ve previously read about individual differences in working memory capacity (WMC) impacting speed of learning, so I can point to a couple references there. I realize “variation in speed of learning” does not necessarily imply the existence of “just get it right away,” but I think the heart of the question here is less about whether there is evidence some students learn fast in absolute terms, and more about whether there is evidence that some students learn fast relative to other students, due to some factor other than prior exposure to the task being learned.

McDaniel et al. (2014) summarize that multiple studies have linked individual differences in speed of learning and WMC:

  • "...[A]cross several types of categorization tasks, Craig and Lewandowsky (2012) and Lewandowsky (2011) reported significant correlations between speed of learning and working memory capacity. In the present study, we found a similar general association between speed of learning in the function task and working memory capacity as indexed by Ospan alone.
    Learning the function rule presumably requires maintaining and comparing stimuli across trials ("comparative hypothesizing", Klayman, 1988) and possibly partitioning the stimuli into subsets for the different slopes and switching back and forth across these partitioned segments during training (Lewandowsky et al., 2002; Sewall & Lewandowsky, 2012), and these processes require working memory capacity (both from a theoretical perspective, Craig & Lewandowsky, 2012; and based on empirical findings, Sewall & Lewandowsky, 2012). Consequently, for participants attempting to abstract the function rule, higher working memory capacity (as indexed by Ospan scores), would facilitate learning."

    "The implication is that for the rule learners, those with higher working memory capacity were able to more effectively support the processing needed to determine the functional relation among the training points, thereby supporting faster learning."

These authors suggest that high WMC facilitates abstraction, that is, seeing “the forest for the trees” by learning underlying rules as opposed to memorizing example-specific details:

  • "...[A]fter training (on a function-learning task), participants either displayed an extrapolation profile reflecting acquisition of the trained cue-criterion associations (exemplar learners) or abstraction of the function rule (rule learners; Studies 1a and 1b).
    Studies 1c and 2 examined the persistence of these learning tendencies on several categorization tasks. Study 1c showed that rule learners were more likely than exemplar learners (indexed a priori by extrapolation profiles) to resist using idiosyncratic features (exemplar similarity) in generalization (transfer) of the trained category. Study 2 showed that the rule learners but not the exemplar learners performed well on a novel categorization task (transfer) after training on an abstract coherent category.
    [W]orking memory capacity (as measured by Ospan following Wiley et al., 2011) was a significant and unique predictor of the tendency to rely on rule versus exemplar processes in the function learning task, such that higher working memory capacity was related to reliance on rule learning.

    For a number of reasons, greater working memory capacity could facilitate abstracting the function rule during learning, including the ability to maintain and compare several stimuli concurrently (Craig & Lewandowsky, 2012), to partition the training stimuli into two linear segments and switch back and forth between them during learning (Erickson, 2008; Sewell & Lewandowsky, 2012), and to reject or ignore initial biases (e.g., a positive linear) in order to discern the given function (cf., Wiley et al., 2011).

    Thus, learners enjoying greater working memory capacity might be more inclined to engage processes that would support rule learning (relating several training trials, partitioning training trials, ignoring initial biases) than would learners with more limited working memory capacity."

At the other end of the spectrum, Swanson & Siegel (2011) found that students with learning disabilities generally have lower WMC:

  • "We argue that in the domain of reading and/or math, individuals with LD have smaller general working-memory capacity than their normal achieving counterparts and this capacity deficit is not entirely specific to their academic disability (i.e., reading or math). ... We find that in situations that place high demands on processing, individuals with LD have deficits related to controlled attentional processes (e.g., maintaining task relevant information in the face of distraction or interference) when compared to their chronological aged-matched counterparts.
    One conclusion from the experimental literature is that individual differences in WM (of which executive processing is a component) are directly related to achievement (e.g., reading comprehension) in individuals with average or above average intelligence (e.g., Daneman & Carpenter, 1980). Thus, children or adults with normal IQs have difficulty (or efficiency varies) in executive processing and that such difficulties are not restricted to those with depressed intelligence
    Our conclusions from approximately two decades of research are that WM deficits are fundamental problems of children and adults with LD. Further, these WM problems are related to difficulties in reading and mathematics, and perhaps writing. Although WM is obviously not the only skill that contributes to academic difficulties [e.g., vocabulary and syntactical skills are also important (Siegal and Ryan, 1988)], WM does play a significance role in accounting for individual differences in academic performance."

Q&A #2

This is a follow-up to Q&A #1 above.

  • "Much of the rule learning (and rule discovery) literature does not provide learners with immediate explanatory feedback. Under such less favorable learning conditions, indeed there may well be high variability. If you have had a chance to dig into any of these primary sources [McDaniel et al. (2014) and sources it references], I'd love to hear whether or not they go outside that."

I checked out the primary sources that McDaniel et al. (2014) referenced as reporting significant correlations between speed of learning and working memory capacity. In both of those primary sources, immediate feedback was provided but it was not explanatory. I also dug deeper into separate literature and did not see any studies of learning rate in the context of explanatory feedback. I agree that it would be interesting to see some of these studies re-run in the context of immediate explanatory feedback.

That said, I did come across some other studies that may be relevant here.


Renkl (1997) found that when studying worked examples, self-explanation characteristics correlated with learning outcomes, even when controlling for study time and prerequisite knowledge. Here is their description of the most successful group:

  • "Cluster 2 was-according to the adjusted means-the most successful group. It could be characterized by a self-explanation style that emphasized the assignment of "meaning" to the operators, both by explicating the underlying principle and the corresponding subgoal. Anticipative reasoning was performed merely at a medium level. The learners in this group infrequently noticed comprehension failures and inspected a slightly below-average number of examples. They started at a relatively low prior knowledge level, but reached a high level of learning success. This self-explanation style could be characterized as data-driven, but nevertheless active (cf. Reimann, 1994; Reimann & Schult, 19%). It was named principle-based."

And other groups:

  • "Cluster 1 concentrated their efforts on the anticipative computation of to-be-found probabilities and did not provide either many principle-based explanations nor explicated many goal-operator combinations. This group self-diagnosed just a few comprehension impasses and inspected a medium number of examples. The post-test performance was high. However, this group started at a relatively high level of prior knowledge. Members of this group could be labeled as anticipative reasoners.
    Cluster 3 was the comparatively large group of 19 learners that could be described as unsuccessful. This relative failure to profit from studying worked-out examples obviously resulted from the poor quality of self-explanations: There were few principle-based explanations, few nominations of goal-operator combinations, and a low level of anticipative reasoning. In addition, this group also noticed many comprehension impasses and did not inspect many examples. This self-explanation style was labeled passive.

    The individuals in cluster 4 engaged in an average amount of principled-based explanations and explications of goal-operator combinations. Anticipative reasoning was relatively infrequent. Interestingly, although they were merely medium successful learners, they very seldom noticed comprehension problems and assigned relatively little time to each example (i.e., inspected many examples). There is some similarity of this style to the unsuccessful learners described by Chi et al. (1989). Those subjects also inspected the examples for a relatively brief time and their learning success contrasted with the low extent of self-diagnosed comprehension difficulties. This self-explanation style is named superficial."

I came across several other studies that found correlations between individual differences in metacognitive abilities and working memory capacity (WMC), which makes me suspect that WMC may be at least partially implicated in self-explanation characteristics.

For instance, Linderholm, Cong, & Zhao (2008) found that low-WMC readers were overconfident in their comprehension of a text, while high-WMC readers were similarly underconfident – I expect this would predispose high-WMC readers to analyze more deeply while reading.

More generally, Komori (2016) summarizes that differences in WMC can impact attentional control, which seems important for learning deep structure from a worked example:

  • "Differences in people's cognitive abilities derived from WMC have been explained by attentional control processes (allocation: e.g., Just and Carpenter, 1992; focusing: e.g., Baddeley, 2007; Osaka et al., 2007; maintenance and inhibition of proper information: e.g., Engle and Kane, 2004; Unsworth and Engle, 2007; Miyake and Friedman, 2012; scope and control: e.g., Chow and Conway, 2015) ..."

Prat, Seo, & Yamasaki (2015) discuss all these things in more detail:

  • "In a series of reading time experiments, Miyake, Just, and Carpenter (1994) showed that while waiting for disambiguating context, high-span readers were more likely to maintain both senses of ambiguous words in mind for long periods of time. Thus, in sentences such as "Because she was a boxer, she was the most popular dog in the pet store," high-span readers were able to easily comprehend the words "dog in the pet store," whereas low-span readers had difficulty reading that section (as indexed by slower reading times) when the context resolved to the less-frequent interpretation of the word. Subsequent research on individual differences in lexical ambiguity resolution has focused on the ability to select appropriate meanings of words in the face of early disambiguating information (e.g., Gadsby, Amott, & Copland, 2008; Gunter, Wagner, & Friederici. 2003). This research suggests that high-span readers are more successful than low-span readers at inhibiting the inappropriate meanings of ambiguous words. Taken together, these results suggest that high-span readers are able to maintain multiple meanings of ambiguous words strategically, selecting the appropriate meaning as soon as disambiguating context is provided.

    At the sentence level, a plethora of research has shown that working memory capacity facilitates syntactic parsing. For instance, King and Just (1991) showed that high-span readers were faster and more accurate than low-span readers when comprehending syntactically complex, object-relative sentences. Additionally, sensitivity to syntactic ambiguity also varies as a function of individual working memory span (e.g., MacDonald, Just, & Carpenter, 1992). Specifically, high-span readers are more likely than low-span readers to slow down during regions of a sentence that are syntactically ambiguous (see Waters & Caplan, 1996 for a counter-example). MacDonald and colleagues (1992) argued that the slowed reading times observed in high-span readers reflected the cost placed on working memory when multiple possible sentence structures were constructed and maintained in parallel.

    Multiple follow-up experiments have investigated differences in high- and low-span readers' sensitivity to probabilistic constraints during syntactic parsing (e.g., Long & Prat, 2008; Pearlmutter & MacDonald, 1995). In one such experiment, Long and Prat (2008) demonstrated that both high- and low-span readers have access to information about contextually based probabilities (e.g., as indexed by sentence completion tasks), but that only high-span individuals use such information during online sentence comprehension (e.g., as indexed by slowing reaction times during potentially ambiguous sentence regions). Thus, the results of individual differences in syntactic parsing converge with those investigating lexical processes. Specifically, high-span readers seem able to represent multiple possibilities when parsing syntactically ambiguous sentences, and at the same time, are able to more readily use contextual information online to flexibly decide which structure is more likely, given all of the information available.

    Because discourse comprehension relies on the outputs of word- and sentence-level processes, it is not surprising that individual differences in working memory capacity are readily apparent at the discourse level (Prat, Mason, & Just, 2012). For instance, high-span readers are more likely to engage in optional, elaborative processes (such as inference generation) during discourse comprehension than are low-span readers (e.g., Barreyro, Cevasco, Burin, & Marotto, 2012; St. George, Mannes, & Hoffinan, 1997). Additionally, high-span readers are better able than low-span readers at focussing on the key details of a passage during comprehension. For example, Sanchez and Wiley (2006) demonstrated that high-span readers are less susceptible than low-span readers to the seductive details effect (Harp & Mayer, 1997), or the reduction in comprehension that occurs when texts are accompanied by "seductive" distractors such as illustrations that are not central to the themes of the text. Using an eye-tracking paradigm, Sanchez and Wiley demonstrated that low-span readers spent almost as much time viewing illustrations as they did reading texts, and spent significantly more time viewing illustrations than did high-span individuals. Finally, a recent study by Unsworth and McMillan (2013) showed that individuals with higher working memory capacity (as indexed by a battery of complex span tasks) are less likely to mind wander, or shift attention away from the task at hand to internal thoughts or feelings that are unrelated to the task, while reading."

Anecdotally, when I recall tutoring students who seemed to generalize less from worked examples, one feature that sticks out is that they would often grasp incorrect structure from the worked example (they would typically over-simplify it), and only after struggling with a practice problem (i.e., getting it wrong or getting stuck and having to refer back to the worked example) would they notice some portion of the structure that they had failed to grasp. Gradually, as they struggled with more practice problems, they would chip away at enough of the structure in the worked example – not the full structure in its entirety, but enough to get them to the point of mastery. Students who generalized well, on the other hand, would excavate far more of the correct structure in a worked example.

Forgetting Rate

I ran across a couple studies (Zerr et al., 2018; McDermott & Zerr, 2019; both in the context of non-explanatory feedback) highlighting that it is not only the learning rates that are variable, but also the forgetting rates – in particular, faster learners tend to be slower forgetters. This seems noteworthy to me because I would expect forgetting rate to depend less on the means by which the learning was acquired (i.e., favorable vs less favorable).

Additionally, in the context of An astonishing regularity in student learning rate, I wonder if individual differences in forgetting are also tangled up in the initial performance measurement, i.e., “forgotten background knowledge” being interpreted as “lack of background knowledge.”

Anecdotally, when managing learning courses that used a mastery-based learning system, I noticed that weaker students would more frequently need to refer back to reference material on prerequisites even if they had mastered those prerequisites recently, and they would also do worse on quizzes where they were unable to refer back to reference material. The effect of forgetting was more clearly represented in their quiz accuracy and the raw amount of practice time, than in their accuracy or quantity of practice opportunities.


Hofstadter, D., & Carter, K. (2012). Some Reflections on Mathematics from a Mathematical Non-mathematician. Mathematics in School, 41(5), 2-4.

Koedinger, K. R., Carvalho, P. F., Liu, R., & McLaughlin, E. A. (2023). An astonishing regularity in student learning rate. Proceedings of the National Academy of Sciences, 120(13), e2221311120.

Komori, M. (2016). Effects of working memory capacity on metacognitive monitoring: A study of group differences using a listening span test. Frontiers in psychology, 7, 172995.

Linderholm, T., Cong, X., & Zhao, Q. (2008). Differences in low and high working-memory capacity readers’ cognitive and metacognitive processing patterns as a function of reading for different purposes. Reading Psychology, 29(1), 61-85.

McDaniel, M. A., Cahill, M. J., Robbins, M., & Wiener, C. (2014). Individual differences in learning and transfer: stable tendencies for learning exemplars versus abstracting rules. Journal of Experimental Psychology: General, 143(2), 668.

McDermott, K. B., & Zerr, C. L. (2019). Individual differences in learning efficiency. Current Directions in Psychological Science, 28(6), 607-613.

Prat, C. S., Seo, R., & Yamasaki, B. L. (2015). The role of individual differences in working memory capacity on reading comprehension ability. In Handbook of Individual Differences in Reading (pp. 331-347). Routledge.

Renkl, A. (1997). Learning from worked-out examples: A study on individual differences. Cognitive science, 21(1), 1-29.

Swanson, H. L., & Siegel, L. (2011). Learning disabilities as a working memory deficit. Experimental Psychology, 49(1), 5-28.

Zerr, C. L., Berg, J. J., Nelson, S. M., Fishell, A. K., Savalia, N. K., & McDermott, K. B. (2018). Learning efficiency: Identifying individual differences in learning rate and retention in healthy adults. Psychological science, 29(9), 1436-1450.