In 2002, a chapter from Kahneman and Frederick mentioned “the bat and ball problem”.
A bat and a ball cost $1.10 in total.
The bat costs $1.00 more than the ball.
How much does the ball cost?
By 2005, Frederick’s Cognitive Reflection Test paper added the lesser known Widgets and Lily Pad problems. In the intervening 20-ish years, each paper seems to have accrued over 5000 citations.
In 2023, Meyer and Frederick published a massive follow-up paper about the first problem: 59 studies, over 73,000 participants, and more pages of Appendixes than pages in the main article. As someone studying various reflection tests and interventions, I had to take a look right away. In this post, I list five initial takeaways and two things to like about the paper.
1. The lure is strong!
Both online and paper-pencil studies repeatedly found that the incorrect lured answer (10 cents) is outstandingly attractive to test takers (Section 2). Although a bunch of attempts to get people to stop selecting that lure did help some people choose the correct response, many (many!) participants doubled down. For example, even after straight-up telling participants that “10 cents” is not correct, nearly a majority of online participants doubled down—these respondents are referred to as “hopeless” in the paper. 😜
2. The path to the lure is subtraction?
A clever experiment reported early in the paper suggested that most people arrive at the lured “10 cents” response via subtraction (page 3). That is, they just subtract $1.00 from $1.10.
Of course, process-tracing methods could provide even more confidence about the status of that hypothesis.
3. The lured response does not satisfy all “System 1” characteristics
The bat and ball problem may not be as “emblematic of the dual system framework …as its frequent citation suggests” (page 7). They list at least 3 reasons. For example, people are often consciously aware of how they arrived at the lured response. (So as with social biases, we cannot infer that cognitive biases are unconscious just because they exhibit other “System 1” features.)
4. The debate about “Smart System 1” is not over
Meyer and Frederick remain unconvinced by the arguments that “most” correct responses occur “intuitively” (Appendix L). The disagreement between M&F and their interlocutors is not about whether people respond correctly without reflection; it’s about how often and the process(es) by which people respond correctly without reflection.
Our think-aloud reflection testing research indicates that correct-but-unreflective response rates range from around 20% to 80% — the higher rates are from the bat and ball problem. You can analyze our public think-aloud transcripts from that paper if you want to try to resolve the debate (e.g., whether correct-but-unreflective responses involve solving the relevant equations or just a heuristic).
5. The bat and ball problem tests more than math ability?
They think so, partly because of the results of their “hint” procedure. They also appeal to data from people who completed the bat and ball problem as well as Raven’s and Linda problems (page 8). Not exactly direct evidence, but interesting empirical arguments.
Our own indirect evidence suggests that the correlates of reflection can depend on whether reflection is tested using math: mathy reflection tests correlated with moral decisions the same way that a math test did, but not the way that logical reflection tests did (see Table 2 from Byrd & Conway 2019 below). So even though we can theoretically distinguish reflection from numeracy, mathematical reflection tests may not empirically distinguish reflection from numeracy.
6. What I Like About Meyer & Frederick’s 2023 Paper
A. Deep Science. Psychological scientists often seem more interested in superficial inquiries than deep inquiries. After they publish an attention-grabbing result, they begin looking for another one (rather than spend a few years unraveling the remaining layers of uncertainty about the initial result). There’s nothing wrong with researching more than one topic, as long as scientists keep running down the unknowns about their results.
This paper suggests that Frederick et al. have been doing some deeper science over the past 20 years. I’m not saying I buy the paper wholesale, of course. I think much more process-tracing research is needed and I think their data should be freely available to the many scientists studying the bat and ball problem (or its ilk). But I do appreciate the Herculean and often underrated effort of synthesizing dozens of studies about an influential psychological test — and in less than 10 pages(!), speaking of which:
B. Unconventional concision. What was the last cognitive science paper you read that reported the methods and results of more than 50 new studies in less than 10 pages? Take your time. We’ll wait. Actually, we don’t have time! And that’s why I wish more scholars, reviewers, and journal editors demanded more concision than the current status quo. Although formatting and reporting standards are probably intended to facilitate the dissemination of scientific knowledge, they often impede it because not every paper needs
- section headings for every study, each with separate Methods and Results subheadings.
- omnibus test descriptions and results for every experiment.
- p-values, standardized effect sizes, and confidence intervals for every “significant” result.
Many research questions are neither answered by nor dependent on the results of omnibus tests. Some descriptive statistics are so revealing and well-powered that inferential statistics add no substantial value. The journal (Cognition) allowed Meyer & Frederick to forgo these conventions when it achieved a greater good — a welcome surprise!
(Confession: I was about to add a conclusion heading and paragraph, but it seemed hypocritical after the prior paragraph. If you are reading this, I deviated from that norm for the sake of concision — without sacrificing something of greater value, I hope.)