A 2019 paper in the Advances in Methods and Practices in Psychological Science found that most psychology textbooks, instructors, and students misinterpret ‘statistical significance’ and p-values. Talk about a headline! More important than the headline, however, are the right interpretations and what we can do to correct widespread misinterpretations. In this post, I explain the authors’ findings and the three solutions they propose.
The Bad News
Turns out 89% of introductory psychology textbooks described statistical significance incorrectly (Cassidy et al. 2019). So did 100% of psychology undergraduates, 80% of methodology instructors, and 90% of scientific psychologists (Haller & Kraus 2002). Yikes!
So what’s the right description of a p-value?
The Meaning of ‘Statistical Significance’ & P-value
Can the meaning of ‘statistical significance’ be derived from its parts, ‘statistical’ and ‘significance’? No. It’s more technical than that. So what’s the technical definition?
Here’s a jargony definition of “statistically significant” (e.g., p < 0.05):
Assuming that the null hypothesis is true and the study is repeated an infinite number times by drawing random samples from the same populations(s), less than 5% of these results will be more extreme than the current result
Cassidy and colleagues’ adaption of Klein 2013, p. 75
In other words, a p-value is the probability of observing a statistical relationship at least as extreme as what was observed, assuming the null hypothesis is true.
[2023 Update: The “null hypothesis” often refers to a zero value, but can also be “another value” of interest proposed by some alternative hypothesis against which we want to compare our observed results (Cassidy et al., 2019, p. 234). For example, you may want to test whether a coin is “fair” by keeping track of the number of times it lands heads after 100 coin flips. You can calculate a p-value representing the probability of observing at least as many heads outcomes as you observed assuming the expected percent of heads outcomes from a “fair” coin: 50% (rather than 0%).]
These definitions of ‘p-value’ and ‘null hypothesis’ make sense of why smaller p-values are seen as reasons to reject null hypotheses. Small p-values mean there is a low probability of the observed outcome if the null hypothesis is true.
[2023 Update: The rarity implied by a low p-value may tempt you to infer too much, especially when the p-value is less than a conventional threshold like 5% (a.k.a., p = 0.05). However, a 5% probability outcome is not as rare as you might think: the probability of a fair coin landing heads in at least 59 out of 100 coin flips is just 4%.
Is 59 out of 100 flips so different than the expected 50 out of 100 flips to reject your alternative hypothesis that the coin is fair? Conventional “null hypothesis significance testing” would say yes! As you can imagine, some people want us to reconsider that convention (Kline 2013, Cumming & Calin-Jageman 2016, 2024).]
P-value & Statistical Significance Fallacies
How do the correct descriptions of ‘statistical significance’ and p-values above differ from what we get from textbooks, methodological instructors, scientists, and their students?
Below are 8 categories of “fallacious” descriptions from Cassidy and colleagues’ Table 1 (2019). The 1st, 5th, and 6th fallacies are the ones I hear most often.
What Can You Do About It?
Now that you see how widespread the misunderstanding is and how simple the correct definition is, what should you do? Cassidy et al. give the following tips:
- Fix the textbooks.
- Remove discussions of the term from books.
- Use their free teaching materials on the Open Science Framework: https://osf.io/qg9t2/
In addition to trying to spread the news in this post, I also teach about p-values in the probability section of my Logic course. If you’re not sure what you can do, you could share this post with someone you think might be interested in it.