Statistical Significance as Social Convention

Few concepts in psychology carry as much authority, and as little sustained reflection, as statistical significance. The p-value has become a gatekeeper of legitimacy, determining which findings are publishable, citable, and fundable. Its presence signals rigor; its absence raises suspicion. Yet statistical significance was never designed to bear the epistemic weight psychology has placed upon it. What now functions as a marker of truth began as a pragmatic convention, adopted to manage uncertainty rather than resolve it.

Statistical significance emerged from early twentieth-century efforts to formalize inference under conditions of incomplete knowledge. It was not intended to certify truth claims, but to provide a rule of thumb for distinguishing signal from noise. The threshold of p < .05 was not derived from theory, ontology, or psychology’s subject matter. It was a compromise, chosen for convenience and gradually institutionalized through repetition. Over time, this convention hardened into a norm, and the norm into an expectation.

Psychology embraced statistical significance enthusiastically because it solved a practical problem. As the discipline moved toward large-scale empirical research, it needed a standardized criterion for adjudicating findings. The p-value offered a simple, ostensibly objective rule. It allowed researchers to present results as either significant or not, reducing complex patterns of uncertainty into a binary decision. This simplification was attractive, particularly in a field dealing with noisy, variable phenomena.

What was lost in this adoption was a clear distinction between statistical and substantive significance. A result could reach statistical significance while explaining trivial variance or lacking theoretical coherence. Conversely, theoretically important effects could fail to reach significance due to sample size, measurement limitations, or design constraints. Despite these caveats being well known, the discipline gradually began to treat statistical significance as a proxy for importance.

By the time I began studying psychology in the early 1980s, this conflation was already entrenched. Students were trained to design studies around achieving significance, often before clarifying what would count as meaningful evidence for a theory. Null results were quietly discouraged. Significance testing was presented less as a tool and more as a requirement. The logic of inquiry subtly shifted: the question became not What does this tell us? but Did it cross the threshold?

This shift shaped research behavior in predictable ways. Studies were powered just enough to detect effects. Analytic decisions were made with an eye toward significance rather than theoretical clarity. Reporting practices favored clean narratives of confirmation over messy accounts of ambiguity. Over time, this produced a literature dense with statistically significant findings and thin on cumulative understanding.

The replication crisis did not create these problems; it exposed them. When many well-established findings failed to replicate, the issue was not merely technical. It revealed how deeply psychology had come to rely on statistical convention as a stand-in for epistemic warrant. Results that were once treated as solid dissolved under repeated scrutiny, not because the original researchers were careless, but because the inferential framework encouraged overconfidence.

Statistical significance functions socially as much as it does analytically. It organizes incentives, shapes publication decisions, and structures academic careers. Journals reward significance. Grant panels expect it. Hiring committees count it. Under these conditions, it is unsurprising that researchers orient their work toward achieving significant results. The convention becomes self-reinforcing, not because it is epistemically optimal, but because it is institutionally efficient.

This social role complicates attempts at reform. Calls to abandon p-values or lower thresholds often underestimate how deeply significance testing is woven into psychology’s professional infrastructure. Even when alternative approaches such as confidence intervals, Bayesian inference, or estimation are adopted, they frequently inherit the same normative function. A new metric replaces the old one, but the underlying desire for a decisive cutoff remains.

The deeper problem is not the p-value itself. It is psychology’s discomfort with uncertainty. Human behavior is probabilistic, context-dependent, and multiply determined. Clean causal claims are rare. Statistical significance offers relief from this discomfort by providing a rule that appears to settle questions definitively. Once the threshold is crossed, ambiguity recedes. The result can be declared real.

Yet this relief is illusory. Statistical significance does not tell us whether a finding is true, robust, or theoretically meaningful. It tells us something much narrower: how likely the observed data would be under a specific null hypothesis. Treating this as a verdict on reality confuses a technical calculation with a substantive judgment.

Case-level thinking makes this confusion especially visible. Individual cases often demonstrate patterns that are psychologically coherent but statistically anomalous. Conversely, statistically significant trends may obscure meaningful heterogeneity. When significance dominates inference, psychology risks privileging what is frequent over what is informative.

The discipline’s reliance on statistical convention also shapes theory development. Theories become optimized for detectability rather than depth. Constructs are defined narrowly to produce measurable effects. Predictions are framed conservatively to avoid null results. Over time, theory adapts to the demands of inference rather than the other way around.

None of this implies that statistical testing should be discarded. It remains a powerful tool for managing uncertainty and guarding against overinterpretation. The problem arises when the tool is mistaken for an epistemic foundation. Statistical significance cannot bear the weight psychology has placed upon it without distorting inquiry.

A more mature relationship to significance would treat it as one source of evidence among others, subordinate to theoretical coherence, methodological transparency, and cumulative plausibility. This requires tolerating ambiguity, reporting uncertainty honestly, and valuing null results where they are informative. It also requires resisting the temptation to let convention substitute for judgment.

Psychology’s credibility will not be restored by replacing one threshold with another. It will be restored by recognizing that inference is an interpretive act, not a mechanical one. Statistical tools assist that act; they do not absolve psychologists of responsibility for making it wisely.

Letter to the Reader

If significance testing has ever felt both indispensable and unsatisfying, that tension is not a personal failing. When I was trained in the early 1980s, statistical significance was already treated as a rite of passage, something one mastered and then rarely questioned.

With time, what becomes clear is how much of our confidence rested on convention rather than reflection. Learning statistics is necessary. Learning what statistics can and cannot tell you is just as important.

Be wary of results that feel settled simply because they passed a threshold. In psychology, certainty is often a social achievement before it is an epistemic one.

Previous
Previous

Replication Failure as Theoretical Failure

Next
Next

Why Psychology Never Escaped Philosophy (Despite Trying To)