Overview on the Null Hypothesis Significance Test

A Systematic Review on Essay Literature on its Problems and Solutions

Downloads

Authors

  • Noah Dongen University of Amsterdam
  • Leonie van Grootel Rathenau Instituut

DOI:

https://doi.org/10.15626/MP.2021.2927

Keywords:

null hypothesis significance test, systematic review, essay literature, opinion literature, thematic synthesis

Abstract

For decades, waxing and waning, there has been an ongoing debate on the values and problems of the ubiquitously used null hypothesis significance test (NHST). With the start of the replication crisis, this debate has flared-up once again, especially in the psychology and psychological methods literature. Arguing for or against the NHST method usually takes place in essays and opinion pieces that cover some, but not all the qualities and problems of the method. The NHST literature landscape is vast, a clear overview is lacking, and participants in the debate seem to be talking past one another. To contribute to a resolution, we conducted a systematic review on essay literature concerning NHST published in psychology and psychological methods journals between 2011 and 2018. We extracted all arguments in defense of (20) and against (70) NHST, and we extracted the solutions (33) that were proposed to remedy (some of) the perceived problems of NHST. Unfiltered, these 123 items form a landscape that is prohibitively difficult to keep in one’s sights. Our contribution to the resolution of the NHST debate is twofold. 1) We performed a thematic synthesis of the arguments and solutions, which carves the landscape in a framework of three zones: mild, moderate, and critical. This reduction summarizes groups of arguments and solutions, thus offering a manageable overview of NHST’s qualities, problems, and solutions. 2) We provide the data on the arguments and solutions as a resource for those who will carry-on the debate and/or study the use of NHST.

Metrics

Metrics Loading ...

References

Baird, G. L., & Duerr, S. R. (2016). Reflections concerning recent ban on NHST and confidence intervals. Journal of Modern Applied Statistical Methods, 15(2), 821–824.

Baird, G. L., & Harlow, L. L. (2016). Does one size fit all? A case for context-driven null hypothesis statistical testing. Journal of Modern Applied Statistical Methods, 15(1), 100.

Bakan, D. (1966). The test of significance in psychological research. Psychological Bulletin, 66(6), 423–437.

Baker, M. (2016). 1,500 scientists lift the lid on reproducibility. Nature, 533(7604), 452–454.

Baril, G. L., & Cannon, J. T. (1995). What is the probability that null hypothesis testing is meaningless? American Psychologist, 50, 1098–1099.

Begley, C. G., & Ellis, L. M. (2012). Raise standards for preclinical cancer research. Nature, 483(7391), 531–533.

Benjamin, D. J., Berger, J. O., Johannesson, M., Nosek, B. A., Wagenmakers, E.-J., & Berk, e. a., Richard. (2018). Redefine statistical significance. Nature Human Behaviour, 2(1), 6–10.

Berger, J. O., & Sellke, T. (1987). Testing a point null hypothesis: The irreconcilability of p values and evidence. Journal of the American Statistical Association, 82(397), 112–122.

Berkson, J. (1942). Tests of significance considered as evidence. Journal of the American Statistical Association, 37(219), 325–335.

Booth, A. (2006). "Brimful of STARLITE": Toward standards for reporting literature searches. Journal of the Medical Library Association, 94(4), 421–429.

Bradley, M. T., & Brand, A. (2016). Significance testing needs a taxonomy: Or how the Fisher, Neyman–Pearson controversy resulted in the inferential tail wagging the measurement dog. Psychological Reports, 119(2), 487–504.

Camerer, C. F., Dreber, A., Forsell, E., Ho, T.-H., Huber, J., Johannesson, M., & Heikensten, e. a., Emma. (2016). Evaluating replicability of laboratory experiments in economics. Science, 351(6280), 1433–1436.

Campitelli, G., Macbeth, G., Ospina, R., & Marmolejo-Ramos, F. (2017). Three Strategies for the Critical Use of Statistical Methods in Psychological Research. Educational and Psychological Measurement, 77(5), 881–895.

Carroll, C., Booth, A., & Lloyd-Jones, M. (2012). Should we exclude inadequately reported studies from qualitative systematic reviews? An evaluation of sensitivity analyses in two case study reviews. Qualitative Health Research, 22(10), 1425–1434.

Center for Open Science. (2020). Registered reports. Retrieved September 22, 2020, from https : / /www.cos.io/initiatives/registered-reports

Chang, M. (2017). What constitutes science and scientific evidence: Roles of null hypothesis testing. Educational and Psychological Measurement, 77(3), 475–488.

Chen, G., Taylor, P. A., & Cox, R. W. (2017). Is the statistic value all we should care about in neuroimaging? NeuroImage, 147, 952–959.

Citrome, L. (2011). The Tyranny of the P-value: Effect Size Matters. Klinik Psikofarmakoloji Bülteni-Bulletin of Clinical Psychopharmacology, 21(2), 91–92.

Cohen, J. (1994). The world is round (p<.05). American Psychologist, 49(12), 997–1003.

Cohen, J. (1995). The Earth Is Round (p < .05): Rejoinder. American Psychologist, 50(12), 1103.

Cooke, A., Smith, D., & Booth, A. (2012). Beyond PICO: the SPIDER tool for qualitative evidence synthesis. Qualitative health research, 22(10), 1435–1443.

Cumming, G. (2008). Replication and p intervals: P values predict the future only vaguely, but confidence intervals do much better. Perspectives on psychological science, 3(4), 286–300.

Cumming, G. (2013). The new statistics: A how-to guide. Australian Psychologist, 48(3), 161–170.

Cumming, G. (2014). The new statistics: Why and how. Psychological Science, 25(1), 7–29.

Dienes, Z., & Mclatchie, N. (2018). Four reasons to prefer Bayesian analyses over significance testing.

Psychonomic bulletin & review, 25(1), 207–218.

Engman, A. (2013). Is there life after P<0.05? Statistical significance and quantitative sociology.

Quality & Quantity: International Journal of Methodology, 47(1), 257–270.

Franco, A., Malhotra, N., & Simonovits, G. (2014). Publication bias in the social sciences: Unlocking the file drawer. Science, 345(6203), 1502–1505.

Freese, J., & King, M. M. (2018). Institutionalizing transparency. Socius: Sociological Research for a Dynamic World, 4, 1–7.

Frick, R. W. (1995). A problem with confidence intervals. American Psychologist, 50, 1102–1103.

Garamszegi, L. Z., & de Villemereuil, P. (2017). Perturbations on the uniform distribution of p-values can lead to misleading inferences from null-hypothesis testing. Trends in neuroscience and education, 8, 18–27.

García-Pérez, M. A. (2017). Thou shalt not bear false witness against null hypothesis significance testing. Educational and Psychological Measurement, 77(4), 631–662.

Gelman, A. (2013). Interrogating p-values. Journal of Mathematical Psychology, 57(5), 188–189.

Gelman, A. (2018). The failure of null hypothesis significance testing when studying incremental changes, and what to do about it. Personality and Social Psychology Bulletin, 44(1), 16–23.

Gigerenzer, G. (2004). Mindless statistics. Journal of Socio-Economics, 33(5), 587–606.

Goddard, S., & Johnson, V. E. (2015). The Lack of Reproducibility in Research. METODE Science Studies Journal, 5, 175–179.

Greenwald, A. G. (1975). Consequences of prejudice against the null hypothesis. Psychological Bulletin, 82(1), 1–20.

Häggström, O. (2017). The need for nuance in the null hypothesis significance testing debate. Educational and psychological measurement, 77(4), 616–630.

Haig, B. D. (2017). Tests of statistical significance made sound. Educational and Psychological Measurement, 77(3), 489–506.

Halsey, L. G., Curran-Everett, D., Vowler, S. L., & Drummond, G. B. (2015). The fickle P value generates irreproducible results. Nature methods, 12(3), 179–185.

Hofmann, S. G. (2011). Some more fundamental problems in clinical research: Comment on ’Statistical significance testing and clinical trials’. Psychotherapy, 48(3), 223–224.

Hogben, L. (1956). The present crisis in statistical theory. The Incorporated Statistician, 7(1), 3–21.

Hubbard, R. (2011). The widespread misinterpretation of p-values as error probabilities. Journal of Applied Statistics, 38(11), 2617–2626.

Hupé, J.-M. (2015). Statistical inferences under the Null hypothesis: common mistakes and pitfalls in neuroimaging studies. Frontiers in neuroscience, 9, 18.

Johansson, T. (2011). Hail the impossible: P-values, evidence, and likelihood. Scandinavian Journal of Psychology, 52(2), 113–125.

Klugkist, I., van Wesel, F., & Bullens, J. (2011). Do We Know What We Test and Do We Test What We Want to Know? International Journal of Behavioral Development, 35(6), 550–560.

Konijn, E. A., van de Schoot, R., Winter, S. D., & Ferguson, C. J. (2015). Possible solution to publication bias through Bayesian statistics, including proper null hypothesis testing. Communication Methods and Measures, 9(4), 280–302.

Laber, E. B., & Shedden, K. (2017). Statistical Significance and the Dichotomization of Evidence: The Relevance of the ASA Statement on Statistical Significance and p-Values for Statisticians. Journal of the American Statistical Association,

(519), 902–904.

Lakens, D., Adolfi, F. G., Albers, C. J., Anvari, F., Apps, M. A. J., Argamon, S. E., & Buchanan, e. a., Emily M. (2018). Justify your alpha. Nature Human Behaviour, 2(3), 168–171.

Lambdin, C. (2012). Significance tests as sorcery: Science is empirical—significance tests are not. Theory & Psychology, 22(1), 67–90.

LeBel, E. P., & Peters, K. R. (2011). Fearing the future of empirical psychology: Bem’s (2011) evidence of psi as a case study of deficiencies in modal research practice. Review of General Psychology, 15(4), 371–379.

Lehrer, J. (2010). Feeling The Future: Is Precognition Possible? [Accessed on 11 August 2020]. https:/ / www. wired . com / 2010 / 11 / feeling - the - future-is-precognition-possible/

Lindsay, D. S., Simons, D. J., & Lilienfeld, S. O. (2016). Research preregistration 101. APS Observer, 29(10).

Lu, Y., & Belitskaya-Levy, I. (2015). The debate about p-values. Shanghai Archives of Psychiatry, 27(6), 381–385.

Ma, Z., Pan, Y., Yu, Z., Wang, J., Jia, J., & Wu, Y. (2013). A quantitative study on the effectiveness of peer review for academic journals. Scientometrics, 95(1), 1–13.

Marsman, M., & Wagenmakers, E. J. (2017). Three Insights from a Bayesian Interpretation of the One-Sided P Value. Educational and Psychological Measurement, 77(3), 529–539.

Martin, R., & Liu, C. (2014). A note on p-values interpreted as plausibilities. Statistica Sinica, 1703–1716.

McGraw, K. O. (1995). Determining false alarm rates in null hypothesis testing research. American Psychologist, 50, 1099–1100.

McShane, B. B., Gal, D., Gelman, A., Robert, C., & Tackett, J. L. (2019). Abandon statistical significance. The American Statistician, 73(sup1), 235–245.

Meehl, P. E. (1990). Why summaries of research on psychological theories are often uninterpretable. Psychological Reports, 66, 195–244.

Meehl, P. E. (1992). Cliometric metatheory: The actuarial approach to empirical, history-based philosophy of science. Psychological Reports, 71(2), 339–467.

Miller, J. (2017). Hypothesis testing in the real world. Educational and Psychological Measurement, 77(4), 663–672.

Morey, R. D., Rouder, J. N., Verhagen, J., & Wagenmakers, E.-J. (2014). Why hypothesis tests are essential for psychological science: A comment on Cumming. Psychological Science, 25(6), 1289–1290.

Munafò, M. (2016). Open science and research reproducibility. ecancermedicalscience, 10, 10:ed56.

Nickerson, R. S. (2000). Null hypothesis significance testing: A review of an old and continuing controversy. Psychological Methods, 5(2), 241–301.

Open Science Collaboration. (2015). Estimating the reproducibility of psychological science. Science, 349(6251), aac4716.

Parker, S. (1995). The “difference of means” may not be the “effect size”. American Psychologist, 50, 1101–1102.

Patsopoulos, N. A., Evangelou, E., & Ioannidis, J. P. A. (2008). Sensitivity of between-study heterogeneity in meta-analysis: Proposed metrics and empirical evaluation. International Journal of Epidemiology, 37(5), 1148–1157.

Perezgonzalez, J. D. (2014). A reconceptualization of significance testing. Theory & Psychology, 24(6), 852–859.

Perezgonzalez, J. D. (2015). The meaning of significance in data testing.

Prinz, F., Schlange, T., & Asadullah, K. (2011). Believe it or not: How much can we rely on published data on potential drug targets? Nature Reviews Drug Discovery, 10(9), 712–712.

Rao, C. R., & Lovric, M. M. (2016). Testing point null hypothesis of a normal mean and the truth: 21st century perspective. Journal of Modern Applied Statistical Methods, 15, 2–21.

Rosenthal, R. (1979). The file drawer problem and tolerance for null results. Psychological Bulletin, 86(3), 638–641.

Rouse, S. V. (2016). Of teacups and t tests: Best practices in contemporary null hypothesis significance testing. Psi Chi Journal of Psychological Research, 21(2), 127–133.

Rozeboom, W. W. (1960). The fallacy of the null-hypothesis significance test. Psychological bulletin, 57(5), 416.

Savalei, V., & Dunn, E. (2015). Is the Call to Abandon P-Values the Red Herring of the Replicability Crisis? Frontiers in Psychology, 6, 1–4.

Scheel, A. M., Schijen, M., & Lakens, D. (2020). An excess of positive results: Comparing the standard Psychology literature with Registered Reports [Accessed on 14 August]. https://psyarxiv.com

Schneider, J. W. (2015). Null hypothesis significance tests. A mix-up of two different theories: the basis for widespread confusion and numerous misinterpretations. Scientometrics, 102(1), 411–432.

Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology: Undisclosed flexibility in data collection + and analysis allows presenting anything as significant. Psychological Science, 22(11), 1359–1366.

Szollosi, A., Kellen, D., Navarro, D. J., Shiffrin, R., van Rooij, I., Van Zandt, T., & Donkin, C. (2019). Is preregistration worthwhile. Trends in Cognitive Sciences, 24(2), 94–95.

Szucs, D., & Ioannidis, J. P. A. (2017). When null hypothesis significance testing is unsuitable for research: A reassessment. Frontiers in Human Neuroscience, 11, 390.

Thomas, J., Graziosi, S., Brunton, J., Ghouze, Z., O’Driscoll, P., & Bond, M. (2020). EPPI-Reviewer: Advanced software for systematic reviews, maps and evidence synthesis.

Thomas, J., & Harden, A. (2008). Methods for the thematic synthesis of qualitative research in systematic reviews. BMC Medical Research Methodology, 8(1), 45.

Tong, A., Flemming, K., McInnes, E., Oliver, S., & Craig, J. (2012). Enhancing transparency in reporting the synthesis of qualitative research: Entreq. BMC Medical Research Methodology, 12(1), 181.

Trafimow, D., & Marks, M. (2015). Editorial. Basic and Applied Social Psychology, 37(1), 1–2.

Trafimow, D. (2013). Descriptive vs. inferential cheating.

van Helden, J. (2016). Confidence Intervals Are No Salvation from the Alleged Fickleness of the P Value. Nature Methods, 13(8), 605–606.

van ‘t Veer, A. E., & Giner-Sorolla, R. (2016). Preregistration in social psychology — A discussion and suggested template. Journal of Experimental Social Psychology, 67, 2–12.

Veldkamp, C. L. S., Hartgerink, C. H. J., van Assen, M. A. L. M., & Wicherts, J. M. (2017). Who believes in the storybook image of the scientist? Accountability in Research, 24(3), 127–151.

Veldkamp, C. L. S., Nuijten, M. B., Dominguez-Alvarez, L., van Assen, M. A. L. M., & Wicherts, J. M. (2014). Statistical reporting errors and collaboration on statistical analyses in psychological science. PLoS One, 9(12), e114876.

Wagenmakers, E.-J. (2007). A practical solution to the pervasive problems of p values. Psychonomic Bulletin & Review, 14(5), 779–804.

Wagenmakers, E.-J., Verhagen, J., Ly, A., Matzke, D., Steingroever, H., Rouder, J. N., & Morey, R. D. (2017). The need for Bayesian hypothesis testing in psychological science. In S. O. Lilienfeld & I. D. Waldman (Eds.), Psychological science under scrutiny: Recent challenges and proposed solutions (pp. 123–138). John Wiley & Sons.

Wagenmakers, E.-J., Wetzels, R., Borsboom, D., & Van Der Maas, H. L. (2011). Why psychologists must change the way they analyze their data: The case of psi: Comment on bem (2011). Journal of Personality and Social Psychology, 100, 426–432.

Wicherts, J. M., Veldkamp, C. L., Augusteijn, H. E., Bakker, M., Van Aert, R., & Van Assen, M. A. (2016). Degrees of freedom in planning, running, analyzing, and reporting psychological studies: A checklist to avoid p-hacking. Frontiers in Psychology, 7, 1832.

Wilcox, R. R., & Serang, S. (2017). Hypothesis Testing, p Values, Confidence Intervals, Measures of Effect Size, and Bayesian Methods in Light of Modern Robust Techniques. Educational and Psychological Measurement, 77(4), 673–689.

Wilson, E. B. (1923). The statistical significance of experimental data. Science, 58(1493), 93–100.

Downloads

Published

2025-05-07

Issue

Section

Original articles