Power or Alpha? The Better Way of Decreasing the False Discovery Rate

The replication crisis in psychology has led to an increased concern regarding the false discovery rate (FDR) – the proportion of false positive findings among all significant findings. In this article, we compare two previously proposed solutions for decreasing the FDR: increasing statistical power and decreasing significance level α . First, we provide an intuitive explanation for α , power, and FDR to improve the understanding of these concepts. Second, we investigate the relationship between α and power. We show that for decreasing FDR, reducing α is more efficient than increasing power. We suggest that researchers interested in reducing the FDR should decrease α rather than increase power. By investigating the relative importance of both α level and power, we connect the literature on these topics and our results have implications for increasing the reproducibility of psychological science.

The reproducibility of studies in psychology has been questioned in the last few years. Massive replication initiatives found that replicability can be as low as 36% (Open Science Collaboration, 2015; but see Camerer et al., 2018;Ebersole et al., 2016;Klein et al., 2014;Klein et al., 2018 for more optimistic estimates), and many researchers have tried to identify the factors affecting the replicability of studies. While a comprehensive overview of this is beyond the scope of a single article (a whole issue of Perspectives on Psychological Science was dedicated to the problem; Pashler and Wagenmakers, 2012), we focus on statistical power, significance level α and the false discovery rate (FDR, the proportion of false positive findings among all statistically significant findings). 1 While some papers emphasize the importance of increasing statistical power to decrease the FDR (Button et al., 2013;Christley, 2010), others call for decreasing α (Benjamin et al., 2018). However, these two views seem disconnected and it is unclear whether (or under which conditions) researchers should decide to decrease α and when to increase power in order to reduce the FDR. To further explore this disconnect, we reviewed all articles mentioning FDR (or related terms) in the context of power and α in five methods and evidence synthesis journals within psychology (for more details see: https://osf.io/9cfg8/). Out of 106 reviewed articles, nine explicitly stated the importance of increasing power to reduce the FDR, while five articles discussed the importance of decreasing α. 2 Notably, only Miller and Ulrich (2019) discussed that both decreasing α and increasing power would reduce FDR. However, the efficiency of those two options was not compared so far.
The current article aims to bridge the discussion over α and power regarding the FDR and investigate the more efficient way of reducing the FDR. To achieve this, we first reiterate the concepts of power, false positives, and false discovery rate. We explain them using intuitive examples to deepen the understanding of these concepts. Next, we examine two possible views and their impact on reducing the FDR. The first view concerns planning a study and deciding on α and power independently. The second view concerns balancing between α and power for a fixed design, where setting α determines power and vice versa.

False Positives and α
In his pivotal book "Statistical Methods for the Research Worker" Fisher (1925) was the first to widely popularize the concept of hypothesis testing and statistical significance to differentiate signal from noise. Neyman and Pearson (1928) introduced the conceptualization of the significance level α as a tool to control the long-term error rates. In other words, a decision from a statistical test with a significance level (i.e., 5%) would not result in more than a rate α of incorrectly rejected true null hypotheses. Thus, α determines the long-term rate of false positives. If researchers set their α to 5%, they will accept the alternative hypothesis when the probability of the data or more extreme data assuming the null hypothesis to be true (the p-value) is below α.
Let us illustrate this concept with an example from Fisher (1935) famous experiment "The Lady tasting tea". Lady Muriel Bristol claims that she can detect whether tea or milk was added first to a cup. To test whether the Lady has these tea tasting abilities, Fisher gives lady Bristol eight cups of tea, in which four of them has milk added first, while the other four have tea added first. Fisher wants to keep his long-term error rate of false positives below 5%. Since the Lady knows that half of the cups are tea first, Fisher focuses only on the number of correctly classified tea-first-cups (because the correctly classified milk-first-cups are dependent on the correctly classified tea-first-cups). How many of the four tea-first-cup cases would the Lady need to classify correctly to convince Fisher of her abilities? The probability of correctly guessing x tea-first-cups in four trials can be obtained using the hypergeometric distribution (Figure 1, left). All four tea-first-cups would be guessed correctly with a probability of 1.43%. So, this event would indicate that it is improbable to see the Lady give all eight correct answers if she has no tea tasting abilities and guessed entirely at random. But what if she makes one mistake? The probability of classifying at least three out of four tea-first-cups correctly by pure guessing is 24.3%. In other words, this would not provide sufficient evidence against her lack of abilities. So, in this case, Fisher would be unable to know whether she can differentiate between the cups. Even if she were guessing entirely at random, she could have achieved at least three out of four correctly guessed tea-first-cups 24.3% of the time. Neyman and Pearson (1928) introduced the concept of statistical power because of the fundamental asymmetry of controlling Type I error rates without explicitly formalizing Type II error control (the probability of concluding the absence of an effect, when it exists; Lehmann, 1992). Statistical power describes the probability that a statistical test rejects the null hypothesis when it is false. In other words, power refers to the probability of rejecting the null hypothesis, assuming that the hypothesized effect is present. The statistical power of a test depends on α, the sample size, and the magnitude of the true effect. A higher α, a larger sample size, and a larger true effect all contribute to increased statistical power (Cohen, 1992). Power is thus related to false negatives, with higher statistical power decreasing the probability of finding a false negative result.

Power
Let's continue with the previous example but look at it from the other side. Assume that the Lady can distinguish whether the milk or tea was added first. It is a difficult task, and she makes a mistake from time to time. Her probability of classifying the cup correctly is 0.7. The resulting probabilities this time follow a noncentral hypergeometric distribution (Liao and Rosen, 2001;Figure 1, right). Thus, the probability of her classifying all eight cups correctly is now 19%. In other words, if the Lady has the ability to classify correctly in 70% of cases, Fisher would only detect this 19% of the time.

False Discovery Rate
It follows from the previously outlined definition that power does not influence the probability of observing a false-positive result for any single study. However, since negative results are rarely published (Masicampo and Lalande, 2012;Mathur and VanderWeele, 2020;Nelson et al., 1986;Rosenthal, 1979;Rosenthal andGaito, 1963, 1964;Wicherts, 2017  .231 .559 .190 Figure 1. The hypergeometric distribution shows the probability of x successes (x-axis) with the probability of success 0.50 (left) and 0.70 (right). Note that we only display up to four successes. We can think of those bars as the number of tea-first-cups classified correctly. The Lady knows how many (but not which) cups have tea added to them first. Therefore, if she classifies all tea-first-cups correctly, she necessarily also classifies the milk-first-cups correctly. The dark-filled bars correspond to the probability of 4 correct answers.
for contrary evidence), it is more interesting to investigate the proportion of false positives among significant findings, i.e., the false discovery rate (FDR). This proportion depends on the number of true positives (believing that someone possesses the tea tasting abilities when they truly do) and the number of false positives (believing that someone possesses the tea tasting abilities when they do not). While the number of true positives depends on power and the number of true alternative hypotheses, the number of false positives depends on α and the proportion of false hypotheses. So, the FDR connects both previously mentioned concepts, and we illustrate it with our running example. Her Majesty The Queen decides to start a Royal Tea Tasting Society (RTTS) and requests Fisher to recruit new members based on their tea tasting abilities. Assume that one-fifth of the population possesses such abilities and can identify the order of milk and tea in 70% of cases. The remaining four-fifths do not possess this skill and their answers are equal to random guessing. Fisher decides to use α of 5%; therefore, 0.05×0.80 = 4% of the tests he administers result in false positives. Because he conveniently uses the same set-up as in the previous example, we know that the power of the test is 19%. Therefore, 0.19×0.20 = 3.8% of the tests he administers yield true positives. Subsequently, he introduces all citizens who passed the test to the Queen, who promotes them to members of the RTTS. However, what the Queen does not realize is the fact that 0.04/(0.04+0.038) = 51% of her RTTS members do not possess any tea tasting abilities (the FDR).
As can be deduced from the example, there are two ways to decrease FDR -either increase power and thus the number of true positives, or reduce α and the number of false positives. This relationship is depicted in Equation (1), which illustrates how power and α influence the FDR, with P(H 0 ) standing for the proportion of true null hypotheses, α for significance level, and ρ for statistical power, This is the reason why many argue that researchers need to increase the statistical power to reduce the FDR. However, we show in the following paragraphs that reducing α is usually the preferable option by investigating two ways of considering the trade-off between power and α. In the first way, researchers plan a study and independently determine what levels of α and power should be used. In the second, researchers balance between α and power for a fixed design, where setting α determines the power and vice versa.

Determining α and Power Independently
The first view assumes that α and power are set independently. 3 For example, researchers plan a study with Figure 2. The logarithm of the FDR gradient (z-axis) is dependent on α (Alpha, x-axis) and power (y-axis) for the probability of the null hypothesis being true equal to 0.5. The red surface (with blue lines) depicts the gradient of FDR with respect to α and the green surface (with red lines) depicts the gradient of FDR in respect to power. Note that they intersect when α is equal to power. When α is lower than power (right side), the gradient of FDR with respect to α dominates the gradient with respect to power. An animated version is accessible at https://osf.io/gbtku/. desired α and power and compute the required sample size for achieving them. Subsequently, we can study how either changing α or power in the planning phase influences the FDR. To do that, we present derivations of Equation (1) with respect to α, and power, Equations (2) and (3) connect the change in α or power to change in FDR. Since the denominators are the same and P(H 0 ) is bound to be between 0 and 1, the comparison of Equation (2) and (3) shows that the gradient of FDR with respect to power will dominate the gradient of FDR with respect to α as long as power is larger than α (Figure 2). This is generally true because α is the lower bound on power, unless a one-sided test is used and the effect is in the opposite direction. Then, power is lower than α and the gradient of FDR with respect to α dominates the gradient of FDR with respect to power. In addition, when a two-sided test is used but the power is low, many significant results will be in the opposite direction (Type S error; Gelman and Carlin, 2014). Including those into the FDR would further change the results. Compelling visualizations that support this claim are also available in the online materials (https://osf.io/gbtku/) and a more detailed discussion of this approach can be found in the open review (https://osf.io/sp95d/). Overall, this indicates that for all conditions that are typically encountered in hypothesis testing, the gradient with respect to alpha will dominate the gradient with respect to power.
In other words, when designing a study, planning a lower α has a larger effect than planning higher power, as long as power is kept higher than α. So, if Fisher wanted to mitigate the proportion of members of RTTS with no tea tasting abilities before the experiment was conducted, the best solution would be to decrease the α as much as possible.

Trading α and Power
The second view goes one step further. If we assume that researchers are operating with limited resources (i.e., a limited number of participants or time), then α determines power or vice versa. In other words, for a fixed design, researchers can either set α and power can be expressed as a function of α, or researchers can set a desired power, and α can be expressed as a function of power. Equation (4) shows the relationship of power (ρ), on the left side, to α, in the case of a two-tailed independent samples z-test. In addition, sample size n and effect size d are needed to determine the parameter µ of the normal distribution of expected z-statistics under the alternative hypothesis. The significance level α determines the upper and lower cut-off value used for significance testing through a quantile function of the standard normal distribution Φ −1 . The cut-off is subsequently used in the cumulative probability density function of the normal distribution Φ µ with mean µ and standard deviation equal to 1, determine the probability of obtaining more extreme z-values than those equal to α, The µ parameter of the cumulative density function of the normal distribution for a two-sample independent ztest is dependent only on the effect size d and the number of participants n split equally into the groups (Equation (5)). More participants or larger effect size means that the distribution of z-statistics has a higher mean µ, Equations (4) and (5) are also depicted for a concrete example with n = 100, d = 0.5 and α = .05 (Figure 3).  If α is decreased, the vertical lines placed at the cutoff z-statistic determined by the quantile function of the normal distribution move further apart from the center and thus reduce the grey-filled area corresponding to the power. On the other hand, one could also increase α and thus increase the area corresponding to power.
So, given constant sample size and effect size, researchers are faced with two possibilities: they can either (a) increase α, reducing the cut-off and thus achieving higher power; or (b) decrease α and subsequently lower the power. We know that there is a convention to set α in statistical tests to 5%. However, there is no reason why α should remain constant at this fixed value. Fisher (1956) explained that the 5% should be disregarded whenever there are other substantial reasons to determine α. More recently, scientists again called for a more flexible adaption of α (Lakens et al., 2018).
In other words, in psychological science that operates with limited resources, there is always a trade-off that needs to be made between avoiding false positives and detecting true positives. If Fisher wants to mitigate the proportion of members of RTTS with no tea tasting abilities (assuming he has a constrained budget), he is faced with two options. On the one hand, he can decrease α and lower the number of false positives with the cost of decreased power and fewer true positives. On the other hand, he can also increase the power and increase the number of true positives at the cost of increasing α leading to more false positives. The important question is, which is more efficient in lowering the FDR: lowering α or increasing power? We show that for a two-sided z-test and for one-sided z-test with true effect in the predicted direction given a constant sample size, decreasing α leads to lower FDR than increasing statistical power. Figure 4 shows this relationship on an example with an independent samples z-tests for the proportion of true null hypothesis P(H 0 ) = 0.5, effect size d = 0.5 and sample size n = 100 (50 per group).
Similar results can be obtained for different sample sizes, effect sizes, proportions of null hypotheses being true and statistical tests (code to generate 3D plots across different µs can be found at https://osf.io/ uszxk/). There is always a decrease in FDR with decreasing α but for two exceptions. First, if the null hypotheses are either all false or all true (which would include effect size equal to 0), then the proportion is 1 or 0 respectively, independent of power and α. Second, for one-sided tests where the true effect is opposite to the expected direction, the FDR will increase with reducing α. However, these two situations should be relatively rare in practice; therefore, reducing α is usually the most efficient way to decrease the FDR.
For a more formal analysis we also calculated the gradient of the FDR with regards to α (see Supplementary Materials at https://osf.io/svu7r). This elaborates the conclusion that α is more efficient in reducing the FDR, since the derivative is positive for all values of α apart from one-sided tests with an effect in the opposite direction. 3D plots showing the derivative for different noncentrality parameters (ncps) can be found at The double x-axis shows α with its corresponding power, scaled according to α in the left chart and according to power in the right chart.
https://osf.io/uszxk/. Figure 5 shows the gradient an independent samples z-tests for the proportion of true null hypothesis P(H 0 ) = 0.5, effect size d = 0.5 and sample size n = 100 (50 per group). An expected objection is that instead of the trade-off by increasing α, one can achieve an increase in power by increasing sample size. As explained before, there is no apparent reason for keeping α constant with increasing sample size. Instead, one can keep the power fixed and use the higher sample size to decrease α. Figure 6 shows that keeping the power constant and decrease α by increasing the sample size is more efficient in lowering the FDR.
Again, a similar pattern can be observed irrespective of the starting sample size, α, power, effect size, and proportion of true null hypotheses. The decrease in FDR is stronger when using the increase in sample size to reduce α rather than increase the power.

Discussion
Our analysis shows that reducing α is usually more effective in reducing the false discovery rate than increasing power. Researchers striving to reduce the false discovery rate should reduce their α instead of increasing power. This is not only true when planning a study and deciding on the levels of α and power, but also when balancing power and α at a constant sample size or when increasing sample size and considering whether to "spend" the additional participants on increasing power or reducing α.
Our conclusion is similar to the long-standing literature on α adjustments for controlling the false discovery rate in multiple testing (e.g., Benjamini & Hochberg,  1995). However, its main goal is to keep the false discovery rate for a set of tests below a certain threshold rather than trading α and power in respect to the FDR. We also need to consider several limitations of our analyses. In case of one-sided tests, reducing α is only more beneficial if the true effect is in the expected direction. In case of two-sided tests, incorporating type S errors into the definition of FDR increases the effectiveness of power if it is close to α. However, both of these scenarios are not plausible under common conditions. In addition, for balancing α and power, we only present results for the two-sample z-test and assuming that the assumptions of the statistical test (e.g., homoskedasticy and normal distribution) are fulfilled. While the relation between power and α and FDR for a variety of other tests can be found at https://osf.io/uwkqz/ and is in line with our analysis, a formal proof that the proposed relationship is holding for all tests under different conditions is not presented in this paper. More research is needed to generalize our results to more kinds of tests and settings. We also only analyze the effect of α and power, while an additional issue causing nonreplicability can be a low prior probability of the tested hypotheses (Benjamin et al., 2018;Hoogeveen et al., 2020;Ioannidis, 2005), which plays a direct role in the FDR formula.
In addition, we want to emphasize that we are still advocates of high power for several reasons. 4 First, high power is crucial for avoiding Type II errors. Controlling Type I errors is often perceived as more important than controlling Type II errors (e.g., Cohen, 1956); however, in some contexts, Type II errors might be more problematic (Fiedler et al., 2012). For example, consider re-searchers first investigating a new, potentially groundbreaking treatment for depression. Here, the Type II error of not detecting the effectiveness of the treatment might be more costly than concluding that the treatment is effective when it is not. This error (and consequently abandoning this line of research) would mean missing an opportunity to improve the lives of people with depression. Another example might be replication studies, where the primary focus is to test whether a previously reported effect is there, with a lesser concern of inflating FDR. Here, high power is crucial to avoid such Type II errors. In addition, low power and conditioning on significance leads to an overestimation of effect sizes (Type M error) and to effect size estimates in the wrong direction (Type S error; Gelman and Carlin, 2014). For these reasons, high powered studies are crucial for cumulative science. Therefore, we recommend that in practice, researchers think about their inferential goals, weighing the costs of both Type I and Type II errors, to determine an optimal α and power (Lakens et al., 2018;Maier & Lakens, 2022;Miller & Ulrich, 2019;Mudge et al., 2012). If an important goal is to reduce the FDR, our analyses show that reducing α is more effective than increasing power.
Last but not least, we want to point out that the actual α level is often higher than the nominal α level due to questionable research practices, such as optional stopping or failure to report all dependent variables (John et al., 2012;Simmons et al., 2011;Wicherts, 2017). Therefore, finding ways to prevent these practices using tools such as preregistration  and registered reports (Chambers et al., 2015) is probably one of the most critical tasks psychological science is facing. Some researchers also argue that we should abandon the framework of statistical testing and instead focus solely on summarizing the full information about effect size estimates (McShane et al., 2019).

Conclusion
We strove for two objectives in this paper. Firstly, we reiterated over α, power, and false discovery rate, hopefully improving the understanding of these concepts. Secondly, we compared two previously proposed solutions to decreasing the false discovery rate. Our results show that with respect to the false discovery rate, it is usually more effective to decrease α than to increase statistical power. We suggest that researchers interested in reducing the false discovery rate focus on reducing α.