Estimating Population Mean Power Under Conditions of Heterogeneity and Selection for Significance

In scientific fields that use significance tests, statistical power is important for successful replications of significant results because it is the long-run success rate in a series of exact replication studies. For any population of significant results, there is a population of power values of the statistical tests on which conclusions are based. We give exact theoretical results showing how selection for significance affects the distribution of statistical power in a heterogeneous population of significance tests. In a set of large-scale simulation studies, we compare four methods for estimating population mean power of a set of studies selected for significance (a maximum likelihood model, extensions of p-curve and p-uniform, & z-curve). The p-uniform and p-curve methods performed well with a fixed effects size and varying sample sizes. However, when there was substantial variability in effect sizes as well as sample sizes, both methods systematically overestimate mean power. With heterogeneity in effect sizes, the maximum likelihood model produced the most accurate estimates when the distribution of effect sizes matched the assumptions of the model, but z-curve produced more accurate estimates when the assumptions of the maximum likelihood model were not met. We recommend the use of z-curve to estimate the typical power of significant results, which has implications for the replicability of significant results in psychology journals.

The purpose of this paper is to develop and evaluate methods for predicting the success rate if sets of significant results were replicated exactly. We call this statistical property, the average power of a set of studies. Average power can range from the criterion for a type-I error, if all significant results are false positives, to 100%, if the statistical power of original studies approaches 1. Average power can be used to quantify the degree of evidential value in a set of studies (Simonsohn et al., 2014b). In the end, we estimate the mean power of studies that were used to examine the replicability of psychological research, and compare the results to actual replication outcomes (Open Science Collabora-tion, 2015). Estimating average power of original studies is interesting because it is tightly connected with the outcome of replication studies (Greenwald et al., 1996;Yuan & Maxwell, 2005). To claim that a finding has been replicated, a replication study should reproduce a statistically significant result, and the probability of a successful replication is a function of statistical power. Thus, if reproducibility is a requirement of good science (Bunge, 1998;Popper, 1959), it follows that high statistical power is a necessary condition for good science.
Information about the average power of studies is also useful because selection for significance increases the type-I error rate and inflates effect sizes (Ioannidis, 2008). However, these biases are relatively small if the original studies had high power. Thus, knowledge about the average power of studies is useful for the planning of future studies. If average power is high, replication studies can use the same sample sizes as original studies, but if average power is low, sample sizes need to be increased to avoid false negative results.
Given the practical importance of power for good science, it is not surprising that psychologists have started to examine the evidential value of results published in psychology journals. At present, two statistical methods have been used to make claims about the average power of psychological research; namely p-curve (Simonsohn et al., 2017) and z-curve (Schimmack, 2015(Schimmack, , 2018a, but so far neither method has been peer-reviewed.

Statistical Power Before and After A Study Has Been Conducted
Before we proceed, we would like to clarify that statistical power of a statistical test is defined as the probability of correctly rejecting the null hypothesis (Neyman & Pearson, 1933). This probability depends on the sampling error of a study and the population effect size. The traditional definition of power does not consider effect sizes of zero (false positives) because the goal of a priori power planning is to ensure that a non-zero effect can be demonstrated.
However, our goal is not to plan future studies, but to analyze results of existing studies. For post-hoc power analysis, it is impossible to distinguish between true positives and false positives and to estimate the average power conditional on the unknown status of hypotheses (i.e., the null-hypothesis is true or false). Thus, we use the term average power as the probability of correctly or incorrectly rejecting the null-hypothesis (Sterling et al., 1995). This definition of average power includes an unknown percentage of false positives that have a probability equal to alpha (typically 5%) to reproduce a significant result in a replication attempt. At the same time, we believe that the strict null-hypothesis is rarely true in psychological research (Cohen, 1994).
It would be ideal if it were possible to estimate the power of a single statistical test that supports a particular finding. Unfortunately, well-documented problems with the "observed power" method suggest that the goal of estimating the power of an individual test may be out of reach (Boos & Stefnski, 2012;Hoenig & Heisey, 2001). Often the main problem is that estimates for a single result are too variable to be practically useful (Yuan & Maxwell, 2005; but also see Anderson, Kelley, & Maxwell, 2017).
It is important to distinguish our undertaking from that of Cohen (1962) and the follow-up studies by Chase and Chase (1976) and Sedlmeier and Gigerenzer (1989). In Cohen's classic survey of power in the Journal of Abnormal and Social Psychology, the results of the studies were not used in any way. Power was never estimated. It was calculated exactly for a priori effect sizes deemed "small," "medium" and "large." If a "medium" effect size referred to the population mean (which Cohen never claimed), power at the mean effect size is still not the same as mean power. In contrast, we aim to estimate the mean power given the actual population effect sizes in a set of studies.

Two Populations of Studies
We distinguish two populations of tests. One population contains all tests that have been conducted. This population contains significant and non-significant results. The other population contains the subset of studies that produced a significant result. We focus on the population of studies selected for significance for two reasons.
First, often non-significant results are not available because journal articles mostly report significant results (Rosenthal, 1979;Sterling, 1959;Sterling et al., 1995). Second, only significant results are used as evidence for a theoretical prediction. It is irrelevant how many tests produced non-significant results because these results are inconclusive. As psychological theories mainly rest on studies that produced significant results, only the evidential value of significant results is relevant for evaluations of the robustness of psychology as a science. In short, we are interested in statistical methods that can estimate the average power of a set of studies with significant results.

The Study Selection Model
We developed a number of theorems that specify how selection for significance influences the distribution of power. These theorems are very general. They do not depend on the particular population distribution of power, the significance tests involved, or the Type I error probabilities of those tests. The only requirement is that for every study with a specific population effect size, sample size, and statistical test, the probability of a result being selected is the true power of a study. We discuss the two most important theorems in detail. All six theorems are provided in the appendix, along with an illustration of the theorems by simulation.

Theorem 1 Population mean true power equals the overall probability of a significant result.
Theorem 1 establishes the central importance of population mean power after selection for significance for 3 predicting replication outcomes. Think of a coin-tossing experiment in which a large population of coins is manufactured, each with a different probability of heads; that is, these coins are not fair coins with equal probabilities for both sides. Also consider heads to be successes or wins. Repeatedly tossing the set of coins and counting the number of heads produces an expected value of the number of successes. For example, the experiment may yield 60% heads and 40% tails. While the exact probability of showing heads of individual coins are unknown, the observable success rate is equivalent to the mean power of all coins. Theorem 1 states that success rate and mean power are equivalent even if the set of coins is a subset of all coins. For example, assume all coins were tossed once and only coins showing heads were retained. Repeating the coin toss experiment, we would still find that the success rate for the set of selected coins matches the mean probabilities of the selected coins.
Theorem 2 The effect of selection for significance on power after selection is to multiply the probability of each power value by a quantity equal to the power value itself, divided by population mean power before selection. If the distribution of power is continuous, this statement applies to the probability density function. Figure 1 illustrates Theorem 2 for a simple, artificial example in which power before selection is uniformly distributed on the interval from 0.05 to 1.0. The corresponding distribution after selection for significance is triangular; now studies with more power are more likely to be selected. In Figure 2, power before selection is less heterogeneous, and higher on average. Consequently, the distributions of power before selection and after selection are much more similar. In both cases, though, mean true power after selection for significance is higher than mean true power before selection for significance. Note. Power before selection follows a beta distribution with a = 13 and b = 6 multiplied by .95 plus .05, so that it ranges from .05 to 1.
The coin-tossing selection model proposed here may seem overly simplistic and unrealistic. Few researchers conduct a study and give up after a first attempt produces a nonsignificant result. For example, Morewedge et al. (2014) disclosed that they did not report "some preliminary studies that used different stimuli and different procedures and that showed no interesting effects." From a theoretical perspective, it is important that all studies test the same hypothesis, but for our selection model it is not. Even if all studies used exactly the same procedures and had exactly the same power, the probability of being selected into the set of reported studies matches their power, and Theorem 2 holds. Each study that was conducted by Morewedge et al. has an unknown true power to produce a significant result, and Theorem 2 implies (via Theorem 5 in the appendix) that their selected studies with significant results have higher mean power than the full set of studies that were conducted. We are only interested in the statistical power and replicability of the published studies with significant results.

Estimation Methods
In this section, we describe four methods for estimating population mean power under conditions of heterogeneity, after selection for statistical significance.

Notation and statistical background
To present our methods formally, it is necessary to introduce some statistical notation. Rather than using traditional notation from statistics that might make it difficult for non-statisticians to understand our method, we follow Simonsohn et al. (2014a), who employed a modified version of the S syntax (Becker et al., 1988) to represent probability distributions. The S language is familiar to psychologists who use the R statistical software (R Core Team, 2017). The notation also makes it easier to implement our methods in R, particularly in the simulation studies.
The outcome of an empirical study is partially determined by random sampling error, which implies that statistical results will vary across studies. This variation is expected to follow a random sampling distribution. Each statistical test has its own sampling distribution. We will use the symbol T to denote a general test statistic; it could be a t-statistic, F, chi-squared, Z, or something else. Assume an upper-tailed test, so that the null hypothesis will be rejected at significance level α (usually α = 0.05), when the continuous test statistic T exceeds a critical value c.
Typically there is a sample of test statistic values T 1 , . . . , T k , but when only one is being considered the subscript will be omitted. The notation p(t) refers to the probability under the null hypothesis that T is less than or equal to the fixed constant t. The symbol p would represent pnorm if the test statistic were standard normal, pf if the test statistic had an F-distribution, and so on. While p(t) is the area under the curve, d(t) is the value on the y axis for a particular t, as in dnorm. Following the conventions of the S language, the inverse of p is q, so that p(q(t)) = q(p(t)) = t.
Sampling distributions when the null-hypothesis is true are well known to psychologists because they provide the foundation of null-hypothesis significance testing. Most psychologists are less familiar with noncentral sampling distributions (see Johnson et al., 1995, for a detailed and authoritative treatment). When the null hypothesis is false, the area under the curve of the test statistic's sampling distribution is p(t,ncp), representing particular cases like pf(t,df1,df2,ncp). The initials ncp stand for "non-centrality parameter." This notation applies directly when T has one of the common non-central distributions like the non-central t, F or chi-squared under the alternative hypothesis, but it can be extended to the distribution of any test statistic under any specific alternative, even when the distribution in question is technically not a non-central distribution. The non-centrality parameter is positive when the null hypothesis is false, and statistical power is a monotonically increasing function of the non-centrality parameter.
This function is given explicitly by Power = 1 − p(c,ncp). For the most important non-central distributions (Z, t, chi-squared and F), the non-centrality parameter can be factored into the product of two terms. The first term is an increasing function of sample size, and the second term is an increasing function of effect size.
In symbols, This formula is capable of accommodating different definitions of effect size (Cohen, 1988;Grissom & Kim, 2012) by making corresponding changes to the function f 2 in f 2 (es). As an example of Equation (1), consider for example a standard F-test for difference between the means of two normal populations with a common variance. After some simplification, the noncentrality parameter of the non-central F may be written as where n = n 1 + n 2 is the total sample size, ρ is the proportion of cases allocated to the first treatment, and d is Cohen's (1988) effect size for the two-sample problem. This expression for the non-centrality parameter can be factored in various ways to match Equation 1; for example, f 1 (n) = n ρ (1 − ρ) and f 2 (es) = es 2 .
Note that this is just an example; Equation 1 applies to the non-centrality parameters of the non-central Z, t, chi-squared and F distributions in general. Thus for a given sample size and a given effect size, the power of a statistical test is Power = 1 − p(c, f 1 (n) · f 2 (es)). ( In this formula, c is the criterion value for statistical significance; the test is significant if T > c. The function f 2 (es) can also be applied to sets of studies with different traditional effect sizes. For example, es could be Cohen's d, and the alternative effect size es could be the point-biserial correlation r (Cohen, 1988, p. 24). Symbolically, es = g(es). Since the function g(es) is monotone increasing, a corresponding inverse function exists, so that es = g −1 (es ). Then Equation (2) becomes where f 2 just means another function f 2 . That is, if the definition of effect size is changed (in a monotone way), the change is absorbed by the function f 2 , and Equation (2) still applies.
We are now ready to introduce our four methods for the estimation of mean power based on a set of studies that vary in power with known sample sizes and unknown population effect sizes. The four methods are called pcurve, p-uniform, maximum likelihood model, and z-curve.

Estimation Methods
The first two estimation methods are based on methods that were developed for the estimation of effect sizes. Our use of these methods for the estimation of mean power is an extension of these methods. Our simulation studies should not be considered tests of these methods for the estimation of effect sizes. We developed these methods simply because power is a function of effect size and sample size and sample sizes are known. Thus, only estimation of unknown effect sizes is needed to estimate power with these methods. Power estimation is a simple additional step to compute power for each study as a function of the effect size estimate and the sample size of each study. These models should work well, when all studies have the same effect size and heterogeneity in power is only a function of heterogeneity in sample size as assumed by these models.

P-curve 2.1 and p-uniform
A p-curve method for estimation of mean power is available online (www.p-curve.com). It is important to point out that this method differs from the p-curve method that we developed. The online p-curve method is called pcurve 4.06. We built our p-curve method on the effect size p-curve method with the version code p-curve2.0 (Simonsohn et al., 2014b). Hence, we refer to our p-curve method as p-curve2.1. P-uniform is very similar to p-curve (van Assen et al., 2014). Both methods aim to find an effect size that produces a uniform distribution of p-values between .05 and .00. Since we developed our p-uniform method for power estimation, a new estimation method has been introduced (van Aert et al., 2016).
We conducted our studies with the original estimation method and our results are limited to the performance of this implementation of p-uniform. To find the best fitting effect size for a set of observed test statistics, p-curve 2.1 and p-uniform compute p-values for various effect sizes and chose the effect size that yields the best approximation of a uniform distribution. If the modified null hypothesis that effect size = es is true, the cumulative distribution function of the test statistic is the conditional probability using ncp = f 1 (n) · f 2 (es) as given in Equation 1. The corresponding modified p-value is Note that since the sample sizes of the tests may differ, the symbols p, n and c as well as T may have different referents for j = 1, . . . , k test statistics. The subscript j has been omitted to reduce notational clutter. If the modified null hypothesis were true, the modified p-values would have a uniform distribution. Both pcurve 2.1 and p-uniform choose as estimated effect size the value of es that makes the modified p-values most nearly uniform. They differ only in the criterion for deciding when uniformity has been reached. P-curve 2.1 is based on a Kolmogorov-Smirnov test for departure from a uniform distribution, choosing the es value yielding the smallest value of the test statistic. P-uniform is based on a different criterion. Denoting by P j the modified p-value associated with test j, calculate where ln is the natural logarithm. If the P j values were uniformly distributed, Y would have a Gamma distribution with expected value k, the number of tests. The Puniform estimate is the modified null hypothesis effect size es that makes Y equal to k, its expected value under uniformity.
These technologies are designed for heterogeneity in sample size only, and assume a common effect size for all the tests. Given an estimate of the common effect size, estimated power for each test varies only as a function of sample size which can be determined by Expression 2 because sample sizes are known. Population mean power can then be estimated by averaging the k power estimates.

Maximum likelihood model
Our maximum likelihood (ML) model also first estimates effect sizes and then combines effect size estimates with known sample sizes to estimate mean power. Unlike p-curve2.1 and p-uniform, the ML model allows for heterogeneity in effect sizes. In this way, the model is similar to Hedges and Vevea's (1996) model for effect size estimation before selection for significance. To take selection for significance into account, the likelihood function of the ML model is a product of k conditional densities; each term is the conditional density of the test statistic T j , given N j = n j and T j > c j , the critical value.
Likelihood function. The model assumes that sample sizes and effect sizes are independent before the selection for significance. Suppose that the distribution of effect size before selection is continuous with probability density g θ (es). This notation indicates that the distribution of effect size depends on an unknown parameter or parameter vector θ. In the appendix, it is shown that the likelihood function (a function of θ) is a product of k terms of the form where the integrals denote areas under curves that can be computed with R's integrate function. The maximum likelihood estimate is the parameter value yielding the highest product. To be applicable to actual data, the ML model has to make assumptions about the distribution of effect sizes. The ML model that was used in the simulation studies assumed a gamma distribution of effect sizes. A gamma distribution is defined by two parameters that need to be estimated based on the data. The effect sizes based on the most likely distribution are then combined with information about sample sizes to obtain power estimates for each study. An estimate of population mean power is then produced by averaging estimated power for the k significance tests. As shown in the appendix, the terms to be averaged are (4) Z-curve Z-curve follows a traditional meta-analyses that converts p-values into Z-scores as a common metric to integrate results from different original studies (Rosenthal, 1979;Stouffer et al., 1949). The use of Z-scores as a common metric makes it possible to fit a single function to p-values arising from different statistical methods and tests. The method is based on the simplicity and tractability of power analysis for the Z-tests, in which the distribution of the test statistic under the alternative hypothesis is just a standard normal shifted by a fixed quantity that plays the role of a non-centrality parameter, and will be denoted by m. Input to the Z-curve is a sample of p-values, all less than α = 0.05. These p-values are processed in several steps to produce an estimate.
1. Convert p-values to Z-scores. The first step is to imagine, for simplicity, that all the p-values arose from two-tailed Z-tests in which results were in the predicted direction. This is equivalent to an upper-tailed Z-test. In our simulations, alpha was set to .05, which results in a selection criterion of z = 1.96. The conversion to Z-scores (Stouffer et al., 1949) consists of finding the test statistic Z that would have produced that p-value. The formula is 2. Set aside Z > 6. We set aside extreme z-scores. This avoids fitting a large number of normal distributions to extremely small p-values. This step has no influence on the final result because all of these p-values have an observed power of 1.00 (rounded to the second decimal). This set also avoids numerical problems that arise from small p-values rounded to 0.
3. Fit a finite mixture model. Before selecting for significance and setting aside values above six, the distribution of the test statistic Z given a particular non-centrality parameter value m is normal with mean m. Afterwards, it is a normal distribution truncated on the left at the critical value c (usually 1.96) truncated on the right at 6, and rescaled to have area one under the curve. Because of heterogeneity in sample size and effect size, the full distribution of Z is an average of truncated normals, with potentially a different value of m for each member of the population. As a simplification, heterogeneity in the distribution of Z is represented as a finite mixture with r components. The model is equivalent to the following two-stage sampling plan.
First, select a non-centrality parameter m from m 1 , . . . , m r according to the respective probabilities w 1 , . . . , w r . Then generate Z from a normal distribution with mean m and standard deviation one. Finally, truncate and re-scale.
Under this approximate model, the probability density function of the test statistic after selection for significance is 7 The finite mixture model is only an approximation because it approximates k standard normal distribution with a smaller set of standard normal distributions. Preliminary studies showed negligible differences between models with 3 or more parameters. Thus, the z-curve method that was used in the simulation studies approximated the observed distribution of z-scores between 1.96 and 6 with three truncated standard normal distributions. The observed density distribution was estimated based on the observed z-scores using the kernel density estimate (Silverman, 1986) as implemented in R's density function, with the default settings.
The default settings are Gaussian approximation and 512 nodes. The most critical default parameter is the bandwidth. The default bandwidth defaults to 0.9 times the minimum of the standard deviation and the interquartile range divided by 1.34 times the sample size to the negative onefifth power (https://stat.ethz.ch/R-manual/Rdevel/library/stats/html/density.html).
Specifically, the fitting step proceeds as follows.
First, obtain the kernel density estimate based on the sample of significant Z values, re-scaling it so that the area under the curve between 1.96 and 6 equals one. To do so, all density values are divided by the sum of the density values times the bandwidth parameter of the density function. Then, numerically choose w j and m j values so as to minimize the sum of absolute differences between Expression (6) and the density estimate.
4. Estimate mean power for Z < 6. The estimate of rejection probability upon replication for Z < 6 is the area under the curve above the critical value, with weights and non-centrality values from the curve fitting step. The estimate is where w 1 , . . . , w r and m 1 , . . . , m r are the values located in Step 3. Note that while the input data are censored both on the left and right as represented in Forumula 6, there is no truncation in Formula 7 because it represets the distribution of Z upon replication.
5. Re-weight using Z > 6. Let q denote the proportion of the original set of Z statistics with Z > 6. Again, we assume that the probability of significance for those tests is essentially one. Bringing this in as one more component of the mixture estimate, the final estimate of the probability of rejecting the null hypothesis for exact replication of a randomly selected test is By Theorem 1, this is also an estimate of population true mean power after selection. Unlike the other estimation methods, z-curve does not require information about sample size. Unlike p-curve2.1 and p-uniform, z-curve does not assume a fixed effect size. Finally, zcurve does not make assumptions about the distribution of true effect sizes or true power, but approximates the actual distribution with a weighted combination of three standard normal distributions.

Simulations
The simulations reported here were carried out using the R programming environment (R Core Team, 2017) distributing the computation among 70 quad core Apple iMac computers. The R code is available in the supplementary materials, at https://osf.io/bvraz.
In the simulations, the four estimation methods (pcurve 2.1, p-uniform, maximum likelihood and z-curve) were applied to samples of significant chi-squared or F statistics, all with p < 0.05. This covers most cases of interest, since t statistics may be squared to yield F statistics, while Z may be squared to yield chi-squared with one degree of freedom.

Heterogeneity in Sample Size Only: Effect size fixed
Sample sizes after selection for significance were randomly generated from a Poisson distribution with mean 86, so that they were approximately normal, with population mean 86 and population standard deviation 9.3. Population mean power, number of test statistics on which the estimates were based, type of test (chi-squared or F) and (numerator) degrees of freedom were varied in a complete factorial design. Within each combination, we generated 10,000 samples of significant test statistics and applied the four estimation methods to each sample. In these simulations, it was not necessary to simulate test statistic values and then literally select those that were significant. A great deal of computation was saved by using the R functions rsigF and rsigCHI, (available from the supplementary materials) to simulate directly from the distribution of the test statistic after selection. A description of the simulation method and a proof of its correctness are given in the appendix.

8
The first simulation had a 4 × 5 × 3 design with true power after selection for significance (.05, 0.25, 0.50, & 0.75), number of test statistics k on which estimates were based (15, 25, 50, 100, & 250) and numerator degrees of freedom (just degrees of freedom for the chisquared tests; 1, 3 & 5) as factors. To obtain the desired levels of power, we used the effect size metric f for Ftests and w for chi-squared tests (Cohen, 1988, p. 216).
Because the pattern of results was similar for F-tests and chi-squared tests and for different degrees of freedom, we only report details for F-tests with one numerator degree of freedom; preliminary data mining of the psychological literature suggests that this is the case most frequently encountered in practice. Full results are given in the supplementary materials.
Average performance. Table 1 shows means and standard deviations of mean power based on 10,000 simulations in each cell of the design. Differences between the estimates and the true values represent systematic bias in the estimates. The results show that all methods performed fairly well, with z-curve showing more bias than the other methods, especially for small sets of studies.
Absolute error of estimation. Although the standard deviations in Table 1 provide some information about estimation errors in individual simulations, we also computed mean absolute errors, abs(True Power-Estimated Power) to supplement this information. With 50% power at least 100 studies would be needed to reduce mean absolute error to less than 6% for all methods. Thus, fairly large sets of studies are needed to obtain precise estimates of mean power.

Heterogeneity in Both Sample Size and Effect Size
The results of the first simulation study were reassuring in that our methods performed well under conditions that were consistent with model assumptions. Pcurve, p-uniform and the ML model performed better than z-curve because they used information about sample sizes and correctly assumed that all studies have the same population effect size. However, our main goal was to test these methods under more realistic conditions where effect sizes vary across studies.
To model heterogeneity in effect size, we let effect size before selection vary according to a gamma distribution (Johnson et al., 1995), a flexible continuous distribution taking positive values. Sample size before selection remained Poisson distributed with a population mean of 86. For convenience, sample size and effect size were independent before selection for significance. The maximum likelihood model correctly assumed a gamma distribution for effect size, and the likelihood search was over the two parameters of the gamma distribution. The other three methods were not modified in any way. P-curve 2.1 and p-uniform continued to assume a fixed effect size, and z-curve continued to assume heterogeneity in the non-centrality parameter without distinguishing between heterogeneity in sample size and heterogeneity in effect size.
We used the same design as in Study 1 with one additional factor: amount of heterogeneity in effect size, as represented by the standard deviation of the effect size distribution. Figure 3 shows the distribution of ef- We dropped the condition with 5% power because it implies a fixed effect size of 0. We also varied the number of test statistics in a simulation (k = 100, 250, 500, 1,000 or 2,000), experimental degrees of freedom (1, 3 or 5), and type of test (F or chi-squared). Within each cell of the design, ten thousand significant test statistics were randomly generated, and population mean power was estimated using all four methods. For brevity, we only present results for F-tests with numerator d f = 1. Full results are given in the supplementary materials.
In our simulations with heterogeneity in effect sizes, maximum likelihood is computationally demanding. Using R's integrate function, the calculation involves fitting a histogram to each curve and then adding the areas of the bars. Numerical accuracy is an issue, especially for ratios of areas when the denominators are very small. In addition, it is necessary to try more than one starting value to have a hope of locating the global maximum because the likelihood function has many local maxima. In our simulations, we used three random

Effect Size Distribution
Cohen's d

Density
Heterogeneity: black = .1; blue = .2, red = .3 Power: solid = 25%, dots = 50%, dashes = 75% starting points. The ML model benefited from the fact that it assumed a gamma distribution of effect sizes, which matched the simulated effect size distributions. In contrast, z-curve made no assumptions and the other two methods falsely assumed a fixed effect size.
Average performance. Table 3 shows estimated population mean power as a function of true population mean power. Results were consistent with the differences in assumptions. Pcurve2.1 and p-uniform overestimated mean power and this bias increased with increasing heterogeneity and increasing mean power. Zcurve estimates were actually better than in the previous simulations with fixed effect sizes. The maximum likelihood model had the best fit, presumably because it anticipated the actual effect size distribution.
Absolute error of estimation. Table 4 shows mean absolute error of estimation. It confirms the pattern of results seen in Table 3. Most important are the large absolute errors for the two methods that assumed a fixed effect size. These large absolute mean differences are obtained despite small standard deviations because p-curve2.0 and p-uniform systematically overestimate mean power. Large sample sizes cannot correct for systematic estimation errors. These results show that fixed effect size models cannot be used for the estimation of mean power when there is substantial heterogeneity in power. The results also show that the difference between z-curve and the ML model are slight and have no practical significance. The good performance of z-curve is encouraging because it does not require assumptions about the effect size distribution.

Violating the Assumptions of the ML model
In the preceding simulation study, heterogeneity in effect size before selection was modeled by a gamma distribution, with effect size independent of sample size before selection. The maximum likelihood model had a substantial and arguably unfair advantage, since the simulation was consistent with the assumptions of the ML model. It is well known that maximum likelihood models are very accurate compared to other methods when their assumptions are met (Stuart & Ord, 1999, Ch. 18). We used a beta distribution of effect sizes to examine how the ML model performs when its assump- tion of a gamma distribution is violated.
In this simulation, z-curve may have the upper hand because it makes no assumptions about the distribution of effect sizes or the correlation between effect sizes and sample sizes. It is well known that selection for significance (e.g. publication bias) introduces a correlation between sample sizes and effect sizes. However, there might also be negative correlations between sample sizes and effect sizes before selection for significance if researchers conduct a priori power analysis to plan their studies or if researchers learn from non-significant results that they need larger samples to achieve significance.
The design of this simulation study was similar to the previous design, but we only simulated the most extreme heterogeneity (SD = .3) condition and added a factor for the correlations between sample size and effect size (r = 0, -.2, -.4, -.8). As before, we ran 10,000 simulations in each condition.
To make results comparable to the results in Table 4, we show the results for the simulation with k = 1,000 per simulated meta-analysis. Figure 4 shows the effect size distributions after selection for significance. As before, effect sizes were transformed into Cohen's d-values so that they can be compared to the distributions in

Effect Size Distribution
Cohen's d

Density
Correlation: black = 0; red = -.8 Power: solid = 25%, dots = 50%, dashes = 75% Average performance. Table 5 shows average estimated population mean power as a function of the correlation between sample size and effect size and different levels of power. One interesting finding is that the correlation between effect size and sample size has no influence on any of the four estimation methods. This is reassuring because the correlation before selection for significance is typically unknown.
It is apparent from Table 5 that correlation between sample size and effect size makes virtually no difference. Results for p-curve2.1 and p-uniform again overestimate effect sizes. More important is the comparison of the ML model and z-curve. Both methods perform reasonably well with mean true power of 50%, although z-curve performs slightly better. With low or high power, however, the ML model overestimates mean power by 5 and 8 percentage points, respectively. The bias for zcurve is less, although even z-curve overestimates high power by 4 percentage points. We explored the cause of this systematic bias and found that it is caused by the default bandwidth method with smaller sets of studies. When we set the bandwidth to a value of 0.05, z-curve estimates with a correlation of zero were .235, .492, and .743, respectively.

Discussion
In this paper, we have compared four methods for estimating the mean statistical power of a heterogeneous population of significance tests, after selection for significance. We have discovered and formally proved a set of theorems relating the distribution of power values before and after selection for significance.

Mean Power and Replicability
Several events in 2011 have triggered a crisis of confidence about the replicability and credibility of published findings in psychology journals. As a result, there have been various attempts to assess the replicability of published results. The most impressive evidence comes from the Open Science Reproducibility project that conducted 100 replication studies from articles published in 2008. The key finding was that 50% of significant results from cognitive psychology could be replicated suc-cessfully, whereas only 25% of significant results from social psychology could be replicated successfully (Open Science Collaboration, 2015).
Social psychologists have questioned these results. Their main argument is that the replication studies were poorly done. "Nosek's ballyhooed finding that most psychology experiments didn't replicate did enormous damage to the reputation of the field, and that its leaders were themselves guilty of methodological problems" (Nisbett quoted in Bartlett, 2018) Estimating mean power provides an empirical answer to the question whether replication failures are caused by problems with the original studies or the replication studies. If the original studies achieved significance only by means of selection for significance or other questionable research practices, estimated mean power would be low. In contrast, if original studies had good power and replication failures are due to methodological problems of replication studies, estimated mean power would be high.
We have applied z-curve to the original studies that were replicated in the Open Science project and found an estimate of 66% mean power (Schimmack & Brunner, 2016). This estimate is higher than the overall success rate of 37% for actual replication studies. This suggests (but not conclusively) that problems with conducting exact replication studies contributed partially to the low success rate of 37%. At the same time, the estimate of 66% is considerably lower than the success rate of 97% for the original studies. This discrepancy shows that success rates in journals are inflated by selection for significance and partially explains replication failures in psychology, especially in social psychology.
This example shows that estimates of mean power provide useful information for the interpretation of replication failures. Without this information, precious resources might be wasted on further replication studies that fail simply because the original results were selected for significance.

Historic Trends in Power
Our statistical approach of estimating mean power is also useful to examine changes in statistical power over time. So far, power analyses of psychology have relied on fixed values of effect sizes that were recommended by Cohen (1962Cohen ( , 1988. However, actual effect sizes may change over time or from one field to another. Z-curve makes it possible to examine what the actual power in a field of study is and whether this power has changed over time. Despite much talk about improvement in psychological science in response to the replication crisis, mean power has increased by less than 5 percentage points since 2011, and improvements are limited to social psychology (Schimmack, 2018b).

Mean Power as a Quality Indicator
One problem in psychological science is the use of quantitative indicators like number of publications or number of studies per article to evaluate productivity and quality of psychological scientists. We believe that mean power is an important additional indicator of good science.
A single study with good power provides more credible evidence and more sound theoretical foundations than three or more studies with low power that were selected from a larger population of studies with nonsignificant results (Schimmack, 2012). However, without quantitative information about power, it is unclear whether reported results are trustworthy or not. Reporting the mean power of studies from a lab or a particular field of research can provide this information. This information can be used by journalists or textbook writers to select articles that reported credible empirical evidence that is likely to replicate in future studies. Simonsohn et al. (2017) provided users with a free online app to compute mean power. However, they did not report the performance of their method in simulation studies and their method has not been peerreviewed. We evaluated their online method and found that the current online method, p-curve 4.06, overestimates mean power under conditions of heterogeneity (Schimmack & Brunner, 2017). Moreover, even heterogeneity in sample sizes alone can produce biased estimates with p-curve4.06 (Brunner, 2018).

P-Curve Estimates of Mean Power
However, we agree with Simonsohn et al. (2014b) Simonsohn et al. (2014 that pcurve 2.0 can be used for the estimation of mean effect sizes and that these estimates are relatively bias free even when there is moderate heterogeneity in effect sizes. Importantly, these estimates are only unbiased for the population of studies that produced significant results, but they are inflated estimates for the population of studies before selection for significance. Failing to distinguish these two populations of studies (i.e., before and after selection for significance) has produced a lot of confusion and unnecessary criticism of selection models in general (McShane et al., 2016). While it is difficult to obtain accurate estimates of effect sizes or power before selection for significance from the subset of studies that were selected for significance, pcurve 2.0 provides reasonably good estimates of effect sizes after selection for significance, which is the reason we built p-curve 2.1 in the first place. However, 13 p-curve 2.1, and especially p-curve 4.06, produce biased estimates of mean power even for the set of studies selected for significance. Therefore, we do not recommend using p-curve to estimate mean power.

P-uniform Estimation of Mean Power
Unlike p-curve, the authors of p-uniform limited their method to estimation of effect sizes before selection for significance. We used their estimation method to create a method for estimation of mean power after selection. As p-curve, the method had problems with heterogeneity in effect sizes and performed even worse than p-curve. Recently, the developers of p-uniform changed the estimation method to make it more robust in the presence of heterogeneity and with outliers (van Aert et al., 2016).
The new approach simply averages the rescaled pvalues and finds the effect size that produces a mean p-value of 0.50. This method is called the Irvine-Hall method. We conducted new simulation studies with this method for the no correlation condition in Table 5 for 25%, 50%, and 75% true power. We found that it performed much better (24%, 76%, 99%) than the old p-uniform method (85%, 91%, 97%), and slightly better than p-curve 2.1 (40%, 84%, 99%). However, the method still produces inflated estimates for medium and high mean power.

Maximum Likelihood Model
Our ML model is similar to Hedges and Vevea's (1996) ML method that corrects for publication bias in effect size meta-analyses. Although this model has been rarely used in actual applications, it received renewed attention during the current replication crisis. McShane et al. argued that p-curve and p-uniform produced biased effect size estimates, whereas a heterogenous ML model produced accurate estimates. However, their focus was on estimating the average effect size before selection for significance. This aim is different from our aim to estimate mean power after selection for significance. Moreover, in their simulation studies the ML model benefited from the fact that the model assumed a normal distribution of effect sizes and this was the distribution of effect sizes in the simulation study. In our simulation studies, the ML model also performed very well when the simulation data met model assumptions. However, estimates were biased when model assumptions differed from the effect size distribution in the data. Hedges and Vevea (1996) also found that their ML model is sensitive to the actual distribution of population effect sizes, which is unknown. The main advantage of z-curve over ML models is that it does not make any distributional assumptions about the data. However, this advantage is limited to estimation of mean power. Whether it is possible to develop finite mixture models without distribution assumptions for the estimation of the mean effect size after selection for significance remains to be examined.

Future Directions
One concern about z-curve was the suboptimal performance when effect sizes were fixed. However, an improved z-curve method may be able to produce better estimates in this scenario as well. As most studies are likely to have some heterogeneity, we recommend using z-curve as the default method for estimating mean power.
Another issue is to examine performance of z-curve when researchers used questionable research practices (John et al., 2012). One questionable research practice is to include multiple dependent variables and to report only those that produced a significant result. This practice would be no different from researchers running multiple exact replication studies with the same dependent variable and reporting only the studies that produced significant results for the selected DV. The probability of this result to be selected is the true power of the study with the chosen DV and the probability of this finding to be replicated equals the true power for the chosen DV. Power can vary across DVs, but the power of the DVs that were discarded is irrelevant.
Things become more complicated, however, if multiple DVs are selected or if only the strongest result is selected among several significant DVs (van Aert et al., 2016). Some questionable research practices may cause z-curve to underestimate mean power. For example, researchers who conduct studies with moderate power may deal with marginally significant results by removing a few outliers to get a just significant result. (John et al., 2012). This would create a pile of z-scores close to the critical value, leading z-curve to underestimate mean power. We recommend inspecting the z-curve to look for this QRP, which should produce a spike in zscores just above 1.96.
Another issue is that studies may use different significance thresholds. Although most studies use p < .05 (two-tailed) as a criterion, some studies use more stringent criteria, for example to correct for multiple comparisons. Including these results would lead to an overestimation of mean power, just like using p < .05 , onetailed as a criterion would lead to overestimation because most studies used the more stringent two-tailed criterion to select for significance.
One solution would be to exclude studies that did not use alpha = .05 or to run separate analyses for sets of studies with different criteria for significance. However, these results are currently so rare that they have no practical consequences for mean power estimates.

Conclusion
Although this article is the seminal introduction of zcurve, we have been writing about z-curve and applications of z-curve since 2015 on social media. Thus, there have already been peer-reviewed criticism of our aims and methods before we were able to publish the method itself. We would like to take this opportunity to correct some of these criticisms and to ask future critics to base their criticism on this article.
De Boeck and Jeon (2018) claim that estimation methods for mean power are problematic because they "aim at rather precise replicability inferences based on other not always precise inferences, without knowing the true values of the effect size and whether the effect is fixed or varies" (p. 769). Contrary to this claim, our simulations show that z-curve can provide precise estimates of replicability; that is, the success rate in a set of exact replication studies without information about population effect sizes. To do so, only test statistics or exact p-values are needed. If related statistical information (e.g. means, SDs, and N) is not reported, an article does not contain quantitative information.
We hope that researchers will use z-curve (https://osf.io/w8nq4) to estimate mean power when they conduct meta-analyses. Hopefully, the reporting of mean power will help researchers to pay more attention to power when they plan future studies, and we might finally see an increase in statistical power, more than 50 years after Cohen (1962) pointed out the importance of power for good psychological science.
More awareness of the actual power in psychological science could also be beneficial for grant applications to fund research projects properly and to reduce the need for questionable research practices to boost power by inflating the risk of type-I errors. Thus, we hope that estimation of mean power serves the most important goal in science, namely to reduce errors. Conducting studies with adequate power reduces type-II errors (false negatives) and in the presence of selection bias it also reduces type-I errors. The downside appears to be that fewer studies would be published, but underpowered studies selected for significance do not provide sound empirical evidence. Maybe reducing the number of published studies would be beneficial, or to paraphrase Cohen (1990), "Less is more, except for statistical power".

Author Contributions
Most of the ideas in this paper were developed jointly. An exception is the z-curve method, which is solely due to Schimmack. Brunner is responsible for the theorems.
We present proofs of six theorems about the relationship between power and the outcome of replication studies. The first two theorems are assumptions of z-curve. The other four theorems are theoretically interesting, very useful for simulation studies, and can be used to further develop z-curve in the future. The theorems are also illustrated with a numerical example. Consider a population of F-tests with 3 and 26 degrees of freedom, and varying true power values. Variation in power comes from variation in the non-centrality parameter, which is sampled from a chi-squared distribution with degrees of freedom chosen so that population mean power is very close to 0.80.
Denoting a randomly selected power value by G and the non-centrality parameter by λ, population mean power is To verify the numerical value of expected power for the example, > alpha = 0.05; criticalvalue = qf(1-alpha,3,26) > fun = function(ncp,DF) + (1 -pf(criticalvalue,df1=3,df2=26,ncp))*dchisq(ncp,DF) > integrate(fun,0,Inf,DF=14.36826) 0.8000001 with absolute error < 5.9e-06 The strange fractional degrees of freedom were located using the R function uniroot, minimizing the absolute difference between the output of integrate and the value 0.8 numerically over the degrees of freedom value. The minimum occurred at 14.36826.

Theorem 1 Population mean true power equals the overall probability of a significant result.
Proof. Suppose that the distribution of true power is discrete. Again denoting a randomly chosen power value by G, the probability of rejecting the null hypothesis is which is population mean power. If the distribution of power is continuous with probability density function f G (g), the calculation is Continuing with the numerical example, we first sample one million non-centrality parameter values from the chi-squared distribution that yields an expected power of 80%. These values are in the vector NCP. We then calculate the corresponding power values, placing them in the vector Power. Next, we generate one million random F statistics from non-central F distributions, using the non-centrality parameter values in NCP. In the R output below, observe that mean power is very close to the proportion of F statistics exceeding the critical value. This illustrates Theorem 1 for the distribution of power before selection. Needless to say, Theorem 1 applies both before and after selection. To show how Theorem 1 applies to the distribution of power after selection, the sub-population of power values corresponding to significant results are stored in SigPower. The tests that were significant are repeated (with the same non-centrality parameters), and the test statistics placed in Fstat2. The proportion of test statistics in Fstat2 that are significant is very close to the mean of SigPower. This gives empirical support to the statement that population mean power after selection for significance equals the probability of obtaining a significant result again. > SigPower = subset(Power,Fstat>criticalvalue) > mean(SigPower) # Mean power after selection [1] 0.8274357 > # Replicate the tests that were significant. > sigNCP = subset(NCP,Fstat>criticalvalue) > Fstat2 = rf(length(sigF),df1=3,df2=26,ncp=sigNCP) > # Proportion of replications significant > length(subset(Fstat2,Fstat2>criticalvalue)) / + length(sigF) [1] 0.827172

Theorem 2
The effect of selection for significance is to multiply the probability of each power value by a quantity equal to the power value itself, divided by population mean power before selection. If the distribution of power is continuous, this statement applies to the probability density function.
Proof. Suppose the distribution of power is discrete. Using Bayes' Theorem, If the distribution of power is continuous with density f G (g), By the Fundamental Theorem of Calculus, the conditional density of power given significance is For the numerical example we are pursuing by simulation, the density function of power before selection is a technical challenge and we will not attempt it. As a substitute, suppose that power before selection follows a beta distribution, a very flexible family on the interval from zero to one (Johnson et al., 1995). If power before selection (denoted by G) has a beta distribution with parameters α and β, Theorem 2 says that the density of power after selection (a function of the power value g) is which is again a beta density, this time with parameters α + 1 and β. M.A.L.M. van Assen has pointed out the similarity of this result to conjugate prior-posterior updating in Bayesian statistics. Figure 5 shows how a beta with α = 2 and β = 4 is transformed into a beta with α = 3 and β = 4.
Theorem 3 Population mean power after selection for significance equals the population mean of squared power before selection, divided by the population mean of power before selection.
Proof. Suppose that the distribution of power is discrete. Then using (10), If the distribution of power is continuous, (11) is used to obtain In the example, SigPower contains the sub-population of power values corresponding to significant results. Observe the verification of Formula 13. Proof. Using Formula 10, so that A similar calculation applies in the continuous case.
To illustrate Theorem 4, recall that the example was constructed so that mean power before selection was equal to 0.80.
In the example, population mean power is 0.80, while population mean power given significance is roughly 0.83. It is reasonable that selecting significant tests would also tend to select higher power values on average, and in fact this intuition is correct. Since . That is, population mean power given significance is greater than the mean power of the entire population, except in the homogeneous case where Var(G) = 0. The exact amount of increase has a compact and somewhat surprising form.
Theorem 5 The increase in population mean power due to selection for significance equals the population variance of power before selection divided by the population mean of power before selection. Proof.
Illustrating Theorem 5 for the ongoing example, Proof. Note that power for a given sample size and effect size is P{T > c|X = es, N = n}. Suppose effect size is discrete. Then P{X = es, N = n|T > c} is where E(G) is expected power before selection, equal to P{T > c} by Theorem 1. Suppose that effect size is continuous with density g(es). The joint distribution of sample size and effect size before selection is determined by P{N = n|X = es}g(es). The joint distribution after selection is determined by P{N = n|X = es, T > c} g(es|T > c) It is also possible to write the joint distribution of sample size and effect size as the conditional density of effect size given sample size, times the discrete probability of sample size. That is, the joint distribution before selection is determined by g(es|N = n)P{N = n}, and the joint distribution after selection is determined by Theorem 6 cannot be illustrated for the ongoing numerical example, because the example employs a distribution of the non-centrality parameter, rather than of sample size and effect size jointly. As a substitute, consider that an observed distribution of sample size after selection must imply a distribution of sample size in the unpublished studies before selection. If that distribution is too outlandish (for example, implying an enormous "file drawer" of pilot studies with tiny sample sizes) we may be forced to another model of the research and publication process. Theorem 6 allows one to solve for P{N = n}, the unconditional probability distribution of sample size before selection, though an estimated or hypothesized distribution of effect size given sample size before selection is needed. When sample size and effect size are deemed independent before selection, this is not a serious obstacle.
Expression (14) says that g(es|N = n, T > c)P{N = n|T > c} is equal to so that integrating both sides with respect to es, and we have The numerator of the fraction is the probability of observing a sample size of n after selection for significance. The denominator is expected power given that sample size, and could be calculated with R's integrate function. By Theorem 1, the quantity E(G) is both population mean power before selection and P{T > c}, the probability of randomly choosing a significant result from the population of tests before selection. In Equation 15, though, it is just a proportionality constant. In practice, one obtains P{N = n} by calculating the fraction in parentheses for each n, and then dividing by the total to obtain numbers that add to one.

Maximum Likelihood
Even though sample size is a random variable, the quantities n 1 , . . . , n k are treated as fixed constants. This is similar to the way that x values in normal regression and logistic regression are treated as fixed constants in the development of the theory, even though clearly they are often random variables in practice. Making the estimation conditional on the observed values n 1 , . . . , n k allows it to be distribution free with respect to sample size, just as regression and logistic regression are distribution free with respect to x. This is preferable to adopting parametric assumptions about the joint distribution of sample size and effect size.
Suppose there is heterogeneity in both sample size and effect size, and that effect size is continuous. The likelihood function given significance is a product of conditional densities evaluated at the observed values of the test statistics. Each term is the conditional density of the test statistic given both the sample size and the event that the test statistic exceeds its respective critical value.
The joint probability distribution of sample size and effect size before selection is determined by the marginal distribution of sample size P{N = n} and the conditional density of effect size given sample size g θ (es|n), where θ is a vector of unknown parameters. Denoting the random effect size by X, the conditional density of an observed test statistic T given significance and a particular sample size n is p(t, f 1 (n) f 2 (es)) − p(c, f 1 (n) f 2 (es)) g θ (es|n) des ∞ 0 1 − p(c, f 1 (n) f 2 (es)) g θ (es|n) des where moving the derivative through the integral sign is justified by dominated convergence. The likelihood function is a product of k such terms. In the main paper, the simplifying assumption that sample size and effect size are independent before selection means that g θ (es|n) is replaced by g θ (es), yielding Expression (3). In the problem of estimating power under heterogeneity in effect size, the unknown parameter is the vector θ in the density of effect size. Let θ denote the maximum likelihood estimate of θ. This yields a maximum likelihood estimate of the true power of each individual test in the sample, and then the estimates are averaged to obtain an estimate of mean power. We now give details.
Consider randomly sampling a single test from the population of tests that were significant the first time they were carried out. Let T 1 denote the value of the test statistic the first time a hypothesis is tested, and let T 2 denote the value of the test statistic the second time that particular hypothesis is tested, under exact repetition of the experiment. Conditionally on fixed values of sample size n and effect size es, T 1 and T 2 are independent. By Theorem 1, population mean power after selection is This is the expression we seek to estimate. Applying Theorem 3 to the sub-population of tests based on a sample of size n, Substituting (17) into (16) yields P{T 2 > c|T 1 > c} = n ∞ 0 1 − p(c, f 1 (n) f 2 (es)) 2 g θ (es|n) des ∞ 0 1 − p(c, f 1 (n) f 2 (es)) g θ (es|n) des P{N = n|T 1 > c} .
Expression (18) has two unknown quantities, the parameter θ of the effect size distribution, and P{N = n|T 1 > c}. For the former quantity, we use the maximum likelihood estimate, while the P{N = n|T 1 > c} values are estimated by the empirical relative frequencies of sample size, which is the non-parametric maximum likelihood estimate. The result is a maximum likelihood estimate of population power given significance: 1 k k j=1 ∞ 0 1 − p(c j , f 1 (n j ) f 2 (es)) 2 g θ (es|n j ) des ∞ 0 1 − p(c j , f 1 (n j ) f 2 (es)) g θ (es|n j ) des .
In the simulations, the density g of effect size is assumed gamma, there is no dependence on n, and the parameter θ is the pair (a, b) that parameterize the gamma distribution.

Simulation
Direct simulation from the distribution of the test statistic given significance. To study the behaviour of an estimation method under selection for significance, it is natural to simulate test statistics from the distribution that applies before selection, and then discard the ones that are not significant. But if one can simulate from the joint distribution of sample size and effect size after selection, the wasteful discarding of nonsignificant test statstics can be avoided. The idea is to do the simulation in two stages. First, simulate pairs from the joint distribution of sample size and effect size after selection, and calculate a non-centrality parameter using Expression (ncpmult). Then using that ncp value, simulate from the distribution of the test statistic given significance. We will now show how to do the second step.
It is well known that if F(t) is a cumulative distribution function of a continuous random variable and U is uniformly distributed on the interval from zero to one, then the random variable T = F −1 (U) has cumulative distribution function F(t). In this case the cumulative distribution function from which we wish to simulate is P{T ≤ t|T > c, X = es, N = n} for t > c, where as usual ncp = f 1 (n) f 2 (es). To obtain the inverse, set u equal to the probability and solve for t, as follows. Denoting the power of the test by γ = 1 − p(c,ncp), u = p(t,ncp) − p(c,ncp) 1 − p(c,ncp) ⇔ u (1 − p(c,ncp)) = p(t,ncp) − p(c,ncp) ⇔ p(t,ncp) = u (1 − p(c,ncp)) + p(c,ncp) ⇔ p(t,ncp) = γu + 1 − γ ⇔ t = q(γu + 1 − γ,ncp).
Since 1 − U also has a Uniform (0,1) distribution, one may proceed as follows. For a given sample size and effect size, first calculate the non-centrality parameter ncp = f 1 (n) f 2 (es), and use that to compute the power value γ = 1 − p(c,ncp). Then calculate the significant test statistic T = q(1 − γU,ncp) , where U is a pseudo-random variate from a Uniform (0,1) distribution. In R, the process can be applied to a vector of ncp values and a vector of independent U values of the same length. Again, this is the second step. The first step is to simulate a collection of ncp values using the desired joint distribution of sample size and effect size after selection for significance. Naturally, simulation is is easiest if sample size and effect size come from well-known distributions with built-in random number generation, and if sample size and effect size are specified to be independent after selection. In one of our simulations, sample size and effect size after selection were correlated. The next section describes how this was done.
Correlated sample size and effect size. Let effect size X have density g θ (es), where θ represents a vector of parameters for the distribution of effect size. Conditionally on X = es, let the distribution of sample size be Poisson distributed with expected value exp(β 0 + β 1 es). This is standard Poisson regression. Simulation from the joint distribution is easy. One simply simulates an effect size es according to the density g, computes the Poisson parameter λ = exp(β 0 + β 1 es), and then samples a value n from a Poisson distribution with parameter λ. The challenge is to choose the parameters θ, β 0 and β 1 so that after selection, (a) the population mean power has a desired value, and at the same time (b) the population correlation between sample size and effect size has a desired value. Population mean power is γ = ∞ 0 n 1 − p(c, f 1 (n) f 2 (es)) P{N = n|X = es}g θ (es)des .
Given values of θ, β 0 and β 1 , this expression can be calculated by numerical integration; recall that P{N = n|X = es} is a Poisson probability.
The population correlation between sample size and effect size is  es e β 0 +β 1 es g θ (es)des .
All these expected values can be calculated by numerical integration using R's integrate function, so that the correlation ρ can be evaluated for any set of θ, β 0 and β 1 values. In our simulation of correlated sample size and effect size, g θ (es) was a beta density, re-parameterized so that θ = (µ, σ 2 ) consisted of the mean µ and variance σ 2 . Conditionally on effect size, sample size was Poisson distributed with expected value exp(β 0 + β 1 es). We set the variance of effect size σ 2 to a fixed value of 0.09, so that the standard deviation of effect size after selection was 0.30, a high value. Given any mean effect size µ and slope β 1 , the parameter β 0 (the intercept of the Poisson regression) was adjusted so that expected sample size at the mean value was equal to 86: β 0 = ln(86) − β 1 µ.
With these constraints, the population mean power γ and correlation ρ were a function of the two free parameters µ and β 1 . Let γ 0 be a desired value of mean power; for example, γ 0 = 0.5. Let ρ 0 be a desired value of the correlation between sample size and effect size; for example, ρ 0 = −0.8. Values of µ and β 1 were locating by numerically minimizing the function f (µ, β 1 ) = |γ − γ 0 | + |ρ − ρ 0 |. We used R's optim function.