Testing ANOVA Replications by Means of the Prior Predictive p-Value

In the current study, we introduce the prior predictive p-value as a method to test replication of an analysis of variance (ANOVA). The prior predictive p-value is based on the prior predictive distribution. If we use the original study to compose the prior distribution, then the prior predictive distribution contains datasets that are expected given the original results. To determine whether the new data resulting from a replication study deviate from the data in the prior predictive distribution, we need to calculate a test statistic for each dataset. We propose to use F̄, which measures to what degree the results of a dataset deviate from an inequality constrained hypothesis capturing the relevant features of the original study: HRF. The inequality constraints in HRF are based on the findings of the original study and can concern, for example, the ordering of means and interaction effects. The prior predictive p-value consequently tests to what degree the new data deviates from predicted data given the original results, considering the findings of the original study. We explain the calculation of the prior predictive p-value step by step, elaborate on the topic of power, and illustrate the method with examples. The replication test and its integrated power and sample size calculator are made available in an R-package and an online interactive application. As such, the current study supports researchers that want to adhere to the call for replication studies in the field of psychology.


Introduction
New studies conducted to replicate earlier original studies are often referred to as replication studies. After the latest "crisis in confidence" in the field of psychology, the call to conduct replication studies is stronger than ever (Anderson & Maxwell, 2016;Asendorpf et al., 2013;Cumming, 2014;Earp & Trafimow, 2015;Ledgerwood, 2014;Open Science Collaboration, 2012Pashler & Wagenmakers, 2012;Schmidt, 2009;Verhagen & Wagenmakers, 2014), and large replication projects such as the Reproducibility Project Psychology (Open Science Collaboration, 2015), Reproducibility Project: Cancer Biology (RP:CB) (Errington et al., 2019), and Many Labs projects (Ebersole et al., 2016;Klein et al., 2014;Klein et al., 2018) have been launched. As a result, methodology on conducting replication studies has received increasing attention (see for example Anderson and Maxwell, 2016;Asendorpf et al., 2013;Brandt et al., 2014;Schmidt, 2009). There is, however, no standard methodology to determine whether a replication is successful or not (Open Science Collaboration, 2015).
The results of an original study are replicated when a new study corroborates the original findings. A common and intuitive method to assess whether a result is replicated is 'vote-counting'. Vote-counting is assessing whether the new effect is statistically significant and in the same direction as the significant effect in the original study (Anderson & Maxwell, 2016;Simonsohn, 2015). Vote-counting, however, has serious shortcomings. First of all, it is a dichotomous evaluation that does not take into account the magnitude of differences between effect-sizes of the original and new study (Asendorpf et al., 2013;Simonsohn, 2015). Secondly, each of the effect sizes being significant does not imply that both effect sizes are the same, nor does one significant effect and one non-significant effect imply that both effects are different (Gelman & Stern, 2006;Nieuwenhuis et al., 2011). Stated otherwise, votecounting does not formally test whether a result is replicated (Anderson & Maxwell, 2016;Verhagen & Wagenmakers, 2014). Thirdly, underpowered replication studies are less likely to replicate significance, which can lead to misleading conclusions (Asendorpf et al., 2013;Cumming, 2008;Hedges & Olkin, 1980;Simonsohn, 2015).
In the current study, we address the following replication research question: "Does the new study fail to replicate relevant features of the original study?". For example, the result of an original ANOVA study is: Group A > Group B > Group C. The finding can be: "Group A performs better than group B, which performs better than group C"; "Group A performs better than group B and C"; or "Group A and B perform better than group C". The 'relevant features' subordinate to the replication test always have to be in line with the original result (i.e., Group A > Group B > Group C) for the test to function properly. If the purpose of the replication test is to put the proclaimed theory by the original to the test, then the claims of the original study determine the exact relevant features to be evaluated. However, if there is reason to test another feature, it is possible to let the relevant features deviate from the claims in the original study. The relevant features of original studies will be captured in the form of an informative hypothesis (Hoijtink, 2012), which is specified using inequality constraints among the means of the ANOVA model. We propose to evaluate the replication of these hypotheses with the prior predictive p-value (Box, 1980).
The prior predictive p-value was not introduced to test replication. It was originally presented as a method to test whether the current data is unexpected given prior expectations concerning the parameter values of a statistical model. A disadvantage of the prior predictive check to test model fit is that it is leaves undetermined whether the prior expectations about the parameter values or the model assumptions are incorrect. Hence, as a model test the prior predictive check has been replaced by the posterior predictive check (Gelman et al., 1996), which does not make prior assumptions about expected parameter values, but instead uses the posterior results given the current data.
With respect to testing replication, however, the prior predictive check is a good method for three reasons. First, instead of non-empirical prior expectations, we use the posterior distribution of the model parameters given the original data as the prior distribution. Consequently, we have a well-founded and clear-cut prior. Second, the prior predictive check uses a distribution of datasets (i.e., the prior predictive distribution) that are expected given the prior (i.e., the posterior of the original study). In this manner, the prior predictive distribution takes into account that results in a new dataset -resulting from a replication study -may deviate from the original results because of random variation instead of meaningful differences. According to our definition, a study replicates if the new dataset is drawn from the same population as the original dataset. Third, the prior predictive check uses a 'relevant checking function' for which we proposeF (Silvapulle & Sen, 2005, p. 38-39). The statisticF captures the deviance from a constrained hypothesis that we base on the findings of the original study. As a result, we can check whether the new study significantly fails to replicate relevant features of the original study, while taking variation into account.  (2016) What is the effect size (corrected for publication bias) in the population?
Hybrid meta-analysis t-test Van Aert and Van Assen (2017) a All models for which a Bayes factor can be computed. b The reconceptualization by Ly et al. (2018) generalizes to most common experimental designs.
c The telescope test is explained in the t-test setting, but applicable to any model for which a power analysis can be conducted. Table 1 shows how our research question and proposed method relate to other replication research questions and associated methods that have been proposed. Our method addresses a question similar to that in Anderson and Maxwell (2016), Harms (2018), Ly et al. (2018), Verhagen and Wagenmakers (2014) and Patil et al. (2016), but now enables researchers to evaluate the replication of relevant features of an original ANOVA study. The bottom panel of Table 1 shows other replication research questions that will not be pursued in this paper. The reader interested in these questions, should consult the given references.
The goal of this paper is to introduce the prior predictive p-value as a method to test replication of relevant features of original ANOVA studies. In the first section, we provide a step by step introduction of the prior predictive p-value as included in the ANOVAreplication R-package and the online interactive application (see osf.io/6h8x3). In the second section, we discuss the statistical power of the prior predictive p-value. In the third section, we explain how to use and interpret the prior predictive p-value by means of a workflow. In the fourth section, we use several studies from the Reproducibility Project Psychology (Open Science Collaboration, 2012) to demonstrate the use of the prior predictive p-value. The paper ends with a discussion and conclusion section.

Prior Predictive p-Value
The evaluation of the replication of an ANOVA study by means of the prior predictive p-value (Box, 1980) consists of three steps that will be explained below.

Step 1: Prior Predictive Distribution of the Data
The ANOVA model is given by: where y i jd is observation i = 1, ..., n jd in group j = 1, ..., J for dataset d ∈ {o, r, sim}, where o denotes the original data, r denotes the new data, and sim denotes simulated data, the latter will be introduced towards the end of this section. Furthermore, µ jd is the mean of group j in dataset d, i jd is the error term, and σ 2 d is the pooled variance over all J groups.
The original ANOVA results can be summarized in the posterior distribution of the parameters: g(µ o , σ 2 o |y o ), where µ o = [µ 1o , ..., µ Jo ] and y o includes all observations y i jo : where the density of the data and the standard prior distribution, that is, a uniform prior on the means and Jeffrey's prior on the variance. The prior distribution for the analysis of the original data is uninformative, that is, the posterior distribution is completely determined by the original data in order to match the results of the original study. If the original study used a Bayesian analysis, the priors should match those of the original study in order to reproduce the original study results. Given the observed original results, the prior distribution for future parameters h(µ r , σ 2 r ) = h(µ sim , σ 2 sim ) = g(µ o , σ 2 o |y o ). With the prior predictive p-value, we then test H 0 : µ r , σ 2 r ∼ h(µ r , σ 2 r ). H 0 states that µ r , σ 2 r follow the distribution of the prior for µ r , σ 2 r . Loosely formulated, H 0 states that the parameters in the new data are in line with our expectations given the original results.
To test H 0 , we obtain datasets that are to be expected given the original data. Using this prior we simulate data y sim that are to be expected given the results of the original study: where f (y sim ) is the prior predictive distribution of the data. Note that f (y sim |µ sim , σ 2 sim ) is the counterpart of Equation 3 for dataset sim instead of o. Datasets y t sim for t = 1, ..., T , where T denotes the number of samples from the prior predictive distribution, are obtained by ). Datasets y t sim have sample sizes n 1r , ..., n Jr , because the predicted data needs to be compared to the new data y r that has sample sizes n 1r , ..., n Jr .
The steps in the following sections elaborate how new data y r can be compared to the T data matrices sampled from f (y sim ) that are to be expected given H 0 using a test-statistic that evaluates relevant features of the original data.

Step 2: Test Statistic Evaluating Relevant Features
We propose to useF (Silvapulle & Sen, 2005, p. 38-39) as a test-statistic to evaluate how much the predicted data and the observed data deviate from an inequality constrained hypothesis capturing the relevant 5 features of the original study H RF : where RSS d,H u denotes the residual sum of squares in dataset d ∈ {r, sim} for the unrestricted hypothesis H u : µ 1d , ..., µ Jd , whereȳ jd denotes the mean for group j in dataset d. S 2 d denotes the mean squared error, where N = J j=1 n jr , and wherẽ µ d thus contains the set of parameter estimates that minimize the residual sum of squares for y d under the constraints imposed by H RF .F y d is the scaled difference between the residual sum of squares under the constraints imposed by H RF and the residual sum of squares for y d under H u . As H u is unrestricted,F y d quantifies the misfit of y d with H RF . The hypothesis capturing the relevant features of the original data, H RF , is of the form Rµ d > 0, where R is a K × J restriction matrix, J denotes the number of groups in the ANOVA study, and K the number of restrictions in H RF , while µ d is the mean vector of length J.
Examples of constraints that can be applied under Rµ r > 0 are: • Simple order constraints: µ jd > µ j d , or µ jd < µ j d for a pair j, j .
The constraints in H RF should be based on the findings of the original study, which implies and requires that H RF is always in agreement with the results of the original study (i.e.,F y o = 0). The results of the original study alone are usually not enough to determine which H RF is to be evaluated. For example, an original study shows thatȳ 1o <ȳ 2o <ȳ 3o . This finding may lead to H RF : µ 1d < µ 2d < µ 3d , but also to H RF : (µ 1d , µ 2d )< µ 3d or H RF : µ 1d < (µ 2d , µ 3d ). Which exact features should be covered in H RF can be guided by the conclusions of the original study. For example, if in the original study it is concluded that a treatment condition leads to better outcomes than two control conditions, the most logical specification of the relevant features is H RF : (µ controlAd , µ controlBd )< µ Treatmentd . Alternatively, if in the original study it is concluded that treatment A is better than treatment B, which is better than the control condition, a logical relevant feature hypothesis would be: H RF : µ TreatmentAd > µ TreatmentBd > µ Controld . It may also occur that the researcher conducting the replication test has an interest to evaluate a claim that is not in the original study, but could be made based on its results. In all cases, the researcher conducting the replication test should substantiate the choices made in the formulation of H RF with results from the original study. It is good practice to also pre-register H RF . In the Examples Section, we demonstrate for two studies how the original study is linked to H RF . First, however, we explain how the prior predictive p-value is calculated.
Step 3: p-value The third and final step is to compute the prior predictive p-value. When we calculateF y t sim for each dataset y t sim obtained in Step 1 with respect toF as defined in Step 2, a sampling-based representation of the prior predictive distribution of the test statistic f (F y sim ) is obtained. Consequently, where H 0 denotes "Replication", that is: H 0 : µ r , σ 2 r ∼ h(µ r , σ 2 r ). Furthermore, I is an indicator function that takes on the value 1 if the argument is true and 0 otherwise.
As illustrated in Figure 1, the prior predictive p-value indicates how exceptional the observed statistic for the new data,F y r , is compared to its prior predictive distribution f (F y sim ). The shaded area on the right side ofF y r is P(F y sim ≥F y r |H 0 ), that is, the prior predictive p-value. If the prior predictive p-value is significant, we reject replication of the relevant features of the original study by the new data. Note that the focus is on rejecting replication of the original results and not on rejecting H RF in itself for the new study.

Uniformity.
To determine the significance of a pvalue by comparing it to some preselected value α, the p-value needs to be uniformly distributed if H 0 is true. Only when the p-value is uniform, α is equal to the nominal Type I error. We will demonstrate that this is true for the prior predictive p-value if f (F y sim ) is continuous, and it is true up to some α 0 if f (F y sim ) is discrete.
A p-value is uniform if: where p denotes a p-value from f (p|H 0 ), that is, the nulldistribution of the p-values.
The following three steps proof that Equation 12 holds for the prior predictive p-value when f (F y sim ) is continuous: where f (F y r |H 0 ) denotes the distribution ofF y r under H 0 .
3. For the situations considered in this paper it holds that f (F y r |H 0 ) = f (F y sim ), therefore With constraints of the form Rµ r > 0, however, f (F y sim ) will often be discrete. When f (F y sim ) is discrete, the prior predictive p-value is not uniform for all α ∈ [0, 1]. For example, let us obtain g(µ o , σ 2 o |y o ) = h(µ r , σ 2 r ) for an original study withȳ 1o = 1,ȳ 2o = 2,ȳ 3o = 3, s 2 o = 5, and n jo = 50, with n jr = 50 and H RF : µ 1r < µ 2r < µ 3r . Subsequently, we simulate y t r for t = 1, ..., 100, 000, and calculate the prior predictive p-value for each y t r . The result is f (p|H 0 ), which is plotted in Figure 2a. In Figure 2a, we see a thick vertical line that indicates a set of p-values with exactly the same value, namely 1.00. This set of equal p-values results from the fact that H RF : µ 1r < µ 2r < µ 3r is true for a substantial number of datasets y t r causing the associatedF y t r to be exactly equal to 0 and the associated prior predictive p-values to be exactly equal to 1 (see Figure 2b). Generally, however, there exists an α 0 for which f (p|H 0 ) is uniform (Meng, 1994), since all values in f (F y sim ) other than 0 will occur in a continuous fashion. Thus, α is uniform for α ∈ [0, α 0 ]. If the preselected α < α 0 , α is equal to the nominal Type I error. α 0 can be computed as 1 − P( f (F y sim ) = 0). For example, α 0 ≥ .05 if no more than 95% ofF y sim is exactly 0. It would be exceptional if more than 95% ofF y sim = 0, but it could occur with extremely low power in the original study and an unspe-   cific H RF . A visualization of f (F y sim ) can help to roughly estimate α 0 . For the discrete f (F y sim ) considered here 53% of f (F y sim ) = 0 and α 0 = .47 (Figure 2b). In the next section, we deal with another important property of null hypothesis significance testing methods: Power.

Power
Power is the probability to reject the null hypothesis (of replication) with a preselected α when not the null, but an alternative hypothesis is true. Researchers typically pursue a power of .80. Let us denote power by γ.
where H a is the population under the alternative hypothesis for which replication is to be rejected. Note that any population for which H 0 is not true can qualify to reject replication. The population used is determined by the theoretical context in which the replication test takes place. The population with µ 1a = .... = µ Ja is a special population that is generally considered to display a non-effect in ANOVA studies. Hence, µ 1a = .... = µ Ja seems a natural default choice for the population under the alternative hypothesis. As a best guess for µ ja and σ 2 a in a power analysis, the grand meanȳ o and variance σ 2 o of the original study can be used. The population under the alternative hypothesis with µ 1a = .... = µ Ja is on the edge of H RF : it deviates minimally from H RF , hence, the associated γ will be a lower limit. Power will increase when the population under the alternative hypothesis is more different from H RF than in the population with equal means, for example, when the means are ordered differently.

Simulation Study
To illustrate the power of the prior predictive p-value, we conducted a simulation study in which we varied the effect size in the original study f o , the sample size for the original study n jo , the sample size for the new study n jr , the relevant feature of interest H RF , and the population under the alternative hypothesis H a as specified in Table 2. For each cell in the simulation study, 10,000 samples were drawn from H a and power was calculated according to Equation 13.
The results of the simulation study are provided in Table 3. As expected, power generally increases with increasing effect sizes, increasing sample sizes, and increasing deviation between y o and H a . There are, however, some exceptions: With small f o and low n jo , larger n jr only emphasize the noise in the original study more and do not lead to an increase in power. Similarly, a more specific H RF does not always increase power. Given original studies with smaller samples and smaller effect sizes, h(µ r , σ 2 r ) is so uninformative that more specific H RF are only more inaccurate under H 0 , andF y r needs to be extremely large to reject the null. Table 3 also shows that the power on the edge (i.e., the power for H a1 ) is insufficient for original studies with small and medium effect sizes (γ < .60 in all cells). With medium f o , power is only sufficient if the new study originates from a population in which the means are ordered differently (e.g., H a2 ). For original studies with large effect sizes and group sample sizes in the original studies with at least 50 participants per group, power can be sufficient under H a1 . Power levels off, however, for H RF1 and H RF2 at .67, and .83 respectively. Under µ 1a = µ 2a = µ 3a , H RF1 : µ 1r < (µ 2r , µ 3r ) is true in 1 3 of the situations by chance. Consequently, power cannot exceed 1 − 1 3 = .67. For H RF2 : µ 1r < µ 2r < µ 3r , 1 6 of  the combinations under H a1 are in line with replication by chance. Hence, power cannot exceed 1 − 1 6 = .83. If we move further from the edge of H RF , as we do with H a2 , power increases. Thus, the power of the prior predictive p-value considering an H RF with three or fewer order constraints will almost never be high if the true means are equal, but can be high if there is a different ordering in reality as compared to the one in H RF .
The results demonstrate that imprecise estimates (i.e., large standard errors leading to a low informative prior) in the original study lead to low power, especially on the edge of H RF . This is as true for the prior predictive p-value as it is for other approaches. For example, in a classical ANOVA study with three groups with 20 participants each, power is <.10, <.40, and <.80 for small, medium, and large effect sizes respectively; a result that was already pointed out in Cohen (1988, p. 313). Zondervan-Zwijnenburg and Rijshouwer (2020) demonstrates the application of different methods to evaluate replication, within the context of small samples. Not a single method is unaffected by small sample sizes. As highlighted by Morey and Lakens (2019) and Patil et al. (2016): Replication can only be rejected based on the findings of the original study, and when these findings are highly imprecise due to large standard deviations and small sample sizes, rejecting them is hard or even impossible.
Underpowered original studies may result in nonsignificant prior predictive p-values that have a high probability of being Type II errors (Morey & Lakens, 2019). Therefore, only reporting the prior predictive p-value is not enough, the probability of a Type II error (i.e., 1-γ) given the population under the alternative hypothesis should be communicated to the reader as well. The next section elaborates on the computation of power and the required sample size for sufficient statistical power. The Workflow and Examples sections explain how researchers should incorporate prior predictive p-values and power. One of the examples will also demonstrate rejected replication despite low power on the edge of H 0 .

Power and Sample Size Determination
As highlighted in the previous sections and in the literature (e.g., Brandt et al., 2014;Simonsohn, 2015), power is an important characteristic of a convincing replication study. It is thus important that researchers can calculate the power of the prior predictive check, and can determine the sample size for a new study such that the replication test has high statistical power. Therefore, the ANOVAreplication R-package and the online interactive application (see osf.io/6h8x3) include a power and sample size calculator.
Given the vector with group sample sizes in the new study n r , h(µ r , σ 2 r ), H a , H RF , and α, the power γ is calculated as follows: 1. Following Step 1 and 2 of the prior predictive check, t = 1, ..., T datasets are simulated from f (F y sim ), andF y sim,1−α can be calculated.
2. Given µ a , σ a , t = 1, ..., T datasets are simulated from f (F y r |H a ) with sample sizes n r . Following Step 2 of the prior predictive check, for each datasetF y r is calculated.
As default choice for µ a , we recommend to useȳ o for each group. With this setting, the power is calculated to reject replication in case of equal group means. As default choice for σ a , we recommend the pooled standard deviation of the original study.
To determine the required sample size to reject replication with sufficient power, we use an iterative procedure. In addition to h(µ r , σ 2 r ), H a , H RF , α, we use the following information to calculate the required sample size: a target power levelγ; a small margin covering acceptable values around the target power γ margin , because the calculated power may not be exactly equal to the target power; a starting value for the group sample size n jr 0 ; a maximum number of iterations Q max ; and a maximum total sample size for the new study N r max . Our default values are:γ = .825, γ margin = .025, α = .05, n jr 0 = 20, Q max = 10, and N r max = 600.
1. In every iteration q, γ q is calculated given n jr q .

Workflow
To clarify the procedure to obtain the prior predictive p-value, the workflow is depicted in Figure 3.
Step 1. The first steps (1a-1c) only require the original study.
Step 1a is to derive the relevant feature to be evaluated in the test statistic from the findings of the original study. Next, the population for which replication should be rejected (i.e., H a ) can be defined. What is the ordering of the means in this population and what is the effect size in that ordering? H a can be a population in which all means are equal, but it does not have to be.
Step 1c is to obtain the data of the original study, or reconstruct the data based on reported means, standard deviations and group sample sizes. If the new study is not yet conducted, the second step is to calculate the required sample size per group for the new study to reject replication with sufficient power (i.e., γ).
Step 2. The sample sizes calculation can be conducted with the sample.size.calc function in the ANOVAreplication package. If the function cannot find a (reasonable) group sample size for which γ is sufficient, this implies that the original study is not suited for replication testing with the prior predictive p-value for the specified H a : its conclusions are too vague (i.e., the standard errors are too wide) to reject replication if H a is true. There is still a chance that the prior predictive p-value turns out significant, especially if the observed data is more extreme than most samples from H a , but the researcher should consider whether collecting data with such a low probability of a meaningful result is ethically acceptable.
Step 3. As a third step, the prior predictive p-value can be computed with the function prior.predictive.check. The power associated to the sample size of the new study can be calculated with power.calc. Note that it is not a post-hoc power analysis, as the definition of H a is unrelated to the new study. Hence, the power to reject replication for H a can be insufficient (i.e., larger than 1-the preset Type II error rate β), while the prior predictive p-value is statistically significant, or vice versa. Figure 3 assists in interpreting the resulting p-value, considering the statistical power to reject replication for H a , unlessF y r is exactly 0. If the new study perfectly meets the features of the original study as described in H RF ,F y r will be 0 and the prior predictive p-value 1.00. In such a case, we confirm replication of the relevant features in the original study as captured in H RF , irrespective of power. Theoretically it is possible thatF y r = 0, while the new study is an extreme sample from a population in which H RF is not true. That, however, is not under consideration here, as our question was whether the observed new study replicates, or fails to replicate, relevant features of the  Figure 3. The prior predictive p-value workflow.
original study.
In case of a non-significant result in combination with low power, the researcher should emphasize the probability that not rejecting replication is a Type II error, and it is advised to conduct a replication study with larger n jr . The required sample size per group can again be calculated with the sample.size.calc function in the ANOVAreplication package. If the required n jr is excessive given H a , it may be an inevitable conclusion that the original study is not suited for replication testing by means of the prior predictive p-value. If replication is rejected despite low power, it implies that the observed new dataset deviates more from H RF than most datasets under H a . With sufficient statistical power, it is still informative to notify the reader of the achieved power and/or the probability of a Type II error given the population under H a .

Examples
To illustrate the use of the prior predictive check to assess whether relevant ANOVA features are replicated, we two selected replication studies that were part of the Reproducibility Project Psychology initiated by the Open Science Collaboration (2012,2015). All calculations can be performed with the ANOVAreplication R-package (Zondervan-Zwijnenburg, 2018).
The first study is Fischer et al. (2008), who stud-ied the impact of self-regulation resources on confirmatory information processing. According to the theory, people who have low self-regulation resources (i.e., depleted participants) will prefer information that matches their initial standpoint. An ego-threat condition was added, because the literature proposes that ego-threat affects decision relevant information processing, although the direction of this effect is not clear. To determine which relevant feature of the results (see Table 4) should be tested for replication, we follow the original findings: "Planned contrasts revealed that the confirmatory information processing tendencies of participants with reduced self-regulation resources [...] were stronger than those of nondepleted [...] and ego threatened participants [...]" Fischer et al. (2008, p. 387). This translates to: H RF : µ low self-regulation,r > (µ high self-regulation,r , µ ego-threatened,r ) (Workflow Step 1a). We want to reject replication when all means in the population are equal. That is: H a : µ low self-regulation,r = (µ high self-regulation,r =µ ego-threatened,r ) (Workflow Step 1b). We simulate the original data based on the means, standard deviations and sample sizes reported in Fischer et al. (2008) (Workflow Step 1c). As the replication study is already conducted by Galliani (2015) (see Table 4 for results), we do not calculate the required sample size to test replication (Workflow Step 2), and proceed to calculate the prior predictive p-value and the power of the replication test   Galliani (2015) results in an extremeF score compared to the predicted data. Figure 4 illustrates this conclusion: Over 90% of the predicted data scores perfectly in line with H RF , but the new study by Galliani (2015) deviates from H RF and scores in the extreme 0.3% of the predicted data. The replication of the original study conclusions is thus rejected. The second study is Janiszewski and Uy (2008), who studied numerical judgements with five experiments. More specifically, they study the impact of precision of an anchor, and motivation to adjust from the anchor on judgement bias. The group means, standard deviations, and sample sizes of experiment 4a in the original study by Janiszewski and Uy (2008) and the replication study by Chandler (2015) are provided in Table 5. We find that based on these results, Janiszewski and Uy (2008) draw two conclusions. "First, a precise anchor results in less adjustment than a rounded anchor" (p. 126). For experiment 4a, which was replicated by Chandler (2015), this conclusion translates to H RF : (µ low motivation,round,r > µ low motivation,precise,r ) & (µ high motivation,round,r > µ high motivation,precise,r ) (Workflow Step 1a). We want to reject replication when all means in the population are equal. That is: H a : µ low motivation,round,r = µ low motivation,precise,r = µ high motivation,round,r = µ high motivation,precise,r (Workflow Step 1b). We simulate the original data based on the means, standard deviations and sample sizes reported in Janiszewski and Uy (2008) (Workflow Step 1c). As the replication study is already conducted by Chandler (2015), we do not calculate the required sample size to test replication (Workflow Step 2), and proceed to calculate the prior predictive p-value and the power of the replication test (Workflow Step 3). The resulting prior predictive p-value is 1.00. The data obtained by Chandler (2015) were perfectly in line with the H RF describing the effect as observed by Janiszewski and Uy (2008). Therefore, we do not have further concerns about the obtained power. Hence, we conclude that the results of Janiszewski and Uy (2008) with respect to H RF : (µ low motivation,round,r > µ low motivation,precise,r ) & (µ high motivation,round,r > µ high motivation,precise,r ) are replicated by Chandler (2015).
The other conclusion that Janiszewski and Uy (2008) draw is about the presence of an interaction effect of adjustment motivation and anchor rounding: "The difference in the amount of adjustment between the roundedand precise-anchor conditions increased as the motivation to adjust went from low [...] to high" (p. 125). The results and conclusions of Janiszewski and Uy with respect to experiment 4a translate to H RF : (µ low motivation,round,r > µ low motivation,precise,r ) & (µ high motivation,round,r > µ high motivation,precise,r ) & (µ low motivation,round,r − µ low motivation,precise,r ) < (µ high motivation,round,r − µ high motivation,precise,r ).
The prior predictive p-value related to this H RF is .014 with γ = .87. Thus, we reject replication of the interaction effect.  Galliani (2015). The histogram bars representF for the predicted data. The thick line on the left representsF for the predicted data that are exactly 0 (i.e., over 90% of the total), whereas the red line representsF for Galliani (2015).

Discussion & Conclusion
The goal of the current paper was to introduce the prior predictive check as a manner to test replication of ANOVA features. With the prior predictive check researchers can find an answer to the question: "Does the new study fail to replicate relevant features of the original study?" Identifying a non-replication may make us wonder about the representativeness of the original study, the new study, and the comparability of both studies. Or, as stated by Simonsohn (2015, p. 9) "Statistical techniques help us identify situations in which something other than chance has occurred. Human judgment, ingenuity, and expertise are needed to know what has occurred instead." In the current paper, we discussed the prior predictive p-value for the ANOVA setting. For the ANOVA setting, we explained how to test relevant features of the form Rµ r > 0. Technically, however, the relevant features evaluated by the ANOVAreplication R-package, however, can also be of the form Rµ r > r and Sµ r = s, where r and s are vectors of length K containing the constants in H RF , and S is a K × J restriction matrix like R. Accordingly, minimum (effect size) differences between means can be evaluated and means can be constrained equal to specific values. Even though constraints of these forms can be evaluated with the R-package and in the online application, they are not emphasized in the current paper because they will less often directly relate to the findings of an original study.
The prior predictive p-value is generalizable to statistical models other than the ANOVA as well. That is, for any model a predictive distribution can be obtained, constrained hypotheses can be constructed, and a test-statistic evaluating the constraints can be calculated. The test as currently provided can already be used for the repeated measures ANOVA by means of contrast weights (see, for example, Furr and Rosenthal, 2003). With contrast weights a score for each participant can be calculated indicating to what degree the participant follows the expected pattern. Subsequently, the replication of relevant features of these contrast scores over groups can be tested. A pre-print introduction to test replication with the prior predictive p-value for structural equation models has been published at https://psyarxiv.com/uvh5s.
In the current paper, we introduced the prior predictive p-value as a new tool to quantify replication fail-