A Brief Guide to Evaluate Replications

https://doi.org/10.15626/MP.2018.843 Article type: Original Article Published under the CC-BY4.0 license Open data: Not relevant Open materials: Not relevant Open and reproducible analysis: Not relevant Open reviews and editorial process: Yes Preregistration: Not relevant Edited by: Rickard Carlsson Reviewed by: Nuijten, M. & Schimmack, U. All supplementary files can be accessed at the OSF project page: https://doi.org/10.17605/OSF.IO/Q56E8

There is growing consensus in the psychology community regarding the fundamental scientific value and importance of replication. Considerably less consensus, however, exists about how to evaluate the design and results of replication studies. In this article, we make concrete recommendations on how to evaluate replications with more nuance than what is typically done currently in the literature. These recommendations are made to maximize the likelihood that replication results are interpreted in a fair and principled manner.
We propose a two-stage approach. The first one involves considering and evaluating six crucial study characteristics (the first three specific to replication studies with the last three relevant for any study): (1) replication method similarity, (2) replication differences, (3) investigator independence, (4) method/data transparency, (5) analytic result reproducibility, and (6) auxiliary hypotheses' plausibility evidence. Second, and assuming sound study characteristics, we recommend more nuanced ways to interpret replication results at the individual-study and meta-analytic levels. Finally, we propose the use of clearer and less ambiguous language to more effectively communicate the results of replication studies.
These recommendations are directly based on curating N = 1,127 replications (as of August 2018) available at Curate Science (CurateScience.org), a web platform that organizes and tracks the transparency and replications of published findings in the social sciences (LeBel, McCarthy, Earp, Elson, & Vanpaemel, 2018). This is the largest known metascientific effort to evaluate and interpret replication results of studies across a wide and heterogeneous set of study types, designs, and methodologies.

Replication-Specific Study Characteristics
When evaluating replication studies, the following three study characteristics are of crucial importance:

Methodological similarity.
A first aspect is whether a replication study employed a sufficiently similar methodology to the original study (i.e., at minimum, used the same operationalizations for the independent and dependent variables, as in "close replications"; LeBel et al., 2018). This is required because only such replications can cast doubt upon an original hypothesis (assuming sound auxiliary hypotheses, see section below), and hence in principle, falsify a hypothesis (LeBel, Berger, Campbell, & Loving, 2017;Pashler & Harris, 2012). Studies that are not sufficiently similar can only speak to the generalizability --but not replicability --of a phenomenon under study, and should therefore be treated as "generalizability studies" rather than "replication studies". Such studies are sometimes called "conceptual replications", but this is a misnomer given that it is more accurate to conceptualize such studies as "extensions" rather than replications (LeBel et al., 2017;Zwaan, Etz, Lucas, & Donnellan, 2017).

Replication differences.
A second aspect to carefully consider is whether there are any study design characteristics that differed from the comparison original study. These are important to consider whether the differences were within or beyond a researcher's control (LeBel et al., 2018). Such differences are critical to consider because they help the community begin to understand the replicability and generalizability of an effect. Consistent positive replication evidence across replications with minor design differences suggests an effect is likely robust across those design differences. On the other hand, for inconsistent replication evidence, such differences may provide initial clues regarding potential boundary conditions of an effect.

Investigator independence.
A final important consideration is the degree of independence between the replication investigators and researchers who conducted the original study. This is important to consider to mitigate against the problem of "correlated investigators" (Rosenthal, 1991) whereby non-independent investigators may be more susceptible to confirmation biases given vested interest in an effect (although preregistration and other transparent practices can alleviate these issues; see next section).

General Study Characteristics
When evaluating studies in general, the following three study characteristics are important to consider.

Study transparency.
Sufficient transparency is required to allow comprehensive scrutiny of how any study was conducted. Sufficient transparency means posting the experimental materials and underlying data in a readable format (e.g., with a codebook) on a public repository (criteria for earning open materials and open data badges, respectively; Kidwell et al., 2016) and following the relevant reporting standards for the type of study and methodology used (e.g., CONSORT reporting standard for experimental studies; Schulz, Altman, & Moher, 2010). If a study is not reported with sufficient transparency, it cannot be properly scrutinized. The findings from such a study are consequently of little value because the target hypothesis was not tested in a sufficiently falsifiable manner. Preregistering a study (which publicly commits data collection, processing, and analysis plans prior to data collection) offers even more transparency and limits researcher degrees of freedom (assuming that the preregistered procedure was actually followed).

Analytic result reproducibility.
For any study, it is also important to consider whether a study's primary result (or set of results) is analytically reproducible. That is, whether a study's primary result can be successfully reproduced (within a certain margin of error) from the raw or transformed data (this is contingent of course on the fact that the data are actually available, whether publicly, as in the case of "open data", or otherwise).
If analytic reproducibility is confirmed, then our confidence in a study's reported results is boosted (and ideally results can also be confirmed to be robust across alternative justifiable data-analytic choices; Steegen, Tuerlinckx, Gelman, & Vanpaemel, 2016). If analytic reproducibility is not confirmed and/or if discrepancies are detected, then our confidence should be reduced and this should be taken into account when interpreting a study's results.

Auxiliary hypotheses.
Finally, for any study, researchers should consider all available evidence regarding how plausible it is that the relevant auxiliary hypotheses, needed to test the substantive hypothesis at hand, were true (LeBel et al., 2018). Auxiliary hypotheses include, for example, the psychometric validity of the measuring instruments, and the sound realizations of experimental conditions (Meehl, 1990). This can be done by examining reported evidence of positive controls or evidence that a replication sample had the ability to detect some effect (e.g., replicating a past known effect; manipulation check evidence). These considerations are particularly crucial when interpreting null results so that one can rule out more mundane reasons for not having detected a signal (e.g., fatal experimenter or data processing errors; though such fatal errors can also sometimes cause false positive results).

Nuanced Statistical Interpretation and Language
Once these six study characteristics have been evaluated and taken into account, we recommend statistical approaches to interpret the results of a replication study at the individual-study and metaanalytic levels that are more nuanced than what is currently typically done. We then propose the use of clearer language to communicate replication results.
1 The ES estimate precision of an original study is not currently accounted for because the vast majority of legacy literature original studies don't report 95% CIs (and CIs most often cannot be calculated because insufficient information is reported). In rare cases that CIs are reported, they are typically so wide (given the underpowered nature of the

Statistical interpretation: Individual-study level.
At the individual-study level, we recommend that the following three distinct statistical aspects of a replication result are considered: (1) whether a signal was detected, (2) consistency of the replication effect size (ES) relative to the original study ES, and (3) the relative precision of the replication ES estimate relative to the original study. Such considerations yield the following replication outcome categories for the situation where an original study detected a signal (see Figure 1, Panel A, for visual depictions of these distinct scenarios) 1 : 1. In cases where a replication effect size estimate was less precise than the original (i.e., the replication ES confidence interval is wider than the original), which can occur when a replication uses a smaller sample size and/or when the replication sample exhibits higher variability, we propose the label "less precise" be used to warn readers that such replication result should only be interpreted metaanalytically (Panel A replication scenario #7; e.g., Schuler & Wanke's, 2016 Study 2 replication result of Caruso et al.'s, 2013 Study 2).
In the situation where an original study did not detect a signal, such considerations yield the following replication outcome categories (see Figure  1,  From this perspective, the proposed improved language to describe a replication study under replication scenario #6 would be: "We report a replication study of effect X. No signal was detected and the effect size was inconsistent with the original one." This terminology contrasts favorably with several ambiguous or unclear replication-related terminologies that are currently commonly used to describe replication results (e.g., "unsuccessful", "failed", "failure to replicate", "non-replication"). The terms "unsuccessful" or "failed" (or "failure to replicate") are ambiguous: was it the replication methodology or the replication result that was unsuccessful or failed (with similar logic applied to the ambiguous term "non-replication")? The terms "unsuccessful" or "failed" are also problematic because of the implicit message conveyed that something was "wrong" with the replication. For example, though the "small telescope approach" (Simonsohn, 2015) was an improvement over the prior simplistic standard of considering a replication p < .05 as "successful" and p > .05 as "unsuccessful", the approach nonetheless uses ambiguous language that does not actually describe a replication result (e.g., "uninformative" vs. "informative failure to replicate"). Instead, the terminology we propose offers unambiguous and descriptively accurate language, stating both whether a signal was detected and the consistency of the replication ES estimate relative to the original study. The proposed nuanced approach to statistically interpreting replication evidence improves the clarity of the language to describe and communicate replication results.

Statistical interpretation: Meta-analytic level.
Interpreting the outcomes of a set of replication studies can proceed in two ways: an informal approach, when only a few replications are available, and a more quantitative meta-analytic approach when several replications are available for a specific operationalization of an effect. The first one considers whether replications can consistently detect a signal, each of which is consistent (i.e., of similar magnitude) with the ES point estimate from the original study (Panel A replication scenario #1). Under this situation, one could informally say that an effect is "replicable." When several replications are available, a more quantitative meta-analytic approach can be taken: an effect can be considered "replicable" when the meta-analytic ES estimate excludes zero and is consistent with the original ES point estimate (also replication scenario #1, see Panel A Figure 1; see also Mathur & VanderWeele, 2018).

Conclusion
It is important to note that replicability should be seen as a minimum requirement for scientific progress rather than an arbiter of truth. Replicability ensures that a research community avoids going down blind alleys chasing after anomalous results that emerged due to chance, noise, or other unknown errors. However, when adjudicating the replicability of an effect, it is important to keep in mind that an effect that does not appear to be replicable does not necessarily mean the tested hypothesis is false: It is always possible that an effect is replicable via alternative methods or operationalizations and/or that there were problems with some of the auxiliary hypotheses (e.g., invalid measurement, or unclear instructions, etc.). This possibility, however, should not be exploited: eventually one must consider the value of continued testing of a hypothesis across different operationalizations and contexts. Conversely, an effect that appears replicable does not necessarily mean the tested hypothesis is true: A replicable effect may not necessarily reflect a valid and/or generalizable effect (e.g., a replicable effect may simply reflect a measurement artifact and/or may not generalize to other methods, populations, or contexts).
The recommendations advocated in this article are based on curating over one thousand replications at Curate Science (as of August 2018). These recommendations have been applied to each of the replication in its database, including employing our suggested language to describe the outcome of each of its curated replication. It is expected, however, that these recommendations will evolve over time as additional replications, from an even wider set of studies, are curated and evaluated (indeed, as of September 2018, approximately 1,800 replications are in the queue to be curated at Curate Science). Consequently, these recommendations should be seen as a starting point for the research community to more accurately evaluate replication results, as we gradually learn more sophisticated approaches to interpret replication results. We hope, however, that our proposed recommendations will be a stepping stone in this direction and consequently accelerate psychology's path on becoming a more cumulative and valid science.