Exploring reliability heterogeneity with multiverse analyses: Data processing decisions unpredictably influence measurement reliability

Analytic flexibility is known to influence the results of statistical tests, e.g. effect sizes and p-values. Yet, the degree to which flexibility in data processing decisions influences measurement reliability is unknown. In this paper I attempt to address this question using a series of 36 reliability multiverse analyses, each with 288 data processing specifications, including accuracy and response time cut-offs. I used data from a Stroop task and Flanker task at two time points, as well as a Dot Probe task across three stimuli conditions and three timepoints. This allowed for broad overview of internal consistency reliability and test-retest estimates across a multiverse of data processing specifications. Largely arbitrary decisions in data processing led to differences between the highest and lowest reliability estimate of at least 0.2, but potentially exceeding 0.7. Importantly, there was no consistent pattern in reliability estimates resulting from the data processing specifications, across time as well as tasks. Together, data processing decisions are highly influential, and largely unpredictable, on measure reliability. I discuss actions researchers could take to mitigate some of the influence of reliability heterogeneity, including adopting hierarchical modelling approaches. Yet, there are no approaches that can completely save us from measurement error. Measurement matters and I call on readers to help us move from what could be a measurement crisis towards a measurement revolution.

In this paper I was concerned with the influence analytic flexibility on measurement reliability, specifically in data processing or data cleaning. I took inspiration from numerous papers reporting the unsettlingly low reliability of Dot Probe attention bias indices (e.g. Jones et al., 2018;Schmukle, 2005;Staugaard, 2009) and other work investigating alternative analyses and data processing strategies, with the intention of yielding a more reliable measurement (e.g. Jones et al., 2018;Price et al., 2015). When considering the impact of re-searcher degrees of freedom, focus is drawn to decisions made in the beginning (task design) or at the end (data analysis) of the research process. I was interested in the middle step: data processing and measure reliability. In this paper, I explore and visualise the influence of data processing steps on reliability using a series of reliability multiverse analyses.

Getting up to speed with reliability
The accuracy of our conclusions rests on the quality, and the strength, of our evidence. Our evidence rests on the bedrock of our measurements. The quality of our measures defines the quality of our results. Without adequate focus on the validity of our measures, how can we be assured that we are capturing the concept or process that we are interested in? Without any attention to the reliability of our measures, how can we be sure that we are capturing a phenomenon with any precision? Psychological science has a guilty habit of neglecting these foundations, though of course some areas fair better than others.
In a recent paper, my colleagues and I argued for a widespread appreciation for the reliability of our cognitive measures . Briefly, low reliability places doubt on the veracity of statistical analyses using that measure; measurement reliability restricts the observable range of effect sizes in simple correlational analyses, and unpredictably in more complicated models; and failing to correct for measurement error makes comparing effect sizes between, and within, studies difficult. These issues are compounded by the sad observation that the reporting of reliability (and validity) evidence is woefully poor. Scale validity and reliability is not routinely examined, and many scales are adapted on an ad hoc basis with little or no validation (Flake et al., 2017). In other cases scales fail to pass deeper psychometric evaluation, including tests of measurement invariance (Hussey and Hughes, 2018). This likely reflects issues with more superficial approaches to establishing validity evidence -i.e. reporting Cronbach's alpha, stating it is adequate, and moving on. Pockets of psychological science take a more enlightened approach. However, I feel it is reasonable to argue that the field at large is not doing well in our measurement practices. Most relevant to this paper; it is the exception rather than the norm to evaluate the psychometric properties of cognitive measurements (Gawronski et al., 2011).
Strictly speaking, we cannot state that a task is unreliable; although we might observe a consistent pattern of unreliability in measurements obtained that causes us to question further use of the task. An important reminder: estimates of reliability refer to the measurement obtained -in a specific sample and under particular circumstances, including the task parameters. Reliability is therefore not fixed; it may differ between populations, samples, and testing conditions. Variations of a task may lead to the generation of more or less reliable measurements. For example, the stimulus presentation duration will likely influence the cognitive processes involved in completing the task, perhaps leading participants to perform more consistently in one version, relative to another. Reliability is a property of the measurement, not of the task used to obtain it. In this study, we are concerned with the data processing steps researchers take and how these influence our measurement, and the resulting reliability estimates. To explore this, I invite you to join me, dear reader, on a walk through the garden of forking paths.

Analytic flexibility and the garden of forking paths
Every result presented in every research article is the culmination of many decisions made by one or more researchers; the sheer number of combinations of valid decisions is likely uncalculatable. The "garden of forking paths" (Gelman and Loken, 2013) is a useful analogy to illustrate this. With each decision that must be made, however arbitrary, the researcher comes to a fork in their research path and selects one. To add a little suspense, there will be many cases when the researcher does not notice a fork in the road. Perhaps the researcher unconsciously makes the same turn as always, their feet working of their own accord. These forks in the path, the decisions researchers make (whether they are aware or not), may be reasonably combined to make a near uncountable number of paths. Each path also leads to a location; some paths end close to one another, and other times the paths diverge wildly. We can think of the end of the path as the statistical result our researcher arrives at.
The researcher has to decide their path, based on the soundest justifications they can make at each fork [e.g. (Lakens et al., 2018). Of course, psychological science has become fully aware of the detrimental effects of selecting one's path retrospectively, based on where the path ends or the results most exciting to the researcher (read as: p < .05; e.g. (Simmons et al., 2011). Analytic flexibility is not inherently bad. However, we must acknowledge the ramifications. The effects we observe, or do not, are potentially influenced by all of the decisions made to arrive at them. Thus, a range of possible effects may have been observed that could be more or less equally valid or justifiable based on the analytical decisions made.
In discussions of analytical flexibility, focus is usually given primarily to decisions made during statistical analysis. For example, should I control for age and gender? Do I reason that this is model more appropriate over that one? Or where should I set my alpha and how should I justify the decision? Discussions of analytical flexibility often concern issues around p-hacking and other QRPs (intended or unintended). However, as Leek and Peng (Leek and Peng, 2015) note, p-values are the tip of the iceberg; not enough scrutiny is given to the impact of the many steps in the research pipeline that precede inference testing. I agree. In my estimation, flexibility in measurement and data handling do not receive the scrutiny they deserve. If the garden of forking paths concerns analytic flexibility, then measurement flexibility decides which gateway one enters the garden through in the first place. As an example, a recent review highlighted the lack of consensus around the processing of task data from tasks in the attention control literature, including but not limited to the data pre-processing used in this paper (von Bastian et al., 2020, p. 47-48).

Mapping the garden of forking paths with multiverse analyses
Multiverse analyses (Steegen et al., 2016) offers us a "GPS in the garden of forking paths" (Quintana and Heathers, 2019). The process is simpler than one might expect. First, we define a set of reasonable data processing and analysis decisions. Second, we run the entire set of analyses. We can then examine results across the entire range of results. Specification curve analysis (Simonsohn et al., 2015) adds third step allowing for inference tests across the distribution of results generated in the multiverse (for insightful applications of specification curve analyses, see Orben and Przybylski, 2019;Rohrer et al., 2017). In this paper I use 'specification' to refer to each combination of data processing decisions in the multiverse analysis.
Multiverse analyses enable us to explore how a researcher's -sometimes arbitrary -choices in data processing (e.g. outlier removal) and analysis decisions (e.g. including covariates, splitting samples) influence statistical results, and the conclusions drawn from the analysis. From this we can examine which choices are more or less influential than others, as well as how robust the result is across the full set of specifications.

A reliability multiverse from many data processing decisions
In this paper I report multiverse analyses exploring the influence of data processing specifications on the reliability of a calculated measurement. I used openly accessible Stroop task and Flanker task data generously shared by Hedge and colleagues (Hedge et al., 2018) and Dot Probe task data from the CogBIAS project (Booth et al., 2017;Booth et al., 2019). Following our previous work in this area , I was interested in the stability and range of reliability estimates on cognitive-behavioural measures. Broadly, I was interested in the impact of data processing decisions on reliability. It is possible that certain analytic decisions tend to yield higher reliability estimates; it may be that particular combinations of decisions are also better, or worse, than others. Beyond that, I was interested in the range of estimates. A small range would suggest that measure reliability is relatively stable as we make potentially arbitrary data processing decisions while walking the garden of forking paths. A large range suggests hidden measurement reliability heterogeneity. This is potentially an important, and underappreciated, contributor to the replicability crisis (Loken and Gelman, 2017). Alternatively, this could be a herald for a crisis of measurement.

Data
Stroop and Flanker task data were obtained from the online repository for Hedge, Sumner, and Powell (Hedge et al., 2018, https://osf.io/cwzds/). Full details of the data collection, study design, and procedure can be found in Hedge et al. (Hedge et al., 2018). These data are ideal for our purposes as they a) contain many trials, helping us obtain more precise estimates of reliability, and b) include two assessment time-points approximately 3-4 weeks apart, allowing us to explore both: internal consistency and test-retest reliability. The data were collected from different studies; for simplicity in this paper, the data across studies were pooled (n = 107 before any data processing -note that this may be different from the sample size presented by Hedge et al. due to differences in data processing).
Dot Probe data were obtained from the CogBIAS project (Booth et al., 2017;Booth et al., 2019). Full details of the full study and data collection can be found in Booth et al. (2017;. These data complement the Stroop and Flanker data as they provide a longer test-retest duration (approximately 1.5 years between repeated measures) across three timepoints. In addition, the task incorporated three stimuli conditions, allowing us cross-sectional comparisons of reliability stability within the same task. The Dot Probe data were pooled such that only a subset of participants completing the task at all three timepoints were retained (n = 285).
Interested readers can find the data and code used to perform the multiverse analyses and generate this manuscript in the Open Science Framework repository for this project (https://osf.io/haz6u/). 1

Stroop task
Participants made keyed responses to the colour of a word presented in the centre of the screen. In congruent conditions the word was the same as the font colour, whereas, in incongruent trials, the word was a different colour from the font colour. In a neutral condition, the word was not a colour word. Participants completed 240 of each trial type. The outcome index we explore here is the RT cost, calculated as the average RT for incongruent trials minus the average RT for congruent trials.

Flanker task
Participants made keyed responses to the colour of a word presented in the centre of the screen. In congruent conditions the word was the same as the font colour, whereas, in incongruent trials, the word was a different colour from the font colour. In a neutral condition, the word was not a colour word. Participants completed 240 of each trial type. The outcome index we explore here is the RT cost, calculated as the average RT for incongruent trials minus the average RT for congruent trials.

Dot probe task task
Participants made keyed responses to the identity of a probe presented on screen. The probe was presented in the same location as one of the paired faces presented on screen for 500ms prior. The paired faces were an emotional face (angry, pained, and happy) paired with a neutral face (taken from the STOIC faces database, Roy et al., 2009). In congruent trials, the probe was presented in the same location as the emotional face. In incongruent trials, the probe was presented in the same location as the neutral face. Participants completed three blocks of 56 trials corresponding to the emotion presented. The 'attention bias' outcome index (MacLeod et al., 1986) was calculated as calculated as the average RT for incongruent trials minus the average RT for congruent trials.

Multiverse analysis
In a personal effort to make my research reproducible, and also help others perform similar processes I have developed simple functions to perform the multiverse analyses reported in this paper. Readers interested in performing similar analyses can find these functions within the splithalf package (Parsons, 2021) and tutorials on the related GitHub page (https://github.com/sdparsons/splithalf).
Step 1. Creating a list of all specifications. No data were removed before the multiverse analysis. To my knowledge, there are no fixed standards in the literature for processing data from any of the tasks. I identified six decisions common to processing RT data, though there are many more. For simplicity I stuck to RT difference scores as the outcome measure of interest. However, there are very different analytical techniques that might be applied to RT tasks such as this (for example, multilevel modelling and drift-diffusion modelling approaches). The decisions were as follows: • Total accuracy. Researchers may opt to remove participants with accuracy lower than a prespecified cut-off; for example 80 of 90 per cent. I used three options; 80 • Absolute response time removals. Researchers will often remove trials faster than a minimum RT threshold and trials that exceed a maximum RT threshold. I use minimum RT cut-offs at 100ms, 200ms, as well as no cut-off. And, I use two maximum RT cutoffs; 3000ms, and 2000ms.
• Relative RT cut offs. After absolute RT cutoffs, researchers can decide to remove trials with RTs greater than a number of standard deviations from the mean (sometimes called relative cut-offs or trimmed means). Three SDs from the mean would remove very extreme outliers; two SDs from the mean is common. I have not seen researchers use one SD from the mean as a cut off, as it is likely a too conservative threshold. As I was interested in a wide range of possible specifications, I included one standard deviation. I use no relative cut off, and one, two, and three SDs from the mean cutoffs in the multiverse.
• Where to apply the relative cutoff. The decision to remove trials based on a SD cutoff comes with its own decision.  Barth, 2022) 2SDs from the participant's average RT, for example. We could also remove trials with RTs greater than 2SDs from the mean RT within each trial type (congruent and incongruent, for example). I included both options; participant level, and trial type level.
• Averaging. Most often the mean RT within each trial type is calculated, and may then be analysed directly, or a difference score calculated to analyse. Researchers may opt to use the median RT instead. I included both options.
Step 2. Run all specifications and extract reliability estimates. From this decision list, we have a complete list of 288 data processing specifications. In the multiverse analysis the data is processed following each specification parameters, before estimating the reliability of the resulting outcome measure. Internal consistency was estimated using 500 permutations of the splithalf  procedure for each specification (5000 is standard, but 500 was selected to reduce processing time). Following Hedge et al. (2018), and because ICC relates to both the correlation and the agreement among repeated measures, test-retest reliability was estimated using ICC2k (Koo & Li, 2016).
Step 3. Visualising the multiverse. I find that one of the joys of multiverse analyses are the visualisations, because sometimes science is more art than science. I explain the visualisations in the results section.

Analysis plan
For the core analysis I performed 18 multiverse analyses following the steps described above. Separately for each of the Stroop and Flanker task data, I examined internal consistency reliability at time 1 and at time 2, as well as test-retest reliability from time 1 to time 2. For the Dot Probe data, I examined internal consistency reliability at each of the three timepoints, separately for the three task conditions, as well as test-retest reliability at across timepoints. For each multiverse I report the median estimate and it's 95% Confidence Interval, the proportion of estimates exceeding 0.7, and the range of estimates in that multiverse. In addition to visualising each multiverse, I also include visualisations overlapping the internal consistency multiverses over time. These overlapped plots allow us to visually inspect whether the pattern of reliability estimates following the full range of data processing specifications are comparable across each time point.

Inferences from the multiverse
It is not my aim in this paper to make inferences from these reliability multiverse analyses as one would in a specification curve analysis (Simonsohn et al., 2015). One could use this method to perform inference testing against the curve of reliability estimates. However, it is not clear what this would add: testing whether the reliability estimates significantly differ from zero is a low bar for assessing the reliability of a measure.

Results
I include a visualisation for each multiverse analysis. The reliability estimates are presented on the y-axis at the top of the figure; each estimate is represented by a black dot and the 95% confidence interval is represented by the shaded band. The x-axis indicates each individual multiverse specification of processing decisions (288 total), displayed in the 'dashboard' at the bottom of the figure. The vertical dashed line running through the top panel and the bottom dashboard represents the median reliability estimate. This line is extended through the dashboard to demonstrate that the estimate is derived from the unique combination of data processing decisions, including (from top to bottom, in order of processing step); 1) participant removal below total accuracy threshold, 2) maximum RT cut-off, 3) minimum RT cut-off, 4) removal of RTs > this number of SDs from the mean, 5) whether this removal is at the trial or subject level, and 6) use of mean or median to derive averages.

Overlapping time 1 and time 2 multiverses
In the next two figures I overlap the time 1 and time 2 multiverses, separately for the Stroop and Flanker data. The specifications are ordered by the reliability estimates at time 1 for each measure (Figures 1 and  3). These figures allow us to compare the patterns of reliability estimates following the same data processing decisions.

Dot Probe Task
For ease of presentation (and to reduce the total number of figures), we visualise the Dot Probe task reliability multiverses entirely as overlapping plots.

Secondary analyses: reliability and number of trials
Increasing the number of trials typically increases reliability estimates (e.g. Hedge et al., 2018;von Bastian et al., 2020). A visual inspection of the multiverses suggests that specifications involving the removal of more trials (i.e. removing trials greater than 1 standard deviation from the average) leads to higher reliability estimates. Table 1 presents the Pearson correlations between the reliability estimates and the number of trials retained in each specification. For internal consistency reliability these correlations typically ran counter to expectations of reduced trials leading to reduced reliability. In most cases the association was negative -more trials removed during data processing was associated with higher reliability estimates were observed. In contrast, for most of the test-retest reliability multiverses, removal of more trials led to lower reliability estimates.

Figure 3. Test-retest reliability multiverse for Stroop RT cost
To investigate this further, I reran the multiverses for Stroop and Flanker data using only the first half of trials collected for each participant. I also reran the multiverses for the Dot Probe data using only the first 20 trials for each trial type (I attempted to rerun the Dot Probe data with only 14 trials for each trial type, but this led to errors under stricter specifications where there were too few trials to run the reliability estimation). To save the reader from viewing all 18 multiverses for a second time, the code and all outputs can be found in the supplementary materials. On visual inspection of the multiverse visualisations, the overall pattern of results is similar: specifications resulting in the removal of more trials tend to result in higher reliability estimates. The final column in Table 1 presents the mean difference in reliability estimates for each of the 18 multiverses (positive values indicate higher reliabil-ity estimates with the full number of trials). For internal consistency estimates: multiverses with fewer trials had lower reliability estimates, on average, for the Stroop and Flanker tasks. But, against expectations, reliability estimates increased for the Dot Probe task when the number of trials was reduced. In contrast, almost all test-retest estimates were reduced in the reduced number of trials analyses. Figure 13 presents the difference between reliability estimates in full vs reduced trials multiverses for all 18 multiverse analyses.

Discussion
Across 18 reliability multiverse analyses, and their colourful visualisations, we explored the influence of data pre-processing specifications on measure reliability. To briefly summarise: Internal consistency reliability estimates ranged from 0.58 to 0.92 in the Stroop data, Figure 4. Internal consistency reliability multiverse for Flanker RT cost at time 1 0.59 to 0.93 in the Flanker data, and -0.28 to 0.68 in the Dot Probe data. Test-retest reliability estimates ranged from 0.47 to 0.63 in the Stroop data, 0.29 to 0.72 in the Flanker data, and 0 to 0.11 in the Dot Probe data. From the introduction we remember that reliability estimates are a product of: the sample and the population they are drawn from, the task (including any differences in implementation), and the circumstances in which the measurement was obtained, i.e. reliability is not an inherent quality of the task itself. The first conclusion we can draw from these multiverse analyses is that data processing specifications are also an integral part of this list.
At the onset of this project, I thought it reasonable to assume that a particular feature of the data processing path might result in consistently higher (and lower) reliability estimates. The clearest indication we can take from these analyses is that there is no single set of data processing specifications, or combination of data processing decisions, that lead to improved reliability. The wide ranges of estimates are an additional cause for concern. Seemingly arbitrary data processing decisions can lead to differences of more than .3 in the reliability of a measure. These decisions are equally reasonable and logical choices, and we should not expect them to have meaningful impact on the theoretical questions being asked of the data. The reliability multiverse analyses presented here demonstrate this using data from a Stroop and a Flanker task. As well as across tasks, overlapping the time 1 and time 2 multiverses for both tasks highlights that even the same set of specifications does not lead to directly comparable internal consistency reliability estimates over time. Data processing decisions appear to be extremely important contributors to mea- Figure 5. Internal consistency reliability multiverse for Flanker RT cost at time 2 sure reliability, but their influence is unpredictable and arbitrary.
The secondary analyses give us more insight into the relationship between the number of trials retained through data processing and the resultant reliability estimates. The picture is not a simple one. Figure 13 highlights the unpredictable influence of what is essentially another multiverse specification decision -do I remove half of trials before any other data processing? While the underlying pattern of more data reduction leading to greater reliability generally holds across tasks, within tasks fewer trials led to lower reliability on average for the Stroop and Flanker tasks (as we should expect) but not the Dot Probe. More work is needed to unravel these influences, but a take-home message may be: while administering more trials to participants is typically a good thing for reliability, there may be some benefit (in terms of reliability) of removing more trials. Though, as I discuss below, pursuit of reliability alone should not be the goal.
In the core of this discussion I raise several open questions and suggest some plausible actions that could be taken to mitigate some of the risk reliability heterogeneity poses.

How do we guard against reliability heterogeneity?
In simple bivariate analyses, we usually think low reliability will simply attenuate estimated effect sizes (e.g. Spearman, 1904). But the influence can be far less predictable (the reader may be noticing a trend of unpredictability in this paper). Low reliability can lead to elevation of effect size estimates and even reversals in direction (or examples, see Brakenhoff et al., 2018;Segerstrom and Boggero, 2020), with the influence be- Figure 6. Internal consistency reliability multiverse for Flanker RT cost at time 2 coming more unpredictable in more complex models. It is therefore important to take reliability heterogeneity into account when comparing effect sizes (for several clear examples, see Cooper et al., 2017). It is plausible that some studies may have obtained smaller or larger effect sizes than others based, in part, on the reliability of the measurements taken. Similarly, identical observed effect sizes may represent very different 'true' effect sizes, if reliability is taken into account. Recently, Wiernik and Dahlke (2020) made a strong case for correcting for measurement error in meta-analyses and provide the necessary formula and code for doing so. There are several actions we can take to begin to account for reliability heterogeneity.

Two simple recommendations
To briefly reiterate two recommendations I and my colleagues have made previously: a) report all data processing steps taken, and b) report the reliability of measures analysed . These recommendations will not 'fix' potential psychometric issues within one's study, or reliability heterogeneity across studies. However, complete reporting of data processing will assist in the computational reproducibility of one's results. Reporting psychometric information will assist in the interpretation of results, including comparisons of effect sizes, as well as provide useful information about the utility of a task in studies of individual differences.

Multiverse analyses as a robustness check
One approach is running a multiverse across a justified set of data processing specifications (that yield the same theoretically justified construct of interest, see the below section on validity) and generating a distribution of effect sizes from the final analyses under these specifications. In principle this is the same as a sensitivity or robustness analysis, and act as a check on the reliability heterogeneity introduced by different (but equally justifiable) data processing specifications.

Adopt a modelling approach
Incorporating trial level variation into our analyses with hierarchical modelling approaches (aka mixed models, multilevel models) will likely be a vital step in protecting us against reliability heterogeneity. Psycho-logical effects are often heterogeneous across individuals (Bolger et al., 2019), and factors within tasks have important effects [e.g. stimuli differences, (DeBruine and Barr, 2021). It follows that our models should take trial-level variation into account. More than this, using models that capture the theorized data generating process, including relevant distributions (e.g. response time distributions are typically very right skewed), likely have a better chance of capturing the process of interest in the first place. Using the Stroop and Flanker data from Hedge et al. (2018) Rouder, Kumar, andHaaf (2019;also see Rouder and Haaf, 2018) demonstrated that hierarchical models should be used to account for error in measurement (for additional guidance on applying this modelling, see Haines, 2019). Adopting this approach has the benefit of 'correcting' the effect size estimate (and standard error) for measurement error as Figure 8. Overlapped internal consistency reliability multiverse for Flanker RT cost at times 1 and 2 part of the model, rather than as an additional step to aid in interpretations and effect size comparisons (a step that is often missed once reliability is deemed "acceptable", assuming that reliability is estimated in the first place). Rouder and colleagues demonstrate that this is also a more effective approach than 'correcting' the effect size estimate using e.g. Spearman's correction for attenuation formula (Spearman, 1904). Yet, even better corrections cannot fully save us from measurement error.
Hierarchical measures do bring their own considerations and potential issues. Applied researchers, or those without training, may need further support to ensure the model specifications are appropriate. The model covariance structure, and appropriate priors in the case of Bayesian approaches, do have potential to introduce additional sources of bias/researcher degrees of free-dom. But, given existing resources and a growing body of training materials and work in this area, it is my view that a modelling approach is likely the best next step (Haines et al., 2020;Rouder et al., 2019;Sullivan-Toole et al., 2021;DeBruine and Barr, 2021). An additional benefit of these approaches is that they typically avoid much of the data pre-processing aspects discussed in this paper, and thus the reliability heterogeneity they generate.

Limitations and room for expansion
A small number of tasks. One limitation of this study is the focus on a small sample of tasks. It is possible that data from other tasks tend to yield more or less consistent patterns of reliability estimates across data processing specifications. Similarly, I have only examined RT costs (i.e. a difference score between two Figure 9. Internal consistency reliability multiverse for Dot Probe attention bias (angry faces) at times 1, 2, and 3 trial types) as the outcome measure. The analyses could have examined accuracy rates, RT averages, signal detection, and a wide variety of outcome measures. It is very possible that other outcome indices would be more or less consistently reliable across the range of data processing specifications. I opted for brevity in this paper by selecting only these tasks; I welcome future work seeking to examine a wider range of tasks and outcome indices.
Extracting the influence of individual decisions. The analyses here do not allow for an in depth examination of the influence of specific data processing decisions. Given lack of consistency across timepoints and measures, I am not confident that robust conclusions could be drawn about a specific decision compared to another. A plausible approach to examine this is a Vibration of Effects analysis (e.g. Klau et al., 2021) in which the variance of the final distribution of estimates can be decomposed to examine the relative influence of different categories of decisions, e.g. model specifications and data processing decisions. Using this information, we might be able to prioritise sources of measurement heterogeneity more accurately.
Applicability to experimental vs correlational analyses. There is a paradox in measurement reliability (see Hedge et al., 2018): Experimental effects that are highly replicable (for example, the Stroop effect) may also show low reliability. Homogeneity within groups or experimental conditions allows for larger and more robust effects; researchers can opt to develop tasks that capitalise on homogeneity. Unfortunately, reliability requires robust individual differences (and vice versa). Highly reliable measures by necessity show consistent, potentially large, individual differences and Figure 10. Internal consistency reliability multiverse for Dot Probe attention bias (happy faces) at times 1, 2, and 3 would not be suitable for group differences or experimental research. As a result, measures tend to be more appropriate for questions of a) assessing differences between groups or experimental conditions, or b) correlational or individual differences. I was primarily concerned with the use of these measures in individual differences research -hence the focus on reliability. Yet, it would be overly simplistic to assert that the discussions in this paper do not also relate to experimental differences questions. Indeed, the data processing specifications that maximise the measure's utility in individual differences analyses can also hinder the measure's utility in experimental questions. Further research would be needed to quantify the relative influences on correlational vs experimental analyses. Yet, large fluctuations in relative between-subjects vs within-subjects variance, due to data processing, holds importance for any re-search question.
Simulation studies. Several valuable extensions to the current approach could be made via simulation approaches. By simulating data with a known measurement structure, we could examine variance in reliability estimates that operates purely by chance: i.e. where no systematic differences in reliability exist across preprocessing decisions. Comparing the distributions to those observed in tasks such as those analysed here would offer insight into how severe reliability heterogeneity is introduced in "real world" data. These simulations are beyond the scope of this initial paper; however hold promise to detect variance and bias relative to a 'true' value of reliability in the simulated data. Figure 11. Internal consistency reliability multiverse for Dot Probe attention bias (pain faces) at times 1, 2, and 3

What about validity?
Others have previously demonstrated that measures are often used ad hoc or with little reported validation efforts (e.g. Flake et al., 2017;Hussey and Hughes, 2018). This study cannot begin to assess the influence of data processing flexibility on measure validity -nor did this paper attempt to address this question. Reliability is only one piece of evidence needed to demonstrate the validity of a measure. Yet, it is an important piece of evidence as "reliability provides an upper bound for validity" (Zuo et al., 2019, page 3). While we cannot directly conclude that flexibility in data processing influences measure validity, we should look to further research to investigate. One possibility would be to conduct a validity multiverse analysis similar to the "Many Analysts, One Data Set" project (Silberzahn et al., 2018). In this project, 29 teams (61 analysts to-tal) analysed the same dataset. The teams adopted a number of different analytic approaches which resulted in a range of results. The authors concluded that, "Uncertainty in interpreting research results is therefore not just a function of statistical power or the use of questionable research practices; it is also a function of the many reasonable decisions that researchers must make in order to conduct the research" (page 354).
Another important validity consideration is the relationship between our data processing pipelines and the (latent) construct of interest. In questionnaire development, removing or adapting items might influence reliability. But, more importantly, will give rise to a different measure that may be more or less related to our latent construct of interest. For example, Fried (2017) found that several common depression questionnaires captured very different clusters of symptoms, which should make us question what is meant by "depression" in the first place when using these measures.
More relevant to task measures, to maximise reliability we might seek to develop a novel version of a task that relies on average response times, instead of a difference score between average response times. While this would yield highly reliable measures, the purpose of the difference score is to isolate the process of interest. Therefore, while we have maximized reliability, we have also influenced both the construct of interest and the validity of the measure. Perhaps this more reliable measure fails to capture the effect we intended to measure. For a more in depth discussion about balancing these theoretical, validity, and reliability considerations see von Bastian et al. (2020, Goodhew andEdwards, 2019).
With respect to the data pre-processing steps taken in this paper, it could be reasonably argued that some preprocessing specifications yield different constructs of interest or could be more or less valid for the process of interest. Are we really interested in the construct including only very accurate participants and only 60% of trials close to the average response time? In this sense, the data pre-processing decisions a researcher might adopt are certainly not arbitrary from a validity standpoint. A reasonable approach in applied work would be to select a narrower set of processing specifications that the researcher believes are theoretically similar enough that the same construct is being measured.

Returning to the garden
My intention for this project was to provide some indication about the influence of data processing pathways on the reliability of our cognitive measurements. The influence can be profound; the multiverse analyses show large differences between the highest and lowest reliability estimates. Yet, we see little consistency in the pattern of decisions leading to higher, or lower, estimates. We have the worst of both worlds: data processing decisions are largely arbitrary yet can have a large -relatively unpredictable -impact on the resulting reliability estimates. Briefly returning to the garden of forking paths metaphor; I imagined that this project would help illuminate the point in which our hypothetical researcher would enter the garden, based on their data processing decisions. But, our investigation has uncovered an unfortunate secret: Our researcher's forking paths are almost entirely arbitrary and interwoven. Each path diverges wildly, leading to almost anywhere in the garden. It is as if our researcher is simply spinning in dizzy circles until they stumble somewhere along the fence of reliability. I discussed several actions researchers can take collectively to help with the issue. But, by no means were these remedies to our reliability issues, nor would they directly help issues with the validity of our measurements. Thankfully, there is a growing awareness that measurement matters (Fried & Flake, 2018). A valuable term, Questionable Measurement Practices (QMPs), was recently added to our vernacular by Flake and Fried (2020). QMPs describe "decisions researchers make that raise doubts about the validity of the measures used in a study, and ultimately the validity of the final con-clusion" (p. 458). I hope that QMPs and the importance of measurement become as widely discussed as the parallel idiom, 'Questionable Research Practices' (QRPs). Most importantly, wider discussion of these practices should make it clear to all researchers that we make many potentially impactful decisions in the design of our measures, our data processing or cleaning, and our data analysis.
I am concerned that we sit on the precipice of a measurement crisis. The so-called replication crisis shook much of our field into widespread and ongoing reforms. Yet, much of the focus has been on improving methodological and statistical practices. This is undoubtedly worthwhile, but largely omits discussion of reliability and validity of our measurements -despite our measurements forming the basis of any outcome or inference. This oversight feels like repairing a damaged wall at the same time as ignoring the shifting foundations under it. I hope that this paper, along other related work, highlights the issue and encourages researchers to place more emphasis on quality measurement. As a field, we can orchestrate a measurement revolution (cf. the "credibility revolution, " Vazire, 2018) in which the quality and validity of our measurements is placed an order of importance above obtaining desired results. If the reader takes home a single message from this paper, please let it be "measurement matters." Figure 13. Difference in reliability estimates from all trials to reduced trials. Note: red = test-retest ICC2, blue = internal consistency estimate

I declare no Conflicts of Interest
Funding SP is currently supported by a Radboud Excellence Fellowship. This work was initually supported by an ESRC grant [ES/R004285/1]