What to make of equivalence testing with a post-specified margin?

In order to determine whether or not an effect is absent based on a statistical test, the recommended frequentist tool is the equivalence test. Typically, it is expected that an appropriate equivalence margin has been specified before any data are observed. Unfortunately, this can be a difficult task. If the margin is too small, then the test’s power will be substantially reduced. If the margin is too large, any claims of equivalence will be meaningless. Moreover, it remains unclear how defining the margin afterwards will bias one’s results. In this short article, we consider a series of hypothetical scenarios in which the margin is defined post-hoc or is otherwise considered controversial. We also review a number of relevant, potentially problematic actual studies from clinical trials research, with the aim of motivating a critical discussion as to what is acceptable and desirable in the reporting and interpretation of equivalence tests.


Introduction
Consider the following hypothetical situation. After having collected data, we want to determine whether or not an effect is absent based on a statistical test. All too often, in such a situation, non-significance (i.e. p > 0.05), or a combination of both non-significance and supposed high power (i.e. a large sample size), is used as the basis for a claim that the effect is null. Unfortunately, such an argument is logically flawed. As the saying goes, "absence of evidence is not evidence of absence" (Altman and Bland, 1995;Hartung et al., 1983). Instead, to correctly conclude the absence of an effect under the frequentist paradigm, the recommended tool is the equivalence test (also known as a "non-inferiority test" for one-sided testing (Wellek, 2010)).
Let θ be our parameter of interest. An equivalence test reverses the question that is asked in a null hypothesis significance test (NHST). Instead of asking whether we can reject the null hypothesis of no effect, e.g., H 0 : θ = 0, an equivalence test examines whether the magnitude of θ is at all meaningful: Can we reject the possibility that θ is as large or larger than our smallest effect size of interest, ∆? The null hypothesis for an equivalence test is defined as H 0 : θ (−∆, ∆). In other words, equivalence implies that θ is small enough that any nonzero effect would be at most equal to ∆. The interval (−∆, ∆) is known as the equivalence margin and represents a range of values for which θ can be considered negligible.
In psychology research and in the social sciences, where the practice of equivalence testing is relatively new -but now "rapidly expanding" (Koh and Cribbie, 2013)-there are many questions about how to best conduct and interpret equivalence tests. For example, consider the question of a "post-specified" margin. It is generally accepted that one must specify the equivalence margin a priori, i.e. before any data have been observed (Wellek, 2010). However, in our hypothetical situation, suppose that we did not have the foresight needed to have pre-specified this margin, are we then simply out of luck?
It is worth noting that lack of foresight is only one reason we may have failed to have pre-specified an appropriate equivalence margin. Defining and justifying the equivalence margin is one of the "most difficult issues" (Hung et al., 2005) for researchers. If the margin we define is deemed too large, then any claim of equivalence will be considered meaningless. If the margin we define is somehow too small, then the probability of declaring equivalence will be substantially reduced (Wiens, 2002). While the margin is ideally chosen as a boundary to objectively exclude the smallest effect size of interest (Lakens et al., 2017), these "ideal" boundaries can be difficult to define, and there is generally no clear consensus among stakeholders (Keefe et al., 2013). Furthermore, previously agreed-upon meaningful effect sizes may be difficult to ascertain as they are rarely specified in protocols and published results (Djulbegovic et al., 2011).
Suppose now that, having failed to pre-specify an adequate equivalence margin, we define the equivalence margin post-hoc, having already collected and observed the data. Given the potential consequences of interpreting data based on post-hoc decisions, it is understandable that this idea may be alarming to some; e.g., see the "Harkonen case" (as discussed in Lee and Rubin, 2016) in which the U.S. Department of Justice prosecuted drug-maker InterMune (United States v. Harkonen, 2013), for making claims based on post-hoc subgroup analyses.
In the biostatistics literature there are many warnings about how and when to specify the equivalence margin. Hung et al., 2005 And Wiens, 2002 observes that: "The potential biases of defining the margin after the study should be weighed against the cost and inconvenience of better understanding the differences [between study groups]." Finally, the Committee for Proprietary Medicinal Products (CPMP), 2001 (the EU scientific advisory organization dealing with new human pharmaceuticals approval) notes that: "it is prudent to specify a noninferiority margin in the protocol in order to avoid the serious difficulties that can arise from later selection." Statements such as these lead one to ask the following. Under what circumstances would equivalence testing with a data-dependent margin "not be interpretable?" What are the "potential biases" and "serious difficulties" we should consider in these, less than ideal, circumstances? Walker and Nowacki, 2011 stress that defining the equivalence margin before observing the data is "essential to maintain the type I error at the desired level" suggesting that potential type I error inflation is the issue of concern. Yet this too remains unclear. With equivalence testing becoming more and more common for psychology researchers, these are important matters to address.
In this article we will shed light on these curious questions by considering a series of rather confounding hypothetical scenarios (Sections 2 and 3) as well as a number of relevant case studies from biomedical research, where equivalence testing has been widely used for decades (Section 4). We conclude (Section 5) with an invitation for further discussion about how best to address the title question: what to make of equivalence testing with a post-specified margin?
The Pseudo-type I error and a pathological case Before going forward, we would be wise to recall that, under the frequentist paradigm, hypotheses are statements about parameters and therefore are nonrandom quantities. Hence, each hypothesis is either true or false, irrespective of how the data are realized.
Let us define a symmetric equivalence margin as (−∆, ∆). Then the standard equivalence testing hypotheses are defined as: Figure 1. The one-to-one correspondence between α and ∆. In the above plot, an equivalence test is conducted on two sample normally distributed data. The observed mean difference isθ = 0.2, and the observed pooled standard deviation is equal to 1, with n 1 = n 2 = 50. The shape of this particular curve is specific to this particular data. However, for any general case, the smallest value of α needed to reject the null (x-axis) decreases as ∆ increases (y-axis). Furthermore, as the dashed lines indicate, when ∆ =θ, the corresponding value of α will be 0.5.
There is a one-to-one correspondence between symmetric confidence intervals and equivalence testing.
The null hypothesis, H 0 , can be rejected whenever the realized confidence bounds satisfy [θ(X; α),θ(X; α)] ⊂ (−∆, ∆). Conversely, there will be insufficient evidence to reject the null hypothesis whenever [θ(X; α),θ(X; α)] (−∆, ∆). For example, with the standard α = 0.05, we can reject H 0 if and only if a 90% CI for θ fits entirely within the equivalence margin. Equivalence testing provides the standard guarantee about type 1 error that Pr(reject H 0 |H 0 is true) ≤ α; see Wellek, 2017. If we reject the null hypothesis if and only if the 90% CI for θ fits within (−∆, ∆), we can rest assured that we will only make a type 1 error in less than 5% of cases.
Should the equivalence margin not be specified a priori, and be defined based on the observed data, we have the following admittedly improper hypothesis test:H In this case, we may not necessarily have that Pr(rejectH 0 |H 0 is true) ≤ α. To better understand, let us consider the following admittedly "pathological case." Let ∆(X) be chosen, based on the observed data, to be the smallest possible value for which one can claim equivalence (known in the literature as the "LEAD" boundaries, see Meyners, 2007). This is done by setting: where is a small positive real number. For example, if a 90% CI for θ is [−0.2, 0.5], the "pathological" equivalence margin might be defined as [−0.51, 0.51], with ∆(X) = 0.5 + 0.01.
Given the monotonic relationship between a confidence interval and an equivalence test, there is a oneto-one correspondence between α and ∆. For any given value of α, conditional on a fixed sample of data, there is a value for ∆ for which one can reject H 0 . Conversely, for any given value of ∆, there is a value of α for which one can reject H 0 ; see Figure 1.
In our pathological case, we have that Pr(rejectH 0 ) = 1, i.e., we will always claim equivalence. In this situation, the margin is entirely "data-dependent." In other words, the data (as summarized by the confidence interval) and the margin are perfectly correlated. We write cor( f (X), ∆) = 1, where f (X) = max(|θ(X; α)|, |θ(X; α)|). Figure 2 displays the relationship between type 1 error and cor( f (X), ∆), see details in the Appendix. In the pathological case, since Pr(rejectH 0 ) = 1, we also have that Pr(rejectH 0 |H 0 ) = 1. As such, we have Pr(rejectH 0 |H 0 ) > α, and therefore, the "pseudo-type I error" is not controlled. When there is less correlation, i.e. when the margin is not entirely data-dependent, we can expect to see less type 1 error inflation. In order for the test to be valid, the key is independence between the margin and the data. In the case when the data and the margin are entirely independent, the type 1 error rate will be at most equal to α, as desired.

A somewhat less pathological case
Now let us consider a somewhat less pathological situation. The CPMP published an advisory report, "Points to consider on switching between superiority and noninferiority" (Committee for Proprietary Medicinal Products (CPMP), 2001), in which they describe another hypothetical situation where the margin is determined after the data is observed: Figure 2. In order for the test to be valid, the key is independence between the margin and the data. The relationship between type 1 error and the correlation between the margin and the data. The correlation measure, cor( f (X), ∆), is obtained by varying the probability of setting ∆(X) equal to the LEAD margin vs. setting ∆(X) equal to a value entirely independent of the data. The curve is the result of repeated simulations of two-sample data; see details in Appendix.
"Let us suppose that a bioequivalence trial finds a 90% confidence interval for the relative bioavailability of a new formulation that ranges from 0.90 to 1.15. Can we only conclude that the relative bioavailability lies between the conventional limits of 0.80 and 1.25 because these were the predefined equivalence margins? Or can we conclude that it lies between 0.90 and 1.15?
The narrower interval based on the actual data is the appropriate one to accept. Hence, if the regulatory requirement changed to +/-15%, this study would have produced satisfactory results. There is no question here of a data-derived selection process.
However, if the trial had resulted in a confidence interval ranging from 0.75 to 1.20, then a post hoc change of equivalence margins to +/-25% would not be acceptable because of the obvious conclusion that the equivalence margin was chosen to fit the data." According to this recommendation, it seems that, without any scrutiny, we are free to shrink a prespecified margin as needed. However, we should always avoid widening the pre-specified margin if that is what is necessary. If this is the case, it would suggest that a prudent strategy would be to always pre-specify the largest possible margin before collecting data, and then shrink the margin as required. This may strike some as opportunistic and potentially problematic.
Ng, 2003 studies a similar hypothetical situation in which a large, possibly infinite number of margins are all pre-specified and all the corresponding hypotheses are tested (without any Bonferroni-type of adjustment for multiple comparisons). Equivalence is then claimed using the narrowest of all potential pre-specified margins for which equivalence is statistically significant. Ng, 2003 explains why this hypothetical strategy may be problematic: "Although there is no inflation of the type I error rate [due to the fact that all hypotheses are nested], simultaneous testing of many nested null hypotheses is problematic in a confirmatory trial because the probability of confirming the finding of such testing in a second trial would approach 0.5 as the number of nested null hypotheses approaches infinity." To better understand Ng, 2003's concern, consider a similar setup where, for a standard null hypothesis significance test, a large, possibly infinite number of prespecified α-levels (allowable type I error rates) are defined. The null is then rejected using the smallest of all potential pre-specified α values. Under this procedure, the probability of confirming a statistical significant finding in a second trial (with identical sample size and α) approaches 0.5; see Hoenig and Heisey, 2001 who describe this (often unappreciated) property of "retrospective power." As such, it is always expected that one specifies (and justifies) a single α-level prior to observing any data; see the recent commentary of Lakens et al., 2018. (These two situations are in fact identical, due to the aforementioned one-to-one correspondence between a data-driven selection of α and a data-driven choice of ∆; see Figure 1.)

How hypothetical are situations like these?
While the cases described in the previous sections were purely hypothetical, similar situations do arise in practice. We consider a number of different clinical trial studies as examples, with the aim of motivating a critical discussion as to what is acceptable and desirable in the reporting and interpretation of equivalence tests.
First, consider cases of post-hoc judgement that often arise in the regulatory approval of drugs seeking a designation of bio-equivalence for approval. When the pre-specified margin is deemed too generous (i.e. too wide) by regulatory authorities only after the data have already been observed and analyzed, the regulator may decide that for the purposes of approval, the drug does not meet an appropriate standard for equivalence. Consider two examples: 1. The SPORTIF III and SPORTIF V randomized controlled trials (RCTs) were studies designed to investigate the potential of ximelagatran as the first oral alternative to warfarin in patients with nonvalvular atrial fibrillation to reduce the risk of thromboembolic complications. The primary end point in each study was the incidence of all strokes and systemic embolic events, and the primary objective was to establish the non-inferiority of ximelagatran relative to warfarin with a pre-specified margin of an absolute 2% difference in the event rate; see Halperin, 2003. Both studies met the primary objectives of noninferiority with the pre-specified margin. As such, upon completion, the studies were heralded as a "major breakthrough" (Albers et al., 2005;Kulbertus, 2003). However, upon regulatory review by the FDA Cardiovascular and Renal drugs Advisory Committee (CRAC), the pre-specified margin was judged to be "too generous" (Boudes, 2006). This post-hoc criticism of the "unreasonably generous" (Kaul et al., 2005) margin, along with concerns about potential liver toxicity, led to a unanimous decision by the CRDAC to conclude that the benefit of ximelagatran did not outweigh the risk. The FDA then refused to grant approval of ximelagatran for any of the proposed indications, see Head et al., 2012 andBoudes, 2006 who provide a detailed timeline and description of the approval process.
2. The EVEREST II study was a RCT designed to evaluate percutaneous mitral valve repair relative to mitral valve surgery (Mauri et al., 2010). The primary efficacy end point was defined as the proportion of patients free from death, surgery for valve dysfunction, and with moderate-severe (3+) or severe (4+) mitral regurgitation at 12 months. Upon completion, researchers claimed success when the primary non-inferiority objective was achieved. However, the conclusion of non-inferiority was "difficult to accept due to unduly wide margins" (Head et al., 2012). Thus, the FDA determined that despite the significant pvalue, "non-inferiority is not implied due to the large margin" and therefore the data "did not demonstrate an appropriate benefit-risk profile when compared to standard mitral valve surgery and were inadequate to support approval" (FDA, 2013).
In other instances, the complete opposite has occurred. Despite the fact that the researchers fail to pre-specify a specific margin prior to observing the data, the regulatory agency will still accept a claim of equivalence/non-inferiority on the basis that, given some non-controversial post-hoc margin, there is sufficient evidence. Consider two examples: 1. The goal of MannKind's "Study 103" was to evaluate the inhaled insulin Afrezza for the treatment of diabetes mellitus in adults. Subjects were randomized to 12 weeks of continued treatment in one of three treatment arms. The prespecified primary objective was to show superiority of the Afrezza TI+metformin arm relative to the secretagogue+metformin arm, with respect to change in HbA1c at 12 weeks. Upon completion, the superiority objective was not achieved and a non-inferiority margin had not been prespecified by the researchers. However, the regulators were able to accept a claim of noninferiority. The FDA clinical review states: "The sponsor did not specify a non-inferiority margin. However, the FDA statistical reviewer noted that Afrezza TI+metformin was non-inferior to secret-agogue+metformin when the standard margin of 0.4% for insulins is used (the upper bound of the 95% confidence interval for the treatment difference in HbA1c is 0.3%)," (Yanoff, 2014).
2. The ALLY-3 trial was a one-arm phase 3 trial with the goal of evaluating the safety and efficacy of oral daclatasvir for chronic HCV genotype 3 infection (McCormack, 2015). There was no active or placebo control and as such it was impossible to conduct a non-inferiority or equivalence test based only on the trial data. As such the FDA looked to other trials to determine estimates for the effectiveness of competitor treatments. In addition, as noted by the Oregon Health Authority, "  (Herink, 2016). In this case, the FDA reviewers "clinically justified" their choice of a post-specified non-inferiority margin based on a historical data; see Struble, 2015. These studies illustrates the fact that, in some fields, there may be well-established "standard" margins or sufficient "historical data." Such standards no doubt make post-specification less controversial for regulatory agencies. When it comes to peer-reviewed journals, researchers will often note that, while an equivalence margin was not pre-specified, a conclusion of equivalence can still be (cautiously) accepted. We consider two examples. In the first case, the margin was not pre-defined, yet claims of equivalence were nevertheless put forward. In the second case, while a margin was pre-defined, additional conclusions were made based on post-specified margins. Chang et al., 2008 published the results of a RCT with the goal of evaluating a 5-versus 3day course of oral corticosteroids (CS) for nonhospitalised children with asthma exacerbations. The primary outcome was 2-week morbidity of children. The study did not show a statistically significant difference between the two treatment arms. In the interpretation of the results, Chang et al. (2008) note that: "It would have been ideal to define a non-inferiority or equivalence margin a priori on the basis of a minimally important effect or historical controls. Our study was designed as a superiority trial, and we did not define a noninferiority margin a priori. Nevertheless, for the primary outcome measure, the chosen symptom score cut-off of 0.20 (i.e., chosen minimally important difference), the study shows equivalence." As such, the researchers concluded that the 3-day and 5-day treatment courses were "equally efficacious" in reducing the symptoms of asthma (A. Chang et al., 2007).
2. Jones et al., 2016 studied the efficacy of isoflurane relative to sevoflurane in cardiac surgery. When interpreting the results, the authors note that: "our choice of non-inferiority margin may seem to be overly generous; however, it is important to emphasize that, if the margin had been reduced to as low as 1.5%, the conclusions of this trial would not have changed," (Jones et al., 2016).
If, following a study's publication, other researchers take issue with how the study's equivalence margin was justified, they will often respond in a letter to the journal. The post-hoc debate between Groenewoud et al., 2017 andGupta et al., 2016 about the appropriateness of the pre-specified non-inferiority margin defined in Groenewoud et al., 2016's study on methods for embryo transfer is an excellent example of this. In the end, readers are left to judge for themselves.

Conclusion
Researchers advocate that equivalence testing has great potential to "facilitate theory falsification" (Quintana, 2018). By clearly distinguishing between what is "evidence of absence" versus what is an "absence of evidence," equivalence testing may facilitate the long "series of searching questions" necessary to evaluate a "failed outcome" (Pocock and Stone, 2016). As a result, it may encourage greater publication of null results which is desperately needed (Fanelli, 2011). Yet, outside of health research, guidelines on how best to define and interpret margins are lacking. We hope that the question posed in the title of this article will motivate researchers to further consider the delicate issues involved.
In clinical trials research, expectations that a margin be pre-specified have been well established for quite some time (Piaggio et al., 2006). This is not the case in other disciplines. In psychology research and in the social sciences, discussions of how best to execute equivalence tests are underway and appropriate recommendations are crucially needed.
One might argue that the pathological case of equivalence testing we considered does not actually qualify as testing per se, and is instead, simply a tool for describing the data. This is the opinion of Meyners, 2007, who concludes that, as a descriptor of the data, the "LEAD boundaries", (−∆(X), ∆(X)), provide "useful information" and in some cases are "even more important than confidence intervals" for reporting results.
At the end of the day, everyone must arrive at their own conclusions as to whether or not a sufficient standard of evidence for equivalence has been demonstrated. Obviously this is often easier said than done. As one final example from clinical trials, we turn to the infamous debate over using bevacizumab (avastin) as a treatment for age-related macular degeneration. A noninferiority study was conducted to investigate (Group, 2011). However, some considered the pre-specified non-inferiority margin of 5 letters (on the ETDRS visual acuity chart) as "generous" even before the results of the trial were announced (Hirschler, 2011). This suggests that, regardless of the results, some would have remained skeptical of any claim of non-inferiority with the 5-letter margin. In stark contrast, the standard of evidence for many healthcare providers was much weaker. Indeed, many doctors determined that the use of bevacizumab (avastin) as a substitute for ranibizumab (lucentis) was justified (particularly given the "too big to ignore" price difference) even before the completion of the non-inferiority trial and were comfortable treating large numbers of patients with Avastin "off-label" (Steinbrook, 2006). In this situation, financial incentives clearly played a competing role with statistical considerations of clinical efficacy in what was to be considered "equivalent." While the use of equivalence testing should be encouraged, caution is warranted. In a review of equivalence and non-inferiority clinical trials, Le Henanff et al., 2006 find that often studies "reported margins [that] were so large that they were clearly unconvincing." Indeed, as Gøtzsche, 2006 conclude: "clinicians should especially bear in mind that noninferiority margins are often far too large to be clinically meaningful and that a claim of equivalence may also be misleading if a trial has not been conducted to an appropriately high standard." We conclude with the following general recommendations: • If the parameter of interest is not measured in units that are interpretable, one should consider standardized effect sizes. Campbell, 2020 notes that: "equivalence tests for standardized effects may help researchers in situations when what is "negligible" is particularly difficult to determine." For instance, if the outcome of interest is a depression scale, the clinical relevance of a certain x point improvement may not be intuitively meaningful. It may be difficult to define what number of points can be considered "negligible." However, since a Cohen's d = 0.2 is widely interpreted to be a "small" sized effect (Cohen, 1977;Fritz et al., 2012), one could conclude, based on an equivalence test which rejects the null with ∆ = d = 0.2, that any effect, if it exists, is at most small.
• The validity of an equivalence test does not depend on the margin being pre-specified. Rather, the necessary requirement for a valid test is that the margin is completely independent of the data. In one of our biomedical examples (Afrezza TI + metformin), we described a situation where the researchers had not specified a margin but the FDA adopted a "standard margin of 0.4%." While there are no comparable independent agencies to regulate psychology research, peer-review journals do possess substantial leverage and would be wise to consider adopting a set of "default margins" (based on standardized effect sizes). While "default equivalence margins" may not be appropriate for all studies, their use would be similar to that of "default priors" for Bayesian inference (Rouder et al., 2012) and offer a potential for more objective analyses.
• Simply because a margin has been pre-specified (and is therefore guaranteed to be independent of the data), it is not necessarily an appropriate choice. Regardless of whether the margin is prespecified, or defined post-hoc, we must acknowledge that a claim of "noninferiority [or equivalence] is almost certain with lenient noninferiority margins" (Flacco et al., 2016). One should always 8 critically consider the practical implications of the given margin.
• If one is to suggest equivalence based on a posthoc margin, one must, at the very least, be forthcoming and honest about the potential for bias. In such cases, every effort should be made to justify the appropriateness of the post-specified margin based on factors entirely independent of the observed data.
• In the absence of a pre-specified margin, one can always resort to simply reporting the associated confidence interval. If the confidence interval contains the null and is "narrow enough," the absence of an effect can be deemed likely. This tactic lacks the formalism of equivalence testing, yet avoids the difficulties of interpretation and justification with a post-hoc margin.
• Deliberate or not, questionable research practices cause major harm to the credibility of psychology research (Sijtsma, 2016). With this in mind, researchers, given their incentive to publish (Nosek et al., 2012), are not in the best position to define their own margins. This is true whenever the margin is pre-specified, and especially true when a margin is suggested post-hoc. As such, in order to avoid any potential scrutiny, researchers would be wise to seek an independent party, void of any potential biases, to define an appropriate margin. This is already common practice in clinical trial research, where sponsors have undeniable incentives to further drug development and the FDA and other regulators will (ideally) set a clear guidance for an acceptable margin. In other fields, such as psychology, the suggestion that an equivalence margin be defined/scrutinized by an independent party has recently been considered within the framework of a proposed publication policy. In the conditional equivalence testing (CET) publication policy, the independent journal editor/reviewers are tasked with critically evaluating a given margin prior to the start of a study (Campbell and Gustafson, 2018).
Please contact H. Campbell at harlan.campbell@stat.ubc.ca with any inquiries.

Conflict of Interest and Funding
We have no conflicts of interest to declare. The research was supported by NSERC Discovery Grant RGPIN-2019-03957.

Author Contributions
H. Campbell and P. Gustafson both contributed to the concept and writing of this article. H. Campbell drafted the original manuscript, and P. Gustafson provided critical revisions. Both authors approved the final version of the manuscript for submission.

Open Science Statement
This article earned the Open Materials badge for making the materials available. It was not pre-registered and had no collected data to share. It has been verified that the analysis reproduced the results presented in the article. The entire editorial process, including the open reviews, are published in the online supplement.