Differences of Type I error rates for ANOVA and Multilevel-Linear-Models using SAS and SPSS for repeated measures designs

To derive recommendations on how to analyze longitudinal data, we examined Type I error rates of Multilevel Linear Models (MLM) and repeated measures Analysis of Variance (rANOVA) using SAS and SPSS.We performed a simulation with the following specifications: To explore the effects of high numbers of measurement occasions and small sample sizes on Type I error, measurement occasions of m = 9 and 12 were investigated as well as sample sizes of n = 15, 20, 25 and 30. Effects of non-sphericity in the population on Type I error were also inspected: 5,000 random samples were drawn from two populations containing neither a within-subject nor a between-group effect. They were analyzed including the most common options to correct rANOVA and MLM-results: The Huynh-Feldt-correction for rANOVA (rANOVA-HF) and the Kenward-Roger-correction for MLM (MLM-KR), which could help to correct progressive bias of MLM with an unstructured covariance matrix (MLM-UN). Moreover, uncorrected rANOVA and MLM assuming a compound symmetry covariance structure (MLM-CS) were also taken into account. The results showed a progressive bias for MLM-UN for small samples which was stronger in SPSS than in SAS. Moreover, an appropriate bias correction for Type I error via rANOVA-HF and an insufficient correction by MLM-UN-KR for n<30 were found. These findings suggest MLM-CS or rANOVA if sphericity holds and a correction of a violation via rANOVA-HF. If an analysis requires MLM, SPSS yields more accurate Type I error rates for MLM-CS and SAS yields more accurate Type I error rates for MLM-UN.

TYPE I ERROR CORRECTIONS FOR REPEATED MEASURES 3 Differences of Type I error rates for ANOVA and Multilevel-Linear-Models using SAS and SPSS for repeated measures designs In times of a replication crisis that is yet to overcome, we feel a need to improve methodological standards in order to regain credibility of scientific knowledge. It is therefore important to generate clearly formulated "best practice" recommendations when there are multiple competing methodological approaches for the same issue in question. Progress in psychology means for researchers to understand and investigate their methodological tools in order to know about their strengths and weaknesses as well as the circumstances under which they should or should not be used.
In this study, we will therefore focus on two popular methods to analyze dependent means as they occur, for example, in longitudinal data. It is important to examine whether a mean change over time is of statistical relevance or not. In recent longitudinal research, a trend to use Multilevel linear models (MLM) instead of repeated measures analysis of variance (rANOVA) can be identified (Arnau, Balluerka, Bono, & Gorostiaga, 2010;Arnau, Bono, & Vallejo, 2009;Goedert, Boston, & Barrett, 2013;Gueorguieva & Krystal, 2004); even appeals to researchers in favor of MLM over rANOVA are made (Boisgontier & Cheval, 2016). Despite the high popularity of MLM, the terminology is not all clear without ambiguity. We follow a definition of Tabachnick and Fidell (2013) in using the term "MLM" to denote models with the following characteristics: Regression models basing upon at least two data levels, where the levels are typically specified by the measurement occasions interleaved with individuals, models containing covariance pattern models, and fixed as well as random effects. Although Tabachnick and Fidell (2013, p. 788) indicate that "MLM" is used for "a highly complex set of techniques", TYPE I ERROR CORRECTIONS FOR REPEATED MEASURES 4 they mention the presence of at least two data levels first, giving the impression that this is the most important aspect of these techniques. As we noticed massive differences in Type I error rates for different approaches before (Haverkamp & Beauducel, 2017), we will furthermore focus on the Type I error corrections that are offered by the respective method. Moreover, the large Type I errors that we have noticed before could trigger publications of results that cannot be replicated or reproduced. This is why we consider that the focus in Type I error rates is of special importance for the current debate on the reproducibility of results.
If the features of MLM over rANOVA are compared, three main advantages of MLM become apparent: First, MLMs permit to model data that are structured in at least one level. If there are reasons to suppose two or more nested data levels, MLM is applicable. In the case of one level of measurement occasions plus one level of individuals, rANOVA is also adequate.
However, if the structure is any more complex, comprehending several levels, rANOVA will always be less appropriate than MLM (Baayen, Davidson, & Bates, 2008). Second, several randomly distributed missing values can emerge in repeated measures designs containing a large number of measurement occasions. Even then, MLM is robust, because there is no requirement for complete data over occasions as individual parameters (e.g., slope parameters) are estimated. A third comparative advantage over rANOVA is the potential to draw comparisons between MLMs with differing assumptions about the covariance structure inherent in the data (Baek & Ferron, 2013). For example, MLMs with compound-symmetry (CS), with uncorrelated structure, or with auto-regressive covariance structure are feasible. If no particular preconceptions or assumptions on the covariances can be formulated a priori, MLM with an unstructured covariance matrix (UN) can be defined as the most common choice for MLM (Tabachnick & Fidell, 2013). To the best of our knowledge, a comparison of all advantages and disadvantages TYPE I ERROR CORRECTIONS FOR REPEATED MEASURES 5 of MLM and rANOVA is not available. However, the reader may find a discussion of several advantages of MLM over rANOVA in Finch, Bolin and Kelley (2014).
In longitudinal research, small sample sizes occur frequently (McNeish, 2016). It is therefore of special interest how the issues related to sample size problems (e.g. incorrect Type I error rates) can adequately be addressed. In recent literature, MLM are recommended as more appropriate compared to rANOVA for small sample sizes if some precautions are taken: McNeish and Stapleton (2016b), among others, report for restricted maximum likelihood (REML) estimation to improve small sample properties of MLM for sample sizes below 25 and even into the single digits. They give a clear recommendation against maximum likelihood (ML) if sample sizes are small because variance components are underestimated and Type I error rates are inflated (McNeish, 2016(McNeish, , 2017. However, as REML is seen as not completely solving these issues, the Kenward-Roger correction (Kenward & Roger, 1997) is suggested as best practice to maintain nominal Type I error rates (McNeish & Stapleton, 2016a). This correction is not yet available in SPSS but was recently included in SAS (McNeish, 2017). We therefore decided to follow these recommendations by using MLM with REML and considering the Kenward-Roger correction in our analyses of small sample properties for the different methods.
Another issue is the robustness of MLM and ANOVA results across different statistical software packages. So far, this has not been systematically examined. For simple tests, like ttests or simple ANOVA models, no substantial differences between software packages are to be expected. However, for more complex statistical techniques like MLM different explicit or implicit default settings (e.g., number of iterations, correction methods) may occur. This may also be related to the different purposes and abilities of the different software packages TYPE I ERROR CORRECTIONS FOR REPEATED MEASURES 6 (Tabachnick & Fidell, 2013). As the number of options can be large, differences between the algorithms may sometimes not be made entirely transparent in the software descriptions (see results section), we consider this a critical topic. However, for very simple repeated measures designs without any complex interaction or covariate, it should nevertheless be expected that different software packages provide the same results. However, to our knowledge, this has not been investigated until now so that we would like to shed some light in this topic by means of our study.
To compare the results of different MLM designs in this study, it is necessary for the respective software package to allow certain specifications of the model(s). Tabachnick and Fidell (2013) provide an overview of the abilities for the most popular packages: SPSS, SAS, HLM, MLwiN (R) and SYSTAT. For this simulation study, a few features will be necessary: At first, it must be possible to specify the structure of the variance-covariance matrix as unstructured or with compound symmetry. Second, probability values as well as degrees of freedom for effects have to be included in the output to allow for corrections if the sphericity assumption is violated. Following the specifications of Tabachnick and Fidell (2013), we decided to compare SAS and SPSS as only these two software packages provide all of the required features mentioned above.
In accordance with this, a literature research shows SAS and SPSS to be among the most popular software packages. Table 1 shows the number of Google Scholar hits for a reference search for a slightly broader set of keywords ("SPSS", "SAS", "Stata", "R project", "R core team", "multilevel linear model", and "hierarchical linear model" as well as the relevant packages to perform MLM in R, see Note of Table 1 for more details). We acknowledge that the TYPE I ERROR CORRECTIONS FOR REPEATED MEASURES 7 validity of reference-searches depends on the search terms and that some additional terms might also be considered relevant in the present context. For example, "mixed models" and "randomeffects models", and "nested data models" might also be interesting terms for a reference search.
However, we did not use "mixed models" and "random-effects models" here because conventional repeated measures ANOVA can also be described with these terms. Moreover, we did not use "nested data models" as this term could be used for several different techniques like non-linear mixed models. Thus, our keywords were chosen in order to enhance the probability that the search results are specific to non-ANOVA methods but specific to multilevel/hierarchical linear models. Keeping the limitations of this reference search in mind, the results nevertheless indicate that SPSS and SAS are often used for MLM. Even when the relative number of hits might be questioned, the absolute number of hits indicate that several hundred researchers used SPSS or SAS for MLM so that our comparison might be of interest at least for these researchers. Note. The search was performed in Google Scholar on the 9 th of September, 2018. The search strings entered were: ""SPSS" -"SAS" -"Stata" -"R Core team" -"R project" AND "multilevel linear model" TYPE I ERROR CORRECTIONS FOR REPEATED MEASURES 8 OR "hierarchical linear model"" (for the SPSS search); "-"SPSS" "SAS" -Stata -"R Core team" -"R project" AND "multilevel linear model" OR "hierarchical linear model"" (for the SAS search); "-"SPSS" -"SAS" Stata -"R Core team" -"R project" AND "multilevel linear model" OR "hierarchical linear model"" (for the Stata search); "-"SPSS" -"SAS" -Stata "R Core team" OR "R project" OR "nlme" OR "lme4" OR "lmertest" OR "lme" OR "pbkrtest" AND "multilevel linear model" OR "hierarchical linear model"" (for the R search).  (Satterthwaite, 1946) being too liberal in contrast to the Kenward-Roger correction (Kenward & Roger, 1997), which delivered more robust results, but their study concentrated on splitplot designs only. On the other hand, the studies of Ferron and colleagues (2009; investigated Type I error rates for MLM in SAS as well, but were restricted to multiple-baseline data. Paccagnella (2011), meanwhile, examined binary response 2-level model data in his study on sufficient sample sizes for accurate estimates and standard errors of estimates in MLM.
Nagelkerke, Oberski, and Vermunt (2016) delivered a detailed analysis on Type I error and power but limited themselves to Multilevel Latent Class analysis. However, we are convinced that these specific simulation studies should be rounded off by simulation studies focusing on rather simple, 'basic' models and data (Berkhof & Snijders, 2001), which are less contingent upon particular modeling options and data features. Although the coverage of simulation approaches will naturally be restricted, using basic models and population data specifications can build a background for the investigation of more specific models. Therefore, this simulation approach focusses solely on the effects of a violation of the sphericity assumption on mean Type I error rates in rANOVA-models (without correction and with Huynh-Feldt-correction) and MLM (based on compound-symmetry as well as on an unstructured covariance matrix) for a within-subjects effect without any between-group effect.

TYPE I ERROR CORRECTIONS FOR REPEATED MEASURES 11
As rANOVA is not capable of the simultaneous analysis of more than one data level, there is no point in a comparison of rANOVA and MLM for data of such complex structure. This study is therefore limited to a subset of simulated repeated measures data that allows for an analysis with rANOVA as well as MLM. Haverkamp and Beauducel (2017) also used rather basic population models and data, but their study was limited to the SPSS package, so that they could not include the options provided by SAS. The present study extends on the study provided by Haverkamp and Beauducel (2017) in that SAS, the Kenward-Roger correction, smaller sample sizes and a larger number of measurement occasions were investigated. The Kenward-Roger correction (Kenward & Roger, 1997) that is available in SAS but not in SPSS was considered here as it should result in a more appropriate Type I Error rate for MLM based on an unstructured covariance matrix (Arnau et al., 2009;Gomez et al., 2005;McNeish & Stapleton, 2016a). As Roger (1997, 2009) have shown that their correction works with sample sizes of about 12 cases, small sample sizes will also be investigated in the present simulation study. As violations of the sphericity condition or compound symmetry (CS) have been found to affect the Type I error rates in rANOVA and MLM, this aspect was also investigated here. It should be noted that CS is not identical but similar to the sphericity assumption of rANOVA. As the CS assumption is more restrictive than the sphericity assumption (Field, 1998), MLM with CS assumption will also satisfy the sphericity assumption. REML will be used as an estimation method for MLM because it is more suitable for small sample sizes than ML (McNeish & Stapleton, 2016b) and because it has been proven to be most accurate for random effects models, i.e., for models that do not contain any fixed between group effects (West, Welch, & Galecki, 2007).
To summarize, this simulation study has two major aims: First, the results of uncorrected rANOVA, rANOVA-HF, MLM-UN and MLM-CS are compared for SAS and SPSS, as they are available in both packages. If the results show substantial differences between the software packages, this will have immediate consequences for software applications, as the software with the more correct Type I error rate should be preferred. Second, SAS offers the Kenward-Rogercorrection, which was developed to correct MLM-UN results for a progressive bias in Type I error (Kenward & Roger, 1997, especially for small sample sizes. Therefore, the samples were also analyzed under this condition (MLM-KR) to compare the results to those delivered by the other rANOVA and MLM specifications.
Our expectations are as follows: Normally, one would expect that statistical methods have a Type-I error at the level of the a priori significance level, when they are used appropriately. This implies that uncorrected ANOVA (rANOVA) and MLM-CS should have 5% of Type-I errors at an alpha-level of 5% when they are used in data without violation of the sphericity assumption. However, when these methods are used with data violating the sphericity assumption, the percentage of Type-I errors should be larger than 5%. We also expect that rANOVA-HF and MLM-KR result in 5% of Type-I errors, even in data violating the sphericity TYPE I ERROR CORRECTIONS FOR REPEATED MEASURES 13 assumption, whereas MLM-UN results in a larger percentage of Type-I errors in small samples with and without violation of the sphericity assumption (Kenward & Roger, 1997Haverkamp & Beauducel, 2017). Finally, if everything works fine, even in light of different default settings, no substantial differences between SPSS and SAS should occur for the simple repeated measures data structure that we will investigate, when identical methods (i.e., MLM-UN, MLM-CS, rANOVA, and HF) are performed.

Material and methods
We performed the analyses of the simulated data with SAS Version 9.4 (SAS Studio 3.7) and that is identical in all yti, whereas the random variables zti represent the specific scores that are different in each yti. The inter-correlation of the yti variables may be due to a constant variable across time or it may be due to other aspects inducing statistical dependency between the yti variables. This form of data generation for m = 9 can also be described in terms of the factor model with a pattern of population common factor loadings and a pattern of unique factor loadings 1/2 ( ) diag − ' D = I P P . As in Snook and Gorsuch (1989, p. 149-150), the population matrix of correlated random variables Y can be written as , '

Y = cP + ZD
( 3) where vector c contains the common random variable and Z is a matrix of m independent random variables (an example population file for the sphericity condition and m = 9 containing the resulting yt-variables, the common variable c, and the independent random variables zt has been uploaded in SPSS-format and in ASCII-format; an SPSS-Syntax example of datageneration can be found at https://osf.io/4g96f/). alpha-level. To identify substantial bias in the results, we followed the criterion of Bradley (1978) by which a test is robust if the empirical error rate is within the range 0.025-0.075 for α = .05. A test is considered to be liberal when the empirical Type I error rate exceeds the upper limit. If the error rate is below the lower limit, the test is regarded as conservative.

Results
The results for nine measurement occasions under the condition of sphericity showed a progressive bias for MLM-UN and small sample sizes (Fig. 1) for MLM-UN SAS reduced the Type I error rate but did not fully solve the issues of small sample sizes, especially for n = 25 or below. The uncorrected rANOVA showed the expected Type I error rates close to five per cent when the sphericity condition holds, regardless whether they were performed in SPSS or SAS and with or without Huynh-Feldt-correction. The results for nine measurement occasions showed higher inflation in Type I error rates for MLM-UN when sphericity was violated (Fig. 2). Again, this progressive bias turned out to be stronger for MLM-UN in SPSS than in SAS. The Kenward-Roger-correction results did not differ much from the Type I error rates of this method for nine measurement occasions under the TYPE I ERROR CORRECTIONS FOR REPEATED MEASURES 17 sphericity condition (cf. Fig. 1). The Type I error rates for the uncorrected rANOVA in SPSS and SAS as well as for MLM-CS in SPSS did not differ substantially and showed a moderately inflated Type I error. The Huynh-Feldt correction provided satisfying results of Type I error rates close to five per cent for both software packages, while MLM-CS shows a striking conservative bias when performed with SAS and a large difference to results for the same method when performed in SPSS. results showed Type I error rates close to five per cent, regardless whether they were performed in SPSS or SAS or -in case of rANOVA -with or without Huynh-Feldt-correction. Figure 3: Average Type I error rates for 5,000 tests: twelve measurement occasions, sphericity assumption holds.
When sphericity was violated, the results for twelve measurement occasions showed a similar high inflation in Type I error rates for MLM-UN as without violation (Fig. 4). As under the previous conditions, this progressive bias was stronger for MLM-UN in SPSS than in SAS.   Table 3).  Table 3 shows the results of the additional simulation. Concerning our expectations, two conclusions can be drawn: First, the trend of uncorrected MLM-UN results to lower Type I error rates as sample size increases can be confirmed. However, even for a large sample size of n = 100, the average Type I error rates of MLM-UN still failed to meet Bradley's liberal criterion (Bradley, 1978) Table 3 for future research to explain these disparities between SAS and SPSS for supposedly identical methods in rather simple repeated measures data.

Discussion
As expected, we found that uncorrected ANOVA (rANOVA) and MLM-CS had 5% of Type-I errors at an alpha-level of 5% when they were used in data without violation of the sphericity assumption. The expected increase of Type-I error rates was also found for rANOVA and MLM-CS with data violating the sphericity assumption. Although we found the expected Type-I error rate of 5% for rANOVA-HF we found unexpected larger Type-I error rates for TYPE I ERROR CORRECTIONS FOR REPEATED MEASURES 22 MLM-KR in data violating the sphericity assumption. The larger Type-I error rates for MLM-UN in small samples with and without violation of the sphericity assumption were again confirmed (Kenward & Roger, 1997Haverkamp & Beauducel, 2017). As Kenward and Roger (1997) noted, the reason for bias of MLM-UN is probably that the precision is obtained from an estimate of the variance-covariance matrix of its asymptotic distribution. However The results of this simulation study bear some implications for the analysis of repeated measures designs in terms of best practice recommendations. Note that these suggestions are based on very basic designs as the simulated data contained no within-subject effect and neither a between-subjects nor a between-group effect. As pointed out before, we took these restrictions to examine Type I error rates for within-subject models that are not distorted in any way by the influences of other effects or levels.
The following implications for simple within-subject repeated measures designs can be derived from this simulation study: 1. The use of MLM-UN to analyze data with nine or more measurement occasions with samples comprising 30 cases or less is generally not recommended without a correction method. This bias is stronger when MLM-UN is performed with SPSS.
When MLM-UN has to be applied, it is best to use it with the Kenward-Roger 3. For nine measurement occasions, a conservative bias according to Bradley (1978) was found for MLM-CS if sphericity was violated. This effect was specific to the SAS software package, as the MLM-CS results for SPSS showed no conservative bias but Type I error rates that were on the verge of the liberal criterion. These There are, of course, several limitations to this study: • The population data contained no within-subject effect and neither a between-subjects nor a between-group effect and no interactions. Accordingly, the model was restricted to a simple within-subjects design.
• Not all methods, particularly corrections for MLM as Kenward-Roger, were available in both software packages. This is a limitation because we do not know how the Kenward-Roger correction would work in the context of the SPSS algorithm.
• SAS and SPSS do not provide a complete description of their algorithms and they do not provide the software scripts. Therefore, the exact reasons for the differences could not be determined. Of course, the software packages are protected by law because people who develop the software scripts need to be paid for their work. However, when considerable differences between software packages occur even for rather TYPE I ERROR CORRECTIONS FOR REPEATED MEASURES 25 simple data, the law protection might constitute a limitation for the scientific value of the software.
Furthermore, this study yields some indications for future research: • The examination of Type I error rates for the discussed methods should be expanded to more complex models including between-subject effects or between-within interaction effects.
• The differences in the results of very basic methods in statistical software packages have to be further explored, especially concerning MLM-UN and MLM-CS.
• The reasons of the massive Type I error inflation for MLM-UN at lower sample sizes have to be analyzed in-depth. It may also be interesting to include R in further research in order to have at least one software where all the scripts are available.
In the course of the ongoing debate about the lack of reproducibility of scientific studies, different recommendations have been developed: Benjamin et al. (2017) proposed to set the statistical standards of evidence higher by shifting the threshold for defining statistical significance for new discoveries from p < 0.05 to p < 0.005. Lakens et al. (2017), on the other hand, formulate a more general demand of justifications of all key choices in research design and statistical practice to increase transparency.
We therefore see the results of this study as helpful for researchers' methodological