We are all less risky and more skillful than our fellow drivers: Successful replication and extension of Svenson (1981)

The better-than-average effect refers to the tendency to rate oneself as better than the average person on desirable traits and skills. In a classic study, Svenson (1981) asked participants to rate their driving safety and skill compared to other participants in the experiment. Results showed that the majority of participants rated themselves as far above the median, despite the statistical impossibility of more than 50% of participants being above the median. We report a preregistered, well-powered (total N = 1,203), very close replication and extension of the Svenson (1981) study. Our results indicate that the majority of participants rated their driving skill and safety as above average. We added different response scales as an extension and findings were stable across all three measures. Thus, our findings are consistent with the original findings by Svenson (1981). Materials, data, and code are available at https://osf.io/fxpwb/.

When people are asked to rate themselves on desirable traits and skills, most people rate themselves as above average.This is known as the better-thanaverage effect and has been demonstrated in a variety of domains.In one of the most well-known examples, Svenson (1981) asked participants to rate their safety and skill as drivers compared to other participants in the experiment.Results showed that the majority of participants rated themselves far above the median, despite the statistical impossibility of more than 50% to be above the median.Here, we embarked on a preregistered very close replication and extension of the Svenson study to examine the replicability of the original finding.

The better-than-average effect
The better-than-average effect has been demonstrated in a variety of domains and is generally considered a manifestation of self-evaluation bias.Drivers believe that they are better drivers (Svenson, 1981; inspired by Preston and Harris, 1965), college instructors believe they are better teachers (Cross, 1977), social psychologists believe they are better researchers (Van Lange et al., 1997), couples believe they have better marriages (Rusbult et al., 2000), and undergradu-ates believe they have better leadership skills, athletic prowess, and ability to get along with others (Brown, 1986).People even believe that they are less biased than others, an effect known as the bias blind spot (Pronin et al., 2002).A recent meta-analysis of 124 published articles found that the better-than-average effect was large and robust across studies (Zell et al., 2020).Although it is closely related to several other biases, including unrealistic optimism (predicting that positive outcomes are more likely and negative outcomes are less likely to happen to oneself compared to others; Shepperd et al., 2013;Weinstein, 1980) and the Dunning-Kruger effect (overestimating the rank of one's performance compared to objective measures; Dunning, 2011), the better-than-average effect is unique in that it involves comparing the present self to an average other on a relatively enduring attribute or skill.
Much research has been dedicated to finding boundary conditions and explanations for the effect (for reviews, see Alicke and Govorun, 2005;Chambers and Windschitl, 2004;Moore and Healy, 2008;Sedikides andAlicke, 2012, 2019;Sedikides and Gregg, 2008;Zell et al., 2020).Yet, to the best of our knowledge, no direct replications exist of the original finding by Svenson (1981).The importance of replicability has received increasing recognition in the field of psychological science over the past few years (e.g., Asendorpf et al., 2013;Brandt et al., 2014;Camerer et al., 2018;Nosek and Errington, 2020;Nosek et al., 2021;Open Science Collaboration, 2015;Zwaan et al., 2017).Replication is considered a cornerstone of science, yet it is only recently that researchers have begun to systematically investigate the replicability of published findings.We here revisit the classic phenomenon to examine the replicability of the original finding with an independent replication.

Choice of study for replication
We chose the Svenson (1981) study based on two factors: absence of direct replications and impact.To the best of our knowledge, there are no published direct replications of this study 1 thus far. 1 The article has had significant impact on scholarly research in several areas of psychology, including social psychology and judgment and decision making.At the time of writing, there were 2,112 citations of the article in Google Scholar.

Findings in the original article
In the original study by Svenson (1981), participants were asked to rate either their skill or their safety as drivers in relation to other participants in the experiment.Data was collected in Sweden (n = 80) and in the US (n = 81) in lab experiments.The results indicated that the majority of participants regarded themselves as more skillful and less risky than the average driver in each group respectively.Among the Swedish participants, 77% ranked their safety as above average and 69% ranked their skill as above average.Among the US participants, 88% ranked their safety as above average and 93% ranked their skill as above average.

Adjustments and extensions
We had to make several adjustments to the original design.First, rather than including two different samples for the two different questions, we ran the questions together in a within-subjects design that would allow us to compare the effects of the two questions and their associated dependent variables.Second, we had to adjust the questionnaire to match the target sample-online American Amazon Mechanical Turk (MTurk) workers.We first introduced a few verification questions to ensure that workers were drivers.Because our study was conducted online, we also had to adjust the reference group.We chose to focus on the US state as the reference group.Third, we had issues with reproducing the question used to elicit rankings and so had to make adjustments.When doing so, we noticed issues with the 10 categories used for percentile ranks (e.g., the midpoint is grouped as 41-50%, and the first category includes a range of 11 percentiles compared to other categories with a range of 10).We therefore added an extension and chose to randomize the dependent variable question across three designs: (1) our best estimate of what the target article used, (2) an adjusted 11-item scale with a mid-point indicated as 50% (average), and (3) a simple 7-item Likert scale asking participants to compare to the average.We compared effects across the three designs.Thus, the use of three different response scales helps to check the robustness of the effect, as minor methodological features can influence the results of hypothesis tests (e.g., Baribault et al., 2018;Landy et al., 2020).

Method
We report all measures, conditions, data exclusions, and how we determined sample size.

Participants
A total of 1,203 American Amazon Mechanical Turk (MTurk) participants completed the study using TurkPrime.com(M age = 40.40,SD = 12.21; 641 females).A comparison of the target article sample and the replication samples is provided in Table 1.An a priori power analysis in G*Power 3.1 (exact test, twotailed, with 95% power) indicated that 90 participants were needed to obtain the smallest effect size from the original paper, Cohen's g = 0.19 (see Supplementary Materials).However, a sample size of n = 90 is smaller than the sample size in the original study (n = 161) and is based on an effect size estimate that might be larger than the true effect size.Therefore, we decided to follow suggestions from Simonsohn (2015) and aim for 2.5 times the original sample size.The data collection was combined with data collection for a different study (see Chen et al., 2021, Experiment 2) that required a much larger sample size (studies displayed in randomized order).
Participants first consented to participate in the study and were then asked verification questions regarding having a driver's license, year and location of license,

Procedure
Participants indicated how safe and how skilled they were as drivers (both questions included, displayed in random order).They then answered a funneling section and provided demographic information (age, gender, country of birth, family social class, English understanding of study), before being debriefed.

Measures
There were two dependent variables: driving safety and driving skill.The question about safety was phrased as follows: We would like to know what you think about how safely you drive an automobile.All drivers are not equally safe drivers.We want you to compare your own skill to the skills of other people in your state.By definition, there is a least safe and a most safe driver.We want you to indicate your own estimated position among the people in your state.Of course, this is a difficult question because you do not know all the people in your state, much less how safely they drive.But please make the most accurate estimate you can.
The question about skill was phrased as follows: We would like to know what you think about how skilled you are at driving an automobile.All drivers are not equally skilled drivers.We want you to compare your own skill to the skills of other people in your state.By definition, there is a least skilled and a most skilled driver.We want you to indicate your own estimated position among the people in your state.Of course, this is a difficult question because you do not know all the people in your state, much less how skilled drivers they are.But please make the most accurate estimate you can.
For each question, participants indicated their driving safety/skill compared to the average driver in their state using one of the three following response scales:

Evaluation criteria for replication
Table 2 provides a classification of the replication using criteria by LeBel et al. (2017).We summarize the replication as a "very close replication".We compare the replication effects with the original effects in the target article using criteria from LeBel et al. (2019).

Data analysis
The original article did not include any statistical tests, and the scale and design make it difficult to conduct such a test.Yet, our best estimation of an analysis is to compare the percentages of participants who answered the 50%+ categories and compare those to an expected 50% (binomial test).For the Likert scale, we conducted a one-sample t-test comparing to the mean of 4, the scale midpoint.We examined normality in the distribution of frequencies, including parameters of skewness and kurtosis.Analysis code can be found in the supplementary materials.

Replication
Descriptive statistics of all measures are presented in Table 3. Statistical tests of the hypotheses are summarized in Tables 4-5 and plotted in Figures 1-3.
The medians for the distributions of safety judgments in Table 3 fall in the interval 71-80%, for both percentile category response scales.This indicates that half of the participants believed themselves to be among the safest 30 percent of drivers.Over 90% of participants (93% for the reproduced materials and 91% for the adjusted materials with a 50% midpoint) believed themselves to be safer than the median driver.Binomial tests against test proportion 0.50 (two-tailed) indicated that this effect was statistically significant, ps < .001(see Table 4).In comparison, the original study found that the medians for the distributions of safety judgments fell in the interval 81-90% for the US group and 71-80% for the Swedish group, indicating that half of the participants believed themselves to be among the safest 20 (US) or 30 (Sweden) percent of the drivers in the two groups respectively.88% in the US group and 77% in the Swedish group believed themselves to be safer than the median driver.
The medians for the distributions of skill judgments in Table 3 fall in the interval 71-80% (both for the reproduced and for the adjusted materials).This indicates that half of the participants believed themselves to be among the most skilled 30 percent of drivers.91% (for the reproduced materials) and 78% (for the adjusted materials) believed themselves to be more skilled than the median driver.Binomial tests against test proportion 0.50 (two-tailed) indicated that this effect was statistically significant, ps < .001(see Table 4).In comparison, the original study found that the medians for the distributions of skill judgments fell in the interval 61-70% for the US group and 51-60% for the Swedish group.93% in the US sample and 69% in the Swedish sample believed themselves to be more skilled than the median driver.

Extensions
Figure 4 shows the effect size (Hedges's g) and 95% confidence intervals for each rating scale.For skills ratings, the CIs are overlapping in all cases, suggesting no evidence for a difference in the size of the better-thanaverage effect depending on the type of rating scale used.For safety ratings, the CIs for the two scales that involve percentile categories are overlapping, but the CIs for the Likert scale are slightly lower, suggesting a slightly smaller better-than-average effect when safety is rated on a Likert scale.Nevertheless, the effect is very large in all cases.For effect sizes, confidence intervals, and important study characteristics of the replication, original study, and meta-analysis by Zell et al. (2020), see Supplementary Table S7.
Figure 5 shows the mean safety and skills ratings in each state (excluding states with fewer than 5 responses).We find no obvious pattern in the effect across states.However, some states had very few observations and CIs are generally very large, which complicates interpretation.Therefore, we chose not to analyze this data further.

Exploratory analyses (not pre-registered)
A series of exploratory OLS regressions were run to investigate whether participants' gender, age, and driving experience (i.e., years since driver's license was obtained) predicted their ratings of driving safety and skill.The regressions also included item order (i.e.Note.Binomial tests comparing the percentage of participants who rated their driving safety and skill as above average to an expected 50%. whether participants rated safety or skills first) and study order (i.e., whether participants completed this study or the study reported in Chen et al., 2021 first).
The analyses revealed that age and driving experience were associated with both safety and skills ratings, such that the rating increased with increasing age and experience (see Supplementary Tables S1-S6).In addition, there was a significant link between gender and safety ratings using the Likert scale and between gender and skills ratings using the Likert scale and the adjusted materials, indicating that women rated themselves lower.However, there was no such link in the other scales; thus, the results involving gender seem to depend on the response scale format and item content.Including item order and study order in the regression analyses did not alter the interpretation of the effects of gender, age, and driving experience.Item order and study order also had no consistent effect on participants' ratings, although completing the Svenson (1981) replication first was associated with higher safety ratings in one of the scales (the reproduced materials) and rating safety before skills was associated with lower safety ratings in another (the Likert scale; see Supplementary Tables S1-S6).Nevertheless, the regression results address the question of whether gender, age, and driving experience are associated with participants' ratings of driving safety and skill; they do not address the question of whether gender, age, and driving experience affect whether participants rate themselves above average.Because the vast majority of participants rated themselves as above average, we did not conduct such an analysis.Finally, we investigated the correlation between skills and safety ratings in the three response scales.This analysis indicated that participants' skills ratings were positively correlated with their safety ratings in all three scales (original scale: tau = .52,p < .001,n = 122; adjusted scale with 50% midpoint: tau = .48,p < .001,n = 136; Likert scale: tau = .47,p < .001,n = 121).

Figure 3
Proportion of participants in each percentile category of safety ratings and skills ratings, using the Likert scale.Effect sizes (Hedges's g) and 95% CIs for each rating scale.

Discussion
We embarked on a preregistered replication and extension of a classic phenomenon in the judgment and decision-making literature known as the better-thanaverage effect.The original article found that the majority of participants reported that they were safer and more skilled than the average driver (Svenson, 1981).The findings from our replication are consistent with the original findings.That is, the majority of participants rated their driving safety and skill as above the median.Results were stable across three different response scales: our best estimate of the original materials, an adjusted scale with a 50% midpoint, and a 7-item Likert scale.
Our replication adds to a larger literature investigating the replicability of published research in psycholog-ical science (e.g., Camerer et al., 2018;Open Science Collaboration, 2015).Importantly, our study design closely follows the original study by Svenson (1981) and thereby classifies as a very close replication according to replication criteria by LeBel et al. (2017).Recently, Ziano et al., 2020 conducted a replication of another classic study on the better-than-average effect (Alicke, 1985), which indicated that college students' ratings of how characteristic a trait was of them (vs.an average student) increased with increasing desirability of the trait, and that this effect was stronger among more controllable traits.Findings from Ziano et al. (2020) were consistent with the original findings.In sum, findings from the present study are in line with the view of the better-than-average effect as a robust phenomenon.

Table 1
Differences and similarities between samples in original study and replication

Table 2
Classification of the replication, based on LeBel et al.

Table 3
Proportion of participants in each category