Wednesday, February 26, 2014

Type I error rates in test of normality by simulation

This simulation tests the type I error rates of the Shapiro-Wilk test of normality in R and SAS.

First, we run a simulation in R. Notice the simulation is vectorized: there are no "for" loops that clutter the code and slow the simulation.

# type I error
alpha <- 0.05

# number of simulations
n.simulations <- 10000

# number of observations in each simulation 
n.obs <- 100

# a vector of test results <- replicate(n.simulations, 

# type I error for the whole simulation

# Store cumulative results in data frame for plotting 
sim <- data.frame(
    n.simulations = 1:n.simulations, = cumsum( / 

# plot type I error as function of the number of simulations 
plot(sim, xlab="number of simulations", 
    ylab="cumulative type I error rate")

# a line for the true error rate
abline(h=alpha, col="red")

# alternative plot using ggplot2
ggplot(sim, aes(x=n.simulations, + 
    geom_line() + 
    xlab('number of simulations') +
    ylab('cumulative type I error rate') + 
    ggtitle('Simulation of type I error in Shapiro-Wilk test') +
    geom_abline(intercept = 0.05, slope=0, col='red') +

As the number of simulations increases, the type I error rate approaches alpha. Try it in R with any value of alpha and any number of observations per simulation.

It's elegant the whole simulation can be condensed to 60 characters:


Likewise, we now do a similar simulation of the Shapiro-Wilk test in SAS. Notice there are no macro loops: the simulation is simpler and faster using a BY statement.

data normal;
 length simulation 4 i 3; /* save space and time */
 do simulation = 1 to 10000;
  do i = 1 to 100;
   x = rand('normal');

proc univariate data=normal noprint ;
 by simulation;
 var x;
 output out=univariate n=n mean=mean std=std NormalTest=NormalTest probn=probn;

data univariate;
 set univariate;
 type_one_error = probn<0.05;

/* Summarize the type I error rates for this simulation */
proc freq data=univariate;
 table type_one_error/nocum;

In my SAS simulation the type I error rate was 5.21%.

Tested with R 3.0.2 and SAS 9.3 on Windows 7.


  1. Nice simulation! In the the SAS code you said "SAS will not give the p-value." Actually, you can use the PROBN option to output the p-value. Then DATA step that creates the indicator variable becomes easier to read, and you can easily modify it to handle other significance levels:
    proc univariate...;
    output out=univariate probn=pvalue ...;
    data univariate;
    type_one_error = (pvalue < 0.05);
    For other examples of these kinds of simulations, see Chapters 4 and 5 in _Simulating Data with SAS_ (2013).

  2. Rick, thank you for correcting this mistake. SAS is powerful.

  3. I came to this nice post when looking for a solution to the following issue. If you run this same simulation but using 5000 instead of 100 normal data, then simulated error I converges to something like 0.44, not 0.5. Any idea why this is so? Thanks a lot and congrats for the blog.