This simulation tests the type I error rates of the Shapiro-Wilk test of normality in R and SAS.

First, we run a simulation in R. Notice the simulation is vectorized: there are no "for" loops that clutter the code and slow the simulation.

# type I error alpha <- 0.05 # number of simulations n.simulations <- 10000 # number of observations in each simulation n.obs <- 100 # a vector of test results type.one.error <- replicate(n.simulations, shapiro.test(rnorm(n.obs))$p.value)<alpha # type I error for the whole simulation mean(type.one.error) # Store cumulative results in data frame for plotting sim <- data.frame( n.simulations = 1:n.simulations, type.one.error.rate = cumsum(type.one.error) / seq_along(type.one.error)) # plot type I error as function of the number of simulations plot(sim, xlab="number of simulations", ylab="cumulative type I error rate") # a line for the true error rate abline(h=alpha, col="red") # alternative plot using ggplot2 require(ggplot2) ggplot(sim, aes(x=n.simulations, y=type.one.error.rate)) + geom_line() + xlab('number of simulations') + ylab('cumulative type I error rate') + ggtitle('Simulation of type I error in Shapiro-Wilk test') + geom_abline(intercept = 0.05, slope=0, col='red') + theme_bw()

As the number of simulations increases, the type I error rate approaches alpha. Try it in R with any value of alpha and any number of observations per simulation.

It's elegant the whole simulation can be condensed to 60 characters:

mean(replicate(10000,shapiro.test(rnorm(100))$p.value)<0.05)

Likewise, we now do a similar simulation of the Shapiro-Wilk test in SAS. Notice there are no macro loops: the simulation is simpler and faster using a BY statement.

data normal; length simulation 4 i 3; /* save space and time */ do simulation = 1 to 10000; do i = 1 to 100; x = rand('normal'); output; end; end; run; proc univariate data=normal noprint ; by simulation; var x; output out=univariate n=n mean=mean std=std NormalTest=NormalTest probn=probn; run; data univariate; set univariate; type_one_error = probn<0.05; run; /* Summarize the type I error rates for this simulation */ proc freq data=univariate; table type_one_error/nocum; run;

In my SAS simulation the type I error rate was 5.21%.

Tested with R 3.0.2 and SAS 9.3 on Windows 7.

Nice simulation! In the the SAS code you said "SAS will not give the p-value." Actually, you can use the PROBN option to output the p-value. Then DATA step that creates the indicator variable becomes easier to read, and you can easily modify it to handle other significance levels:

ReplyDeleteproc univariate...;

output out=univariate probn=pvalue ...;

run;

data univariate;

...

type_one_error = (pvalue < 0.05);

run;

For other examples of these kinds of simulations, see Chapters 4 and 5 in _Simulating Data with SAS_ (2013).

Rick, thank you for correcting this mistake. SAS is powerful.

ReplyDeleteI came to this nice post when looking for a solution to the following issue. If you run this same simulation but using 5000 instead of 100 normal data, then simulated error I converges to something like 0.44, not 0.5. Any idea why this is so? Thanks a lot and congrats for the blog.

ReplyDelete