Wednesday, February 26, 2014

Type I error rates in test of normality by simulation

This simulation tests the type I error rates of the Shapiro-Wilk test of normality in R and SAS.

First, we run a simulation in R. Notice the simulation is vectorized: there are no "for" loops that clutter the code and slow the simulation.

# type I error
alpha <- 0.05

# number of simulations
n.simulations <- 10000

# number of observations in each simulation 
n.obs <- 100

# a vector of test results
type.one.error <- replicate(n.simulations, 
    shapiro.test(rnorm(n.obs))$p.value)<alpha

# type I error for the whole simulation
mean(type.one.error)

# Store cumulative results in data frame for plotting 
sim <- data.frame(
    n.simulations = 1:n.simulations,
    type.one.error.rate = cumsum(type.one.error) / 
        seq_along(type.one.error))

# plot type I error as function of the number of simulations 
plot(sim, xlab="number of simulations", 
    ylab="cumulative type I error rate")

# a line for the true error rate
abline(h=alpha, col="red")

# alternative plot using ggplot2
require(ggplot2)
ggplot(sim, aes(x=n.simulations, y=type.one.error.rate)) + 
    geom_line() + 
    xlab('number of simulations') +
    ylab('cumulative type I error rate') + 
    ggtitle('Simulation of type I error in Shapiro-Wilk test') +
    geom_abline(intercept = 0.05, slope=0, col='red') +
    theme_bw() 

As the number of simulations increases, the type I error rate approaches alpha. Try it in R with any value of alpha and any number of observations per simulation.

It's elegant the whole simulation can be condensed to 60 characters:

mean(replicate(10000,shapiro.test(rnorm(100))$p.value)<0.05)

Likewise, we now do a similar simulation of the Shapiro-Wilk test in SAS. Notice there are no macro loops: the simulation is simpler and faster using a BY statement.

data normal;
 length simulation 4 i 3; /* save space and time */
 do simulation = 1 to 10000;
  do i = 1 to 100;
   x = rand('normal');
   output;
  end;
 end;
run;

proc univariate data=normal noprint ;
 by simulation;
 var x;
 output out=univariate n=n mean=mean std=std NormalTest=NormalTest probn=probn;
run;

data univariate;
 set univariate;
 type_one_error = probn<0.05;
run;

/* Summarize the type I error rates for this simulation */
proc freq data=univariate;
 table type_one_error/nocum;
run;

In my SAS simulation the type I error rate was 5.21%.

Tested with R 3.0.2 and SAS 9.3 on Windows 7.

3 comments:

  1. Nice simulation! In the the SAS code you said "SAS will not give the p-value." Actually, you can use the PROBN option to output the p-value. Then DATA step that creates the indicator variable becomes easier to read, and you can easily modify it to handle other significance levels:
    proc univariate...;
    output out=univariate probn=pvalue ...;
    run;
    data univariate;
    ...
    type_one_error = (pvalue < 0.05);
    run;
    For other examples of these kinds of simulations, see Chapters 4 and 5 in _Simulating Data with SAS_ (2013).

    ReplyDelete
  2. Rick, thank you for correcting this mistake. SAS is powerful.

    ReplyDelete
  3. I came to this nice post when looking for a solution to the following issue. If you run this same simulation but using 5000 instead of 100 normal data, then simulated error I converges to something like 0.44, not 0.5. Any idea why this is so? Thanks a lot and congrats for the blog.

    ReplyDelete

Snowflake SQL error: NULL result in a non-nullable column

Troubleshooting Snowflake SQL Error : NULL result in a non-nullable column When working with Snowflake, you might encounter the error mes...