Wednesday, February 26, 2014

Type I error rates in test of normality by simulation

This simulation tests the type I error rates of the Shapiro-Wilk test of normality in R and SAS.

First, we run a simulation in R. Notice the simulation is vectorized: there are no "for" loops that clutter the code and slow the simulation.

# type I error
alpha <- 0.05

# number of simulations
n.simulations <- 10000

# number of observations in each simulation 
n.obs <- 100

# a vector of test results
type.one.error <- replicate(n.simulations, 
    shapiro.test(rnorm(n.obs))$p.value)<alpha

# type I error for the whole simulation
mean(type.one.error)

# Store cumulative results in data frame for plotting 
sim <- data.frame(
    n.simulations = 1:n.simulations,
    type.one.error.rate = cumsum(type.one.error) / 
        seq_along(type.one.error))

# plot type I error as function of the number of simulations 
plot(sim, xlab="number of simulations", 
    ylab="cumulative type I error rate")

# a line for the true error rate
abline(h=alpha, col="red")

# alternative plot using ggplot2
require(ggplot2)
ggplot(sim, aes(x=n.simulations, y=type.one.error.rate)) + 
    geom_line() + 
    xlab('number of simulations') +
    ylab('cumulative type I error rate') + 
    ggtitle('Simulation of type I error in Shapiro-Wilk test') +
    geom_abline(intercept = 0.05, slope=0, col='red') +
    theme_bw() 

As the number of simulations increases, the type I error rate approaches alpha. Try it in R with any value of alpha and any number of observations per simulation.

It's elegant the whole simulation can be condensed to 60 characters:

mean(replicate(10000,shapiro.test(rnorm(100))$p.value)<0.05)

Likewise, we now do a similar simulation of the Shapiro-Wilk test in SAS. Notice there are no macro loops: the simulation is simpler and faster using a BY statement.

data normal;
 length simulation 4 i 3; /* save space and time */
 do simulation = 1 to 10000;
  do i = 1 to 100;
   x = rand('normal');
   output;
  end;
 end;
run;

proc univariate data=normal noprint ;
 by simulation;
 var x;
 output out=univariate n=n mean=mean std=std NormalTest=NormalTest probn=probn;
run;

data univariate;
 set univariate;
 type_one_error = probn<0.05;
run;

/* Summarize the type I error rates for this simulation */
proc freq data=univariate;
 table type_one_error/nocum;
run;

In my SAS simulation the type I error rate was 5.21%.

Tested with R 3.0.2 and SAS 9.3 on Windows 7.

3 comments:

  1. Nice simulation! In the the SAS code you said "SAS will not give the p-value." Actually, you can use the PROBN option to output the p-value. Then DATA step that creates the indicator variable becomes easier to read, and you can easily modify it to handle other significance levels:
    proc univariate...;
    output out=univariate probn=pvalue ...;
    run;
    data univariate;
    ...
    type_one_error = (pvalue < 0.05);
    run;
    For other examples of these kinds of simulations, see Chapters 4 and 5 in _Simulating Data with SAS_ (2013).

    ReplyDelete
  2. Rick, thank you for correcting this mistake. SAS is powerful.

    ReplyDelete
  3. I came to this nice post when looking for a solution to the following issue. If you run this same simulation but using 5000 instead of 100 normal data, then simulated error I converges to something like 0.44, not 0.5. Any idea why this is so? Thanks a lot and congrats for the blog.

    ReplyDelete

Estimating token counts from character length

When programatically using an AI chatbot API, it is easy to run up big bills. To avoid this, carefully moniter token usage, but resist the u...