Heuristic Andrew

Type I error rates in two-sample t-test by simulation

2018-01-28T14:51:00.001-07:00

What do you do when analyzing data is fun, but you don't have any new data? You make it up.

This simulation tests the type I error rates of two-sample t-test in R and SAS. It demonstrates efficient methods for simulation, and it reminders the reader not to take the result of any single hypothesis test as gospel truth. That is, there is always a risk of a false positive (or false negative), so determining truth requires more than one research study.

A type I error is a false positive. That is, it happens when a hypothesis test rejects the null hypothesis when in fact it is not true. In this simulation the null hypothesis is true by design, though in the real world we cannot be sure the null hypothesis is true. This is why we write that we "fail to reject the null hypothesis" rather than "we accept it." If there were no errors in the hypothesis tests in this simulation, we would never reject the null hypothesis, but by design it is normal to reject it according to alpha, the significance level. The de facto standard for alpha is 0.05.

R

First, we run a simulation in R by repeatedly comparing randomly-generated sets of normally-distributed values using the two-sample t-test. Notice the simulation is vectorized: there are no "for" loops that clutter the code and slow the simulation.

# type I error 
alpha.p <- 0.05

# number of simulations
n.simulations <- 1000

# number of observations in each simulation
n.obs <- 100

# a vector of test results
type.one.error<-replicate(n.simulations, t.test(rnorm(n.obs),rnorm(n.obs),
   var.equal=TRUE)$p.value)<alpha.p

# type I error for the whole simulation
mean(type.one.error)

# Store cumulative results in data frame for plotting
sim <- data.frame(
 n.simulations = 1:n.simulations, 
 type.one.error.rate = cumsum(type.one.error) / seq_along(type.one.error))

# alternative plot using ggplot2
require(ggplot2)
ggplot(sim, aes(x=n.simulations, y=type.one.error.rate)) + 
    geom_line() + 
    xlab('Number of simulations') +
    ylab('Cumulative type I error rate') + 
    ggtitle('Simulation of type I error in t-test') +
    geom_abline(intercept = alpha.p, slope=0, col='red') +
    theme_bw()

SAS

Likewise, here is the equivalent code to do the same in SAS. Notice the simulation is implemented not as a slow SAS macro. Instead, it uses the BY statement in PROC TTEST.

/*
Create a data set with 1000 simulations. Each simulation
has 100 observations in each of two groups.
*/
data normal;
 length simulation 4 i 3; /* save space and time */
 do simulation = 1 to 1000;
  do i = 1 to 100;
   group='A';
 /* The values are normally distributed */
    x = rand('normal');
 output;
 group='B';
    x = rand('normal');
    output;
  end;
 end;
run;

/*
Run two-sample t-test once for each simulation, and output to
a data set called ttests.
*/
ods _all_ close;
ods output ttests=ttests;
proc ttest plots=none data=normal;
 by simulation;
 class group;
 var x;
run;

data ttests;
 set ttests;

 /* Limit the rows */
 if variances='Equal';

 /* Define the error as a boolean */
 type_one_error = probt<0.05;

 /* cumulative error */
 retain cumulative_error_count;
 format cumulative_error_rate percent10.2;
 label cumulative_error_rate = 'Cumulative error rate';
 if simulation eq 1 then cumulative_error_count = 0;
 cumulative_error_count+type_one_error;
 cumulative_error_rate = cumulative_error_count /simulation;
run;

/* Summarize the type I error rates for this simulation */
ods html;
proc freq data=ttests;
 table type_one_error/nocum;
run;

/* Draw a line plot */
proc sgplot data=ttests;
 series x=simulation y=cumulative_error_rate;
 refline 0.05 /axis=y lineattrs=(color=red);
run;

Sawtooth

Did you notice the sawtooth pattern in the error rate? The incidence of a false positive is relatively rare, and when it happens, there is a spike in the error rate. Then for each simulation in which there is no false positive, the rate drops by a steady rate because the count of simulations (the denominator) is an integer.

Conclusion

This article was developed on Ubuntu 16.04 with R 3.4 and Windows 7 with SAS 9.4.

See also the article: Type I error rates in test of normality by simulation .

R: InternetOpenUrl failed: 'The date in the certificate is invalid or has expired'

2016-03-10T10:42:00.000-07:00

Today the two-year-old TLS security certificate for cran.r-project.org expired, so suddenly in R you are getting errors running install.packages or update.packages.

The error looks like this:

> update.packages()
--- Please select a CRAN mirror for use in this session ---
Error in download.file(url, destfile = f, quiet = TRUE) : 
  cannot open URL 'https://cran.r-project.org/CRAN_mirrors.csv'
In addition: Warning message:
In download.file(url, destfile = f, quiet = TRUE) :
  InternetOpenUrl failed: 'The date in the certificate is invalid or has expired'

The workaround is simple: choose another repository! For example:

options("repos"="https://cran.revolutionanalytics.com/")
update.packages(ask=T)
install.packages('gbm')

This is bad timing with the release of R 3.2.4 today. If you need to download R using your web browser, visit a mirror, such as cran.revolutionanalytics.com.

Tested with R 3.2.3 on Windows 7 and Windows Server 2012.

List of user-installed R packages and their versions

2015-06-09T14:02:00.002-06:00

This R command lists all the packages installed by the user (ignoring packages that come with R such as base and foreign) and the package versions.

ip <- as.data.frame(installed.packages()[,c(1,3:4)])
rownames(ip) <- NULL
ip <- ip[is.na(ip$Priority),1:2,drop=FALSE]
print(ip, row.names=FALSE)

Example output

       Package   Version
        bitops     1.0-6
 BradleyTerry2     1.0-6
          brew     1.0-6
         brglm     0.5-9
           car    2.0-25
         caret    6.0-47
          coin    1.0-24
    colorspace     1.2-6
        crayon     1.2.1
      devtools     1.8.0
     dichromat     2.0-0
        digest     0.6.8
         earth     4.4.0
      evaluate       0.7
[..snip..]

Tested with R 3.2.0.

This is a small step towards managing package versions: for a better solution, see the checkpoint package. You could also use the first column to reinstall user-installed R packages after an R upgrade.

Autocommit with ceODBC is slow

2015-02-10T16:07:00.000-07:00

You already know that in Python it is faster to call executemany() than repeatedly calling execute() to INSERT the same number of rows because executemany() avoids rebinding the parameters, but what about the effect of autocommit on performance? While this is probably not specific to ceODBC, using autocommit is astonishingly slow. Here is how slow.

First, the Python code to run the benchmark:

import ceODBC
import datetime
import os
import time

connection_string="driver=sql server;database=database;server=server;" 
print connection_string

conn = None
cursor = None
def init_db():
    import ceODBC
    global conn
    global cursor
    conn = ceODBC.connect(connection_string)
    cursor = conn.cursor()

def table_exists():
    cursor.execute("select count(1) from information_schema.tables where table_name='zzz_ceodbc_test'")
    return cursor.fetchone()[0] == 1

def create_table():
    print('create_table')
    create_sql="""
CREATE TABLE zzz_ceodbc_test (
    col1 INT,
    col2 VARCHAR(50)
) """
    try:
        cursor.execute(create_sql)
        assert(table_exists())
    except:
        import traceback
        traceback.print_exc()

rows = []
for i in xrange(0,10000):
    rows.append((i,'abcd'))

def log_speed(start_time, end_time, records):
    elapsed_seconds = end_time - start_time
    if elapsed_seconds > 0:
        records_second = int(records / elapsed_seconds)
        # make elapsed_seconds an integer to shorten the string format
        elapsed_str = str(
            datetime.timedelta(seconds=int(elapsed_seconds)))
        print("{:,} records; {} records/sec; {} elapsed".format(records, records_second, elapsed_str))
    else:
        print("counter: %i records " % records)

 
 
def benchmark(bulk, autocommit):
    init_db()
    global conn
    global cursor
    conn.autocommit=True
    cursor.execute('truncate table zzz_ceodbc_test')
    
    conn.autocommit = autocommit
    insert_sql = 'insert into zzz_ceodbc_test (col1, col2) values (?,?)'
    
    start_time = time.time()
    if bulk:
        cursor.executemany(insert_sql, rows)
    else:
        for row in rows:
            cursor.execute(insert_sql, row)
    conn.commit()
    end_time = time.time()
    
    cursor.execute("select count(1) from zzz_ceodbc_test")
    assert cursor.fetchone()[0] == len(rows)
    
    log_speed(start_time, end_time, len(rows))
    conn.autocommit=True
    
    del cursor
    del conn
    return end_time - start_time


def benchmark_repeat(bulk, autocommit, repeats=5):
    description = "%s, autocommit=%s" % ('bulk' if bulk else 'one at a time', autocommit)
    print '\n******* %s' % description
    results = []
    for x in xrange(0, repeats):
        results.append(benchmark(bulk, autocommit))
    print results

benchmark_repeat(True, False)
benchmark_repeat(True, True)
benchmark_repeat(False, True)

And to graph the results in R:

results_table <- 'group seconds
bulk_manual 0.6710000038146973
bulk_manual 0.6710000038146973
bulk_manual 0.9830000400543213
bulk_manual 0.7330000400543213
bulk_manual 0.6710000038146973
bulk_auto 8.486999988555908
bulk_auto 8.269000053405762
bulk_auto 8.980999946594238
bulk_auto 8.453999996185303
bulk_auto 8.480999946594238
one_at_a_time 24.391000032424927
one_at_a_time 23.70300006866455
one_at_a_time 71.66299986839294
one_at_a_time 23.58899998664856
one_at_a_time 37.18400001525879'

results <- read.table(textConnection(results_table), header = TRUE)
closeAllConnections() 

library(ggplot2)
ggplot(results, aes(group, seconds)) + geom_boxplot()

Conclusion: executemany() with autocommit is 76% faster than execute(), and executemany() without autocommit is 91% faster than executemany() with autocommit. Also, executemany() gives more consistent performance.

Ran on Windows 7 Pro 64-bit, Python 2.7.9 32-bit, ceODBC 2.0.1, Microsoft SQL Server 11.0 SP1, R 3.1.2.

Fibonacci sequence in R and SAS

2014-12-05T09:54:00.004-07:00

Because the Fibonacci sequence is simply defined by recursion, it makes for an elegant programming exercise. Here is one way to do it in SAS, and another way to do it in R. I've also included unit testing code to check that it works.

Fibonacci sequence in SAS using a recursive macro:

%macro fib(n);
%if &n = 1 %then 1; * first seed value;
%else %if &n = 2 %then 1; * second seed value;
%else %eval(%fib(%eval(&n-1))+%fib(%eval(&n-2))); * use recursion;
%mend;

* show values 1-5;
%put %fib(1);
%put %fib(2);
%put %fib(3);
%put %fib(4);
%put %fib(5);

* check values 1-10;
%macro check_fib;
%if %fib(1) ne 1 %then %abort;
%if %fib(2) ne 1 %then %abort;
%if %fib(3) ne 2 %then %abort;
%if %fib(4) ne 3 %then %abort;
%if %fib(5) ne 5 %then %abort;
%if %fib(6) ne 8 %then %abort;
%if %fib(7) ne 13 %then %abort;
%if %fib(8) ne 21 %then %abort;
%if %fib(9) ne 34 %then %abort;
%if %fib(10) ne 55 %then %abort;
%put NOTE: OK!;
%mend;
%check_fib;

Fibonacci sequence in R using a recursive function that supports either single integers or a vector of integers:

fib  <- function(n)
{
    if (length(n) > 1) return(sapply(n, fib)) # accept a numeric vector
    if (n == 1) return(1) # first seed value
    if (n == 2) return(1) # second seed value
    return(fib(n-1)+fib(n-2)) # use recursion
}

# print first five Fibonacci numbers
fib(1)
fib(2)
fib(3)
fib(4)
fib(5)

# verify the Fibonacci sequence 1 through 10
(actual <- fib(1:10))
(expected <- c(1,1,2,3,5,8,13,21,34,55))
all.equal(actual,expected)

For alternative implements, see SAS and R: Example 7.1: Create a Fibonacci sequence. In SAS, Nick Horton calculates the Fibonacci sequence using a DATA STEP, and in R he uses a FOR loop.

Adam Rich responded with his post Fibonacci Sequence in R with Memoization which gives a performance boost by caching the results.

In the comments below, Rick Wicklin referred to his SAS/IML solution that generates the Fibonacci sequence iteratively and Matrices, eigenvalues, Fibonacci, and the golden ratio.

This post first appeared on Heuristic Andrew.

Visualizing principal components with R and Sochi Olympic Athletes

2014-03-27T13:00:00.000-06:00

Principal Components Analysis (PCA) is used as a dimensionality reduction method. Here we simply explain PCA step-by-step using data about Sochi Olympic Curlers.

It is hard to visualize a high dimensional space. When I took linear algebra, the book and teachers spoke about it as if were easy to visualize a hyperspace, but later when I took the Coursera course Neural Networks for Machine Learning, Geoffrey Hinton gave the wise advise, "To deal with a 14-dimensional space, visualize a 3-D space and say 'fourteen' to yourself very loudly. Everyone does it." In other words, people cannot visualize a high dimensional space, so we use a simpler problem—two dimensions of Olympic athlete data—to explain PCA.

First, we have one dimensional data where the only dimension is the curler's height.

Next, we add a second dimension: the curler's weight. Notice there is a strong correlation between height and weight. Because of this redundancy, two dimensions are not necessary to represent most of the information.

By the way, if you look carefully at the first two images, notice the horizontal placement of the curlers is identical: adding the second axis moves the curlers only vertically.

After performing PCA, there are two principal components. Because we want to simplify two dimensions into one dimension, we ignore the second principal component and plot the data onto the first component as red squares. The black lines join each original point (green) to its projection (red) onto a one-dimensional line.

The blue line illustrates the first principal component. Its on this one-dimensional line that the two-dimensional space is projected.

Now we can show the same projections from the previous graph on its own one-dimensional strip chart, which most of the variation of a two-dimensional space in one dimension.

So in general PCA reduces the number of dimensions by projecting high dimensional data into a lower dimensional space. With higher dimensional data, it is often useful to keep more of the principal components. For graphing, two or three principal components are retained. For other purposes, the optimal number of components may be chosen using a scree plot or the minimum number of components that captures some percentage of the variation, say 90%.

Here is the R code.


# Read data from CSV
# Download from http://www.danasilver.org/static/assets/sochi-2014-athletes/athletes.csv
# See below for faster option.
athletes <- read.csv('athletes.csv')

# Subset data
ath <- athletes[athletes$sport=='Curling',c('height','weight')]
ath <- ath[complete.cases(ath),]

# ALTERNATIVELY instead of downloading
ath <- structure(list(height = c(1.73, 1.78, 1.7, 1.73, 1.71, 1.93, 
1.7, 1.69, 1.84, 1.75, 1.83, 1.8, 1.8, 1.64), weight = c(66L, 
84L, 74L, 66L, 73L, 80L, 58L, 60L, 88L, 85L, 80L, 71L, 85L, 69L
)), .Names = c("height", "weight"), row.names = c(536L, 624L, 
640L, 820L, 930L, 949L, 1191L, 1632L, 1818L, 2349L, 2583L, 2609L, 
2641L, 2696L), class = "data.frame")

# Plot 1 Dimension (just height)
png('pca1-stripchart.png')
stripchart(ath$height, col="green", pch=19, cex=2, 
 xlab="Height (m)", 
 main="Curlers at Sochi 2014 Winter Olympics")
dev.off()

# Plot 2 Dimensions
x <- as.matrix(ath)
plot2d <- function(col=3)
{
 plot (x, asp = 0, col = col, pch = 19, cex = 2, 
  xlab="Height (m)", 
  ylab="Weight (kg)", 
  main="Curlers at Sochi 2014 Winter Olympics")
}
png('pca2-scatterplot.png')
plot2d()
dev.off()

# Perform PCA
pcX <- prcomp(x, retx = TRUE, scale = FALSE, center=TRUE)

# Transform points
transformed  <- pcX$x [,1] %*% t (pcX$rotation [1,])
transformed <- scale (transformed, center = -pcX$center, scale = FALSE)

# Plot PCA projection
plot_pca <- function()
{
 plot2d()
 points (transformed, col = 2, pch = 15, cex = 2)
 segments (x [,1],x [,2], transformed [,1], transformed [,2])
}
png('pca3-pca-projection.png')
plot_pca()
dev.off()

# Draw first principal component over scatterplot
png('pca4-first-component-on-scatterplot.png')
plot_pca()
lm.fit <- lm(transformed[,2] ~ transformed[,1])
abline(lm.fit, col="blue", cex=1.5)
dev.off()

# Plot first principal component by itself
png('pca5-first-component-stripchart.png')
stripchart(pcX$x[,1], col="red", cex=2, pch=15, 
 xlab="First principal component")
dev.off()

This was tested on R 3.0.2 (64-bit). Thank you to Dana Silver for the Sochi athlete data and to cbeleites for explaining how to plot PCA projections with line segments.

Type I error rates in test of normality by simulation

2014-02-26T09:39:00.000-07:00

This simulation tests the type I error rates of the Shapiro-Wilk test of normality in R and SAS.

First, we run a simulation in R. Notice the simulation is vectorized: there are no "for" loops that clutter the code and slow the simulation.

# type I error
alpha <- 0.05

# number of simulations
n.simulations <- 10000

# number of observations in each simulation 
n.obs <- 100

# a vector of test results
type.one.error <- replicate(n.simulations, 
    shapiro.test(rnorm(n.obs))$p.value)<alpha

# type I error for the whole simulation
mean(type.one.error)

# Store cumulative results in data frame for plotting 
sim <- data.frame(
    n.simulations = 1:n.simulations,
    type.one.error.rate = cumsum(type.one.error) / 
        seq_along(type.one.error))

# plot type I error as function of the number of simulations 
plot(sim, xlab="number of simulations", 
    ylab="cumulative type I error rate")

# a line for the true error rate
abline(h=alpha, col="red")

# alternative plot using ggplot2
require(ggplot2)
ggplot(sim, aes(x=n.simulations, y=type.one.error.rate)) + 
    geom_line() + 
    xlab('number of simulations') +
    ylab('cumulative type I error rate') + 
    ggtitle('Simulation of type I error in Shapiro-Wilk test') +
    geom_abline(intercept = 0.05, slope=0, col='red') +
    theme_bw()

As the number of simulations increases, the type I error rate approaches alpha. Try it in R with any value of alpha and any number of observations per simulation.

It's elegant the whole simulation can be condensed to 60 characters:

mean(replicate(10000,shapiro.test(rnorm(100))$p.value)<0.05)

Likewise, we now do a similar simulation of the Shapiro-Wilk test in SAS. Notice there are no macro loops: the simulation is simpler and faster using a BY statement.

data normal;
 length simulation 4 i 3; /* save space and time */
 do simulation = 1 to 10000;
  do i = 1 to 100;
   x = rand('normal');
   output;
  end;
 end;
run;

proc univariate data=normal noprint ;
 by simulation;
 var x;
 output out=univariate n=n mean=mean std=std NormalTest=NormalTest probn=probn;
run;

data univariate;
 set univariate;
 type_one_error = probn<0.05;
run;

/* Summarize the type I error rates for this simulation */
proc freq data=univariate;
 table type_one_error/nocum;
run;

In my SAS simulation the type I error rate was 5.21%.

Tested with R 3.0.2 and SAS 9.3 on Windows 7.