tag:blogger.com,1999:blog-76660668160411906612024-03-16T12:49:34.082-06:00Heuristic AndrewR, SAS, machine learning, and statisticsAndrew Zhttp://www.blogger.com/profile/10108637160465346326noreply@blogger.comBlogger7125tag:blogger.com,1999:blog-7666066816041190661.post-13682260196422501612018-01-28T14:51:00.001-07:002018-01-28T15:26:16.925-07:00 Type I error rates in two-sample t-test by simulation <p>What do you do when analyzing data is fun, but you don't have any new data? You make it up.</p>
<p>This simulation tests the type I error rates of two-sample t-test in R and SAS. It demonstrates efficient methods for simulation, and it reminders the reader not to take the result of any single hypothesis test as gospel truth. That is, there is always a risk of a false positive (or false negative), so determining truth requires more than one research study.<p>
<p>A type I error is a false positive. That is, it happens when a hypothesis test rejects the null hypothesis when in fact it is not true. In this simulation the null hypothesis is true by design, though in the real world we cannot be sure the null hypothesis is true. This is why we write that we "fail to reject the null hypothesis" rather than "we accept it." If there were no errors in the hypothesis tests in this simulation, we would never reject the null hypothesis, but by design it is normal to reject it according to <i>alpha</i>, the significance level. The de facto standard for alpha is 0.05.</p>
<h2>R</h2>
<p>First, we run a simulation in R by repeatedly comparing randomly-generated sets of normally-distributed values using the two-sample t-test. Notice the simulation is vectorized: there are no "for" loops that clutter the code and slow the simulation.</p>
<a name='more'></a>
<pre class="prettyprint lang-r">
# type I error
alpha.p <- 0.05
# number of simulations
n.simulations <- 1000
# number of observations in each simulation
n.obs <- 100
# a vector of test results
type.one.error<-replicate(n.simulations, t.test(rnorm(n.obs),rnorm(n.obs),
var.equal=TRUE)$p.value)<alpha.p
# type I error for the whole simulation
mean(type.one.error)
# Store cumulative results in data frame for plotting
sim <- data.frame(
n.simulations = 1:n.simulations,
type.one.error.rate = cumsum(type.one.error) / seq_along(type.one.error))
# alternative plot using ggplot2
require(ggplot2)
ggplot(sim, aes(x=n.simulations, y=type.one.error.rate)) +
geom_line() +
xlab('Number of simulations') +
ylab('Cumulative type I error rate') +
ggtitle('Simulation of type I error in t-test') +
geom_abline(intercept = alpha.p, slope=0, col='red') +
theme_bw()
</pre>
<div class="separator" style="clear: both; text-align: center;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgrEh_hiU1LlPja3-mAisRhfUsnmdkMI4S9tEaMxIU5sdAngvMsv9U4v5RjdL7cgYFL-EdqXvMf9jYrchzcX9jyGnZRAuSGCGDS8Xz1moUFz3YcJo0DgiNAFQ93f-TmlC5u8uV8sDW-QYIR/s1600/t_test_2sample_r.png" data-original-width="550" data-original-height="550" /></div>
<h2>SAS</h2>
<p>Likewise, here is the equivalent code to do the same in SAS. Notice the simulation is implemented not as a slow SAS macro. Instead, it uses the BY statement in PROC TTEST.</p>
<pre class="prettyprint lang-sas">
/*
Create a data set with 1000 simulations. Each simulation
has 100 observations in each of two groups.
*/
data normal;
length simulation 4 i 3; /* save space and time */
do simulation = 1 to 1000;
do i = 1 to 100;
group='A';
/* The values are normally distributed */
x = rand('normal');
output;
group='B';
x = rand('normal');
output;
end;
end;
run;
/*
Run two-sample t-test once for each simulation, and output to
a data set called ttests.
*/
ods _all_ close;
ods output ttests=ttests;
proc ttest plots=none data=normal;
by simulation;
class group;
var x;
run;
data ttests;
set ttests;
/* Limit the rows */
if variances='Equal';
/* Define the error as a boolean */
type_one_error = probt<0.05;
/* cumulative error */
retain cumulative_error_count;
format cumulative_error_rate percent10.2;
label cumulative_error_rate = 'Cumulative error rate';
if simulation eq 1 then cumulative_error_count = 0;
cumulative_error_count+type_one_error;
cumulative_error_rate = cumulative_error_count /simulation;
run;
/* Summarize the type I error rates for this simulation */
ods html;
proc freq data=ttests;
table type_one_error/nocum;
run;
/* Draw a line plot */
proc sgplot data=ttests;
series x=simulation y=cumulative_error_rate;
refline 0.05 /axis=y lineattrs=(color=red);
run;
</pre>
<img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjb2XD-5tfMB8unjUV_XNAmXS49v7QinRqHosp1OnTKN-6ERtHsHL6VP3P_YTVWpDMfchWf9ZiAxV4aGUJHhZ8uOJBtgwICH_-KEq4v1PzpjtmxI0pB4u80FEJHlfC5Jz79bbwmXBuSvVww/s1600/sas_type_i_two_sample_t_test.png" data-original-width="640" data-original-height="480" />
<h2>Sawtooth</h2>
<p>Did you notice the sawtooth pattern in the error rate? The incidence of a false positive is relatively rare, and when it happens, there is a spike in the error rate. Then for each simulation in which there is no false positive, the rate drops by a steady rate because the count of simulations (the denominator) is an integer.</p>
<h2>Conclusion</h2>
<p>This article was developed on Ubuntu 16.04 with R 3.4 and Windows 7 with SAS 9.4.</p>
<p>See also the article: <a href="http://heuristicandrew.blogspot.com/2014/02/type-i-error-rates-in-test-of-normality.html"> Type I error rates in test of normality by simulation </a>.</p><div class="blogger-post-footer"><p>For more posts like this, see <a href="http://heuristicandrew.blogspot.com/">Heuristic Andrew</a>.</p></div>Andrew Zhttp://www.blogger.com/profile/10108637160465346326noreply@blogger.com1tag:blogger.com,1999:blog-7666066816041190661.post-35460068890983619952016-03-10T10:42:00.000-07:002016-03-10T10:43:47.572-07:00R: InternetOpenUrl failed: 'The date in the certificate is invalid or has expired'<p>Today the two-year-old TLS security certificate for <a rel="external nofollow" href="https://cran.r-project.org">cran.r-project.org</a> expired, so suddenly in R you are getting errors running <b>install.packages</b> or <b>update.packages</b>.</p>
<p>The error looks like this:</p>
<pre class="prettyprint lang-r">
> update.packages()
--- Please select a CRAN mirror for use in this session ---
Error in download.file(url, destfile = f, quiet = TRUE) :
cannot open URL 'https://cran.r-project.org/CRAN_mirrors.csv'
In addition: Warning message:
In download.file(url, destfile = f, quiet = TRUE) :
InternetOpenUrl failed: 'The date in the certificate is invalid or has expired'
</pre>
<p>The workaround is simple: choose another repository! For example:<p>
<a name='more'></a>
<pre class="prettyprint lang-r">
options("repos"="https://cran.revolutionanalytics.com/")
update.packages(ask=T)
install.packages('gbm')
</pre>
<p>This is bad timing with the release of R 3.2.4 today. If you need to download R using your web browser, visit a mirror, such as <a href="https://cran.revolutionanalytics.com" rel="external nofollow">cran.revolutionanalytics.com</a>.</p>
<p>Tested with R 3.2.3 on Windows 7 and Windows Server 2012.</p><div class="blogger-post-footer"><p>For more posts like this, see <a href="http://heuristicandrew.blogspot.com/">Heuristic Andrew</a>.</p></div>Andrew Zhttp://www.blogger.com/profile/10108637160465346326noreply@blogger.com3tag:blogger.com,1999:blog-7666066816041190661.post-90323432501550011152015-06-09T14:02:00.002-06:002015-06-09T14:02:36.699-06:00List of user-installed R packages and their versions<p>This R command lists all the packages installed by the user (ignoring packages that come with R such as <em>base</em> and <em>foreign</em>) and the package versions.</p>
<pre class="prettyprint lang-r">
ip <- as.data.frame(installed.packages()[,c(1,3:4)])
rownames(ip) <- NULL
ip <- ip[is.na(ip$Priority),1:2,drop=FALSE]
print(ip, row.names=FALSE)
</pre>
<p>Example output</p>
<pre>
Package Version
bitops 1.0-6
BradleyTerry2 1.0-6
brew 1.0-6
brglm 0.5-9
car 2.0-25
caret 6.0-47
coin 1.0-24
colorspace 1.2-6
crayon 1.2.1
devtools 1.8.0
dichromat 2.0-0
digest 0.6.8
earth 4.4.0
evaluate 0.7
[..snip..]
</pre>
<p>Tested with R 3.2.0.</p>
<p>This is a small step towards managing package versions: for a better solution, see the <em>checkpoint</em> package. You could also use the first column to reinstall user-installed R packages after an R upgrade.</p><div class="blogger-post-footer"><p>For more posts like this, see <a href="http://heuristicandrew.blogspot.com/">Heuristic Andrew</a>.</p></div>Andrew Zhttp://www.blogger.com/profile/10108637160465346326noreply@blogger.com0tag:blogger.com,1999:blog-7666066816041190661.post-57751969354243938352015-02-10T16:07:00.000-07:002015-02-13T10:07:10.042-07:00Autocommit with ceODBC is slow<p>You already know that in Python it is faster to call executemany() than repeatedly calling execute() to INSERT the same number of rows because executemany() avoids rebinding the parameters, but what about the effect of autocommit on performance? While this is probably not specific to ceODBC, using autocommit is astonishingly slow. Here is how slow.</p>
<p>First, the Python code to run the benchmark:</p>
<pre class="prettyprint lang-python">
import ceODBC
import datetime
import os
import time
connection_string="driver=sql server;database=database;server=server;"
print connection_string
conn = None
cursor = None
def init_db():
import ceODBC
global conn
global cursor
conn = ceODBC.connect(connection_string)
cursor = conn.cursor()
def table_exists():
cursor.execute("select count(1) from information_schema.tables where table_name='zzz_ceodbc_test'")
return cursor.fetchone()[0] == 1
def create_table():
print('create_table')
create_sql="""
CREATE TABLE zzz_ceodbc_test (
col1 INT,
col2 VARCHAR(50)
) """
try:
cursor.execute(create_sql)
assert(table_exists())
except:
import traceback
traceback.print_exc()
rows = []
for i in xrange(0,10000):
rows.append((i,'abcd'))
def log_speed(start_time, end_time, records):
elapsed_seconds = end_time - start_time
if elapsed_seconds > 0:
records_second = int(records / elapsed_seconds)
# make elapsed_seconds an integer to shorten the string format
elapsed_str = str(
datetime.timedelta(seconds=int(elapsed_seconds)))
print("{:,} records; {} records/sec; {} elapsed".format(records, records_second, elapsed_str))
else:
print("counter: %i records " % records)
def benchmark(bulk, autocommit):
init_db()
global conn
global cursor
conn.autocommit=True
cursor.execute('truncate table zzz_ceodbc_test')
conn.autocommit = autocommit
insert_sql = 'insert into zzz_ceodbc_test (col1, col2) values (?,?)'
start_time = time.time()
if bulk:
cursor.executemany(insert_sql, rows)
else:
for row in rows:
cursor.execute(insert_sql, row)
conn.commit()
end_time = time.time()
cursor.execute("select count(1) from zzz_ceodbc_test")
assert cursor.fetchone()[0] == len(rows)
log_speed(start_time, end_time, len(rows))
conn.autocommit=True
del cursor
del conn
return end_time - start_time
def benchmark_repeat(bulk, autocommit, repeats=5):
description = "%s, autocommit=%s" % ('bulk' if bulk else 'one at a time', autocommit)
print '\n******* %s' % description
results = []
for x in xrange(0, repeats):
results.append(benchmark(bulk, autocommit))
print results
benchmark_repeat(True, False)
benchmark_repeat(True, True)
benchmark_repeat(False, True)
</pre>
<p>And to graph the results in R:</p>
<pre class="prettyprint lang-r">
results_table <- 'group seconds
bulk_manual 0.6710000038146973
bulk_manual 0.6710000038146973
bulk_manual 0.9830000400543213
bulk_manual 0.7330000400543213
bulk_manual 0.6710000038146973
bulk_auto 8.486999988555908
bulk_auto 8.269000053405762
bulk_auto 8.980999946594238
bulk_auto 8.453999996185303
bulk_auto 8.480999946594238
one_at_a_time 24.391000032424927
one_at_a_time 23.70300006866455
one_at_a_time 71.66299986839294
one_at_a_time 23.58899998664856
one_at_a_time 37.18400001525879'
results <- read.table(textConnection(results_table), header = TRUE)
closeAllConnections()
library(ggplot2)
ggplot(results, aes(group, seconds)) + geom_boxplot()
</pre>
<img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhCBHXcV0YSMgmS_kc2Xy8GKQlB1xJKebqaRJnCIP4bpZfZsmsplh6c094lYlAvC2cgbir4tdT-_AFDmNGGIFx3V4sVYX6a4ck_boA_GS1lYgNPVg_9ohhlvl0zc-vP3DFfrDvB0tKhQFB-/s1600/ceodbc_autocommit.png" />
<p><b>Conclusion: executemany() with autocommit is 76% faster than execute(), and executemany() without autocommit is 91% faster than executemany() with autocommit.</b> Also, executemany() gives more consistent performance.</p>
<p>Ran on Windows 7 Pro 64-bit, Python 2.7.9 32-bit, ceODBC 2.0.1, Microsoft SQL Server 11.0 SP1, R 3.1.2.</p>
<div class="blogger-post-footer"><p>For more posts like this, see <a href="http://heuristicandrew.blogspot.com/">Heuristic Andrew</a>.</p></div>Andrew Zhttp://www.blogger.com/profile/10108637160465346326noreply@blogger.com2tag:blogger.com,1999:blog-7666066816041190661.post-81473268845505267372014-12-05T09:54:00.004-07:002016-01-29T14:38:10.957-07:00Fibonacci sequence in R and SAS<p>Because the Fibonacci sequence is simply defined by recursion, it makes for an elegant programming exercise. Here is one way to do it in SAS, and another way to do it in R. I've also included unit testing code to check that it works.</p>
<a name='more'></a>
<p>Fibonacci sequence in SAS using a recursive macro:</p>
<pre class="prettyprint">
%macro fib(n);
%if &n = 1 %then 1; * first seed value;
%else %if &n = 2 %then 1; * second seed value;
%else %eval(%fib(%eval(&n-1))+%fib(%eval(&n-2))); * use recursion;
%mend;
* show values 1-5;
%put %fib(1);
%put %fib(2);
%put %fib(3);
%put %fib(4);
%put %fib(5);
* check values 1-10;
%macro check_fib;
%if %fib(1) ne 1 %then %abort;
%if %fib(2) ne 1 %then %abort;
%if %fib(3) ne 2 %then %abort;
%if %fib(4) ne 3 %then %abort;
%if %fib(5) ne 5 %then %abort;
%if %fib(6) ne 8 %then %abort;
%if %fib(7) ne 13 %then %abort;
%if %fib(8) ne 21 %then %abort;
%if %fib(9) ne 34 %then %abort;
%if %fib(10) ne 55 %then %abort;
%put NOTE: OK!;
%mend;
%check_fib;
</pre>
<p>Fibonacci sequence in R using a recursive function that supports either single integers or a vector of integers:</p>
<pre class="prettyprint language-r">
fib <- function(n)
{
if (length(n) > 1) return(sapply(n, fib)) # accept a numeric vector
if (n == 1) return(1) # first seed value
if (n == 2) return(1) # second seed value
return(fib(n-1)+fib(n-2)) # use recursion
}
# print first five Fibonacci numbers
fib(1)
fib(2)
fib(3)
fib(4)
fib(5)
# verify the Fibonacci sequence 1 through 10
(actual <- fib(1:10))
(expected <- c(1,1,2,3,5,8,13,21,34,55))
all.equal(actual,expected)
</pre>
<p>For alternative implements, see <a rel="external nofollow" href="http://sas-and-r.blogspot.com/2009/06/create-fibonacci-series.html">SAS and R: Example 7.1: Create a Fibonacci sequence</A>. In SAS, Nick Horton calculates the Fibonacci sequence using a DATA STEP, and in R he uses a FOR loop.</p>
<p>Adam Rich responded with his post <a href="http://adamleerich.com/2014/12/07/fibonacci-sequence-in-r-with-memoization/">Fibonacci Sequence in R with Memoization</a> which gives a performance boost by caching the results.</p>
<p>In the comments below, Rick Wicklin referred to his <a href="http://blogs.sas.com/content/iml/2010/09/30/twitter-and-the-fibonacci-sequence/">SAS/IML solution that generates the Fibonacci sequence iteratively</a> and <a href="http://blogs.sas.com/content/iml/2010/10/05/matrices-eigenvalues-fibonacci-and-the-golden-ratio/">Matrices, eigenvalues, Fibonacci, and the golden ratio</a>.</p>
<p><b>This post first appeared on <a href="http://heuristicandrew.blogspot.com/">Heuristic Andrew</a>.</b></p><div class="blogger-post-footer"><p>For more posts like this, see <a href="http://heuristicandrew.blogspot.com/">Heuristic Andrew</a>.</p></div>Andrew Zhttp://www.blogger.com/profile/10108637160465346326noreply@blogger.com5tag:blogger.com,1999:blog-7666066816041190661.post-33821993543403669132014-03-27T13:00:00.000-06:002014-03-27T21:12:21.695-06:00Visualizing principal components with R and Sochi Olympic Athletes<p>Principal Components Analysis (PCA) is used as a dimensionality reduction method. Here we simply explain PCA step-by-step using data about Sochi Olympic Curlers. </p>
<a name='more'></a>
<p>It is hard to visualize a high dimensional space. When I took linear algebra, the book and teachers spoke about it as if were easy to visualize a hyperspace, but later when I took the Coursera course Neural Networks for Machine Learning, Geoffrey Hinton gave the wise advise, "To deal with a 14-dimensional space, visualize a 3-D space and say 'fourteen' to yourself very loudly. Everyone does it." In other words, people cannot visualize a high dimensional space, so we use a simpler problem—two dimensions of Olympic athlete data—to explain PCA.</p>
<p>First, we have one dimensional data where the only dimension is the curler's height.</p>
<img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhBTBMdCF-JjjAelKOtBBXWoCe7xHeu_aM88Jvst9ThBXh2F_spWRy88khvBf_WYbgzlAgTSAx9TzBFVNRkTC4MMYkBmPrdTxW1EP7J0ZAQb8_opXYBemjEGOt6EsGEymFVV9P4C_MiH9bK/s1600/pca1-stripchart.png" />
<p>Next, we add a second dimension: the curler's weight. Notice there is a strong correlation between height and weight. Because of this redundancy, two dimensions are not necessary to represent most of the information.</p>
<p>By the way, if you look carefully at the first two images, notice the horizontal placement of the curlers is identical: adding the second axis moves the curlers only vertically.</p>
<img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjmFARw6Si9LcAD7mxoJQdOE2QqPU2yWwzeQV_rmwYmQYT9H8wuH6hvCl5Wp2yOvk3Be2ebTXO9nmOKa9rPy6_E7CpcFGpzv_qyjZypJyWqE2Dk1zYgbY3kWzKehJ8rR5mdwKc7Uq85W0jC/s1600/pca2-scatterplot.png" />
<p>After performing PCA, there are two principal components. Because we want to simplify two dimensions into one dimension, we ignore the second principal component and plot the data onto the first component as red squares. The black lines join each original point (green) to its projection (red) onto a one-dimensional line.</p>
<img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgzE9BHBqlAPeDL82Kbr6E48Vr4MaKZXor6h7YXBghSVoG2Xtb64mnn9C4l8BlOqm2S17ULlFIzxN5lLWNW8nj1w6sJl870aQLb36w44NcTLm_WVuPidUn9obAtT5dU7MatazplOKQbMKtm/s1600/pca3-pca-projection.png" />
<p>The blue line illustrates the first principal component. Its on this one-dimensional line that the two-dimensional space is projected.</p>
<img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgTn_vAr8nJ6qUGb7qzsewpgxjdKtSQXIU7mnm10jOrbSbg33y9RDPiFjLub_R47Ihntm2YRVEUW7cEc1dwyAnj9mdU3FsA3mG2Pk1Z61JRTkW-tK-NfELe2LXsZ_FAwLiAH7v2fjP5ASLu/s1600/pca4-first-component-on-scatterplot.png" />
<p>Now we can show the same projections from the previous graph on its own one-dimensional strip chart, which most of the variation of a two-dimensional space in one dimension.</p>
<img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhSKBtrj4FE7yaDeHnEwOa4wwo3khxqfNlb6KTtUfDodg0X1ZRwUkjbXPY1yRGsTfc0uE_LDfNv-dq5l0yBK_Wr_whPC8OtZ1wUWoB4Q2rfacx6W3dxK4GiwMEMe2ImS4ClFNeblpVC7kYC/s1600/pca5-first-component-stripchart.png" />
<p>So in general PCA reduces the number of dimensions by projecting high dimensional data into a lower dimensional space. With higher dimensional data, it is often useful to keep more of the principal components. For graphing, two or three principal components are retained. For other purposes, the optimal number of components may be chosen using a scree plot or the minimum number of components that captures some percentage of the variation, say 90%.</p>
<p>Here is the R code.</p>
<pre class="prettyprint lang-r">
# Read data from CSV
# Download from http://www.danasilver.org/static/assets/sochi-2014-athletes/athletes.csv
# See below for faster option.
athletes <- read.csv('athletes.csv')
# Subset data
ath <- athletes[athletes$sport=='Curling',c('height','weight')]
ath <- ath[complete.cases(ath),]
# ALTERNATIVELY instead of downloading
ath <- structure(list(height = c(1.73, 1.78, 1.7, 1.73, 1.71, 1.93,
1.7, 1.69, 1.84, 1.75, 1.83, 1.8, 1.8, 1.64), weight = c(66L,
84L, 74L, 66L, 73L, 80L, 58L, 60L, 88L, 85L, 80L, 71L, 85L, 69L
)), .Names = c("height", "weight"), row.names = c(536L, 624L,
640L, 820L, 930L, 949L, 1191L, 1632L, 1818L, 2349L, 2583L, 2609L,
2641L, 2696L), class = "data.frame")
# Plot 1 Dimension (just height)
png('pca1-stripchart.png')
stripchart(ath$height, col="green", pch=19, cex=2,
xlab="Height (m)",
main="Curlers at Sochi 2014 Winter Olympics")
dev.off()
# Plot 2 Dimensions
x <- as.matrix(ath)
plot2d <- function(col=3)
{
plot (x, asp = 0, col = col, pch = 19, cex = 2,
xlab="Height (m)",
ylab="Weight (kg)",
main="Curlers at Sochi 2014 Winter Olympics")
}
png('pca2-scatterplot.png')
plot2d()
dev.off()
# Perform PCA
pcX <- prcomp(x, retx = TRUE, scale = FALSE, center=TRUE)
# Transform points
transformed <- pcX$x [,1] %*% t (pcX$rotation [1,])
transformed <- scale (transformed, center = -pcX$center, scale = FALSE)
# Plot PCA projection
plot_pca <- function()
{
plot2d()
points (transformed, col = 2, pch = 15, cex = 2)
segments (x [,1],x [,2], transformed [,1], transformed [,2])
}
png('pca3-pca-projection.png')
plot_pca()
dev.off()
# Draw first principal component over scatterplot
png('pca4-first-component-on-scatterplot.png')
plot_pca()
lm.fit <- lm(transformed[,2] ~ transformed[,1])
abline(lm.fit, col="blue", cex=1.5)
dev.off()
# Plot first principal component by itself
png('pca5-first-component-stripchart.png')
stripchart(pcX$x[,1], col="red", cex=2, pch=15,
xlab="First principal component")
dev.off()
</pre>
<p>This was tested on R 3.0.2 (64-bit). Thank you to Dana Silver for the <a href="http://www.danasilver.org/2014/02/25/sochi-2014-athletes/">Sochi athlete data</a> and to cbeleites for explaining <a rel="external nofollow" href="https://stats.stackexchange.com/questions/69251/r-function-prcomp-doesnt-project-my-2d-cloud-onto-the-principal-vector-as-expec">how to plot PCA projections with line segments</a>.</p><div class="blogger-post-footer"><p>For more posts like this, see <a href="http://heuristicandrew.blogspot.com/">Heuristic Andrew</a>.</p></div>Andrew Zhttp://www.blogger.com/profile/10108637160465346326noreply@blogger.com1tag:blogger.com,1999:blog-7666066816041190661.post-18750347734824944102014-02-26T09:39:00.000-07:002014-03-10T13:22:20.612-06:00Type I error rates in test of normality by simulation<p>This simulation tests the type I error rates of the Shapiro-Wilk test of normality in R and SAS. </p>
<a name='more'></a>
<p>First, we run a simulation in R. Notice the simulation is vectorized: there are no "for" loops that clutter the code and slow the simulation.</p>
<pre class="prettyprint lang-r">
# type I error
alpha <- 0.05
# number of simulations
n.simulations <- 10000
# number of observations in each simulation
n.obs <- 100
# a vector of test results
type.one.error <- replicate(n.simulations,
shapiro.test(rnorm(n.obs))$p.value)<alpha
# type I error for the whole simulation
mean(type.one.error)
# Store cumulative results in data frame for plotting
sim <- data.frame(
n.simulations = 1:n.simulations,
type.one.error.rate = cumsum(type.one.error) /
seq_along(type.one.error))
# plot type I error as function of the number of simulations
plot(sim, xlab="number of simulations",
ylab="cumulative type I error rate")
# a line for the true error rate
abline(h=alpha, col="red")
# alternative plot using ggplot2
require(ggplot2)
ggplot(sim, aes(x=n.simulations, y=type.one.error.rate)) +
geom_line() +
xlab('number of simulations') +
ylab('cumulative type I error rate') +
ggtitle('Simulation of type I error in Shapiro-Wilk test') +
geom_abline(intercept = 0.05, slope=0, col='red') +
theme_bw()
</pre>
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEijh6_JhqKkhYbMrqvTEchJ-NPUwITZp8oSkGv6hWTfzlh5HazJyuqrIbZLMP-YY1iJK3lhHA8UER5nPt1f2gLASD0EbF50o1qUo94onuxPPWASHjC_XpHYmfC4kBQNDn-v7H_i1WuAkmMj/s1600/shapiro-wilk2.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEijh6_JhqKkhYbMrqvTEchJ-NPUwITZp8oSkGv6hWTfzlh5HazJyuqrIbZLMP-YY1iJK3lhHA8UER5nPt1f2gLASD0EbF50o1qUo94onuxPPWASHjC_XpHYmfC4kBQNDn-v7H_i1WuAkmMj/s1600/shapiro-wilk2.png" /></a></div>
<p>As the number of simulations increases, the type I error rate approaches alpha. Try it in R with any value of alpha and any number of observations per simulation.</p>
<p>It's elegant the whole simulation can be condensed to 60 characters:</p>
<pre class="prettyprint lang-r">
mean(replicate(10000,shapiro.test(rnorm(100))$p.value)<0.05)
</pre>
<p>Likewise, we now do a similar simulation of the Shapiro-Wilk test in SAS. Notice there are no macro loops: the simulation is simpler and faster using a BY statement.</p>
<pre class="prettyprint lang-sas">
data normal;
length simulation 4 i 3; /* save space and time */
do simulation = 1 to 10000;
do i = 1 to 100;
x = rand('normal');
output;
end;
end;
run;
proc univariate data=normal noprint ;
by simulation;
var x;
output out=univariate n=n mean=mean std=std NormalTest=NormalTest probn=probn;
run;
data univariate;
set univariate;
type_one_error = probn<0.05;
run;
/* Summarize the type I error rates for this simulation */
proc freq data=univariate;
table type_one_error/nocum;
run;
</pre>
<p>In my SAS simulation the type I error rate was 5.21%.</p>
<p>Tested with R 3.0.2 and SAS 9.3 on Windows 7.</p><div class="blogger-post-footer"><p>For more posts like this, see <a href="http://heuristicandrew.blogspot.com/">Heuristic Andrew</a>.</p></div>Andrew Zhttp://www.blogger.com/profile/10108637160465346326noreply@blogger.com3