Sunday, January 28, 2018

Type I error rates in two-sample t-test by simulation

What do you do when analyzing data is fun, but you don't have any new data? You make it up.

This simulation tests the type I error rates of two-sample t-test in R and SAS. It demonstrates efficient methods for simulation, and it reminders the reader not to take the result of any single hypothesis test as gospel truth. That is, there is always a risk of a false positive (or false negative), so determining truth requires more than one research study.

A type I error is a false positive. That is, it happens when a hypothesis test rejects the null hypothesis when in fact it is not true. In this simulation the null hypothesis is true by design, though in the real world we cannot be sure the null hypothesis is true. This is why we write that we "fail to reject the null hypothesis" rather than "we accept it." If there were no errors in the hypothesis tests in this simulation, we would never reject the null hypothesis, but by design it is normal to reject it according to alpha, the significance level. The de facto standard for alpha is 0.05.

R

First, we run a simulation in R by repeatedly comparing randomly-generated sets of normally-distributed values using the two-sample t-test. Notice the simulation is vectorized: there are no "for" loops that clutter the code and slow the simulation.

Wednesday, January 10, 2018

Condition execution on row count

Use this code as a template for scenarios when you want to change how a SAS program runs depending on whether a data set is empty or not empty. For example, when a report is empty, you may want to not send an email with what would be a blank report. In other words, the report sends only when it has information.

On the other hand, you may want to send an email when a data set is empty if that means an automated SAS program had an error that requires manual intervention.

In general, it's good practice in automated SAS programs to check the size of a data sets in case they are empty or otherwise have the wrong number of observations. With one easy tweak, you could check for a specific minimum number of observations that is greater than zero. (This is left as an exercise for the reader.)