Heuristic Andrew: r

Showing posts with label r. Show all posts

Sunday, January 28, 2018

Type I error rates in two-sample t-test by simulation

What do you do when analyzing data is fun, but you don't have any new data? You make it up.

This simulation tests the type I error rates of two-sample t-test in R and SAS. It demonstrates efficient methods for simulation, and it reminders the reader not to take the result of any single hypothesis test as gospel truth. That is, there is always a risk of a false positive (or false negative), so determining truth requires more than one research study.

A type I error is a false positive. That is, it happens when a hypothesis test rejects the null hypothesis when in fact it is not true. In this simulation the null hypothesis is true by design, though in the real world we cannot be sure the null hypothesis is true. This is why we write that we "fail to reject the null hypothesis" rather than "we accept it." If there were no errors in the hypothesis tests in this simulation, we would never reject the null hypothesis, but by design it is normal to reject it according to alpha, the significance level. The de facto standard for alpha is 0.05.

R

First, we run a simulation in R by repeatedly comparing randomly-generated sets of normally-distributed values using the two-sample t-test. Notice the simulation is vectorized: there are no "for" loops that clutter the code and slow the simulation.

R: InternetOpenUrl failed: 'The date in the certificate is invalid or has expired'

Today the two-year-old TLS security certificate for cran.r-project.org expired, so suddenly in R you are getting errors running install.packages or update.packages.

The error looks like this:

> update.packages()
--- Please select a CRAN mirror for use in this session ---
Error in download.file(url, destfile = f, quiet = TRUE) : 
  cannot open URL 'https://cran.r-project.org/CRAN_mirrors.csv'
In addition: Warning message:
In download.file(url, destfile = f, quiet = TRUE) :
  InternetOpenUrl failed: 'The date in the certificate is invalid or has expired'

The workaround is simple: choose another repository! For example:

List of user-installed R packages and their versions

This R command lists all the packages installed by the user (ignoring packages that come with R such as base and foreign) and the package versions.

ip <- as.data.frame(installed.packages()[,c(1,3:4)])
rownames(ip) <- NULL
ip <- ip[is.na(ip$Priority),1:2,drop=FALSE]
print(ip, row.names=FALSE)

Example output

       Package   Version
        bitops     1.0-6
 BradleyTerry2     1.0-6
          brew     1.0-6
         brglm     0.5-9
           car    2.0-25
         caret    6.0-47
          coin    1.0-24
    colorspace     1.2-6
        crayon     1.2.1
      devtools     1.8.0
     dichromat     2.0-0
        digest     0.6.8
         earth     4.4.0
      evaluate       0.7
[..snip..]

Tested with R 3.2.0.

This is a small step towards managing package versions: for a better solution, see the checkpoint package. You could also use the first column to reinstall user-installed R packages after an R upgrade.

Tuesday, February 10, 2015

Autocommit with ceODBC is slow

You already know that in Python it is faster to call executemany() than repeatedly calling execute() to INSERT the same number of rows because executemany() avoids rebinding the parameters, but what about the effect of autocommit on performance? While this is probably not specific to ceODBC, using autocommit is astonishingly slow. Here is how slow.

First, the Python code to run the benchmark:

import ceODBC
import datetime
import os
import time

connection_string="driver=sql server;database=database;server=server;" 
print connection_string

conn = None
cursor = None
def init_db():
    import ceODBC
    global conn
    global cursor
    conn = ceODBC.connect(connection_string)
    cursor = conn.cursor()

def table_exists():
    cursor.execute("select count(1) from information_schema.tables where table_name='zzz_ceodbc_test'")
    return cursor.fetchone()[0] == 1

def create_table():
    print('create_table')
    create_sql="""
CREATE TABLE zzz_ceodbc_test (
    col1 INT,
    col2 VARCHAR(50)
) """
    try:
        cursor.execute(create_sql)
        assert(table_exists())
    except:
        import traceback
        traceback.print_exc()

rows = []
for i in xrange(0,10000):
    rows.append((i,'abcd'))

def log_speed(start_time, end_time, records):
    elapsed_seconds = end_time - start_time
    if elapsed_seconds > 0:
        records_second = int(records / elapsed_seconds)
        # make elapsed_seconds an integer to shorten the string format
        elapsed_str = str(
            datetime.timedelta(seconds=int(elapsed_seconds)))
        print("{:,} records; {} records/sec; {} elapsed".format(records, records_second, elapsed_str))
    else:
        print("counter: %i records " % records)

 
 
def benchmark(bulk, autocommit):
    init_db()
    global conn
    global cursor
    conn.autocommit=True
    cursor.execute('truncate table zzz_ceodbc_test')
    
    conn.autocommit = autocommit
    insert_sql = 'insert into zzz_ceodbc_test (col1, col2) values (?,?)'
    
    start_time = time.time()
    if bulk:
        cursor.executemany(insert_sql, rows)
    else:
        for row in rows:
            cursor.execute(insert_sql, row)
    conn.commit()
    end_time = time.time()
    
    cursor.execute("select count(1) from zzz_ceodbc_test")
    assert cursor.fetchone()[0] == len(rows)
    
    log_speed(start_time, end_time, len(rows))
    conn.autocommit=True
    
    del cursor
    del conn
    return end_time - start_time


def benchmark_repeat(bulk, autocommit, repeats=5):
    description = "%s, autocommit=%s" % ('bulk' if bulk else 'one at a time', autocommit)
    print '\n******* %s' % description
    results = []
    for x in xrange(0, repeats):
        results.append(benchmark(bulk, autocommit))
    print results

benchmark_repeat(True, False)
benchmark_repeat(True, True)
benchmark_repeat(False, True)

And to graph the results in R:

results_table <- 'group seconds
bulk_manual 0.6710000038146973
bulk_manual 0.6710000038146973
bulk_manual 0.9830000400543213
bulk_manual 0.7330000400543213
bulk_manual 0.6710000038146973
bulk_auto 8.486999988555908
bulk_auto 8.269000053405762
bulk_auto 8.980999946594238
bulk_auto 8.453999996185303
bulk_auto 8.480999946594238
one_at_a_time 24.391000032424927
one_at_a_time 23.70300006866455
one_at_a_time 71.66299986839294
one_at_a_time 23.58899998664856
one_at_a_time 37.18400001525879'

results <- read.table(textConnection(results_table), header = TRUE)
closeAllConnections() 

library(ggplot2)
ggplot(results, aes(group, seconds)) + geom_boxplot()

Conclusion: executemany() with autocommit is 76% faster than execute(), and executemany() without autocommit is 91% faster than executemany() with autocommit. Also, executemany() gives more consistent performance.

Ran on Windows 7 Pro 64-bit, Python 2.7.9 32-bit, ceODBC 2.0.1, Microsoft SQL Server 11.0 SP1, R 3.1.2.

Heuristic Andrew

Sunday, January 28, 2018

Type I error rates in two-sample t-test by simulation

R

Thursday, March 10, 2016

R: InternetOpenUrl failed: 'The date in the certificate is invalid or has expired'

Tuesday, June 9, 2015

List of user-installed R packages and their versions

Tuesday, February 10, 2015

Autocommit with ceODBC is slow

Friday, December 5, 2014

Fibonacci sequence in R and SAS

Thursday, March 27, 2014

Visualizing principal components with R and Sochi Olympic Athletes

Wednesday, February 26, 2014

Type I error rates in test of normality by simulation

Benchmarking the TP-Link TL-WPA7510 Powerline kit