Tuesday, February 10, 2015

Autocommit with ceODBC is slow

You already know that in Python it is faster to call executemany() than repeatedly calling execute() to INSERT the same number of rows because executemany() avoids rebinding the parameters, but what about the effect of autocommit on performance? While this is probably not specific to ceODBC, using autocommit is astonishingly slow. Here is how slow.

First, the Python code to run the benchmark:

import ceODBC
import datetime
import os
import time

connection_string="driver=sql server;database=database;server=server;" 
print connection_string

conn = None
cursor = None
def init_db():
    import ceODBC
    global conn
    global cursor
    conn = ceODBC.connect(connection_string)
    cursor = conn.cursor()

def table_exists():
    cursor.execute("select count(1) from information_schema.tables where table_name='zzz_ceodbc_test'")
    return cursor.fetchone()[0] == 1

def create_table():
CREATE TABLE zzz_ceodbc_test (
    col1 INT,
    col2 VARCHAR(50)
) """
        import traceback

rows = []
for i in xrange(0,10000):

def log_speed(start_time, end_time, records):
    elapsed_seconds = end_time - start_time
    if elapsed_seconds > 0:
        records_second = int(records / elapsed_seconds)
        # make elapsed_seconds an integer to shorten the string format
        elapsed_str = str(
        print("{:,} records; {} records/sec; {} elapsed".format(records, records_second, elapsed_str))
        print("counter: %i records " % records)

def benchmark(bulk, autocommit):
    global conn
    global cursor
    cursor.execute('truncate table zzz_ceodbc_test')
    conn.autocommit = autocommit
    insert_sql = 'insert into zzz_ceodbc_test (col1, col2) values (?,?)'
    start_time = time.time()
    if bulk:
        cursor.executemany(insert_sql, rows)
        for row in rows:
            cursor.execute(insert_sql, row)
    end_time = time.time()
    cursor.execute("select count(1) from zzz_ceodbc_test")
    assert cursor.fetchone()[0] == len(rows)
    log_speed(start_time, end_time, len(rows))
    del cursor
    del conn
    return end_time - start_time

def benchmark_repeat(bulk, autocommit, repeats=5):
    description = "%s, autocommit=%s" % ('bulk' if bulk else 'one at a time', autocommit)
    print '\n******* %s' % description
    results = []
    for x in xrange(0, repeats):
        results.append(benchmark(bulk, autocommit))
    print results

benchmark_repeat(True, False)
benchmark_repeat(True, True)
benchmark_repeat(False, True)

And to graph the results in R:

results_table <- 'group seconds
bulk_manual 0.6710000038146973
bulk_manual 0.6710000038146973
bulk_manual 0.9830000400543213
bulk_manual 0.7330000400543213
bulk_manual 0.6710000038146973
bulk_auto 8.486999988555908
bulk_auto 8.269000053405762
bulk_auto 8.980999946594238
bulk_auto 8.453999996185303
bulk_auto 8.480999946594238
one_at_a_time 24.391000032424927
one_at_a_time 23.70300006866455
one_at_a_time 71.66299986839294
one_at_a_time 23.58899998664856
one_at_a_time 37.18400001525879'

results <- read.table(textConnection(results_table), header = TRUE)

ggplot(results, aes(group, seconds)) + geom_boxplot()

Conclusion: executemany() with autocommit is 76% faster than execute(), and executemany() without autocommit is 91% faster than executemany() with autocommit. Also, executemany() gives more consistent performance.

Ran on Windows 7 Pro 64-bit, Python 2.7.9 32-bit, ceODBC 2.0.1, Microsoft SQL Server 11.0 SP1, R 3.1.2.

Wednesday, February 4, 2015

Party like it's 19999 (SAS)

On 03OCT2014 I must have missed the party in Cary, NC.

data _null_;
 format date date9.;
 date = 19999;
 put date=;

So the next party is 03JUN2568?

data _null_;
 format date date9.;
 date = 222222;
 put date=;