You already know that in Python it is faster to call executemany() than repeatedly calling execute() to INSERT the same number of rows because executemany() avoids rebinding the parameters, but what about the effect of autocommit on performance? While this is probably not specific to ceODBC, using autocommit is astonishingly slow. Here is how slow.
First, the Python code to run the benchmark:
import ceODBC import datetime import os import time connection_string="driver=sql server;database=database;server=server;" print connection_string conn = None cursor = None def init_db(): import ceODBC global conn global cursor conn = ceODBC.connect(connection_string) cursor = conn.cursor() def table_exists(): cursor.execute("select count(1) from information_schema.tables where table_name='zzz_ceodbc_test'") return cursor.fetchone()[0] == 1 def create_table(): print('create_table') create_sql=""" CREATE TABLE zzz_ceodbc_test ( col1 INT, col2 VARCHAR(50) ) """ try: cursor.execute(create_sql) assert(table_exists()) except: import traceback traceback.print_exc() rows = [] for i in xrange(0,10000): rows.append((i,'abcd')) def log_speed(start_time, end_time, records): elapsed_seconds = end_time - start_time if elapsed_seconds > 0: records_second = int(records / elapsed_seconds) # make elapsed_seconds an integer to shorten the string format elapsed_str = str( datetime.timedelta(seconds=int(elapsed_seconds))) print("{:,} records; {} records/sec; {} elapsed".format(records, records_second, elapsed_str)) else: print("counter: %i records " % records) def benchmark(bulk, autocommit): init_db() global conn global cursor conn.autocommit=True cursor.execute('truncate table zzz_ceodbc_test') conn.autocommit = autocommit insert_sql = 'insert into zzz_ceodbc_test (col1, col2) values (?,?)' start_time = time.time() if bulk: cursor.executemany(insert_sql, rows) else: for row in rows: cursor.execute(insert_sql, row) conn.commit() end_time = time.time() cursor.execute("select count(1) from zzz_ceodbc_test") assert cursor.fetchone()[0] == len(rows) log_speed(start_time, end_time, len(rows)) conn.autocommit=True del cursor del conn return end_time - start_time def benchmark_repeat(bulk, autocommit, repeats=5): description = "%s, autocommit=%s" % ('bulk' if bulk else 'one at a time', autocommit) print '\n******* %s' % description results = [] for x in xrange(0, repeats): results.append(benchmark(bulk, autocommit)) print results benchmark_repeat(True, False) benchmark_repeat(True, True) benchmark_repeat(False, True)
And to graph the results in R:
results_table <- 'group seconds bulk_manual 0.6710000038146973 bulk_manual 0.6710000038146973 bulk_manual 0.9830000400543213 bulk_manual 0.7330000400543213 bulk_manual 0.6710000038146973 bulk_auto 8.486999988555908 bulk_auto 8.269000053405762 bulk_auto 8.980999946594238 bulk_auto 8.453999996185303 bulk_auto 8.480999946594238 one_at_a_time 24.391000032424927 one_at_a_time 23.70300006866455 one_at_a_time 71.66299986839294 one_at_a_time 23.58899998664856 one_at_a_time 37.18400001525879' results <- read.table(textConnection(results_table), header = TRUE) closeAllConnections() library(ggplot2) ggplot(results, aes(group, seconds)) + geom_boxplot()
Conclusion: executemany() with autocommit is 76% faster than execute(), and executemany() without autocommit is 91% faster than executemany() with autocommit. Also, executemany() gives more consistent performance.
Ran on Windows 7 Pro 64-bit, Python 2.7.9 32-bit, ceODBC 2.0.1, Microsoft SQL Server 11.0 SP1, R 3.1.2.
This isn't an example of "bulk loading" as understood by most RDBMSs, but rather the result of transaction scoping. The example does single row inserts within one transaction.
ReplyDeleteEach RDBMS has its own syntax for loading data from various forms of external flatfiles. (There is also "standard" SQL to copy rows from one table to another.) This method is (in my experience) always the fastest method. Depending on the RDBMS, one may lose (by choice or requirement) transaction semantics, of course.
Robert, yes, you are right. I updated the article that executemany() binds the query parameter once, and this is different than bulk loading.
Delete