Heuristic Andrew: 2015

Wednesday, September 2, 2015

SAS macros always have a global scope

SAS allows the programmer to declare the scope of macro variables using %LOCAL or %GLOBAL, but the macros themselves are always created in the global scope.

Say you have a macro that in another language, say Python, would be considered a function. Within the macro you want a sub-macro (i.e., sub-function) to be used only within the outer macro.

%macro outer;
%put NOTE: outer;

/* This "sub-macro" is defined within the outer macro and is
   intended only for use within the outer macro. */
 %macro inner(foo);
 %put NOTE: inner &foo;
 %mend;

%inner(1);
%inner(2);
%mend;

%outer;

/* If the "sub-macro" has a local scope, the next step would fail */
%inner(3);

/* However, it succeeds */

This can lead to conflicts if the macro %inner is defined somewhere else in the same session. One way of dealing with this is to be careful to give the inner macro a unique name like __outer_inner where the underscores in the prefix suggest a local scope, and adding outer to the macro name indicates the macro is to be used only in the outer macro.

Gotcha with SAS, regular expressions, and end-of-line matching

Regular expressions are essential for sophisticated text processing, and it is generally easy to transfer knowledge of Perl regular expressions to the SAS functions prxparse, prxmatch, prxposn, etc. However, use caution with the end of line character ($) because of how SAS treats whitespace.

For demonstration I will run what looks like an equivalent use of regular expressions in Python, JavaScript, and SAS, but notice that only SAS does not match the string.

# Python 2.7
import re
first_name ='Andrew  '
first_name = first_name.strip()
if re.search(r"^Andrew$", first_name):
    print 'match' # it does match
else:
    print 'no match'

/*JavaScript */
first_name ='Andrew  ';
first_name = first_name.trim();
if (first_name.match(/^Andrew$/)) 
    alert('match'); /* it does match */
    else alert('no match');

(JavaScript fiddle for this code.)

data x;
 first_name='Andrew  ';
 first_name=strip(first_name);
 match=prxmatch('/^Andrew$/', first_name); /* it does not match (match=0) */
run;

In SAS ignore the trailing whitespace using the trim() function:

data x;
 first_name='Andrew  ';
 match=prxmatch('/^Andrew$/', trim(first_name)); /* it does match */
run;

SAS, however, does not distinguish a string that was inserted with trailing spaces from a string that was inserted without trailing spaces. In the following SAS-only example imagine the table was created and populated using a non-SAS system like MySQL or Microsoft SQL Server.

proc sql;
 create table names (
  first_name varchar(8)
 );

 insert into names values ('Andrew'); /* no trailing spaces */
 insert into names values ('Andrew '); /* one trailing space */
 insert into names values ('Andrew  ');
quit;

data names;
 set names;
 length=length(first_name);
 match1=prxmatch('/^Andrew$/', first_name);
 match2=prxmatch('/^Andrew$/', trim(first_name));
run;

This was tested with SAS 9.4M3 on Microsoft Windows 7.

Tuesday, June 9, 2015

List of user-installed R packages and their versions

This R command lists all the packages installed by the user (ignoring packages that come with R such as base and foreign) and the package versions.

ip <- as.data.frame(installed.packages()[,c(1,3:4)])
rownames(ip) <- NULL
ip <- ip[is.na(ip$Priority),1:2,drop=FALSE]
print(ip, row.names=FALSE)

Example output

       Package   Version
        bitops     1.0-6
 BradleyTerry2     1.0-6
          brew     1.0-6
         brglm     0.5-9
           car    2.0-25
         caret    6.0-47
          coin    1.0-24
    colorspace     1.2-6
        crayon     1.2.1
      devtools     1.8.0
     dichromat     2.0-0
        digest     0.6.8
         earth     4.4.0
      evaluate       0.7
[..snip..]

Tested with R 3.2.0.

This is a small step towards managing package versions: for a better solution, see the checkpoint package. You could also use the first column to reinstall user-installed R packages after an R upgrade.

Tuesday, March 3, 2015

SAS 9.4 crash with MySQL ODBC pass-through queries

SAS 9.4 (TS1M2) on X64_DS08R2 (Windows Server 2008 64-bit) always crashes with certain pass-through queries using MySQL Connector/ODBC 5.3.4. When it crashes, the SAS log shows some red messages, but SAS closes immediately.

The crash is not reproducible with other ODBC drivers, on SAS 9.3 64-bit, or SAS 9.4 32-bit.

Workarounds include: using an ODBC DSN instead of the connection string, not using pass-through queries, or using SAS 9.3.

SAS agreed to fix the bug.

Autocommit with ceODBC is slow

You already know that in Python it is faster to call executemany() than repeatedly calling execute() to INSERT the same number of rows because executemany() avoids rebinding the parameters, but what about the effect of autocommit on performance? While this is probably not specific to ceODBC, using autocommit is astonishingly slow. Here is how slow.

First, the Python code to run the benchmark:

import ceODBC
import datetime
import os
import time

connection_string="driver=sql server;database=database;server=server;" 
print connection_string

conn = None
cursor = None
def init_db():
    import ceODBC
    global conn
    global cursor
    conn = ceODBC.connect(connection_string)
    cursor = conn.cursor()

def table_exists():
    cursor.execute("select count(1) from information_schema.tables where table_name='zzz_ceodbc_test'")
    return cursor.fetchone()[0] == 1

def create_table():
    print('create_table')
    create_sql="""
CREATE TABLE zzz_ceodbc_test (
    col1 INT,
    col2 VARCHAR(50)
) """
    try:
        cursor.execute(create_sql)
        assert(table_exists())
    except:
        import traceback
        traceback.print_exc()

rows = []
for i in xrange(0,10000):
    rows.append((i,'abcd'))

def log_speed(start_time, end_time, records):
    elapsed_seconds = end_time - start_time
    if elapsed_seconds > 0:
        records_second = int(records / elapsed_seconds)
        # make elapsed_seconds an integer to shorten the string format
        elapsed_str = str(
            datetime.timedelta(seconds=int(elapsed_seconds)))
        print("{:,} records; {} records/sec; {} elapsed".format(records, records_second, elapsed_str))
    else:
        print("counter: %i records " % records)

 
 
def benchmark(bulk, autocommit):
    init_db()
    global conn
    global cursor
    conn.autocommit=True
    cursor.execute('truncate table zzz_ceodbc_test')
    
    conn.autocommit = autocommit
    insert_sql = 'insert into zzz_ceodbc_test (col1, col2) values (?,?)'
    
    start_time = time.time()
    if bulk:
        cursor.executemany(insert_sql, rows)
    else:
        for row in rows:
            cursor.execute(insert_sql, row)
    conn.commit()
    end_time = time.time()
    
    cursor.execute("select count(1) from zzz_ceodbc_test")
    assert cursor.fetchone()[0] == len(rows)
    
    log_speed(start_time, end_time, len(rows))
    conn.autocommit=True
    
    del cursor
    del conn
    return end_time - start_time


def benchmark_repeat(bulk, autocommit, repeats=5):
    description = "%s, autocommit=%s" % ('bulk' if bulk else 'one at a time', autocommit)
    print '\n******* %s' % description
    results = []
    for x in xrange(0, repeats):
        results.append(benchmark(bulk, autocommit))
    print results

benchmark_repeat(True, False)
benchmark_repeat(True, True)
benchmark_repeat(False, True)

And to graph the results in R:

results_table <- 'group seconds
bulk_manual 0.6710000038146973
bulk_manual 0.6710000038146973
bulk_manual 0.9830000400543213
bulk_manual 0.7330000400543213
bulk_manual 0.6710000038146973
bulk_auto 8.486999988555908
bulk_auto 8.269000053405762
bulk_auto 8.980999946594238
bulk_auto 8.453999996185303
bulk_auto 8.480999946594238
one_at_a_time 24.391000032424927
one_at_a_time 23.70300006866455
one_at_a_time 71.66299986839294
one_at_a_time 23.58899998664856
one_at_a_time 37.18400001525879'

results <- read.table(textConnection(results_table), header = TRUE)
closeAllConnections() 

library(ggplot2)
ggplot(results, aes(group, seconds)) + geom_boxplot()

Conclusion: executemany() with autocommit is 76% faster than execute(), and executemany() without autocommit is 91% faster than executemany() with autocommit. Also, executemany() gives more consistent performance.

Ran on Windows 7 Pro 64-bit, Python 2.7.9 32-bit, ceODBC 2.0.1, Microsoft SQL Server 11.0 SP1, R 3.1.2.

Wednesday, February 4, 2015

Party like it's 19999 (SAS)

On 03OCT2014 I must have missed the party in Cary, NC.

data _null_;
 format date date9.;
 date = 19999;
 put date=;
run;

So the next party is 03JUN2568?

data _null_;
 format date date9.;
 date = 222222;
 put date=;
run;

Thursday, January 29, 2015

LimeSurvey is allergic to Cloudflare Rocket Loader

In case you use LimeSurvey with Cloudflare, you may want to disable Rocket Loader, which "automatically asynchronously load all JavaScript resources." In LimeSurvey it causes problem saving questions (the save button does not do anything), disables tooltips for buttons in the administrative interface (so the icons are hard to interpret), and maybe causes other problems.

If you are not sure whether Rocket Loader is enabled, just look at the HTML source. If it is enabled, you will see "rocketloader" in the HTML source.

Cloudflare's Auto Minify seems safe to use.

I tested with LimeSurvey Version 2.05+ Build 141229, Firefox 35, and Google Chrome 40.

Heuristic Andrew