Wednesday, February 3, 2016

Undocumented SAS feature: Bulkloading to Netezza with ODBC interface

The SAS/ACCESS Interface to ODBC in SAS 9.4M4 states it supports bulk loading only to "Microsoft SQL Server data on Windows platforms." However, in practice on the Windows platform it also supports bulk loading to Netezza.

Bulk loading is amazingly fast. In some of my benchmarks the duration of the whole bulk loading operation is independent of the number of rows inserted!

By default on Netezza the bulk loading interface delimits values using a pipe character, and for cases where the values contain a pipe, SAS Access Interface to ODBC unofficially supports the BL_DELIMITER option to specify an alternate delimiter. For the ODBC interface, this option is undocumented.

However, there are nuances with the BL_DELIMITER option. According to the SAS Access Interface to Netezza:

You can use any 7-bit ASCII character as a delimiter. The default is the pipe symbol (ǀ). To use a printable ASCII character, enclose it in quotation marks (for example, BL_DELIMITER="|"). However, to use an extended character, use the three-digit decimal number representation of the ASCII character for this option. For example, set BL_DELIMITER=202 to use ASCII character 202 as a delimiter. You must specify decimal number delimiters as three digits even if the first two digits would be zero. For example, specify BL_DELIMITER=003, not BL_DELIMITER=3 or BL_DELIMITER=03.

First, notice a contradiction in the documentation. Because 7-bit characters are in the range 1-127 implies that 8-bit characters in the range 128-256 are not supported, but the documentation gives an example in this range (BL_DELIMITER=202).

Second, the syntax for the ODBC interface (which is not covered by the documentation for the ODBC interface) supports only a single character, so specifying a delimiter using the three-digit decimal notation will always cause an error.

For example, this works with the ODBC interface


options sastrace=',,,d' sastraceloc=saslog nostsuffix;

/*
 Because a data set with one variable does not require
 a delimiter, this data set has two variables.
*/
data has_pipe;
/*
 Because the first character in 122 used below is a 1,
 here we test the number 1.
*/ 
 number=1;
/*
 Because by default the delimiter is a pipe, one of the
  values has a pipe.
*/
 char='I|have|a|pipe.';
 output;
run;


/* Use a lowercase z as the alternate delimiter */
data nz.has_pipe(bulkload=yes bl_delimiter='z');
    set has_pipe;
run;

However, the decimal representation fails.


/* 122 is the decimal representation of the lowercase z */
data nz.has_pipe(bulkload=yes bl_delimiter=122);
 set has_pipe;
run;

The SAS log shows SAS treats the decimal representation as a literal character and truncates it to the first character.


ODBC_25: Executed: on connection 8
CREATE EXTERNAL TABLE EXT_HAS_PIPE SAMEAS ADMIN.HAS_PIPE USING
(DATAOBJECT('\\.\pipe\BL_HAS_PIPE_3') DELIMITER '1' REMOTESOURCE 'ODBC' )

If you have it licensed, the SAS/ACCESS Interface to Netezza should support the decimal notation: in this case, I would suggest using a tab delimiter with BL_DELIMITER=009.

If not, you must either disable bulkloading or use a single, printable ASCII character as a delimiter. If your data set requires a full range of characters but never all characters on the same row (for example, some rows have a pipe while other rows have a caret), split your data set into two data sets, and then bulk load each data set using separate delimiters.

5 comments:

  1. Hi Andrew - have you got bulk loading from R working? Using the RODBC package loads the rows one at a time and is very slow!
    Cheers,
    Scott

    ReplyDelete
  2. Scott, I have used Netezza only with Python and SAS. In Python I created a CSV file and then used the CREATE EXTERNAL TABLE, and you should be able to do the same in R with RODBC. Actually, CREATE EXTERNAL TABZLE is what SAS does behind the scenes for bulkloading to Netezza too.

    ReplyDelete
  3. Thanks Andrew. I've got nzload working on the server which I call in R by the system() function. It essentially does what you said - bulk loads a csv file to a table.

    If anyone else stumbles upon this, I use sqlSave to create the table in Netezza with the first row of my table, then delete that row and use nzload to load the whole table.

    ReplyDelete
  4. Scott, thank you for posting your solution. Have you tried rownames=FALSE with sqlSave?

    Also for anyone reading: there is a command-line utility called nzload, but alternatively you can use external tables in a SQL command through the ODBC driver. It seems the nzload command does this anyway behind the scenes.

    Here is an example of an insert that loads a local CSV file /tmp/foo.csv into a remote Netezza table:
    INSERT INTO some_external_table
    SELECT *
    FROM EXTERNAL '/tmp/foo.csv' (
    a INT
    ,b INT
    ) USING

    (
    QUOTEDVALUE 'DOUBLE'
    delimiter ','
    DateDelim '/'
    Nullvalue ''
    DateStyle 'MDY'
    REMOTESOURCE 'ODBC'
    LOGDIR '/tmp/log'
    skiprows 1
    maxerrors 0
    )

    ReplyDelete
  5. Yeh I tried rownames=F, but it had no effect. Still loaded rows one by one. Interesting about the query command - haven't tested if there is any performance difference.
    Thanks!

    ReplyDelete