Heuristic Andrew: sas

Showing posts with label sas. Show all posts

Thursday, February 3, 2022

Generate random names and addresses from SAS

For testing data processing systems (e.g., CRM, record linkage), you may need to generate fake people. SAS makes it uniquely easy to generate an unlimited count of fake US residents because it comes with a data set of US zip codes, which include the city and state name.

The system uses four data sets: first names, last names, street names, and US zip codes. Initials are randomly generated from letters. The street addresses probably do not exist in the given zip codes.

You could extend this by:

Add street directions (i.e., N, S, E, W)
Add street post type (e.g., Dr., Ct.)
Add units (e.g., Apt B, Ste 101)
Add post office boxes and private mail boxes
Spell out the middle name
Add name prefix (e.g., Dr., Mr.)
Add name suffix (e.g., Jr., Sr.)

Estimating birth date from age

This code demonstrates an algorithm for estimating birth date from age. We cannot know the exact birth date, but we can get close: the maximum error is half a year, and the typical error is one quarter of a year.


/* The %age macro was taken from the Internet---maybe from here http://support.sas.com/kb/24/808.html ? */
%macro age(date,birth);
floor ((intck('month',&birth,&date) - (day(&date) < day(&birth))) / 12)
%mend age;

/*
Generate 10000 fake people with random birth dates and random perspective days
on which their age was measured. Then, calculate age from that perspective date.
In reality, there is some seasonality to births (e.g., more births in July), but 
here we assume each day of the year has an equal distribution of births.
*/
data person;
	format birth_date submit_date yymmdd10.;
	do i = 1 to 10000;
		birth_date = %randbetween(19000,20500);
		submit_date = birth_date + %randbetween(0,100*365);
		age = %age(submit_date, birth_date);
		output;
	end;
	drop i;
%runquit;

/* Work in reverse from age to estimated birth date. */
data reverse;
	set person;
	format birth_date_min birth_date_max yymmdd10.;
	birth_date_min = intnx('years', submit_date, -1 * (age+1), 's') - 1;
	birth_date_max = intnx('years',birth_date_min,1,'s') + 1;

    /* check range of estimates for errors */
	min_error = (birth_date > birth_date_min);
	max_error = (birth_date < birth_date_max);

    /* estimate birth date as the middle of the range */
	birth_date_avg = mean(birth_date_min, birth_date_max);
    
    /* calculate variance */
	abs_days_error = abs(birth_date - birth_date_avg);
%runquit;

/* Both errors should always be zero. */
proc freq data=reverse;
	table min_error max_error;
quit;

/* Error of estimates range from 0 to 183.5 with a median of 92 and average of 91.*/
proc means data=reverse n nmiss min median mean max;
	var abs_days_error;
quit;

/* Distribution of errors is uniform */
proc sgplot data=reverse;
	histogram abs_days_error;
quit;

Tested with SAS 9.4M6

Monday, January 28, 2019

SAS Message Log with ODBC: COMMIT performed on connection #

When using SAS to develop high-performance queries against remote SQL databases, it is helpful to see the exact ODBC messages that SAS passes to the driver. Sometimes the implicit SQL poorly translates a query, which can be optimized. To see these message, enable the SAS trace like this:

options sastrace=',,,d' sastraceloc=saslog nostsuffix;

However, when closing the SAS process, there can be a pop-up dialog window with the title "SAS Message Log" with entries like this:

ODBC: COMMIT performed on connection #6.
ODBC: COMMIT performed on connection #5.
ODBC: COMMIT performed on connection #4.
ODBC: COMMIT performed on connection #3.
ODBC: COMMIT performed on connection #2.
ODBC: COMMIT performed on connection #1.
ODBC: COMMIT performed on connection #0.

When running SAS interactively, this is a minor nuisance. When running SAS in an automated batch, this can be a serious problem because the dialog will wait indefinitely for human interaction, so the sas.exe process will never terminate.

Type I error rates in two-sample t-test by simulation

What do you do when analyzing data is fun, but you don't have any new data? You make it up.

This simulation tests the type I error rates of two-sample t-test in R and SAS. It demonstrates efficient methods for simulation, and it reminders the reader not to take the result of any single hypothesis test as gospel truth. That is, there is always a risk of a false positive (or false negative), so determining truth requires more than one research study.

A type I error is a false positive. That is, it happens when a hypothesis test rejects the null hypothesis when in fact it is not true. In this simulation the null hypothesis is true by design, though in the real world we cannot be sure the null hypothesis is true. This is why we write that we "fail to reject the null hypothesis" rather than "we accept it." If there were no errors in the hypothesis tests in this simulation, we would never reject the null hypothesis, but by design it is normal to reject it according to alpha, the significance level. The de facto standard for alpha is 0.05.

R

First, we run a simulation in R by repeatedly comparing randomly-generated sets of normally-distributed values using the two-sample t-test. Notice the simulation is vectorized: there are no "for" loops that clutter the code and slow the simulation.

Condition execution on row count

Use this code as a template for scenarios when you want to change how a SAS program runs depending on whether a data set is empty or not empty. For example, when a report is empty, you may want to not send an email with what would be a blank report. In other words, the report sends only when it has information.

On the other hand, you may want to send an email when a data set is empty if that means an automated SAS program had an error that requires manual intervention.

In general, it's good practice in automated SAS programs to check the size of a data sets in case they are empty or otherwise have the wrong number of observations. With one easy tweak, you could check for a specific minimum number of observations that is greater than zero. (This is left as an exercise for the reader.)

SAS ERROR: Cannot load SSL support. on Microsoft Windows

When using SAS with HTTPS or FTPS, which requires SSL/TLS support, you may see this error message in the SAS log.

ERROR: Cannot load SSL support.

Here is an example of code that can trigger the error.

filename myref url "https://www.google.com";
data _null_; 
infile myref; 
run;

The cause was that

SAS error "insufficient memory" on remote queries with wide rows

SAS can give the error The SAS System stopped processing this step because of insufficient memory when querying a single, wide row from a remote SQL Server. The following code fully demonstrates the problem and shows a workaround. Also, I eliminate the explanation that SAS data sets in general do not support rows this wide.

Reusing calculated columns in Netezza and SAS queries

Netezza and SAS allow a query to reference a calculated column by name in the SELECT, WHERE, and ORDER BY clauses. Based on the DRY principle, this reduces code and makes code easier to read and maintain.

Some people call calculated columns derived or computed columns.

In Microsoft SQL Server, SQLite, and other RDBMSs you cannot exactly do this: a workaround is to reference a subquery or view. In Microsoft SQL Server, you can also define a computed column on a table.

Below is an example tested with Netezza 7.2. Notice height_m is used in the SELECT clause, and bmi is used in the WHERE and ORDER BY clauses.

In case of error in SAS program, send email and stop

Any automated program should check for errors and unexpected conditions, such as inability to access a resource and presence of invalid values. Unlike traditional programming languages such as Python and C# that stop processing when an error occurs, SAS barrels ahead through the rest of the program. Therefore, carelessly-written SAS programs can create unwanted side effects, such as overwriting an output data set with bad data.

Previously I wrote about a robust solution for checking SAS error codes which wraps the entire program in a macro and invokes %GOTO EXIT in case an error. This is still the ideal solution when some part of the program must continue, but it comes at a cost: wrapping SAS code in a macro disables syntax highlighting in the SAS Enhanced Editor (though not in SAS Studio). Also, it can be awkward to work with the large code block delimited by the macro, so this post focuses on two alternatives.

ISO 3166-1 alpha-2 (two-letter country code) format for SAS

Here is the widely-used ISO 3166-1 alpha-2 format for use in SAS. It is commonly called the two-letter country code format.

The PROC FORMAT code generates a character format, so where the raw data contains a code, such as US, it expands it to the pretty name, such as United States. As with any SAS format, applying the format does not change the underlying data.

Undocumented SAS feature: Bulkloading to Netezza with ODBC interface

The SAS/ACCESS Interface to ODBC in SAS 9.4M4 states it supports bulk loading only to "Microsoft SQL Server data on Windows platforms." However, in practice on the Windows platform it also supports bulk loading to Netezza.

Bulk loading is amazingly fast. In some of my benchmarks the duration of the whole bulk loading operation is independent of the number of rows inserted!

By default on Netezza the bulk loading interface delimits values using a pipe character, and for cases where the values contain a pipe, SAS Access Interface to ODBC unofficially supports the BL_DELIMITER option to specify an alternate delimiter. For the ODBC interface, this option is undocumented.

However, there are nuances with the BL_DELIMITER option. According to the SAS Access Interface to Netezza:

Frequency of individual characters from SAS data set

This script counts the frequencies of individual ASCII characters in a single column in a SAS data set and then prints an easy-to-read report.

My initial motivation relates to delimiters. By default bulkloading data from Netezza to SAS (which is very fast) uses the pipe character as a delimiter, but my data set contained values with the pipe character, so this macro identifies alternative delimiters.

Another potential use is cracking a message encrypted using a simple letter substitution cipher.

To begin, this code creates an example data set courtesy of William Shakespeare.

SAS macros always have a global scope

SAS allows the programmer to declare the scope of macro variables using %LOCAL or %GLOBAL, but the macros themselves are always created in the global scope.

Say you have a macro that in another language, say Python, would be considered a function. Within the macro you want a sub-macro (i.e., sub-function) to be used only within the outer macro.

%macro outer;
%put NOTE: outer;

/* This "sub-macro" is defined within the outer macro and is
   intended only for use within the outer macro. */
 %macro inner(foo);
 %put NOTE: inner &foo;
 %mend;

%inner(1);
%inner(2);
%mend;

%outer;

/* If the "sub-macro" has a local scope, the next step would fail */
%inner(3);

/* However, it succeeds */

This can lead to conflicts if the macro %inner is defined somewhere else in the same session. One way of dealing with this is to be careful to give the inner macro a unique name like __outer_inner where the underscores in the prefix suggest a local scope, and adding outer to the macro name indicates the macro is to be used only in the outer macro.

Gotcha with SAS, regular expressions, and end-of-line matching

Regular expressions are essential for sophisticated text processing, and it is generally easy to transfer knowledge of Perl regular expressions to the SAS functions prxparse, prxmatch, prxposn, etc. However, use caution with the end of line character ($) because of how SAS treats whitespace.

For demonstration I will run what looks like an equivalent use of regular expressions in Python, JavaScript, and SAS, but notice that only SAS does not match the string.

# Python 2.7
import re
first_name ='Andrew  '
first_name = first_name.strip()
if re.search(r"^Andrew$", first_name):
    print 'match' # it does match
else:
    print 'no match'

/*JavaScript */
first_name ='Andrew  ';
first_name = first_name.trim();
if (first_name.match(/^Andrew$/)) 
    alert('match'); /* it does match */
    else alert('no match');

(JavaScript fiddle for this code.)

data x;
 first_name='Andrew  ';
 first_name=strip(first_name);
 match=prxmatch('/^Andrew$/', first_name); /* it does not match (match=0) */
run;

In SAS ignore the trailing whitespace using the trim() function:

data x;
 first_name='Andrew  ';
 match=prxmatch('/^Andrew$/', trim(first_name)); /* it does match */
run;

SAS, however, does not distinguish a string that was inserted with trailing spaces from a string that was inserted without trailing spaces. In the following SAS-only example imagine the table was created and populated using a non-SAS system like MySQL or Microsoft SQL Server.

proc sql;
 create table names (
  first_name varchar(8)
 );

 insert into names values ('Andrew'); /* no trailing spaces */
 insert into names values ('Andrew '); /* one trailing space */
 insert into names values ('Andrew  ');
quit;

data names;
 set names;
 length=length(first_name);
 match1=prxmatch('/^Andrew$/', first_name);
 match2=prxmatch('/^Andrew$/', trim(first_name));
run;

This was tested with SAS 9.4M3 on Microsoft Windows 7.

Tuesday, March 3, 2015

SAS 9.4 crash with MySQL ODBC pass-through queries

SAS 9.4 (TS1M2) on X64_DS08R2 (Windows Server 2008 64-bit) always crashes with certain pass-through queries using MySQL Connector/ODBC 5.3.4. When it crashes, the SAS log shows some red messages, but SAS closes immediately.

The crash is not reproducible with other ODBC drivers, on SAS 9.3 64-bit, or SAS 9.4 32-bit.

Workarounds include: using an ODBC DSN instead of the connection string, not using pass-through queries, or using SAS 9.3.

SAS agreed to fix the bug.

Party like it's 19999 (SAS)

On 03OCT2014 I must have missed the party in Cary, NC.

data _null_;
 format date date9.;
 date = 19999;
 put date=;
run;

So the next party is 03JUN2568?

data _null_;
 format date date9.;
 date = 222222;
 put date=;
run;

Wednesday, December 17, 2014

SAS crash with BULKLOAD and ODBC Driver 11 for SQL Server

SAS 9.4 (TS1M2) crashes hard when using BULKLOAD=YES with the latest Microsoft ODBC Driver 11 for SQL Server. For a brief moment you may see in the SAS log ERROR: BCP initialize error: [Microsoft][ODBC Driver 11 for SQL Server]Connection is not enabled for BCP, and then the SAS window closes.

The Windows Event Viewer has the following information:

Faulting application name: sas.exe, version: 9402.0.21456.21239, time stamp: 0x53d05339
Faulting module name: ntdll.dll, version: 6.1.7601.18229, time stamp: 0x51fb1072
Exception code: 0xc0000374
Fault offset: 0x000ce753
Faulting process id: 0x5724
Faulting application start time: 0x01d004d90519e3fc
Faulting application path: C:\Program Files\SASHome2\x86\SASFoundation\9.4\sas.exe
Faulting module path: C:\Windows\SysWOW64\ntdll.dll
Report Id: 4c84f206-70cc-11e4-bd42-0205857feb80

SAS Support acknowledged the crash and will address it in the next maintenance release. At this time I do not see any KB article.

Fibonacci sequence in R and SAS

Because the Fibonacci sequence is simply defined by recursion, it makes for an elegant programming exercise. Here is one way to do it in SAS, and another way to do it in R. I've also included unit testing code to check that it works.

Change the process priority in SAS

This SAS code programmatically changes the process priority for the current SAS program. You may want to use this for SAS batch jobs.

For example, when I run multiprocessor jobs using Ian J. Ghent %multiThreadDataStep macro, the Windows user interface becomes less responsive and my batch job competes with other processes, but using this method addresses both issues.

Simply run the line corresponding to the desired priority level. To decrease process priority, choose one of the last two.

x "wmic.exe process where processid=""&SYSJOBID"" call setpriority 'high priority'";
x "wmic.exe process where processid=""&SYSJOBID"" call setpriority 'above normal'";
x "wmic.exe process where processid=""&SYSJOBID"" call setpriority 'normal'";
x "wmic.exe process where processid=""&SYSJOBID"" call setpriority 'below normal'";
x "wmic.exe process where processid=""&SYSJOBID"" call setpriority 'idle'";

Tested on SAS 9.4 on Windows 7 and SAS 9.3 on Windows 2008.

I don't have SAS for Linux, but I guess it would look like this to increase the niceness and therefore reduce the process priority:

x "renice -p 16 &SYSJOBID";

Wednesday, February 26, 2014

Type I error rates in test of normality by simulation

This simulation tests the type I error rates of the Shapiro-Wilk test of normality in R and SAS.

Thursday, February 3, 2022

Thursday, December 10, 2020

Monday, January 28, 2019

Sunday, January 28, 2018

R

Wednesday, January 10, 2018

Tuesday, August 30, 2016

Thursday, June 23, 2016

Wednesday, June 8, 2016

Tuesday, March 15, 2016

Monday, February 22, 2016

Wednesday, February 3, 2016

Friday, January 29, 2016

Wednesday, September 2, 2015

Monday, August 31, 2015

Tuesday, March 3, 2015

Wednesday, February 4, 2015

Wednesday, December 17, 2014

Friday, December 5, 2014

Monday, June 23, 2014

Wednesday, February 26, 2014