Monday, August 31, 2015

Gotcha with SAS, regular expressions, and end-of-line matching

Regular expressions are essential for sophisticated text processing, and it is generally easy to transfer knowledge of Perl regular expressions to the SAS functions prxparse, prxmatch, prxposn, etc. However, use caution with the end of line character ($) because of how SAS treats whitespace.

For demonstration I will run what looks like an equivalent use of regular expressions in Python, JavaScript, and SAS, but notice that only SAS does not match the string.

# Python 2.7
import re
first_name ='Andrew  '
first_name = first_name.strip()
if re.search(r"^Andrew$", first_name):
    print 'match' # it does match
else:
    print 'no match'
/*JavaScript */
first_name ='Andrew  ';
first_name = first_name.trim();
if (first_name.match(/^Andrew$/)) 
    alert('match'); /* it does match */
    else alert('no match');

(JavaScript fiddle for this code.)

data x;
 first_name='Andrew  ';
 first_name=strip(first_name);
 match=prxmatch('/^Andrew$/', first_name); /* it does not match (match=0) */
run;

In SAS ignore the trailing whitespace using the trim() function:

data x;
 first_name='Andrew  ';
 match=prxmatch('/^Andrew$/', trim(first_name)); /* it does match */
run;

SAS, however, does not distinguish a string that was inserted with trailing spaces from a string that was inserted without trailing spaces. In the following SAS-only example imagine the table was created and populated using a non-SAS system like MySQL or Microsoft SQL Server.

proc sql;
 create table names (
  first_name varchar(8)
 );

 insert into names values ('Andrew'); /* no trailing spaces */
 insert into names values ('Andrew '); /* one trailing space */
 insert into names values ('Andrew  ');
quit;

data names;
 set names;
 length=length(first_name);
 match1=prxmatch('/^Andrew$/', first_name);
 match2=prxmatch('/^Andrew$/', trim(first_name));
run;
Screenshot of SAS data set

This was tested with SAS 9.4M3 on Microsoft Windows 7.