Monday, August 31, 2015

Gotcha with SAS, regular expressions, and end-of-line matching

Regular expressions are essential for sophisticated text processing, and it is generally easy to transfer knowledge of Perl regular expressions to the SAS functions prxparse, prxmatch, prxposn, etc. However, use caution with the end of line character ($) because of how SAS treats whitespace.

For demonstration I will run what looks like an equivalent use of regular expressions in Python, JavaScript, and SAS, but notice that only SAS does not match the string.

# Python 2.7
import re
first_name ='Andrew  '
first_name = first_name.strip()
if re.search(r"^Andrew$", first_name):
    print 'match' # it does match
else:
    print 'no match'
/*JavaScript */
first_name ='Andrew  ';
first_name = first_name.trim();
if (first_name.match(/^Andrew$/)) 
    alert('match'); /* it does match */
    else alert('no match');

(JavaScript fiddle for this code.)

data x;
 first_name='Andrew  ';
 first_name=strip(first_name);
 match=prxmatch('/^Andrew$/', first_name); /* it does not match (match=0) */
run;

In SAS ignore the trailing whitespace using the trim() function:

data x;
 first_name='Andrew  ';
 match=prxmatch('/^Andrew$/', trim(first_name)); /* it does match */
run;

SAS, however, does not distinguish a string that was inserted with trailing spaces from a string that was inserted without trailing spaces. In the following SAS-only example imagine the table was created and populated using a non-SAS system like MySQL or Microsoft SQL Server.

proc sql;
 create table names (
  first_name varchar(8)
 );

 insert into names values ('Andrew'); /* no trailing spaces */
 insert into names values ('Andrew '); /* one trailing space */
 insert into names values ('Andrew  ');
quit;

data names;
 set names;
 length=length(first_name);
 match1=prxmatch('/^Andrew$/', first_name);
 match2=prxmatch('/^Andrew$/', trim(first_name));
run;
Screenshot of SAS data set

This was tested with SAS 9.4M3 on Microsoft Windows 7.

3 comments:

  1. Thanks Andrew... I expected the opposite to be true. Knowing SAS doesn't care about trailing spaces, I figured that surely $ would match the end of text in a line, rather than the absolute end of the length of the variable (which is kind of useless). The workaround of adding trim() to the input variable within the prxmatch function got it to work the way I'd expect.

    ReplyDelete
  2. Thanks a lot Andrew. I thought that I will have to rewrite my test cases if I would not have found your article. I was surprised to see this behaviour in sas and couldn't find a way out.This is the beauty of sharing, you took the trouble of writing this article that saved me huge trouble of searching an alternative to regex for writing test cases

    ReplyDelete
  3. Thank you for this post! I spent hours trying to figure out why SAS wouldn't match what I wanted to find. I already had applied the strip function to the sting in a previous data step but apparently that did not do the trick. It was not observable from copying the string to an editor either.

    ReplyDelete

Get HTML of iframes in Microsoft Playwright

Playwright is a powerful framework for web testing and automation. This article demonstrates how to extract the HTML content of child IFRAME...