tag:blogger.com,1999:blog-76660668160411906612024-03-16T12:49:34.082-06:00Heuristic AndrewR, SAS, machine learning, and statisticsAndrew Zhttp://www.blogger.com/profile/10108637160465346326noreply@blogger.comBlogger20125tag:blogger.com,1999:blog-7666066816041190661.post-91226769287869879342022-02-03T10:03:00.002-07:002022-02-03T10:03:52.049-07:00Generate random names and addresses from SAS<p>For testing data processing systems (e.g., CRM, record linkage), you may need to generate fake people. SAS makes it
uniquely easy to generate an unlimited count of fake US residents because it comes with a data set of
US zip codes, which include the city and state name.</p>
<p>The system uses four data sets: first names, last names, street names, and US zip codes. Initials are randomly generated from letters. The street addresses probably do not exist in the given zip codes. </p>
<p>You could extend this by:</p>
<ul>
<li>Add street directions (i.e., N, S, E, W)</li>
<li>Add street post type (e.g., Dr., Ct.)</li>
<li>Add units (e.g., Apt B, Ste 101)</li>
<li>Add post office boxes and private mail boxes</li>
<li>Spell out the middle name</li>
<li>Add name prefix (e.g., Dr., Mr.)</li>
<li>Add name suffix (e.g., Jr., Sr.)</li>
</ul>
<a name='more'></a>
<pre class="prettyprint">
%let mv_person_count = 10000; /* how many people to make */
%let mv_max_street_num = 20000; /* largest street number */
/* https://www.ssa.gov/OACT/babynames/decades/century.html */
data first;
format first_name $20.;
input first_name $;
first_name_id = _n_;
datalines;
James
Robert
John
Michael
William
David
Richard
Joseph
Thomas
Charles
Christopher
Daniel
Matthew
Anthony
Mark
Donald
Steven
Paul
Andrew
Joshua
Kenneth
Kevin
Brian
George
Edward
Ronald
Timothy
Jason
Jeffrey
Ryan
Jacob
Gary
Nicholas
Eric
Jonathan
Stephen
Larry
Justin
Scott
Brandon
Benjamin
Samuel
Gregory
Frank
Alexander
Raymond
Patrick
Jack
Dennis
Jerry
Tyler
Aaron
Jose
Adam
Henry
Nathan
Douglas
Zachary
Peter
Kyle
Walter
Ethan
Jeremy
Harold
Keith
Christian
Roger
Noah
Gerald
Carl
Terry
Sean
Austin
Arthur
Lawrence
Jesse
Dylan
Bryan
Joe
Jordan
Billy
Bruce
Albert
Willie
Gabriel
Logan
Alan
Juan
Wayne
Roy
Ralph
Randy
Eugene
Vincent
Russell
Elijah
Louis
Bobby
Philip
Johnny
Mary
Patricia
Jennifer
Linda
Elizabeth
Barbara
Susan
Jessica
Sarah
Karen
Nancy
Lisa
Betty
Margaret
Sandra
Ashley
Kimberly
Emily
Donna
Michelle
Dorothy
Carol
Amanda
Melissa
Deborah
Stephanie
Rebecca
Sharon
Laura
Cynthia
Kathleen
Amy
Shirley
Angela
Helen
Anna
Brenda
Pamela
Nicole
Emma
Samantha
Katherine
Christine
Debra
Rachel
Catherine
Carolyn
Janet
Ruth
Maria
Heather
Diane
Virginia
Julie
Joyce
Victoria
Olivia
Kelly
Christina
Lauren
Joan
Evelyn
Judith
Megan
Cheryl
Andrea
Hannah
Martha
Jacqueline
Frances
Gloria
Ann
Teresa
Kathryn
Sara
Janice
Jean
Alice
Madison
Doris
Abigail
Julia
Judy
Grace
Denise
Amber
Marilyn
Beverly
Danielle
Theresa
Sophia
Marie
Diana
Brittany
Natalie
Isabella
Charlotte
Rose
Alexis
Kayla
Homer
Marge
Bart
Lisa
Maggie
;
/* https://www.thoughtco.com/most-common-us-surnames-1422656 */
data last;
format last_name $20.;
input last_name $;
last_name_id = _n_;
datalines;
Smith
Johnson
Williams
Brown
Jones
Garcia
Miller
Davis
Rodriguez
Martinez
Hernandez
Lopez
Gonzales
Wilson
Anderson
Thomas
Taylor
Moore
Jackson
Martin
Lee
Perez
Thompson
White
Harris
Sanchez
Clark
Ramirez
Lewis
Robinson
Walker
Young
Allen
King
Wright
Scott
Torres
Nguyen
Hill
Flores
Green
Adams
Nelson
Baker
Hall
Rivera
Campbell
Mitchell
Carter
Roberts
Gomez
Phillips
Evans
Turner
Diaz
Parker
Cruz
Edwards
Collins
Reyes
Stewart
Morris
Morales
Murphy
Cook
Rogers
Gutierrez
Ortiz
Morgan
Cooper
Peterson
Bailey
Reed
Kelly
Howard
Ramos
Kim
Cox
Ward
Richardson
Watson
Brooks
Chavez
Wood
James
Bennet
Gray
Mendoza
Ruiz
Hughes
Price
Alvarez
Castillo
Sanders
Patel
Myers
Long
Ross
Foster
Jimenez
Simpson
;
/* https://www.nlc.org/resource/most-common-u-s-street-names/ */
data street;
format street_name $20.;
input street_name $;
street_name_id = _n_;
datalines;
Second
Third
First
Fourth
Park
Fifth
Main
Sixth
Oak
Seventh
Pine
Maple
Cedar
Eighth
Elm
View
Washington
Ninth
Lake
Hill
Evergreen
;
data person0;
do i = 1 to &mv_person_count;
first_name_id = %RandBetween(1, 205);
last_name_id = %RandBetween(1, 101);
street_name_id = %RandBetween(1, 21);
zip_code_id = %RandBetween(1, 40000);
output;
end;
drop i;
run;
data zip;
set sashelp.zipcode(keep=zip city statecode);
zip_code_id = _n_;
run;
proc sql;
create table person1 as
select
f.first_name,
l.last_name,
s.street_name,
z.city '',
z.statecode as state '',
z.zip as zip_numeric ''
from person0 as p
join first as f on
f.first_name_id = p.first_name_id
join last as l on
l.last_name_id = p.last_name_id
join street as s on
s.street_name_id = p.street_name_id
join zip as z on
z.zip_code_id = p.zip_code_id;
quit;
data person2;
format name street city state zip $50.;
set person1;
initial = byte(int(65+26*ranuni(0)));
name = catx(' ', first_name, initial, last_name);
/* RandBetween from https://blogs.sas.com/content/iml/2015/10/05/random-integers-sas.html */
street_num = put(%RandBetween(1,&mv_max_street_num),10.);
street = catx(' ',street_num,street_name);
zip = put(zip_numeric, z5.);
drop zip_numeric street_name street_num first_name initial last_name;
run;
</pre>
<p>Example output table with ten randomly generated fake people:</p>
<style>.mmt-table,.mmt-table-custom{width:100%;table-layout:fixed;}.mmt-table tr td, .mmt-table thead tr th{height:24px!important;padding:4px;border:1px solid #aaa;overflow: auto;} .mmt-table thead tr th {background: #e8a647; color: #fff; text-align: inherit;}.mmt-table-custom-parent{margin:0;padding:4px}.mmt-table tr:nth-child(odd) {background: #ededed} .mmt-table tr:nth-child(even) {background: #cdcdcd} .mmt-table td:nth-child(odd) {color: #545454} .mmt-table td:nth-child(even) {color: #000000}.mmt-table tr td, .mmt-table thead tr th {border : 1px solid #aaaaaa}.mmt-table thead tr th {background: #e8a647; color: #fff; text-align: inherit;}</style><div class="mmt-custom-content col-sm-12 mmt-table-custom-parent col-lg-12">
<table id="TABLE-1" class="mmt-table mmt-table-custom selected">
<thead><tr><th class="mmt-selected-col">name</th><th>street</th><th>city</th><th>state</th><th>zip</th></tr></thead>
<tbody>
<tr class="">
<td class="mmt-selected-col">Steven I Murphy</td><td>8206 Fifth</td><td>Frankfort</td><td>KY</td><td>40619</td></tr><tr><td class="mmt-selected-col">Mary M Williams</td><td>5076 Seventh</td><td>Evensville</td><td>TN</td><td>37332</td></tr><tr><td class="mmt-selected-col">Jeffrey Y Lopez</td><td>3485 Third</td><td>Henning</td><td>IL</td><td>61848</td></tr><tr><td class="mmt-selected-col">Richard Z Sanders</td><td>5500 Sixth</td><td>Kimball</td><td>NE</td><td>69145</td></tr><tr><td class="mmt-selected-col">Russell M Smith</td><td>16425 Sixth</td><td>Lexington</td><td>KY</td><td>40515</td></tr><tr><td class="mmt-selected-col">Johnny R Carter</td><td>11949 Eighth</td><td>Mount Hope</td><td>OH</td><td>44660</td></tr><tr><td class="mmt-selected-col">Raymond V Green</td><td>4659 Park</td><td>West Helena</td><td>AR</td><td>72390</td></tr><tr><td class="mmt-selected-col">Megan N Anderson</td><td>8437 Third</td><td>Chico</td><td>CA</td><td>95927</td></tr><tr><td class="mmt-selected-col">Isabella A Ross</td><td>8151 Evergreen</td><td>Barstow</td><td>MD</td><td>20610</td></tr><tr><td class="mmt-selected-col">Sharon Q Flores</td><td>3022 Lake</td><td>Poth</td><td>TX</td><td>78147</td></tr></tbody></table></div>
<p>Want to generate people names with Python instead of SAS? See <a href="https://github.com/az0/entity-metadata/blob/master/code/combine_people_names.py">combine_people_names.py</a> for a system that generates random people using <a href="https://sourceforge.net/projects/entity-metadata/files/wikidata/">Wikidata biographies</a>.</p> <div class="blogger-post-footer"><p>For more posts like this, see <a href="http://heuristicandrew.blogspot.com/">Heuristic Andrew</a>.</p></div>Andrew Zhttp://www.blogger.com/profile/10108637160465346326noreply@blogger.com0tag:blogger.com,1999:blog-7666066816041190661.post-86224278290762592722020-12-10T17:56:00.005-07:002020-12-10T18:00:42.144-07:00Estimating birth date from age<p>This code demonstrates an algorithm for estimating birth date from age. We cannot know the exact birth date, but we can get close: the maximum error is half a year, and the typical error is one quarter of a year.</p>
<pre>
/* The %age macro was taken from the Internet---maybe from here http://support.sas.com/kb/24/808.html ? */
%macro age(date,birth);
floor ((intck('month',&birth,&date) - (day(&date) < day(&birth))) / 12)
%mend age;
/*
Generate 10000 fake people with random birth dates and random perspective days
on which their age was measured. Then, calculate age from that perspective date.
In reality, there is some seasonality to births (e.g., more births in July), but
here we assume each day of the year has an equal distribution of births.
*/
data person;
format birth_date submit_date yymmdd10.;
do i = 1 to 10000;
birth_date = %randbetween(19000,20500);
submit_date = birth_date + %randbetween(0,100*365);
age = %age(submit_date, birth_date);
output;
end;
drop i;
%runquit;
/* Work in reverse from age to estimated birth date. */
data reverse;
set person;
format birth_date_min birth_date_max yymmdd10.;
birth_date_min = intnx('years', submit_date, -1 * (age+1), 's') - 1;
birth_date_max = intnx('years',birth_date_min,1,'s') + 1;
/* check range of estimates for errors */
min_error = (birth_date > birth_date_min);
max_error = (birth_date < birth_date_max);
/* estimate birth date as the middle of the range */
birth_date_avg = mean(birth_date_min, birth_date_max);
/* calculate variance */
abs_days_error = abs(birth_date - birth_date_avg);
%runquit;
/* Both errors should always be zero. */
proc freq data=reverse;
table min_error max_error;
quit;
/* Error of estimates range from 0 to 183.5 with a median of 92 and average of 91.*/
proc means data=reverse n nmiss min median mean max;
var abs_days_error;
quit;
/* Distribution of errors is uniform */
proc sgplot data=reverse;
histogram abs_days_error;
quit;
</pre>
<div class="separator" style="clear: both;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhXORffS0VsJYmAHrwOVEMA3cwgGBC7RcI_3VSwmKLLYqUIPdxNCN6Uw08OO5jgAA-JPB0lxcFJCgpql_t-fDH4-slMB2cUvOScK4fRw62CBcCDkdzb4ZihyphenhyphenLorOibNyLah_XJQyZycYgZJ/s0/SGPlot.png" style="display: block; padding: 1em 0; text-align: center; "><img alt="" border="0" data-original-height="480" data-original-width="640" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhXORffS0VsJYmAHrwOVEMA3cwgGBC7RcI_3VSwmKLLYqUIPdxNCN6Uw08OO5jgAA-JPB0lxcFJCgpql_t-fDH4-slMB2cUvOScK4fRw62CBcCDkdzb4ZihyphenhyphenLorOibNyLah_XJQyZycYgZJ/s0/SGPlot.png"/></a></div>
<p>Tested with SAS 9.4M6<div class="blogger-post-footer"><p>For more posts like this, see <a href="http://heuristicandrew.blogspot.com/">Heuristic Andrew</a>.</p></div>Andrew Zhttp://www.blogger.com/profile/10108637160465346326noreply@blogger.com0tag:blogger.com,1999:blog-7666066816041190661.post-68142783651821582612019-01-28T10:14:00.000-07:002019-01-28T10:14:35.657-07:00SAS Message Log with ODBC: COMMIT performed on connection #<p>When using SAS to develop high-performance queries against remote SQL databases, it is helpful to see the exact ODBC messages that SAS passes to the driver. Sometimes the implicit SQL poorly translates a query, which can be optimized. To see these message, enable the SAS trace like this:</p>
<pre>
options sastrace=',,,d' sastraceloc=saslog nostsuffix;
</pre>
<p>However, when closing the SAS process, there can be a pop-up dialog window with the title "SAS Message Log" with entries like this:</p>
<pre>
ODBC: COMMIT performed on connection #6.
ODBC: COMMIT performed on connection #5.
ODBC: COMMIT performed on connection #4.
ODBC: COMMIT performed on connection #3.
ODBC: COMMIT performed on connection #2.
ODBC: COMMIT performed on connection #1.
ODBC: COMMIT performed on connection #0.
</pre>
<p>When running SAS interactively, this is a minor nuisance. When running SAS in an automated batch, this can be a serious problem because the dialog will wait indefinitely for human interaction, so the sas.exe process will never terminate.</p>
<a name='more'></a>
<p>This isn't exactly a bug, but it can feel like it. Sadly, SAS provides no convenient options like these:</p>
<ul>
<li>Never show the SAS message pop-up dialog when the SAS editor has closed.</li>
<li>Automatically close the pop-up dialog after 60 seconds of inactivity.</li>
<li>Filter all traces with the text "ODBC commit."</li>
</ul>
<p>The SAS developer has these options:</p>
<ul>
<li>Disable the SAS trace.</li>
<li>Send the SAS trace to a file like this: <pre>options sastrace=',,,d' sastraceloc=file 'c:\sastest\mytrace.log' nostsuffix;</pre>
</li>
<li>Manually close the SAS Message Log whenever it appears.</li>
<li>Use <a href="https://gist.github.com/az0/06be331b83ba82f6bdbe8763a3cdeb4a">my Python script to automatically close the "SAS Message Log" dialog</a> whenever it appears.</li>
</ul>
<p>Tested on SAS 9.4M5 and Python 3.7 on Windows 10 and Windows Server 2008.</p><div class="blogger-post-footer"><p>For more posts like this, see <a href="http://heuristicandrew.blogspot.com/">Heuristic Andrew</a>.</p></div>Andrew Zhttp://www.blogger.com/profile/10108637160465346326noreply@blogger.com0tag:blogger.com,1999:blog-7666066816041190661.post-13682260196422501612018-01-28T14:51:00.001-07:002018-01-28T15:26:16.925-07:00 Type I error rates in two-sample t-test by simulation <p>What do you do when analyzing data is fun, but you don't have any new data? You make it up.</p>
<p>This simulation tests the type I error rates of two-sample t-test in R and SAS. It demonstrates efficient methods for simulation, and it reminders the reader not to take the result of any single hypothesis test as gospel truth. That is, there is always a risk of a false positive (or false negative), so determining truth requires more than one research study.<p>
<p>A type I error is a false positive. That is, it happens when a hypothesis test rejects the null hypothesis when in fact it is not true. In this simulation the null hypothesis is true by design, though in the real world we cannot be sure the null hypothesis is true. This is why we write that we "fail to reject the null hypothesis" rather than "we accept it." If there were no errors in the hypothesis tests in this simulation, we would never reject the null hypothesis, but by design it is normal to reject it according to <i>alpha</i>, the significance level. The de facto standard for alpha is 0.05.</p>
<h2>R</h2>
<p>First, we run a simulation in R by repeatedly comparing randomly-generated sets of normally-distributed values using the two-sample t-test. Notice the simulation is vectorized: there are no "for" loops that clutter the code and slow the simulation.</p>
<a name='more'></a>
<pre class="prettyprint lang-r">
# type I error
alpha.p <- 0.05
# number of simulations
n.simulations <- 1000
# number of observations in each simulation
n.obs <- 100
# a vector of test results
type.one.error<-replicate(n.simulations, t.test(rnorm(n.obs),rnorm(n.obs),
var.equal=TRUE)$p.value)<alpha.p
# type I error for the whole simulation
mean(type.one.error)
# Store cumulative results in data frame for plotting
sim <- data.frame(
n.simulations = 1:n.simulations,
type.one.error.rate = cumsum(type.one.error) / seq_along(type.one.error))
# alternative plot using ggplot2
require(ggplot2)
ggplot(sim, aes(x=n.simulations, y=type.one.error.rate)) +
geom_line() +
xlab('Number of simulations') +
ylab('Cumulative type I error rate') +
ggtitle('Simulation of type I error in t-test') +
geom_abline(intercept = alpha.p, slope=0, col='red') +
theme_bw()
</pre>
<div class="separator" style="clear: both; text-align: center;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgrEh_hiU1LlPja3-mAisRhfUsnmdkMI4S9tEaMxIU5sdAngvMsv9U4v5RjdL7cgYFL-EdqXvMf9jYrchzcX9jyGnZRAuSGCGDS8Xz1moUFz3YcJo0DgiNAFQ93f-TmlC5u8uV8sDW-QYIR/s1600/t_test_2sample_r.png" data-original-width="550" data-original-height="550" /></div>
<h2>SAS</h2>
<p>Likewise, here is the equivalent code to do the same in SAS. Notice the simulation is implemented not as a slow SAS macro. Instead, it uses the BY statement in PROC TTEST.</p>
<pre class="prettyprint lang-sas">
/*
Create a data set with 1000 simulations. Each simulation
has 100 observations in each of two groups.
*/
data normal;
length simulation 4 i 3; /* save space and time */
do simulation = 1 to 1000;
do i = 1 to 100;
group='A';
/* The values are normally distributed */
x = rand('normal');
output;
group='B';
x = rand('normal');
output;
end;
end;
run;
/*
Run two-sample t-test once for each simulation, and output to
a data set called ttests.
*/
ods _all_ close;
ods output ttests=ttests;
proc ttest plots=none data=normal;
by simulation;
class group;
var x;
run;
data ttests;
set ttests;
/* Limit the rows */
if variances='Equal';
/* Define the error as a boolean */
type_one_error = probt<0.05;
/* cumulative error */
retain cumulative_error_count;
format cumulative_error_rate percent10.2;
label cumulative_error_rate = 'Cumulative error rate';
if simulation eq 1 then cumulative_error_count = 0;
cumulative_error_count+type_one_error;
cumulative_error_rate = cumulative_error_count /simulation;
run;
/* Summarize the type I error rates for this simulation */
ods html;
proc freq data=ttests;
table type_one_error/nocum;
run;
/* Draw a line plot */
proc sgplot data=ttests;
series x=simulation y=cumulative_error_rate;
refline 0.05 /axis=y lineattrs=(color=red);
run;
</pre>
<img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjb2XD-5tfMB8unjUV_XNAmXS49v7QinRqHosp1OnTKN-6ERtHsHL6VP3P_YTVWpDMfchWf9ZiAxV4aGUJHhZ8uOJBtgwICH_-KEq4v1PzpjtmxI0pB4u80FEJHlfC5Jz79bbwmXBuSvVww/s1600/sas_type_i_two_sample_t_test.png" data-original-width="640" data-original-height="480" />
<h2>Sawtooth</h2>
<p>Did you notice the sawtooth pattern in the error rate? The incidence of a false positive is relatively rare, and when it happens, there is a spike in the error rate. Then for each simulation in which there is no false positive, the rate drops by a steady rate because the count of simulations (the denominator) is an integer.</p>
<h2>Conclusion</h2>
<p>This article was developed on Ubuntu 16.04 with R 3.4 and Windows 7 with SAS 9.4.</p>
<p>See also the article: <a href="http://heuristicandrew.blogspot.com/2014/02/type-i-error-rates-in-test-of-normality.html"> Type I error rates in test of normality by simulation </a>.</p><div class="blogger-post-footer"><p>For more posts like this, see <a href="http://heuristicandrew.blogspot.com/">Heuristic Andrew</a>.</p></div>Andrew Zhttp://www.blogger.com/profile/10108637160465346326noreply@blogger.com1tag:blogger.com,1999:blog-7666066816041190661.post-15634471371669628062018-01-10T15:06:00.000-07:002018-01-12T13:04:13.150-07:00Condition execution on row count<p>Use this code as a template for scenarios when you want to change how a SAS program runs depending on whether a data set is empty or not empty. For example, when a report is empty, you may want to not send an email with what would be a blank report. In other words, the report sends only when it has information.</p>
<p>On the other hand, you may want to send an email when a data set is empty if that means an automated SAS program had an error that requires manual intervention.</p>
<p>In general, it's good practice in automated SAS programs to check the size of a data sets in case they are empty or otherwise have the wrong number of observations. With one easy tweak, you could check for a specific minimum number of observations that is greater than zero. (This is left as an exercise for the reader.)</p>
<a name='more'></a>
<pre class="prettyprint">
/*
This creates a sample data set with one record.
*/
data mydata;
input x;
datalines;
1
;
/*
This creates a sample data set with zero records.
It has the same name as above, so if you want to test the scenario
with a non-empty data set, simply do not run this step.
*/
data mydata;
input x;
datalines;
;
/*
Count the number of observations, and store the count in a
macro variable.
*/
data _null_;
if 0 then set mydata nobs=record_count;
call symput('mv_record_count', put(record_count, 20.));
stop;
run;
/* Print count to the log. */
%put &=mv_record_count;
/* Define a macro */
%macro conditional_run;
%if &mv_record_count gt 0 %then %do;
%put NOTE: this runs when the data set is not empty;
%end;
%else %do;
%put NOTE: this runs when the data set is empty;
%end;
%mend;
/* Run macro */
%conditional_run;
</pre>
<p>I tested this on SAS 9.4 on Windows 7, though it should work on practically all SAS systems.</p><div class="blogger-post-footer"><p>For more posts like this, see <a href="http://heuristicandrew.blogspot.com/">Heuristic Andrew</a>.</p></div>Andrew Zhttp://www.blogger.com/profile/10108637160465346326noreply@blogger.com0tag:blogger.com,1999:blog-7666066816041190661.post-62393259808289034962016-08-30T11:52:00.000-06:002016-08-30T11:52:56.330-06:00SAS ERROR: Cannot load SSL support. on Microsoft Windows<p>When using SAS with HTTPS or FTPS, which requires SSL/TLS support, you may see this error message in the SAS log.</p>
<pre class="prettyprint">
ERROR: Cannot load SSL support.
</pre>
<p>Here is an example of code that can trigger the error.
<pre class="prettyprint">
filename myref url "https://www.google.com";
data _null_;
infile myref;
run;
</pre>
<p>The cause was that
<a name='more'></a>
<em>SAS/Secure Client Components</em> was not installed, so I resolved the issue by running the SAS Deployment Wizard to install <em>SAS/Secure Client Components</em>.</p>
<p>Tested with SAS 9.4 M3 on Microsoft Windows 7. The error may also happen with encrypted SMTP, but I did not test SMTP.</p><div class="blogger-post-footer"><p>For more posts like this, see <a href="http://heuristicandrew.blogspot.com/">Heuristic Andrew</a>.</p></div>Andrew Zhttp://www.blogger.com/profile/10108637160465346326noreply@blogger.com0tag:blogger.com,1999:blog-7666066816041190661.post-74811397327626646592016-06-23T13:04:00.000-06:002016-06-23T13:06:53.689-06:00SAS error "insufficient memory" on remote queries with wide rows<p>SAS can give the error <b>The SAS System stopped processing this step because of insufficient memory</b> when querying a single, wide row from a remote SQL Server. The following code fully demonstrates the problem and shows a workaround. Also, I eliminate the explanation that SAS data sets in general do not support rows this wide.</p>
<a name='more'></a>
<pre class="prettyprint">
proc sql;
/* Connect to Microsoft SQL Server */
connect using sbox;
execute (
/* Create the table */
create table dbo.zzz_varchar_max (
id int,
txt1 varchar(max),
txt2 varchar(max)
);
/* Insert a single row */
insert into zzz_varchar_max values (1, 'foo', 'bar');
) by sbox;
quit;
/* This step triggers the error */
data _null_;
set sbox.zzz_varchar_max;
run;
/* This step does NOT trigger the error */
data _null_;
set sbox.zzz_varchar_max(drop=txt2);
run;
/* ERROR: The SAS System stopped processing this step because of insufficient memory. */
/* This step inspects the metadata. SAS considers each character column as having
the width 32767, which is the maximum string size for a SAS data set according to
https://support.sas.com/documentation/cdl/en/basess/58133/HTML/default/viewer.htm#a001336069.htm
*/
proc contents data=sbox.zzz_varchar_max;
run;
/* This step shows that SAS allows creating a data set that is even
wider than the query that fails, so the error isn't a fundamental limitation
of SAS. */
data wide;
id =1;
format txt1 txt2 txt3 $32767.;
txt1='foo';
txt2='bar';
txt3='';
run;
</pre>
<p>If the data set has a single column with NVARCHAR(MAX) or VARCHAR(MAX), there is no error. It happens only when there are (at least) two such wide columns.</p>
<p>Another workarounds include: use the KEEP option on the data set to KEEP only one of the wide columns, use a PROC SQL statement to query only one of the columns, or use a remote SQL query (maybe with SUBSTR) to truncate the columns.</p>
<p>Another workaround is to switch from the modern Microsoft ODBC driver (<tt>driver=ODBC Driver 11 for SQL Server</tt>) to the ancient driver (<tt>driver=sql server</tt>) by changing the ODBC connection string or DSN.<p>
<p>I tested with SAS 9.4 TS1M3 32-bit, Microsoft SQL Server 2012 (11.0 SP2), and the ODBC Driver 11 for SQL Server (2014.120.2000.08).</p>
<div class="blogger-post-footer"><p>For more posts like this, see <a href="http://heuristicandrew.blogspot.com/">Heuristic Andrew</a>.</p></div>Andrew Zhttp://www.blogger.com/profile/10108637160465346326noreply@blogger.com0tag:blogger.com,1999:blog-7666066816041190661.post-20785613602467047292016-06-08T08:29:00.001-06:002016-06-08T10:26:21.795-06:00Reusing calculated columns in Netezza and SAS queries<p>Netezza and SAS allow a query to reference a <i>calculated</i> column by name in the SELECT, WHERE, and ORDER BY clauses. Based on the DRY principle, this reduces code and makes code easier to read and maintain.</p>
<p>Some people call calculated columns <i>derived</i> or <i>computed</i> columns.</p>
<p>In Microsoft SQL Server, SQLite, and other RDBMSs you cannot exactly do this: a workaround is to reference a subquery or view. In Microsoft SQL Server, you can also define a computed column on a table.</p>
<p>Below is an example tested with Netezza 7.2. Notice <tt>height_m</tt> is used in the SELECT clause, and <tt>bmi</tt> is used in the WHERE and ORDER BY clauses.</p>
<a name='more'></a>
<pre class="prettyprint lang-sql">
CREATE TEMP TABLE people (weight_kg INT, height_m float);
INSERT INTO people
VALUES (50, 1.6);
INSERT INTO people
VALUES (70, 1.8);
INSERT INTO people
VALUES (150, 1.8);
SELECT weight_kg
,height_m
,height_m*height_m as height_m_squared
,weight_kg/(<span style="background:yellow">height_m_squared</span>)::int as bmi
FROM people
WHERE <span style="background:yellow">bmi</span> < 30
ORDER BY <span style="background:yellow">bmi</span>;
</pre>
<p>Below is an example tested with SAS 9.4.</p>
<pre class="prettyprint">
data people;
input weight_kg height_m;
datalines;
50 1.6
70 1.8
150 1.8
;
proc sql;
select
weight_kg,
height_m,
height_m*height_m as height_m_squared,
weight_kg/(calculated height_m_squared) as bmi
from
people
where
calculated bmi < 30
order by
calculated bmi;
quit;
</pre><div class="blogger-post-footer"><p>For more posts like this, see <a href="http://heuristicandrew.blogspot.com/">Heuristic Andrew</a>.</p></div>Andrew Zhttp://www.blogger.com/profile/10108637160465346326noreply@blogger.com0tag:blogger.com,1999:blog-7666066816041190661.post-29907743668078706252016-03-15T16:47:00.000-06:002016-03-18T13:27:10.059-06:00In case of error in SAS program, send email and stop<p>Any automated program should check for errors and unexpected conditions, such as inability to access a resource and presence of invalid values. Unlike traditional programming languages such as Python and C# that stop processing when an error occurs, SAS barrels ahead through the rest of the program. Therefore, carelessly-written SAS programs can create unwanted side effects, such as overwriting an output data set with bad data.</p>
<p>Previously I wrote about a <a href="https://heuristically.wordpress.com/2012/02/09/return-codes-errors-sas/">robust solution for checking SAS error codes</a> which wraps the entire program in a macro and invokes %GOTO EXIT in case an error. This is still the ideal solution when some part of the program must continue, but it comes at a cost: wrapping SAS code in a macro disables syntax highlighting in the SAS Enhanced Editor (though not in SAS Studio). Also, it can be awkward to work with
the large code block delimited by the macro, so this post focuses on two alternatives.</p>
<a name='more'></a>
<p>The easiest method is to enable <tt>OPTIONS ERRORABEND</tt>, which exits the program in case of an error. The major benefit is it requires only a single line of code to enable for the rest of the program, regardless of how many steps there are. For example:</p>
<pre class="prettyprint">
OPTIONS ERRORABEND;
/* Misspelled data set */
data female;
set sashelp.clss;
if sex='F';
run;
</pre>
<p>However, ERRORABEND can be painful while developing a program in interactive mode because it immediately exits the SAS session, and it cannot push error notifications.</p>
<p>This motivates the following approach. After each DATA STEP or procedure, call a macro that checks the sanity of preceding steps. In case of an error, the universal error handler can send a notification email, clean up, etc. Like ERRORABEND, syntax highlighting still works. Like the other options, it can check errors for DATA STEPs and various procedures. Though it requires code after each step, it is less code than the first solution proposed in the <a href="https://heuristically.wordpress.com/2012/02/09/return-codes-errors-sas/">original article</a>.</p>
<pre class="prettyprint">
/* This macro is called from two other macros below. It defines any
common error handling code, and it should be customized.
In some programs, you may want to clean up here.
*/
%macro email_and_abort;
/* Send an email */
filename mymail email "andrew@example.com" subject="error in SAS program";
data _null_;
file mymail;
put 'Check the SAS logs';
run;
/* Stop further processing */
%abort cancel;
%mend;
/* This macro asserts the last procedure did not throw an error
or a warning. We will invoke this macro after each step. */
%macro expect_no_syserr;
/* If there is an error or warning... */
%if &syserr ne 0 %then %do;
/* Write the error code to the SAS log */
%put ERROR: &=syserr;/
%email_and_abort;
%end;
%mend;
/* Assuming SASHELP.CLASS exists and contains the character
variable SEX, this DATA STEP will succeed. */
data female;
set sashelp.class;
if sex='F';
run;
%expect_no_syserr;
/* Likewise, the macro works with procedures such as PROC FREQ. */
proc freq data=sashelp.class noprint;
table age/out=age_freq;
run;
%expect_no_syserr;
/* This macro asserts PROC SQL creates at least one observation. */
%macro expect_any_obs;
%if &sqlobs eq 0 %then %do;
%put ERROR: no observations;
%email_and_abort;
%end;
%mend;
/* This PROC SQL shows that the assertions working with PROC
SQL and that two assertions can be combined. */
proc sql;
create table age_count as
select
age,
count(1) as count
from
sashelp.class
group by
age;
quit;
%expect_no_syserr;
%expect_any_obs;
/* Misspelled data set. This will fail to demonstrate the macros. */
data female;
set sashelp.clss;
if sex='F';
run;
%expect_no_syserr;
</pre>
<p>Similar assertion macros can check that a data set exists, variables exist in a a data set, observations exist in a data set, a path is writable, a file exists, and so on.</p>
<p>Tested with SAS 9.4M3.</p><div class="blogger-post-footer"><p>For more posts like this, see <a href="http://heuristicandrew.blogspot.com/">Heuristic Andrew</a>.</p></div>Andrew Zhttp://www.blogger.com/profile/10108637160465346326noreply@blogger.com3tag:blogger.com,1999:blog-7666066816041190661.post-81296235930403431272016-02-22T13:13:00.001-07:002016-02-22T13:13:58.166-07:00ISO 3166-1 alpha-2 (two-letter country code) format for SAS<p>Here is the widely-used ISO 3166-1 alpha-2 format for use in SAS. It is commonly called the two-letter country code format.</p>
<p>The PROC FORMAT code generates a character format, so where the raw data contains a code, such as US, it expands it to the pretty name, such as United States. As with any SAS format, applying the format does not change the underlying data.</p>
<a name='more'></a>
<pre class="prettyprint">
proc format;
/* ISO 3166-1 alpha-2 two letter country codes */
value $ iso3166alphatwo
'AF' = 'Afghanistan'
'AX' = 'Åland Islands'
'AL' = 'Albania'
'DZ' = 'Algeria'
'AS' = 'American Samoa'
'AD' = 'Andorra'
'AO' = 'Angola'
'AI' = 'Anguilla'
'AQ' = 'Antarctica'
'AG' = 'Antigua and Barbuda'
'AR' = 'Argentina'
'AM' = 'Armenia'
'AW' = 'Aruba'
'AU' = 'Australia'
'AT' = 'Austria'
'AZ' = 'Azerbaijan'
'BS' = 'Bahamas'
'BH' = 'Bahrain'
'BD' = 'Bangladesh'
'BB' = 'Barbados'
'BY' = 'Belarus'
'BE' = 'Belgium'
'BZ' = 'Belize'
'BJ' = 'Benin'
'BM' = 'Bermuda'
'BT' = 'Bhutan'
'BO' = 'Bolivia, Plurinational State of'
'BQ' = 'Bonaire, Sint Eustatius and Saba'
'BA' = 'Bosnia and Herzegovina'
'BW' = 'Botswana'
'BV' = 'Bouvet Island'
'BR' = 'Brazil'
'IO' = 'British Indian Ocean Territory'
'BN' = 'Brunei Darussalam'
'BG' = 'Bulgaria'
'BF' = 'Burkina Faso'
'BI' = 'Burundi'
'KH' = 'Cambodia'
'CM' = 'Cameroon'
'CA' = 'Canada'
'CV' = 'Cape Verde'
'KY' = 'Cayman Islands'
'CF' = 'Central African Republic'
'TD' = 'Chad'
'CL' = 'Chile'
'CN' = 'China'
'CX' = 'Christmas Island'
'CC' = 'Cocos (Keeling) Islands'
'CO' = 'Colombia'
'KM' = 'Comoros'
'CG' = 'Congo'
'CD' = 'Congo, the Democratic Republic of the'
'CK' = 'Cook Islands'
'CR' = 'Costa Rica'
'CI' = 'Côte d''Ivoire'
'HR' = 'Croatia'
'CU' = 'Cuba'
'CW' = 'Curaçao'
'CY' = 'Cyprus'
'CZ' = 'Czech Republic'
'DK' = 'Denmark'
'DJ' = 'Djibouti'
'DM' = 'Dominica'
'DO' = 'Dominican Republic'
'EC' = 'Ecuador'
'EG' = 'Egypt'
'SV' = 'El Salvador'
'GQ' = 'Equatorial Guinea'
'ER' = 'Eritrea'
'EE' = 'Estonia'
'ET' = 'Ethiopia'
'FK' = 'Falkland Islands (Malvinas)'
'FO' = 'Faroe Islands'
'FJ' = 'Fiji'
'FI' = 'Finland'
'FR' = 'France'
'GF' = 'French Guiana'
'PF' = 'French Polynesia'
'TF' = 'French Southern Territories'
'GA' = 'Gabon'
'GM' = 'Gambia'
'GE' = 'Georgia'
'DE' = 'Germany'
'GH' = 'Ghana'
'GI' = 'Gibraltar'
'GR' = 'Greece'
'GL' = 'Greenland'
'GD' = 'Grenada'
'GP' = 'Guadeloupe'
'GU' = 'Guam'
'GT' = 'Guatemala'
'GG' = 'Guernsey'
'GN' = 'Guinea'
'GW' = 'Guinea-Bissau'
'GY' = 'Guyana'
'HT' = 'Haiti'
'HM' = 'Heard Island and McDonald Mcdonald Islands'
'VA' = 'Holy See (Vatican City State)'
'HN' = 'Honduras'
'HK' = 'Hong Kong'
'HU' = 'Hungary'
'IS' = 'Iceland'
'IN' = 'India'
'ID' = 'Indonesia'
'IR' = 'Iran, Islamic Republic of'
'IQ' = 'Iraq'
'IE' = 'Ireland'
'IM' = 'Isle of Man'
'IL' = 'Israel'
'IT' = 'Italy'
'JM' = 'Jamaica'
'JP' = 'Japan'
'JE' = 'Jersey'
'JO' = 'Jordan'
'KZ' = 'Kazakhstan'
'KE' = 'Kenya'
'KI' = 'Kiribati'
'KP' = 'Korea, Democratic People''s Republic of'
'KR' = 'Korea, Republic of'
'KW' = 'Kuwait'
'KG' = 'Kyrgyzstan'
'LA' = 'Lao People''s Democratic Republic'
'LV' = 'Latvia'
'LB' = 'Lebanon'
'LS' = 'Lesotho'
'LR' = 'Liberia'
'LY' = 'Libya'
'LI' = 'Liechtenstein'
'LT' = 'Lithuania'
'LU' = 'Luxemourg'
'MO' = 'Macao'
'MK' = 'Macedonia, the Former Yugoslav Republic of'
'MG' = 'Madagascar'
'MW' = 'Malawi'
'MY' = 'Malaysia'
'MV' = 'Maldives'
'ML' = 'Mali'
'MT' = 'Malta'
'MH' = 'Marshall Islands'
'MQ' = 'Martinique'
'MR' = 'Mauritania'
'MU' = 'Mauritius'
'YT' = 'Mayotte'
'MX' = 'Mexico'
'FM' = 'Micronesia, Federated States of'
'MD' = 'Moldova, Republic of'
'MC' = 'Monaco'
'MN' = 'Mongolia'
'ME' = 'Montenegro'
'MS' = 'Montserrat'
'MA' = 'Morocco'
'MZ' = 'Mozambique'
'MM' = 'Myanmar'
'NA' = 'Namibia'
'NR' = 'Nauru'
'NP' = 'Nepal'
'NL' = 'Netherlands'
'NC' = 'New Caledonia'
'NZ' = 'New Zealand'
'NI' = 'Nicaragua'
'NE' = 'Niger'
'NG' = 'Nigeria'
'NU' = 'Niue'
'NF' = 'Norfolk Island'
'MP' = 'Northern Mariana Islands'
'NO' = 'Norway'
'OM' = 'Oman'
'PK' = 'Pakistan'
'PW' = 'Palau'
'PS' = 'Palestine, State of'
'PA' = 'Panama'
'PG' = 'Papua New Guinea'
'PY' = 'Paraguay'
'PE' = 'Peru'
'PH' = 'Philippines'
'PN' = 'Pitcairn'
'PL' = 'Poland'
'PT' = 'Portugal'
'PR' = 'Puerto Rico'
'QA' = 'Qatar'
'RE' = 'Réunion'
'RO' = 'Romania'
'RU' = 'Russian Federation'
'RW' = 'Rwanda'
'BL' = 'Saint Barthélemy'
'SH' = 'Saint Helena, Ascension and Tristan da Cunha'
'KN' = 'Saint Kitts and Nevis'
'LC' = 'Saint Lucia'
'MF' = 'Saint Martin (French part)'
'PM' = 'Saint Pierre and Miquelon'
'VC' = 'Saint Vincent and the Grenadines'
'WS' = 'Samoa'
'SM' = 'San Marino'
'ST' = 'Sao Tome and Principe'
'SA' = 'Saudi Arabia'
'SN' = 'Senegal'
'RS' = 'Serbia'
'SC' = 'Seychelles'
'SL' = 'Sierra Leone'
'SG' = 'Singapore'
'SX' = 'Sint Maarten (Dutch part)'
'SK' = 'Slovakia'
'SI' = 'Slovenia'
'SB' = 'Solomon Islands'
'SO' = 'Somalia'
'ZA' = 'South Africa'
'GS' = 'South Georgia and the South Sandwich Islands'
'SS' = 'South Sudan'
'ES' = 'Spain'
'LK' = 'Sri Lanka'
'SD' = 'Sudan'
'SR' = 'Suriname'
'SJ' = 'Svalbard and Jan Mayen'
'SZ' = 'Swaziland'
'SE' = 'Sweden'
'CH' = 'Switzerland'
'SY' = 'Syryan Arab Republic'
'TW' = 'Taiwan, Province of China'
'TJ' = 'Tajikistan'
'TZ' = 'Tanzania, United Republic of'
'TH' = 'Thailand'
'TL' = 'Timor-Leste'
'TG' = 'Togo'
'TK' = 'Tokelau'
'TO' = 'Tonga'
'TT' = 'Trinidad and Tobago'
'TN' = 'Tunisia'
'TR' = 'Turkey'
'TM' = 'Turkmenistan'
'TC' = 'Turks and Caicos Islands'
'TV' = 'Tuvalu'
'UG' = 'Uganda'
'UA' = 'Ukraine'
'AE' = 'United Arab Emirates'
'GB' = 'United Kingdom'
'US' = 'United States'
'UM' = 'United States Minor Outlying Islands'
'UY' = 'Uruguay'
'UZ' = 'Uzbekistan'
'VU' = 'Vanuatu'
'VE' = 'Venezuela, Bolivarian Republic of'
'VN' = 'Vietnam'
'VG' = 'Virgin Islands, British'
'VI' = 'Virgin Islands, U.S.'
'WF' = 'Wallis and Futuna'
'EH' = 'Western Sahara'
'YE' = 'Yemen'
'ZM' = 'Zambia'
'ZW' = 'Zimbabwe'
;
quit;
/* Example usage */
data country;
format country_code $iso3166alphatwo.;
country_code = 'US';
output;
country_code='GB';
output;
run;
proc print data=country;
run;
</pre>
<p>This list is <a href="https://support.cloudflare.com/hc/en-us/articles/205072537-What-are-the-two-letter-country-codes-for-the-Access-Rules-">from Cloudflare</a> published 2015.
<p>Tested with SAS 9.4M3 on Microsoft Windows.</p><div class="blogger-post-footer"><p>For more posts like this, see <a href="http://heuristicandrew.blogspot.com/">Heuristic Andrew</a>.</p></div>Andrew Zhttp://www.blogger.com/profile/10108637160465346326noreply@blogger.com0tag:blogger.com,1999:blog-7666066816041190661.post-48506702183320023752016-02-03T14:01:00.000-07:002016-02-03T14:08:10.739-07:00Undocumented SAS feature: Bulkloading to Netezza with ODBC interface<p>The SAS/ACCESS Interface to ODBC in SAS 9.4M4 states it supports <a rel="external nofollow" href="https://support.sas.com/documentation/cdl/en/acreldb/68028/HTML/default/viewer.htm#n0qddeciyqg3qfn1tosrb42y47os.htm">bulk loading</a> only to "Microsoft SQL Server data on Windows platforms." However, in practice on the Windows platform it also supports bulk loading to Netezza.</p>
<p>Bulk loading is amazingly fast. In some of my benchmarks the duration of the whole bulk loading operation is independent of the number of rows inserted!<?p>
<p>By default on Netezza the bulk loading interface delimits values using a pipe character, and for cases where the values contain a pipe, <i>SAS Access Interface to ODBC</i> unofficially supports the BL_DELIMITER option to specify an alternate delimiter. For the ODBC interface, this option is undocumented.</p>
<p>However, there are nuances with the BL_DELIMITER option. According to the <a rel="external nofollow" href="https://support.sas.com/documentation/cdl/en/acreldb/68028/HTML/default/viewer.htm#n04jpcu1ws9wjrn15vii00ifx5v6.htm">SAS Access Interface to Netezza</a>:</p>
<a name='more'></a>
<blockquote>You can use any 7-bit ASCII character as a delimiter. The default is the pipe symbol (ǀ). To use a printable ASCII character, enclose it in quotation marks (for example, BL_DELIMITER="|"). However, to use an extended character, use the three-digit decimal number representation of the ASCII character for this option. For example, set BL_DELIMITER=202 to use ASCII character 202 as a delimiter. You must specify decimal number delimiters as three digits even if the first two digits would be zero. For example, specify BL_DELIMITER=003, not BL_DELIMITER=3 or BL_DELIMITER=03. </blockquote>
<p>First, notice a contradiction in the documentation. Because 7-bit characters are in the range 1-127 implies that 8-bit characters in the range 128-256 are not supported, but the documentation gives an example in this range (BL_DELIMITER=202).</p>
<p>Second, the syntax for the ODBC interface (which is not covered by the documentation for the ODBC interface) supports only a single character, so specifying a delimiter using the three-digit decimal notation will always cause an error.</p>
<p>For example, this works with the ODBC interface</p>
<pre class="prettyprint">
options sastrace=',,,d' sastraceloc=saslog nostsuffix;
/*
Because a data set with one variable does not require
a delimiter, this data set has two variables.
*/
data has_pipe;
/*
Because the first character in 122 used below is a 1,
here we test the number 1.
*/
number=1;
/*
Because by default the delimiter is a pipe, one of the
values has a pipe.
*/
char='I|have|a|pipe.';
output;
run;
/* Use a lowercase z as the alternate delimiter */
data nz.has_pipe(bulkload=yes bl_delimiter='z');
set has_pipe;
run;
</pre>
<p>However, the decimal representation fails.</p>
<pre class="prettyprint">
/* 122 is the decimal representation of the lowercase z */
data nz.has_pipe(bulkload=yes bl_delimiter=122);
set has_pipe;
run;
</pre>
<p>The SAS log shows SAS treats the decimal representation as a literal character and truncates it to the first character.</p>
<pre class="prettyprint">
ODBC_25: Executed: on connection 8
CREATE EXTERNAL TABLE EXT_HAS_PIPE SAMEAS ADMIN.HAS_PIPE USING
(DATAOBJECT('\\.\pipe\BL_HAS_PIPE_3') DELIMITER '1' REMOTESOURCE 'ODBC' )
</pre>
<p>If you have it licensed, the <i>SAS/ACCESS Interface to Netezza</i> should support the decimal notation: in this case, I would suggest using a tab delimiter with BL_DELIMITER=009.</p>
<p>If not, you must either disable bulkloading or use a single, printable ASCII character as a delimiter. If your data set requires a full range of characters but never all characters on the same row (for example, some rows have a pipe while other rows have a caret), split your data set into two data sets, and then bulk load each data set using separate delimiters.</p>
<div class="blogger-post-footer"><p>For more posts like this, see <a href="http://heuristicandrew.blogspot.com/">Heuristic Andrew</a>.</p></div>Andrew Zhttp://www.blogger.com/profile/10108637160465346326noreply@blogger.com5tag:blogger.com,1999:blog-7666066816041190661.post-34943995675146043022016-01-29T13:31:00.000-07:002016-02-08T12:44:21.513-07:00Frequency of individual characters from SAS data set<p>This script counts the frequencies of individual ASCII characters in a single column in a SAS data set and then prints an easy-to-read report.</p>
<p>My initial motivation relates to delimiters. By default bulkloading data from Netezza to SAS (which is very fast) uses the pipe character as a delimiter, but my data set contained values with the pipe character, so this macro identifies alternative delimiters.</p>
<p>Another potential use is cracking a message encrypted using a simple letter substitution cipher.</p>
<p>To begin, this code creates an example data set courtesy of William Shakespeare.</p>
<a name='more'></a>
<pre>
data sonnet18;
input line $60.;
datalines;
Shall I compare thee to a summer's day?
Thou art more lovely and more temperate:
Rough winds do shake the darling buds of May,
And summer's lease hath all too short a date:
Sometime too hot the eye of heaven shines,
And often is his gold complexion dimmed,
And every fair from fair sometime declines,
By chance, or nature's changing course untrimmed:
But thy eternal summer shall not fade,
Nor lose possession of that fair thou ow'st,
Nor shall death brag thou wander'st in his shade,
When in eternal lines to time thou grow'st,
So long as men can breathe, or eyes can see,
So long lives this, and this gives life to thee.
;
</pre>
<p>Next, here is the macro that counts all the printable ASCII characters in all rows of the data set and makes a new data set with total counts by ASCII character.</p>
<pre class="prettyprint">
%macro character_histogram(dataset, column);
data histogram_tmp;
set &dataset;
/* Characters 32 through 126 are printable ASCII. */
%do i = 32 %to 126;
/* Count the number of characters in the column. */
/* Store each count in separate column. */
count_chr_&i = count(trim(&column), byte(&i));
%end;
run;
/* Sum the character counts from all the rows. */
proc means data=histogram_tmp noprint;
var count_chr_:;
output out=histogram_wide sum=sum/autoname;
run;
/* Clean up */
proc sql;
drop table histogram_tmp;
quit;
/* Switch from long to wide. */
proc transpose
data=histogram_wide(keep=count_chr:)
out=histogram_long
;
run;
/* Make pretty. */
data histogram_long;
set histogram_long;
character_decimal = input(compress(_name_, , 'kd'), 3.);
drop _name_;
character = byte(character_decimal);
rename col1=count_characters;
run;
%mend;
</pre>
<p>Finally, this code invokes the macro and prints the report.</p>
<pre class="prettyprint">
/* Run the histogram macro */
%character_histogram(sonnet18, line);
/* Print the final report as a table. */
proc print data=histogram_long noobs;
var character_decimal character count_characters ;
run;
/* Barchart, what some people would call a histogram of the letters. */
proc sgplot data=histogram_long;
hbar character/freq=count_characters;
run;
</pre>
<p>This is the final report.</p>
<table class="table" cellspacing="0" cellpadding="5" rules="all" frame="box" bordercolor="#C1C1C1" summary="Procedure Print: Data Set WORK.HISTOGRAM_LONG">
<thead>
<tr>
<th class="r header" scope="col">character_decimal</th>
<th class="l header" scope="col">character</th>
<th class="r header" scope="col">count_characters</th>
</tr>
</thead>
<tbody>
<tr>
<td class="r data">32</td>
<td class="l data"> </td>
<td class="r data">100</td>
</tr>
<tr>
<td class="r data">33</td>
<td class="l data">!</td>
<td class="r data">0</td>
</tr>
<tr>
<td class="r data">34</td>
<td class="l data">"</td>
<td class="r data">0</td>
</tr>
<tr>
<td class="r data">35</td>
<td class="l data">#</td>
<td class="r data">0</td>
</tr>
<tr>
<td class="r data">36</td>
<td class="l data">$</td>
<td class="r data">0</td>
</tr>
<tr>
<td class="r data">37</td>
<td class="l data">%</td>
<td class="r data">0</td>
</tr>
<tr>
<td class="r data">38</td>
<td class="l data">&</td>
<td class="r data">0</td>
</tr>
<tr>
<td class="r data">39</td>
<td class="l data">'</td>
<td class="r data">6</td>
</tr>
<tr>
<td class="r data">40</td>
<td class="l data">(</td>
<td class="r data">0</td>
</tr>
<tr>
<td class="r data">41</td>
<td class="l data">)</td>
<td class="r data">0</td>
</tr>
<tr>
<td class="r data">42</td>
<td class="l data">*</td>
<td class="r data">0</td>
</tr>
<tr>
<td class="r data">43</td>
<td class="l data">+</td>
<td class="r data">0</td>
</tr>
<tr>
<td class="r data">44</td>
<td class="l data">,</td>
<td class="r data">12</td>
</tr>
<tr>
<td class="r data">45</td>
<td class="l data">-</td>
<td class="r data">0</td>
</tr>
<tr>
<td class="r data">46</td>
<td class="l data">.</td>
<td class="r data">1</td>
</tr>
<tr>
<td class="r data">47</td>
<td class="l data">/</td>
<td class="r data">0</td>
</tr>
<tr>
<td class="r data">48</td>
<td class="l data">0</td>
<td class="r data">0</td>
</tr>
<tr>
<td class="r data">49</td>
<td class="l data">1</td>
<td class="r data">0</td>
</tr>
<tr>
<td class="r data">50</td>
<td class="l data">2</td>
<td class="r data">0</td>
</tr>
<tr>
<td class="r data">51</td>
<td class="l data">3</td>
<td class="r data">0</td>
</tr>
<tr>
<td class="r data">52</td>
<td class="l data">4</td>
<td class="r data">0</td>
</tr>
<tr>
<td class="r data">53</td>
<td class="l data">5</td>
<td class="r data">0</td>
</tr>
<tr>
<td class="r data">54</td>
<td class="l data">6</td>
<td class="r data">0</td>
</tr>
<tr>
<td class="r data">55</td>
<td class="l data">7</td>
<td class="r data">0</td>
</tr>
<tr>
<td class="r data">56</td>
<td class="l data">8</td>
<td class="r data">0</td>
</tr>
<tr>
<td class="r data">57</td>
<td class="l data">9</td>
<td class="r data">0</td>
</tr>
<tr>
<td class="r data">58</td>
<td class="l data">:</td>
<td class="r data">3</td>
</tr>
<tr>
<td class="r data">59</td>
<td class="l data">;</td>
<td class="r data">0</td>
</tr>
<tr>
<td class="r data">60</td>
<td class="l data"><</td>
<td class="r data">0</td>
</tr>
<tr>
<td class="r data">61</td>
<td class="l data">=</td>
<td class="r data">0</td>
</tr>
<tr>
<td class="r data">62</td>
<td class="l data">></td>
<td class="r data">0</td>
</tr>
<tr>
<td class="r data">63</td>
<td class="l data">?</td>
<td class="r data">1</td>
</tr>
<tr>
<td class="r data">64</td>
<td class="l data">@</td>
<td class="r data">0</td>
</tr>
<tr>
<td class="r data">65</td>
<td class="l data">A</td>
<td class="r data">3</td>
</tr>
<tr>
<td class="r data">66</td>
<td class="l data">B</td>
<td class="r data">2</td>
</tr>
<tr>
<td class="r data">67</td>
<td class="l data">C</td>
<td class="r data">0</td>
</tr>
<tr>
<td class="r data">68</td>
<td class="l data">D</td>
<td class="r data">0</td>
</tr>
<tr>
<td class="r data">69</td>
<td class="l data">E</td>
<td class="r data">0</td>
</tr>
<tr>
<td class="r data">70</td>
<td class="l data">F</td>
<td class="r data">0</td>
</tr>
<tr>
<td class="r data">71</td>
<td class="l data">G</td>
<td class="r data">0</td>
</tr>
<tr>
<td class="r data">72</td>
<td class="l data">H</td>
<td class="r data">0</td>
</tr>
<tr>
<td class="r data">73</td>
<td class="l data">I</td>
<td class="r data">1</td>
</tr>
<tr>
<td class="r data">74</td>
<td class="l data">J</td>
<td class="r data">0</td>
</tr>
<tr>
<td class="r data">75</td>
<td class="l data">K</td>
<td class="r data">0</td>
</tr>
<tr>
<td class="r data">76</td>
<td class="l data">L</td>
<td class="r data">0</td>
</tr>
<tr>
<td class="r data">77</td>
<td class="l data">M</td>
<td class="r data">1</td>
</tr>
<tr>
<td class="r data">78</td>
<td class="l data">N</td>
<td class="r data">2</td>
</tr>
<tr>
<td class="r data">79</td>
<td class="l data">O</td>
<td class="r data">0</td>
</tr>
<tr>
<td class="r data">80</td>
<td class="l data">P</td>
<td class="r data">0</td>
</tr>
<tr>
<td class="r data">81</td>
<td class="l data">Q</td>
<td class="r data">0</td>
</tr>
<tr>
<td class="r data">82</td>
<td class="l data">R</td>
<td class="r data">1</td>
</tr>
<tr>
<td class="r data">83</td>
<td class="l data">S</td>
<td class="r data">4</td>
</tr>
<tr>
<td class="r data">84</td>
<td class="l data">T</td>
<td class="r data">1</td>
</tr>
<tr>
<td class="r data">85</td>
<td class="l data">U</td>
<td class="r data">0</td>
</tr>
<tr>
<td class="r data">86</td>
<td class="l data">V</td>
<td class="r data">0</td>
</tr>
<tr>
<td class="r data">87</td>
<td class="l data">W</td>
<td class="r data">1</td>
</tr>
<tr>
<td class="r data">88</td>
<td class="l data">X</td>
<td class="r data">0</td>
</tr>
<tr>
<td class="r data">89</td>
<td class="l data">Y</td>
<td class="r data">0</td>
</tr>
<tr>
<td class="r data">90</td>
<td class="l data">Z</td>
<td class="r data">0</td>
</tr>
<tr>
<td class="r data">91</td>
<td class="l data">[</td>
<td class="r data">0</td>
</tr>
<tr>
<td class="r data">92</td>
<td class="l data">\</td>
<td class="r data">0</td>
</tr>
<tr>
<td class="r data">93</td>
<td class="l data">]</td>
<td class="r data">0</td>
</tr>
<tr>
<td class="r data">94</td>
<td class="l data">^</td>
<td class="r data">0</td>
</tr>
<tr>
<td class="r data">95</td>
<td class="l data">_</td>
<td class="r data">0</td>
</tr>
<tr>
<td class="r data">96</td>
<td class="l data">`</td>
<td class="r data">0</td>
</tr>
<tr>
<td class="r data">97</td>
<td class="l data">a</td>
<td class="r data">37</td>
</tr>
<tr>
<td class="r data">98</td>
<td class="l data">b</td>
<td class="r data">3</td>
</tr>
<tr>
<td class="r data">99</td>
<td class="l data">c</td>
<td class="r data">9</td>
</tr>
<tr>
<td class="r data">100</td>
<td class="l data">d</td>
<td class="r data">20</td>
</tr>
<tr>
<td class="r data">101</td>
<td class="l data">e</td>
<td class="r data">63</td>
</tr>
<tr>
<td class="r data">102</td>
<td class="l data">f</td>
<td class="r data">10</td>
</tr>
<tr>
<td class="r data">103</td>
<td class="l data">g</td>
<td class="r data">10</td>
</tr>
<tr>
<td class="r data">104</td>
<td class="l data">h</td>
<td class="r data">31</td>
</tr>
<tr>
<td class="r data">105</td>
<td class="l data">i</td>
<td class="r data">26</td>
</tr>
<tr>
<td class="r data">106</td>
<td class="l data">j</td>
<td class="r data">0</td>
</tr>
<tr>
<td class="r data">107</td>
<td class="l data">k</td>
<td class="r data">1</td>
</tr>
<tr>
<td class="r data">108</td>
<td class="l data">l</td>
<td class="r data">23</td>
</tr>
<tr>
<td class="r data">109</td>
<td class="l data">m</td>
<td class="r data">22</td>
</tr>
<tr>
<td class="r data">110</td>
<td class="l data">n</td>
<td class="r data">31</td>
</tr>
<tr>
<td class="r data">111</td>
<td class="l data">o</td>
<td class="r data">44</td>
</tr>
<tr>
<td class="r data">112</td>
<td class="l data">p</td>
<td class="r data">4</td>
</tr>
<tr>
<td class="r data">113</td>
<td class="l data">q</td>
<td class="r data">0</td>
</tr>
<tr>
<td class="r data">114</td>
<td class="l data">r</td>
<td class="r data">28</td>
</tr>
<tr>
<td class="r data">115</td>
<td class="l data">s</td>
<td class="r data">38</td>
</tr>
<tr>
<td class="r data">116</td>
<td class="l data">t</td>
<td class="r data">39</td>
</tr>
<tr>
<td class="r data">117</td>
<td class="l data">u</td>
<td class="r data">13</td>
</tr>
<tr>
<td class="r data">118</td>
<td class="l data">v</td>
<td class="r data">5</td>
</tr>
<tr>
<td class="r data">119</td>
<td class="l data">w</td>
<td class="r data">4</td>
</tr>
<tr>
<td class="r data">120</td>
<td class="l data">x</td>
<td class="r data">1</td>
</tr>
<tr>
<td class="r data">121</td>
<td class="l data">y</td>
<td class="r data">8</td>
</tr>
<tr>
<td class="r data">122</td>
<td class="l data">z</td>
<td class="r data">0</td>
</tr>
<tr>
<td class="r data">123</td>
<td class="l data">{</td>
<td class="r data">0</td>
</tr>
<tr>
<td class="r data">124</td>
<td class="l data">|</td>
<td class="r data">0</td>
</tr>
<tr>
<td class="r data">125</td>
<td class="l data">}</td>
<td class="r data">0</td>
</tr>
<tr>
<td class="r data">126</td>
<td class="l data">~</td>
<td class="r data">0</td>
</tr>
</tbody>
</table>
<img src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjAY2EFbvRPo8y9yhOWXrqunY_VrW-jnakHpTWdNx4PymaqxBcbEwwStLyy8s8FaXVZRegHIA609m0vllosiYZNojFp3KsGBjFZLM56CKxAlBnwrqvG36ZwIOkRnvOLSLCiKdBXoA4ZESDK/s800-Ic42/histogram_characters.png" alt="Histogram of individual characters made with SAS SGPLOT" width="640" height="480">
<p>This script was tested with SAS 9.4M3 on Windows 7.</p>
<div class="blogger-post-footer"><p>For more posts like this, see <a href="http://heuristicandrew.blogspot.com/">Heuristic Andrew</a>.</p></div>Andrew Zhttp://www.blogger.com/profile/10108637160465346326noreply@blogger.com0tag:blogger.com,1999:blog-7666066816041190661.post-19282710871606224682015-09-02T10:29:00.002-06:002016-01-29T14:37:39.366-07:00SAS macros always have a global scope<p>SAS allows the programmer to declare the scope of macro variables using <tt>%LOCAL</tt> or <tt>%GLOBAL</tt>, but the macros themselves are always created in the global scope.</p>
<p>Say you have a macro that in another language, say Python, would be considered a function. Within the macro you want a sub-macro (i.e., sub-function) to be used only within the outer macro.</p>
<pre class="prettyprint">
%macro outer;
%put NOTE: outer;
/* This "sub-macro" is defined within the outer macro and is
intended only for use within the outer macro. */
%macro inner(foo);
%put NOTE: inner &foo;
%mend;
%inner(1);
%inner(2);
%mend;
%outer;
/* If the "sub-macro" has a local scope, the next step would fail */
%inner(3);
/* However, it succeeds */
</pre>
<p>This can lead to conflicts if the macro <tt>%inner</tt> is defined somewhere else in the same session. One way of dealing with this is to be careful to give the inner macro a unique name like <tt>__outer_inner</tt> where the underscores in the prefix suggest a local scope, and adding <tt>outer</tt> to the macro name indicates the macro is to be used only in the <tt>outer</tt> macro.</p>
<a name='more'></a>
<p>Another option is to use the <tt>%sysmacdelete</tt> to delete the inner macro:</p>
<pre class="prettyprint">
%macro outer;
%put NOTE: outer;
%macro inner(foo);
%put NOTE: inner &foo;
%mend;
%inner(1);
%inner(2);
/* Delete the inner macro */
%SYSMACDELETE inner;
%mend;
%outer;
/* This fails because of SYSMACDELETE */
%inner(3);
</pre>
<p>Tested with SAS 9.4M3 on Windows 7.</p><div class="blogger-post-footer"><p>For more posts like this, see <a href="http://heuristicandrew.blogspot.com/">Heuristic Andrew</a>.</p></div>Andrew Zhttp://www.blogger.com/profile/10108637160465346326noreply@blogger.com0tag:blogger.com,1999:blog-7666066816041190661.post-46289311495100508102015-08-31T14:40:00.001-06:002015-08-31T14:41:25.024-06:00Gotcha with SAS, regular expressions, and end-of-line matching<p>Regular expressions are essential for sophisticated text processing, and it is generally easy to transfer knowledge of Perl regular expressions to the SAS functions prxparse, prxmatch, prxposn, etc. However, use caution with the end of line character ($) because of how SAS treats whitespace.</p>
<p>For demonstration I will run what looks like an equivalent use of regular expressions in Python, JavaScript, and SAS, but notice that only SAS does not match the string.</p>
<pre class="prettyprint lang-py">
# Python 2.7
import re
first_name ='Andrew '
first_name = first_name.strip()
if re.search(r"^Andrew$", first_name):
print 'match' # it does match
else:
print 'no match'
</pre>
<pre class="prettyprint lang-js">
/*JavaScript */
first_name ='Andrew ';
first_name = first_name.trim();
if (first_name.match(/^Andrew$/))
alert('match'); /* it does match */
else alert('no match');
</pre>
<p>(<a rel="external nofollow" href="https://jsfiddle.net/rwj2fp7k/">JavaScript fiddle for this code</a>.)</p>
<pre class="prettyprint">
data x;
first_name='Andrew ';
first_name=strip(first_name);
match=prxmatch('/^Andrew$/', first_name); /* it does not match (match=0) */
run;
</pre>
<p>In SAS ignore the trailing whitespace using the trim() function:</p>
<pre class="prettyprint">
data x;
first_name='Andrew ';
match=prxmatch('/^Andrew$/', trim(first_name)); /* it does match */
run;
</pre>
<p>SAS, however, does not distinguish a string that was inserted with trailing spaces from a string that was inserted without trailing spaces. In the following SAS-only example imagine the table was created and populated using a non-SAS system like MySQL or Microsoft SQL Server.</p>
<pre class="prettyprint">
proc sql;
create table names (
first_name varchar(8)
);
insert into names values ('Andrew'); /* no trailing spaces */
insert into names values ('Andrew '); /* one trailing space */
insert into names values ('Andrew ');
quit;
data names;
set names;
length=length(first_name);
match1=prxmatch('/^Andrew$/', first_name);
match2=prxmatch('/^Andrew$/', trim(first_name));
run;</pre>
<img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhvke4q7NvYYzNVmuBO0-HUevTofAydTbeCuA34jjvWugzQo_6QSu_goOWdeU3N4wflz5I2pHCu5cSViDH9SXNYmhVHQiDJw7poSGVWz1HXhYOIC8lWmpR6p7g2-DG6MyIvk6fvHabsce-M/s1600/sas_names.png" alt="Screenshot of SAS data set"/>
<p>This was tested with SAS 9.4M3 on Microsoft Windows 7.</p><div class="blogger-post-footer"><p>For more posts like this, see <a href="http://heuristicandrew.blogspot.com/">Heuristic Andrew</a>.</p></div>Andrew Zhttp://www.blogger.com/profile/10108637160465346326noreply@blogger.com3tag:blogger.com,1999:blog-7666066816041190661.post-42599446120622864692015-03-03T09:26:00.002-07:002016-01-29T14:37:50.330-07:00SAS 9.4 crash with MySQL ODBC pass-through queries<p>SAS 9.4 (TS1M2) on X64_DS08R2 (Windows Server 2008 64-bit) always crashes with certain pass-through queries using MySQL Connector/ODBC 5.3.4. When it crashes, the SAS log shows some red messages, but SAS closes immediately. </p>
<p>The crash is not reproducible with other ODBC drivers, on SAS 9.3 64-bit, or SAS 9.4 32-bit.</p>
<p>Workarounds include: using an ODBC DSN instead of the connection string, not using pass-through queries, or using SAS 9.3.</p>
<p>SAS agreed to fix the bug.</p>
<a name='more'></a>
<p>This shows how to procedure it
<pre class="prettyprint">
/* Trace log */
options sastrace=',,d,d' sastraceloc=file 'c:\temp\mytracefile.log';
/* This does not crash */
libname ensembl odbc
required="Driver={MySQL ODBC 5.3 Unicode Driver};Server=ensembldb.ensembl.org;Database=aedes_aegypti_core_48_1b;Uid=anonymous;interactive=1;";
/* This does not crash */
proc sql;
create table x as
select *
from ensembl.analysis;
quit;
/* This crashes */
proc sql;
connect using ensembl;
create table x as
select *
from connection to ensembl (
show databases;
);
quit;
</pre>
<p>This bug was present in SAS 9.4M2, and it was fixed in SAS 9.4M3.</p><div class="blogger-post-footer"><p>For more posts like this, see <a href="http://heuristicandrew.blogspot.com/">Heuristic Andrew</a>.</p></div>Andrew Zhttp://www.blogger.com/profile/10108637160465346326noreply@blogger.com0tag:blogger.com,1999:blog-7666066816041190661.post-67287999903663485652015-02-04T14:35:00.001-07:002015-02-04T14:48:42.289-07:00Party like it's 19999 (SAS)<p>On 03OCT2014 I must have missed the party in Cary, NC.</p>
<pre class="prettyprint">
data _null_;
format date date9.;
date = 19999;
put date=;
run;
</pre>
<img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiJ3fne8ivdLCUjBVhH3y22HtNdZKIPo77r5pB2LF-fk-fTVq7hUpy-Wz66C8yaVa4VyAiQ5mwaf_2gdZHfCQp5bDZs3opcVDHdaNQ26cojsOkR1X6s-S-XLeC1feNQDaYepanuNsVEVg1m/s1600/prince19999.jpg" />
<p>So the next party is 03JUN2568?</p>
<pre class="prettyprint">
data _null_;
format date date9.;
date = 222222;
put date=;
run;
</pre><div class="blogger-post-footer"><p>For more posts like this, see <a href="http://heuristicandrew.blogspot.com/">Heuristic Andrew</a>.</p></div>Andrew Zhttp://www.blogger.com/profile/10108637160465346326noreply@blogger.com0tag:blogger.com,1999:blog-7666066816041190661.post-7372564325646714612014-12-17T10:27:00.003-07:002016-01-29T14:38:00.255-07:00SAS crash with BULKLOAD and ODBC Driver 11 for SQL Server<p>SAS 9.4 (TS1M2) crashes hard when using <b>BULKLOAD=YES</b> with the latest <a rel="external nofollow" href="http://www.microsoft.com/en-us/download/details.aspx?id=36434">Microsoft ODBC Driver 11 for SQL Server</a>. For a brief moment you may see in the SAS log <b>ERROR: BCP initialize error: [Microsoft][ODBC Driver 11 for SQL Server]Connection is not enabled for BCP</b>, and then the SAS window closes.</p>
<p>The Windows Event Viewer has the following information:</p>
<pre>
Faulting application name: sas.exe, version: 9402.0.21456.21239, time stamp: 0x53d05339
Faulting module name: ntdll.dll, version: 6.1.7601.18229, time stamp: 0x51fb1072
Exception code: 0xc0000374
Fault offset: 0x000ce753
Faulting process id: 0x5724
Faulting application start time: 0x01d004d90519e3fc
Faulting application path: C:\Program Files\SASHome2\x86\SASFoundation\9.4\sas.exe
Faulting module path: C:\Windows\SysWOW64\ntdll.dll
Report Id: 4c84f206-70cc-11e4-bd42-0205857feb80
</pre>
<p>SAS Support acknowledged the crash and will address it in the next maintenance release. At this time I do not see any KB article.</p>
<a name='more'></a>
<p>If you upload large data sets from SAS to SQL Server, make sure to try either BULKLOAD or INSERTBUFF. The performance improvement can be dramatic.</p>
<p>Until the maintenance release, SAS suggested using larger values of READBUFF instead of BULKLOAD. In a few simple tests, larger values of READBUFF are much better than the default READBUFF=1, but they are slower than BULKLOAD=YES. Also, huge values of READBUFF are not better, and it may depend on the workload.</p>
<p>Microsoft Windows 7 comes with a SQL Server driver version 6, which is from 1996. The new driver is useful to use date-only columns. With the old driver, date-only columns are converted to character format (see <a rel="external nofollow" href="http://support.sas.com/kb/40/080.html">SAS note 40080</a>).</p>
<p>Both SQL Server drivers can be installed and use simultaneously using different connection strings, so one option is bulkloading using the old driver (for data sets without date-only variables).</p>
<p><b>This post first appeared on <a href="http://heuristicandrew.blogspot.com/">Heuristic Andrew</a>.</b></p>
<div class="blogger-post-footer"><p>For more posts like this, see <a href="http://heuristicandrew.blogspot.com/">Heuristic Andrew</a>.</p></div>Andrew Zhttp://www.blogger.com/profile/10108637160465346326noreply@blogger.com0tag:blogger.com,1999:blog-7666066816041190661.post-81473268845505267372014-12-05T09:54:00.004-07:002016-01-29T14:38:10.957-07:00Fibonacci sequence in R and SAS<p>Because the Fibonacci sequence is simply defined by recursion, it makes for an elegant programming exercise. Here is one way to do it in SAS, and another way to do it in R. I've also included unit testing code to check that it works.</p>
<a name='more'></a>
<p>Fibonacci sequence in SAS using a recursive macro:</p>
<pre class="prettyprint">
%macro fib(n);
%if &n = 1 %then 1; * first seed value;
%else %if &n = 2 %then 1; * second seed value;
%else %eval(%fib(%eval(&n-1))+%fib(%eval(&n-2))); * use recursion;
%mend;
* show values 1-5;
%put %fib(1);
%put %fib(2);
%put %fib(3);
%put %fib(4);
%put %fib(5);
* check values 1-10;
%macro check_fib;
%if %fib(1) ne 1 %then %abort;
%if %fib(2) ne 1 %then %abort;
%if %fib(3) ne 2 %then %abort;
%if %fib(4) ne 3 %then %abort;
%if %fib(5) ne 5 %then %abort;
%if %fib(6) ne 8 %then %abort;
%if %fib(7) ne 13 %then %abort;
%if %fib(8) ne 21 %then %abort;
%if %fib(9) ne 34 %then %abort;
%if %fib(10) ne 55 %then %abort;
%put NOTE: OK!;
%mend;
%check_fib;
</pre>
<p>Fibonacci sequence in R using a recursive function that supports either single integers or a vector of integers:</p>
<pre class="prettyprint language-r">
fib <- function(n)
{
if (length(n) > 1) return(sapply(n, fib)) # accept a numeric vector
if (n == 1) return(1) # first seed value
if (n == 2) return(1) # second seed value
return(fib(n-1)+fib(n-2)) # use recursion
}
# print first five Fibonacci numbers
fib(1)
fib(2)
fib(3)
fib(4)
fib(5)
# verify the Fibonacci sequence 1 through 10
(actual <- fib(1:10))
(expected <- c(1,1,2,3,5,8,13,21,34,55))
all.equal(actual,expected)
</pre>
<p>For alternative implements, see <a rel="external nofollow" href="http://sas-and-r.blogspot.com/2009/06/create-fibonacci-series.html">SAS and R: Example 7.1: Create a Fibonacci sequence</A>. In SAS, Nick Horton calculates the Fibonacci sequence using a DATA STEP, and in R he uses a FOR loop.</p>
<p>Adam Rich responded with his post <a href="http://adamleerich.com/2014/12/07/fibonacci-sequence-in-r-with-memoization/">Fibonacci Sequence in R with Memoization</a> which gives a performance boost by caching the results.</p>
<p>In the comments below, Rick Wicklin referred to his <a href="http://blogs.sas.com/content/iml/2010/09/30/twitter-and-the-fibonacci-sequence/">SAS/IML solution that generates the Fibonacci sequence iteratively</a> and <a href="http://blogs.sas.com/content/iml/2010/10/05/matrices-eigenvalues-fibonacci-and-the-golden-ratio/">Matrices, eigenvalues, Fibonacci, and the golden ratio</a>.</p>
<p><b>This post first appeared on <a href="http://heuristicandrew.blogspot.com/">Heuristic Andrew</a>.</b></p><div class="blogger-post-footer"><p>For more posts like this, see <a href="http://heuristicandrew.blogspot.com/">Heuristic Andrew</a>.</p></div>Andrew Zhttp://www.blogger.com/profile/10108637160465346326noreply@blogger.com5tag:blogger.com,1999:blog-7666066816041190661.post-55413700720847164632014-06-23T16:09:00.001-06:002014-06-23T16:24:27.466-06:00Change the process priority in SAS<p>This SAS code programmatically changes the process priority for the current SAS program. You may want to use this for SAS batch jobs. </p>
<p>For example, when I run multiprocessor jobs using Ian J. Ghent %multiThreadDataStep macro, the Windows user interface becomes less responsive and my batch job competes with other processes, but using this method addresses both issues.</p>
<p>Simply run the line corresponding to the desired priority level. To decrease process priority, choose one of the last two.</p>
<pre class="prettyprint">
x "wmic.exe process where processid=""&SYSJOBID"" call setpriority 'high priority'";
x "wmic.exe process where processid=""&SYSJOBID"" call setpriority 'above normal'";
x "wmic.exe process where processid=""&SYSJOBID"" call setpriority 'normal'";
x "wmic.exe process where processid=""&SYSJOBID"" call setpriority 'below normal'";
x "wmic.exe process where processid=""&SYSJOBID"" call setpriority 'idle'";
</pre>
<p>Tested on SAS 9.4 on Windows 7 and SAS 9.3 on Windows 2008.</p>
<p>I don't have SAS for Linux, but I guess it would look like this to increase the niceness and therefore reduce the process priority:</p>
<pre class="prettyprint">
x "renice -p 16 &SYSJOBID";
</pre><div class="blogger-post-footer"><p>For more posts like this, see <a href="http://heuristicandrew.blogspot.com/">Heuristic Andrew</a>.</p></div>Andrew Zhttp://www.blogger.com/profile/10108637160465346326noreply@blogger.com0tag:blogger.com,1999:blog-7666066816041190661.post-18750347734824944102014-02-26T09:39:00.000-07:002014-03-10T13:22:20.612-06:00Type I error rates in test of normality by simulation<p>This simulation tests the type I error rates of the Shapiro-Wilk test of normality in R and SAS. </p>
<a name='more'></a>
<p>First, we run a simulation in R. Notice the simulation is vectorized: there are no "for" loops that clutter the code and slow the simulation.</p>
<pre class="prettyprint lang-r">
# type I error
alpha <- 0.05
# number of simulations
n.simulations <- 10000
# number of observations in each simulation
n.obs <- 100
# a vector of test results
type.one.error <- replicate(n.simulations,
shapiro.test(rnorm(n.obs))$p.value)<alpha
# type I error for the whole simulation
mean(type.one.error)
# Store cumulative results in data frame for plotting
sim <- data.frame(
n.simulations = 1:n.simulations,
type.one.error.rate = cumsum(type.one.error) /
seq_along(type.one.error))
# plot type I error as function of the number of simulations
plot(sim, xlab="number of simulations",
ylab="cumulative type I error rate")
# a line for the true error rate
abline(h=alpha, col="red")
# alternative plot using ggplot2
require(ggplot2)
ggplot(sim, aes(x=n.simulations, y=type.one.error.rate)) +
geom_line() +
xlab('number of simulations') +
ylab('cumulative type I error rate') +
ggtitle('Simulation of type I error in Shapiro-Wilk test') +
geom_abline(intercept = 0.05, slope=0, col='red') +
theme_bw()
</pre>
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEijh6_JhqKkhYbMrqvTEchJ-NPUwITZp8oSkGv6hWTfzlh5HazJyuqrIbZLMP-YY1iJK3lhHA8UER5nPt1f2gLASD0EbF50o1qUo94onuxPPWASHjC_XpHYmfC4kBQNDn-v7H_i1WuAkmMj/s1600/shapiro-wilk2.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEijh6_JhqKkhYbMrqvTEchJ-NPUwITZp8oSkGv6hWTfzlh5HazJyuqrIbZLMP-YY1iJK3lhHA8UER5nPt1f2gLASD0EbF50o1qUo94onuxPPWASHjC_XpHYmfC4kBQNDn-v7H_i1WuAkmMj/s1600/shapiro-wilk2.png" /></a></div>
<p>As the number of simulations increases, the type I error rate approaches alpha. Try it in R with any value of alpha and any number of observations per simulation.</p>
<p>It's elegant the whole simulation can be condensed to 60 characters:</p>
<pre class="prettyprint lang-r">
mean(replicate(10000,shapiro.test(rnorm(100))$p.value)<0.05)
</pre>
<p>Likewise, we now do a similar simulation of the Shapiro-Wilk test in SAS. Notice there are no macro loops: the simulation is simpler and faster using a BY statement.</p>
<pre class="prettyprint lang-sas">
data normal;
length simulation 4 i 3; /* save space and time */
do simulation = 1 to 10000;
do i = 1 to 100;
x = rand('normal');
output;
end;
end;
run;
proc univariate data=normal noprint ;
by simulation;
var x;
output out=univariate n=n mean=mean std=std NormalTest=NormalTest probn=probn;
run;
data univariate;
set univariate;
type_one_error = probn<0.05;
run;
/* Summarize the type I error rates for this simulation */
proc freq data=univariate;
table type_one_error/nocum;
run;
</pre>
<p>In my SAS simulation the type I error rate was 5.21%.</p>
<p>Tested with R 3.0.2 and SAS 9.3 on Windows 7.</p><div class="blogger-post-footer"><p>For more posts like this, see <a href="http://heuristicandrew.blogspot.com/">Heuristic Andrew</a>.</p></div>Andrew Zhttp://www.blogger.com/profile/10108637160465346326noreply@blogger.com3