Heuristic Andrew

2024-02-14T17:08:00.002-07:00

I bought a dashcam from Temu for $6.31 January 2024, and here is sample footage that includes three scenes: daytime, dusk, and daytime.

Product benefits

Easy mounting with suction cup
Easy power with 12V cigarette plug adapter
Battery lasts a few moments after car turns off
MicroSD card included
Cheap price

Problems

Cheap quality
Terrible video quality (despite product description)
Narrow field of view (despite product description)

Notes

It was a weird choice for it record in .avi instead of .mp4 container.
The product was discontinued on Temu.

Model (box): Y320
Manufacturer (box): shenzhen Hengxin Weiye Digital Co., LTD
Product title (Temu):
Dash Camera For Cars With 32G Memory Cards Wide Angle Full 1080P Driving Record...

Below is the product menu (PDF on Temu):
Intelligent voice reminder, built-in multinational voice pronunciation, no need to worry about language barriers 1080P highdefinition night vision, even in the weak light environment, can also shoot clearly Loop recording, no missing seconds, segmented storage, automatic monitoring of sto rage space, when the memory is full, automatically delete the earliest recorded video and save the new video Builtin gravity sensor, when a sudden brake or collision is sensed, the current video is instantly locked to prevent overwriting important files during loop recording Supported languages: English, French, German, Russian, Japanese, etc.

Timestamp precision in Snowflake

2023-07-12T16:17:00.005-06:00

Timestamps in Snowflake have precisions 0 to 9 with a default of 9, which is a nanosecond, but the Snowflake documentation is not clear on precisions 0 to 8.

Storage difference

I did an empircal test by creating tables. Each table had one million rows and one column with random timestamps. The values have an original precison of one nanosecond, and I used random values because otherwise Snowflake would compress down any number of rows with the same values to a few KB.

create or replace table zzz_timestamp9 as
select dateadd(nanosecond, uniform(1,3e17, random()), current_timestamp())::timestamp(9) as time1
from TABLE(GENERATOR(ROWCOUNT => 1e6))
;
create or replace table zzz_timestamp0 as
select dateadd(nanosecond, uniform(1,3e17, random()), current_timestamp())::timestamp(0) as time1
from TABLE(GENERATOR(ROWCOUNT => 1e6))
;

The storage difference for 1 million rows was 3.5MB vs 7.0MB.

Precision difference

Again, I generated random rows and then copied the value into columns with varied precisions.

select
    dateadd(nanosecond, uniform(1,3e17, random()), current_timestamp())::timestamp(9) as "precision 9",
    "precision 9"::timestamp(3) as "precision 3",
    "precision 9"::timestamp(2) as "precision 2",
    "precision 9"::timestamp(1) as "precision 1",
    "precision 9"::timestamp(0) as "precision 0",
    datediff(ms, "precision 3", "precision 9") as "Precisions 9 vs 3 in milliseconds", /* alwayz zero */
    datediff(second, "precision 0", "precision 9") as "Precisions 9 vs 0 in seconds" /* alwayz zero */
from TABLE(GENERATOR(ROWCOUNT => 10))

Precision 0 is 1 second, precision 1 is 100 ms, precision 2 is 10 ms, precision 3 is 1 ms, etc.

openwrt ssh connection refused

2023-05-20T23:14:00.002-06:00

Symptom

Normal connection attempt

$ ssh root@192.168.1.1
Connection to 192.168.1.1 closed.

End of log with verbose ssh

debug1: Authentications that can continue: publickey,password
debug1: Next authentication method: publickey
debug1: Offering public key: /home/z/.ssh/id_ed25519 ED25519 SHA256:XX//XX/XX agent
debug1: Server accepts key: /home/z/.ssh/id_ed25519 ED25519 SHA256:XX//XX/XX agent
Authenticated to 192.168.1.1 ([192.168.1.1]:22) using "publickey".
debug1: channel 0: new [client-session]
debug1: Entering interactive session.
debug1: pledge: filesystem
debug1: Sending environment.
debug1: channel 0: setting env LANG = "en_US.UTF-8"
debug1: client_input_channel_req: channel 0 rtype exit-status reply 0
debug1: channel 0: free: client-session, nchannels 1
Connection to 192.168.1.1 closed.
Transferred: sent 2488, received 1012 bytes, in 0.0 seconds
Bytes per second: sent 346691.3, received 141017.5
debug1: Exit status 1

System log snippet in OpenWRT's luci interface

Sat May 20 23:02:33 2023 authpriv.notice dropbear[7690]: Pubkey auth succeeded for 'root' with ssh-ed25519 key SHA256:XX//XX/XX from 192.168.1.X:X
Sat May 20 23:02:33 2023 authpriv.info dropbear[7691]: Exit (root) from <192.168.1.X:X>: Child failed
Sat May 20 23:02:33 2023 authpriv.info dropbear[7690]: Exit (root) from <192.168.1.X:X>: Disconnect received

Background

A few weeks ago, I installed OpenWRT 22.03.3 on my Belkin RT3200. SSH and everything else was working fine until I updated it to OpenWRT 22.03.5, and then immedietly SSH refused to connect.

Solution

I read a GitHub conversation about a similar problem. In their case, zsh was missing, and I remembered that earlier I installed Bash. I reinstalled it via LuCI, and then SSH worked again.

Enable 5G standalone (NR SA) on Samsung Galaxy A13

2022-12-04T15:06:00.024-07:00

I bought two Samsung Galaxy A13 5G phones from Google Fi for $100 each on promo, but by default, they connected in NR NSA (non-standalone) mode, which requires an LTE anchor. Performance was okay in NR NSA except in few geographic areas where LTE bands 2, 4, and 66 had high packet loss of 50-100% because of low signal or interference.

With a little effort, Samsung Galaxy phones can be put into NR SA ("pure 5G") mode. The process is similar to the Galaxy S22 (see guide by peacey8 at end), but there are a few differences. In particular, the A13 uses a Mediatek MT6769V/CU Helio G80 chipset instead of the Qualcomm SM8450 Snapdragon 8 Gen 1 on the Galaxy S22, so the serivce menus differ.

Pros and cons

Pros

Lower latency and packet loss (sometimes/generally).
Latest generation of cell network technology.

Cons

Requires compatiable carrier. (In the USA, it's only T-Mobile and their MVNOs such as Google Fi and Mint Mobile.)
Requires somewhat modern phone.
Requires compatible cell tower. (In my area, 5G coverage is excellent.)
T-Mobile does not support voice yet on NR SA mode. If band locking is set to NR SA, phone calls will not work. If band locking allows LTE and NR NSA, it will automatically fall back to LTE/NR NSA.
Allowing NR SA mode requires some initial technical set up.
Without band locking, the phone may often favor NR NSA over NR SA, even when NR SA has better network performance.
Band locking settings (optional) gets lost after phone reboot.
Set up varies by chipset.

Guide

This guide assumes your phone has the XAA (unlocked) CSC profile. If not, see peacey8's guide at the end. If you have TMB profile, then NR SA mode should already work. Once you have XAA, proceed here.

On the phone, enable developer mode: go to settings (gear) - About Phone - Software Information - tap seven times on build.
In phone settings (Gear) - Developer Options - enable USB debugging.
Install Samsung USB driver for Windows. (Sorry, you need a laptop or desktop.)
Launch the SamFW FRP tool on Windows.
Connect phone by USB cable to Windows machine.
In SamFW, enable secret code for Verizon. (You do not need to use Verizon as a carrier.)
SamFW should stop at "Waiting for DIAG."
In Samsung Phone app (with the blue icon), call *#0808#. (It does not work for me in the Google Phone app with the green icon.) After hitting the last pound (and before hitting the call button), the USB settings menu should appear.
If you have two options, then you did not enable USB debugging. Start over.
Tap DM+ACM+ADB and tap OK. (The Galaxy S22 has way more USB debugging options, by the way.)
SamFW should stop at "Disabling DIAG."
Then call *#0808#, and switch back to MTP.
Phone can be disconnected from USB, if you wish.
In the Samsung phone, app use the dialer code *#27663368378#.
UE SETTING AND INFO -SETTING - PROTOCOL- NR -ALLOW LIST- ALLOW LIST OFF
Three dots (top right), back
NR5G SA / NSA mode control - SA / NSA enable
Reboot phone.

Once it is working, you can disable the USB debugging and developer options menu on the phone, and you can uninstall SamFW.

Troubleshooting

Unlike the Galaxy S22, the *#2263# menu will not look any different on the A13. Even when NR SA is active, this service menu looks the same. It doesn't distinguish NR NSA and NR SA, so to confirm you are on SA, some options are:

Dial *#0011#. Pick relevant SIM. Check fourth line "Serving PLMN." If you see Nr5G, it's SA. If LTE, then it's not.
Use app such as CellMapper, Network Cell Info Lite, or NetMonster. They will all show the network type on the main page.

If you are still on LTE or NR NSA, some options:

Check your carrier supports NR SA. Right now, T-Mobile does, while Verizon and AT&T do not.
Check you have a 5G SIM and a 5G plan.
Turn off wifi. (It may use LTE only for power savings.)
Move closer to the tower.
Move to another tower.
Reset the connection like this: phone settings (gear) - Connections - Mobile networks - LTE/3G/2G. Wait a moment and set back to 5G/LTE/3G/2G. On our two A13, S22, and S22+ phones, we often need to reset the connection like this, and airplane mode does not seem to help. Otherwise, it may stick to LTE when NR NSA or NR SA are available.
Disable LTE like this: in the Samsung Phone app (green app icon), call *#2263#. Tap the relevant SIM. Tap CLEAR ALL BANDS. Tap "BLOCK SET BY AP," so the asterisk goes away. Tap NR menu to enter it. Select NR ALL. Go back to main. Apply selection. This may reset after rebooting the phone.
Alternate way to force NR: in Google Phone app (blue app icon), call *#*#4636#*#*- Phone Information - Set preferred network Type - NR only.

I tested this two Galaxy A13 phones running Android 12 with T-Mobile via Google FI. In case you want to join Google Fi, here's a referral code to get a $20 credit when you join: 2RD2V5.

On the Galaxy S22, voice calls do not work with NR SA: incoming calls go straight to voicemail, and outgoing calls stop a moment after dialing because VoNR apprently is not enabled. It is probably the same on VoNR.

Thank you much to the NR SA for Galaxy S22 guide from peacey8 and molexs's comment about the SCR01 hotspot.

Generate random names and addresses from SAS

2022-02-03T10:03:00.002-07:00

For testing data processing systems (e.g., CRM, record linkage), you may need to generate fake people. SAS makes it uniquely easy to generate an unlimited count of fake US residents because it comes with a data set of US zip codes, which include the city and state name.

The system uses four data sets: first names, last names, street names, and US zip codes. Initials are randomly generated from letters. The street addresses probably do not exist in the given zip codes.

You could extend this by:

Add street directions (i.e., N, S, E, W)
Add street post type (e.g., Dr., Ct.)
Add units (e.g., Apt B, Ste 101)
Add post office boxes and private mail boxes
Spell out the middle name
Add name prefix (e.g., Dr., Mr.)
Add name suffix (e.g., Jr., Sr.)



%let mv_person_count = 10000; /* how many people to make */
%let mv_max_street_num = 20000; /* largest street number */

/* https://www.ssa.gov/OACT/babynames/decades/century.html */
data first;
	format first_name $20.;
	input first_name $;
	first_name_id = _n_;
datalines;
James
Robert
John
Michael
William
David
Richard
Joseph
Thomas
Charles
Christopher
Daniel
Matthew
Anthony
Mark
Donald
Steven
Paul
Andrew
Joshua
Kenneth
Kevin
Brian
George
Edward
Ronald
Timothy
Jason
Jeffrey
Ryan
Jacob
Gary
Nicholas
Eric
Jonathan
Stephen
Larry
Justin
Scott
Brandon
Benjamin
Samuel
Gregory
Frank
Alexander
Raymond
Patrick
Jack
Dennis
Jerry
Tyler
Aaron
Jose
Adam
Henry
Nathan
Douglas
Zachary
Peter
Kyle
Walter
Ethan
Jeremy
Harold
Keith
Christian
Roger
Noah
Gerald
Carl
Terry
Sean
Austin
Arthur
Lawrence
Jesse
Dylan
Bryan
Joe
Jordan
Billy
Bruce
Albert
Willie
Gabriel
Logan
Alan
Juan
Wayne
Roy
Ralph
Randy
Eugene
Vincent
Russell
Elijah
Louis
Bobby
Philip
Johnny
Mary
Patricia
Jennifer
Linda
Elizabeth
Barbara
Susan
Jessica
Sarah
Karen
Nancy
Lisa
Betty
Margaret
Sandra
Ashley
Kimberly
Emily
Donna
Michelle
Dorothy
Carol
Amanda
Melissa
Deborah
Stephanie
Rebecca
Sharon
Laura
Cynthia
Kathleen
Amy
Shirley
Angela
Helen
Anna
Brenda
Pamela
Nicole
Emma
Samantha
Katherine
Christine
Debra
Rachel
Catherine
Carolyn
Janet
Ruth
Maria
Heather
Diane
Virginia
Julie
Joyce
Victoria
Olivia
Kelly
Christina
Lauren
Joan
Evelyn
Judith
Megan
Cheryl
Andrea
Hannah
Martha
Jacqueline
Frances
Gloria
Ann
Teresa
Kathryn
Sara
Janice
Jean
Alice
Madison
Doris
Abigail
Julia
Judy
Grace
Denise
Amber
Marilyn
Beverly
Danielle
Theresa
Sophia
Marie
Diana
Brittany
Natalie
Isabella
Charlotte
Rose
Alexis
Kayla
Homer
Marge
Bart
Lisa
Maggie
;

/* https://www.thoughtco.com/most-common-us-surnames-1422656 */
data last;
	format last_name $20.;
	input last_name $;
	last_name_id = _n_;
datalines;
Smith
Johnson
Williams
Brown
Jones
Garcia
Miller
Davis
Rodriguez
Martinez
Hernandez
Lopez
Gonzales
Wilson
Anderson
Thomas
Taylor
Moore
Jackson
Martin
Lee
Perez
Thompson
White
Harris
Sanchez
Clark
Ramirez
Lewis
Robinson
Walker
Young
Allen
King
Wright
Scott
Torres
Nguyen
Hill
Flores
Green
Adams
Nelson
Baker
Hall
Rivera
Campbell
Mitchell
Carter
Roberts
Gomez
Phillips
Evans
Turner
Diaz
Parker
Cruz
Edwards
Collins
Reyes
Stewart
Morris
Morales
Murphy
Cook
Rogers
Gutierrez
Ortiz
Morgan
Cooper
Peterson
Bailey
Reed
Kelly
Howard
Ramos
Kim
Cox
Ward
Richardson
Watson
Brooks
Chavez
Wood
James
Bennet
Gray
Mendoza
Ruiz
Hughes
Price
Alvarez
Castillo
Sanders
Patel
Myers
Long
Ross
Foster
Jimenez
Simpson
;

/* https://www.nlc.org/resource/most-common-u-s-street-names/ */
data street;
	format street_name $20.;
	input street_name $;
	street_name_id = _n_;
datalines;
Second
Third
First
Fourth
Park
Fifth
Main
Sixth
Oak
Seventh
Pine
Maple
Cedar
Eighth
Elm
View
Washington
Ninth
Lake
Hill
Evergreen
;


data person0;
	do i = 1 to &mv_person_count;
		first_name_id = %RandBetween(1, 205);
		last_name_id = %RandBetween(1, 101);
		street_name_id = %RandBetween(1, 21);
		zip_code_id = %RandBetween(1, 40000);
		output;
	end;
	drop i;
run;

data zip;
	set sashelp.zipcode(keep=zip city statecode);
	zip_code_id = _n_;
run;

proc sql;
	create table person1 as
	select
		f.first_name,
		l.last_name,
		s.street_name,
		z.city '',
		z.statecode as state '',
		z.zip as zip_numeric ''
	from person0 as p
	join first as f on
		f.first_name_id = p.first_name_id
	join last as l on
		l.last_name_id = p.last_name_id
	join street as s on
		s.street_name_id = p.street_name_id
	join zip as z on
		z.zip_code_id = p.zip_code_id;
quit;

data person2;
	format name street city state zip $50.;
	set person1;
	initial = byte(int(65+26*ranuni(0)));
	name = catx(' ', first_name, initial, last_name);
	/* RandBetween from https://blogs.sas.com/content/iml/2015/10/05/random-integers-sas.html */
	street_num = put(%RandBetween(1,&mv_max_street_num),10.);
	street = catx(' ',street_num,street_name);
	zip = put(zip_numeric, z5.);
	drop zip_numeric street_name street_num first_name initial last_name;
run;

Example output table with ten randomly generated fake people:

name	street	city	state	zip
Steven I Murphy	8206 Fifth	Frankfort	KY	40619
Mary M Williams	5076 Seventh	Evensville	TN	37332
Jeffrey Y Lopez	3485 Third	Henning	IL	61848
Richard Z Sanders	5500 Sixth	Kimball	NE	69145
Russell M Smith	16425 Sixth	Lexington	KY	40515
Johnny R Carter	11949 Eighth	Mount Hope	OH	44660
Raymond V Green	4659 Park	West Helena	AR	72390
Megan N Anderson	8437 Third	Chico	CA	95927
Isabella A Ross	8151 Evergreen	Barstow	MD	20610
Sharon Q Flores	3022 Lake	Poth	TX	78147

Want to generate people names with Python instead of SAS? See combine_people_names.py for a system that generates random people using Wikidata biographies.

Estimating birth date from age

2020-12-10T17:56:00.005-07:00

This code demonstrates an algorithm for estimating birth date from age. We cannot know the exact birth date, but we can get close: the maximum error is half a year, and the typical error is one quarter of a year.


/* The %age macro was taken from the Internet---maybe from here http://support.sas.com/kb/24/808.html ? */
%macro age(date,birth);
floor ((intck('month',&birth,&date) - (day(&date) < day(&birth))) / 12)
%mend age;

/*
Generate 10000 fake people with random birth dates and random perspective days
on which their age was measured. Then, calculate age from that perspective date.
In reality, there is some seasonality to births (e.g., more births in July), but 
here we assume each day of the year has an equal distribution of births.
*/
data person;
	format birth_date submit_date yymmdd10.;
	do i = 1 to 10000;
		birth_date = %randbetween(19000,20500);
		submit_date = birth_date + %randbetween(0,100*365);
		age = %age(submit_date, birth_date);
		output;
	end;
	drop i;
%runquit;

/* Work in reverse from age to estimated birth date. */
data reverse;
	set person;
	format birth_date_min birth_date_max yymmdd10.;
	birth_date_min = intnx('years', submit_date, -1 * (age+1), 's') - 1;
	birth_date_max = intnx('years',birth_date_min,1,'s') + 1;

    /* check range of estimates for errors */
	min_error = (birth_date > birth_date_min);
	max_error = (birth_date < birth_date_max);

    /* estimate birth date as the middle of the range */
	birth_date_avg = mean(birth_date_min, birth_date_max);
    
    /* calculate variance */
	abs_days_error = abs(birth_date - birth_date_avg);
%runquit;

/* Both errors should always be zero. */
proc freq data=reverse;
	table min_error max_error;
quit;

/* Error of estimates range from 0 to 183.5 with a median of 92 and average of 91.*/
proc means data=reverse n nmiss min median mean max;
	var abs_days_error;
quit;

/* Distribution of errors is uniform */
proc sgplot data=reverse;
	histogram abs_days_error;
quit;

Tested with SAS 9.4M6

How to connect from Linux to BleemSync 1.1

2019-07-22T22:44:00.000-06:00

After installing BleemSync 1.1 on my PlayStation Classic, I could connect to the BleemSync UI from Windows but not from Linux (Ubuntu 19.04). Google Chrome reported "unable to connect," and ping to 169.254.215.100 reported a network error.

dmesg showed that Linux identified the device, and RNDIS networking started

[23945.137399] usb 1-3: new high-speed USB device number 31 using xhci_hcd
[23945.286069] usb 1-3: New USB device found, idVendor=04e8, idProduct=6863, bcdDevice=ff.ff
[23945.286075] usb 1-3: New USB device strings: Mfr=3, Product=4, SerialNumber=5
[23945.286079] usb 1-3: Product: classic
[23945.286082] usb 1-3: Manufacturer: BleemSync
[23945.290633] rndis_host 1-3:1.0 usb0: register 'rndis_host' at usb-0000:00:14.0-3, RNDIS device, 8a:04:6f:1c:f9:72
[23945.291271] cdc_acm 1-3:1.2: ttyACM0: USB ACM device
[23945.339113] rndis_host 1-3:1.0 enp0s20f0u3: renamed from usb0

However, ifconfig showed it did not have an IPv4 address. This indicates a DHCP failure.

$ ifconfig enp0s20f0u3
enp0s20f0u3: flags=4163  mtu 1500
        inet6 fe80::adf4:a447:4b1d:f96c  prefixlen 64  scopeid 0x20
        ether 72:b7:b8:ae:c8:00  txqueuelen 1000  (Ethernet)
        RX packets 8  bytes 536 (536.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 81  bytes 12347 (12.3 KB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

I bypassed DHCP by manual IP configuration.

sudo route add -net 169.254.215.0 netmask 255.255.255.0 metric 1024 dev enp0s20f0u3
sudo ifconfig enp0s20f0u3 169.254.215.2
ping 169.254.215.100

Now Google Chrome, ping, and even telnet worked.

Note: your interface name may vary. Mine was enp0s20f0u3.

This solution worked until rebooting Ubuntu. Later I found a permanent solution: BleemSync 1.0 on Ubuntu thanks to DDFoster96.

SAS Message Log with ODBC: COMMIT performed on connection #

2019-01-28T10:14:00.000-07:00

When using SAS to develop high-performance queries against remote SQL databases, it is helpful to see the exact ODBC messages that SAS passes to the driver. Sometimes the implicit SQL poorly translates a query, which can be optimized. To see these message, enable the SAS trace like this:

options sastrace=',,,d' sastraceloc=saslog nostsuffix;

However, when closing the SAS process, there can be a pop-up dialog window with the title "SAS Message Log" with entries like this:

ODBC: COMMIT performed on connection #6.
ODBC: COMMIT performed on connection #5.
ODBC: COMMIT performed on connection #4.
ODBC: COMMIT performed on connection #3.
ODBC: COMMIT performed on connection #2.
ODBC: COMMIT performed on connection #1.
ODBC: COMMIT performed on connection #0.

When running SAS interactively, this is a minor nuisance. When running SAS in an automated batch, this can be a serious problem because the dialog will wait indefinitely for human interaction, so the sas.exe process will never terminate.

This isn't exactly a bug, but it can feel like it. Sadly, SAS provides no convenient options like these:

Never show the SAS message pop-up dialog when the SAS editor has closed.
Automatically close the pop-up dialog after 60 seconds of inactivity.
Filter all traces with the text "ODBC commit."

The SAS developer has these options:

Disable the SAS trace.

Send the SAS trace to a file like this:

options sastrace=',,,d' sastraceloc=file 'c:\sastest\mytrace.log' nostsuffix;

Manually close the SAS Message Log whenever it appears.
Use my Python script to automatically close the "SAS Message Log" dialog whenever it appears.

Tested on SAS 9.4M5 and Python 3.7 on Windows 10 and Windows Server 2008.

Type I error rates in two-sample t-test by simulation

2018-01-28T14:51:00.001-07:00

What do you do when analyzing data is fun, but you don't have any new data? You make it up.

This simulation tests the type I error rates of two-sample t-test in R and SAS. It demonstrates efficient methods for simulation, and it reminders the reader not to take the result of any single hypothesis test as gospel truth. That is, there is always a risk of a false positive (or false negative), so determining truth requires more than one research study.

A type I error is a false positive. That is, it happens when a hypothesis test rejects the null hypothesis when in fact it is not true. In this simulation the null hypothesis is true by design, though in the real world we cannot be sure the null hypothesis is true. This is why we write that we "fail to reject the null hypothesis" rather than "we accept it." If there were no errors in the hypothesis tests in this simulation, we would never reject the null hypothesis, but by design it is normal to reject it according to alpha, the significance level. The de facto standard for alpha is 0.05.

R

First, we run a simulation in R by repeatedly comparing randomly-generated sets of normally-distributed values using the two-sample t-test. Notice the simulation is vectorized: there are no "for" loops that clutter the code and slow the simulation.

# type I error 
alpha.p <- 0.05

# number of simulations
n.simulations <- 1000

# number of observations in each simulation
n.obs <- 100

# a vector of test results
type.one.error<-replicate(n.simulations, t.test(rnorm(n.obs),rnorm(n.obs),
   var.equal=TRUE)$p.value)<alpha.p

# type I error for the whole simulation
mean(type.one.error)

# Store cumulative results in data frame for plotting
sim <- data.frame(
 n.simulations = 1:n.simulations, 
 type.one.error.rate = cumsum(type.one.error) / seq_along(type.one.error))

# alternative plot using ggplot2
require(ggplot2)
ggplot(sim, aes(x=n.simulations, y=type.one.error.rate)) + 
    geom_line() + 
    xlab('Number of simulations') +
    ylab('Cumulative type I error rate') + 
    ggtitle('Simulation of type I error in t-test') +
    geom_abline(intercept = alpha.p, slope=0, col='red') +
    theme_bw()

SAS

Likewise, here is the equivalent code to do the same in SAS. Notice the simulation is implemented not as a slow SAS macro. Instead, it uses the BY statement in PROC TTEST.

/*
Create a data set with 1000 simulations. Each simulation
has 100 observations in each of two groups.
*/
data normal;
 length simulation 4 i 3; /* save space and time */
 do simulation = 1 to 1000;
  do i = 1 to 100;
   group='A';
 /* The values are normally distributed */
    x = rand('normal');
 output;
 group='B';
    x = rand('normal');
    output;
  end;
 end;
run;

/*
Run two-sample t-test once for each simulation, and output to
a data set called ttests.
*/
ods _all_ close;
ods output ttests=ttests;
proc ttest plots=none data=normal;
 by simulation;
 class group;
 var x;
run;

data ttests;
 set ttests;

 /* Limit the rows */
 if variances='Equal';

 /* Define the error as a boolean */
 type_one_error = probt<0.05;

 /* cumulative error */
 retain cumulative_error_count;
 format cumulative_error_rate percent10.2;
 label cumulative_error_rate = 'Cumulative error rate';
 if simulation eq 1 then cumulative_error_count = 0;
 cumulative_error_count+type_one_error;
 cumulative_error_rate = cumulative_error_count /simulation;
run;

/* Summarize the type I error rates for this simulation */
ods html;
proc freq data=ttests;
 table type_one_error/nocum;
run;

/* Draw a line plot */
proc sgplot data=ttests;
 series x=simulation y=cumulative_error_rate;
 refline 0.05 /axis=y lineattrs=(color=red);
run;

Sawtooth

Did you notice the sawtooth pattern in the error rate? The incidence of a false positive is relatively rare, and when it happens, there is a spike in the error rate. Then for each simulation in which there is no false positive, the rate drops by a steady rate because the count of simulations (the denominator) is an integer.

Conclusion

This article was developed on Ubuntu 16.04 with R 3.4 and Windows 7 with SAS 9.4.

See also the article: Type I error rates in test of normality by simulation .

Condition execution on row count

2018-01-10T15:06:00.000-07:00

Use this code as a template for scenarios when you want to change how a SAS program runs depending on whether a data set is empty or not empty. For example, when a report is empty, you may want to not send an email with what would be a blank report. In other words, the report sends only when it has information.

On the other hand, you may want to send an email when a data set is empty if that means an automated SAS program had an error that requires manual intervention.

In general, it's good practice in automated SAS programs to check the size of a data sets in case they are empty or otherwise have the wrong number of observations. With one easy tweak, you could check for a specific minimum number of observations that is greater than zero. (This is left as an exercise for the reader.)



/*
This creates a sample data set with one record.
*/
data mydata;
 input x;
datalines;
1
;
 
/*
This creates a sample data set with zero records.
It has the same name as above, so if you want to test the scenario
with a non-empty data set, simply do not run this step.
*/
data mydata;
 input x;
datalines;
;
 
/*
Count the number of observations, and store the count in a
macro variable.
*/
data _null_;
 if 0 then set mydata nobs=record_count;
 call symput('mv_record_count', put(record_count, 20.));
 stop;
run;
 
/* Print count to the log. */
%put &=mv_record_count;
 
/* Define a macro */
%macro conditional_run;
%if &mv_record_count gt 0 %then %do;
 %put NOTE: this runs when the data set is not empty;
%end;
%else %do;
 %put NOTE: this runs when the data set is empty;
%end;
%mend;
 
/* Run macro */
%conditional_run;

I tested this on SAS 9.4 on Windows 7, though it should work on practically all SAS systems.

SAS ERROR: Cannot load SSL support. on Microsoft Windows

2016-08-30T11:52:00.000-06:00

When using SAS with HTTPS or FTPS, which requires SSL/TLS support, you may see this error message in the SAS log.

ERROR: Cannot load SSL support.

Here is an example of code that can trigger the error.

filename myref url "https://www.google.com";
data _null_; 
infile myref; 
run;

The cause was that SAS/Secure Client Components was not installed, so I resolved the issue by running the SAS Deployment Wizard to install SAS/Secure Client Components.

Tested with SAS 9.4 M3 on Microsoft Windows 7. The error may also happen with encrypted SMTP, but I did not test SMTP.

SAS error "insufficient memory" on remote queries with wide rows

2016-06-23T13:04:00.000-06:00

SAS can give the error The SAS System stopped processing this step because of insufficient memory when querying a single, wide row from a remote SQL Server. The following code fully demonstrates the problem and shows a workaround. Also, I eliminate the explanation that SAS data sets in general do not support rows this wide.


proc sql;
 /* Connect to Microsoft SQL Server */
 connect using sbox;

 execute (
  /* Create the table */
  create table dbo.zzz_varchar_max (
   id int,
   txt1 varchar(max),
   txt2 varchar(max)
  );

  /* Insert a single row */
  insert into zzz_varchar_max values (1, 'foo', 'bar');
 ) by sbox;
quit;

/* This step triggers the error */
data _null_;
 set sbox.zzz_varchar_max;
run;

/* This step does NOT trigger the error */
data _null_;
 set sbox.zzz_varchar_max(drop=txt2);
run;
/* ERROR: The SAS System stopped processing this step because of insufficient memory. */

/* This step inspects the metadata. SAS considers each character column as having
   the width 32767, which is the maximum string size for a SAS data set according to
   https://support.sas.com/documentation/cdl/en/basess/58133/HTML/default/viewer.htm#a001336069.htm
*/
proc contents data=sbox.zzz_varchar_max;
run;

/* This step shows that SAS allows creating a data set that is even 
   wider than the query that fails, so the error isn't a fundamental limitation
   of SAS. */

data wide;
 id =1;
 format txt1 txt2 txt3 $32767.;
 txt1='foo';
 txt2='bar';
 txt3='';
run;

If the data set has a single column with NVARCHAR(MAX) or VARCHAR(MAX), there is no error. It happens only when there are (at least) two such wide columns.

Another workarounds include: use the KEEP option on the data set to KEEP only one of the wide columns, use a PROC SQL statement to query only one of the columns, or use a remote SQL query (maybe with SUBSTR) to truncate the columns.

Another workaround is to switch from the modern Microsoft ODBC driver (driver=ODBC Driver 11 for SQL Server) to the ancient driver (driver=sql server) by changing the ODBC connection string or DSN.

I tested with SAS 9.4 TS1M3 32-bit, Microsoft SQL Server 2012 (11.0 SP2), and the ODBC Driver 11 for SQL Server (2014.120.2000.08).

Reusing calculated columns in Netezza and SAS queries

2016-06-08T08:29:00.001-06:00

Netezza and SAS allow a query to reference a calculated column by name in the SELECT, WHERE, and ORDER BY clauses. Based on the DRY principle, this reduces code and makes code easier to read and maintain.

Some people call calculated columns derived or computed columns.

In Microsoft SQL Server, SQLite, and other RDBMSs you cannot exactly do this: a workaround is to reference a subquery or view. In Microsoft SQL Server, you can also define a computed column on a table.

Below is an example tested with Netezza 7.2. Notice height_m is used in the SELECT clause, and bmi is used in the WHERE and ORDER BY clauses.

CREATE TEMP TABLE people (weight_kg INT, height_m float);

INSERT INTO people
VALUES (50, 1.6);

INSERT INTO people
VALUES (70, 1.8);

INSERT INTO people
VALUES (150, 1.8);

SELECT weight_kg
 ,height_m
 ,height_m*height_m as height_m_squared
 ,weight_kg/(height_m_squared)::int as bmi
FROM people
WHERE bmi < 30
ORDER BY bmi;

Below is an example tested with SAS 9.4.

data people;
  input weight_kg height_m;
datalines;
50 1.6
70 1.8
150 1.8
;

proc sql;
  select
    weight_kg,
    height_m,
    height_m*height_m as height_m_squared,
    weight_kg/(calculated height_m_squared) as bmi
  from
    people
  where
    calculated bmi < 30
  order by
     calculated bmi;
quit;

In case of error in SAS program, send email and stop

2016-03-15T16:47:00.000-06:00

Any automated program should check for errors and unexpected conditions, such as inability to access a resource and presence of invalid values. Unlike traditional programming languages such as Python and C# that stop processing when an error occurs, SAS barrels ahead through the rest of the program. Therefore, carelessly-written SAS programs can create unwanted side effects, such as overwriting an output data set with bad data.

Previously I wrote about a robust solution for checking SAS error codes which wraps the entire program in a macro and invokes %GOTO EXIT in case an error. This is still the ideal solution when some part of the program must continue, but it comes at a cost: wrapping SAS code in a macro disables syntax highlighting in the SAS Enhanced Editor (though not in SAS Studio). Also, it can be awkward to work with the large code block delimited by the macro, so this post focuses on two alternatives.

The easiest method is to enable OPTIONS ERRORABEND, which exits the program in case of an error. The major benefit is it requires only a single line of code to enable for the rest of the program, regardless of how many steps there are. For example:

OPTIONS ERRORABEND;

/* Misspelled data set */
data female;
 set sashelp.clss;
 if sex='F';
run;

However, ERRORABEND can be painful while developing a program in interactive mode because it immediately exits the SAS session, and it cannot push error notifications.

This motivates the following approach. After each DATA STEP or procedure, call a macro that checks the sanity of preceding steps. In case of an error, the universal error handler can send a notification email, clean up, etc. Like ERRORABEND, syntax highlighting still works. Like the other options, it can check errors for DATA STEPs and various procedures. Though it requires code after each step, it is less code than the first solution proposed in the original article.

/* This macro is called from two other macros below. It defines any
   common error handling code, and it should be customized. 

   In some programs, you may want to clean up here.
*/
%macro email_and_abort;
 /* Send an email */
 filename mymail email "andrew@example.com" subject="error in SAS program";
 data _null_;
  file mymail;
  put 'Check the SAS logs';
 run; 

 /* Stop further processing */
 %abort cancel;
%mend;

/* This macro asserts the last procedure did not throw an error 
   or a warning. We will invoke this macro after each step. */
%macro expect_no_syserr;
/* If there is an error or warning... */
%if &syserr ne 0 %then %do;
 /* Write the error code to the SAS log */
 %put ERROR: &=syserr;/

 %email_and_abort;
 %end;
%mend;


/* Assuming SASHELP.CLASS exists and contains the character 
   variable SEX, this DATA STEP will succeed. */
data female;
 set sashelp.class;
 if sex='F';
run;
%expect_no_syserr;

/* Likewise, the macro works with procedures such as PROC FREQ. */
proc freq data=sashelp.class noprint;
 table age/out=age_freq;
run;
%expect_no_syserr;


/* This macro asserts PROC SQL creates at least one observation. */
%macro expect_any_obs;
%if &sqlobs eq 0 %then %do;
 %put ERROR: no observations;

 %email_and_abort;
 %end;
%mend;

/* This PROC SQL shows that the assertions working with PROC
   SQL and that two assertions can be combined. */
proc sql;
 create table age_count as
 select
  age,
  count(1) as count
 from
  sashelp.class
 group by
  age;
quit;
%expect_no_syserr;
%expect_any_obs;

/* Misspelled data set. This will fail to demonstrate the macros. */
data female;
 set sashelp.clss;
 if sex='F';
run;
%expect_no_syserr;

Similar assertion macros can check that a data set exists, variables exist in a a data set, observations exist in a data set, a path is writable, a file exists, and so on.

Tested with SAS 9.4M3.

R: InternetOpenUrl failed: 'The date in the certificate is invalid or has expired'

2016-03-10T10:42:00.000-07:00

Today the two-year-old TLS security certificate for cran.r-project.org expired, so suddenly in R you are getting errors running install.packages or update.packages.

The error looks like this:

> update.packages()
--- Please select a CRAN mirror for use in this session ---
Error in download.file(url, destfile = f, quiet = TRUE) : 
  cannot open URL 'https://cran.r-project.org/CRAN_mirrors.csv'
In addition: Warning message:
In download.file(url, destfile = f, quiet = TRUE) :
  InternetOpenUrl failed: 'The date in the certificate is invalid or has expired'

The workaround is simple: choose another repository! For example:

options("repos"="https://cran.revolutionanalytics.com/")
update.packages(ask=T)
install.packages('gbm')

This is bad timing with the release of R 3.2.4 today. If you need to download R using your web browser, visit a mirror, such as cran.revolutionanalytics.com.

Tested with R 3.2.3 on Windows 7 and Windows Server 2012.

ISO 3166-1 alpha-2 (two-letter country code) format for SAS

2016-02-22T13:13:00.001-07:00

Here is the widely-used ISO 3166-1 alpha-2 format for use in SAS. It is commonly called the two-letter country code format.

The PROC FORMAT code generates a character format, so where the raw data contains a code, such as US, it expands it to the pretty name, such as United States. As with any SAS format, applying the format does not change the underlying data.

proc format;
 /* ISO 3166-1 alpha-2 two letter country codes */
 value $ iso3166alphatwo
  'AF' = 'Afghanistan'
  'AX' = 'Åland Islands'
  'AL' = 'Albania'
  'DZ' = 'Algeria'
  'AS' = 'American Samoa'
  'AD' = 'Andorra'
  'AO' = 'Angola'
  'AI' = 'Anguilla'
  'AQ' = 'Antarctica'
  'AG' = 'Antigua and Barbuda'
  'AR' = 'Argentina'
  'AM' = 'Armenia'
  'AW' = 'Aruba'
  'AU' = 'Australia'
  'AT' = 'Austria'
  'AZ' = 'Azerbaijan'
  'BS' = 'Bahamas'
  'BH' = 'Bahrain'
  'BD' = 'Bangladesh'
  'BB' = 'Barbados'
  'BY' = 'Belarus'
  'BE' = 'Belgium'
  'BZ' = 'Belize'
  'BJ' = 'Benin'
  'BM' = 'Bermuda'
  'BT' = 'Bhutan'
  'BO' = 'Bolivia, Plurinational State of'
  'BQ' = 'Bonaire, Sint Eustatius and Saba'
  'BA' = 'Bosnia and Herzegovina'
  'BW' = 'Botswana'
  'BV' = 'Bouvet Island'
  'BR' = 'Brazil'
  'IO' = 'British Indian Ocean Territory'
  'BN' = 'Brunei Darussalam'
  'BG' = 'Bulgaria'
  'BF' = 'Burkina Faso'
  'BI' = 'Burundi'
  'KH' = 'Cambodia'
  'CM' = 'Cameroon'
  'CA' = 'Canada'
  'CV' = 'Cape Verde'
  'KY' = 'Cayman Islands'
  'CF' = 'Central African Republic'
  'TD' = 'Chad'
  'CL' = 'Chile'
  'CN' = 'China'
  'CX' = 'Christmas Island'
  'CC' = 'Cocos (Keeling) Islands'
  'CO' = 'Colombia'
  'KM' = 'Comoros'
  'CG' = 'Congo'
  'CD' = 'Congo, the Democratic Republic of the'
  'CK' = 'Cook Islands'
  'CR' = 'Costa Rica'
  'CI' = 'Côte d''Ivoire'
  'HR' = 'Croatia'
  'CU' = 'Cuba'
  'CW' = 'Curaçao'
  'CY' = 'Cyprus'
  'CZ' = 'Czech Republic'
  'DK' = 'Denmark'
  'DJ' = 'Djibouti'
  'DM' = 'Dominica'
  'DO' = 'Dominican Republic'
  'EC' = 'Ecuador'
  'EG' = 'Egypt'
  'SV' = 'El Salvador'
  'GQ' = 'Equatorial Guinea'
  'ER' = 'Eritrea'
  'EE' = 'Estonia'
  'ET' = 'Ethiopia'
  'FK' = 'Falkland Islands (Malvinas)'
  'FO' = 'Faroe Islands'
  'FJ' = 'Fiji'
  'FI' = 'Finland'
  'FR' = 'France'
  'GF' = 'French Guiana'
  'PF' = 'French Polynesia'
  'TF' = 'French Southern Territories'
  'GA' = 'Gabon'
  'GM' = 'Gambia'
  'GE' = 'Georgia'
  'DE' = 'Germany'
  'GH' = 'Ghana'
  'GI' = 'Gibraltar'
  'GR' = 'Greece'
  'GL' = 'Greenland'
  'GD' = 'Grenada'
  'GP' = 'Guadeloupe'
  'GU' = 'Guam'
  'GT' = 'Guatemala'
  'GG' = 'Guernsey'
  'GN' = 'Guinea'
  'GW' = 'Guinea-Bissau'
  'GY' = 'Guyana'
  'HT' = 'Haiti'
  'HM' = 'Heard Island and McDonald Mcdonald Islands'
  'VA' = 'Holy See (Vatican City State)'
  'HN' = 'Honduras'
  'HK' = 'Hong Kong'
  'HU' = 'Hungary'
  'IS' = 'Iceland'
  'IN' = 'India'
  'ID' = 'Indonesia'
  'IR' = 'Iran, Islamic Republic of'
  'IQ' = 'Iraq'
  'IE' = 'Ireland'
  'IM' = 'Isle of Man'
  'IL' = 'Israel'
  'IT' = 'Italy'
  'JM' = 'Jamaica'
  'JP' = 'Japan'
  'JE' = 'Jersey'
  'JO' = 'Jordan'
  'KZ' = 'Kazakhstan'
  'KE' = 'Kenya'
  'KI' = 'Kiribati'
  'KP' = 'Korea, Democratic People''s Republic of'
  'KR' = 'Korea, Republic of'
  'KW' = 'Kuwait'
  'KG' = 'Kyrgyzstan'
  'LA' = 'Lao People''s Democratic Republic'
  'LV' = 'Latvia'
  'LB' = 'Lebanon'
  'LS' = 'Lesotho'
  'LR' = 'Liberia'
  'LY' = 'Libya'
  'LI' = 'Liechtenstein'
  'LT' = 'Lithuania'
  'LU' = 'Luxemourg'
  'MO' = 'Macao'
  'MK' = 'Macedonia, the Former Yugoslav Republic of'
  'MG' = 'Madagascar'
  'MW' = 'Malawi'
  'MY' = 'Malaysia'
  'MV' = 'Maldives'
  'ML' = 'Mali'
  'MT' = 'Malta'
  'MH' = 'Marshall Islands'
  'MQ' = 'Martinique'
  'MR' = 'Mauritania'
  'MU' = 'Mauritius'
  'YT' = 'Mayotte'
  'MX' = 'Mexico'
  'FM' = 'Micronesia, Federated States of'
  'MD' = 'Moldova, Republic of'
  'MC' = 'Monaco'
  'MN' = 'Mongolia'
  'ME' = 'Montenegro'
  'MS' = 'Montserrat'
  'MA' = 'Morocco'
  'MZ' = 'Mozambique'
  'MM' = 'Myanmar'
  'NA' = 'Namibia'
  'NR' = 'Nauru'
  'NP' = 'Nepal'
  'NL' = 'Netherlands'
  'NC' = 'New Caledonia'
  'NZ' = 'New Zealand'
  'NI' = 'Nicaragua'
  'NE' = 'Niger'
  'NG' = 'Nigeria'
  'NU' = 'Niue'
  'NF' = 'Norfolk Island'
  'MP' = 'Northern Mariana Islands'
  'NO' = 'Norway'
  'OM' = 'Oman'
  'PK' = 'Pakistan'
  'PW' = 'Palau'
  'PS' = 'Palestine, State of'
  'PA' = 'Panama'
  'PG' = 'Papua New Guinea'
  'PY' = 'Paraguay'
  'PE' = 'Peru'
  'PH' = 'Philippines'
  'PN' = 'Pitcairn'
  'PL' = 'Poland'
  'PT' = 'Portugal'
  'PR' = 'Puerto Rico'
  'QA' = 'Qatar'
  'RE' = 'Réunion'
  'RO' = 'Romania'
  'RU' = 'Russian Federation'
  'RW' = 'Rwanda'
  'BL' = 'Saint Barthélemy'
  'SH' = 'Saint Helena, Ascension and Tristan da Cunha'
  'KN' = 'Saint Kitts and Nevis'
  'LC' = 'Saint Lucia'
  'MF' = 'Saint Martin (French part)'
  'PM' = 'Saint Pierre and Miquelon'
  'VC' = 'Saint Vincent and the Grenadines'
  'WS' = 'Samoa'
  'SM' = 'San Marino'
  'ST' = 'Sao Tome and Principe'
  'SA' = 'Saudi Arabia'
  'SN' = 'Senegal'
  'RS' = 'Serbia'
  'SC' = 'Seychelles'
  'SL' = 'Sierra Leone'
  'SG' = 'Singapore'
  'SX' = 'Sint Maarten (Dutch part)'
  'SK' = 'Slovakia'
  'SI' = 'Slovenia'
  'SB' = 'Solomon Islands'
  'SO' = 'Somalia'
  'ZA' = 'South Africa'
  'GS' = 'South Georgia and the South Sandwich Islands'
  'SS' = 'South Sudan'
  'ES' = 'Spain'
  'LK' = 'Sri Lanka'
  'SD' = 'Sudan'
  'SR' = 'Suriname'
  'SJ' = 'Svalbard and Jan Mayen'
  'SZ' = 'Swaziland'
  'SE' = 'Sweden'
  'CH' = 'Switzerland'
  'SY' = 'Syryan Arab Republic'
  'TW' = 'Taiwan, Province of China'
  'TJ' = 'Tajikistan'
  'TZ' = 'Tanzania, United Republic of'
  'TH' = 'Thailand'
  'TL' = 'Timor-Leste'
  'TG' = 'Togo'
  'TK' = 'Tokelau'
  'TO' = 'Tonga'
  'TT' = 'Trinidad and Tobago'
  'TN' = 'Tunisia'
  'TR' = 'Turkey'
  'TM' = 'Turkmenistan'
  'TC' = 'Turks and Caicos Islands'
  'TV' = 'Tuvalu'
  'UG' = 'Uganda'
  'UA' = 'Ukraine'
  'AE' = 'United Arab Emirates'
  'GB' = 'United Kingdom'
  'US' = 'United States'
  'UM' = 'United States Minor Outlying Islands'
  'UY' = 'Uruguay'
  'UZ' = 'Uzbekistan'
  'VU' = 'Vanuatu'
  'VE' = 'Venezuela, Bolivarian Republic of'
  'VN' = 'Vietnam'
  'VG' = 'Virgin Islands, British'
  'VI' = 'Virgin Islands, U.S.'
  'WF' = 'Wallis and Futuna'
  'EH' = 'Western Sahara'
  'YE' = 'Yemen'
  'ZM' = 'Zambia'
  'ZW' = 'Zimbabwe'
;
quit;

/* Example usage */
data country;
 format country_code $iso3166alphatwo.;

 country_code = 'US';
 output;
 country_code='GB';
 output;
run;

proc print data=country;
run;

This list is from Cloudflare published 2015.

Tested with SAS 9.4M3 on Microsoft Windows.

Undocumented SAS feature: Bulkloading to Netezza with ODBC interface

2016-02-03T14:01:00.000-07:00

The SAS/ACCESS Interface to ODBC in SAS 9.4M4 states it supports bulk loading only to "Microsoft SQL Server data on Windows platforms." However, in practice on the Windows platform it also supports bulk loading to Netezza.

Bulk loading is amazingly fast. In some of my benchmarks the duration of the whole bulk loading operation is independent of the number of rows inserted!

By default on Netezza the bulk loading interface delimits values using a pipe character, and for cases where the values contain a pipe, SAS Access Interface to ODBC unofficially supports the BL_DELIMITER option to specify an alternate delimiter. For the ODBC interface, this option is undocumented.

However, there are nuances with the BL_DELIMITER option. According to the SAS Access Interface to Netezza:

You can use any 7-bit ASCII character as a delimiter. The default is the pipe symbol (ǀ). To use a printable ASCII character, enclose it in quotation marks (for example, BL_DELIMITER="|"). However, to use an extended character, use the three-digit decimal number representation of the ASCII character for this option. For example, set BL_DELIMITER=202 to use ASCII character 202 as a delimiter. You must specify decimal number delimiters as three digits even if the first two digits would be zero. For example, specify BL_DELIMITER=003, not BL_DELIMITER=3 or BL_DELIMITER=03.

First, notice a contradiction in the documentation. Because 7-bit characters are in the range 1-127 implies that 8-bit characters in the range 128-256 are not supported, but the documentation gives an example in this range (BL_DELIMITER=202).

Second, the syntax for the ODBC interface (which is not covered by the documentation for the ODBC interface) supports only a single character, so specifying a delimiter using the three-digit decimal notation will always cause an error.

For example, this works with the ODBC interface


options sastrace=',,,d' sastraceloc=saslog nostsuffix;

/*
 Because a data set with one variable does not require
 a delimiter, this data set has two variables.
*/
data has_pipe;
/*
 Because the first character in 122 used below is a 1,
 here we test the number 1.
*/ 
 number=1;
/*
 Because by default the delimiter is a pipe, one of the
  values has a pipe.
*/
 char='I|have|a|pipe.';
 output;
run;


/* Use a lowercase z as the alternate delimiter */
data nz.has_pipe(bulkload=yes bl_delimiter='z');
    set has_pipe;
run;

However, the decimal representation fails.


/* 122 is the decimal representation of the lowercase z */
data nz.has_pipe(bulkload=yes bl_delimiter=122);
 set has_pipe;
run;

The SAS log shows SAS treats the decimal representation as a literal character and truncates it to the first character.


ODBC_25: Executed: on connection 8
CREATE EXTERNAL TABLE EXT_HAS_PIPE SAMEAS ADMIN.HAS_PIPE USING
(DATAOBJECT('\\.\pipe\BL_HAS_PIPE_3') DELIMITER '1' REMOTESOURCE 'ODBC' )

If you have it licensed, the SAS/ACCESS Interface to Netezza should support the decimal notation: in this case, I would suggest using a tab delimiter with BL_DELIMITER=009.

If not, you must either disable bulkloading or use a single, printable ASCII character as a delimiter. If your data set requires a full range of characters but never all characters on the same row (for example, some rows have a pipe while other rows have a caret), split your data set into two data sets, and then bulk load each data set using separate delimiters.

Frequency of individual characters from SAS data set

2016-01-29T13:31:00.000-07:00

This script counts the frequencies of individual ASCII characters in a single column in a SAS data set and then prints an easy-to-read report.

My initial motivation relates to delimiters. By default bulkloading data from Netezza to SAS (which is very fast) uses the pipe character as a delimiter, but my data set contained values with the pipe character, so this macro identifies alternative delimiters.

Another potential use is cracking a message encrypted using a simple letter substitution cipher.

To begin, this code creates an example data set courtesy of William Shakespeare.

data sonnet18;
 input line $60.;
datalines;
Shall I compare thee to a summer's day?
Thou art more lovely and more temperate:
Rough winds do shake the darling buds of May,
And summer's lease hath all too short a date:
Sometime too hot the eye of heaven shines,
And often is his gold complexion dimmed,
And every fair from fair sometime declines,
By chance, or nature's changing course untrimmed:
But thy eternal summer shall not fade,
Nor lose possession of that fair thou ow'st,
Nor shall death brag thou wander'st in his shade,
When in eternal lines to time thou grow'st,
   So long as men can breathe, or eyes can see,
   So long lives this, and this gives life to thee.
 ;

Next, here is the macro that counts all the printable ASCII characters in all rows of the data set and makes a new data set with total counts by ASCII character.

%macro character_histogram(dataset, column);
data histogram_tmp;
 set &dataset;
 /* Characters 32 through 126 are printable ASCII. */
 %do i = 32 %to 126;
  /* Count the number of characters in the column. */
  /* Store each count in separate column. */
  count_chr_&i = count(trim(&column), byte(&i));
 %end;
run;

/* Sum the character counts from all the rows. */
proc means data=histogram_tmp noprint;
 var count_chr_:;
 output  out=histogram_wide sum=sum/autoname;
run;

/* Clean up */
proc sql;
 drop table histogram_tmp;
quit;

/* Switch from long to wide. */
proc transpose
 data=histogram_wide(keep=count_chr:)
 out=histogram_long
 ;
run;

/* Make pretty. */
data histogram_long;
 set histogram_long;
 character_decimal = input(compress(_name_, , 'kd'), 3.);
 drop _name_;
 character = byte(character_decimal);
 rename col1=count_characters;
run;
%mend;

Finally, this code invokes the macro and prints the report.

/* Run the histogram macro  */
%character_histogram(sonnet18, line);

/* Print the final report as a table. */
proc print data=histogram_long noobs;
 var character_decimal character count_characters ;
run;

/* Barchart, what some people would call a histogram of the letters. */
proc sgplot data=histogram_long;
 hbar character/freq=count_characters;
run;

This is the final report.

character_decimal	character	count_characters
32		100
33	!	0
34	"	0
35	#	0
36	$	0
37	%	0
38	&	0
39	'	6
40	(	0
41	)	0
42	*	0
43	+	0
44	,	12
45	-	0
46	.	1
47	/	0
48	0	0
49	1	0
50	2	0
51	3	0
52	4	0
53	5	0
54	6	0
55	7	0
56	8	0
57	9	0
58	:	3
59	;	0
60	<	0
61	=	0
62	>	0
63	?	1
64	@	0
65	A	3
66	B	2
67	C	0
68	D	0
69	E	0
70	F	0
71	G	0
72	H	0
73	I	1
74	J	0
75	K	0
76	L	0
77	M	1
78	N	2
79	O	0
80	P	0
81	Q	0
82	R	1
83	S	4
84	T	1
85	U	0
86	V	0
87	W	1
88	X	0
89	Y	0
90	Z	0
91	[	0
92	\	0
93	]	0
94	^	0
95	_	0
96	`	0
97	a	37
98	b	3
99	c	9
100	d	20
101	e	63
102	f	10
103	g	10
104	h	31
105	i	26
106	j	0
107	k	1
108	l	23
109	m	22
110	n	31
111	o	44
112	p	4
113	q	0
114	r	28
115	s	38
116	t	39
117	u	13
118	v	5
119	w	4
120	x	1
121	y	8
122	z	0
123	{	0
124	\|	0
125	}	0
126	~	0

This script was tested with SAS 9.4M3 on Windows 7.

SAS macros always have a global scope

2015-09-02T10:29:00.002-06:00

SAS allows the programmer to declare the scope of macro variables using %LOCAL or %GLOBAL, but the macros themselves are always created in the global scope.

Say you have a macro that in another language, say Python, would be considered a function. Within the macro you want a sub-macro (i.e., sub-function) to be used only within the outer macro.

%macro outer;
%put NOTE: outer;

/* This "sub-macro" is defined within the outer macro and is
   intended only for use within the outer macro. */
 %macro inner(foo);
 %put NOTE: inner &foo;
 %mend;

%inner(1);
%inner(2);
%mend;

%outer;

/* If the "sub-macro" has a local scope, the next step would fail */
%inner(3);

/* However, it succeeds */

This can lead to conflicts if the macro %inner is defined somewhere else in the same session. One way of dealing with this is to be careful to give the inner macro a unique name like __outer_inner where the underscores in the prefix suggest a local scope, and adding outer to the macro name indicates the macro is to be used only in the outer macro.

Another option is to use the %sysmacdelete to delete the inner macro:

%macro outer;
%put NOTE: outer;

 %macro inner(foo);
 %put NOTE: inner &foo;
 %mend;

%inner(1);
%inner(2);
/* Delete the inner macro */
%SYSMACDELETE inner;
%mend;

%outer;

/* This fails because of SYSMACDELETE */
%inner(3);

Tested with SAS 9.4M3 on Windows 7.

Gotcha with SAS, regular expressions, and end-of-line matching

2015-08-31T14:40:00.001-06:00

Regular expressions are essential for sophisticated text processing, and it is generally easy to transfer knowledge of Perl regular expressions to the SAS functions prxparse, prxmatch, prxposn, etc. However, use caution with the end of line character ($) because of how SAS treats whitespace.

For demonstration I will run what looks like an equivalent use of regular expressions in Python, JavaScript, and SAS, but notice that only SAS does not match the string.

# Python 2.7
import re
first_name ='Andrew  '
first_name = first_name.strip()
if re.search(r"^Andrew$", first_name):
    print 'match' # it does match
else:
    print 'no match'

/*JavaScript */
first_name ='Andrew  ';
first_name = first_name.trim();
if (first_name.match(/^Andrew$/)) 
    alert('match'); /* it does match */
    else alert('no match');

(JavaScript fiddle for this code.)

data x;
 first_name='Andrew  ';
 first_name=strip(first_name);
 match=prxmatch('/^Andrew$/', first_name); /* it does not match (match=0) */
run;

In SAS ignore the trailing whitespace using the trim() function:

data x;
 first_name='Andrew  ';
 match=prxmatch('/^Andrew$/', trim(first_name)); /* it does match */
run;

SAS, however, does not distinguish a string that was inserted with trailing spaces from a string that was inserted without trailing spaces. In the following SAS-only example imagine the table was created and populated using a non-SAS system like MySQL or Microsoft SQL Server.

proc sql;
 create table names (
  first_name varchar(8)
 );

 insert into names values ('Andrew'); /* no trailing spaces */
 insert into names values ('Andrew '); /* one trailing space */
 insert into names values ('Andrew  ');
quit;

data names;
 set names;
 length=length(first_name);
 match1=prxmatch('/^Andrew$/', first_name);
 match2=prxmatch('/^Andrew$/', trim(first_name));
run;

This was tested with SAS 9.4M3 on Microsoft Windows 7.

List of user-installed R packages and their versions

2015-06-09T14:02:00.002-06:00

This R command lists all the packages installed by the user (ignoring packages that come with R such as base and foreign) and the package versions.

ip <- as.data.frame(installed.packages()[,c(1,3:4)])
rownames(ip) <- NULL
ip <- ip[is.na(ip$Priority),1:2,drop=FALSE]
print(ip, row.names=FALSE)

Example output

       Package   Version
        bitops     1.0-6
 BradleyTerry2     1.0-6
          brew     1.0-6
         brglm     0.5-9
           car    2.0-25
         caret    6.0-47
          coin    1.0-24
    colorspace     1.2-6
        crayon     1.2.1
      devtools     1.8.0
     dichromat     2.0-0
        digest     0.6.8
         earth     4.4.0
      evaluate       0.7
[..snip..]

Tested with R 3.2.0.

This is a small step towards managing package versions: for a better solution, see the checkpoint package. You could also use the first column to reinstall user-installed R packages after an R upgrade.

SAS 9.4 crash with MySQL ODBC pass-through queries

2015-03-03T09:26:00.002-07:00

SAS 9.4 (TS1M2) on X64_DS08R2 (Windows Server 2008 64-bit) always crashes with certain pass-through queries using MySQL Connector/ODBC 5.3.4. When it crashes, the SAS log shows some red messages, but SAS closes immediately.

The crash is not reproducible with other ODBC drivers, on SAS 9.3 64-bit, or SAS 9.4 32-bit.

Workarounds include: using an ODBC DSN instead of the connection string, not using pass-through queries, or using SAS 9.3.

SAS agreed to fix the bug.

This shows how to procedure it


/* Trace log */
options sastrace=',,d,d' sastraceloc=file 'c:\temp\mytracefile.log';

/* This does not crash */
libname ensembl odbc
    required="Driver={MySQL ODBC 5.3 Unicode Driver};Server=ensembldb.ensembl.org;Database=aedes_aegypti_core_48_1b;Uid=anonymous;interactive=1;";

/* This does not crash */
proc sql;
 create table x as 
    select *
 from ensembl.analysis;
quit;

/* This crashes */
proc sql;
 connect using ensembl;
 create table x as 
    select *
 from connection to ensembl (
            show databases;
        );
quit;

This bug was present in SAS 9.4M2, and it was fixed in SAS 9.4M3.

Autocommit with ceODBC is slow

2015-02-10T16:07:00.000-07:00

You already know that in Python it is faster to call executemany() than repeatedly calling execute() to INSERT the same number of rows because executemany() avoids rebinding the parameters, but what about the effect of autocommit on performance? While this is probably not specific to ceODBC, using autocommit is astonishingly slow. Here is how slow.

First, the Python code to run the benchmark:

import ceODBC
import datetime
import os
import time

connection_string="driver=sql server;database=database;server=server;" 
print connection_string

conn = None
cursor = None
def init_db():
    import ceODBC
    global conn
    global cursor
    conn = ceODBC.connect(connection_string)
    cursor = conn.cursor()

def table_exists():
    cursor.execute("select count(1) from information_schema.tables where table_name='zzz_ceodbc_test'")
    return cursor.fetchone()[0] == 1

def create_table():
    print('create_table')
    create_sql="""
CREATE TABLE zzz_ceodbc_test (
    col1 INT,
    col2 VARCHAR(50)
) """
    try:
        cursor.execute(create_sql)
        assert(table_exists())
    except:
        import traceback
        traceback.print_exc()

rows = []
for i in xrange(0,10000):
    rows.append((i,'abcd'))

def log_speed(start_time, end_time, records):
    elapsed_seconds = end_time - start_time
    if elapsed_seconds > 0:
        records_second = int(records / elapsed_seconds)
        # make elapsed_seconds an integer to shorten the string format
        elapsed_str = str(
            datetime.timedelta(seconds=int(elapsed_seconds)))
        print("{:,} records; {} records/sec; {} elapsed".format(records, records_second, elapsed_str))
    else:
        print("counter: %i records " % records)

 
 
def benchmark(bulk, autocommit):
    init_db()
    global conn
    global cursor
    conn.autocommit=True
    cursor.execute('truncate table zzz_ceodbc_test')
    
    conn.autocommit = autocommit
    insert_sql = 'insert into zzz_ceodbc_test (col1, col2) values (?,?)'
    
    start_time = time.time()
    if bulk:
        cursor.executemany(insert_sql, rows)
    else:
        for row in rows:
            cursor.execute(insert_sql, row)
    conn.commit()
    end_time = time.time()
    
    cursor.execute("select count(1) from zzz_ceodbc_test")
    assert cursor.fetchone()[0] == len(rows)
    
    log_speed(start_time, end_time, len(rows))
    conn.autocommit=True
    
    del cursor
    del conn
    return end_time - start_time


def benchmark_repeat(bulk, autocommit, repeats=5):
    description = "%s, autocommit=%s" % ('bulk' if bulk else 'one at a time', autocommit)
    print '\n******* %s' % description
    results = []
    for x in xrange(0, repeats):
        results.append(benchmark(bulk, autocommit))
    print results

benchmark_repeat(True, False)
benchmark_repeat(True, True)
benchmark_repeat(False, True)

And to graph the results in R:

results_table <- 'group seconds
bulk_manual 0.6710000038146973
bulk_manual 0.6710000038146973
bulk_manual 0.9830000400543213
bulk_manual 0.7330000400543213
bulk_manual 0.6710000038146973
bulk_auto 8.486999988555908
bulk_auto 8.269000053405762
bulk_auto 8.980999946594238
bulk_auto 8.453999996185303
bulk_auto 8.480999946594238
one_at_a_time 24.391000032424927
one_at_a_time 23.70300006866455
one_at_a_time 71.66299986839294
one_at_a_time 23.58899998664856
one_at_a_time 37.18400001525879'

results <- read.table(textConnection(results_table), header = TRUE)
closeAllConnections() 

library(ggplot2)
ggplot(results, aes(group, seconds)) + geom_boxplot()

Conclusion: executemany() with autocommit is 76% faster than execute(), and executemany() without autocommit is 91% faster than executemany() with autocommit. Also, executemany() gives more consistent performance.

Ran on Windows 7 Pro 64-bit, Python 2.7.9 32-bit, ceODBC 2.0.1, Microsoft SQL Server 11.0 SP1, R 3.1.2.

Party like it's 19999 (SAS)

2015-02-04T14:35:00.001-07:00

On 03OCT2014 I must have missed the party in Cary, NC.

data _null_;
 format date date9.;
 date = 19999;
 put date=;
run;

So the next party is 03JUN2568?

data _null_;
 format date date9.;
 date = 222222;
 put date=;
run;

LimeSurvey is allergic to Cloudflare Rocket Loader

2015-01-29T16:26:00.001-07:00

In case you use LimeSurvey with Cloudflare, you may want to disable Rocket Loader, which "automatically asynchronously load all JavaScript resources." In LimeSurvey it causes problem saving questions (the save button does not do anything), disables tooltips for buttons in the administrative interface (so the icons are hard to interpret), and maybe causes other problems.

If you are not sure whether Rocket Loader is enabled, just look at the HTML source. If it is enabled, you will see "rocketloader" in the HTML source.

Cloudflare's Auto Minify seems safe to use.

I tested with LimeSurvey Version 2.05+ Build 141229, Firefox 35, and Google Chrome 40.

character_decimal	character	count_characters
32		100
33	!	0
34	"	0
35	#	0
36	$	0
37	%	0
38	&	0
39	'	6
40	(	0
41	)	0
42	*	0
43	+	0
44	,	12
45	-	0
46	.	1
47	/	0
48	0	0
49	1	0
50	2	0
51	3	0
52	4	0
53	5	0
54	6	0
55	7	0
56	8	0
57	9	0
58	:	3
59	;	0
60	<	0
61	=	0
62	>	0
63	?	1
64	@	0
65	A	3
66	B	2
67	C	0
68	D	0
69	E	0
70	F	0
71	G	0
72	H	0
73	I	1
74	J	0
75	K	0
76	L	0
77	M	1
78	N	2
79	O	0
80	P	0
81	Q	0
82	R	1
83	S	4
84	T	1
85	U	0
86	V	0
87	W	1
88	X	0
89	Y	0
90	Z	0
91	[	0
92	\	0
93	]	0
94	^	0
95	_	0
96	`	0
97	a	37
98	b	3
99	c	9
100	d	20
101	e	63
102	f	10
103	g	10
104	h	31
105	i	26
106	j	0
107	k	1
108	l	23
109	m	22
110	n	31
111	o	44
112	p	4
113	q	0
114	r	28
115	s	38
116	t	39
117	u	13
118	v	5
119	w	4
120	x	1
121	y	8
122	z	0
123	{	0
124	\|	0
125	}	0
126	~	0

character_decimal	character	count_characters
32		100
33	!	0
34	"	0
35	#	0
36	$	0
37	%	0
38	&	0
39	'	6
40	(	0
41	)	0
42	*	0
43	+	0
44	,	12
45	-	0
46	.	1
47	/	0
48	0	0
49	1	0
50	2	0
51	3	0
52	4	0
53	5	0
54	6	0
55	7	0
56	8	0
57	9	0
58	:	3
59	;	0
60	<	0
61	=	0
62	>	0
63	?	1
64	@	0
65	A	3
66	B	2
67	C	0
68	D	0
69	E	0
70	F	0
71	G	0
72	H	0
73	I	1
74	J	0
75	K	0
76	L	0
77	M	1
78	N	2
79	O	0
80	P	0
81	Q	0
82	R	1
83	S	4
84	T	1
85	U	0
86	V	0
87	W	1
88	X	0
89	Y	0
90	Z	0
91	[	0
92	\	0
93	]	0
94	^	0
95	_	0
96	`	0
97	a	37
98	b	3
99	c	9
100	d	20
101	e	63
102	f	10
103	g	10
104	h	31
105	i	26
106	j	0
107	k	1
108	l	23
109	m	22
110	n	31
111	o	44
112	p	4
113	q	0
114	r	28
115	s	38
116	t	39
117	u	13
118	v	5
119	w	4
120	x	1
121	y	8
122	z	0
123	{	0
124	\|	0
125	}	0
126	~	0

character_decimal	character	count_characters
32		100
33	!	0
34	"	0
35	#	0
36	$	0
37	%	0
38	&	0
39	'	6
40	(	0
41	)	0
42	*	0
43	+	0
44	,	12
45	-	0
46	.	1
47	/	0
48	0	0
49	1	0
50	2	0
51	3	0
52	4	0
53	5	0
54	6	0
55	7	0
56	8	0
57	9	0
58	:	3
59	;	0
60	<	0
61	=	0
62	>	0
63	?	1
64	@	0
65	A	3
66	B	2
67	C	0
68	D	0
69	E	0
70	F	0
71	G	0
72	H	0
73	I	1
74	J	0
75	K	0
76	L	0
77	M	1
78	N	2
79	O	0
80	P	0
81	Q	0
82	R	1
83	S	4
84	T	1
85	U	0
86	V	0
87	W	1
88	X	0
89	Y	0
90	Z	0
91	[	0
92	\	0
93	]	0
94	^	0
95	_	0
96	`	0
97	a	37
98	b	3
99	c	9
100	d	20
101	e	63
102	f	10
103	g	10
104	h	31
105	i	26
106	j	0
107	k	1
108	l	23
109	m	22
110	n	31
111	o	44
112	p	4
113	q	0
114	r	28
115	s	38
116	t	39
117	u	13
118	v	5
119	w	4
120	x	1
121	y	8
122	z	0
123	{	0
124	\|	0
125	}	0
126	~	0