Friday, January 29, 2016

Frequency of individual characters from SAS data set

This script counts the frequencies of individual ASCII characters in a single column in a SAS data set and then prints an easy-to-read report.

My initial motivation relates to delimiters. By default bulkloading data from Netezza to SAS (which is very fast) uses the pipe character as a delimiter, but my data set contained values with the pipe character, so this macro identifies alternative delimiters.

Another potential use is cracking a message encrypted using a simple letter substitution cipher.

To begin, this code creates an example data set courtesy of William Shakespeare.

data sonnet18;
 input line $60.;
datalines;
Shall I compare thee to a summer's day?
Thou art more lovely and more temperate:
Rough winds do shake the darling buds of May,
And summer's lease hath all too short a date:
Sometime too hot the eye of heaven shines,
And often is his gold complexion dimmed,
And every fair from fair sometime declines,
By chance, or nature's changing course untrimmed:
But thy eternal summer shall not fade,
Nor lose possession of that fair thou ow'st,
Nor shall death brag thou wander'st in his shade,
When in eternal lines to time thou grow'st,
   So long as men can breathe, or eyes can see,
   So long lives this, and this gives life to thee.
 ;

Next, here is the macro that counts all the printable ASCII characters in all rows of the data set and makes a new data set with total counts by ASCII character.

%macro character_histogram(dataset, column);
data histogram_tmp;
 set &dataset;
 /* Characters 32 through 126 are printable ASCII. */
 %do i = 32 %to 126;
  /* Count the number of characters in the column. */
  /* Store each count in separate column. */
  count_chr_&i = count(trim(&column), byte(&i));
 %end;
run;

/* Sum the character counts from all the rows. */
proc means data=histogram_tmp noprint;
 var count_chr_:;
 output  out=histogram_wide sum=sum/autoname;
run;

/* Clean up */
proc sql;
 drop table histogram_tmp;
quit;

/* Switch from long to wide. */
proc transpose
 data=histogram_wide(keep=count_chr:)
 out=histogram_long
 ;
run;

/* Make pretty. */
data histogram_long;
 set histogram_long;
 character_decimal = input(compress(_name_, , 'kd'), 3.);
 drop _name_;
 character = byte(character_decimal);
 rename col1=count_characters;
run;
%mend;

Finally, this code invokes the macro and prints the report.

/* Run the histogram macro  */
%character_histogram(sonnet18, line);

/* Print the final report as a table. */
proc print data=histogram_long noobs;
 var character_decimal character count_characters ;
run;

/* Barchart, what some people would call a histogram of the letters. */
proc sgplot data=histogram_long;
 hbar character/freq=count_characters;
run;

This is the final report.

character_decimal character count_characters
32   100
33 ! 0
34 " 0
35 # 0
36 $ 0
37 % 0
38 & 0
39 ' 6
40 ( 0
41 ) 0
42 * 0
43 + 0
44 , 12
45 - 0
46 . 1
47 / 0
48 0 0
49 1 0
50 2 0
51 3 0
52 4 0
53 5 0
54 6 0
55 7 0
56 8 0
57 9 0
58 : 3
59 ; 0
60 < 0
61 = 0
62 > 0
63 ? 1
64 @ 0
65 A 3
66 B 2
67 C 0
68 D 0
69 E 0
70 F 0
71 G 0
72 H 0
73 I 1
74 J 0
75 K 0
76 L 0
77 M 1
78 N 2
79 O 0
80 P 0
81 Q 0
82 R 1
83 S 4
84 T 1
85 U 0
86 V 0
87 W 1
88 X 0
89 Y 0
90 Z 0
91 [ 0
92 \ 0
93 ] 0
94 ^ 0
95 _ 0
96 ` 0
97 a 37
98 b 3
99 c 9
100 d 20
101 e 63
102 f 10
103 g 10
104 h 31
105 i 26
106 j 0
107 k 1
108 l 23
109 m 22
110 n 31
111 o 44
112 p 4
113 q 0
114 r 28
115 s 38
116 t 39
117 u 13
118 v 5
119 w 4
120 x 1
121 y 8
122 z 0
123 { 0
124 | 0
125 } 0
126 ~ 0
Histogram of individual characters made with SAS SGPLOT

This script was tested with SAS 9.4M3 on Windows 7.

No comments:

Post a Comment