Thursday, February 3, 2022

Generate random names and addresses from SAS

For testing data processing systems (e.g., CRM, record linkage), you may need to generate fake people. SAS makes it uniquely easy to generate an unlimited count of fake US residents because it comes with a data set of US zip codes, which include the city and state name.

The system uses four data sets: first names, last names, street names, and US zip codes. Initials are randomly generated from letters. The street addresses probably do not exist in the given zip codes.

You could extend this by:

  • Add street directions (i.e., N, S, E, W)
  • Add street post type (e.g., Dr., Ct.)
  • Add units (e.g., Apt B, Ste 101)
  • Add post office boxes and private mail boxes
  • Spell out the middle name
  • Add name prefix (e.g., Dr., Mr.)
  • Add name suffix (e.g., Jr., Sr.)


%let mv_person_count = 10000; /* how many people to make */
%let mv_max_street_num = 20000; /* largest street number */

/* https://www.ssa.gov/OACT/babynames/decades/century.html */
data first;
	format first_name $20.;
	input first_name $;
	first_name_id = _n_;
datalines;
James
Robert
John
Michael
William
David
Richard
Joseph
Thomas
Charles
Christopher
Daniel
Matthew
Anthony
Mark
Donald
Steven
Paul
Andrew
Joshua
Kenneth
Kevin
Brian
George
Edward
Ronald
Timothy
Jason
Jeffrey
Ryan
Jacob
Gary
Nicholas
Eric
Jonathan
Stephen
Larry
Justin
Scott
Brandon
Benjamin
Samuel
Gregory
Frank
Alexander
Raymond
Patrick
Jack
Dennis
Jerry
Tyler
Aaron
Jose
Adam
Henry
Nathan
Douglas
Zachary
Peter
Kyle
Walter
Ethan
Jeremy
Harold
Keith
Christian
Roger
Noah
Gerald
Carl
Terry
Sean
Austin
Arthur
Lawrence
Jesse
Dylan
Bryan
Joe
Jordan
Billy
Bruce
Albert
Willie
Gabriel
Logan
Alan
Juan
Wayne
Roy
Ralph
Randy
Eugene
Vincent
Russell
Elijah
Louis
Bobby
Philip
Johnny
Mary
Patricia
Jennifer
Linda
Elizabeth
Barbara
Susan
Jessica
Sarah
Karen
Nancy
Lisa
Betty
Margaret
Sandra
Ashley
Kimberly
Emily
Donna
Michelle
Dorothy
Carol
Amanda
Melissa
Deborah
Stephanie
Rebecca
Sharon
Laura
Cynthia
Kathleen
Amy
Shirley
Angela
Helen
Anna
Brenda
Pamela
Nicole
Emma
Samantha
Katherine
Christine
Debra
Rachel
Catherine
Carolyn
Janet
Ruth
Maria
Heather
Diane
Virginia
Julie
Joyce
Victoria
Olivia
Kelly
Christina
Lauren
Joan
Evelyn
Judith
Megan
Cheryl
Andrea
Hannah
Martha
Jacqueline
Frances
Gloria
Ann
Teresa
Kathryn
Sara
Janice
Jean
Alice
Madison
Doris
Abigail
Julia
Judy
Grace
Denise
Amber
Marilyn
Beverly
Danielle
Theresa
Sophia
Marie
Diana
Brittany
Natalie
Isabella
Charlotte
Rose
Alexis
Kayla
Homer
Marge
Bart
Lisa
Maggie
;

/* https://www.thoughtco.com/most-common-us-surnames-1422656 */
data last;
	format last_name $20.;
	input last_name $;
	last_name_id = _n_;
datalines;
Smith
Johnson
Williams
Brown
Jones
Garcia
Miller
Davis
Rodriguez
Martinez
Hernandez
Lopez
Gonzales
Wilson
Anderson
Thomas
Taylor
Moore
Jackson
Martin
Lee
Perez
Thompson
White
Harris
Sanchez
Clark
Ramirez
Lewis
Robinson
Walker
Young
Allen
King
Wright
Scott
Torres
Nguyen
Hill
Flores
Green
Adams
Nelson
Baker
Hall
Rivera
Campbell
Mitchell
Carter
Roberts
Gomez
Phillips
Evans
Turner
Diaz
Parker
Cruz
Edwards
Collins
Reyes
Stewart
Morris
Morales
Murphy
Cook
Rogers
Gutierrez
Ortiz
Morgan
Cooper
Peterson
Bailey
Reed
Kelly
Howard
Ramos
Kim
Cox
Ward
Richardson
Watson
Brooks
Chavez
Wood
James
Bennet
Gray
Mendoza
Ruiz
Hughes
Price
Alvarez
Castillo
Sanders
Patel
Myers
Long
Ross
Foster
Jimenez
Simpson
;

/* https://www.nlc.org/resource/most-common-u-s-street-names/ */
data street;
	format street_name $20.;
	input street_name $;
	street_name_id = _n_;
datalines;
Second
Third
First
Fourth
Park
Fifth
Main
Sixth
Oak
Seventh
Pine
Maple
Cedar
Eighth
Elm
View
Washington
Ninth
Lake
Hill
Evergreen
;


data person0;
	do i = 1 to &mv_person_count;
		first_name_id = %RandBetween(1, 205);
		last_name_id = %RandBetween(1, 101);
		street_name_id = %RandBetween(1, 21);
		zip_code_id = %RandBetween(1, 40000);
		output;
	end;
	drop i;
run;

data zip;
	set sashelp.zipcode(keep=zip city statecode);
	zip_code_id = _n_;
run;

proc sql;
	create table person1 as
	select
		f.first_name,
		l.last_name,
		s.street_name,
		z.city '',
		z.statecode as state '',
		z.zip as zip_numeric ''
	from person0 as p
	join first as f on
		f.first_name_id = p.first_name_id
	join last as l on
		l.last_name_id = p.last_name_id
	join street as s on
		s.street_name_id = p.street_name_id
	join zip as z on
		z.zip_code_id = p.zip_code_id;
quit;

data person2;
	format name street city state zip $50.;
	set person1;
	initial = byte(int(65+26*ranuni(0)));
	name = catx(' ', first_name, initial, last_name);
	/* RandBetween from https://blogs.sas.com/content/iml/2015/10/05/random-integers-sas.html */
	street_num = put(%RandBetween(1,&mv_max_street_num),10.);
	street = catx(' ',street_num,street_name);
	zip = put(zip_numeric, z5.);
	drop zip_numeric street_name street_num first_name initial last_name;
run;

Example output table with ten randomly generated fake people:

namestreetcitystatezip
Steven I Murphy8206 FifthFrankfortKY40619
Mary M Williams5076 SeventhEvensvilleTN37332
Jeffrey Y Lopez3485 ThirdHenningIL61848
Richard Z Sanders5500 SixthKimballNE69145
Russell M Smith16425 SixthLexingtonKY40515
Johnny R Carter11949 EighthMount HopeOH44660
Raymond V Green4659 ParkWest HelenaAR72390
Megan N Anderson8437 ThirdChicoCA95927
Isabella A Ross8151 EvergreenBarstowMD20610
Sharon Q Flores3022 LakePothTX78147

Want to generate people names with Python instead of SAS? See combine_people_names.py for a system that generates random people using Wikidata biographies.

No comments:

Post a Comment

Get HTML of iframes in Microsoft Playwright

Playwright is a powerful framework for web testing and automation. This article demonstrates how to extract the HTML content of child IFRAME...