For testing data processing systems (e.g., CRM, record linkage), you may need to generate fake people. SAS makes it uniquely easy to generate an unlimited count of fake US residents because it comes with a data set of US zip codes, which include the city and state name.
The system uses four data sets: first names, last names, street names, and US zip codes. Initials are randomly generated from letters. The street addresses probably do not exist in the given zip codes.
You could extend this by:
- Add street directions (i.e., N, S, E, W)
- Add street post type (e.g., Dr., Ct.)
- Add units (e.g., Apt B, Ste 101)
- Add post office boxes and private mail boxes
- Spell out the middle name
- Add name prefix (e.g., Dr., Mr.)
- Add name suffix (e.g., Jr., Sr.)
%let mv_person_count = 10000; /* how many people to make */ %let mv_max_street_num = 20000; /* largest street number */ /* https://www.ssa.gov/OACT/babynames/decades/century.html */ data first; format first_name $20.; input first_name $; first_name_id = _n_; datalines; James Robert John Michael William David Richard Joseph Thomas Charles Christopher Daniel Matthew Anthony Mark Donald Steven Paul Andrew Joshua Kenneth Kevin Brian George Edward Ronald Timothy Jason Jeffrey Ryan Jacob Gary Nicholas Eric Jonathan Stephen Larry Justin Scott Brandon Benjamin Samuel Gregory Frank Alexander Raymond Patrick Jack Dennis Jerry Tyler Aaron Jose Adam Henry Nathan Douglas Zachary Peter Kyle Walter Ethan Jeremy Harold Keith Christian Roger Noah Gerald Carl Terry Sean Austin Arthur Lawrence Jesse Dylan Bryan Joe Jordan Billy Bruce Albert Willie Gabriel Logan Alan Juan Wayne Roy Ralph Randy Eugene Vincent Russell Elijah Louis Bobby Philip Johnny Mary Patricia Jennifer Linda Elizabeth Barbara Susan Jessica Sarah Karen Nancy Lisa Betty Margaret Sandra Ashley Kimberly Emily Donna Michelle Dorothy Carol Amanda Melissa Deborah Stephanie Rebecca Sharon Laura Cynthia Kathleen Amy Shirley Angela Helen Anna Brenda Pamela Nicole Emma Samantha Katherine Christine Debra Rachel Catherine Carolyn Janet Ruth Maria Heather Diane Virginia Julie Joyce Victoria Olivia Kelly Christina Lauren Joan Evelyn Judith Megan Cheryl Andrea Hannah Martha Jacqueline Frances Gloria Ann Teresa Kathryn Sara Janice Jean Alice Madison Doris Abigail Julia Judy Grace Denise Amber Marilyn Beverly Danielle Theresa Sophia Marie Diana Brittany Natalie Isabella Charlotte Rose Alexis Kayla Homer Marge Bart Lisa Maggie ; /* https://www.thoughtco.com/most-common-us-surnames-1422656 */ data last; format last_name $20.; input last_name $; last_name_id = _n_; datalines; Smith Johnson Williams Brown Jones Garcia Miller Davis Rodriguez Martinez Hernandez Lopez Gonzales Wilson Anderson Thomas Taylor Moore Jackson Martin Lee Perez Thompson White Harris Sanchez Clark Ramirez Lewis Robinson Walker Young Allen King Wright Scott Torres Nguyen Hill Flores Green Adams Nelson Baker Hall Rivera Campbell Mitchell Carter Roberts Gomez Phillips Evans Turner Diaz Parker Cruz Edwards Collins Reyes Stewart Morris Morales Murphy Cook Rogers Gutierrez Ortiz Morgan Cooper Peterson Bailey Reed Kelly Howard Ramos Kim Cox Ward Richardson Watson Brooks Chavez Wood James Bennet Gray Mendoza Ruiz Hughes Price Alvarez Castillo Sanders Patel Myers Long Ross Foster Jimenez Simpson ; /* https://www.nlc.org/resource/most-common-u-s-street-names/ */ data street; format street_name $20.; input street_name $; street_name_id = _n_; datalines; Second Third First Fourth Park Fifth Main Sixth Oak Seventh Pine Maple Cedar Eighth Elm View Washington Ninth Lake Hill Evergreen ; data person0; do i = 1 to &mv_person_count; first_name_id = %RandBetween(1, 205); last_name_id = %RandBetween(1, 101); street_name_id = %RandBetween(1, 21); zip_code_id = %RandBetween(1, 40000); output; end; drop i; run; data zip; set sashelp.zipcode(keep=zip city statecode); zip_code_id = _n_; run; proc sql; create table person1 as select f.first_name, l.last_name, s.street_name, z.city '', z.statecode as state '', z.zip as zip_numeric '' from person0 as p join first as f on f.first_name_id = p.first_name_id join last as l on l.last_name_id = p.last_name_id join street as s on s.street_name_id = p.street_name_id join zip as z on z.zip_code_id = p.zip_code_id; quit; data person2; format name street city state zip $50.; set person1; initial = byte(int(65+26*ranuni(0))); name = catx(' ', first_name, initial, last_name); /* RandBetween from https://blogs.sas.com/content/iml/2015/10/05/random-integers-sas.html */ street_num = put(%RandBetween(1,&mv_max_street_num),10.); street = catx(' ',street_num,street_name); zip = put(zip_numeric, z5.); drop zip_numeric street_name street_num first_name initial last_name; run;
Example output table with ten randomly generated fake people:
name | street | city | state | zip |
---|---|---|---|---|
Steven I Murphy | 8206 Fifth | Frankfort | KY | 40619 |
Mary M Williams | 5076 Seventh | Evensville | TN | 37332 |
Jeffrey Y Lopez | 3485 Third | Henning | IL | 61848 |
Richard Z Sanders | 5500 Sixth | Kimball | NE | 69145 |
Russell M Smith | 16425 Sixth | Lexington | KY | 40515 |
Johnny R Carter | 11949 Eighth | Mount Hope | OH | 44660 |
Raymond V Green | 4659 Park | West Helena | AR | 72390 |
Megan N Anderson | 8437 Third | Chico | CA | 95927 |
Isabella A Ross | 8151 Evergreen | Barstow | MD | 20610 |
Sharon Q Flores | 3022 Lake | Poth | TX | 78147 |
Want to generate people names with Python instead of SAS? See combine_people_names.py for a system that generates random people using Wikidata biographies.
No comments:
Post a Comment