Soundex (Sed)
From LiteratePrograms
Contents |
Theory
The | Soundex | algorithm | is | lossy | but | useful | for | hashing |
T00000 | S53200 | A42635 | I20000 | L20000 | B30000 | U21400 | F60000 | H25200 |
As Soundex was originally developed ca. 1920, it probably (when automated) ran on card processing hardware significantly less powerful than sed
.
Practice
We perform a direct transcription of the algorithm given in Patterson's Soundex (Rexx) implementation.
Logical processing
First, a straightforward lookup table, with y
<<convert characters to numerics>>= #/* (1) Convert characters to numerics using table: */ #/* 0 = A,E,H,I,O,U,W,Y */ #/* 1 = B,F,P,V */ #/* 2 = C,G,J,K,Q,S,X,Z */ #/* 3 = D,T */ #/* 4 = L */ #/* 5 = M,N */ #/* 6 = R */ y/ABCDEFGHIJKLMNOPQRSTUVWXYZ/01230120022455012623010202/
Now for a couple of iterated transformations -- we use : and t to apply the substitutions up to a fixpoint.
<<make multiple digits single>>= #/* (2) Make multiple digits single. */ :setify s/\(.\)\1/\1/ tsetify <<remove trailing zeroes>>= #/* (3) Remove zeroes after first position. */ :denull s/\(.\)0/\1/ tdenull
Question 1. if f is an operator applying an injective map, and f2 is the map applied twice, what is ?
Exercise 2. note that denull
commutes with setify
; furthermore, the individual steps of each commute with each other. Rewrite this section to combine the loops.
Question 3. what is the minimum number of loops needed in an iterative program?
Physical processing
Then we pad or truncate to produce a 6 digit code.
<<pad with zeroes to 6 chars>>= #/* (4) Fill on right with zeroes to make 6 characters. */ s/$/000000/ s/\(......\).*/\1/
Finally we restore the leading character. Assuming we have a copy of the alphabetical input in the hold space, we can append it to the pattern space with g. The replacement then becomes a simple matter of taking the tail of the numeric string and appending it to the head of the alphabetic:
<<restore first character>>= #/* (5) Replace first digit with first character of name. */ G s/.\(.....\).*\n\(.\).*/\2\1/
Wrapping up
In order to use a copy of the input from the hold space, we must first have stashed it there with the h command. Also, we accept lowercased input while stripping punctuation and other non-alphabetics.
<<soundex.sed>>= y/abcdefghijklmnopqrstuvwxyz/ABCDEFGHIJKLMNOPQRSTUVWXYZ/ s/[^A-Z]//g h convert characters to numerics make multiple digits single remove trailing zeroes pad with zeroes to 6 chars restore first character
Download code |