Soundex (Sed)

From LiteratePrograms

Jump to: navigation, search
Other implementations: C | Rexx | Sed

Contents

Theory

The Soundex algorithm is lossy but useful for hashing
T00000 S53200 A42635 I20000 L20000 B30000 U21400 F60000 H25200

As Soundex was originally developed ca. 1920, it probably (when automated) ran on card processing hardware significantly less powerful than sed.

Practice

We perform a direct transcription of the algorithm given in Patterson's Soundex (Rexx) implementation.

Logical processing

First, a straightforward lookup table, with y

<<convert characters to numerics>>=
#/* (1) Convert characters to numerics using table:                    */
#/*     0 = A,E,H,I,O,U,W,Y                                            */
#/*     1 = B,F,P,V                                                    */
#/*     2 = C,G,J,K,Q,S,X,Z                                            */
#/*     3 = D,T                                                        */
#/*     4 = L                                                          */
#/*     5 = M,N                                                        */
#/*     6 = R                                                          */
y/ABCDEFGHIJKLMNOPQRSTUVWXYZ/01230120022455012623010202/

Now for a couple of iterated transformations -- we use : and t to apply the substitutions up to a fixpoint.

<<make multiple digits single>>=
#/* (2) Make multiple digits single.                                   */
:setify
s/\(.\)\1/\1/
tsetify
<<remove trailing zeroes>>=
#/* (3) Remove zeroes after first position.                            */
:denull
s/\(.\)0/\1/
tdenull

Question 1. if f is an operator applying an injective map, and f2 is the map applied twice, what is ?

Exercise 2. note that denull commutes with setify; furthermore, the individual steps of each commute with each other. Rewrite this section to combine the loops.

Question 3. what is the minimum number of loops needed in an iterative program?

Physical processing

Then we pad or truncate to produce a 6 digit code.

<<pad with zeroes to 6 chars>>=
#/* (4) Fill on right with zeroes to make 6 characters.                */
s/$/000000/
s/\(......\).*/\1/

Finally we restore the leading character. Assuming we have a copy of the alphabetical input in the hold space, we can append it to the pattern space with g. The replacement then becomes a simple matter of taking the tail of the numeric string and appending it to the head of the alphabetic:

<<restore first character>>=
#/* (5) Replace first digit with first character of name.              */
G
s/.\(.....\).*\n\(.\).*/\2\1/

Wrapping up

In order to use a copy of the input from the hold space, we must first have stashed it there with the h command. Also, we accept lowercased input while stripping punctuation and other non-alphabetics.

<<soundex.sed>>=
y/abcdefghijklmnopqrstuvwxyz/ABCDEFGHIJKLMNOPQRSTUVWXYZ/
s/[^A-Z]//g
h
convert characters to numerics
make multiple digits single
remove trailing zeroes
pad with zeroes to 6 chars
restore first character
Download code
Views