Hash function comparison (C, sh)
From LiteratePrograms
This program is under development.
Please help to debug it. When debugging
is complete, remove the {{develop}} tag.
In this article, we compare how some simple hash functions perform. More specifically, we compare the number of collisions for different hash table sizes and the speed.
Contents |
The program
This program will read a list of words, one for each line, from stdin and print the corresponding hash values. The hash function and table size are specified as command line arguments.
<<hashcmp.c>>= #include<stdio.h> #include<string.h> #include<stdlib.h> typedef unsigned long hash_t; hashfuncs struct hashfunc_s { const char *name; hash_t (*func)(const unsigned char *key); } funcs[]={ {"add", add_hash}, {"xor", xor_hash}, {"sax", sax_hash}, {"sdbm", sdbm_hash}, {"bernstein", bernstein_hash} }; int main(int argc, char *argv[]) { char line[256]; char *p; hash_t (*hash)(const unsigned char *key)=NULL; hash_t hsize; int i; if(argc>1) { for(i=0; i<sizeof funcs/sizeof(struct hashfunc_s); ++i) { if(!strcmp(argv[1], funcs[i].name)) { hash=funcs[i].func; break; } } if(!funcs) { fprintf(stderr, "ERROR: Could not find the %s hash function\n", argv[1]); exit(EXIT_FAILURE); } } else hash=funcs[0].func; if(argc>2) hsize=atol(argv[2]); else hsize=32768; while(fgets(line, 256, stdin)) { if((p=strchr(line, '\r'))) *p='\0'; if((p=strchr(line, '\n'))) *p='\0'; printf("%lu\n", hash((unsigned char *)line)%hsize); } return EXIT_SUCCESS; }
<<Makefile>>= all: hashcmp hashcmp: hashcmp.c cc -o hashcmp -Wall -ansi -pedantic hashcmp.c
Functions
The hash functions take one argument, a '\0'-terminated string, and return the calculated hash value. This value is later redused to the hash table size before it is printed.
Additive
This function simply sums up the values of each character in the string.
<<add_hash>>= hash_t add_hash(const unsigned char *key) { hash_t h=0; while(*key) h+=*key++; return h; }
XOR
This function applies bitwise exclusive-or on all the characters. Note that the generated hash value will never be larger than UCHAR_MAX (normally 255) with this method. If key only consists of ASCII values, it will be maximum 127. So this function is a bad choice for large hash tables.
<<xor_hash>>= hash_t xor_hash(const unsigned char *key) { hash_t h=0; while(*key) h^=*key++; return h; }
Shift-add-XOR
This function combines the first two functions (add and XOR) with a shift operation.
<<sax_hash>>= hash_t sax_hash(const unsigned char *key) { hash_t h=0; while(*key) h^=(h<<5) + (h>>2) + *key++; return h; }
SDBM
This is an implementation of the [sdbm hash function].
<<sdbm_hash>>= hash_t sdbm_hash(const unsigned char *key) { hash_t h=0; while(*key) h=*key++ + (h<<6) + (h<<16) - h; return h; }
Bernstein
This is an efficient hash function created by Daniel J. Bernstein.
<<bernstein_hash>>= hash_t bernstein_hash(const unsigned char *key) { hash_t h=0; while(*key) h=33*h + *key++; return h; }
<<hashfuncs>>= add_hash xor_hash sax_hash sdbm_hash bernstein_hash
Collisions
To check for collisions, we use the following shell script. The file words is created by running shuffle -f /usr/share/dict/words >words
on a NetBSD system. The generated file can be downloaded here.
If the parameter max is specified, the output is the size of the "biggest" collision (most keys giving the same hash value). With the avg parameter, the average collission size is calculated. Without a parameter, the output is the total number of keys involved in collsions.
<<collcmp.sh>>= #!/bin/sh case "$1" in "max") AWKCMD='{if($1>max) max=$1} END {print max?max:0}' FMT=" %7s" ;; "avg") AWKCMD='{sum+=$1} END {print sum/NR}' FMT=" % 7.1f" ;; *) AWKCMD='{if($1>1) ncoll+=$1} END {print ncoll?ncoll:0}' FMT=" %7s" esac echo " func:size 10 50 100 500 1000 5000 10000 50000 100000 200000" for hashfunc in add xor sax sdbm bernstein ; do printf "% 10s:" "$hashfunc" for size in 10 50 100 500 1000 5000 10000 50000 100000 200000; do head -n $size <words | ./hashcmp $hashfunc $size | sort -n | uniq -c | awk "$AWKCMD" | xargs printf "$FMT" done echo done
Output
./collcmp.sh
func:size 10 50 100 500 1000 5000 10000 50000 100000 200000 add: 7 27 64 317 700 4639 9719 49799 99823 199810 xor: 7 34 84 478 980 5000 10000 50000 100000 200000 sax: 4 28 71 311 623 3175 6321 31651 63180 126568 sdbm: 4 30 56 317 627 3218 6239 31537 63242 126320 bernstein: 6 31 67 309 608 3164 6312 31835 63093 126034
./collcmp.sh avg
func:size 10 50 100 500 1000 5000 10000 50000 100000 200000 add: 2.0 1.5 1.6 1.6 1.8 4.2 7.1 27.8 51.5 95.6 xor: 1.7 1.8 2.4 5.3 8.8 39.1 78.1 390.6 781.2 1562.5 sax: 1.2 1.5 1.8 1.6 1.6 1.6 1.6 1.6 1.6 1.6 sdbm: 1.2 1.5 1.5 1.6 1.6 1.6 1.6 1.6 1.6 1.6 bernstein: 1.4 1.6 1.6 1.6 1.6 1.6 1.6 1.6 1.6 1.6
./collcmp.sh max
func:size 10 50 100 500 1000 5000 10000 50000 100000 200000 add: 4 4 5 6 8 22 34 148 277 543 xor: 3 4 7 16 24 89 167 768 1488 2915 sax: 2 3 4 5 5 7 8 7 8 8 sdbm: 2 4 3 4 6 6 6 8 7 7 bernstein: 2 4 4 6 6 6 6 6 8 10
Speed
The following script compares the time usage for the different hash functions.
<<timecmp.sh>>= #!/bin/sh cat words words words words words words words words words words >words10 for hashfunc in add xor sax sdbm bernstein ; do printf "% 10s: " "$hashfunc" time ./hashcmp $hashfunc <words10 >/dev/null done
Output
add: 5.04 real 4.88 user 0.05 sys xor: 4.06 real 3.93 user 0.05 sys sax: 6.16 real 5.90 user 0.08 sys sdbm: 6.08 real 5.87 user 0.09 sys bernstein: 6.14 real 5.92 user 0.09 sys
External links
- The Art of Hashing
== JAVA CODE ===
//sdbm hash function
public int hashFunction(String s){ int hash=0; for(int i=0; i<s.length(); i++){ hash = s.charAt(i) + (hash << 6) + (hash << 16) - hash; } return Math.abs(hash%p); }
public void insert(String word){ if(contains(word))return; int index = hashFunction(word); Element m = new Element(word, null); if(tabel[index]==null)tabel[index]=m; else { Element n = tabel[index]; while(n.next!=null){ n=n.next; } n.next=m; } }
public void delete(String word){ if(!contains(word))return; int index = hashFunction(word); Element e = tabel[index]; if(word.equals(e.word))tabel[index]=null; if(e.next!=null && word.equals(e.next.word)) { e.next=e.next.next; return; }
while(e.next!=null){ e=e.next; if(e.next!=null && word.equals(e.next.word)) { e.next=e.next.next; return; } } }
public boolean contains(String word){ int index = hashFunction(word); if(tabel[index]==null)return false; Element e = tabel[index]; while(e!=null){ if(word.equals(e.word))return true; e=e.next; } return false; }
public int indexOf(String word){ int index = -1; if(contains(word))index = hashFunction(word); return index; } }
Download code |