Hash function comparison (C, sh)

From LiteratePrograms

Jump to: navigation, search

This program is under development.
Please help to debug it. When debugging
is complete, remove the {{develop}} tag.


In this article, we compare how some simple hash functions perform. More specifically, we compare the number of collisions for different hash table sizes and the speed.

Contents

The program

This program will read a list of words, one for each line, from stdin and print the corresponding hash values. The hash function and table size are specified as command line arguments.

<<hashcmp.c>>=
#include<stdio.h>
#include<string.h>
#include<stdlib.h>
typedef unsigned long hash_t;
hashfuncs
struct hashfunc_s {
	const char *name;
	hash_t (*func)(const unsigned char *key);
} funcs[]={
	{"add", add_hash},
	{"xor", xor_hash},
	{"sax", sax_hash},
	{"sdbm", sdbm_hash},
	{"bernstein", bernstein_hash}
};
int main(int argc, char *argv[])
{
	char line[256];
	char *p;
	hash_t (*hash)(const unsigned char *key)=NULL;
	hash_t hsize;
	int i;
	if(argc>1) {
		for(i=0; i<sizeof funcs/sizeof(struct hashfunc_s); ++i) {
			if(!strcmp(argv[1], funcs[i].name)) {
				hash=funcs[i].func;
				break;
			}
		}
		if(!funcs) {
			fprintf(stderr, "ERROR: Could not find the %s hash function\n", argv[1]);
			exit(EXIT_FAILURE);
		}
	} else hash=funcs[0].func;
	if(argc>2) hsize=atol(argv[2]);
	else hsize=32768;
	while(fgets(line, 256, stdin)) {
		if((p=strchr(line, '\r'))) *p='\0';
		if((p=strchr(line, '\n'))) *p='\0';
		printf("%lu\n", hash((unsigned char *)line)%hsize);
	}
	return EXIT_SUCCESS;
}
<<Makefile>>=
all: hashcmp

hashcmp: hashcmp.c
	cc -o hashcmp -Wall -ansi -pedantic hashcmp.c

Functions

The hash functions take one argument, a '\0'-terminated string, and return the calculated hash value. This value is later redused to the hash table size before it is printed.

Additive

This function simply sums up the values of each character in the string.

<<add_hash>>=
hash_t add_hash(const unsigned char *key)
{
	hash_t h=0;
	while(*key) h+=*key++;
	return h;
}

XOR

This function applies bitwise exclusive-or on all the characters. Note that the generated hash value will never be larger than UCHAR_MAX (normally 255) with this method. If key only consists of ASCII values, it will be maximum 127. So this function is a bad choice for large hash tables.

<<xor_hash>>=
hash_t xor_hash(const unsigned char *key)
{
	hash_t h=0;
	while(*key) h^=*key++;
	return h;
}

Shift-add-XOR

This function combines the first two functions (add and XOR) with a shift operation.

<<sax_hash>>=
hash_t sax_hash(const unsigned char *key)
{
	hash_t h=0;
	while(*key) h^=(h<<5) + (h>>2) + *key++;
	return h;
}

SDBM

This is an implementation of the [sdbm hash function].

<<sdbm_hash>>=
hash_t sdbm_hash(const unsigned char *key)
{
	hash_t h=0;
	while(*key) h=*key++ + (h<<6) + (h<<16) - h;
	return h;
}

Bernstein

This is an efficient hash function created by Daniel J. Bernstein.

<<bernstein_hash>>=
hash_t bernstein_hash(const unsigned char *key)
{
	hash_t h=0;
	while(*key) h=33*h + *key++;
	return h;
}


<<hashfuncs>>=
add_hash
xor_hash
sax_hash
sdbm_hash
bernstein_hash

Collisions

To check for collisions, we use the following shell script. The file words is created by running shuffle -f /usr/share/dict/words >words on a NetBSD system. The generated file can be downloaded here.

If the parameter max is specified, the output is the size of the "biggest" collision (most keys giving the same hash value). With the avg parameter, the average collission size is calculated. Without a parameter, the output is the total number of keys involved in collsions.

<<collcmp.sh>>=
#!/bin/sh
case "$1" in
	"max")
		AWKCMD='{if($1>max) max=$1} END {print max?max:0}'
		FMT=" %7s"
	;;
	"avg")
		AWKCMD='{sum+=$1} END {print sum/NR}'
		FMT=" % 7.1f"
	;;
	*)
		AWKCMD='{if($1>1) ncoll+=$1} END {print ncoll?ncoll:0}'
		FMT=" %7s"
esac
echo "      func:size  10      50     100     500    1000    5000   10000   50000  100000  200000"
for hashfunc in add xor sax sdbm bernstein ; do
	printf "% 10s:" "$hashfunc"
	for size in 10 50 100 500 1000 5000 10000 50000 100000 200000; do
		head -n $size <words |
			./hashcmp $hashfunc $size |
			sort -n | 
			uniq -c | 
			awk "$AWKCMD" |
			xargs printf "$FMT"
	done
	echo
done

Output

./collcmp.sh
      func:size  10      50     100     500    1000    5000   10000   50000  100000  200000
       add:       7      27      64     317     700    4639    9719   49799   99823  199810
       xor:       7      34      84     478     980    5000   10000   50000  100000  200000
       sax:       4      28      71     311     623    3175    6321   31651   63180  126568
      sdbm:       4      30      56     317     627    3218    6239   31537   63242  126320
 bernstein:       6      31      67     309     608    3164    6312   31835   63093  126034
./collcmp.sh avg
      func:size  10      50     100     500    1000    5000   10000   50000  100000  200000
       add:     2.0     1.5     1.6     1.6     1.8     4.2     7.1    27.8    51.5    95.6
       xor:     1.7     1.8     2.4     5.3     8.8    39.1    78.1   390.6   781.2  1562.5
       sax:     1.2     1.5     1.8     1.6     1.6     1.6     1.6     1.6     1.6     1.6
      sdbm:     1.2     1.5     1.5     1.6     1.6     1.6     1.6     1.6     1.6     1.6
 bernstein:     1.4     1.6     1.6     1.6     1.6     1.6     1.6     1.6     1.6     1.6
./collcmp.sh max
      func:size  10      50     100     500    1000    5000   10000   50000  100000  200000
       add:       4       4       5       6       8      22      34     148     277     543
       xor:       3       4       7      16      24      89     167     768    1488    2915
       sax:       2       3       4       5       5       7       8       7       8       8
      sdbm:       2       4       3       4       6       6       6       8       7       7
 bernstein:       2       4       4       6       6       6       6       6       8      10

Speed

The following script compares the time usage for the different hash functions.

<<timecmp.sh>>=
#!/bin/sh
cat words words words words words words words words words words >words10
for hashfunc in add xor sax sdbm bernstein ; do
	printf "% 10s: " "$hashfunc"
	time ./hashcmp $hashfunc <words10 >/dev/null
done

Output

       add:         5.04 real         4.88 user         0.05 sys
       xor:         4.06 real         3.93 user         0.05 sys
       sax:         6.16 real         5.90 user         0.08 sys
      sdbm:         6.08 real         5.87 user         0.09 sys
 bernstein:         6.14 real         5.92 user         0.09 sys

External links

  • The Art of Hashing


  == JAVA CODE ===


 //sdbm hash function

public int hashFunction(String s){ int hash=0; for(int i=0; i<s.length(); i++){ hash = s.charAt(i) + (hash << 6) + (hash << 16) - hash; } return Math.abs(hash%p); }

public void insert(String word){ if(contains(word))return; int index = hashFunction(word); Element m = new Element(word, null); if(tabel[index]==null)tabel[index]=m; else { Element n = tabel[index]; while(n.next!=null){ n=n.next; } n.next=m; } }

public void delete(String word){ if(!contains(word))return; int index = hashFunction(word); Element e = tabel[index]; if(word.equals(e.word))tabel[index]=null; if(e.next!=null && word.equals(e.next.word)) { e.next=e.next.next; return; }

while(e.next!=null){ e=e.next; if(e.next!=null && word.equals(e.next.word)) { e.next=e.next.next; return; } } }

public boolean contains(String word){ int index = hashFunction(word); if(tabel[index]==null)return false; Element e = tabel[index]; while(e!=null){ if(word.equals(e.word))return true; e=e.next; } return false; }

public int indexOf(String word){ int index = -1; if(contains(word))index = hashFunction(word); return index; } }

Download code
Views
Personal tools