Huffman coding (Python)

From LiteratePrograms

Jump to: navigation, search
Other implementations: C++ | Java | Python

Given an arbitrary set of symbols (the english alphabet is the example that will be used here), Huffman coding is a way of creating the most efficient (smallest) binary code for that set of symbols. It exploits the fact that in most data some symbols occur more frequently than others and by shortening the code for that symbol, space can be saved. There are other interesting properties of Huffman codes as well as some "gotchas" with respect to their use, so the above link is worth reading.

The algorithm itself is fairly simple; it's a special method of building a binary tree. Initially, one starts with a set of nodes, each consisting of a symbol and its associated frequency. When the tree is finished, each of these nodes will be a leaf in the tree. The algorithm then selects the two least frequent nodes, and creates a parent node with a frequency equal to the sum of the children's frequencies, and replaces the children with the parent (the parent can be thought of as representing the probabilities of either child occurring). This is done until there is only one node remaining in the set, and at that point the binary tree is complete.

We will use tuples of the form (frequency, symbol) for the leaf nodes and (frequency, child1, child2) for nodes internal to the tree. The reason for the odd order of values is that in a sorted list, tuples are lexicographically sorted; this makes the code much simpler, as we do not have to write any special code to sort the nodes in order of frequency.

Because Python uses pass by reference function call semantics for objects (including lists), we're first going to make a copy, so as not to mangle the original.

<<Huffman tree builder>>=
def makeHuffTree(symbolTupleList):
   trees = list(symbolTupleList)

At this point, we can implement the actual tree construction. While there is more than one item on the list, we simply combine the two lowest frequency nodes. To efficiently do this, we will use a priority queue. which allows us to extract the minimal element very efficiently. There is a Python library that implements priority queues with binary heaps, so we will include it:

import heapq

Again, we pop the two least frequent children, sum their frequencies, pack them into a parent, and then put that parent back into the queue with the new frequency. This is continued until there is only one element left in the list.

<<Huffman tree builder>>=
   while len(trees) > 1:
      childR, childL = heapq.heappop(trees), heapq.heappop(trees)
      parent = (childL[0] + childR[0], childL, childR)
      heapq.heappush(trees, parent)

With the tree entirely built, we can just return the single item in the list that contains the tree:

<<Huffman tree builder>>=
   return trees[0]

Now to generate the actual binary codes, we traverse the tree until we reach one of the symbols while tracking the path taken through the tree. By convention, '0' represents following the left child and '1' following the right. We will make our print out these codes, but it is equally easy to drop them into whatever data structure fits your needs.

We'll use recursion, as it is easier to reason about for tree traversal. Our base case for traversal is when we hit a leaf node (which has only 2 elements). We will also use a variable prefix which contains the path taken through the tree (a.k.a. the Huffman code).

<<Huffman tree printer>>=
def printHuffTree(huffTree, prefix = ''):
   if len(huffTree) == 2:
      print huffTree[1], prefix

Otherwise, we'll simply continue on down the children, noting whether we went right or left.

<<Huffman tree printer>>=
      printHuffTree(huffTree[1], prefix + '0')
      printHuffTree(huffTree[2], prefix + '1')

Here are some example letter frequencies for the English language to use as input.

<<Example data>>=
exampleData = [
  (0.124167  , 'e'),   
  (0.0969225 , 't'),   
  (0.0820011 , 'a'),   
  (0.0768052 , 'i'),   
  (0.0764055 , 'n'),   
  (0.0714095 , 'o'),   
  (0.0706768 , 's'),   
  (0.0668132 , 'r'),   
  (0.0448308 , 'l'),   
  (0.0363709 , 'd'),   
  (0.0350386 , 'h'),   
  (0.0344391 , 'c'),   
  (0.028777  , 'u'),   
  (0.0281775 , 'm'),   
  (0.0235145 , 'f'),   
  (0.0203171 , 'p'),   
  (0.0189182 , 'y'),   
  (0.0181188 , 'g'),   
  (0.0135225 , 'w'),   
  (0.0124567 , 'v'),   
  (0.0106581 , 'b'),   
  (0.00393019, 'k'),   
  (0.00219824, 'x'),   
  (0.0019984 , 'j'),   
  (0.0009325 , 'q'),   
  (0.000599  , 'z')   

To tie it all together, here's a program to print out the Huffman codes, given those symbols and frequencies:

Huffman tree builder
Huffman tree printer
Example data
if __name__ == '__main__':
   huffTree = makeHuffTree(exampleData)

The output of this program for the example data is:

a 0000
i 0001
n 0010
y 001100
g 001101
d 00111
o 0100
s 0101
h 01100
c 01101
r 0111
e 100
u 10100
m 10101
w 101100
v 101101
f 10111
t 110
l 1110
p 11110
b 111110
j 111111000
q 1111110010
z 1111110011
x 11111101
k 1111111

Note the extremely short encodings for common letters like 'e' and 't', and the longer encodings for uncommon letters like 'q' and 'z'.

Download code