Lossless compression (Python)
From LiteratePrograms
Lossless (de)compression.
Contents |
theory
We implement LZ78, an example of lossless compression via a dictionary coder. Modulo the instantiation of the dictionary, accomplished here via side effect, both encoding and decoding are anamorphic reductions.
implementation
In the compressed domain, the data is represented as a sequence of tokens, consisting of a word already in the dictionary followed by a novel character.
decoding
The decoder is straightforward: for each token, it produces the concatenation of the dictionary entry for the word and the subsequent character.
<<decompress>>= [dodict(d,d[w]+c) for (w,c) in ts]
En passant, it enters the token produced into the dictionary.
<<dodictd>>= dodict = lambda d,v: d.__setitem__(len(d),v) or v
We need only initialize the dictionary and arrange to join the resulting list of decoded substrings.
<<decoder>>= def decode(ts): d,j = {0:''}, ''.join dodictd return j(decompress)
Question: in what monad are we computing?
encoding
The encoder is more complex. For each character, it checks the token pairing it with the word coding for the characters already seen. If this token is already in the dictionary, the information would be redundant for the decoder, so the token is only produced when it is novel.
<<compress>>= [tok for c in cs for tok in [(w,c)] for w in [dodict(d,tok)] if not w] + [(w,'')]
Question: Why do we include the final (w,")
term?
Exercise: modify the entire expression so last term of the final output uses the final character of the input instead of potentially being (0,")
In checking a token, we either return its codeword, or reset the codeword to 0, and, en passant, enter the token into the dictionary.
<<dodicte>>= dodict = lambda d,k: d.get(k) or d.__setitem__(k,len(d)) or 0
Of course, we must initialize the dictionary and arrange to start in the reset state.
<<encoder>>= def encode(cs): d,w = {0:''}, 0 dodicte return compress
wrapping up
Finally, we include the boilerplate for a simple test case:
<<lz78.py>>= encoder decoder if __name__ == '__main__': test = "wow! how now brown plow cow" print encode(test) print decode(encode(test))
The output should be similar to:
[(0, 'w'), (0, 'o'), (1, '!'), (0, ' '), (0, 'h'), (2, 'w'), (4, 'n'), (6, ' '), (0, 'b'), (0, 'r'), (6, 'n'), (4, 'p'), (0, 'l'), (8, 'c'), (6, '')] wow! how now brown plow cow
Download code |