Word count (Haskell)

From LiteratePrograms

Jump to: navigation, search
Other implementations: Assembly Intel x86 Linux | C | C++ | Forth | Haskell | J | Lua | OCaml | Perl | Python | Python, functional | Rexx

An implementation of the UNIX wc tool, in Haskell.

The wc tool counts characters, words and lines in text files or stdin. When invoked without any options, it will print all three values. These options are supported:

  • -c - Only count characters
  • -w - Only count words
  • -l - Only count lines

If the tool is invoked without any file name parameters, it will use stdin.

This is a somewhat more complex implementation of wc than other sample Haskell implementations on the web. This added complexity stems from two things:

  1. Adding command line options;
  2. Avoiding loading an entire file into memory to count it (and instead reading the file line-by-line).

Contents

Counting and printing

The core of wc's functionality is to count various quantities, and print the results to the screen. This task is accomplished by the wc function, with the help of several supporting functions:

<<counting and printing>>=
wc
countAndPrintFile
countFile
getCount
printCountLine

The wc function takes a group of option settings and a list of filenames, and for each file in the list prints the count totals specified by the options.

<<wc>>=
wc :: Opts -> [FilePath] -> IO ()

If the file list is empty, or contains only a "-", the input is assumed to arrive via stdin. We therefore simply get the entire contents of the stdin handle lazily using getContents, get the count totals for that, and then print the count.

<<wc>>=
wc opts []     = wc opts ["-"]
wc opts ["-"]  = do
    text <- getContents
    let count = getCount text
    printCountLine opts "" count

If the file list contains only a single filename, we simply open and count that file, and print the resulting count totals. The situation when the file list has multiple elements is a little more complex. In this case, we need to both find and print the count for each file in the list, and accumulate a total count for all of the files. This is accomplished by folding the function countAndPrintFile across the file list.

<<wc>>=
wc opts [file] = do
    count <- countFile file
    printCountLine opts file count
wc opts files  = do
    totalcount <- foldM (countAndPrintFile opts) (0,0,0) files
    printCountLine opts "total" totalcount

countAndPrintFile

At each step of the fold operation countAndPrintFile takes a current total count and a filename, prints the results of counting the file, and returns a new total count (which serves as an input to the next step of the fold).

<<countAndPrintFile>>=
countAndPrintFile :: Opts -> WordCount -> FilePath -> IO WordCount
countAndPrintFile opts total@(tls,tws,tcs) file = do
    count@(ls,ws,cs) <- countFile file
    printCountLine opts file count
    return (tls+ls,tws+ws,tcs+cs)

The WordCount type is simply an alias for a tuple consisting of three integers representing the current line, word, and character counts for the corresponding file.

<<WordCount>>=
type WordCount = (Int, Int, Int)

countFile

The countFile function takes a filename as its argument, and returns the line, word, and character counts for that file. The actual counting operation is handled by getCount, while countFile handles getting the contents lazily of the file to be counted.

<<countFile>>=
countFile :: FilePath -> IO WordCount
countFile file = do
    text <- readFile file
    return $ getCount text

getCount

The getCount function is the workhorse of the counting operation. It gets a string containing the entire contents of the file to be counted (lazily of course). It splits the file into lines and folds across all the lines, accumulating the counts. For each line, the line count is incremented by 1, and the word and characters counts are respectively increased by the numbers of words and characters in the line. The character count is further incremented by 1, to account for the fact that lines strips the newline character off of the lines.

<<getCount>>=
getCount :: String -> WordCount
getCount = foldl' (\(c, w, l) x -> (c+length x+1, w+length (words x), l+1))
                  (0, 0, 0)
                  . lines

printCountLine

The printCountLine function prints out the line, word, and character counts for the file f, in accordance with the option settings.

<<printCountLine>>=
printCountLine :: Opts -> FilePath -> WordCount -> IO ()
printCountLine opts f (ls,ws,cs) =
    putStrLn ("\t" ++ (if showLines opts then (show ls) ++ "\t" else "")
                   ++ (if showWords opts then (show ws) ++ "\t" else "")
                   ++ (if showChars opts then (show cs) ++ "\t" else "")
                   ++ f)

Option handling

The handling of command line options makes use of the System.Console.GetOpt library. It uses an approach to process the options inspired by a Haskell mailing list post by Tomasz Zielonka (see References).

We first define a new record datatype Opts, which contains three boolean fields representing the different options.

<<options setup>>=
data Opts = Opts { showLines, showWords, showChars :: Bool }

We also define a GetOpts option description list. This defines both short (e.g. -l) and long (e.g. --line) command line flags for each supported option, a function which operates on the Opts datatype and is called when a particular flag is detected, and a usage message for each flag.

<<options setup>>=
options :: [OptDescr (Opts -> Opts)]
options =
 [ Option ['l'] ["lines"] 
    (NoArg (\o -> o {showLines = True})) "show line count"   
 , Option ['w'] ["words"] 
    (NoArg (\o -> o {showWords = True})) "show word count"       
 , Option ['c'] ["chars"] 
    (NoArg (\o -> o {showChars = True})) "show character count"
 ]

Parsing of the command line flags makes use of the getOpt function, which returns either a tuple containing a list of options and a list of files, or an error. If an error is returned, we simple print a usage message to the screen.

<<parseOpts>>=
parseOpts :: [String] -> IO ([Opts -> Opts], [String])
parseOpts args = 
    case getOpt RequireOrder options args of
        (opts,files,[]) -> return (opts,files)
        (_,_,errs)      -> fail (concat errs ++ usageInfo header options)
    where 
        header = "Usage: wc [OPTION...] [files...]" 

The tricky part of the option handling is the processing of the options. The getOpt function returns a list of option values, but we would prefer not to have to scan the list every time we want to check whether an option is set. Fortunately, the way in which the options list was defined provides us with a convenient way around this problem. This list was defined such that the "value" of each option in the list returned by is getOpt actually a function that transforms Opts records. As a result it is possible to use a fold to thread an Opts record through the list of option functions. If there are no option flags set on the command line then we default to having all options "on".

<<processOpts>>=
processOpts :: [Opts -> Opts] -> Opts
processOpts []   = allOpts True
processOpts opts = foldl (flip ($)) (allOpts False) opts
allOpts :: Bool -> Opts
allOpts b =  Opts { showLines = b, 
                    showWords = b, 
                    showChars = b } 

Putting it all together

The remainder of wc.hs is fairly straightforward. It does two things:

  1. Imports the necessary supporting libraries;
  2. Defines a short main function that gathers the command line arguments, handles any command line options, and then applies wc to the list of files provided on the command line.
<<wc.hs>>=
module Main where
import System.Environment
import System.Console.GetOpt
import Control.Monad
import Data.List
WordCount
options setup
counting and printing
parseOpts            
processOpts     
main :: IO ()        
main = do
    args <- getArgs
    (optList, files) <- parseOpts args
    let opts = processOpts optList
    wc opts files

References

  • High-level technique for program options handling
  • Simple 'wc' using 'interact'
Download code
Views