Word count (Haskell)
From LiteratePrograms
- Other implementations: Assembly Intel x86 Linux | C | C++ | Forth | Haskell | J | Lua | OCaml | Perl | Python | Python, functional | Rexx
An implementation of the UNIX wc tool, in Haskell.
The wc tool counts characters, words and lines in text files or stdin. When invoked without any options, it will print all three values. These options are supported:
- -c - Only count characters
- -w - Only count words
- -l - Only count lines
If the tool is invoked without any file name parameters, it will use stdin.
This is a somewhat more complex implementation of wc than other sample Haskell implementations on the web. This added complexity stems from two things:
- Adding command line options;
- Avoiding loading an entire file into memory to count it (and instead reading the file line-by-line).
Contents |
Counting and printing
The core of wc's functionality is to count various quantities, and print the results to the screen. This task is accomplished by the wc
function, with the help of several supporting functions:
<<counting and printing>>= wc countAndPrintFile countFile getCount printCountLine
The wc
function takes a group of option settings and a list of filenames, and for each file in the list prints the count totals specified by the options.
<<wc>>= wc :: Opts -> [FilePath] -> IO ()
If the file list is empty, or contains only a "-", the input is assumed to arrive via stdin. We therefore simply get the entire contents of the stdin handle lazily using getContents
, get the count totals for that, and then print the count.
<<wc>>= wc opts [] = wc opts ["-"] wc opts ["-"] = do text <- getContents let count = getCount text printCountLine opts "" count
If the file list contains only a single filename, we simply open and count that file, and print the resulting count totals. The situation when the file list has multiple elements is a little more complex. In this case, we need to both find and print the count for each file in the list, and accumulate a total count for all of the files. This is accomplished by folding the function countAndPrintFile
across the file list.
<<wc>>= wc opts [file] = do count <- countFile file printCountLine opts file count wc opts files = do totalcount <- foldM (countAndPrintFile opts) (0,0,0) files printCountLine opts "total" totalcount
countAndPrintFile
At each step of the fold operation countAndPrintFile
takes a current total count and a filename, prints the results of counting the file, and returns a new total count (which serves as an input to the next step of the fold).
<<countAndPrintFile>>= countAndPrintFile :: Opts -> WordCount -> FilePath -> IO WordCount countAndPrintFile opts total@(tls,tws,tcs) file = do count@(ls,ws,cs) <- countFile file printCountLine opts file count return (tls+ls,tws+ws,tcs+cs)
The WordCount
type is simply an alias for a tuple consisting of three integers representing the current line, word, and character counts for the corresponding file.
<<WordCount>>= type WordCount = (Int, Int, Int)
countFile
The countFile
function takes a filename as its argument, and returns the line, word, and character counts for that file. The actual counting operation is handled by getCount
, while countFile
handles getting the contents lazily of the file to be counted.
<<countFile>>= countFile :: FilePath -> IO WordCount countFile file = do text <- readFile file return $ getCount text
getCount
The getCount
function is the workhorse of the counting operation. It gets a string containing the entire contents of the file to be counted (lazily of course). It splits the file into lines and folds across all the lines, accumulating the counts. For each line, the line count is incremented by 1, and the word and characters counts are respectively increased by the numbers of words and characters in the line. The character count is further incremented by 1, to account for the fact that lines
strips the newline character off of the lines.
<<getCount>>= getCount :: String -> WordCount getCount = foldl' (\(c, w, l) x -> (c+length x+1, w+length (words x), l+1)) (0, 0, 0) . lines
printCountLine
The printCountLine
function prints out the line, word, and character counts for the file f
, in accordance with the option settings.
<<printCountLine>>= printCountLine :: Opts -> FilePath -> WordCount -> IO () printCountLine opts f (ls,ws,cs) = putStrLn ("\t" ++ (if showLines opts then (show ls) ++ "\t" else "") ++ (if showWords opts then (show ws) ++ "\t" else "") ++ (if showChars opts then (show cs) ++ "\t" else "") ++ f)
Option handling
The handling of command line options makes use of the System.Console.GetOpt
library. It uses an approach to process the options inspired by a Haskell mailing list post by Tomasz Zielonka (see References).
We first define a new record datatype Opts
, which contains three boolean fields representing the different options.
<<options setup>>= data Opts = Opts { showLines, showWords, showChars :: Bool }
We also define a GetOpts
option description list. This defines both short (e.g. -l
) and long (e.g. --line
) command line flags for each supported option, a function which operates on the Opts
datatype and is called when a particular flag is detected, and a usage message for each flag.
<<options setup>>= options :: [OptDescr (Opts -> Opts)] options = [ Option ['l'] ["lines"] (NoArg (\o -> o {showLines = True})) "show line count" , Option ['w'] ["words"] (NoArg (\o -> o {showWords = True})) "show word count" , Option ['c'] ["chars"] (NoArg (\o -> o {showChars = True})) "show character count" ]
Parsing of the command line flags makes use of the getOpt
function, which returns either a tuple containing a list of options and a list of files, or an error. If an error is returned, we simple print a usage message to the screen.
<<parseOpts>>= parseOpts :: [String] -> IO ([Opts -> Opts], [String]) parseOpts args = case getOpt RequireOrder options args of (opts,files,[]) -> return (opts,files) (_,_,errs) -> fail (concat errs ++ usageInfo header options) where header = "Usage: wc [OPTION...] [files...]"
The tricky part of the option handling is the processing of the options. The getOpt
function returns a list of option values, but we would prefer not to have to scan the list every time we want to check whether an option is set. Fortunately, the way in which the options
list was defined provides us with a convenient way around this problem. This list was defined such that the "value" of each option in the list returned by is getOpt
actually a function that transforms Opts
records. As a result it is possible to use a fold to thread an Opts
record through the list of option functions. If there are no option flags set on the command line then we default to having all options "on".
<<processOpts>>= processOpts :: [Opts -> Opts] -> Opts processOpts [] = allOpts True processOpts opts = foldl (flip ($)) (allOpts False) opts allOpts :: Bool -> Opts allOpts b = Opts { showLines = b, showWords = b, showChars = b }
Putting it all together
The remainder of wc.hs
is fairly straightforward. It does two things:
- Imports the necessary supporting libraries;
- Defines a short
main
function that gathers the command line arguments, handles any command line options, and then applieswc
to the list of files provided on the command line.
<<wc.hs>>= module Main where import System.Environment import System.Console.GetOpt import Control.Monad import Data.List WordCount options setup counting and printing parseOpts processOpts main :: IO () main = do args <- getArgs (optList, files) <- parseOpts args let opts = processOpts optList wc opts files
References
- High-level technique for program options handling
- Simple 'wc' using 'interact'
Download code |