Word count (OCaml)
From LiteratePrograms
- Other implementations: Assembly Intel x86 Linux | C | C++ | Forth | Haskell | J | Lua | OCaml | Perl | Python | Python, functional | Rexx
An implementation of the UNIX wc tool, in OCaml.
The wc tool counts characters, words and lines in text files or stdin. When invoked without any options, it will print all three values. These options are supported:
- -c - Only count characters
- -w - Only count words
- -l - Only count lines
If the tool is invoked without any file name parameters, it will use stdin.
This is a somewhat more complex implementation of wc than other sample OCaml implementations on the web. This added complexity stems from two things:
- Adding command line options;
- Avoiding loading an entire file into memory to count it (and instead reading the file line-by-line).
Contents |
Counting and printing
The core of wc's functionality is to count various quantities, and print the results to the screen. This task is accomplished by the wc
function, with the help of several supporting functions:
<<counting and printing>>= getCount countFile printCountLine countAndPrintFile wc
The wc
function takes a group of option settings and a list of filenames, and for each file in the list prints the count totals specified by the options.
<<wc>>= let wc opts = function
If the file list is empty, or contains only a "-", the input is assumed to arrive via stdin. We therefore simply get the count totals for the stdin handle, and print that count.
<<wc>>= [] | ["-"] -> printCountLine opts (getCount stdin ("",0,0,0))
If the file list contains only a single filename, we simply open and count that file, and print the resulting count totals. The situation when the file list has multiple elements is a little more complex. In this case, we need to both find and print the count for each file in the list, and accumulate a total count for all of the files. This is accomplished by folding the function countAndPrintFile
across the file list.
<<wc>>= | [file] -> printCountLine opts (countFile file) | files -> let totalcount = List.fold_left (countAndPrintFile opts) ("total",0,0,0) files in printCountLine opts totalcount
countAndPrintFile
At each step of the fold operation countAndPrintFile
takes a current total count and a filename, prints the results of counting the file, and returns a new total count (which serves as an input to the next step of the fold).
<<countAndPrintFile>>= let countAndPrintFile opts (tf,tls,tws,tcs) file = let (f,ls,ws,cs) as count = countFile file in printCountLine opts count; (tf, tls+ls, tws+ws, tcs+cs)
The type of the second argument to countAndPrintFile
is a tuple consisting of a filename, and three integers representing the current line, word, and character counts for the corresponding file.
countFile
The countFile
function takes a filename as its argument, and returns the line, word, and character counts for that file. The actual counting operation is handled by getCount
, while countFile
handles opening and closing the file to be counted.
<<countFile>>= let countFile file = let hdl = open_in file in let count = getCount hdl (file,0,0,0) in close_in hdl; count
getCount
The getCount
function is the workhorse of the counting operation. At each iteration, it tries to read a line from the open input channel hdl
. If the attempted read shows that the end-of-file has been reached, getCount
returns the total number of lines, words, and characters in the file. If a line was successfully read, the line count is incremented by 1, and the word and characters counts are respectively increased by the numbers of words and characters in the line. The character count is further incremented by 1, to account for the fact that input_line
strips the newline character off of any line it returns.
We include a utility function words
to split a string into words.
<<getCount>>= let words = Str.split (Str.regexp "[ \t\n]+") let rec getCount hdl (f,ls,ws,cs) = try let line = input_line hdl in let ls = ls + 1 and ws = ws + List.length (words line) and cs = cs + String.length line + 1 in getCount hdl (f,ls,ws,cs) with End_of_file -> (f,ls,ws,cs)
printCountLine
The printCountLine
function prints out the line, word, and character counts for the file f
, in accordance with the option settings.
<<printCountLine>>= let printCountLine opts (f,ls,ws,cs) = if opts.showLines then Printf.printf "\t%d" ls; if opts.showWords then Printf.printf "\t%d" ws; if opts.showChars then Printf.printf "\t%d" cs; Printf.printf "\t%s\n" f
Option handling
The handling of command line options makes use of the Arg
module.
We first define a new record datatype opts
, which contains three boolean fields representing the different options.
<<options type>>= type opts = { showLines : bool; showWords : bool; showChars : bool }
Parsing of the command line flags makes use of the Arg.parse
function, which unfortunately does not return anything, so the information parsed from the flags will have to be stored by mutating references. In the following we set up the things we need in order to call Arg.parse
. We have four references, one boolean for each of the three flags and a list for the list of filenames.
We define a function process_arg
, which will be used as the argument to Arg.parse
for handling regular (non-flag) arguments; in this case it adds the filename to the list of files.
We also define an option description list. Each element is a tuple consisting of the short (e.g. -l
) command line flag for each supported option, a specification for it to set the appropriate reference, and the usage message for each flag. We also define an additional option descriptor to handle the special argument "-", because otherwise Arg.parse
will think it is a flag and not handle it correctly.
<<options setup>>= let lines = ref false and words = ref false and chars = ref false and files = ref [] in let process_arg file = files := file :: !files in let options = [ "-l", Set lines, "show line count"; "-w", Set words, "show word count"; "-c", Set chars, "show character count"; "-", Unit (fun () -> process_arg "-"), "input via stdin" ] in
The parseOpts
function is our main option-handling code. After the above, we call Arg.parse
with the appropriate arguments we set up earlier. Then we need to inspect the flags to see if they are all false, and if so set them all true, as per the default behavior of wc
. We construct the opts
record containing the flag truth values. Then we return a tuple of the opts
structure and the list of files. Note that we reverse the filenames list because that list was constructed in reverse (process_args
adds new items to the front of the list).
<<parseOpts>>= let header = "Usage: wc [OPTION...] [files...]" let parseOpts () = options setup parse options process_arg header; if not !lines && not !words && not !chars then lines := true; words := true; chars := true; { showLines = !lines; showWords = !words; showChars = !chars }, List.rev !files
Putting it all together
The remainder of wc.ml
is fairly straightforward. It does two things:
- Loads the necessary supporting libraries (the
words
function used functions from theStr
regular-expression library, which is not loaded by default); and "opens" theArg
module, bringing its names into our namespace (otherwise we would have to prefix a lot of things withArg.
). - Defines a short "main function" that gathers the command line options, and then applies
wc
to the list of files provided on the command line.
Note that OCaml doesn't really have an explicit "main function"; it just executes all code at the top level of the source file. One convention is shown below, where we use let () =
. This accomplishes two purposes:
- It allows us to avoid using the ugly
;;
terminator syntax, which is usually necessary to separate statements of code at the top level, but is not required before certain definition constructs, includinglet
- It pattern-matches the result of the block of code to
()
, the unit type, to ensure that the block of code doesn't "return" anything, since anything returned at the top level will be discarded anyway.
<<wc.ml>>= #load "str.cma" open Arg options type counting and printing parseOpts let () = let opts, files = parseOpts () in wc opts files
Download code |