Word count (Python)

From LiteratePrograms

Jump to: navigation, search
Other implementations: Assembly Intel x86 Linux | C | C++ | Forth | Haskell | J | Lua | OCaml | Perl | Python | Python, functional | Rexx

An implementation of the UNIX wc tool, in Python.

The wc tool counts characters, words and lines in text files or stdin. When invoked without any options, it will print all three values. These options are supported:

  • -c - Only count characters
  • -w - Only count words
  • -l - Only count lines

If the tool is invoked without any file name parameters, it will use stdin.


General structure

#!/usr/bin/env python
from optparse import OptionParser
import sys
configure parser
support functions

Configuring the command line parser

First, the supported options are defined. Note that this also defines long options --char, --words and --lines as alternative to the short options -c, -w and -l. Note that the option parser automatically defines an additional option -h with long form --help which gives a short summary of all options. That's why also an usage string and a help string for each option is given.

<<configure parser>>=
parser = OptionParser(usage="usage: %prog [options] [file1 file2 ...]")
parser.add_option("-c", "--char", 
                  help="Only count characters")
parser.add_option("-w", "--words", 
                  help="Only count words")
parser.add_option("-l", "--lines", 
                  help="Only count lines")

After all supported options are defined, the supplied options can be parsed.

<<configure parser>>=
(options, args) = parser.parse_args()

If none of the options were given, we actually want all of them.

<<configure parser>>=
if not (options.characters or options.words or options.lines):
    options.characters, options.words, options.lines = True, True, True

The main program

If any file arguments are given, we loop through the arguments, keeping track of the total number of characters, words, characters and files. If more than one file was processed, the total is also output. If only one file was processed, the total would just repeat the counts of the single file and therefore is not given. If no file arguments are given, standard input is processed.

if args:
    total_lines = 0
    total_words = 0
    total_chars = 0
    file_count = 0
    for file_string in args:
        process single file argument
    if file_count > 1:
        print_count(total_lines, total_words, total_chars, "total") 
    process stdin

In case the file name contains a file glob (wild card), that glob is expanded. Note that on POSIX compatible systems, this is the wrong thing to do, since there the wild cards are already expanded by the shell. If a glob symbol ends up in the file name, on such systems it means that either it was quoted (indicating that expansion was explicitly not wanted), the file glob didn't match a file name, or it was a result of file globbing (note that on POSIX systems, a star can legally be part of the file name). However, systems like Microsoft Windows don't do file globbing in the shell, therefore there it's the job of the program to do it. This questions a list of platforms where file globbing has to be performed.

<<process single file argument>>=
#this comment is there to prevent a literateprograms bug messing up the indentation (which is fatal in python)
    platforms_needing_glob = ["win32"] # please populate!
    if sys.platform in platforms_needing_glob:
        import glob
        file_list = glob.glob(file_string)
        file_list = [file_string]

Next, we loop through all file names corresponding to the argument, read the total file content into memory, count the chars, words and lines, print them and update the globals.

<<process single file argument>>=
    for file_name in file_list:
        file = open(file_name)
        data = file.readlines()
        lines, words, chars = get_count(data)
        print_count(lines, words, chars, file_name)
        total_lines += lines
        total_words += words
        total_chars += chars
        file_count += 1

Processing standard input works the same as processing a file, except that there are no totals to update. As file name for output we give the empty string.

<<process stdin>>=
file = sys.stdin
data = file.readlines()
lines, words, chars = get_count(data)
print_count(lines, words, chars, "")

Support functions

We need a couple of support functions.

<<support functions>>=
def get_count(data):
    lines = len(data)
    words = sum(len(x.split()) for x in data)
    chars = sum(len(x) for x in data)
    return lines, words, chars
def print_count(lines, words, chars, filename):
    print "\t",
    if options.lines:
        print lines, "\t",
    if options.words:
        print words, "\t",
    if options.characters:
        print chars, "\t",
    print filename
Download code