Shunting yard algorithm (Python)

From LiteratePrograms

Other implementations: C | Perl | Python

In this article, we describe an implementation of the Shunting yard algorithm in Python. The algorithm is a simple way of parsing expressions in infix notation.

In this implementation we generate a form of RPN which is readable by the dc UNIX tool.

Tokens

The first step is to split the input string containing the expression into tokens. The only token types we support are numbers and operators (+, -, *, /) and parentheses. This is easily accomplished with the re.split function.

<<tokens>>=
tokens = re.split(' *([+\-*/()]|\d+\.\d+|\d+) *', expr)

Operators

We use a hash to store the operator precedences. The unary - operator is called -u to distinguish it from the binary operator with the same name.

<<operators>>=
prec = {'-u': 5, '*': 4, '/': 3, '+': 2, '-': 1, '(': 0, '': 9}

Right-associative operators must be treated specially. In this example, we only have one (the unary -), and store it in another hash.

<<operators>>=
right = {'-u': 1}

We provide a function to easily find the operator precedence.

<<operators>>=
def getprec(op):
    return prec.get(op, -1)

The way we used the split function, will leave some empty tokens. We just ignore them.

A unary operator is the first token or any operator that is preceded by another operator (not parentheses).

<<tokens-preprocessing>>=
if not token:
    continue
if token == '-' and getprec(last) >= 0:
    token = '-u'

The parser

The shunting yard algorithm is quite simple. All numbers are added to the output stream (here represented by @rpn). Operators are pushed on a stack. Each time we reach a new operator, we pop operators from the stack until we reach one that has lower precedence. In the case of a right associative operator, we also stop if we reach an operator of the same precedence.

All popped operators are appended to the output stream.

When we reach a right parenthesis, we pop all operators until the matching left parenthesis. The parentheses are thrown away.

<<parsing>>=
# Parsing
op_stack = []
rpn = []
last = ''
for token in tokens:
    tokens-preprocessing
    if re.match('^\d+$|^\d+\.\d+$', token):
        if re.match('^\d+$|^\d+\.\d+$', last) or last == ')':
            raise Exception, "Value tokens must be separated by an operator"
        rpn.append(token)
    elif token == '(':
        op_stack.append(token)
    elif token == ')':
        while op_stack[-1] != '(':
            t = op_stack.pop()
            rpn.append(t)
        if op_stack.pop() != '(':
            raise Exception, "No matching ("
    elif getprec(token) > 0:
        pr = getprec(token)
        if token in right:
            while op_stack and pr < getprec(op_stack[-1]):
                rpn.append(op_stack.pop())
        else:
            while op_stack and pr <= getprec(op_stack[-1]):
               rpn.append(op_stack.pop())
        op_stack.append(token)
    else:
        raise Exception, "Unknown token: \"%s\"" % token
    last = token

When we have reached the end of the input stream, all remaining operators are popped and appended to the output stream.

<<parsing>>=
while op_stack:
    rpn.append(op_stack.pop())

Output

As we want to generate code for dc, we have to do some small adjustments in the output stream.

The unary - operator doesn't exist in dc, so we have to fake it by multiplying with -1, which is encoded as _1 in dc.

We also want the result to be printed, so we append the p command.

<<output>>=
for token in rpn:
    if token == '-u':
        print '_1*',
    else:
        print token,
print 'p'

The program

This program expects an expression on the command line. The output can be piped to dc like this: ./shunting-yard.py '1+2*3' | dc.

<<shunting-yard.py>>=
#!/usr/bin/env python
import re, sys
try:
    expr = sys.argv[1]
except:
    print "Usage %s expression" % sys.argv[0]
    sys.exit(1)
tokens
operators
parsing
output