Shunting yard algorithm (Python)
From LiteratePrograms
In this article, we describe an implementation of the Shunting yard algorithm in Python. The algorithm is a simple way of parsing expressions in infix notation.
In this implementation we generate a form of RPN which is readable by the dc UNIX tool.
Contents |
Tokens
The first step is to split the input string containing the expression into tokens. The only token types we support are numbers and operators (+, -, *, /) and parentheses. This is easily accomplished with the re.split function.
<<tokens>>= tokens = re.split(' *([+\-*/()]|\d+\.\d+|\d+) *', expr)
Operators
We use a hash to store the operator precedences. The unary - operator is called -u to distinguish it from the binary operator with the same name.
<<operators>>= prec = {'-u': 5, '*': 4, '/': 3, '+': 2, '-': 1, '(': 0, '': 9}
Right-associative operators must be treated specially. In this example, we only have one (the unary -), and store it in another hash.
<<operators>>= right = {'-u': 1}
We provide a function to easily find the operator precedence.
<<operators>>= def getprec(op): return prec.get(op, -1)
The way we used the split function, will leave some empty tokens. We just ignore them.
A unary operator is the first token or any operator that is preceded by another operator (not parentheses).
<<tokens-preprocessing>>= if not token: continue if token == '-' and getprec(last) >= 0: token = '-u'
The parser
The shunting yard algorithm is quite simple. All numbers are added to the output stream (here represented by @rpn). Operators are pushed on a stack. Each time we reach a new operator, we pop operators from the stack until we reach one that has lower precedence. In the case of a right associative operator, we also stop if we reach an operator of the same precedence.
All popped operators are appended to the output stream.
When we reach a right parenthesis, we pop all operators until the matching left parenthesis. The parentheses are thrown away.
<<parsing>>= # Parsing op_stack = [] rpn = [] last = '' for token in tokens: tokens-preprocessing if re.match('^\d+$|^\d+\.\d+$', token): if re.match('^\d+$|^\d+\.\d+$', last) or last == ')': raise Exception, "Value tokens must be separated by an operator" rpn.append(token) elif token == '(': op_stack.append(token) elif token == ')': while op_stack[-1] != '(': t = op_stack.pop() rpn.append(t) if op_stack.pop() != '(': raise Exception, "No matching (" elif getprec(token) > 0: pr = getprec(token) if token in right: while op_stack and pr < getprec(op_stack[-1]): rpn.append(op_stack.pop()) else: while op_stack and pr <= getprec(op_stack[-1]): rpn.append(op_stack.pop()) op_stack.append(token) else: raise Exception, "Unknown token: \"%s\"" % token last = token
When we have reached the end of the input stream, all remaining operators are popped and appended to the output stream.
<<parsing>>= while op_stack: rpn.append(op_stack.pop())
Output
As we want to generate code for dc, we have to do some small adjustments in the output stream.
The unary - operator doesn't exist in dc, so we have to fake it by multiplying with -1, which is encoded as _1 in dc.
We also want the result to be printed, so we append the p command.
<<output>>= for token in rpn: if token == '-u': print '_1*', else: print token, print 'p'
The program
This program expects an expression on the command line. The output can be piped to dc like this: ./shunting-yard.py '1+2*3' | dc
.
<<shunting-yard.py>>= #!/usr/bin/env python import re, sys try: expr = sys.argv[1] except: print "Usage %s expression" % sys.argv[0] sys.exit(1) tokens operators parsing output
Download code |