100 Languages Speedrun: Episode 18: PLY Python Lex-Yacc

In episodes 8 and 14 we created a simple language without any tools - using hand-written regexp-based tokenizer, and hand-written recursive descent parser.

If you want to experiment with creating your own programming languages, I'd generally recommend starting this way - it's more intuitive, more flexible, supports a lot better error reporting, and mostly uses skills you already have, like writing regexps and writing functions which call other functions.

But an alternative way exist - writing a specification and using a tool compile it to a tokenizer and a parser.

Tools vs Manual Parsers

There are complicated Computer Science reasons for why these tradeoffs exist, but let's just focus on the facts.

For tokenizers, tool-generated ones are potentially much faster than long-list-of-regexp-to-match-in-order style we used.

If you have 100 token types in your language, a tool can take 100 regexps for these tokens, and turn them into what's sort of a monster regexp that will check all 100 at once. Meanwhile our hand-written tokenizer will have to check each token type one by one, which could be 100 times slower. By the way this is the reason why most such tools match tokens "longest match first" not in any priority order - there's no order in this monster regexp.

The main downside is that you're much less flexible in the kind of regexps you can use, and for many languages where context is needed for tokenization (like in C++, a*b could be either multiplication, or definition of variable b with type a*), it's often difficult to give the tools the right context.

But for a simple language, the tools are usually perfectly adequate.

For parser generator tools, the advantage is mainly that it can support more rule types. For example parsing tools usually have no trouble with rules that refer to themselves on the left like expression: expression + product, while writing them by hand you have to rewrite them to avoid an infinite loop.

The main downside of tool-generated parsers is drastically worse error reporting. It's not a fixable issue. It's fundamental to the way these tools work, and it's simply impossible for them to give human-readable errors like "expected + or - but got NUMBER" (without basically adding a second backup parser just for error reporting). It's usually really clear which language uses hand-written recursive descent vs tool-generated parser, from just the messages you get in case of syntax error. That's just the cost you'll have to pay.

It's also possible to mix these approaches - with hand-crafted tokenizer and generated parser (not what I'd recommend), or with tool-generated tokenizer and hand-crafted parser (much more reasonable combination).

All that is just about traditional ("LR" type) parser generator tools. There are some alternative tools like ANTLR ("LL" type) which fall somewhere in between these, and generate parsers closer to what a human would make, with better error reporting, at cost of more restrictive grammar rules.

Tokenizer

We're going to implement pretty much the same language as back in episode 14 - it will read some numbers from inputs, do some math, and print them out. Let's start with the tokenizer:

#!/usr/bin/env python3

import sys
import ply.lex as lex

tokens = (
  "DIVIDE",
  "EQUALS",
  "ID",
  "LPAREN",
  "MINUS",
  "NUMBER",
  "PLUS",
  "PRINT",
  "READ",
  "RPAREN",
  "SET",
  "TIMES",
)

# Tokens
t_PLUS    = r"\+"
t_MINUS   = r"-"
t_TIMES   = r"\*"
t_DIVIDE  = r"/"
t_EQUALS  = r"="
t_LPAREN  = r"\("
t_RPAREN  = r"\)"

def t_NUMBER(t):
  r"-?\d+(\.\d*)?"
  t.value = float(t.value)
  return t

# We cannot make 't_READ' etc. have higher priority than 't_ID'
# so we need to match them together and figure out token type
# in a function.
def t_NAME(t):
  r"[a-zA-Z_][a-zA-Z0-9_]*"
  keywords = {
    "set" : "SET",
    "read" : "READ",
    "print" : "PRINT",
  }
  t.type = keywords.get(t.value, "ID")
  return t

def t_ignoreme(t):
  r'[ \t\n]|\#.*'
  pass

def t_error(t):
  print(f"Illegal character {t.value[0]!r}")
  t.lexer.skip(1)

# Build the lexer
lexer = lex.lex()

with open(sys.argv[1]) as file:
  while line := file.readline():
    print(f"Line: {line}", end="")
    lexer.input(line)
    while True:
      tok = lexer.token()
      if not tok:
        break
      print(f"  {tok}")

Which does this:

$ ./math_tokenizer.py miles_to_km.math
Line: # Program for converting miles to km
Line: read miles
  LexToken(READ,'read',1,0)
  LexToken(ID,'miles',1,5)
Line: set kms = miles * 1.60934
  LexToken(SET,'set',1,0)
  LexToken(ID,'kms',1,4)
  LexToken(EQUALS,'=',1,8)
  LexToken(ID,'miles',1,10)
  LexToken(TIMES,'*',1,16)
  LexToken(NUMBER,1.60934,1,18)
Line: print kms
  LexToken(PRINT,'print',1,0)
  LexToken(ID,'kms',1,6)

The first thing to notice is that it's much longer than the hand-written one.

The second is that we need special handling of keywords, as those regexps are not in any priority order, so set being keyword (SET) and set being variable name (ID) are both possible matches, and we cannot rely on the tokenizer prioritizing one over the other. To be honest we probably should have done this two step process in our original parser as well, as this way makes it clear that settings shouldn't be parsed as two tokens set tings, while our previous implementation had a bug here (which could also have been addressed with by matching /set\b/ instead of /set/).

At least for this simple language, I don't think either of these approaches is significantly better than the other. Both hand-written and tool-generated tokenizers are perfectly fine.

Parser

Let's write the parser now. I'll keep the tokenizer in the same file:

...
# Build the lexer
lexer = lex.lex()

def p_line_statement(p):
  'line : statement'
  p[0] = p[1]

def p_line_empty(p):
  'line :'
  p[0] =  {"op": "ignore"}

def p_statement_set(p):
  "statement : SET ID EQUALS expression"
  p[0] = {"op": "set", "var": p[2], "value": p[4]}

def p_statement_print(p):
  "statement : PRINT expression"
  p[0] = {"op": "print", "value": p[2]}

def p_statement_read(p):
  "statement : READ ID"
  p[0] = {"op": "read", "var": p[2]}

def p_expression_plus(p):
  "expression : expression PLUS product"
  p[0] = {"op": "+", "left": p[1], "right": p[3]}

def p_expression_minus(p):
  "expression : expression MINUS product"
  p[0] = {"op": "-", "left": p[1], "right": p[3]}

def p_expression_product(p):
  "expression : product"
  p[0] = p[1]

def p_product_times(p):
  "product : product TIMES factor"
  p[0] = {"op": "*", "left": p[1], "right": p[3]}

def p_product_divide(p):
  "product : product DIVIDE factor"
  p[0] = {"op": "/", "left": p[1], "right": p[3]}

def p_product_factor(p):
  "product : factor"
  p[0] = p[1]

def p_factor_number(p):
  "factor : NUMBER"
  p[0] = {"op": "number", "value": p[1]}

def p_factor_var(p):
  "factor : ID"
  p[0] = {"op": "var", "value": p[1]}

def p_factor_paren(p):
  "factor : LPAREN expression RPAREN"
  p[0] = p[2]

def p_error(p):
  print(f"Syntax error at {p!r}")

import ply.yacc as yacc
yacc.yacc()

with open(sys.argv[1]) as file:
  while line := file.readline():
    print(f"Line: {line}", end="")
    r = yacc.parse(line)
    print(f"  {r}")

Which does this:

./math_parser.py miles_to_km.math
Line: # Program for converting miles to km
  {'op': 'ignore'}
Line: read miles
  {'op': 'read', 'var': 'miles'}
Line: set kms = miles * 1.60934
  {'op': 'set', 'var': 'kms', 'value': {'op': '*', 'left': {'op': 'var', 'value': 'miles'}, 'right': {'op': 'number', 'value': 1.60934}}}
Line: print kms
  {'op': 'print', 'value': {'op': 'var', 'value': 'kms'}}

This is also somewhat longer than our hand-written parser.

I put every rule in its own function. PLY also allows defining one function for multiple rules, but then the function code can get quite messy.

I'm a bit confused why PLY does p[0] = X instead of the more obvious return X. It might have something to do with the way it does code generation.

The nice thing is that we managed to add any rules we wanted here without modifications. Our hand-written parser couldn't handle left-recursive rules like "expression : expression + product".

To allow empty matches, we had to add additional concept of line which can be either statement or empty. Our original parser just handled lines without tokens in a special way, so it implicitly had this logic.

Run it!

And finally the code to run our math interpreter. This part isn't really specific to PLY, but it will be a bit more verbose in Python than it was in Ruby.

...
vars = {}

def evaluate_statement(s):
  if s["op"] == "set":
    vars[s["var"]] = evaluate(s["value"])
  elif s["op"] == "print":
    print(evaluate(s["value"]))
  elif s["op"] == "read":
    vars[s["var"]] = float(input("Enter a number: "))
  elif s["op"] == "ignore":
    pass
  else:
    raise Exception(f"Unknown statement type {s['op']}")

def evaluate(expr):
  if expr["op"] == "+":
    return evaluate(expr["left"]) + evaluate(expr["right"])
  elif expr["op"] == "-":
    return evaluate(expr["left"]) - evaluate(expr["right"])
  elif expr["op"] == "*":
    return evaluate(expr["left"]) * evaluate(expr["right"])
  elif expr["op"] == "/":
    return evaluate(expr["left"]) / evaluate(expr["right"])
  elif expr["op"] == "number":
    return expr["value"]
  elif expr["op"] == "var":
    return vars[expr["value"]]
  else:
    raise Exception(f"Unknown expression type {expr['op']}")

with open(sys.argv[1]) as file:
  while line := file.readline():
    r = yacc.parse(line)
    evaluate_statement(r)

Which does just what we'd expect:

$ ./math.py miles_to_km.math
Enter a number: 20
32.1868

It's not a perfect comparison, as our hand-written code was in Ruby, and this is in Python, but counting non-empty lines without end-only lines, hand-written one had 96 line, and PLY one has 125, so it's somewhat more verbose, but nothing too bad.

Should you use PLY?

PLY is a bit short on documentation, but it's perfectly fine for Python-based parsers for simple languages. I think you'd generally be better off writing a recursive descent parser, but if you don't want to do that, PLY is a reasonable choice too.

I'm not really a fan of LR-style parser generators, as they work in highly non-intuitive way, and when they can't parse something, their error reporting is universally terrible. They also have fairly rigid structure, so for more complex languages it might be a lot harder to do the kind of parsing you want. Interestingly, a lot of programming languages like GCC's C, C++, Go and so on started with a tool-based parser, but generally they moved on to a hand-written one. Python is about the only language I know that went the opposite direction, from a hand-written to a tool-based parser (but PEG not LR, we'll get to why it matters in a future episode).

Code

11