100 Languages Speedrun: Episode 14: Recursive Descent Parser

Back in episode 8, we took our first steps towards creating a new programming language, and implemented a tokenizer. If you didn't read that episode, I recommend doing it now, otherwise this episode might be a bit confusing.

Our language

We'll create a very simple language for doing math.

It will have one statement per line, and there will only be these statements:

reading number from user and saving it to variable read name

printing variable print name

setting variable set name = expression

Any empty line will be ignored. # will start a comment that will go until end of current line.

So a program could look like this:

# Program for converting miles to km
read miles
set kms = miles * 1.60934
print kms

What is an expression?

It's very clear what a statement is, but what is an expression?

Let's write down the possibilities. These would be enough for our simple language:

a number like 1.60934

( expression )

expression + expression

expression - expression

expression * expression

expression / expression

And we have the whole language!

Well, almost. There's just tiny problem of ambiguity. Right now the language has no idea if 3 + 4 * 5 means 3 + ( 4 * 5 ) or (3 + 4) * 5. It also doesn't know if 2 - 3 - 4 is 2 - (3 - 4) or (2 - 3) - 4.

There are two approaches to solving this in use. One is to tell our parser "operator precedence" (if + or * goes first), and for each operator its associativity (left or right or not allowed - it determines which way 2 - 3 - 4 is parsed).

The other is to rewrite grammar to get rid of such ambiguity, and that's generally the simpler approach.

Unambiguous grammar

Instead of having expression with 6 possibilities, it will now only have these:

expression + product

expression - product

product

Then product is:

product * factor

product / factor

factor

And factor is:

( expression )

number

Expressed this way, it's impossible to have ambiguities. You can verify it yourself. A product cannot possibly have anything with + on either side, unless we use parentheses, because none of the rules it refers to has +. And 2 - 3 - 4 cannot possibly mean 2 - (3 - 4) because the right side of - is a product, so it cannot have any - in it, without explicit parentheses.

By the way don't think too much about names like product or factor. I don't find them terribly intuitive myself. You could call it expression2, expression3 or such.

A lot of "parser generator" tools accept definitions like what we just came up with. Unfortunately for this episode we won't be using a parser generator, we'll be writing Recursive Descent Parser ourselves, and it requires grammars to be tweaked a bit.

Grammar for Recursive Descent Parser

I didn't discuss what Recursive Descent Parser are yet. We'll get to it soon.

But a big limitation of Recursive Descent Parser is that none of the parsing rules can refer to itself on the left, because then it would goes into infinite recursive loop.

So let's rewrite our definitions in a way that won't have this issue:

expression is:

product ( ("+"|"-") product )*

product is:

factor ( ("*"|"/") factor ]*)

And factor is:

"(" expression ")"

number

I had to extend the notation a bit. a|b means a or b. ( ... )* means zero or more of whatever's in the parentheses. And as there are special characters now, I'm quoting all literal symbols like "+", or "(".

Tokenizer

OK, let's start with tokenizer. It will use StringScanner from Ruby standard library, and some regexp. Here's tokenizer program for our math language:

#!/usr/bin/env ruby

require "strscan"
require "pathname"

class MathLanguage
  def initialize(path)
    @lines = Pathname(path).readlines
  end

  def tokenize(string)
    scanner = StringScanner.new(string)
    tokens = []
    until scanner.eos?
      if scanner.scan(/\s+/)
        # do nothing
      elsif scanner.scan(/#.*/)
        # comments - ignore rest of line
      elsif scanner.scan(/-?\d+(?:\.\d*)?/)
        tokens << { type: "number", value: scanner.matched.to_f }
      elsif scanner.scan(/[\+\-\*\/=()]|set|read|print/)
        tokens << { type: scanner.matched }
      elsif scanner.scan(/[a-zA-Z][a-zA-Z0-9]*/)
        tokens << { type: "id", value: scanner.matched }
      else
        raise "Invalid character #{scanner.rest}"
      end
    end
    tokens
  end

  def call
    @lines.each do |line|
      tokens = tokenize(line)
      puts "Line: \"#{line.chomp}\" has tokens:"
      tokens.each do |token|
        p token
      end
    end
  end
end

unless ARGV.size == 1
  STDERR.puts "Usage: #{$0} file.math"
  exit 1
end

MathLanguage.new(ARGV[0]).call

If we run it, we can see it turns all lines into tokens, and skips comments:

$ ./math_tokenizer.rb miles_to_km.math
Line: "# Program for converting miles to km" has tokens:
Line: "read miles" has tokens:
{:type=>"read"}
{:type=>"id", :value=>"miles"}
Line: "set kms = miles * 1.60934" has tokens:
{:type=>"set"}
{:type=>"id", :value=>"kms"}
{:type=>"="}
{:type=>"id", :value=>"miles"}
{:type=>"*"}
{:type=>"number", :value=>1.60934}
Line: "print kms" has tokens:
{:type=>"print"}
{:type=>"id", :value=>"kms"}

What is a Recursive Descent Parser?

We have a list of definitions. Now what is this Recursive Descent Parser I've been alluding to?

It's simply a bunch of method (or functions if you don't like OOP terminology) - one per grammar rule. Each method consumes some of the tokens from the left, and returns parsed content.

For Recursive Descent Parser, methods for rule X should expect that X will be on the left of the list, and raise exception if it isn't - but they should be OK with there being some leftover tokens.

As we're only parsing and not executing anything, for now our loop will look like this:

def call
    @lines.each do |line|
      @tokens = tokenize(line)
      next if @tokens.empty?
      statement = parse_statement
      raise "Extra tokens left over" unless @tokens.empty?
      pp statement
    end
  end

I'm saving @tokens to instance variable, so we don't need to pass it to every method.

Also to simplify the code, here's some helper methods. next_token_is? checks if next token is of any of the expected types and return true or false. expect_token shifts token if it's one of expected types, or raises exception if it's of a wrong type:

def next_token_is?(*types)
    @tokens[0] && types.include?(@tokens[0][:type])
  end

  def expect_token(*types)
    raise "Parse error" unless next_token_is?(*types)
    @tokens.shift
  end

And with that, here's our parser:

def parse_factor
    token = expect_token("number", "id", "(")
    case token[:type]
    when "number", "id"
      token
    when "("
      result = parse_expression
      expect_token(")")
      result
    end
  end

  def parse_product
    result = parse_factor
    while next_token_is?("*", "/")
      op_token = @tokens.shift
      result = {type: op_token[:type], left: result, right: parse_factor}
    end
    result
  end

  def parse_expression
    result = parse_product
    while next_token_is?("+", "-")
      op_token = @tokens.shift
      result = {type: op_token[:type], left: result, right: parse_product}
    end
    result
  end

  def parse_statement
    token = expect_token("read", "set", "print")
    case token[:type]
    when "read"
      token = expect_token("id")
      {type: "read", id: token[:value]}
    when "set"
      var_token = expect_token("id")
      expect_token("=")
      expr = parse_expression
      {type: "set", id: var_token[:value], expr: expr}
    when "print"
      token = expect_token("id")
      {type: "print", id: token[:value]}
    end
  end

I recommend taking a good look at this code. Recursive Descent Parsers are very readable like that, but since it's probably not the type of code you've seen before, it might be a bit confusing initially.

We can give it a try:

$  ./math_parser.rb miles_to_km.math
{:type=>"read",
 :id=>"miles"}
{:type=>"set",
 :id=>"kms",
 :expr=>
  {:type=>"*",
   :left=>{:type=>"id", :value=>"miles"},
   :right=>{:type=>"number", :value=>1.60934}}}
{:type=>"print",
 :id=>"kms"}

Error handling

One huge advantage of Recursive Descent Parsers over parser generators is potential for great error reporting. Recursive Descent is much closer to the way humans think, so if there's an error, we can say something like At line 20 char 1: expected one of "read", "set", or "print", but got token: "number".

For complicated technical reasons, parsers created by parser generators often have atrociously bad error reporting.

I'm mentioning this, as for sake of simplicity, for this episode we won't be doing anything beyond raise "Parse error", but we could, and we will in the future.

Run it!

To run the code we change the loop slightly:

def call
    @variables = {}
    @lines.each do |line|
      @tokens = tokenize(line)
      next if @tokens.empty?
      statement = parse_statement
      raise "Extra tokens left over" unless @tokens.empty?
      eval_statement(statement)
    end
  end

And the code to evaluate parsed statements and expressions:

def eval_expression(expr)
    case expr[:type]
    when "number"
      expr[:value]
    when "id"
      @variables[expr[:value]]
    when "+", "-", "*", "/"
      left = eval_expression(expr[:left])
      right = eval_expression(expr[:right])
      left.send(expr[:type], right)
    else
      raise
    end
  end

  def eval_statement(statement)
    case statement[:type]
    when "read"
      @variables[statement[:id]] = STDIN.readline.to_f
    when "print"
      puts @variables[statement[:id]]
    when "set"
      @variables[statement[:id]] = eval_expression(statement[:expr])
    else
      raise
    end
  end

I left the else raise there to catch programming errors. If we coded things correctly, it should never be reached.

If you're not familiar with Ruby, left.send(expr[:type], right) is equivalent to left + right, left - right, left * right, or left / right depending on expr[:type].

And that's all we needed to create our proper programming language:

$ ./math.rb miles_to_km.math
500
804.67

Where do we go from here?

We didn't need any special tooling, just some regular expressions for tokenizer, and some plain recursive methods for parser and for execution.

This approach is perfectly fine for simple programming languages. You can just add more token types, more grammar rules, some error handling, and some interesting interpreter, and you can have a decent language for your needs. Nothing about it is specific to Ruby - you can use any language. Regular expression support is definitely a big plus, but people have been writing tokenizers without regular expressions too, looking one character at a time.

In the future we'll try to do something similar with tools like PLY (Python Lex-Yacc), or ANTLR.

Code