100 Languages Speedrun: Episode 14: Recursive Descent Parser

Back in episode 8, we took our first steps towards creating a new programming language, and implemented a tokenizer. If you didn't read that episode, I recommend doing it now, otherwise this episode might be a bit confusing.
Our language
We'll create a very simple language for doing math.
It will have one statement per line, and there will only be these statements:
  • reading number from user and saving it to variable read name
  • printing variable print name
  • setting variable set name = expression
  • Any empty line will be ignored. # will start a comment that will go until end of current line.
    So a program could look like this:
    # Program for converting miles to km
    read miles
    set kms = miles * 1.60934
    print kms
    What is an expression?
    It's very clear what a statement is, but what is an expression?
    Let's write down the possibilities. These would be enough for our simple language:
  • a number like 1.60934
  • ( expression )
  • expression + expression
  • expression - expression
  • expression * expression
  • expression / expression
  • And we have the whole language!
    Well, almost. There's just tiny problem of ambiguity. Right now the language has no idea if 3 + 4 * 5 means 3 + ( 4 * 5 ) or (3 + 4) * 5. It also doesn't know if 2 - 3 - 4 is 2 - (3 - 4) or (2 - 3) - 4.
    There are two approaches to solving this in use. One is to tell our parser "operator precedence" (if + or * goes first), and for each operator its associativity (left or right or not allowed - it determines which way 2 - 3 - 4 is parsed).
    The other is to rewrite grammar to get rid of such ambiguity, and that's generally the simpler approach.
    Unambiguous grammar
    Instead of having expression with 6 possibilities, it will now only have these:
  • expression + product
  • expression - product
  • product
  • Then product is:
  • product * factor
  • product / factor
  • factor
  • And factor is:
  • ( expression )
  • number
  • Expressed this way, it's impossible to have ambiguities. You can verify it yourself. A product cannot possibly have anything with + on either side, unless we use parentheses, because none of the rules it refers to has +. And 2 - 3 - 4 cannot possibly mean 2 - (3 - 4) because the right side of - is a product, so it cannot have any - in it, without explicit parentheses.
    By the way don't think too much about names like product or factor. I don't find them terribly intuitive myself. You could call it expression2, expression3 or such.
    A lot of "parser generator" tools accept definitions like what we just came up with. Unfortunately for this episode we won't be using a parser generator, we'll be writing Recursive Descent Parser ourselves, and it requires grammars to be tweaked a bit.
    Grammar for Recursive Descent Parser
    I didn't discuss what Recursive Descent Parser are yet. We'll get to it soon.
    But a big limitation of Recursive Descent Parser is that none of the parsing rules can refer to itself on the left, because then it would goes into infinite recursive loop.
    So let's rewrite our definitions in a way that won't have this issue:
    expression is:
  • product ( ("+"|"-") product )*
  • product is:
  • factor ( ("*"|"/") factor ]*)
  • And factor is:
  • "(" expression ")"
  • number
  • I had to extend the notation a bit. a|b means a or b. ( ... )* means zero or more of whatever's in the parentheses. And as there are special characters now, I'm quoting all literal symbols like "+", or "(".
    Tokenizer
    OK, let's start with tokenizer. It will use StringScanner from Ruby standard library, and some regexp. Here's tokenizer program for our math language:
    #!/usr/bin/env ruby
    
    require "strscan"
    require "pathname"
    
    class MathLanguage
      def initialize(path)
        @lines = Pathname(path).readlines
      end
    
      def tokenize(string)
        scanner = StringScanner.new(string)
        tokens = []
        until scanner.eos?
          if scanner.scan(/\s+/)
            # do nothing
          elsif scanner.scan(/#.*/)
            # comments - ignore rest of line
          elsif scanner.scan(/-?\d+(?:\.\d*)?/)
            tokens << { type: "number", value: scanner.matched.to_f }
          elsif scanner.scan(/[\+\-\*\/=()]|set|read|print/)
            tokens << { type: scanner.matched }
          elsif scanner.scan(/[a-zA-Z][a-zA-Z0-9]*/)
            tokens << { type: "id", value: scanner.matched }
          else
            raise "Invalid character #{scanner.rest}"
          end
        end
        tokens
      end
    
      def call
        @lines.each do |line|
          tokens = tokenize(line)
          puts "Line: \"#{line.chomp}\" has tokens:"
          tokens.each do |token|
            p token
          end
        end
      end
    end
    
    unless ARGV.size == 1
      STDERR.puts "Usage: #{$0} file.math"
      exit 1
    end
    
    MathLanguage.new(ARGV[0]).call
    If we run it, we can see it turns all lines into tokens, and skips comments:
    $ ./math_tokenizer.rb miles_to_km.math
    Line: "# Program for converting miles to km" has tokens:
    Line: "read miles" has tokens:
    {:type=>"read"}
    {:type=>"id", :value=>"miles"}
    Line: "set kms = miles * 1.60934" has tokens:
    {:type=>"set"}
    {:type=>"id", :value=>"kms"}
    {:type=>"="}
    {:type=>"id", :value=>"miles"}
    {:type=>"*"}
    {:type=>"number", :value=>1.60934}
    Line: "print kms" has tokens:
    {:type=>"print"}
    {:type=>"id", :value=>"kms"}
    What is a Recursive Descent Parser?
    We have a list of definitions. Now what is this Recursive Descent Parser I've been alluding to?
    It's simply a bunch of method (or functions if you don't like OOP terminology) - one per grammar rule. Each method consumes some of the tokens from the left, and returns parsed content.
    For Recursive Descent Parser, methods for rule X should expect that X will be on the left of the list, and raise exception if it isn't - but they should be OK with there being some leftover tokens.
    As we're only parsing and not executing anything, for now our loop will look like this:
    def call
        @lines.each do |line|
          @tokens = tokenize(line)
          next if @tokens.empty?
          statement = parse_statement
          raise "Extra tokens left over" unless @tokens.empty?
          pp statement
        end
      end
    I'm saving @tokens to instance variable, so we don't need to pass it to every method.
    Also to simplify the code, here's some helper methods. next_token_is? checks if next token is of any of the expected types and return true or false. expect_token shifts token if it's one of expected types, or raises exception if it's of a wrong type:
    def next_token_is?(*types)
        @tokens[0] && types.include?(@tokens[0][:type])
      end
    
      def expect_token(*types)
        raise "Parse error" unless next_token_is?(*types)
        @tokens.shift
      end
    And with that, here's our parser:
    def parse_factor
        token = expect_token("number", "id", "(")
        case token[:type]
        when "number", "id"
          token
        when "("
          result = parse_expression
          expect_token(")")
          result
        end
      end
    
      def parse_product
        result = parse_factor
        while next_token_is?("*", "/")
          op_token = @tokens.shift
          result = {type: op_token[:type], left: result, right: parse_factor}
        end
        result
      end
    
      def parse_expression
        result = parse_product
        while next_token_is?("+", "-")
          op_token = @tokens.shift
          result = {type: op_token[:type], left: result, right: parse_product}
        end
        result
      end
    
      def parse_statement
        token = expect_token("read", "set", "print")
        case token[:type]
        when "read"
          token = expect_token("id")
          {type: "read", id: token[:value]}
        when "set"
          var_token = expect_token("id")
          expect_token("=")
          expr = parse_expression
          {type: "set", id: var_token[:value], expr: expr}
        when "print"
          token = expect_token("id")
          {type: "print", id: token[:value]}
        end
      end
    I recommend taking a good look at this code. Recursive Descent Parsers are very readable like that, but since it's probably not the type of code you've seen before, it might be a bit confusing initially.
    We can give it a try:
    $  ./math_parser.rb miles_to_km.math
    {:type=>"read",
     :id=>"miles"}
    {:type=>"set",
     :id=>"kms",
     :expr=>
      {:type=>"*",
       :left=>{:type=>"id", :value=>"miles"},
       :right=>{:type=>"number", :value=>1.60934}}}
    {:type=>"print",
     :id=>"kms"}
    Error handling
    One huge advantage of Recursive Descent Parsers over parser generators is potential for great error reporting. Recursive Descent is much closer to the way humans think, so if there's an error, we can say something like At line 20 char 1: expected one of "read", "set", or "print", but got token: "number".
    For complicated technical reasons, parsers created by parser generators often have atrociously bad error reporting.
    I'm mentioning this, as for sake of simplicity, for this episode we won't be doing anything beyond raise "Parse error", but we could, and we will in the future.
    Run it!
    To run the code we change the loop slightly:
    def call
        @variables = {}
        @lines.each do |line|
          @tokens = tokenize(line)
          next if @tokens.empty?
          statement = parse_statement
          raise "Extra tokens left over" unless @tokens.empty?
          eval_statement(statement)
        end
      end
    And the code to evaluate parsed statements and expressions:
    def eval_expression(expr)
        case expr[:type]
        when "number"
          expr[:value]
        when "id"
          @variables[expr[:value]]
        when "+", "-", "*", "/"
          left = eval_expression(expr[:left])
          right = eval_expression(expr[:right])
          left.send(expr[:type], right)
        else
          raise
        end
      end
    
      def eval_statement(statement)
        case statement[:type]
        when "read"
          @variables[statement[:id]] = STDIN.readline.to_f
        when "print"
          puts @variables[statement[:id]]
        when "set"
          @variables[statement[:id]] = eval_expression(statement[:expr])
        else
          raise
        end
      end
    I left the else raise there to catch programming errors. If we coded things correctly, it should never be reached.
    If you're not familiar with Ruby, left.send(expr[:type], right) is equivalent to left + right, left - right, left * right, or left / right depending on expr[:type].
    And that's all we needed to create our proper programming language:
    $ ./math.rb miles_to_km.math
    500
    804.67
    Where do we go from here?
    We didn't need any special tooling, just some regular expressions for tokenizer, and some plain recursive methods for parser and for execution.
    This approach is perfectly fine for simple programming languages. You can just add more token types, more grammar rules, some error handling, and some interesting interpreter, and you can have a decent language for your needs. Nothing about it is specific to Ruby - you can use any language. Regular expression support is definitely a big plus, but people have been writing tokenizers without regular expressions too, looking one character at a time.
    In the future we'll try to do something similar with tools like PLY (Python Lex-Yacc), or ANTLR.
    Code

    31

    This website collects cookies to deliver better user experience

    100 Languages Speedrun: Episode 14: Recursive Descent Parser