26
100 Languages Speedrun: Episode 14: Recursive Descent Parser
Back in episode 8, we took our first steps towards creating a new programming language, and implemented a tokenizer. If you didn't read that episode, I recommend doing it now, otherwise this episode might be a bit confusing.
We'll create a very simple language for doing math.
It will have one statement
per line, and there will only be these statements:
- reading number from user and saving it to variable
read name
- printing variable
print name
- setting variable
set name = expression
Any empty line will be ignored. #
will start a comment that will go until end of current line.
So a program could look like this:
# Program for converting miles to km
read miles
set kms = miles * 1.60934
print kms
It's very clear what a statement
is, but what is an expression
?
Let's write down the possibilities. These would be enough for our simple language:
- a number like
1.60934
( expression )
expression + expression
expression - expression
expression * expression
expression / expression
And we have the whole language!
Well, almost. There's just tiny problem of ambiguity. Right now the language has no idea if 3 + 4 * 5
means 3 + ( 4 * 5 )
or (3 + 4) * 5
. It also doesn't know if 2 - 3 - 4
is 2 - (3 - 4)
or (2 - 3) - 4
.
There are two approaches to solving this in use. One is to tell our parser "operator precedence" (if +
or *
goes first), and for each operator its associativity (left or right or not allowed - it determines which way 2 - 3 - 4
is parsed).
The other is to rewrite grammar to get rid of such ambiguity, and that's generally the simpler approach.
Instead of having expression
with 6 possibilities, it will now only have these:
expression + product
expression - product
product
Then product
is:
product * factor
product / factor
factor
And factor
is:
( expression )
number
Expressed this way, it's impossible to have ambiguities. You can verify it yourself. A product cannot possibly have anything with +
on either side, unless we use parentheses, because none of the rules it refers to has +
. And 2 - 3 - 4
cannot possibly mean 2 - (3 - 4)
because the right side of -
is a product
, so it cannot have any -
in it, without explicit parentheses.
By the way don't think too much about names like product
or factor
. I don't find them terribly intuitive myself. You could call it expression2
, expression3
or such.
A lot of "parser generator" tools accept definitions like what we just came up with. Unfortunately for this episode we won't be using a parser generator, we'll be writing Recursive Descent Parser ourselves, and it requires grammars to be tweaked a bit.
I didn't discuss what Recursive Descent Parser are yet. We'll get to it soon.
But a big limitation of Recursive Descent Parser is that none of the parsing rules can refer to itself on the left, because then it would goes into infinite recursive loop.
So let's rewrite our definitions in a way that won't have this issue:
expression
is:
product ( ("+"|"-") product )*
product
is:
factor ( ("*"|"/") factor ]*)
And factor
is:
"(" expression ")"
number
I had to extend the notation a bit. a|b
means a
or b
. ( ... )*
means zero or more of whatever's in the parentheses. And as there are special characters now, I'm quoting all literal symbols like "+"
, or "("
.
OK, let's start with tokenizer. It will use StringScanner
from Ruby standard library, and some regexp. Here's tokenizer program for our math language:
#!/usr/bin/env ruby
require "strscan"
require "pathname"
class MathLanguage
def initialize(path)
@lines = Pathname(path).readlines
end
def tokenize(string)
scanner = StringScanner.new(string)
tokens = []
until scanner.eos?
if scanner.scan(/\s+/)
# do nothing
elsif scanner.scan(/#.*/)
# comments - ignore rest of line
elsif scanner.scan(/-?\d+(?:\.\d*)?/)
tokens << { type: "number", value: scanner.matched.to_f }
elsif scanner.scan(/[\+\-\*\/=()]|set|read|print/)
tokens << { type: scanner.matched }
elsif scanner.scan(/[a-zA-Z][a-zA-Z0-9]*/)
tokens << { type: "id", value: scanner.matched }
else
raise "Invalid character #{scanner.rest}"
end
end
tokens
end
def call
@lines.each do |line|
tokens = tokenize(line)
puts "Line: \"#{line.chomp}\" has tokens:"
tokens.each do |token|
p token
end
end
end
end
unless ARGV.size == 1
STDERR.puts "Usage: #{$0} file.math"
exit 1
end
MathLanguage.new(ARGV[0]).call
If we run it, we can see it turns all lines into tokens, and skips comments:
$ ./math_tokenizer.rb miles_to_km.math
Line: "# Program for converting miles to km" has tokens:
Line: "read miles" has tokens:
{:type=>"read"}
{:type=>"id", :value=>"miles"}
Line: "set kms = miles * 1.60934" has tokens:
{:type=>"set"}
{:type=>"id", :value=>"kms"}
{:type=>"="}
{:type=>"id", :value=>"miles"}
{:type=>"*"}
{:type=>"number", :value=>1.60934}
Line: "print kms" has tokens:
{:type=>"print"}
{:type=>"id", :value=>"kms"}
We have a list of definitions. Now what is this Recursive Descent Parser I've been alluding to?
It's simply a bunch of method (or functions if you don't like OOP terminology) - one per grammar rule. Each method consumes some of the tokens from the left, and returns parsed content.
For Recursive Descent Parser, methods for rule X
should expect that X
will be on the left of the list, and raise exception if it isn't - but they should be OK with there being some leftover tokens.
As we're only parsing and not executing anything, for now our loop will look like this:
def call
@lines.each do |line|
@tokens = tokenize(line)
next if @tokens.empty?
statement = parse_statement
raise "Extra tokens left over" unless @tokens.empty?
pp statement
end
end
I'm saving @tokens
to instance variable, so we don't need to pass it to every method.
Also to simplify the code, here's some helper methods. next_token_is?
checks if next token is of any of the expected types and return true
or false
. expect_token
shifts token if it's one of expected types, or raises exception if it's of a wrong type:
def next_token_is?(*types)
@tokens[0] && types.include?(@tokens[0][:type])
end
def expect_token(*types)
raise "Parse error" unless next_token_is?(*types)
@tokens.shift
end
And with that, here's our parser:
def parse_factor
token = expect_token("number", "id", "(")
case token[:type]
when "number", "id"
token
when "("
result = parse_expression
expect_token(")")
result
end
end
def parse_product
result = parse_factor
while next_token_is?("*", "/")
op_token = @tokens.shift
result = {type: op_token[:type], left: result, right: parse_factor}
end
result
end
def parse_expression
result = parse_product
while next_token_is?("+", "-")
op_token = @tokens.shift
result = {type: op_token[:type], left: result, right: parse_product}
end
result
end
def parse_statement
token = expect_token("read", "set", "print")
case token[:type]
when "read"
token = expect_token("id")
{type: "read", id: token[:value]}
when "set"
var_token = expect_token("id")
expect_token("=")
expr = parse_expression
{type: "set", id: var_token[:value], expr: expr}
when "print"
token = expect_token("id")
{type: "print", id: token[:value]}
end
end
I recommend taking a good look at this code. Recursive Descent Parsers are very readable like that, but since it's probably not the type of code you've seen before, it might be a bit confusing initially.
We can give it a try:
$ ./math_parser.rb miles_to_km.math
{:type=>"read",
:id=>"miles"}
{:type=>"set",
:id=>"kms",
:expr=>
{:type=>"*",
:left=>{:type=>"id", :value=>"miles"},
:right=>{:type=>"number", :value=>1.60934}}}
{:type=>"print",
:id=>"kms"}
One huge advantage of Recursive Descent Parsers over parser generators is potential for great error reporting. Recursive Descent is much closer to the way humans think, so if there's an error, we can say something like At line 20 char 1: expected one of "read", "set", or "print", but got token: "number"
.
For complicated technical reasons, parsers created by parser generators often have atrociously bad error reporting.
I'm mentioning this, as for sake of simplicity, for this episode we won't be doing anything beyond raise "Parse error"
, but we could, and we will in the future.
To run the code we change the loop slightly:
def call
@variables = {}
@lines.each do |line|
@tokens = tokenize(line)
next if @tokens.empty?
statement = parse_statement
raise "Extra tokens left over" unless @tokens.empty?
eval_statement(statement)
end
end
And the code to evaluate parsed statements and expressions:
def eval_expression(expr)
case expr[:type]
when "number"
expr[:value]
when "id"
@variables[expr[:value]]
when "+", "-", "*", "/"
left = eval_expression(expr[:left])
right = eval_expression(expr[:right])
left.send(expr[:type], right)
else
raise
end
end
def eval_statement(statement)
case statement[:type]
when "read"
@variables[statement[:id]] = STDIN.readline.to_f
when "print"
puts @variables[statement[:id]]
when "set"
@variables[statement[:id]] = eval_expression(statement[:expr])
else
raise
end
end
I left the else raise
there to catch programming errors. If we coded things correctly, it should never be reached.
If you're not familiar with Ruby, left.send(expr[:type], right)
is equivalent to left + right
, left - right
, left * right
, or left / right
depending on expr[:type]
.
And that's all we needed to create our proper programming language:
$ ./math.rb miles_to_km.math
500
804.67
We didn't need any special tooling, just some regular expressions for tokenizer, and some plain recursive methods for parser and for execution.
This approach is perfectly fine for simple programming languages. You can just add more token types, more grammar rules, some error handling, and some interesting interpreter, and you can have a decent language for your needs. Nothing about it is specific to Ruby - you can use any language. Regular expression support is definitely a big plus, but people have been writing tokenizers without regular expressions too, looking one character at a time.
In the future we'll try to do something similar with tools like PLY (Python Lex-Yacc), or ANTLR.
26