Having covered tokens vs. syntax, we can now talk about what “kind” of parser generator MGrammar is. As I said in a previous post, if you want the full discussion of parser generators you should really read the Dragon book, but here’s sort of how things go in a nutshell:
There are traditionally two ways to interpret grammars: LL and LR. In both cases the first “L” means that the grammars read left-to-right. The second character indicates how the grammar is interpreted: “L” for left-to-right and “R” for right-to-left. An LL parser is also called a “top-down” parser because it starts with the top-most (or “start”) syntax production of the language and then starts trying to fill in from there. So if you’ve got a grammar for, say, Visual Basic, a LL parser starts by saying “OK, I’m parsing a file. Well, a file can start with an Option statement, so I’ll start looking there.” If it doesn’t find an Option statement, it says “OK, no Option statements, the next thing I can see is an Import statement, so I’ll start looking for those.” And so on, down the grammar. This is also called “predictive” parsing-the parser always knows what it expects next and either it finds it or it doesn’t.
An LL parser is nice because it fits with how we humans tend to think about parsing text. In fact, both the Visual Basic and C# parsers are handwritten LL parsers. LL parsers also have some nice properties for error recovery-since you always know what you’re expecting, it’s easy to tell the user what you were expecting to see. On the other hand, there are situations where prediction can fail you when trying to figure out what went wrong, usually when something is really out of place-say, a class declaration in the middle of a function declaration. This is because an LL parser only knows what it expects next-it doesn’t think about all the things that could occur somewhere else.
A LR parser is the reverse of a LL parser. Instead of starting with the syntax and trying to fit tokens into it, it starts with the tokens and tries to piece together something that fits into the syntax. So if we’re parsing a Visual Basic file with a LR parser, it would start by saying “OK, I found an `Option’ token. Does that fit anywhere in the grammar?” Since it didn’t make a complete anything yet, it’d shift `Option’ on a stack and then look at the next part of the input. “OK, I found a `Strict’ token. Does a `Strict’ token combined with an `Option’ token fit anywhere in the grammar?” Since there is an `Option Strict’ statement, it can shift `Strict’ onto the stack and then reduce both elements to an `Option Strict statement’. Then it would ask “Does `Option Strict statement’ fit anywhere in the grammar?” and so on.
As you can see, a LR parser builds up the parse tree from the bottom-first it figures out the leaves and then tries to assemble those leaves into a coherent tree. A LR grammar is nice because it’s very easy to generate a set of tables that describe a machine that can process the grammar. So it’s easy to generate an efficient program to recognize the grammar. A downside of a LR parser is that when things go wrong you don’t have the context that an LL parser has-you don’t necessarily know what you’re expecting to come next, so you can have a harder time recovering from problems.
Another downside of an LR parser is that it may find ambiguities in the grammar that wouldn’t exist for an LL parser. For example, a token sequence might reduce into two different syntax productions in the grammar that are totally distinct if you know what the context is. But since a LR parser doesn’t know what syntax production it’s looking for, it’s ambiguous. This means that coming up with a grammar for a LR parser can be difficult because you get lots of conflicts that are not obvious to a human reading the grammar.
But all is not lost. That’s because there’s an improvement on an LR parser, called a GLR parser, that fixes many of these ambiguities. The “G” stands for “Generalized” and what a GLR parser does that’s different than a regular LR parser is that it allows for multiple threads of parsing. In other words, when you run into a conflict where, say, a set of tokens can reduce into two different productions, you simply explore both possibilities. The odds are that in most grammars the ambiguity will resolve itself pretty quickly (i.e. one of the threads will reach an error state), so you just run all possibilities in parallel and wait for one of the threads to survive the others. This fixes most of the conflicts caused solely by the nature of bottom-up parsing.
As you might have guessed by the way I’ve written this entry, MGrammar is a GLR parser generator. There are some tweaks to the algorithm that we make, which I can discuss later, but the core of what we do is GLR. I wasn’t actually part of the original design team, so I can’t comment on exactly what led us to choose GLR over, say, LL but my thinking above probably touches on some of the reasons.
I’d also add that LL and LR are not the last word in parser technology. There are also other approaches to parsing such as Parsing Expression Grammars (PEGs) and the related implementation strategy packrat parsing. And research continues today, so I’m sure there’ll be even more strategies in the future.