LL vs. LR vs. GLR

Having covered tokens vs. syntax, we can now talk about what “kind” of parser generator MGrammar is. As I said in a previous post, if you want the full discussion of parser generators you should really read the Dragon book, but here’s sort of how things go in a nutshell:

There are traditionally two ways to interpret grammars: LL and LR. In both cases the first “L” means that the grammars read left-to-right. The second character indicates how the grammar is interpreted: “L” for left-to-right and “R” for right-to-left. An LL parser is also called a “top-down” parser because it starts with the top-most (or “start”) syntax production of the language and then starts trying to fill in from there. So if you’ve got a grammar for, say, Visual Basic, a LL parser starts by saying “OK, I’m parsing a file. Well, a file can start with an Option statement, so I’ll start looking there.” If it doesn’t find an Option statement, it says “OK, no Option statements, the next thing I can see is an Import statement, so I’ll start looking for those.” And so on, down the grammar. This is also called “predictive” parsing-the parser always knows what it expects next and either it finds it or it doesn’t.

An LL parser is nice because it fits with how we humans tend to think about parsing text. In fact, both the Visual Basic and C# parsers are handwritten LL parsers. LL parsers also have some nice properties for error recovery-since you always know what you’re expecting, it’s easy to tell the user what you were expecting to see. On the other hand, there are situations where prediction can fail you when trying to figure out what went wrong, usually when something is really out of place-say, a class declaration in the middle of a function declaration. This is because an LL parser only knows what it expects next-it doesn’t think about all the things that could occur somewhere else.

A LR parser is the reverse of a LL parser. Instead of starting with the syntax and trying to fit tokens into it, it starts with the tokens and tries to piece together something that fits into the syntax. So if we’re parsing a Visual Basic file with a LR parser, it would start by saying “OK, I found an `Option’ token. Does that fit anywhere in the grammar?” Since it didn’t make a complete anything yet, it’d shift `Option’ on a stack and then look at the next part of the input. “OK, I found a `Strict’ token. Does a `Strict’ token combined with an `Option’ token fit anywhere in the grammar?” Since there is an `Option Strict’ statement, it can shift `Strict’ onto the stack and then reduce both elements to an `Option Strict statement’. Then it would ask “Does `Option Strict statement’ fit anywhere in the grammar?” and so on.

As you can see, a LR parser builds up the parse tree from the bottom-first it figures out the leaves and then tries to assemble those leaves into a coherent tree. A LR grammar is nice because it’s very easy to generate a set of tables that describe a machine that can process the grammar. So it’s easy to generate an efficient program to recognize the grammar. A downside of a LR parser is that when things go wrong you don’t have the context that an LL parser has-you don’t necessarily know what you’re expecting to come next, so you can have a harder time recovering from problems.

Another downside of an LR parser is that it may find ambiguities in the grammar that wouldn’t exist for an LL parser. For example, a token sequence might reduce into two different syntax productions in the grammar that are totally distinct if you know what the context is. But since a LR parser doesn’t know what syntax production it’s looking for, it’s ambiguous. This means that coming up with a grammar for a LR parser can be difficult because you get lots of conflicts that are not obvious to a human reading the grammar.

But all is not lost. That’s because there’s an improvement on an LR parser, called a GLR parser, that fixes many of these ambiguities. The “G” stands for “Generalized” and what a GLR parser does that’s different than a regular LR parser is that it allows for multiple threads of parsing. In other words, when you run into a conflict where, say, a set of tokens can reduce into two different productions, you simply explore both possibilities. The odds are that in most grammars the ambiguity will resolve itself pretty quickly (i.e. one of the threads will reach an error state), so you just run all possibilities in parallel and wait for one of the threads to survive the others. This fixes most of the conflicts caused solely by the nature of bottom-up parsing.

As you might have guessed by the way I’ve written this entry, MGrammar is a GLR parser generator. There are some tweaks to the algorithm that we make, which I can discuss later, but the core of what we do is GLR. I wasn’t actually part of the original design team, so I can’t comment on exactly what led us to choose GLR over, say, LL but my thinking above probably touches on some of the reasons.

I’d also add that LL and LR are not the last word in parser technology. There are also other approaches to parsing such as Parsing Expression Grammars (PEGs) and the related implementation strategy packrat parsing. And research continues today, so I’m sure there’ll be even more strategies in the future.

12 thoughts on “LL vs. LR vs. GLR

  1. Pingback: Anonymous

  2. Miguel

    Very good post, Mr. Paul! Really educational and using the perfect approach on the explanation.

    I guess this one, together with "Lexer vs Parser" one, validate half (even more) of my Compiler Design course, back at university days 🙂

    Congrats,
    M.

    Reply
  3. Evan

    One reason to choose GLR over LL is that it is a more powerful parsing algorithm–for every LL parser there exists a GLR parser which recognizes the same grammar, but the reverse isn’t true.

    Just as the GLR (aka Tomita) algorithm is a generalization of LR, the Earley algorithm is a generalization of LL. You could call it GLL, but I don’t think anybody does. GLR and Earley are equally powerful.

    Reply
  4. Anonymous

    This is a great high-level introduction to several different styles of parsers. In particular, you’ve done a good job describe LL and GLR parsers. However, I think you’ve made a few mistakes regarding LR parsers.

    You said that "[a] downside of a LR parser is that when things go wrong you don’t have the context that an LL parser has-you don’t necessarily know what you’re expecting to come next, so you can have a harder time recovering from problems."

    This is incorrect. Due to the LR parser’s stack of shifted symbols, it has *more* context available to it than an LL parser does for error reporting or recovery. Additionally, LR parsers tend to detect syntax errors sooner than LL parsers for equivalent languages.

    You also say that "another downside of an LR parser is that it may find ambiguities in the grammar that wouldn’t exist for an LL parser."

    This is also incorrect. The set of languages that are LR(1) is *larger* than the set of languages that are LL(k).

    Reply
  5. Dietrich Epp

    Check your facts on this one. You say that a downside of an "LR parser is that it may find ambiguities in the grammar that wouldn’t exist for an LL parser." Not true.

    When you say that a grammar is ambiguous, that means that there’s a sentence which can be parsed in two different ways. For example, if you were writing a grammar for a calculator program, you might parse 1 + 2 * 3 as (1 + 2) * 3 or you might parse it as 1 + (2 * 3). If the grammar is ambiguous, than changing between an LL parser and an LR parser isn’t going to fix anything.

    Here’s the real factual error in the article: you seem to have forgotten that all LL grammars are also LR grammars. You can take any LL grammar and run it through an LR parser generator and you will get a parser that produces the same exact results. The reverse is not true. Take a very simple LR grammar, such as one expressing division. Because division is not associative, (1 / 2) / 3 is not the same as 1 / (2 / 3). Languages you’re familiar always use the first definition. It’s really easy to express that in an LR grammar, but if you want to express it in an LL grammar, you have to get clever. When you get clever, it gets harder to read and understand the grammar.

    There are also other classes of languages and parsers, such as the extremely important LALR class, which is a subset of LR that is easier to write parsers for. Most automatically generated parsers are LALR. There is an even smaller subset, SLR, which is used as a stepping stone when generating LALR grammars. Nobody really uses SLR otherwise. And it’s fairly hard to find an LR parser generator.

    Please please please do some more research before you blog about it like this. The Aho Sethi Ullman book is classic but it’s a bit dense if you’re not used to reading books like that. You can also check out some compilers classes online from colleges that do that sort of thing.

    Reply
  6. Dietrich Epp

    panopticoncentral.net/…/24813.aspx

    Check your facts on this one. You say that a downside of an "LR parser is that it may find ambiguities in the grammar that wouldn’t exist for an LL parser." Not true.

    When you say that a grammar is ambiguous, that means that there’s a sentence which can be parsed in two different ways. For example, if you were writing a grammar for a calculator program, you might parse 1 + 2 * 3 as (1 + 2) * 3 or you might parse it as 1 + (2 * 3). If the grammar is ambiguous, than changing between an LL parser and an LR parser isn’t going to fix anything.

    Here’s the real factual error in the article: you seem to have forgotten that all LL grammars are also LR grammars. You can take any LL grammar and run it through an LR parser generator and you will get a parser that produces the same exact results. The reverse is not true. Take a very simple LR grammar, such as one expressing division. Because division is not associative, (1 / 2) / 3 is not the same as 1 / (2 / 3). Languages you’re familiar always use the first definition. It’s really easy to express that in an LR grammar, but if you want to express it in an LL grammar, you have to get clever. When you get clever, it gets harder to read and understand the grammar.

    There are also other classes of languages and parsers, such as the extremely important LALR class, which is a subset of LR that is easier to write parsers for. Most automatically generated parsers are LALR. There is an even smaller subset, SLR, which is used as a stepping stone when generating LALR grammars. Nobody really uses SLR otherwise. And it’s fairly hard to find an LR parser generator.

    Please please please do some more research before you blog about it like this. The Aho Sethi Ullman book is classic but it’s a bit dense if you’re not used to reading books like that. You can also check out some compilers classes online from colleges that do that sort of thing.

    Reply
  7. Arkanosis

    Well written, and very clear.

    There is a pathological case with LL parsers, which is not a problem for LR parsers : left-recursive grammars cause infinite recursions, eg.

    foo ::= foo `a`
    | /* Nothing */

    (equivalent to the regexp "a*")

    For such cases, a LL parser says "Well, the next thing I expect is a ‘foo’, so lets look for the first thing a ‘foo’ may be. It may be a ‘foo’ followed by the letter ‘a’, so the next thing I expect is a ‘foo’, so…"

    and so on, and so on…

    Best regards,

    Reply
  8. Guillaume

    Nice post, very informative.
    It refreshes a bit my memory regarding the grammar and languages lessons.

    Any informative and useful links regarding the PEG theory and technic ?

    Reply
  9. a coder

    thank you, interesting article. i’ve been thinking about writing my own language the last few weeks again, and put together an expression evaluating thing and a basic shell. i’d like to learn more about this stuff. i’m goint o check out the mgrammar language spec.

    Reply
  10. Nikos Vaggalis

    Great article but when you say "For example, a token sequence might reduce into two different syntax productions in the grammar that are totally distinct if you know what the context is. But since a LR parser doesn’t know what syntax production it’s looking for, it’s ambiguous" are you referring to type2 grammars or type1? Because type2 are not context sensitive and both LL and LR although processing it differently produce the same result.

    Reply
  11. Nikos Vaggalis

    Great article but when you say :
    "A downside of a LR parser is that when things go wrong you don’t have the context that an LL parser has-you don’t necessarily know what you’re expecting to come next, so you can have a harder time recovering from problems."
    and
    "For example, a token sequence might reduce into two different syntax productions in the grammar that are totally distinct if you know what the context is. But since a LR parser doesn’t know what syntax production it’s looking for, it’s ambiguous. This means that coming up with a grammar for a LR parser can be difficult because you get lots of conflicts that are not obvious to a human reading the grammar."
    are you referring to type1-context sensitive grammar? because in type2-context-free grammar a LL or LR parser
    use the same grammar and although process the sentence differently the result would be the same

    Reply

Leave a Reply