Revisiting “DSLs are a bad idea”…

My post a few months ago entitled “DSLs: Definitely a bad idea!” certainly seemed to touch a nerve for some people. What was intended to be a sort of lighthearted throwaway entry engendered a lot stronger reactions than I’d expected. So I thought it might be worthwhile to back up for a second and revisit the issue, address some (though not all) of the comments, and talk about how they connect to how I see M.

To start with, yes, of course, I was engaging in a bit of hyperbole by saying that DSLs are a “bad idea.” DSLs are just a tool, and tools are neither bad nor good in and of themselves-it’s all in how you use them. I don’t believe that DSLs are so dangerous that only highly qualified and well studied people should be allowed to attempt them. But I do believe that they have enough of a capacity to cause mischief that they should not be entered into lightly unless you are just fooling around. I mean, if you’re just creating a DSL for your own amusement, then who cares? Have a ball! In fact, I think it’s a great way to learn about programming languages and how they are put together. But if you’re in an environment where you’re going to depend on that DSL for something, or if other people are going to have to depend on it, then you should really take a step back and think carefully before heading down that path. Because DSLs have a nasty habit of taking on lives of their own if you’re not really careful about it.

The question then arises: “Paul, you work on the DSL component of M. Why are you telling us we should think twice before creating a DSL? Aren’t you speaking against your own interest?” I don’t see it that way, for this reason: what I’m mainly talking about here is that one should pause before one creates an entirely new DSL. Maybe this was imprecision in my previous piece that I should have corrected, but I think one can have much less concern when one is working in a domain where there is already a language of some sort defined.

This, to my mind, is where M’s DSL capabilities really shine. If you’ve got a domain that has no domain-specific language associated with it, think carefully before you embark on creating one. But if you’re working in some domain that already has some domain-specific language associated with it, the damage is already done, so to speak, so you might as well jump in and try to make things better! In particular, if you’ve got a domain that has a language associated with it that is more machine-friendly than human-friendly (*cough* angle brackets *cough*), then I think that M’s DSL capabilities are going to be hugely useful for you.

So overall, I’m just cautioning against the hammeritis that often infects the programming community (you know, “when all you have is a hammer, .”). DSLs are a wonderful tool, and I’m hoping to soon start talking about how you put them together. Just, you know, hammer responsibly.

LL vs. LR vs. GLR

Having covered tokens vs. syntax, we can now talk about what “kind” of parser generator MGrammar is. As I said in a previous post, if you want the full discussion of parser generators you should really read the Dragon book, but here’s sort of how things go in a nutshell:

There are traditionally two ways to interpret grammars: LL and LR. In both cases the first “L” means that the grammars read left-to-right. The second character indicates how the grammar is interpreted: “L” for left-to-right and “R” for right-to-left. An LL parser is also called a “top-down” parser because it starts with the top-most (or “start”) syntax production of the language and then starts trying to fill in from there. So if you’ve got a grammar for, say, Visual Basic, a LL parser starts by saying “OK, I’m parsing a file. Well, a file can start with an Option statement, so I’ll start looking there.” If it doesn’t find an Option statement, it says “OK, no Option statements, the next thing I can see is an Import statement, so I’ll start looking for those.” And so on, down the grammar. This is also called “predictive” parsing-the parser always knows what it expects next and either it finds it or it doesn’t.

An LL parser is nice because it fits with how we humans tend to think about parsing text. In fact, both the Visual Basic and C# parsers are handwritten LL parsers. LL parsers also have some nice properties for error recovery-since you always know what you’re expecting, it’s easy to tell the user what you were expecting to see. On the other hand, there are situations where prediction can fail you when trying to figure out what went wrong, usually when something is really out of place-say, a class declaration in the middle of a function declaration. This is because an LL parser only knows what it expects next-it doesn’t think about all the things that could occur somewhere else.

A LR parser is the reverse of a LL parser. Instead of starting with the syntax and trying to fit tokens into it, it starts with the tokens and tries to piece together something that fits into the syntax. So if we’re parsing a Visual Basic file with a LR parser, it would start by saying “OK, I found an `Option’ token. Does that fit anywhere in the grammar?” Since it didn’t make a complete anything yet, it’d shift `Option’ on a stack and then look at the next part of the input. “OK, I found a `Strict’ token. Does a `Strict’ token combined with an `Option’ token fit anywhere in the grammar?” Since there is an `Option Strict’ statement, it can shift `Strict’ onto the stack and then reduce both elements to an `Option Strict statement’. Then it would ask “Does `Option Strict statement’ fit anywhere in the grammar?” and so on.

As you can see, a LR parser builds up the parse tree from the bottom-first it figures out the leaves and then tries to assemble those leaves into a coherent tree. A LR grammar is nice because it’s very easy to generate a set of tables that describe a machine that can process the grammar. So it’s easy to generate an efficient program to recognize the grammar. A downside of a LR parser is that when things go wrong you don’t have the context that an LL parser has-you don’t necessarily know what you’re expecting to come next, so you can have a harder time recovering from problems.

Another downside of an LR parser is that it may find ambiguities in the grammar that wouldn’t exist for an LL parser. For example, a token sequence might reduce into two different syntax productions in the grammar that are totally distinct if you know what the context is. But since a LR parser doesn’t know what syntax production it’s looking for, it’s ambiguous. This means that coming up with a grammar for a LR parser can be difficult because you get lots of conflicts that are not obvious to a human reading the grammar.

But all is not lost. That’s because there’s an improvement on an LR parser, called a GLR parser, that fixes many of these ambiguities. The “G” stands for “Generalized” and what a GLR parser does that’s different than a regular LR parser is that it allows for multiple threads of parsing. In other words, when you run into a conflict where, say, a set of tokens can reduce into two different productions, you simply explore both possibilities. The odds are that in most grammars the ambiguity will resolve itself pretty quickly (i.e. one of the threads will reach an error state), so you just run all possibilities in parallel and wait for one of the threads to survive the others. This fixes most of the conflicts caused solely by the nature of bottom-up parsing.

As you might have guessed by the way I’ve written this entry, MGrammar is a GLR parser generator. There are some tweaks to the algorithm that we make, which I can discuss later, but the core of what we do is GLR. I wasn’t actually part of the original design team, so I can’t comment on exactly what led us to choose GLR over, say, LL but my thinking above probably touches on some of the reasons.

I’d also add that LL and LR are not the last word in parser technology. There are also other approaches to parsing such as Parsing Expression Grammars (PEGs) and the related implementation strategy packrat parsing. And research continues today, so I’m sure there’ll be even more strategies in the future.

Did Joel Spolksy nearly drive my coworker insane?

Maybe.

Dare had a pointer today to a programming.reddit story that talks about competing versions of a dustup between Joel and Greg Whitten back when Joel used to work for Excel. I could care less who’s right, but the interesting implication of the story is that Joel was the one who came up with the application interface for VBA. I’m not sure that’s entirely true-a lot people worked on VBA-but if it is, he might be responsible for nearly driving a former co-worker insane.

When I started at Microsoft, I worked on Access. Access used a version of BASIC called Embedded Basic (or EB) for its programmability story. EB may not have been perfect, but I always thought it was a really nice language engine-it was small, it was fast, and it worked really well with Access and its programming model. After Access 1.0 shipped, however, EB was discontinued in favor of VBA which was the new official programmability story for Microsoft. There were a few people left on EB whose job, as far as I could tell, were just to tell me “no” when I came to them and asked them to fix bugs. (My interactions with the EB team at this point inspired my program manager to tell me that one of my special skills seemed to be “being able to tell people to f**k off, but in such a nice way that there’s nothing they can say in return.” Suffice it to say, I didn’t have a great relationship with the EB folks.)

By the time we got to Access 95, we knew we had to move over to VBA. The head of the project stopped by and asked me if I was interested in maybe leading the effort to switch over to VBA. In what I can only describe as an extreme fit of sanity, I declined. Another coworker took on the job. Suffice it to say, it was nearly a complete disaster. VBA was designed to work well with Excel, which had a totally different storage, execution, and debugging model than Access. As a result, almost everything having to do with programmability had to be redesigned, often in some pretty hairy ways. From the outside, it seems like a miracle that it ever worked, much less worked well.

As you can imagine, this was pretty hard on my coworker, who was responsible for getting this working in Access. As I remember, the VBA team at the time (Joel was gone by that point, I think) was somewhat sympathetic but not overly helpful-after all, they worked really well with Excel and we had all these “weird” requirements. Anyway, I knew things weren’t going well when my coworker came into my office and told me, “I found this really stupid bug that I made, so I decided to rename all the local variables to say insulting things about me.” Then one day I came in to work and found my coworker wasn’t there. When I asked where he was, I was told that he’d been ordered to take several days off and fly to Vegas to unwind for a while. I believe they were worried he might totally go off the deep end.

In the end, though, we shipped, my coworker did not go insane, and everything ended up OK. But it’s interesting to think what might have happened.