Category Archives: MGrammar

“Oslo” has a May 2009 CTP…

In case you missed it, we pushed out a new CTP this week of “Oslo”. You can get it at the Oslo Developer Center. New stuff includes:

  • The “Quadrant” modeling tool. Use Quadrant to browse and edit models in a repository database.
  • Domain models for the UML 2.1 specification encompassing Use Case, Activity, Class, Sequence, Component diagrams, profiles and templates.
  • An XMI importer supporting the 2.1 specifications, and covering the diagrams identified above.
  • A domain model and loader for System.Runtime.

There isn’t a huge amount of change for the language portion of M in this CTP. Most of the major differences are under the hood and will only be apparent if you’re calling the underlying APIs directly. The only big thing I can think of is that the “identifier” keyword is no longer needed in a grammar-we’ll automatically detect the situation for you.

The feature I spent most of our last milestone working on appears to be finally coming together, so I hope to have more to say about that soon.

Catching up on free media

Things have been a bit quiet around Panopticon Central lately due to the fact that I’ve been heads down on developing a particularly gnarly feature in M. More on that if and when it starts to see the light of day. But in the mean time, there have been a few interviews/talks that have been posted that I wanted to point out:

Hope you enjoy them!

Revisiting “DSLs are a bad idea”…

My post a few months ago entitled “DSLs: Definitely a bad idea!” certainly seemed to touch a nerve for some people. What was intended to be a sort of lighthearted throwaway entry engendered a lot stronger reactions than I’d expected. So I thought it might be worthwhile to back up for a second and revisit the issue, address some (though not all) of the comments, and talk about how they connect to how I see M.

To start with, yes, of course, I was engaging in a bit of hyperbole by saying that DSLs are a “bad idea.” DSLs are just a tool, and tools are neither bad nor good in and of themselves-it’s all in how you use them. I don’t believe that DSLs are so dangerous that only highly qualified and well studied people should be allowed to attempt them. But I do believe that they have enough of a capacity to cause mischief that they should not be entered into lightly unless you are just fooling around. I mean, if you’re just creating a DSL for your own amusement, then who cares? Have a ball! In fact, I think it’s a great way to learn about programming languages and how they are put together. But if you’re in an environment where you’re going to depend on that DSL for something, or if other people are going to have to depend on it, then you should really take a step back and think carefully before heading down that path. Because DSLs have a nasty habit of taking on lives of their own if you’re not really careful about it.

The question then arises: “Paul, you work on the DSL component of M. Why are you telling us we should think twice before creating a DSL? Aren’t you speaking against your own interest?” I don’t see it that way, for this reason: what I’m mainly talking about here is that one should pause before one creates an entirely new DSL. Maybe this was imprecision in my previous piece that I should have corrected, but I think one can have much less concern when one is working in a domain where there is already a language of some sort defined.

This, to my mind, is where M’s DSL capabilities really shine. If you’ve got a domain that has no domain-specific language associated with it, think carefully before you embark on creating one. But if you’re working in some domain that already has some domain-specific language associated with it, the damage is already done, so to speak, so you might as well jump in and try to make things better! In particular, if you’ve got a domain that has a language associated with it that is more machine-friendly than human-friendly (*cough* angle brackets *cough*), then I think that M’s DSL capabilities are going to be hugely useful for you.

So overall, I’m just cautioning against the hammeritis that often infects the programming community (you know, “when all you have is a hammer, .”). DSLs are a wonderful tool, and I’m hoping to soon start talking about how you put them together. Just, you know, hammer responsibly.

LL vs. LR vs. GLR

Having covered tokens vs. syntax, we can now talk about what “kind” of parser generator MGrammar is. As I said in a previous post, if you want the full discussion of parser generators you should really read the Dragon book, but here’s sort of how things go in a nutshell:

There are traditionally two ways to interpret grammars: LL and LR. In both cases the first “L” means that the grammars read left-to-right. The second character indicates how the grammar is interpreted: “L” for left-to-right and “R” for right-to-left. An LL parser is also called a “top-down” parser because it starts with the top-most (or “start”) syntax production of the language and then starts trying to fill in from there. So if you’ve got a grammar for, say, Visual Basic, a LL parser starts by saying “OK, I’m parsing a file. Well, a file can start with an Option statement, so I’ll start looking there.” If it doesn’t find an Option statement, it says “OK, no Option statements, the next thing I can see is an Import statement, so I’ll start looking for those.” And so on, down the grammar. This is also called “predictive” parsing-the parser always knows what it expects next and either it finds it or it doesn’t.

An LL parser is nice because it fits with how we humans tend to think about parsing text. In fact, both the Visual Basic and C# parsers are handwritten LL parsers. LL parsers also have some nice properties for error recovery-since you always know what you’re expecting, it’s easy to tell the user what you were expecting to see. On the other hand, there are situations where prediction can fail you when trying to figure out what went wrong, usually when something is really out of place-say, a class declaration in the middle of a function declaration. This is because an LL parser only knows what it expects next-it doesn’t think about all the things that could occur somewhere else.

A LR parser is the reverse of a LL parser. Instead of starting with the syntax and trying to fit tokens into it, it starts with the tokens and tries to piece together something that fits into the syntax. So if we’re parsing a Visual Basic file with a LR parser, it would start by saying “OK, I found an `Option’ token. Does that fit anywhere in the grammar?” Since it didn’t make a complete anything yet, it’d shift `Option’ on a stack and then look at the next part of the input. “OK, I found a `Strict’ token. Does a `Strict’ token combined with an `Option’ token fit anywhere in the grammar?” Since there is an `Option Strict’ statement, it can shift `Strict’ onto the stack and then reduce both elements to an `Option Strict statement’. Then it would ask “Does `Option Strict statement’ fit anywhere in the grammar?” and so on.

As you can see, a LR parser builds up the parse tree from the bottom-first it figures out the leaves and then tries to assemble those leaves into a coherent tree. A LR grammar is nice because it’s very easy to generate a set of tables that describe a machine that can process the grammar. So it’s easy to generate an efficient program to recognize the grammar. A downside of a LR parser is that when things go wrong you don’t have the context that an LL parser has-you don’t necessarily know what you’re expecting to come next, so you can have a harder time recovering from problems.

Another downside of an LR parser is that it may find ambiguities in the grammar that wouldn’t exist for an LL parser. For example, a token sequence might reduce into two different syntax productions in the grammar that are totally distinct if you know what the context is. But since a LR parser doesn’t know what syntax production it’s looking for, it’s ambiguous. This means that coming up with a grammar for a LR parser can be difficult because you get lots of conflicts that are not obvious to a human reading the grammar.

But all is not lost. That’s because there’s an improvement on an LR parser, called a GLR parser, that fixes many of these ambiguities. The “G” stands for “Generalized” and what a GLR parser does that’s different than a regular LR parser is that it allows for multiple threads of parsing. In other words, when you run into a conflict where, say, a set of tokens can reduce into two different productions, you simply explore both possibilities. The odds are that in most grammars the ambiguity will resolve itself pretty quickly (i.e. one of the threads will reach an error state), so you just run all possibilities in parallel and wait for one of the threads to survive the others. This fixes most of the conflicts caused solely by the nature of bottom-up parsing.

As you might have guessed by the way I’ve written this entry, MGrammar is a GLR parser generator. There are some tweaks to the algorithm that we make, which I can discuss later, but the core of what we do is GLR. I wasn’t actually part of the original design team, so I can’t comment on exactly what led us to choose GLR over, say, LL but my thinking above probably touches on some of the reasons.

I’d also add that LL and LR are not the last word in parser technology. There are also other approaches to parsing such as Parsing Expression Grammars (PEGs) and the related implementation strategy packrat parsing. And research continues today, so I’m sure there’ll be even more strategies in the future.

Lexing and Parsing

I want to talk in more detail about how the MGrammar parser works, but before I delve too deeply in to that I wanted to talk a little bit about some basic parsing concepts so we can be sure we’re on the same page. This may be a bit basic for some, so you can just skip ahead if you already know all this.

A language grammar is actually made up of two different grammars: a lexical grammar and a syntatic grammar. Although a grammar may not distinguish between the two in its specification (as we’ll see later), every grammar has these two things contained within it. The lexical grammar translates characters (the raw material of language) into lexical units called tokens. The syntatic grammar then translates the tokens into syntatic structures called syntax trees.

Let’s take a look at this in practical terms. Let’s say you’ve got the following sentence:

Alice kissed Bob.

Fundamentally, this sentence is just a collection of characters: the upper-case letter A, the lower-case letter l, the lower-case letter i, and so on. Lexically analyzing this string takes these characters and shape them into the basic tokens of English: punctuation and words. So a lexical analysis might return the following tokens: the word Alice, the word kissed, the word Bob, and a period.

One thing to notice here is what’s missing in the list of tokens-every character in the string “Alice kissed Bob.” is represented in the token list except for two characters. We’ve dropped the spaces between the words out of the token list, and we’ve done this because spaces aren’t significant in English syntax. (As anyone who was taught to type two spaces after a period can attest.) In other words, you can have as many spaces as you want between tokens and they don’t mean anything except as delimiters between tokens. In other words, “AlicekissedBob.” is not the same as “Alice kissed Bob.” but “Alice kissed Bob.” is the same as “Alice kissed Bob.” (Although it looks a little funny.) Since spaces don’t matter once we’ve formed the tokens, we toss them out.

Once we’ve got the tokens, we can then proceed to do the syntatic analysis. In the case of English, a common syntax pattern is “<subject> <verb> <direct object>.” Looking at the sentence, we can pretty easily discern this pattern at work: the subject of the sentence is “Alice,” the verb is “kissed,” the direct object is “Bob,” and the sentence ends with a period. Since the verb links the subject and the direct object, we’d probably come up with a syntax tree that looks something like this:

image

As I said before, some languages make a bigger deal of lexical vs. syntatic grammars than others. For example, if you go and read the XML spec, you’ll see that they don’t really call out what’s a token and what’s syntax. On the other hand, the Visual Basic language specification is extremely explicit about the distinction between tokens and syntax, even going so far as to have two separate grammars. How you think about tokens vs. syntax has big implications for the implementation of your parser, which I’ll talk about in my next entry.

DSLs: Definitely a bad idea!

Last Friday Rocky posted an entry on his weblog entitled “DSLs – fun, cool, but maybe a bad idea?” and my reaction was:

Maybe a bad idea? Maybe a bad idea? Of course they’re a bad idea!

This may seem strange coming from someone who’s working on the heart of Oslo/M’s DSL capabilities, but stick with me here.

To begin with, writing a new language, domain-specific or not, is a lot like opening a restaurant: everyone thinks they can do it because they’ve eaten at one and it looks like fun. But the truth is that something like 50% of restaurants fail in the first year, and probably 90% fail over the long haul. It’s incredibly difficult, time consuming, and hard to keep a restaurant as a going concern-but doing that while actually coming up with something people want to eat is nearly impossible. I’m amazed new restaurants open up at all, given the odds.

Languages are the same way. It’s mind-bendingly, stupendously difficult to build a new language, even one that’s largely based on an existing one. Yet many programmers think, “hey, I use languages, how hard can this be?” and go at it. Like restaurants, probably over 98% of them fail to ever gain any traction at all, but god bless the optimists because without them we’d never get the 2% of languages that succeed. I’m personally willing to sacrifice the millions of dollars and hours wasted on languages that never make it just so that we can get languages like C# and Java and Ruby and Python and so on. (One language omitted to preserve at least the appearance of humility.)

So the fact that coming up with a new language is a bad idea shouldn’t dissuade people from developing new DSLs, it should just give them pause and hopefully a little humility. The key, I think, is to start small and to stay small. A domain-specific language needs to stay, well, domain-specific and not try and branch out to become a general purpose programming language. The problem is that it’s really, really hard to resist the temptation to do that. You just have to keep in mind that all the really successful DSLs have remained little languages, and repeat that to yourself every day. If you do that, you’ll be fine. Or, put another way:

Remember kids, DSLs don’t kill people, people kill people.

I am not a theorist…

As I’m sure regular followers of my blog are aware, I have a weakness for disclaimers. As I’ve starting thinking about how I want to talk in more detail about how the MGrammar parser generator works, I felt compelled to make yet another disclaimer: I am not a computer science theorist.

The field of parser generation is an old and storied one (at least, relative to a lot of other things having to do with computers) and there’s a lot of theory that’s built up around it. I simply ask my readers to keep in mind that I am but a humble practitioner, same as they are, and I make no great claims to deep insight or great knowledge of the full theory of parsers. I can not hope to even touch the hem of the garment of those who have gone before me, particularly these guys:

Book Cover

So if you really want to understand the difference between a LL(k) and a LALR(1) parser, go buy the book. Otherwise, you’re stuck with my simplistic explanations.

MGrammar and Rescuing the Princess

One of the major reasons that I decided to come to work on the Oslo team was the experience I had with what was going to become MGrammar. I was interested in prototyping some language and compiler design ideas, and I knew that the Oslo team had some technologies that might help me out, specifically a parser generator. They helpfully pointed me to their source code, I enlisted, built and started to play around. I’d been building some parser technology by hand, but I quickly discarded it once I started playing around with MGrammar in Intellipad. In addition to giving me a bunch of things that I needed, it was also just really. fun. I mean, it was cool to bring up a buffer, start hammering on a syntax and then put in an example and see the result pop up. Or, as was the case most of the time, not pop up. Tweak, tweak, tweak and then. boom! the result shows up.

This storyline is one that I’ve heard a number of times internally and even seen externally a time or two. People just have a great time playing around with grammars and languages and seeing what they can get to come out. There’s frustration, sure, but then there’s also a big payoff when you finally see it working-look what I created!

I hadn’t really thought a lot about it until I came across a presentation that Daniel Cook (author of the Lost Garden blog) gave at an OfficeLabs meeting entitled “Mixing Games and Applications.” He says it a lot better than I can, but I intuitively think that I (and others) enjoy fooling around with MGrammar in Intellipad precisely because it taps in to game playing part of our brain. I’m not sure how to capitalize on that further, but it’s interesting food for thought.

Answering some questions about MGrammar….

As some readers have noticed, I’ve been conspicuously silent since I moved over to the Oslo team. Some of this had to do with getting onto a new blogging engine, some of it had to do with various distractions like jury duty, but a lot of it had to do with the fact that Visual Basic 10.0 has gotten to a pretty stable state (so there isn’t so much for me to say any more) and I am just getting off the ground with Oslo. Not having much useful to say, I figured I’d be better off saying nothing at all rather than continuing to blather on and on about nothing in particular (that’s what Twitter is for).

Part of the process of moving over to Oslo was figuring out exactly what my role was going to be. There’s lots of cool stuff going on and a lot of smart people with strong ideas about things, so I knew it would take a while for things to settle out. Things will still probably evolve over time, but the area I’ve spent the most time on, and the thing I’ll have the most influence on for the time being, is MGrammar. It’s been an interesting experience because I’ve had to relearn a bunch of the theory around parser generation that I had forgotten in my time in VB (since we had a hand-built recursive descent parser). But it’s been enjoyable, and I think there are a lot of neat things we can do with it. So stay tuned on that.

Someone forwarded me Kevin Moore’s 20 questions/thoughts around MGrammar, and I thought answering some of those questions might be a good entree into blogging about my new job. I won’t address all of his questions, but the interesting ones, in my opinion, were:

Could C#/VB implement their compilers using MGrammar? Should they? Would they?

I believe that, as the languages stand today, C# could implement their parser using MGrammar but VB couldn’t. The reason why VB couldn’t is a pretty technical issue relating to the embedding of XML into VB which we may end up solving as well, but given the hacks that we (VB) had to do to the hand generated parser to get it to work I don’t think that reflects badly on MGrammar (or VB for that matter). The bottom line is that the parser generator is certainly powerful enough to handle an industrial-strength language. Whether the C# or VB teams should or will do anything with MGrammar is really a question for them. Both languages have hand-built parsers which allows an extraordinary degree of control over the behavior of a parse and, in particular, error messages and error recovery. They may be loath to give that up, but we’ll see.

Where is the MGrammar for C#? .for VB? .for CSV? .for VS solution files? .for XML?

In general, the Oslo team is just one little team and can only do so much. Although we endeavor to provide you with the maximum number of samples, in truth we are limited in what we can do at any one time. You should expect to see more samples as time goes on. I’ll also note, though, that some of those languages do present interesting challenges to a parser generator. I’ve already covered VB, but XML also has challenges, which I think is really a subject for another entry.

There should be standardized or defacto escape mechanisms for existing grammars.

If I understand the point here correctly, what’s really being asked for is composability of grammars. This is a difficult, but interesting, area of work. To use an example I’ve already referenced, there were a lot of challenges integrating XML into VB because the languages were very different, even down to the lexical level. Having languages compose in some kind of simple way (even to do things like add or override parts of an existing grammar) is an open question that we’re very interested in but we’re still working on.

Will there be tools to go back from a logical model (MSchema) to the input format?

This is on our radar, but we don’t have an answer at the moment. It’s a very natural and logical thing to do, but (of course) running a grammar in reverse is not a trivial matter. I’m not sure we can get to a point where we can reverse a parse automatically, but we do want to provide people with a way to do this in as simple a manner as possible.

Where is the community site for MGrammar?

Well, we’ve got the Oslo Developer Center and the Oslo forums. If you want to start a community site, by all means go ahead!

Any news with Mono? Is Miguel interested?

I think you’d have to ask Miguel that.

Are you making a play to make MGrammar a standard? I’ve heard yes, but just wanted to confirm.

What we’ve committed to is releasing the MGrammar specification (well, actually, the whole M language specification) under the Open Specification Promise. That’s the only commitment we’ve made so far regarding MGrammar’s specification, but it’s a pretty big one.

How is the computer science behind MGrammar?

Pretty good, and I expect it to get even better. Expect to hear more here and elsewhere about internal specifics as time goes on (it’s a little more than can be covered in one blog post).

I hope that answers some questions, and feel free to ask more questions here or on the forums. Thanks!