Monthly Archives: February 2009

Lexing and Parsing

I want to talk in more detail about how the MGrammar parser works, but before I delve too deeply in to that I wanted to talk a little bit about some basic parsing concepts so we can be sure we’re on the same page. This may be a bit basic for some, so you can just skip ahead if you already know all this.

A language grammar is actually made up of two different grammars: a lexical grammar and a syntatic grammar. Although a grammar may not distinguish between the two in its specification (as we’ll see later), every grammar has these two things contained within it. The lexical grammar translates characters (the raw material of language) into lexical units called tokens. The syntatic grammar then translates the tokens into syntatic structures called syntax trees.

Let’s take a look at this in practical terms. Let’s say you’ve got the following sentence:

Alice kissed Bob.

Fundamentally, this sentence is just a collection of characters: the upper-case letter A, the lower-case letter l, the lower-case letter i, and so on. Lexically analyzing this string takes these characters and shape them into the basic tokens of English: punctuation and words. So a lexical analysis might return the following tokens: the word Alice, the word kissed, the word Bob, and a period.

One thing to notice here is what’s missing in the list of tokens-every character in the string “Alice kissed Bob.” is represented in the token list except for two characters. We’ve dropped the spaces between the words out of the token list, and we’ve done this because spaces aren’t significant in English syntax. (As anyone who was taught to type two spaces after a period can attest.) In other words, you can have as many spaces as you want between tokens and they don’t mean anything except as delimiters between tokens. In other words, “AlicekissedBob.” is not the same as “Alice kissed Bob.” but “Alice kissed Bob.” is the same as “Alice kissed Bob.” (Although it looks a little funny.) Since spaces don’t matter once we’ve formed the tokens, we toss them out.

Once we’ve got the tokens, we can then proceed to do the syntatic analysis. In the case of English, a common syntax pattern is “<subject> <verb> <direct object>.” Looking at the sentence, we can pretty easily discern this pattern at work: the subject of the sentence is “Alice,” the verb is “kissed,” the direct object is “Bob,” and the sentence ends with a period. Since the verb links the subject and the direct object, we’d probably come up with a syntax tree that looks something like this:

image

As I said before, some languages make a bigger deal of lexical vs. syntatic grammars than others. For example, if you go and read the XML spec, you’ll see that they don’t really call out what’s a token and what’s syntax. On the other hand, the Visual Basic language specification is extremely explicit about the distinction between tokens and syntax, even going so far as to have two separate grammars. How you think about tokens vs. syntax has big implications for the implementation of your parser, which I’ll talk about in my next entry.