Mona Compiler Development, part 1: lexical analysis

As I've mentioned in my introductory post, the first part of the monac (Mona compiler) I've built is concerned with lexical analysis. In this post I'll explain what lexical analysis is, what part it plays in the overall compilation process, and how I've gone about implementing it in monac (which is programmed in Scala). Since my primary motivation for the entire Mona project is educational, I had decided to build my own lexer (lexical analyser) from scratch. The main body of work involved creating my own regular expression engine, which is at the heart of the lexer, and other than that, the scanning  mechanism (which reads individual characters and handles whitespace/comments), and the caching mechanism for some data structures.

Lexical analysis in general

The primary task in lexical analysis is the transformation of source code in textual form (a sequence of characters) into tokens - a series of abstract units of information about a program (similar to words in a sentence). In order to achieve this, the lexer first has to recognize individual lexemes - the characters which form the tokens - and then additionally process those characters in order to obtain the tokens. The main benefit of implementing a separate lexical analysis step is so that later phases of the compiler can work directly with lexical units, i.e. not just plain characters, that had already gone through some form of classification, and have embedded information in them. Here's a trivial example of the process (taken from the Dragon Book, pg. 6):

  • some code: position = initial + rate * 60
  • tokens: <id, 1> <=> <id, 2> <+> <id, 3> <*> <60>

The line position = initial + rate * 6 is a line of source code that serves as the input to the lexical analyser, and the output tokens are written between < and > signs. The tokens were classified into three categories: identifiers (written as <id, [number]>), purely syntactic elements, and constants (both written as <[text]>). The above example is simply a visualization - in practice the tokens are objects, not sequences of characters, in order to make them machine-readable.

Identifier tokens (e.g. <id, 1>) refer to named entities in the program - variables, classes, modules, etc. In this example, it is expected that position, initial, and rate are variables which store some numeric state of the program. Purely syntactic elements - the equals, plus, and times signs - are inherited from the necessary syntactic structure of the language, in this specific case from the rule which states that assignment statements ought to have an equals sign delimiting the left from the right hand side. Constants are the simplest class of tokens as their lexemes convey all of the necessary information about them, as opposed to identifiers which rely on context (i.e. depending on where a lexeme is found, it may refer to different entities).

The tokens output by the lexer are then fed into the parser, the component of the compiler which performs syntactic analysis (which is the phase that I'm currently working on).

Implementation in monac

Overall organization of the lexer

In monac, the lexer is essentially a black box which accepts an input buffer (e.g. a sequence of characters from a file), and provides a Scala Stream of tokens as the output. This design decision was driven by the fact that the lexer is merely called by the next compilation phase, and relying on an abstract stream-based interface allows the lexer to remain flexible for under-the-hood changes.

The lexemes are recognized by FSAs (finite state automata, essentially graphs which represent a process), and are classified into several different token classes: identifiers, character literals, string literals, various numerals, and purely syntactic elements (each of which has a class of its own). There are two types of identifiers - alphanumeric (regular, e.g. ListType), and those consisting of special characters (e.g. /= - inequality operator), and additionally, some lexemes which would fall under special identifiers are actually purely syntactic elements (e.g. { and ;).

The FSAs are constructed from regular expressions, in two steps. Firstly, the regular expression string, written by the programmer, is itself analysed and converted into a machine-readable structure, and secondly, this structure is used to create a transition diagram which forms the core of the finite state automaton (this is explained in more detail later).

Once the lexeme is recognized by an FSA, a respective constructor function is called to return a token, which is then fed into the token stream. Virtual tokens, those without an associated lexeme, can be inserted into the stream as well, at the moment only in the case of statement breaks at the end of lines (explicit semicolons as statement breaks are optional in Mona).

Scanning mechanism

The scanning mechanism is quite simple - for each consecutive lexeme, characters from the buffer are fed into a set of recognizers (finite state automata created from regular expressions), which are mapped to constructor functions of the relevant token classes. The recognizers can be in three overall states: continuing, accepting, and broken, and they are reset at the beginning of each new lexeme (overall states are relevant to the outside world, internally the FSAs also contain more specific numeric states). The overall state is determined by the current sequence of characters (for the current lexeme) - continuing means that the recognizer is still accepting characters, accepting means that the lexeme has been accepted and the constructor function is to be called, while broken means that the FSA cannot recognize the current sequence of characters which are making up the potential lexeme.

More specifically, as each character is fed into the set of recognizers, the transition diagrams (graphs) specific to each recognizer (or FSA, or regex), are used to determine the next  overall state. The transition diagram can be seen in this example for regular alphanumeric identifiers:

L(L|D)*
Cat(Letter,Kleene(Union(Letter,Digit)))
N L N N 
N N L D 
N N L D 
N N L D
final states: List(1, 2, 3)

While a transition from the current internal state is possible over the fed character, the overall state is continuing. Once a character over which no possible transition exists, the overall state changes to either accepting or broken, depending on whether the current internal state was in the set of final states (if yes, the lexeme is said to be accepted). For example, if the input characters form a sequence "ListType " (note the space), the above FSA for identifiers is going to be in the overall state of accepting - as the transition p -> e leads it into a final state, and there is no transition over " " (space).

The comment and whitespace part of the scanning mechanism are not really worth explaining, but for the interested, they can be examined in the getNextToken function of the lexer. What is interesting, is that Mona supports nested multi-line comments, which is a feature implemented through a simple counting/stack-like mechanism (also evident in the linked function).

Regex preprocessing

As mentioned above, the regular expressions which define Mona lexemes are first converted into machine-readable structures that, once printed, e.g. look like this:

  • input expression: L(L|D)*
  • output structure: Cat(Letter, Kleene(Union(Letter, Digit)))

This particular input expression is used for recognizing regular alphanumeric identifiers - it's a concatenation of a letter and any number of either letters or digits afterwards. In the structure, concatenation is represented by the Cat object (with that which is to be concatenated as its two arguments, respectively), acceptance of letters by Letter, the arbitrary repetition by Kleene (with that which is to be repeated as its only argument), and the choice / or operation by Union (with the choices as it's two arguments, non-respectively, i.e. the order doesn't matter).

The structure has the same meaning as the input string, but it's in a machine-readable form after the process, instead of a human-readable string. This is similar to compilation as a whole. The process of obtaining this structure can be examined in this source file, and the algorithm can be expressed in the following few steps:

  1. Insert implicit concatenation operations
  2. Convert expression to posfix
  3. Sequentially evaluate the regular expression and create nodes for the structure

Afterwards, the regex structure is used to construct a transition diagram for the FSAs (next section). The structure has the form of a tree, which is evident in the output of the regex preprocessor: Cat(Letter, Kleene(Union(Letter, Digit))), and can be visualized like this:

blog

  • Image 1 - a visualization of a parsed regular expression

Letter and Digit are special regex nodes which specify classes of characters. There are two more classes in addition to the aforementioned two: Special and Unicode, used for special ASCII characters (excluding newline), and unicode characters (like →), respectively. In the actual string of the expression, classes have letters assigned to them, and as shown in the above expression L(L|D), L does not match the actual upper case letter L, but stands for the character class Letter. The regex engine doesn't handle escaping with backslashes, so at the moment these characters are simply reserved and cannot be used for anything else, but this is a low-priority problem since there is little need to match concrete letters.

Finite state automata (transition diagram) algorithms

After the regular expression has been analysed and converted into a machine-readable form, two algorithms are applied on that structures, in sequence, to obtain the required FSA - Thompson's construction algorithm, and the subset construction algorithm - which are explained below.

Thompson's construction algorithm

This algorithm recursively evaluates the regular expression nodes in order to create the transition diagram. In monac it is implemented through a composition of an evaluation function (code), which creates the appropriate structure of the graph depending on the node type being evaluated, and recursively evaluates other nodes, and a merge function (code) which is used to merge the recursively evaluated nodes into a single graph structure.

Here is a useful visualization of the evaluation function in the case of a Union node:

Thompson-or.svg

  • Image 2 - visualization of the evaluation function in the case of a union operation (source)

The above image is obtained from an expression in the form s|q, where s and q are some yet-to-be-evaluated subexpressions, and N is the evaluation function. In memory this structure is stored in the form of an adjacency matrix.

It might also be useful to show a code sample of the evaluation function from monac, because I consider it to be quite readable and informative (you can follow the above link and examine the entire function):

private def eval(node: Node): TransitionDiagram = node match {  
  case Lit(lit) => {
    val result = TransitionDiagram(2)
    result.addTransition(0, 1, lit)
    result
  }

  case Union(left, right) => {
    val leftDiag = eval(left)
    val rightDiag = eval(right)

    // 0 is begin, 1 and 2 are left and right, 3 is end
    val result = TransitionDiagram(4)
    result.addTransition(0, 1, TransitionDiagram.EtaTransition)
    result.addTransition(0, 2, TransitionDiagram.EtaTransition)
    result.addTransition(1, 3, TransitionDiagram.EtaTransition)
    result.addTransition(2, 3, TransitionDiagram.EtaTransition)

    // merge
    expandState(result, 1, leftDiag)
    expandState(result, 2, rightDiag)
    result
  } 
  // ...
}

The above code only demonstrates cases for Lit and Union nodes, which are used for matching a single specific letter ('Lit' as in literal), and of course the union operation, respectively. The merge function (expandState) is more complicated and less conceptually important, so I won't go into detail of explaining it. All that is left in the Thompson's construction algorithm are other cases - concatenation, the Kleene star operation (repetition), and specific character classes.

The result of this step is an NFA - a nondeterministic finite state automaton. Such an FSA is contained in a transition diagram, but as its name implies, it does not have a determined transition for every state and character combination, which means two things: firstly, that there are eta-transitions, i.e. transitions over the empty character (in the above code, TransitionDiagram.EtaTransition); and secondly, that there can be two different transitions over the same character from a single state. This is problematic since it's more complicated to design a recognizer when there are multiple paths in the transition diagram that need to be considered at the same moment. That is why the next algorithm will reduce this NFA to a DFA - a deterministic automaton.

Subset construction algorithm

As previously mentioned, this step of the construction process converts a nondeterministic finite state automaton to a deterministic one (a DFA). Conceptually it's rather simple - multiple internal NFA states will be mapped into a single DFA state (the subset), so that the subset contains all NFA states that can be reached over a single equivalent transition. For example, if NFA states 2 and 3 can be reached from the NFA state 1 with a transition over a, then 2 and 3 will be merged into a single DFA state, which is then reached over the single, deterministic transition over a. Additionally, internal NFA states reachable over the eta-transition (no transition character) from the current state are also added to the subset.

I've learned the most about the subset construction algorithm from this PDF document (from the Porland State University). My implementation can be examined more thoroughly on monac's GitHub (function nfaToDfa).

Another (optional) step could be added to this part of the lexer - and that is minimization of the transition diagrams - however that isn't really a priority at the moment as the current implementation works good enough.

Caching transition diagrams

The FSA construction process actually happens only once, prior to running the relevant compiler version. During normal operation the FSAs are loaded off the disk.

More specifically, a database with all expressions and relevant FSAs is saved to an internal compiler file, and once the construction is requested by the lexer, the FSA for the relevant expression is merely read from the database, instead of the entire construction process happening each time.

What's next?

I'm currently working on the second major compiler phase - syntactic analysis. The main components involved include the parser, which is similar to the lexer, except instead of reading characters and recognizing lexemes, it reads tokens from the lexer, and recognizes patterns and structure in the code; and the symbol table, which holds relevant information about identifiers; along with accompanying models for various code structures.

Later I will deal with the Mona type system, both in theory and implementation - which falls under semantic analysis. My plan is to cover the parser, the symbol table, and the type system over the next four posts.