Writing an LR(1) Parser Generator

About twenty-five years ago, I wrote an LALR(1) compiler-compiler loosely based on the syntax in “Principles of Compiler Design” by Aho and Ullman.  I wrote that program using Kernighan and Ritchie’s C language.  A few years later, I tried the more ambitious effort of reprogramming the compiler-compiler in Stroustrup’s C++.   Unfortunately, it wasn’t a very “clean” version of C++.  Although C++ is a powerful OOP-type language, it’s hard to wring out the older C-type constructs.

I’ve since lost access to C compilers, so my code sat fallow for many years.  A few weeks ago, I decided to try converting it to Java.  One of my current goals is to write a C-to-Java converter, and it would be a great help to have a modifiable compiler-compiler at my disposal.  That conversion became quite a task (it took about 77 Java classes).

LR parsers are very powerful, but are difficult (if not impossible) to write by hand.  I wrote the original compiler-compiler on an MS Windows-based machine, so I didn’t have access to either YACC or LEX.  This meant I had to bootstrap the LR parser table.  To my amazement, the bootstrap was successful.

For the Java conversion, I still had the last table that my C++ compiler-compiler used to parse (along with the source productions and lex statements).  I was hoping to use this conversion effort as a model for my C-to-Java converter to follow.  Unfortunately, I made some design decisions during the C++ modification that didn’t make the Java conversion straightforward.  Computers back then didn’t have much memory, so I was writing various structures and objects out to files.  Obviously, I/O slows down processing speed, but it’s better than stopping completely with an out-of-memory error.  Modern computers don’t have those old memory restrictions, so I got rid of the old file I/O.  To make a long story shorter, the Java conversion is mostly done.  I’m now adding new features to make my future converter effort easier.  The complete syntax of the compiler-compiler (I called the program CCOM) is here.  The optional operator (?) is available for the regular expressions used by the lexer but not yet available for the production statements of CCOM.

The next step is to bullet-proof the program and add useful diagnostics.  LR parsers are great for identifying input strings that belong to a specific grammar but are poor at error-handling.  The fist occurrence of an error causes them to stop processing.  The trick, then, is to restart the scan and find as many errors as possible.  Thanks to ACM (Association for Computing Machinery), I have access to multiple papers describing LR parser error repair.  I plan to implement one or more of these algorithms in CCOM and report my success (or failure) here.

This entry was posted in Software. Bookmark the permalink.