WACC Compiler Project
Tiberiu Andrei Georgescu
PhD Student - Safe and Trusted AI at Imperial College London
& Semi-Book Review for "Engineering a Compiler"
This year was tough. Next to courses that took a long time to prepare for and understand, we have had 3 big projects to work on:
- Pintos, engineering a big part of an operating system
- WACC, writing a compiler for a mock programming language from ground up
- DRP, designing for real people (a program/app/business plan)
Out of these three, WACC was my favourite when it comes to learning experience. It was not the hardest (Pintos), or the most rewarding (DRP), but it was just a great, well rounded project idea. WACC helped me understand better how programming languages are structured as a grammar, how they are converted into machine-readable input and how compilers are able to optimise the code so that the program runs as fast and as smoothly as possible.
While working on this project I have, of course, attended the lectures about compilers. However, what helped me the most was enforcing what I have learnt by reading from this book:
Not only it had more detail for each lecture presented, but it had more advanced, more efficient options for both front-end and back-end compiling.
Retrospectively, there are many things which I would prefer to have done differently to the way I have done them then when it comes to building the actual compiler. But when it comes to the learning, I do not think there would have been any better way. I had two main sources of input (i.e. the lectures and the book). I had one method of practising what I have learnt (i.e. the actual WACC compiler). Among all of these, as all students, we had support from PhD students working on compilers, to explain misunderstandings and additional design techniques. One of them recommended us to read a research paper for one of the parts of the back-end.
Having all of these sources of information made the Compilers course my favourite of this year (with Networking as a close second). I felt that I understood things easier and faster compared to the other courses, and it felt as a very nice challenge. Difficult, but manageable with all the resources that were offered.
Now, onto the actual WACC Compiler:
The WACC Compiler
The WACC project was split into three parts:
- Front-end
- Back-end
- Extension
The first two parts complete the major part of the project: a (more or less) basic compiler, able to convert any .wacc file in machine readable & executable instructions, whatever means necessary. The last part was our choice: basically whatever we feel like to do in order to upgrade the WACC programming language specification, the compiler itself, with full creative control (in a previous year, one team wrote a fully-fledged operating system in WACC, using their own compiler to convert the code).
Choosing a programming language
Having no base code whatsoever, our team had to make one big choice & stick to it until the end of the project: choose a programming language. We had four candidates:
- Kotlin
One of our teammates proposed this language due to its increasing popularity among our peers and similarity with Java. We discarded it quite quickly, as we had either more experience working in the other languages or we found them better suited to the task of building a compiler.
- C++
Having relatively much experience in C and the Unreal Engine's C++, I finally wanted to learn the vanilla C++. It was also the fastest language we could have worked on to build the compiler, second only to C (which would have been a nightmare to work in on such a big project on such a tight schedule). The main reason we opted not to go for it was that we wanted to avoid platform compatibility issues, which would have taken us more time to fix. Also, as neither of us had experience with vanilla C++, we were afraid the learning curve would have slowed us down even more.
- Java
This was the language we all had experience with. Last year, we have had many projects in Java, spanning algorithms to data structure to concurrent programming (of which you may know I am a great fan of). This would have been our safest option. However, we chose against it in favour of:
- Scala
Scala was the second most popular programming language this year for the WACC project. It combines object oriented programming with functional paradigms like pattern matching and list comprehension, alongside some wonderful monad-like classes (like Option and Try, for those Scala mains). Even though none of us had any prior experience in Scala, we chose to write the compiler in this one. And oh boy, was it worth it in the end.
Wherever I can, I will explain why Scala made our work so much nicer when working on this compiler.
Front-End
There were three tasks that our compiler had to do for the first part:
- Lexical Analysis, by splitting the input into tokens
- Syntactic Analysis, by parsing the tokens and creating a structure to represent the input
- Semantic Analysis, by ensuring the code had grammatical coherence
For the Lexical and Syntactic Analysis, we have used ANTLR to generate the base Syntax Tree of the program. This was a somewhat misleading part of our project, as it turned out to be a bit too straightforward.
What do I mean by that? Well, the project's spec directly gave us a description of the grammar for the WACC compiler. That grammar specification can be directly translated into the g4 file that ANTLR uses to generate the classes for the base Syntax Tree. This apparently meant that we have finished with two thirds of our task in a very short time.
However, that's where the misleading part comes in. We had to also do the Semantic Analysis, where we had to take each deconstruction step into consideration and check if it makes grammatical sense.
Also, the syntax tree that ANTLR provides is...well, unnecessarily complex. It has a lot of information that is unnecessary, bloated with useless information that we would never use. What we had to do was reduce it into an AST (Abstract Syntax Tree) that kept much less information, all of it useful, and would be easier to process.
The First Pickle of Many
The first pickle in which we ran was that ANTLR does not generate Scala code, only Java. What saved us from this was that Scala gets compiled in bytecode, just as Java, making them compatible with each other.
What that meant was that, whatever class ANTLR created in Java, we were able to reference them directly in Scala. It felt like a hack, but if it works, why complain?
Then came the AST.
We wanted to make the AST as compact as possible (for efficiency). Each node from the base syntax tree had to have a direct translation. Also, due to working in Scala, we also wanted these classes to work smoothly on Pattern Matching scenarios.
Again, Scala saved us lots of work. Having one class for each AST node became as easy as copy pasting one line. One Line! We used scala's Case Classes, which meant we could keep all AST class definitions in one file. Within an hour, we had the AST defined.
Semantic Analysis
To minimise the traversals as much as possible, we have chosen to do both the conversion base ST to AST and the Semantic Analysis in one go.
For that, we have used a class from ANTLR called Visitor.
The Visitor can be used to "visit" an ANTLR-generated syntax tree and return...whatever we wanted, really.
We have made our visitor return an AST node in each visit. This meant that each node visited by the visitor would be converted into an AST node, and could be used then to progressively convert the AST of the whole program from the leaves up to the main root.
In order to make this visitor also do the Semantic Analysis in the same parse and avoid the redundant information of the ANTLR Syntax Tree, we would do the semantic analysis as the last step of the conversion, immediately before returning the newly created AST node.
And that's it. At the end, our Visitor returned .wacc code as an AST and, if any error occured, it would throw an error and explain the issue to the user.
Back-End
There was only one task: Generate Machine Readable Code!
More specifically, we had to convert the code into ARM1176JZF-S assembly code.
This was, I think, both our best and our worst task. Our best because we could finally show our prowess and understanding of how a compiler is supposed to run. It was also our worst because, well, we kinda missed the deadline with the code generation...a bit.
What I mean to say is that we have employed very complex code optimisation techniques which took us more time than we previously anticipated. So we were not able to pass some of the tests by the end of this deadline. I'll come back to those later.
Some of the groups chose to translate the AST directly into ARM11 code. This is an entirely valid approach, could potentially offer faster compile time and easier code management. However, when it comes to compile-time optimisation, having to work solely on an AST becomes harder, as it does not offer enough information about the program's control flow.
That is why people working on compilers choose to go through multiple IR (Intermediate Representations) of the code, which can provide better information when it comes to what variables are used when, what parts of the code can be reached from one specific point in time and what is redundant/repetitive in the code.
One of these IRs is a CFG (Control Flow Graph). This turns the AST into a Directed Cyclic Graph, keeping the relevant information in the AST and finding the potential paths the code can go on, which ones are repeatable and which ones are unused/unreachable.
Another IR is the SSA Form, of which I have first learnt in Engineering a Compiler. This form is a CFG that ensures each variable is assigned only once. From this form, we can find out the minimum amount of registers that can be used to not have overlaps in the code, thus lowering the necessity of populating the stack of the program. This makes the machine code faster and more memory efficient. All professionally made compilers use SSA Form as an IR.
SSA Form in One Go
Usually, to make the SSA Form conversion, you have to go through another IR and/or do some more analysis on the AST, like computing Dominance Frontiers. However, we have found out that you can go directly from AST to SSA Form in one parse. One of the PhD students that oversaw us recommended us reading this research paper: Simple and Efficient Construction of SSA Form and it was absolutely worth it.
In order to do it all in one parse, we had to do some lazy initialisation of the SSA Variables. It all came to knowing if there are multiple paths that can come into a node at a specific time. If it was so, we would initialise a variable with only the ones that we have went through already, and then add the rest of the paths when they would be traversed/terminated.
For a better explanation of the algorithm, please refer to the example in section 2.2 of the paper.
Well, it took us a while to work the algorithm out from the paper, but it worked! We actually managed to convert the AST into an SSA Control Flow Graph.
The Great Pickle
We had no more time for the register allocation algorithm. I do not know why, but it happened. So, unfortunately, the SSA form did not provide the wonderful optimisation that we planned.
However, we had it done. The register allocation part would have been just more work, but an implementable, known algorithm. The project is still there, so we can always improve on it.
We had to rush the deconstruction algorithm of the SSA form and the translation from Graph to ARM11 code, which we thought would be faster.
On this part, I am very happy to say that we have used the Strategy Pattern very effectively. After converting the SSA into a DSSA (Deconstructed SSA), we would turn them into some abstract instructions which can be formatted into the specific type of code.
What do I mean by that? Basically an Instruction is a class. To be written as an Assembly instruction, we give it a Formatter of a certain type. This Formatter will convert the instruction into one of type ARM11, or into x86 or whatever type of assembly instruction (because they can be translated quite easily into the other).
This makes our compiler easy to set up for converting the .wacc program into code that can be executed by multiple types of processors: Just create a new Formatter!
Well, all of these optimisations made the work a bit slow. Therefore, we were unable to pass all the tests for the Backend part in time for the deadline.
Even though we did not pass all the unit tests in time, we have learnt very much. We have literally converted a research paper into Scala code! And everything we've done paved the way for the following task: The Extension.
Extension(s?)
And now, for the extension. Full creative control. What were we able to do?
Well, a lot!
Here's the full list of extensions:
- Constant Folding
- Libraries and Support for Imports
- Concurrent Compilation (ofc I'd do it)
- Function Overloading
- Unused Function Removal
- Peephole Optimisation
- Array Bound Analysis
- Vim Plugin (fun stuff)
- Passing all tests (I guess you could call that an extension)
Being so many, I will not go in depth with all of them. Let me just quickly go through Concurrent Compilation, Function Overloading and the Vim Plugin.
Concurrent Compilation
This was, maybe, a little too easy. Another reason why Backend took us so much was because we wanted to prepare ourselves for this part, making sure there would be no contention on resources where we could.
Then, Scala made most of the work. Every list that was processed using foreach or map could be processed in parallel, and every structure accessed in multiple places at the same time needed to be changed to a thread-safe version. Making the compilation concurrent may have been the easiest part of the whole project.
Function Overloading
A funny part of the project was that the vanilla WACC has no function overloading. So it was a conscious effort to avoid function overloading in the front end.
Now, as we had total creative control, we could finally add function overloading to WACC. So we had to dig every part of the code that ensured no function can have the same name and remove it.
However, this was not enough. There was one more thing we had to take into consideration: the function's output. If there was a "char f()" and an "int f()", we would have to choose one or the other.
Fortunately, WACC also employs that each function must return something and a variable must be assigned that value. Therefore, calling a function can only be done with this syntax:
<type> x = call <function>(<parameters>)
So, in order to choose the correct function to call, we had to build a simple type casting, which checks the LHS (Left Hand Side) of an assignment operation and tries to cast the RHS (Right Hand Side) to it. Simple, yet effective in this situation.
This type casting can be repurposed if we will want to create a more flexible type casting for WACC in the future.
VIM Plugin
Shout out to one of my team mates for this one, as she came with the idea and implementation. By following a book which I recently read myself, Learn Vimscript the Hard Way, she was able to build a plugin which supported vim syntax highlighting specifically for WACC.
Not only that, but it provides something us Vim users love: Folding. This means that we are able to fold automatically .wacc files and move through the code much faster.
On such a plugin, we could also be able to add commands to compile the wacc project automatically, implying the user also uses our own WACC compiler.
Other Optimisation Levels
Other than that, we have added more ways to optimise the code for longer compilation, but faster execution time. These can be chosen through special flags in the command line.
The -O[n] flag we can provide a specific level of optimisation for the program.
The -ssa outputs only an abbreviated SSA Form for the user to read (this was very useful to us in the debugging sessions).
The -o <filename> makes the output of the compiled program be sent to a file.
And the -h flag offers a quick look at all the flags that we have set up for our compiler.
Final Thoughts
Overall, this was a very rewarding learning experience for me and my team. Not only have we learnt how to build from nothing a fully fledged compiler, with optimised code and everything, but we have, or at least I have, found a programming language that merges what is best from Object Oriented programming with functional paradigms, making me a much more efficient programmer.
If you have never built a compiler before, start now. It may not be the most potent or useful program you have ever written, but it gives you much insight into the programming language of your choice, into the assembly language of your processor, into the architecture of your computer. And it can also be fun, especially if you have no deadlines.
And read. Keep your mind busy and interested. Read about what you are doing, not what you will be doing. I was able to do and learn so much about compilers because I have read and learnt from multiple sources. Even if I missed a deadline, it does not mean I have not learnt. Proof to that is my 91.5% in Compilers this academic year.
So here it is. Hope you liked it! As always, keep yourself busy!
Tibi.
Automation Control Engineer/ Equipment Engr. at HUAWEI Consumer Business Group.
4 å¹´Good job! I still remember in Huawei uk when you said you were busy doing this work. Hope you make it bright!