Compilers

My Custom RISC VM: Lessons from Building a Compiler in Java

Ever wondered what it takes to build a compiler from scratch? I dove deep, creating a custom RISC VM and a compiler in Java. Here are my key takeaways.

Alex Carter

A software engineer passionate about compilers, low-level systems, and the magic of VMs.

September 16, 20256 min read21 views

There's a certain magic to writing code in a high-level language like Java or Python. You type `int x = 10;`, and it just... works. But what's happening under the hood? How does that simple statement transform into something a processor can actually understand? I've always been fascinated by this gap, and I decided the best way to understand it was to bridge it myself.

So, I embarked on a project: building a custom compiler in Java for my own simple, RISC-based Virtual Machine. It was a challenging, sometimes frustrating, but ultimately incredibly rewarding journey. It wasn't about building the next GCC or LLVM, but about peeling back the layers of abstraction. Here are the most important lessons I learned along the way.

The Architecture: A Two-Part Harmony

First, I had to define the problem. The project had two distinct but deeply connected parts:

The Virtual Machine (VM): This would be the target. I designed a simple, register-based architecture with a minimal RISC-like instruction set. Think instructions like LOAD R1, [address], STORE R2, [address], ADD R1, R2, R3, and conditional jumps like JMPZ R1, address (Jump if Register 1 is Zero). This was the CPU my compiler would be talking to.
The Compiler: This would be the translator, written in Java. Its job was to take a simple, C-like source language (which I creatively named "SimpleC") and convert it into a sequence of bytecode instructions that my VM could execute.

This separation was my first, and perhaps most crucial, design decision. It allowed me to work on the language parser and the machine execution logic independently.

Lesson 1: Your Lexer and Parser Are Everything

Every compiler starts by reading source code. This is a two-step process: lexical analysis (lexing) and syntax analysis (parsing).

Lexing: From a Stream of Characters to a List of Tokens

The lexer scans the raw text and groups characters into meaningful chunks called tokens. For example, the line if (x > 10) becomes a sequence of tokens like:

[IF, LPAREN, IDENTIFIER("x"), GREATER_THAN, INTEGER(10), RPAREN]

My initial thought was to use a bunch of regular expressions. While tempting, it quickly becomes messy to manage state (like being inside a string or comment). I opted to write a simple state-machine-based lexer by hand. It was more verbose but gave me complete control and a deeper understanding of the process. The key takeaway here is that a clean, reliable stream of tokens is non-negotiable. Garbage in, garbage out.

Parsing: From Tokens to an Abstract Syntax Tree (AST)

The parser takes the token stream and builds a tree-like data structure that represents the code's structure. This is the Abstract Syntax Tree (AST). It's where the grammatical hierarchy of the code is made explicit.

For example, y = x + 5; isn't a flat list of tokens anymore. It becomes a tree:

      Assignment
      /        \
  Identifier(y)   BinaryOperation(+)
                  /             \
            Identifier(x)       Number(5)

I wrote a recursive descent parser, which is a very natural way to do it. You have a function for each grammar rule (e.g., `parseExpression()`, `parseStatement()`). The AST is the central data structure for the entire compiler. The type checker, optimizer, and code generator all operate on it. Lesson learned: don't skimp on your AST design. A well-designed AST makes every subsequent step infinitely easier.

Lesson 2: Don't Skip the Intermediate Representation (IR)

My first naive attempt was to walk the AST and directly generate my custom assembly code. This was a disaster. An `if-else` statement required complex logic to manage jump labels, nested expressions became a tangled mess of register states, and the code was brittle and impossible to debug.

The solution was to introduce an Intermediate Representation (IR). An IR is a representation of the program that is simpler than the source code but more abstract than the target machine code. I chose a simple Three-Address Code (TAC) format, where each instruction has at most three operands, like result = op1 + op2.

The expression y = x * 2 + 3; in the AST would first be translated into this IR:

t1 = 2
t2 = x * t1
t3 = 3
t4 = t2 + 3
y = t4

This was a game-changer. It decoupled the language's complexity (the "front end") from the machine's complexity (the "back end"). It flattened complex expressions into a simple, linear sequence of instructions. This made the next step, code generation, far more manageable. It's also the stage where most serious compilers perform optimizations.

Lesson 3: Code Generation Is a Glorified Puzzle

This is where the compiler finally starts talking in the VM's language. The goal is to translate the IR into the RISC assembly I designed. The main challenge here is register allocation.

My VM only had a handful of registers (R1-R8). My IR, however, assumed an infinite number of temporary variables (t1, t2, ...). The puzzle is to map those infinite temporaries onto my finite set of physical registers.

I implemented a simple, linear scan register allocation algorithm. It's not the most efficient, but it's a classic for a reason: it's relatively easy to understand and implement. The process involves:

Iterating through the IR instructions.
For each temporary variable, determining its "live range" (the instructions between its first definition and its last use).
Trying to assign a physical register to each temporary.
If you run out of registers, you have to "spill" a variable to memory (generate STORE instructions) and load it back later (with LOAD).

Seeing my compiler smartly juggling registers, spilling to the stack, and then generating the final, runnable assembly was the most satisfying moment of the entire project. It felt like I was solving a Sudoku puzzle at a massive scale.

Lesson 4: Why Java Was a Surprisingly Good Fit

Some might argue that a "systems" project like a compiler should be written in C or C++. But for this project, Java was an excellent choice.

Strong Typing and OOP: Designing the AST, IR, and token structures was a breeze with Java's classes and interfaces. Using the visitor pattern to walk the AST felt clean and type-safe.
Garbage Collection: The compiler itself generates a lot of temporary objects—AST nodes, IR instructions, symbol table entries. Not having to manually manage the memory for the compiler's own internal workings was a massive productivity boost.
Rich Standard Library: Maps for symbol tables, lists for token streams, and robust string manipulation tools were all there, ready to go.

The performance of the compiler itself was never an issue. Java is more than fast enough for a project of this scale. The lesson? The best language for a learning project is often the one you're most productive in.

Was It Worth It? Absolutely.

Building a compiler and a VM from scratch is the programming equivalent of taking an engine apart and putting it back together. You gain an intuition for things that were previously just abstract rules.

I now have a much deeper appreciation for what goes into register allocation, calling conventions, and the sheer complexity that modern compilers manage to hide from us. The next time you write a line of code, take a moment to appreciate the incredible journey it takes from your text editor to the CPU. If you're curious enough, I highly recommend trying to build that bridge yourself.