Systems Programming

From Scratch: My 64-bit RISC VM and Compiler in Java

A deep-dive into my from-scratch project: building a 64-bit RISC-V virtual machine and a custom compiler entirely in Java. Discover the architecture, challenges, and key learnings.

A

Alex Carter

A systems programmer passionate about building compilers and virtual machines from the ground up.

7 min read21 views

There's a unique kind of magic in computing that we often take for granted. You write print("Hello, World!"), hit run, and just like that, text appears on your screen. But what happens in the abyss between your high-level command and the silicon that executes it? I’ve always been fascinated by this question, and I decided the best way to find the answer was to build that abyss myself.

This is the story of my journey building a 64-bit RISC-V virtual machine and a compiler for a custom high-level language, all from scratch, entirely in Java. It’s a project born from pure curiosity—a desire to peel back the layers of abstraction and truly understand how a computer works at a fundamental level.

Why RISC-V and Java? A Surprising Duo

The first two questions I usually get are "Why RISC-V?" and "...Why Java?"

The Case for RISC-V

RISC-V is an open-source instruction set architecture (ISA). Unlike proprietary ISAs like x86 or ARM, its specification is free to use for any purpose. This openness is fantastic, but the real win for a project like this is its modularity and simplicity. The base integer instruction set, RV64I, is remarkably clean and contains fewer than 50 instructions. This made it the perfect, manageable target for a one-person project. I didn't need to wrestle with decades of legacy features; I could focus on the core concepts of a modern 64-bit CPU.

The Case for Java

Choosing Java for a low-level project like a VM might seem counterintuitive. Wouldn't C++ or Rust be a better fit? For raw performance, absolutely. But my goal wasn't to beat performance benchmarks; it was to maximize learning and development speed. Java offered several key advantages:

  • Automatic Memory Management: Not having to worry about malloc and free for the VM and compiler's own data structures was a huge productivity boost.
  • Excellent Tooling: A rich ecosystem of build tools (Maven), testing frameworks (JUnit), and powerful IDEs made the development process smooth.
  • Portability: The VM runs on any machine with a JVM, which is a neat bonus.
  • The Challenge: Frankly, it was an interesting challenge. It forced me to think about how to represent low-level concepts like registers and memory in a high-level, object-oriented environment.

The Heart of the Machine: The VM Architecture

Advertisement

At its core, a virtual machine that emulates a CPU is surprisingly simple. It's a program that endlessly cycles through three steps: Fetch, Decode, and Execute.

My VM is built around a few key components:

  • Memory: A single, large byte[] array. This represents the entire addressable memory space for the programs my VM will run.
  • Registers: A long[32] array. These are the 32 general-purpose 64-bit registers that form the CPU's working memory. Register x0 is hardwired to zero, as per the RISC-V spec.
  • Program Counter (PC): A long variable that holds the memory address of the next instruction to be executed.

The main loop looks something like this (in pseudocode):


while (running) {
  // 1. Fetch
  int instruction = memory.read_32_bits(pc);

  // 2. Decode
  Opcode opcode = decode_opcode(instruction);
  int rd = decode_rd(instruction); // Destination register
  int rs1 = decode_rs1(instruction); // Source register 1
  int rs2 = decode_rs2(instruction); // Source register 2
  long immediate = decode_immediate(instruction);

  // 3. Execute
  execute_instruction(opcode, rd, rs1, rs2, immediate);

  pc += 4; // Move to the next instruction (unless it was a jump)
}
  

The most tedious part was decoding. RISC-V instructions are 32 bits wide, and different instruction formats (R-type, I-type, J-type, etc.) place fields like the opcode, register numbers, and immediate values at different bit positions. This meant lots of bitwise shifting and masking. Getting the sign extension of immediate values correct was a classic stumbling block that took a few tries to nail down.

Breathing Life into the VM: The Compiler

A VM is useless without programs to run. Writing raw machine code by hand is possible but excruciating. I needed a compiler to translate a more human-friendly language into the RISC-V instructions my VM understands.

I designed a simple, C-like language I call "SimpleLang." Its goal was to be just powerful enough to write interesting little programs.

Table 1: SimpleLang vs. C Subset Feature Comparison
Feature My SimpleLang C Language Subset
Variable Declaration var x = 10; int x = 10;
Control Flow if (x > 5) { ... } if (x > 5) { ... }
Loops while (i < 10) { ... } while (i < 10) { ... }
Functions fun add(a, b) { return a + b; } int add(int a, int b) { return a + b; }
Types 64-bit Integers only Multiple (int, char, float, etc.)
Pointers No Yes

The compiler follows a classic pipeline:

  1. Lexer (Tokenizer): The lexer scans the raw source code string and converts it into a stream of tokens. For example, var x = 10; becomes `[VAR, IDENTIFIER("x"), EQUALS, NUMBER(10), SEMICOLON]`.
  2. Parser: The parser takes the token stream and builds an Abstract Syntax Tree (AST). This is a tree structure that represents the grammatical structure of the code. It's at this stage that syntax errors like a missing semicolon are caught.
  3. Code Generator: This is where the magic happens. The code generator traverses the AST and emits RISC-V machine code. For example, when it sees an addition node in the tree, it emits an ADD instruction. It manages register allocation (deciding which temporary values go into which registers) and maps variable names to memory locations on the stack.

Putting It All Together: A Simple Calculation's Journey

Let's trace a very simple program: var result = 123 + 456;

  1. Compilation: The compiler parses this line. The code generator decides to load the number 123 into a register (say, t0), load 456 into another (t1), add them together, and store the result in a third register (t2). It then generates instructions to store the value from t2 into the memory location allocated for the `result` variable on the stack.
  2. Generated Assembly (for humans):
    
    # Load immediate values into temporary registers
    li   t0, 123          # t0 = 123
    li   t1, 456          # t1 = 456
    
    # Perform the addition
    add  t2, t0, t1       # t2 = t0 + t1
    
    # Store the result on the stack (e.g., at an offset from the frame pointer fp)
    sw   t2, -8(fp)       # Memory[fp - 8] = t2
          
  3. Execution: The VM loads the compiled binary code into its memory. The PC is set to the address of the first instruction. The VM fetches `li t0, 123`, decodes it, and executes it by placing the value 123 into its internal `long[]` array at the index for register `t0`. It proceeds instruction by instruction, until the final value (579) is stored in the VM's byte array representing the stack.

Challenges and Lessons Learned

This project was an incredible learning experience, filled with moments of frustration and triumph.

  • The ISA Spec is Your Bible: The RISC-V specification manual is dense but absolutely essential. I spent countless hours with it, triple-checking bit field layouts and instruction behaviors. My respect for hardware engineers who do this for a living skyrocketed.
  • Function Calls are Hard: Implementing the calling convention—the rules for how functions pass arguments and return values—was the single most complex part. Managing the stack pointer, frame pointer, and saving/restoring caller-saved registers correctly required careful planning and a lot of debugging.
  • The Power of Simplicity: Seeing a high-level `while` loop boil down to just a comparison and a conditional branch instruction (like `BNE` - Branch if Not Equal) was a profound moment. It demystified so much about how software actually runs.

Final Thoughts and What's Next

Building a VM and compiler from scratch is a rite of passage for anyone interested in systems programming. It’s not about creating a product to compete with GCC or the JVM. It’s about the journey. It's about replacing a black box with a glass one, allowing you to peer inside and understand its inner workings. The feeling of running a program, compiled by your compiler, on top of your virtual machine, is indescribably rewarding.

So, what's next? The project is far from over. I plan to implement the 'M' extension for multiplication and division instructions, add support for floating-point numbers, and maybe even start working on a tiny operating system to manage multiple processes. The rabbit hole goes deep, and I'm excited to see how far it goes.

If you’ve ever been curious about what lies beneath your code, I can’t recommend a project like this enough. Just pick a simple ISA, a language you love, and start building. You won't regret it.

Tags

You May Also Like