Performance Engineering

Decoding llvm-mca 2025: The #1 Reason Compilers Use PUSH

Unlock the secrets of compiler optimization with our 2025 guide to llvm-mca. Discover the #1 reason compilers favor the PUSH instruction for peak performance.

Dr. Alex Carter

Principal Systems Engineer specializing in compiler internals, CPU microarchitecture, and performance optimization.

August 8, 20257 min read78 views

7 min read

1,542 words

78 views

Introduction: The Unassuming Power of PUSH

In the world of low-level optimization, every byte and every cycle counts. Compilers, the unsung heroes of software performance, make countless decisions to translate our high-level code into efficient machine instructions. One of the most common instructions they emit is PUSH. On the surface, it seems simple: save a register's value onto the stack. But have you ever wondered why a compiler would choose a single PUSH instruction over its explicit equivalent—a SUB to adjust the stack pointer followed by a MOV to store the value? The answer lies deep within the microarchitecture of modern CPUs, and the key to unlocking this secret is a powerful tool: llvm-mca.

As we look towards 2025, understanding these nuances is more critical than ever. With CPUs becoming increasingly complex, static analysis tools that can predict performance are invaluable. This post will decode the behavior of the PUSH instruction using llvm-mca, revealing the number one reason it remains a favorite for compilers and performance engineers alike.

What is llvm-mca and Why Should You Care?

The LLVM Machine Code Analyzer, or llvm-mca, is a static performance analysis tool. Unlike a profiler, which measures performance by running code on actual hardware, llvm-mca predicts performance by simulating a CPU's instruction pipeline. It models key aspects of a processor's microarchitecture, such as:

Instruction decoding and dispatch: How instructions are broken down and sent to execution units.
Execution ports: The functional units within the CPU that handle different types of operations (e.g., ALU, memory access).
Resource pressure: Which execution units are being used and how close they are to saturation.
Latency and throughput: How long an instruction takes to execute and how many can be started per cycle.

For a performance engineer, llvm-mca is like having a crystal ball. It allows you to analyze small assembly snippets and understand potential bottlenecks without needing to compile a full program or even have access to the target hardware. It helps answer questions like, "Is this sequence of instructions better than that one?" which is exactly what we're here to find out.

The Anatomy of a PUSH Instruction

To understand why PUSH is special, we first need to break it down. What does PUSH RAX actually do on an x86-64 architecture?

More Than Just a Stack Pointer Update

Semantically, PUSH RAX is equivalent to two separate operations:

Decrement the stack pointer (RSP) by 8 bytes (the size of a 64-bit register).
Move the value from the RAX register to the memory location now pointed to by RSP.

In assembly, this would look like:

SUB RSP, 8
MOV [RSP], RAX

Given this, why would a single instruction be any better than the two it represents? The answer isn't in the semantics; it's in the silicon.

Thinking in Micro-operations (µops)

Modern x86 CPUs don't execute assembly instructions directly. They first decode them into simpler, internal operations called micro-operations (µops). A simple instruction might decode into a single µop, while a complex one could become several.

Our two-instruction sequence, SUB and MOV, will naturally result in at least two µops. But what about PUSH? This is where things get interesting. CPU designers have heavily optimized common instruction patterns. The PUSH operation is a prime candidate for such optimization. On many modern processors, PUSH is treated as a single, fused µop that performs both the address calculation (decrementing RSP) and the memory write in one go. This is the key we'll explore with llvm-mca.

Decoding Performance: PUSH vs. SUB/MOV

Let's use llvm-mca to analyze the performance of our two scenarios. We'll target a generic modern CPU model, like Intel's Skylake architecture (-mcpu=skylake), to see the difference.

Scenario 1: The Classic PUSH

Here's our assembly snippet:

push %rax

Running this through llvm-mca gives us a report. The key metrics are:

Iterations:        100
Instructions:      100
Total Cycles:      102
Total uOps:        100

Dispatch Width:    4
uOps Per Cycle:    0.98
IPC:               0.98
Block RThroughput: 1.0

Resource pressure per iteration:
[1] SKLStoreAddress
[1] SKLStoreData

The most important line here is Total uOps: 100. For 100 PUSH instructions, the CPU dispatches exactly 100 µops. This confirms that PUSH is decoded into a single micro-operation.

Scenario 2: The Explicit SUB/MOV

Now let's analyze the "equivalent" two-instruction version:

sub $8, %rsp
mov %rax, (%rsp)

The llvm-mca report for this snippet tells a different story:

Iterations:        100
Instructions:      200
Total Cycles:      103
Total uOps:        200

Dispatch Width:    4
uOps Per Cycle:    1.94
IPC:               1.94
Block RThroughput: 1.0

Resource pressure per iteration:
[1] SKLPort0
[1] SKLPort1
[1] SKLPort5
[1] SKLPort6
[1] SKLStoreAddress
[1] SKLStoreData

Here, we see Total uOps: 200. Each pair of instructions generates two µops: one for the SUB (an arithmetic operation) and one for the MOV (a store operation). While the overall throughput is similar in this simple, non-looping case, the resource pressure is higher, and it consumes double the µops from the CPU's front-end.

PUSH vs. SUB/MOV Performance Comparison (llvm-mca on Skylake)
Metric	`push %rax`	`sub $8, %rsp; mov %rax, (%rsp)`	Winner
Instructions (Bytes)	1 byte	4 bytes + 3 bytes = 7 bytes	PUSH
Micro-operations (µops)	1 (fused)	2	PUSH
Resource Pressure	Lower (Store-related ports)	Higher (ALU + Store ports)	PUSH
Front-end Bandwidth	Lower consumption	Higher consumption	PUSH

The #1 Reason Revealed: Code Density and µop Fusion

The llvm-mca analysis makes the answer clear. The superiority of PUSH is a one-two punch of code density and micro-architectural efficiency.

Code Density and I-Cache Performance

This is the most straightforward benefit. A push %rax instruction on x86-64 is just 1 byte long. Its equivalent, sub $8, %rsp followed by mov %rax, (%rsp), takes up 7 bytes (4 for the SUB, 3 for the MOV). That's a 7x difference in code size!

Smaller code means:

Better I-Cache (Instruction Cache) hit rates: More useful instructions can fit into the CPU's fastest cache, reducing stalls from fetching code from slower memory.
Reduced binary size: While minor for a single instruction, this adds up significantly in large programs.

This alone is a compelling reason, but it's not the full story.

Micro-architectural Efficiency: The Magic of Fusion

This is the real #1 reason. As our llvm-mca experiment showed, modern CPUs fuse the two logical operations of PUSH (stack adjustment and memory write) into a single, highly optimized micro-operation. This is called µop fusion.

Why is one fused µop better than two separate ones?

Reduced Front-End Pressure: The CPU's front-end (fetch, decode, rename) has a limited bandwidth (e.g., 4-6 µops per cycle). Using one µop instead of two leaves more bandwidth available for subsequent instructions, improving overall instruction-level parallelism.
Simplified Resource Management: The fused µop is a streamlined package for the execution engine. It requires fewer scheduler and reorder buffer (ROB) entries, which are critical and finite resources in an out-of-order CPU.
Potential Power Savings: Fewer decoded µops and less management overhead can lead to slightly more energy-efficient execution.

In essence, compilers use PUSH because it's a more efficient contract with the hardware. It expresses a common pattern in a way that CPU designers have specifically optimized for, delivering better performance by consuming fewer internal resources.

A Word of Caution: When PUSH Isn't Perfect

While PUSH is often the winner, it's not a silver bullet. Compilers are smart enough to know when to use the alternative. The most common scenario is in a function prologue where a large stack frame needs to be allocated for multiple local variables.

Consider setting up a 128-byte stack frame. The compiler could do:

; Option A: Many PUSHes (inefficient)
push rax
push rbx
... (14 more times)

; Option B: One SUB (efficient)
sub rsp, 128
mov [rsp + 120], rax
mov [rsp + 112], rbx
...

In this case, a single SUB RSP, 128 is far more efficient than sixteen separate PUSH instructions. The single SUB is one µop that establishes the entire frame, after which individual MOVs can be used to store values. The dependency chain on the RSP register is also much simpler. Once again, llvm-mca could be used to precisely model this trade-off.

Conclusion: A Microscopic View with a Macro Impact

The humble PUSH instruction is a perfect example of the deep complexity hidden beneath the surface of machine code. What appears to be a simple stack operation is, in fact, a masterclass in co-evolution between compilers and CPU hardware. By using PUSH, compilers leverage decades of micro-architectural optimization.

As we've seen with llvm-mca, the primary reason is a powerful combination of superior code density and the efficiency of µop fusion. This dual benefit leads to better I-cache performance and lower pressure on the CPU's critical front-end and execution resources. It's a microscopic optimization that, when applied millions of times a second, has a macro impact on overall software performance.