Decoding llvm-mca 2025: The #1 Reason Compilers Use PUSH
Unlock the secrets of compiler optimization with our 2025 guide to llvm-mca. Discover the #1 reason compilers favor the PUSH instruction for peak performance.
Dr. Alex Carter
Principal Systems Engineer specializing in compiler internals, CPU microarchitecture, and performance optimization.
Introduction: The Unassuming Power of PUSH
In the world of low-level optimization, every byte and every cycle counts. Compilers, the unsung heroes of software performance, make countless decisions to translate our high-level code into efficient machine instructions. One of the most common instructions they emit is PUSH
. On the surface, it seems simple: save a register's value onto the stack. But have you ever wondered why a compiler would choose a single PUSH
instruction over its explicit equivalent—a SUB
to adjust the stack pointer followed by a MOV
to store the value? The answer lies deep within the microarchitecture of modern CPUs, and the key to unlocking this secret is a powerful tool: llvm-mca.
As we look towards 2025, understanding these nuances is more critical than ever. With CPUs becoming increasingly complex, static analysis tools that can predict performance are invaluable. This post will decode the behavior of the PUSH
instruction using llvm-mca
, revealing the number one reason it remains a favorite for compilers and performance engineers alike.
What is llvm-mca and Why Should You Care?
The LLVM Machine Code Analyzer, or llvm-mca
, is a static performance analysis tool. Unlike a profiler, which measures performance by running code on actual hardware, llvm-mca
predicts performance by simulating a CPU's instruction pipeline. It models key aspects of a processor's microarchitecture, such as:
- Instruction decoding and dispatch: How instructions are broken down and sent to execution units.
- Execution ports: The functional units within the CPU that handle different types of operations (e.g., ALU, memory access).
- Resource pressure: Which execution units are being used and how close they are to saturation.
- Latency and throughput: How long an instruction takes to execute and how many can be started per cycle.
For a performance engineer, llvm-mca
is like having a crystal ball. It allows you to analyze small assembly snippets and understand potential bottlenecks without needing to compile a full program or even have access to the target hardware. It helps answer questions like, "Is this sequence of instructions better than that one?" which is exactly what we're here to find out.
The Anatomy of a PUSH Instruction
To understand why PUSH
is special, we first need to break it down. What does PUSH RAX
actually do on an x86-64 architecture?
More Than Just a Stack Pointer Update
Semantically, PUSH RAX
is equivalent to two separate operations:
- Decrement the stack pointer (
RSP
) by 8 bytes (the size of a 64-bit register). - Move the value from the
RAX
register to the memory location now pointed to byRSP
.
In assembly, this would look like:
SUB RSP, 8
MOV [RSP], RAX
Given this, why would a single instruction be any better than the two it represents? The answer isn't in the semantics; it's in the silicon.
Thinking in Micro-operations (µops)
Modern x86 CPUs don't execute assembly instructions directly. They first decode them into simpler, internal operations called micro-operations (µops). A simple instruction might decode into a single µop, while a complex one could become several.
Our two-instruction sequence, SUB
and MOV
, will naturally result in at least two µops. But what about PUSH
? This is where things get interesting. CPU designers have heavily optimized common instruction patterns. The PUSH
operation is a prime candidate for such optimization. On many modern processors, PUSH
is treated as a single, fused µop that performs both the address calculation (decrementing RSP) and the memory write in one go. This is the key we'll explore with llvm-mca
.
Decoding Performance: PUSH vs. SUB/MOV
Let's use llvm-mca
to analyze the performance of our two scenarios. We'll target a generic modern CPU model, like Intel's Skylake architecture (-mcpu=skylake
), to see the difference.
Scenario 1: The Classic PUSH
Here's our assembly snippet:
push %rax
Running this through llvm-mca
gives us a report. The key metrics are:
Iterations: 100
Instructions: 100
Total Cycles: 102
Total uOps: 100
Dispatch Width: 4
uOps Per Cycle: 0.98
IPC: 0.98
Block RThroughput: 1.0
Resource pressure per iteration:
[1] SKLStoreAddress
[1] SKLStoreData
The most important line here is Total uOps: 100. For 100 PUSH
instructions, the CPU dispatches exactly 100 µops. This confirms that PUSH
is decoded into a single micro-operation.
Scenario 2: The Explicit SUB/MOV
Now let's analyze the "equivalent" two-instruction version:
sub $8, %rsp
mov %rax, (%rsp)
The llvm-mca
report for this snippet tells a different story:
Iterations: 100
Instructions: 200
Total Cycles: 103
Total uOps: 200
Dispatch Width: 4
uOps Per Cycle: 1.94
IPC: 1.94
Block RThroughput: 1.0
Resource pressure per iteration:
[1] SKLPort0
[1] SKLPort1
[1] SKLPort5
[1] SKLPort6
[1] SKLStoreAddress
[1] SKLStoreData
Here, we see Total uOps: 200. Each pair of instructions generates two µops: one for the SUB
(an arithmetic operation) and one for the MOV
(a store operation). While the overall throughput is similar in this simple, non-looping case, the resource pressure is higher, and it consumes double the µops from the CPU's front-end.
Metric | push %rax | sub $8, %rsp; mov %rax, (%rsp) | Winner |
---|---|---|---|
Instructions (Bytes) | 1 byte | 4 bytes + 3 bytes = 7 bytes | PUSH |
Micro-operations (µops) | 1 (fused) | 2 | PUSH |
Resource Pressure | Lower (Store-related ports) | Higher (ALU + Store ports) | PUSH |
Front-end Bandwidth | Lower consumption | Higher consumption | PUSH |
The #1 Reason Revealed: Code Density and µop Fusion
The llvm-mca
analysis makes the answer clear. The superiority of PUSH
is a one-two punch of code density and micro-architectural efficiency.
Code Density and I-Cache Performance
This is the most straightforward benefit. A push %rax
instruction on x86-64 is just 1 byte long. Its equivalent, sub $8, %rsp
followed by mov %rax, (%rsp)
, takes up 7 bytes (4 for the SUB, 3 for the MOV). That's a 7x difference in code size!
Smaller code means:
- Better I-Cache (Instruction Cache) hit rates: More useful instructions can fit into the CPU's fastest cache, reducing stalls from fetching code from slower memory.
- Reduced binary size: While minor for a single instruction, this adds up significantly in large programs.
This alone is a compelling reason, but it's not the full story.
Micro-architectural Efficiency: The Magic of Fusion
This is the real #1 reason. As our llvm-mca
experiment showed, modern CPUs fuse the two logical operations of PUSH
(stack adjustment and memory write) into a single, highly optimized micro-operation. This is called µop fusion.
Why is one fused µop better than two separate ones?
- Reduced Front-End Pressure: The CPU's front-end (fetch, decode, rename) has a limited bandwidth (e.g., 4-6 µops per cycle). Using one µop instead of two leaves more bandwidth available for subsequent instructions, improving overall instruction-level parallelism.
- Simplified Resource Management: The fused µop is a streamlined package for the execution engine. It requires fewer scheduler and reorder buffer (ROB) entries, which are critical and finite resources in an out-of-order CPU.
- Potential Power Savings: Fewer decoded µops and less management overhead can lead to slightly more energy-efficient execution.
In essence, compilers use PUSH
because it's a more efficient contract with the hardware. It expresses a common pattern in a way that CPU designers have specifically optimized for, delivering better performance by consuming fewer internal resources.
A Word of Caution: When PUSH Isn't Perfect
While PUSH
is often the winner, it's not a silver bullet. Compilers are smart enough to know when to use the alternative. The most common scenario is in a function prologue where a large stack frame needs to be allocated for multiple local variables.
Consider setting up a 128-byte stack frame. The compiler could do:
; Option A: Many PUSHes (inefficient)
push rax
push rbx
... (14 more times)
; Option B: One SUB (efficient)
sub rsp, 128
mov [rsp + 120], rax
mov [rsp + 112], rbx
...
In this case, a single SUB RSP, 128
is far more efficient than sixteen separate PUSH
instructions. The single SUB
is one µop that establishes the entire frame, after which individual MOV
s can be used to store values. The dependency chain on the RSP
register is also much simpler. Once again, llvm-mca
could be used to precisely model this trade-off.
Conclusion: A Microscopic View with a Macro Impact
The humble PUSH
instruction is a perfect example of the deep complexity hidden beneath the surface of machine code. What appears to be a simple stack operation is, in fact, a masterclass in co-evolution between compilers and CPU hardware. By using PUSH
, compilers leverage decades of micro-architectural optimization.
As we've seen with llvm-mca
, the primary reason is a powerful combination of superior code density and the efficiency of µop fusion. This dual benefit leads to better I-cache performance and lower pressure on the CPU's critical front-end and execution resources. It's a microscopic optimization that, when applied millions of times a second, has a macro impact on overall software performance.