ARM's MOP Cache: The Decode-Bypass Optimization That Came and Went

🧩 MOps and the MOP Cache — ARM's Transient Front-End Optimization, Explained

CPU Microarchitecture · ARM Cortex Series Deep Dive

Tracing ARM CPU documentation across generations, you'll notice something unusual: starting with the Cortex-A76 and A77, official materials introduce a term called "MOps (Macro-Operations)" and a structure called the "MOP Cache." A few years later, ARM quietly removes it. ARM has long been praised for simple decoding — an inherent RISC advantage. So why did ARM add a structure specifically designed to cache decode results, only to pull it out again? This article untangles that apparent contradiction through the lens of front-end pipeline design.

🧠 TL;DR — A MOp is an intermediate representation ARM placed between a raw architectural instruction and the final execution-level µOp. The MOP Cache stores these intermediate forms so the decode stage can be bypassed entirely on subsequent executions. It peaked in 2019–2021 and was effectively retired by the 2024 generation.

📚 Three Layers: Instruction → MOp → µOp

A high-performance CPU does not fetch an instruction and execute it directly. The path from memory to execution units involves several transformation stages. ARM engineers on the official community forum define the two key terms as follows:

MOps (Macro-Operations): The instructions as defined by the ISA — the exact operations written in assembly or emitted by a compiler. These are architecturally visible.

µOps (Micro-Operations): The minimum execution units the processor creates internally at runtime. These are implementation-specific and invisible to programmers.

On the Cortex-A78, the front-end pipeline flows through the following layers:


graph TD
  A[L1 I-Cache
Raw Instructions] --> B[Decode
Decode Stage] B --> C[MOps
Intermediate Repr.] C --> D[MOP Cache
Stored MOps] D --> E[Rename
µOp Expansion] E --> F[Execution Units
ALU · LS · FP] style A fill:#eaf2f8,stroke:#2980b9 style B fill:#fef9e7,stroke:#f39c12 style C fill:#e8f8f5,stroke:#16a085 style D fill:#eafaf1,stroke:#27ae60 style E fill:#fef9e7,stroke:#f39c12 style F fill:#f4ecf7,stroke:#8e44ad

🔗 Diagram summary: The ARM front-end decodes raw instructions from the I-Cache into MOps (an intermediate representation), stores them in the MOP Cache, then expands them into µOps at the Rename stage before dispatch to execution units. MOps sit one level above µOps in the pipeline hierarchy.

The key insight is that a MOp is nearly a 1:1 mapping to an architectural instruction (approximately 1:1.06). On the Cortex-A78, decoding four instructions produces an average of six MOps — about 6% expansion over raw instructions. Each MOp is then split into one or two µOps at Rename; µOps are the true minimum execution unit. This matters because it means MOps are a coarser-granularity intermediate form compared to what x86 pipelines store in their µop caches.

CISC vs. RISC: Different Starting Points

Attribute CISC (x86) RISC (ARM AArch64)
Instruction length Variable (1–17 bytes) Fixed 4 bytes
Decode complexity Very high (boundary detection, operand mode analysis) Low (no boundary ambiguity)
Internal decomposition Instruction → 1–4 µops (since Intel P6, 1995) Instruction ≈ MOp → 1–2 µops

x86 processors have executed instructions internally as µops since Intel's P6 microarchitecture — because directly executing variable-length CISC instructions is prohibitively inefficient. ARM, in theory, needs no such decomposition. Yet ARM still introduced an intermediate MOp layer. Understanding why — and why ARM later discarded it — is the crux of this article.

🕰️ Historical Context — x86 Ran This Experiment Twice First

Before ARM introduced its MOP Cache, the x86 ecosystem had already explored "caching decode output" in two distinct experiments. ARM's MOP Cache is a later entrant to this design pattern.

Pentium 4 Trace Cache (2000) — A pioneering structure that stored decoded µops in execution order. The high invalidation cost on branch mispredictions ultimately led Intel to abandon it with the Core microarchitecture.

Intel Sandy Bridge Decoded ICache / µop Cache (2011) — A ~1.5K µop capacity decoded instruction cache that bypasses the expensive x86 decoder for hot code paths. Carried forward through Ivy Bridge and Haswell.

AMD Zen Op Cache (2017) — AMD's first µop cache, documented to deliver both performance and power savings. "Not just performance, but power efficiency" was a stated design goal.

ARM's timeline moved one step behind — but moved quickly. The foundation was laid in the Cortex-A76 (2018), and the MOP Cache fully materialized in the A77 (2019).

2018 · A76
MOp fusion, no cache
2019 · A77
MOP Cache introduced
2020–21 · X1 · V1
Peak: 3K entries
2022 · A715
Downsized / removed
2024 · X925
Effectively retired

🏗️ MOP Cache Structure — A Serial Layer Below I-Cache

The most common misconception is worth addressing upfront: the MOP Cache is not a parallel alternative to the I-Cache — it is a serial layer downstream of it. The I-Cache holds raw instruction bytes; after those instructions are decoded into MOps, the results populate the MOP Cache. On the next execution of the same code, the decode stage is bypassed entirely.

Attribute L1 I-Cache MOP Cache
Stored content Raw instruction bytes Decoded MOps
Pipeline position Fetch stage (upstream) Immediately post-decode
On hit Avoids DRAM / upper-cache access Bypasses the decode stage entirely
A78 capacity 64 KB 1.5K MOp entries

The decision flow through the front-end looks like this:


flowchart TD
  A([Instruction Fetch]) --> B{MOP Cache
Hit?} B -->|YES| C[Bypass Decode
~1-cycle savings] B -->|NO| D[I-Cache → Decode
Fills MOP Cache] C --> E([Rename / Execute]) D --> E style A fill:#3498db,stroke:#2980b9,color:#ffffff style B fill:#fef9e7,stroke:#f39c12 style C fill:#eafaf1,stroke:#27ae60,color:#1e8449 style D fill:#fdedec,stroke:#e74c3c,color:#c0392b style E fill:#3498db,stroke:#2980b9,color:#ffffff

🔁 Diagram summary: After an instruction fetch, a MOP Cache hit allows the decode stage to be fully bypassed, saving approximately one cycle. On a miss, execution falls back to the I-Cache→Decode path, which then populates the MOP Cache. Both paths converge at Rename/Execute.

According to AndroidAuthority's Cortex-A78 analysis, the MOP Cache achieves high hit rates across a range of workloads. For repetitive code such as hot loops and frequently called functions, the decode stage is rarely activated — delivering meaningful power savings beyond the raw cycle benefit. This matters in mobile silicon where thermal budget is a hard constraint.

MOP Cache hit rate (representative workloads) 85%+

ARM's I-Cache also stores instructions in an enriched intermediate format rather than pure raw bytes — the A76–X1 generations use approximately 36–40-bit predecode entries per instruction (a 32-bit instruction word plus metadata bits). This predecode information reduces the cost of full decode; the MOP Cache goes one step further by eliminating decode altogether. The functional overlap between predecode and the MOP Cache quietly foreshadows the MOP Cache's eventual removal.

📊 Generation-by-Generation: Peak and Retirement

Mapping MOP Cache presence and capacity across ARM core generations makes the trajectory clear: 2019–2021 was the peak, and 2022 onward marks a steady retreat.

Core Year MOP Cache
Cortex-A76 2018 None (fusion only)
Cortex-A77 2019 1.5K — first introduction
Cortex-A78 2020 1.5K
Cortex-X1 2020 3K (2× expanded)
Neoverse V1 2021 3K (8 MOps/cycle)
Cortex-X3 2022 Reduced to 1.5K
Cortex-A715 2022 Removed
Cortex-X925 / A725 2024 Removed (predecode integrated)

Visualized as a bar chart, the rise to 3K in the X1 and Neoverse V1 generation — and the subsequent decline — is immediately apparent:

A77 (2019)
1.5K
X1 (2020)
3K
X3 (2022)
1.5K
X925 (2024)
0 (removed)

🔍 Same Structure, Different Rationale — x86 µop Cache vs. ARM MOP Cache

The two structures look superficially similar, but the motivations driving their adoption are nearly opposite.

💼 x86 rationale — "Avoid the expensive decode entirely"

Detecting instruction boundaries in a variable-length stream, resolving operand modes, and splitting one instruction into up to four µops consumes substantial energy and silicon area every cycle. The µop cache exists to eliminate this heavy decoder path for hot code sequences — it is a structural necessity for competitive x86 performance.

🧠 ARM rationale — "Decode is cheap, but there are other gains"

① Mobile power reduction — bypassing even a lightweight decoder improves energy efficiency at the workload level. ② Fused MOp reuse — pairs of simple instructions fused into a single MOp are reused without re-fusing. ③ ~1-cycle front-end latency savings, accelerating branch-misprediction recovery. ④ Higher throughput — 4 instructions/cycle into 6 MOps/cycle increases issue bandwidth.

Concept ARM Intel AMD
Post-decode store MOP Cache µop Cache Op Cache
Stored granularity MOps (1:1.06) µops (1–4 per insn) µops
First introduced 2019 2011 2017

The key architectural difference is the abstraction level. x86's µops are final execution units — they go straight to the scheduler. ARM's MOps are a higher-level intermediate form that is later expanded into µOps at the Rename stage. This means ARM's MOP Cache stores coarser-grained operations, and the decode-to-execution path still requires a Rename expansion step even on a cache hit. The trade-off: simpler cache entries (fewer bits per MOp), but an extra pipeline step before actual execution.

🔒 Security — When a Performance Structure Becomes an Attack Surface

Decoded instruction caches attracted academic attention as side-channel attack surfaces alongside their performance benefits. The paper "I See Dead µops" (ISCA 2021) demonstrated that Intel and AMD µop caches are vulnerable to Spectre-class side-channel attacks: timing differences introduced by µop cache behavior can be exploited to leak data across process boundaries. ARM's MOP Cache was not directly analyzed in that paper, but the structural similarity makes it a candidate for analogous attacks — a direction that subsequent research continues to explore (primary measurement data targeting ARM's MOP Cache specifically remains limited as of this writing).

🔄 Why It Was Removed — Predecode Does the Same Job, More Cheaply

According to Chips and Cheese's Cortex-X925 analysis, ARM removed the MOP Cache from the X925 (2024) and A725 (2024) for three converging reasons:

Predecode reduces decode cost to the point of redundancy — The I-Cache stores instructions in a 76-bit predecode format (32-bit instruction + 44-bit metadata) in the X925 generation, allowing MOps to be generated quickly without full decode. Running both predecode and the MOP Cache in parallel becomes engineering overhead without proportional gain.

Diminishing returns at lower clock frequencies — Efficiency cores like the A725 operate at lower frequencies where there is sufficient timing margin to decode without a bypass structure. Saving ~1 cycle matters far less when the clock period is longer.

Die area reallocation — The silicon occupied by 1.5K–3K MOP Cache entries is better invested in a larger L1 I-Cache or additional execution units, where ROI is higher in modern workloads.

In short, ARM matured its predecode subsystem to absorb what the MOP Cache was doing. This resolves the apparent paradox raised at the outset — the MOP Cache was not a sign that ARM's decode problem was fundamentally like x86's. It was a time-bounded optimal solution: its ROI was high during the competitive high-clock mobile core era, and it became redundant once predecode reached sufficient sophistication to cover the same use case at lower cost.

🍎 Apple Silicon — A Different Path, Same Destination

Apple's Firestorm high-performance core (M1/M2) features an exceptionally large ~192 KB L1 I-Cache and an 8-wide decode front-end. Based on available public analysis, no separately named "MOP Cache" structure has been reported. Instead, Apple resolves front-end bottlenecks through a combination of a very large I-Cache and an oversized reorder buffer (ROB) with 600+ entries. This represents an alternative approach — "sufficiently large buffers eliminate the need for a decode bypass cache" — and it converges directionally with ARM's 2024 decision to retire the MOP Cache in favor of a stronger predecode foundation.

🎯 Takeaways

A MOp is an intermediate representation — sitting between the raw architectural instruction and the execution-level µOp. The mapping is nearly 1:1, but complex instructions split into two MOps and simple instruction pairs can fuse into one.

The MOP Cache is a serial sub-layer — not a parallel alternative to the I-Cache. It is populated after an I-Cache hit and decode, and bypasses the decode stage on subsequent accesses to the same code region.

Similar structure, opposite motivation — x86's µop cache exists to tame inherently expensive CISC decode. ARM's MOP Cache targeted power savings, fused MOp reuse, and front-end throughput improvement in a RISC context where decode is already cheap.

Peak 2019, retired 2024 — once predecode matured to overlap the MOP Cache's function, ARM reclaimed the die area for more productive uses.

Going forward, ARM front-end optimization is expected to evolve toward more sophisticated predecode metadata and improved branch prediction, rather than a decoded-op cache layer. By contrast, x86's µop cache is unlikely to disappear as long as the CISC complexity that necessitates it remains (though AMD Zen 5 is redefining the Op Cache's role as part of a broader front-end redesign). RISC-V high-performance implementations are also beginning to explore decoded instruction caches — suggesting that "caching decode output" will remain a common design question across ISA boundaries, even as the specific form that answer takes continues to evolve.

💬 One-line summary — A MOp is the intermediate representation ARM placed between an architectural instruction and a µOp; the MOP Cache stored those results to bypass decode on repeated execution. Once predecode matured to cover the same ground more cheaply, the MOP Cache became redundant — a time-bounded optimal solution, now quietly retired.

📎 References

• ARM Community Forum — MOps/µops definitions (Cortex-A Community Forum)

• Chips and Cheese — Cortex-X925 MOP Cache removal analysis

• WikiChip — ARM Cortex-A77 / A78 / Neoverse V1 Microarchitecture

• WikiChip — Intel Sandy Bridge µop Cache, AMD Zen Op Cache

• I See Dead µops: Leaking Secrets via Intel/AMD Micro-Op Caches (ISCA 2021)

• System on Chips — ARM Cortex-A78 MOPs/UOPs Instruction Fetch Pipeline

This content is a technical overview based on publicly available microarchitecture documentation and third-party analyses. Specific figures may vary across generations, implementations, and sources. Verify against primary vendor documentation before applying to design decisions.

S
SoC Design
Semiconductor & SoC Design Notes

A collection of notes on semiconductor and SoC design and verification — gathered from public sources, personally reviewed and organized before publishing.

This post is based on publicly available data and cited sources. Last updated: June 8, 2026.

댓글