NVIDIA Vera CPU Deep Dive: Olympus Core Architecture, Specs, and the x86 Challenge
NVIDIA Vera CPU: Architecture Deep Dive — Olympus Core Specs, Design, and Market Position
NVIDIA is moving beyond GPUs to reshape the data-center CPU landscape. Vera — packing 88 custom-designed Olympus cores, 1.2 TB/s memory bandwidth, CPU–GPU coherent coupling via NVLink-C2C, and native FP8 arithmetic — makes a direct challenge to Intel and AMD's x86 server dominance. This article consolidates Vera's published specifications, the internal microarchitecture of the Olympus core, early benchmark data, and a breakdown of who is actually buying in.
🧭 Four Questions This Article Addresses
This analysis breaks down NVIDIA's data-center CPU Vera across four dimensions: (1) published specifications and performance figures, (2) the processor ISA (instruction set architecture) it is based on, (3) the internal design of the microarchitecture, memory subsystem, and cache hierarchy, and (4) how it differentiates from incumbent x86 server CPUs and who its target customers are. The underlying question is straightforward — "Why did NVIDIA abandon off-the-shelf ARM cores to design its own, and does that decision pose a credible threat to Intel and AMD's server CPU stronghold?"
🟡 One caveat up front — most publicly available data is based on NVIDIA's own claims from pre-production silicon and a limited early benchmark suite. Mass production is scheduled for H2 2026, so every performance figure in this article must be read with that constraint in mind.
🏭 Why NVIDIA Is Building a CPU
Historical Context
NVIDIA entered the data-center CPU market in 2022 with Grace, a design built around 72 cores of ARM's commercial IP — the Neoverse V2. The goal then was not to sell CPUs on their own merits, but to deliver a tightly integrated Grace Hopper Superchip — CPU and Hopper GPU bound together over NVLink-C2C — and thereby own the complete compute stack of the AI data center. The CPU was the control brain that made the GPU sell better.
Vera is the second generation of that strategy. The pivotal change: NVIDIA replaced the licensed ARM core with a fully in-house microarchitecture called "Olympus" — the company's first custom CPU core. This is not simply a refresh; it represents a deliberate pivot toward workloads specific to the agentic AI era: control, inference orchestration, and real-time session management. The difference between integrating a licensed core and designing one from scratch around your target workload is a strategic commitment of an entirely different order.
Vera's Place in the Ecosystem
Vera is not a standalone CPU. As a component of the Vera Rubin platform, it couples with the Rubin GPU over NVLink-C2C to form the Vera Rubin Superchip. That superchip in turn becomes the compute block of the NVL72 rack (72 GPUs + 36 CPUs). Vera's value, therefore, is only fully realized in combination with the GPU — its performance profile is designed around that coupling, not standalone throughput.
graph LR
A[Olympus Core
88 cores · 176 threads] --> B[Vera CPU]
B -->|NVLink-C2C
1.8 TB/s| C[Rubin GPU]
B --> D[Vera Rubin
Superchip]
C --> D
D --> E[NVL72 Rack
72 GPUs · 36 CPUs]
style A fill:#e8f8f5,stroke:#16a085
style B fill:#eaf2f8,stroke:#2980b9
style C fill:#fef9e7,stroke:#f39c12
style D fill:#eafaf1,stroke:#27ae60
style E fill:#f4ecf7,stroke:#8e44ad
🔗 Diagram summary: 88 Olympus cores form the Vera CPU, which binds to the Rubin GPU via NVLink-C2C (1.8 TB/s) to create the Superchip. That Superchip scales out into the NVL72 rack configuration (72 GPUs, 36 CPUs). Vera is a system component whose value is realized through GPU coupling, not as a standalone chip.
⚙️ ISA, Microarchitecture, and the Olympus Core
ISA: Arm v9.2-A, but a 100% custom microarchitecture
Vera executes the Arm v9.2-A instruction set — not x86. The critical distinction from Grace is that NVIDIA licenses only the ISA from Arm; the microarchitecture that executes those instructions is NVIDIA's own custom design, named Olympus. Think of the ISA as a language and the microarchitecture as the engine that runs it: NVIDIA borrows Arm's language while building the engine from scratch. This gives full ISA-level compatibility with the Arm software ecosystem while allowing NVIDIA to optimize every execution stage for its target workloads — a freedom that a licensed core does not offer.
Olympus Core Key Specifications
| Parameter | Details |
|---|---|
| ISA | Arm v9.2-A |
| Core Count | 88 |
| Thread Count | 176 (NVIDIA Spatial Multithreading) |
| Front End | 10-wide instruction fetch & decode |
| Branch Predictor | Neural Branch Predictor; resolves 2 taken branches per cycle |
| Vector Engine | 6 × 128-bit SVE2 |
| FP8 | Native in-core support |
| vs. Previous Generation | Replaces Grace's ARM Neoverse V2 entirely |
Three Defining Microarchitecture Decisions
(1) 10-wide front end — Olympus fetches and decodes 10 instructions per cycle, substantially wider than competing designs. A wider front end is the principal lever for raising single-thread IPC (instructions per clock); NVIDIA cites this as the basis for its claimed +50% IPC improvement over Grace. The decode-width comparison below illustrates the gap.
(2) Neural Branch Predictor — Rather than a conventional TAGE or perceptron-based predictor, Olympus adopts a neural architecture. The motivation is workloads with irregular control flow — AI inference kernels and agent orchestration loops — where branch mispredictions are frequent and the pipeline flush cost is high. Misprediction penalty scales with pipeline depth, so a more accurate predictor directly raises effective throughput for these workloads.
(3) Spatial Multithreading (SMT) — Conceptually similar to Hyper-Threading, but Olympus's variant statically partitions physical core resources between threads rather than sharing them dynamically. Eliminating inter-thread resource contention reduces latency variance. For an AI agent server handling hundreds of concurrent sessions, consistent, predictable response latency matters more than peak throughput that occasionally degrades — Spatial MT targets exactly that operational profile.
🧠 Memory Subsystem, Cache Hierarchy, and Interconnects
Memory — LPDDR5x + SOCAMM
| Parameter | Vera | Grace (prev-gen) |
|---|---|---|
| Memory Type | LPDDR5x | LPDDR5 |
| Maximum Capacity | 1.5 TB | ~480 GB |
| Maximum Bandwidth | 1.2 TB/s | 546 GB/s |
| Bandwidth per Core | 13.6 GB/s | 7.6 GB/s |
| Packaging | SOCAMM (field-upgradeable) | On-board soldered |
The key innovation here is SOCAMM (Small Outline Compression-Attached Memory Module), a new module form factor. Grace's memory was soldered to the board, making capacity expansion impossible over the server's lifetime. SOCAMM makes memory field-upgradeable, directly reducing data-center TCO (total cost of ownership). The design goal — pairing LPDDR's low-power profile with server-class serviceability — is what SOCAMM enables.
Memory Bandwidth vs. x86 Competitors
In AI inference, the most common bottleneck is not compute throughput but memory bandwidth. Vera's 1.2 TB/s is 2.6–4× higher than current x86 server platforms — a gap that directly translates to inference throughput for bandwidth-bound models.
Cache Hierarchy
| Level | Size | Notes |
|---|---|---|
| L2 | 2 MB/core | 2× vs. Grace |
| L3 | 164 MB (unified) | Shared across all 88 cores |
164 MB of unified L3 is substantial even by server CPU standards. It does not reach AMD EPYC Turin's 3D V-Cache configurations (up to 768 MB), but is comparable to or larger than standard EPYC configurations (128–256 MB L3) without stacked cache. A larger last-level cache absorbs more of the working set for inference and compiler workloads, reducing pressure on the memory subsystem and improving effective throughput where data reuse is possible.
SCF — 2nd-Gen Scalable Coherency Fabric
SCF is NVIDIA's proprietary on-chip interconnect that ties together the 88 cores, L3 cache, SOCAMM memory, I/O, and NVLink-C2C on a single compute die.
▶ Bidirectional bandwidth: 3.4 TB/s ▶ Deterministic latency under full load (coherency protocol integrated). Rather than a conventional ring or mesh topology, SCF is designed with cache coherence and high aggregate bandwidth as first-class constraints — analogous to how NVLink treats GPU-to-GPU connectivity.
NVLink-C2C — The CPU–GPU Coupling That Matters Most
| Parameter | Vera | Grace |
|---|---|---|
| NVLink-C2C Bandwidth | 1.8 TB/s | 900 GB/s |
| vs. PCIe | 7× PCIe Gen 6 | — |
| Characteristics | Cache-coherent | — |
The defining property is not raw bandwidth but cache coherence. Because CPU and GPU share a unified memory address space with hardware-managed coherence, software can access GPU memory without explicit DMA transfers. This eliminates the copy-and-synchronize overhead that dominates AI agent loops where the CPU orchestrates inference requests and the GPU executes them — a tight feedback cycle where minimizing handoff latency directly impacts throughput per second.
Vera also adds PCIe Gen 6 (double the bandwidth of Gen 5) and CXL 3.1 support. CXL 3.1 (Compute Express Link) is an open coherent interconnect standard for CPUs, accelerators, and memory expanders. Its inclusion future-proofs the platform for multi-node memory pooling and heterogeneous accelerator integration — workloads that are increasingly relevant as inference infrastructure scales.
📊 Performance Data
Generational Leap Over Grace
Ranked by percentage improvement over Grace, memory capacity (+200%) and bandwidth (+120%) show the largest gains — consistent with the design intent of targeting memory-bandwidth-bound AI inference workloads.
Against x86 Competition (Phoronix Early Benchmarks)
Phoronix published a limited test set using pre-production silicon, comparing Vera against Intel Xeon Granite Rapids 6980P (single- and dual-socket) and AMD EPYC Turin 9755, 9575F, and 9475F.
| Metric | Result |
|---|---|
| Geomean (overall) | +11% vs. best AMD EPYC config; +55.3% vs. Intel Xeon single-socket |
| vs. 128-core latest-gen x86 | ~1.5× advantage |
| STREAM Triad | 90% of rated peak achieved; 4× per-core bandwidth vs. x86 |
| Linux kernel build | 20 seconds (Phoronix all-time fastest) |
🔴 Critical caveat: NVIDIA hand-selected the benchmark workloads, and withheld both power consumption and operating frequency data. As a result, performance-per-watt comparisons are not possible at this stage. Fully independent benchmarks on production silicon will not be available until the H2 2026 release. These figures should be read with the explicit acknowledgment that the workload selection favors NVIDIA.
⚔️ Vera vs. Legacy x86: Where the Designs Diverge
| Parameter | x86 (Xeon / EPYC) | NVIDIA Vera |
|---|---|---|
| ISA | x86-64 | Arm v9.2-A |
| Design Target | General-purpose (server / cloud / HPC) | Optimized for AI agents & inference |
| Memory | DDR5 / select HBM configs | LPDDR5x (low-power, high-bandwidth) |
| Memory Packaging | DIMM | SOCAMM (integrated, field-upgradeable) |
| CPU–GPU Coupling | PCIe (non-coherent) | NVLink-C2C 1.8 TB/s (cache-coherent) |
| Multithreading | Hyper-Threading (resource contention) | Spatial MT (statically partitioned) |
| FP8 | Requires external accelerator | Native in-core support |
The most consequential difference is CPU–GPU coupling architecture. In x86 server platforms, the CPU connects to the GPU over PCIe — a bus that requires explicit DMA copies whenever data moves between CPU and GPU memory. Vera eliminates that overhead by sharing a single coherent address space over NVLink-C2C. The second critical differentiator is memory bandwidth strategy. Vera's 1.2 TB/s is 2.6–4× the bandwidth of current x86 platforms, creating a decisive advantage in the inference workloads that are memory-bandwidth-bound rather than compute-bound.
🎯 Who Is Buying Vera
🟢 Tier 1: AI Factories & Hyperscalers
▶ Confirmed deployment — Oracle Cloud Infrastructure (OCI), deploying hundreds of thousands of Vera CPUs beginning in 2026.
▶ Reported collaborations — Alibaba, Meta, ByteDance, CoreWeave, Lambda, Nebius, Nscale.
The common thread is large-scale agentic AI infrastructure — coding assistants, enterprise agents, and consumer chatbots running thousands of concurrent sessions. Vera's thread density and memory bandwidth are directly relevant to this operational pattern.
🟡 Tier 2: Frontier AI Research Labs
▶ Exploratory discussions reported — OpenAI, Anthropic, and others. Organizations whose inference infrastructure bottlenecks in CPU orchestration and memory management are precisely the deployment target Vera's design addresses.
💼 Tier 3: OEM Server Vendors
▶ Official partners — Dell, HPE, Lenovo, Supermicro. These OEMs will supply Vera-based server platforms to enterprise customers.
🔴 Not Targeted: General Enterprise & Legacy Workloads
Vera is not designed for general-purpose servers, Windows Server environments, or workloads with heavy dependency on legacy x86 binaries. A mature Arm binary and container ecosystem is a prerequisite — cloud-native deployments with containerized workloads are the assumed operating environment.
🧩 Takeaways — A Shift in CPU Design Philosophy
NVIDIA Vera represents not an incremental performance update but a deliberate shift in CPU design philosophy. It trades generality for targeted optimization of the specific bottlenecks of the agentic AI era: memory bandwidth, CPU–GPU coherency, and multithreaded latency predictability. The early benchmark leads — +55% over the latest Xeon generation and +11% over AMD's best EPYC configuration — are notable, but carry the caveat of NVIDIA-selected workloads and undisclosed power efficiency data.
🧠 One-line summary — Vera is not "a faster general-purpose CPU" but "the optimal control processor for a Rubin GPU." Its value should be measured not in standalone chip benchmarks but in the combined efficiency of the Vera Rubin Superchip.
Looking Ahead — Deployment Timeline
benchmarks released
independent validation
deployment ramps
Scenario A (Optimistic) — OCI's large-scale deployment succeeds, and the Vera Rubin platform becomes the de facto standard for agentic AI infrastructure. NVIDIA establishes itself as the critical supplier for AI data-center compute at both the GPU and CPU layers.
Scenario B (Uncertain) — x86 software ecosystem inertia, undisclosed power efficiency, and pricing dynamics become real friction in production deployments. Competition from AMD EPYC + MI350, AWS Graviton, and Google Axion — all Arm server CPUs with their own advantages — adds further uncertainty.
Practical Implications
✓ Procurement and investment decisions should wait for independent validation on production silicon in H2 2026. Current figures are pre-production and NVIDIA-curated.
✓ Arm ISA compatibility is a hard prerequisite — workloads with legacy x86 dependencies will incur containerization and recompile costs that must be factored into migration planning.
✓ CXL 3.1 support opens a forward path to multi-node memory pooling and heterogeneous accelerator integration — relevant to teams planning infrastructure beyond the 2026 horizon.
※ Data still pending: production-silicon power efficiency, unit pricing, and direct benchmark comparisons against competing Arm server CPUs (Graviton, Axion, Cobalt) remain undisclosed at this time and require post-launch independent evaluation.
📚 Sources: NVIDIA Vera CPU official page · NVIDIA Technical Blog (Agentic Workloads / Performance) · Phoronix Vera Benchmarks · Tom's Hardware (Vera vs. EPYC/Xeon) · ServeTheHome (Vera in Detail) · VideoCardz (Vera Rubin NVL72)
This content is provided for informational purposes based on publicly available technical materials and early benchmark data. It does not constitute a recommendation to purchase any product or make any investment. Performance figures cited are based primarily on pre-production silicon and vendor-disclosed benchmarks; actual performance, power efficiency, and pricing of production units may differ. Readers are solely responsible for any procurement or investment decisions.
I compile and verify materials from a semiconductor and SoC design perspective before publishing.
This article is based on publicly available data and sources. Last updated: June 8, 2026
댓글
댓글 쓰기