NVIDIA Vera CPU: Architecture Deep Dive — Olympus Core Specs, Design, and Market Position

NVIDIA is moving beyond GPUs to reshape the data-center CPU landscape. Vera — packing 88 custom-designed Olympus cores, 1.2 TB/s memory bandwidth, CPU–GPU coherent coupling via NVLink-C2C, and native FP8 arithmetic — makes a direct challenge to Intel and AMD's x86 server dominance. This article consolidates Vera's published specifications, the internal microarchitecture of the Olympus core, early benchmark data, and a breakdown of who is actually buying in.

🧭 Four Questions This Article Addresses

This analysis breaks down NVIDIA's data-center CPU Vera across four dimensions: (1) published specifications and performance figures, (2) the processor ISA (instruction set architecture) it is based on, (3) the internal design of the microarchitecture, memory subsystem, and cache hierarchy, and (4) how it differentiates from incumbent x86 server CPUs and who its target customers are. The underlying question is straightforward — "Why did NVIDIA abandon off-the-shelf ARM cores to design its own, and does that decision pose a credible threat to Intel and AMD's server CPU stronghold?"

🟡 One caveat up front — most publicly available data is based on NVIDIA's own claims from pre-production silicon and a limited early benchmark suite. Mass production is scheduled for H2 2026, so every performance figure in this article must be read with that constraint in mind.

🏭 Why NVIDIA Is Building a CPU

Historical Context

NVIDIA entered the data-center CPU market in 2022 with Grace, a design built around 72 cores of ARM's commercial IP — the Neoverse V2. The goal then was not to sell CPUs on their own merits, but to deliver a tightly integrated Grace Hopper Superchip — CPU and Hopper GPU bound together over NVLink-C2C — and thereby own the complete compute stack of the AI data center. The CPU was the control brain that made the GPU sell better.

Vera is the second generation of that strategy. The pivotal change: NVIDIA replaced the licensed ARM core with a fully in-house microarchitecture called "Olympus" — the company's first custom CPU core. This is not simply a refresh; it represents a deliberate pivot toward workloads specific to the agentic AI era: control, inference orchestration, and real-time session management. The difference between integrating a licensed core and designing one from scratch around your target workload is a strategic commitment of an entirely different order.

Vera's Place in the Ecosystem

Vera is not a standalone CPU. As a component of the Vera Rubin platform, it couples with the Rubin GPU over NVLink-C2C to form the Vera Rubin Superchip. That superchip in turn becomes the compute block of the NVL72 rack (72 GPUs + 36 CPUs). Vera's value, therefore, is only fully realized in combination with the GPU — its performance profile is designed around that coupling, not standalone throughput.


graph LR
  A[Olympus Core
88 cores · 176 threads] --> B[Vera CPU]
  B -->|NVLink-C2C
1.8 TB/s| C[Rubin GPU]
  B --> D[Vera Rubin
Superchip]
  C --> D
  D --> E[NVL72 Rack
72 GPUs · 36 CPUs]
  style A fill:#e8f8f5,stroke:#16a085
  style B fill:#eaf2f8,stroke:#2980b9
  style C fill:#fef9e7,stroke:#f39c12
  style D fill:#eafaf1,stroke:#27ae60
  style E fill:#f4ecf7,stroke:#8e44ad

🔗 Diagram summary: 88 Olympus cores form the Vera CPU, which binds to the Rubin GPU via NVLink-C2C (1.8 TB/s) to create the Superchip. That Superchip scales out into the NVL72 rack configuration (72 GPUs, 36 CPUs). Vera is a system component whose value is realized through GPU coupling, not as a standalone chip.

⚙️ ISA, Microarchitecture, and the Olympus Core

ISA: Arm v9.2-A, but a 100% custom microarchitecture

Vera executes the Arm v9.2-A instruction set — not x86. The critical distinction from Grace is that NVIDIA licenses only the ISA from Arm; the microarchitecture that executes those instructions is NVIDIA's own custom design, named Olympus. Think of the ISA as a language and the microarchitecture as the engine that runs it: NVIDIA borrows Arm's language while building the engine from scratch. This gives full ISA-level compatibility with the Arm software ecosystem while allowing NVIDIA to optimize every execution stage for its target workloads — a freedom that a licensed core does not offer.

Olympus Core Key Specifications

Parameter	Details
ISA	Arm v9.2-A
Core Count	88
Thread Count	176 (NVIDIA Spatial Multithreading)
Front End	10-wide instruction fetch & decode
Branch Predictor	Neural Branch Predictor; resolves 2 taken branches per cycle
Vector Engine	6 × 128-bit SVE2
FP8	Native in-core support
vs. Previous Generation	Replaces Grace's ARM Neoverse V2 entirely

Three Defining Microarchitecture Decisions

(1) 10-wide front end — Olympus fetches and decodes 10 instructions per cycle, substantially wider than competing designs. A wider front end is the principal lever for raising single-thread IPC (instructions per clock); NVIDIA cites this as the basis for its claimed +50% IPC improvement over Grace. The decode-width comparison below illustrates the gap.

NVIDIA Olympus

10-wide

Intel Sapphire R.

6-wide

AMD Zen 4

4-wide

(2) Neural Branch Predictor — Rather than a conventional TAGE or perceptron-based predictor, Olympus adopts a neural architecture. The motivation is workloads with irregular control flow — AI inference kernels and agent orchestration loops — where branch mispredictions are frequent and the pipeline flush cost is high. Misprediction penalty scales with pipeline depth, so a more accurate predictor directly raises effective throughput for these workloads.

(3) Spatial Multithreading (SMT) — Conceptually similar to Hyper-Threading, but Olympus's variant statically partitions physical core resources between threads rather than sharing them dynamically. Eliminating inter-thread resource contention reduces latency variance. For an AI agent server handling hundreds of concurrent sessions, consistent, predictable response latency matters more than peak throughput that occasionally degrades — Spatial MT targets exactly that operational profile.

🧠 Memory Subsystem, Cache Hierarchy, and Interconnects

Memory — LPDDR5x + SOCAMM

Parameter	Vera	Grace (prev-gen)
Memory Type	LPDDR5x	LPDDR5
Maximum Capacity	1.5 TB	~480 GB
Maximum Bandwidth	1.2 TB/s	546 GB/s
Bandwidth per Core	13.6 GB/s	7.6 GB/s
Packaging	SOCAMM (field-upgradeable)	On-board soldered

The key innovation here is SOCAMM (Small Outline Compression-Attached Memory Module), a new module form factor. Grace's memory was soldered to the board, making capacity expansion impossible over the server's lifetime. SOCAMM makes memory field-upgradeable, directly reducing data-center TCO (total cost of ownership). The design goal — pairing LPDDR's low-power profile with server-class serviceability — is what SOCAMM enables.

Memory Bandwidth vs. x86 Competitors

In AI inference, the most common bottleneck is not compute throughput but memory bandwidth. Vera's 1.2 TB/s is 2.6–4× higher than current x86 server platforms — a gap that directly translates to inference throughput for bandwidth-bound models.

NVIDIA Vera

1.2 TB/s

Grace (prev-gen)

546 GB/s

AMD EPYC 9755

~460 GB/s

Intel Xeon 6980P

~307 GB/s

Cache Hierarchy

Level	Size	Notes
L2	2 MB/core	2× vs. Grace
L3	164 MB (unified)	Shared across all 88 cores

164 MB of unified L3 is substantial even by server CPU standards. It does not reach AMD EPYC Turin's 3D V-Cache configurations (up to 768 MB), but is comparable to or larger than standard EPYC configurations (128–256 MB L3) without stacked cache. A larger last-level cache absorbs more of the working set for inference and compiler workloads, reducing pressure on the memory subsystem and improving effective throughput where data reuse is possible.

SCF — 2nd-Gen Scalable Coherency Fabric

SCF is NVIDIA's proprietary on-chip interconnect that ties together the 88 cores, L3 cache, SOCAMM memory, I/O, and NVLink-C2C on a single compute die.

▶ Bidirectional bandwidth: 3.4 TB/s ▶ Deterministic latency under full load (coherency protocol integrated). Rather than a conventional ring or mesh topology, SCF is designed with cache coherence and high aggregate bandwidth as first-class constraints — analogous to how NVLink treats GPU-to-GPU connectivity.

NVLink-C2C — The CPU–GPU Coupling That Matters Most

Parameter	Vera	Grace
NVLink-C2C Bandwidth	1.8 TB/s	900 GB/s
vs. PCIe	7× PCIe Gen 6	—
Characteristics	Cache-coherent	—

The defining property is not raw bandwidth but cache coherence. Because CPU and GPU share a unified memory address space with hardware-managed coherence, software can access GPU memory without explicit DMA transfers. This eliminates the copy-and-synchronize overhead that dominates AI agent loops where the CPU orchestrates inference requests and the GPU executes them — a tight feedback cycle where minimizing handoff latency directly impacts throughput per second.

Vera also adds PCIe Gen 6 (double the bandwidth of Gen 5) and CXL 3.1 support. CXL 3.1 (Compute Express Link) is an open coherent interconnect standard for CPUs, accelerators, and memory expanders. Its inclusion future-proofs the platform for multi-node memory pooling and heterogeneous accelerator integration — workloads that are increasingly relevant as inference infrastructure scales.

📊 Performance Data

Generational Leap Over Grace

Ranked by percentage improvement over Grace, memory capacity (+200%) and bandwidth (+120%) show the largest gains — consistent with the design intent of targeting memory-bandwidth-bound AI inference workloads.

Memory Capacity (3×)

+200%

Memory Bandwidth

+120%

NVLink Bandwidth (2×)

+100%

Phoronix Geomean

~+60%

IPC

+50%

Against x86 Competition (Phoronix Early Benchmarks)

Phoronix published a limited test set using pre-production silicon, comparing Vera against Intel Xeon Granite Rapids 6980P (single- and dual-socket) and AMD EPYC Turin 9755, 9575F, and 9475F.

Metric	Result
Geomean (overall)	+11% vs. best AMD EPYC config; +55.3% vs. Intel Xeon single-socket
vs. 128-core latest-gen x86	~1.5× advantage
STREAM Triad	90% of rated peak achieved; 4× per-core bandwidth vs. x86
Linux kernel build	20 seconds (Phoronix all-time fastest)

🔴 Critical caveat: NVIDIA hand-selected the benchmark workloads, and withheld both power consumption and operating frequency data. As a result, performance-per-watt comparisons are not possible at this stage. Fully independent benchmarks on production silicon will not be available until the H2 2026 release. These figures should be read with the explicit acknowledgment that the workload selection favors NVIDIA.

⚔️ Vera vs. Legacy x86: Where the Designs Diverge

Parameter	x86 (Xeon / EPYC)	NVIDIA Vera
ISA	x86-64	Arm v9.2-A
Design Target	General-purpose (server / cloud / HPC)	Optimized for AI agents & inference
Memory	DDR5 / select HBM configs	LPDDR5x (low-power, high-bandwidth)
Memory Packaging	DIMM	SOCAMM (integrated, field-upgradeable)
CPU–GPU Coupling	PCIe (non-coherent)	NVLink-C2C 1.8 TB/s (cache-coherent)
Multithreading	Hyper-Threading (resource contention)	Spatial MT (statically partitioned)
FP8	Requires external accelerator	Native in-core support

The most consequential difference is CPU–GPU coupling architecture. In x86 server platforms, the CPU connects to the GPU over PCIe — a bus that requires explicit DMA copies whenever data moves between CPU and GPU memory. Vera eliminates that overhead by sharing a single coherent address space over NVLink-C2C. The second critical differentiator is memory bandwidth strategy. Vera's 1.2 TB/s is 2.6–4× the bandwidth of current x86 platforms, creating a decisive advantage in the inference workloads that are memory-bandwidth-bound rather than compute-bound.

🎯 Who Is Buying Vera

🟢 Tier 1: AI Factories & Hyperscalers

▶ Confirmed deployment — Oracle Cloud Infrastructure (OCI), deploying hundreds of thousands of Vera CPUs beginning in 2026.
▶ Reported collaborations — Alibaba, Meta, ByteDance, CoreWeave, Lambda, Nebius, Nscale.
The common thread is large-scale agentic AI infrastructure — coding assistants, enterprise agents, and consumer chatbots running thousands of concurrent sessions. Vera's thread density and memory bandwidth are directly relevant to this operational pattern.

🟡 Tier 2: Frontier AI Research Labs

▶ Exploratory discussions reported — OpenAI, Anthropic, and others. Organizations whose inference infrastructure bottlenecks in CPU orchestration and memory management are precisely the deployment target Vera's design addresses.

💼 Tier 3: OEM Server Vendors

▶ Official partners — Dell, HPE, Lenovo, Supermicro. These OEMs will supply Vera-based server platforms to enterprise customers.

🔴 Not Targeted: General Enterprise & Legacy Workloads

Vera is not designed for general-purpose servers, Windows Server environments, or workloads with heavy dependency on legacy x86 binaries. A mature Arm binary and container ecosystem is a prerequisite — cloud-native deployments with containerized workloads are the assumed operating environment.

🧩 Takeaways — A Shift in CPU Design Philosophy

NVIDIA Vera represents not an incremental performance update but a deliberate shift in CPU design philosophy. It trades generality for targeted optimization of the specific bottlenecks of the agentic AI era: memory bandwidth, CPU–GPU coherency, and multithreaded latency predictability. The early benchmark leads — +55% over the latest Xeon generation and +11% over AMD's best EPYC configuration — are notable, but carry the caveat of NVIDIA-selected workloads and undisclosed power efficiency data.

🧠 One-line summary — Vera is not "a faster general-purpose CPU" but "the optimal control processor for a Rubin GPU." Its value should be measured not in standalone chip benchmarks but in the combined efficiency of the Vera Rubin Superchip.

Looking Ahead — Deployment Timeline

Now

Pre-production
benchmarks released

H2 2026

Production silicon
independent validation

2027+

OCI full-scale
deployment ramps

Scenario A (Optimistic) — OCI's large-scale deployment succeeds, and the Vera Rubin platform becomes the de facto standard for agentic AI infrastructure. NVIDIA establishes itself as the critical supplier for AI data-center compute at both the GPU and CPU layers.

Scenario B (Uncertain) — x86 software ecosystem inertia, undisclosed power efficiency, and pricing dynamics become real friction in production deployments. Competition from AMD EPYC + MI350, AWS Graviton, and Google Axion — all Arm server CPUs with their own advantages — adds further uncertainty.

Practical Implications

✓ Procurement and investment decisions should wait for independent validation on production silicon in H2 2026. Current figures are pre-production and NVIDIA-curated.
✓ Arm ISA compatibility is a hard prerequisite — workloads with legacy x86 dependencies will incur containerization and recompile costs that must be factored into migration planning.
✓ CXL 3.1 support opens a forward path to multi-node memory pooling and heterogeneous accelerator integration — relevant to teams planning infrastructure beyond the 2026 horizon.

※ Data still pending: production-silicon power efficiency, unit pricing, and direct benchmark comparisons against competing Arm server CPUs (Graviton, Axion, Cobalt) remain undisclosed at this time and require post-launch independent evaluation.

📚 Sources: NVIDIA Vera CPU official page · NVIDIA Technical Blog (Agentic Workloads / Performance) · Phoronix Vera Benchmarks · Tom's Hardware (Vera vs. EPYC/Xeon) · ServeTheHome (Vera in Detail) · VideoCardz (Vera Rubin NVL72)

This content is provided for informational purposes based on publicly available technical materials and early benchmark data. It does not constitute a recommendation to purchase any product or make any investment. Performance figures cited are based primarily on pre-production silicon and vendor-disclosed benchmarks; actual performance, power efficiency, and pricing of production units may differ. Readers are solely responsible for any procurement or investment decisions.

SoC Design

Semiconductor & SoC Design Notes

I compile and verify materials from a semiconductor and SoC design perspective before publishing.

Blog

This article is based on publicly available data and sources. Last updated: June 8, 2026

이 블로그 검색