Vera Rubin – Extreme Co-Design: An Evolution from Grace Blackwell Oberon

SemiAnalysis 5 信息等级 5 发布：2026-02-25T18:40 抓取：2026-05-03 03:22

AI 半导体算力动态

摘要

英伟达在CES 2026正式发布Vera Rubin平台全系6款芯片产品，涵盖GPU、CPU及网络交换芯片。该平台搭载第二代机架级Oberon架构（VR NVL72），采用模块化设计以提升集成效率与算力吞吐，进一步扩展其在AI服务器全栈硬件领域的布局。

客观事实

英伟达于CES 2026正式发布Vera Rubin平台6款芯片产品。
VR NVL72采用第二代机架级Oberon架构与模块化设计。
该平台产品矩阵涵盖GPU、CPU、NVLink交换机及多款网络芯片。

英伟达 CES 2026 Vera Rubin平台 Rubin GPU Vera CPU NVLink 6 Switch ConnectX-9 BlueField-4 Spectrum-6 VR NVL72 Oberon架构 AMD Google Trainium 3 MI450X Helios Racks TPU Grace Blackwell

原文

At CES 2026, Nvidia officially announced in detail all 6 Rubin platform products: the Rubin GPU, Vera CPU, NVLink 6 Switch, ConnectX-9, BlueField-4, and Spectrum-6. VR NVL72 is the second generation of Nvidia’s rack scale Oberon architecture that takes the stage. With competition catching up on rack scale game, Trainium 3 in the Gen2 UltraServer, AMD MI450X Helios Racks, and Google’s TPU which was at rack scale even before GB200, Nvidia answers with “extreme co-design” supremacy. With extreme co-design, Nvidia takes rack scale integration to the next level. Rack system becomes a unit of compute, a single distributed accelerator, and Nvidia designs the system.
Source: NvidiaFor the Vera Rubin platform, Nvidia is asserting even more control over the system and rack level design. Rack scale integration and assembly have become more challenging, as every component is being pushed to the limit, whilst also optimizing for cost efficiency. VR NVL72 has a much more holistic design with a modular approach compared to Grace Blackwell for the purpose of integration efficiency and throughput.
Nvidia’s competitiveness strengthens with its extreme co-design supremacy. It is the only player with the best in class or close to the best in class silicon product offerings for all the major silicon contents in an Nvidia trail-blazed AI server system design. Nvidia offers the best accelerator, a SOTA scale up switch, the best NIC, and one of the best Ethernet networking switch, and a much improved purpose-designed CPU. No other competitors have such a complete suite of integrated silicon products.
In the sections below, we will discuss the 6 silicon products of the Vera Rubin platform at the silicon level. Then, we will discuss the rack and compute tray evolution from Grace Blackwell to Vera Rubin from the design perspective and the implication to components: cables, connectors, PCB, thermal, mechanical, and power.
Next, we will discuss the major networks of the VR NVL72 system, namely the scale up NVLink 6 network and the backend scale out network. We will discuss the logistical implications of much more limited hyperscaler customisation and the assembly supplier landscape.
Lastly, the report ends with a discussion on the TCO of the VR NVL72 system as well as the BoM and Power Budget estimate supporting the TCO analysis. Behind the paywall, we also provide readers with insight into Nvidia’s plans for their Groq IP. We will also cover some of the challenges with regards to HBM ramp for Micron, SK Hynix, and Samsung.
Today we are also launching the VR NVL72 Component BoM and Power Budget Model. It provides a system-level bill of materials and power budget analysis for the VR NVL72 system covered in this article. This is important because various vendors and their shares of the subcomponents will drive whether they are winners or losers in the $500B Rubin buildout. The downstream impacts and dislocations in the market are large. The model covers:
Nvidia Compute Tray Content: Strata module with Rubin GPU, Vera CPU, SOCAMM memory; BlueField-4; ConnectX-9
NVLink System: NVSwitch, NVLink backplane and cabling, associated connectors, host CPU management module
Liquid Cooling Content: Coldplates, QDs, Manifolds
PCB, Substrate, and Materials Content: key system boards, ABF substrates, CCL content
Connectors: Paladin HD2 Board to Board Connectors, Paladin HD2 NVLink 6.0 Connectors
Power Delivery Content: power shelves, busbars, VRMs, power delivery modules
Mechanical Structure: chassis, loading mechanism, railkits, rack chassis
Management modules: BMC
Networking: Transceivers, CX-9

Source: Nvidia VR NVL72 Component BoM and Power Budget ModelPlease visit this self service portal to purchase the model. Contact sales@semianalysis.com for any questions regarding the product.
Extreme Co-design: 6 Silicon Product Offerings under the Rubin Platform - Chip Floorplans & SpecificationsSource: SemiAnalysis, NvidiaRubin’s dense FP4 and FP8 FLOPs increase by roughly ~3.5× versus GB200, while FP16 FLOPs rise by a more modest ~1.6x, underscoring NVIDIA’s continued emphasis on FP4/FP8 as the primary scaling vector. On the memory side, HBM capacity remains flat from GB300, while HBM bandwidth scales more aggressively at ~2.8x. Overall, the architecture prioritizes bandwidth and low-precision compute.
Source: SemiAnalysis, NvidiaRubinSource: NvidiaRubin’s design is a logical evolution from Blackwell, with the move to a 3nm process and disaggregating I/O into chiplets while keeping the same basic structure of 2 reticle-sized dies with 8 stacks of HBM. 35 PFLOPS dense FP4 is a 3.5x improvement over Blackwell GB200, achieved by:
Increasing SM count from 160 to 224
Doubling Tensor Core width in the SM to 32768 FP4 MACs/clock
Increasing clock speed 25% from 1.90GHz to 2.38GHz
Additionally, Nvidia claims up to an effective 50 PFLOPS of FP4 performance can be achieved with an updated 3rd generation Transformer Engine that replaces 2:4 structured sparsity from prior generations. We will detail this architectural feature for Rubin below.
Notably, the Tensor core width doubling only applies to FP4 and FP8, with BF16 and TF32 remaining the same as Blackwell, resulting in performance scaling only 1.6x of Blackwell. This architectural decision reflects NVIDIA’s belief that most training and inference workloads will move away from TF32 and BF16 and onto FP8 and FP4.
On the memory front, the move to HBM4 means double the bus width per stack, running at 10.8 GT/s for 22TB/s total bandwidth or 2.75x Blackwell at the same 288GB capacity as GB300. Memory bandwidth has been upgraded significantly from the original 13TB/s advertised at GTC 2025. In order to catch up to AMD MI450’s memory bandwidth, Nvidia requested much higher HBM4 pin speeds from the DRAM suppliers - well above the speeds that was in the JEDEC specification for HBM4.
While Nvidia is targeting 22TB/s, we understand that memory suppliers are having challenges hitting Nvidia’s requirements and we see it likely that initial shipments will come in slightly below at closer to 20TB/s. We have discussed the implications to SK Hynix, Samsung, and Micron extensively for Accelerator and HBM model subscribers. Micron is well behind Samsung and Hynix and we believe they are effectively out of the picture for Rubin HBM4. We have more details on qualifications and pin speeds in the Accelerator and HBM model
The NVLink-C2C chiplet houses the SerDes for the Vera CPU connection, doubled in bandwidth to 1.8TB/s, while the larger NVLink 6 chiplet on the other end of the chip features 36 custom ‘400G’ SerDes links for 2x NVLink bandwidth to all 72 Rubin GPUs.
Transistor count has climbed 60% to 336 billion.
A notable omission from Rubin is the mention of Sparse FLOPs. In previous generations, 2:4 structured sparsity was used to double marketing FLOPs numbers. However, adoption was minimal especially at low precisions due to accuracy losses from the rigid sparsity structure forcing half of the values to be zero. Programmers basically ignored structured sparsity as it was not useful, which caused hardware designs to change as well. Blackwell Ultra GB300 added 50% more dense FP4 while keeping sparse FP4 FLOPs the same, while AMD’s MI355X stopped supporting structured sparsity on MXFP8, MXFP6 and MXFP4 formats to save silicon area.
Rubin’s adaptive compression engine in the improved Transformer Engine is a key feature to re-boost naturally sparser inference performance by doing dynamic computation of sparsity in-flight and eliminating zeros in the data stream without zeroing out non-zero values, thus maintaining model accuracy while still boosting performance. This is done automatically on existing models built for Blackwell without the need for a new programming model or specific optimizations. While models that utilize Post Training Quantization or Quantization Aware Training will be tuned to maximize adaptive compression speedups, they are not strictly needed to take advantage of dynamic compression.
This means the sparser the workload, the closer the performance will be to the 50 PFLOPS marketed peak performance. NVIDIA thus brands the 50 PFLOPS figure as FP4 Inference while the 35 PFLOPS FP4 Training number is for dense workloads. As accuracy is preserved, this allows the marketing team to claim 5x FLOPs for Rubin over GB200, comparing 50 PFLOPS dynamically compressed FP4 to 10 PFLOPS dense FP4. Whether actual GEMM performance reaches 50 PFLOPS depends on how many zeros are in the tensor. The more zeros, the closer it can reach. The less zeros in the tensor, the lower the speedup. Overall, we expect to see much greater traction for Rubin’s adaptive sparsity compression as opposed to structured sparsity thanks to the automatic implementation.
With that said, many ML Systems engineer are still skeptical that this new form of sparsity will work well, and it is very possible that Nvidia’s 50 PFLOPS is purely marketing like prior generations
Rubin’s chip level TDP increases up to 2,300W vs 1000-1400W for Blackwell. Supply chain rumors have indicated that there are 2 different “SKUs” with different power and performance profiles: a Max-P variant at 2,300W and a Max-Q variant at 1,800W. However, these are not distinct hardware SKUs but the 2 default power profiles that Nvidia is offering users based on their workload needs. Max-Q is what Nvidia believes offers the best performance per Watt. Max-P offers the greatest absolute performance though this would come with an efficiency penalty. Running the Max-P setting results in a 20% increase in rack power draw but the performance gain fall well short of this 20% power consumption increase.
These power profiles are software managed. Users can also choose whatever max power draw they prefer (as long as it is no more than 2,300W per GPU) and this has been the case for previous GPU generations as well. Several hyperscalers and labs have chosen to run their GPUs at lower power to optimize for performance per Watt as well as taking into account power availability constraints.
Source: Nvidia VR NVL72 Component BoM and Power Budget ModelFor the mechanical structure of the Rubin package, upgrades have also been made with the addition of an upgraded heat spreader and a stiffener. This compares to Blackwell B200 and B300 packages that only have a heat spreader lid. The heat spreader lid allows more equal distribution of heat exiting the package. It also provides mechanical support for the package to prevent warpage.
For Rubin, the heat spreader lid is a module made up of two separate lids. Beside the heat spreader lid, a stiffener is added to the package structure to provide even more mechanical support to avoid warpage. At the surface of the heat spreader lid, there will also be a layer of electroplated gold. The reason for this is to prevent corrosion from liquid metal TIM2, which is between the heat spreader lid and the cold plate.
VeraSource: Nvidia VR NVL72 Component BoM and Power Budget ModelNVIDIA was aggressive on the CPU front, with Vera doubling performance over Grace by moving to a 3nm reticle-sized compute die and disaggregating the memory controllers and I/O into chiplets. Core count increases from 72 to 88, with 91 cores printed on die to leave redundancy for yield improvement. These cores mark the return of NVIDIA’s custom ARM CPU designs, with the ‘Olympus’ core now supporting SMT multi-threading for a total of 176 processing threads. L3 cache also received a 40% capacity bump to 162MB. Memory bus width doubled to 1024-bit and speed increased to 9600MT/s for 2.5x bandwidth, while maximum capacity tripled to 1.5TB with 8 SOCAMM modules. The NVLink-C2C to the Rubin GPUs also doubled in bandwidth to 1.8TB/s. PCIe6 and CXL3.1 are now supported as well. All this results in transistor count increasing 2.2x to 227 billion.
NVLink 6 SwitchSource: Nvidia VR NVL72 Component BoM and Power Budget ModelWhile the rack-level switching bandwidth has doubled, the number of NVLink Switch chips per rack has also doubled to 36, with each switch tray now housing 4 Switch chips. This means the new NVLink 6 Switch chip has the same 28.8T bandwidth as NVLink5 Switch, with half the number of ports but running at double the rate using ‘400G’ bi-directional SerDes. This allows the high bandwidth switch design to remain as a single monolithic die, saving on design complexity. The layout remains the same as NVIDIA’s previous switches, with 2 sides for IO and a central logic section crossbar and 3.6 TFlop SHARP in-network compute acceleration.
ConnectX-9Source: Nvidia VR NVL72 Component BoM and Power Budget ModelThe ConnectX-9 is largely iterative from ConnectX-8, with the same 800G networking bandwidth and 48-lane PCIe6 switch capability. However, CX-9 now supports 800G Ethernet with 4x200G PAM4 SerDes, compared to CX-8 that only supported it on InfiniBand. For the Rubin platform, NVIDIA is doubling the number of NICs per GPU to achieve 2x scale-out bandwidth.
BlueField-4Source: Nvidia VR NVL72 Component BoM and Power Budget ModelBlueField-4’s design departs significantly from BlueField-3. Instead of doing a bespoke tapeout with compute and networking, NVIDIA simply reuses their large Grace CPU die, co-packaged with a ConnectX-9 die to make an 800G DPU with massive compute capabilities. 128GB of LPDDR5 feeds the Grace CPU at half the bandwidth of regular Grace. That is 4x the memory capacity of BlueField-3. BlueField-4 can also function as a storage controller, with four BF-4 chips in each Context Memory Storage system.
Spectrum-6Source: Nvidia VR NVL72 Component BoM and Power Budget ModelWhile not part of the Rubin NVL72 rack, Spectrum-6 CPO enables even larger scale-out clusters with its doubled radix. The design retains the same features as Spectrum-5, with 8 IO chiplets surrounding the main switch die. 102.4T switching bandwidth is achieved with 512x 200G SerDes. 32 3.2T optical engines on the package convert these electrical signals to optical links, each with a detachable fiber connector. The SN6810 features one of these chips, while the SN6800 houses four, multiplexed together to create a 409.6T switch box. There will also be a non-CPO version with pluggable OSFP cages in the SN6600. The non-CPO version will be more common in our view.
Rubin Oberon Rack: NVL72 not NVL144 nor NVL36Since the announcement of GB200 at Nvidia GTC 2024, the concept of an AI server system has shifted from a chassis to a rack scale system. In our GB200 article, we discussed the evolution of Nvidia AI server form factor from HGX (8 GPU per node) to Oberon (NVL72 rack scale). While the HGX form factor still exists, the majority of Nvidia’s Blackwell GPUs are integrated in the Oberon form factor. Rubin will also be offered in both HGX and Oberon systems.
The key difference between the Blackwell and Rubin Oberon architecture is the number of SKUs offered to customers. As Blackwell Oberon was the first ever mass deployment of a rack scale solution with rack power density over 100KW for the GB200 NVL72 SKU, many datacenters did not have the infrastructure ready to support 100kw+ per rack. Nvidia offered two SKUs of Blackwell Oberon: GB200 NVL72 and GB200 NVL36x2. The latter being a lower density SKU offered for customers who did not have the infra ready to handle the thermals of a single high density rack. We discussed the difference between the two form factors in the GB200 article.
Unlike Blackwell, Rubin is only offered in the VR NVL72 SKU. The set up is very similar to that of GB200/GB300 NVL72. Each VR NVL72 system consists of:
72 Rubin GPU packages
36 Vera CPUs
36 NVLink 6 Switch ASICs
Source: Nvidia VR NVL72 BoM and Power Budget ModelOn a side note, VR NVL72 was initially known as VR NVL144 as Jensen math from GTC 2025 defined the number of GPU as the number of GPU compute die in system (with 2 compute dies per package and 72 Rubin packages per Oberon rack = 144 compute die). The naming was changed back to VR NVL72 to represent the 72 Rubin GPU packages in the system in late December. This was right before CES 2026 where the naming was officially confirmed as VR NVL72.
CPX Form FactorSource: Nvidia VR NVL72 BoM and Power Budget ModelAlthough Nvidia initially planned to integrate the CPX accelerator into the VR NVL72 rack, the current development suggests that CPX will only be offered as a standalone rack as we detailed in our article introducing Nvidia CPX . To recap Rubin-era system planning in our previous CPX article, Nvidia initially contemplated three VR NVL72 configurations:
VR NVL72 (Regular): Standard Oberon VR NVL72 without CPX
VR NVL72 CPX (Integrated): Rubin GPU and Rubin CPX within the same compute tray
VR NVL72 CPX (Dual Rack): Rubin CPX deployed in a separate rack alongside the VR NVL72 rack
The standalone/dedicated rack direction materially changes the deployment calculus. A dual-rack approach allows hyperscalers to scale prefill and decode capacity independently, optimize datacenter power envelopes, and reduce system-level failure domains versus tightly coupled trays. More importantly, it formalizes architectural disaggregation between inference prefill (compute-bound) and decode (bandwidth-bound).
Rubin CPX was originally architected as a GDDR7-based accelerator optimized for prefill, based on three key considerations:
Prefill is primarily FLOPs-limited, not bandwidth-limited, making HBM less indispensable.
HBM’s increased bandwidth is structurally underutilized in prefill.
GDDR7 offers materially lower cost per GB and avoids the need for 2.5D packaging,
However, Nvidia began exploring HBM-equipped variants for prefill, either via modified CPX configurations or through lower memory spec (such as using HBM3E) Rubin deployments dedicated to prefill, which we noted this way back in early December last year in our Accelerator & HBM model.
We also think a lot this shift is driven by evolving memory economics. Conventional DRAM pricing has risen sharply: As DDR pricing increases, the relative premium of HBM compresses because pricing is more locked down in long term contracts, narrowing the cost gap between a GDDR-based CPX and lower-spec HBM configurations, therefore eliminating a lot of the cost benefits GDDR offers relative to performance. While memory bandwidth is not as important for pre-fill compare to decode, it is still necessary.
Compute Tray RedesignOne of the major changes with VR NVL72 is within the compute tray. This redesign of the compute tray is centered around simplifying assembly, namely eliminating cables from the compute tray as cables have been the major point of failure of GB200/300 assembly. As Jensen put it at CES 2026, the cableless design reduces the compute tray assembly time from 2 hours to 5 minutes. To achieve this, the VR NVL72 compute tray adopts a modular design with the modules connecting to each via board-to-board connectors.
Source: Nvidia VR NVL72 BoM and Power Budget ModelTo understand the compute tray of VR NVL72 we must first understand the 6 modules that make up the VR NVL72 compute tray:
Strata Module x 2
Orchid Module x4
Compute Tray Midplane x 1
Power Delivery Module x 1
BlueField-4 Module x 1
System Management Module x 1
We break down these components costs and all the subcomponent costs in the Nvidia VR NVL72 BoM and Power Budget Model.
StrataStrata Module, Source: Nvidia VR NVL72 BoM and Power Budget ModelBianca Module. Source: Nvidia VR NVL72 BoM and Power Budget ModelSituated at the back of the chassis, the Strata module of VR NVL72 is the equivalent of the Bianca board of GB200/300. It houses two Rubin GPUs and one Vera CPU. Unlike Bianca, the LPDDR5X memory for Vera is socketed via SOCAMM module. 8 SOCAMM sockets are placed on the left and right of Vera. Two SOCAMM modules of different capacities are offered, 192GByte and 128GByte, for a maximum of 1,534GByte and a minimum of 1,024GByte per Vera. The Connect-X NICs mezzanine module is also taken off the Strata m