NVIDIA has unveiled its next-generation Rubin GPU architecture, claiming significant improvements: a 10x reduction in token cost, a 5x boost in inference performance, and a 3.5x increase in training performance. However, the transistor count only grew by 1.6 times. How does Rubin achieve this “computational magic”?
The first key is Chip Co-Design. Rubin represents a complete supercomputing platform, not just a GPU. It integrates six co-designed chips: the core Rubin GPU, the Grace CPU, NVLink 6 (for GPU interconnect), Spectrum-X (Ethernet switch), ConnectX-8 SuperNIC (network interface), and BlueField-4 DPU (data processing unit). All are built on TSMC’s 3nm process and optimized in tandem for AI workloads. The Rubin GPU itself features a third-generation Transformer Engine, HBM4 memory (288GB, 22 TB/s bandwidth), and doubled NVLink bandwidth. The custom Grace CPU cores (88 cores, 176 threads) are also tailored for AI.
The second key is Inference Context Memory, introduced in the BlueField-4 DPU. It offloads the KV (Key-Value) cache—critical for large language model inference—from expensive GPU HBM to larger, cheaper storage. This frees up HBM for model weights and active computation, enabling larger-scale inference tasks.
The third key is the full-fledged NVFP4 data format. Optimized in Rubin’s 6th-gen Tensor Cores, NVFP4 is a 4-bit floating-point format using a sophisticated two-level scaling mechanism (block-level and tensor-level). It maintains storage efficiency while achieving computational accuracy close to FP8, significantly boosting performance for both inference and training.
However, NVIDIA faces a technological inflection point. Competitors are closing in. Google’s TPU and other specialized accelerators (from Amazon, Meta, Intel, etc.) are gaining ground, especially in the massive inference market. Within the GPU camp, AMD and Chinese firms like Moore Threads and Biren Technology are also advancing. NVIDIA’s dominance, while still strong due to its ecosystem and process advantages, is being challenged.
In response, NVIDIA is adapting. It has opened its NVLink platform to allow integration of custom accelerators and acquired AI chip startup Groq to bolster its inference capabilities. NVIDIA is now walking on two legs: pushing GPU limits for AI while integrating strategies to counter specialized chips.

