At CES 2024, NVIDIA’s Jensen Huang unveiled Alpamaio, an open-source AI driving stack. This move shifts NVIDIA from just selling chips to providing a complete solution—models, simulation tools, and software—potentially disrupting the automotive industry. Alpamaio, developed over six to seven years, aims to solve critical “long-tail” problems in end-to-end autonomous driving systems that have kept many stuck at Level 2/3.
The foundation is the World Foundation Model, Cosmos, trained on 20 million hours of real-world video. It understands the physical world by generating scenes, reasoning, and predicting trajectories. Integrated with NVIDIA’s Omniverse for high-fidelity simulation, Cosmos helps AI grasp real-world physics, saving developers immense effort.
Within Alpamaio, Cosmos serves two key roles:
- Generating vast simulated training data, creating rare or hard-to-capture driving scenarios (e.g., extreme weather, accidents) through a blend of autoregressive and diffusion mechanisms. This embodies the “computation is data” philosophy.
- Acting as the backbone for the reasoning model, specifically the Cosmos Reason branch (~82B parameters). This model translates visual input into text-based “causal chains” for decision-making, forming the core of Alpamaio 2.0.
Alpamaio 2.0 has three highlights:
- Causal Chain Dataset: Trained on over 700,000 reasoning trajectories explained in natural language. This helps the system break down novel, complex situations into manageable sub-tasks, improving interpretability and handling of edge cases.
- Diffusion Trajectory Decoder: A ~23B parameter model that converts Cosmos Reason’s high-level reasoning into physically plausible vehicle trajectories, constrained by real vehicle dynamics, planning 6.4 seconds ahead.
- Multi-Stage Training: A four-phase strategy to avoid “black box” issues:
- Phase 1: Train Cosmos Reason as a Vision-Language Model (VLM) on general and driving-specific visual Q&A data.
- Phase 2: Pre-train the full Alpamaio system on 80,000+ hours of general driving data (some with LiDAR) to extend VLM to Vision-Language-Action (VLA) capability.
- Phase 3: Supervised fine-tuning using the massive Causal Chain Dataset (human-machine annotated) to enhance reasoning.
- Phase 4: Reinforcement learning in simulation to align reasoning with action and improve robustness.
In essence, Alpamaio combines VLA (language-based reasoning) and World Model capabilities. This contrasts with non-VLA, “end-to-end” approaches (like Tesla’s FSD) that map vision directly to action without explicit language-based reasoning. By open-sourcing such a comprehensive stack, NVIDIA is providing a potentially shortcut for automakers, bundling hardware with a sophisticated AI software suite.
