STOCK TITAN

NVIDIA Dynamo Open-Source Library Accelerates and Scales AI Reasoning Models

Rhea-AI Impact
(Low)
Rhea-AI Sentiment
(Positive)
Tags
AI

NVIDIA has unveiled NVIDIA Dynamo, a new open-source inference software designed to accelerate and scale AI reasoning models in AI factories. The software, succeeding NVIDIA Triton Inference Server, focuses on maximizing token revenue generation while reducing costs.

Key features of Dynamo include:

  • Doubles performance and revenue for Llama models on NVIDIA Hopper platform
  • Boosts token generation by over 30x per GPU for DeepSeek-R1 model on GB200 NVL72 racks
  • Enables dynamic GPU allocation and management
  • Supports disaggregated serving for separate processing phases

The platform includes four main innovations: GPU Planner for dynamic resource management, Smart Router for efficient request direction, Low-Latency Communication Library for optimized data transfer, and Memory Manager for cost-effective data handling. Major companies including AWS, Google Cloud, Microsoft Azure, and Meta will be implementing this technology.

NVIDIA ha presentato NVIDIA Dynamo, un nuovo software di inferenza open-source progettato per accelerare e scalare i modelli di ragionamento AI nelle fabbriche di intelligenza artificiale. Questo software, che succede a NVIDIA Triton Inference Server, si concentra massimizzando la generazione di entrate da token riducendo al contempo i costi.

Le principali caratteristiche di Dynamo includono:

  • Raddoppia le prestazioni e le entrate per i modelli Llama sulla piattaforma NVIDIA Hopper
  • Aumenta la generazione di token di oltre 30 volte per GPU per il modello DeepSeek-R1 su rack GB200 NVL72
  • Abilita l'allocazione e la gestione dinamica delle GPU
  • Supporta la fornitura disaggregata per fasi di elaborazione separate

La piattaforma include quattro principali innovazioni: GPU Planner per la gestione dinamica delle risorse, Smart Router per una direzione efficiente delle richieste, Low-Latency Communication Library per un trasferimento dati ottimizzato e Memory Manager per una gestione dei dati economica. Grandi aziende come AWS, Google Cloud, Microsoft Azure e Meta implementeranno questa tecnologia.

NVIDIA ha presentado NVIDIA Dynamo, un nuevo software de inferencia de código abierto diseñado para acelerar y escalar modelos de razonamiento de IA en fábricas de IA. Este software, que sucede a NVIDIA Triton Inference Server, se centra en maximizar la generación de ingresos por tokens mientras reduce costos.

Las características clave de Dynamo incluyen:

  • Duplica el rendimiento y los ingresos para los modelos Llama en la plataforma NVIDIA Hopper
  • Aumenta la generación de tokens en más de 30 veces por GPU para el modelo DeepSeek-R1 en racks GB200 NVL72
  • Permite la asignación y gestión dinámica de GPU
  • Soporta la provisión disgregada para fases de procesamiento separadas

La plataforma incluye cuatro innovaciones principales: GPU Planner para la gestión dinámica de recursos, Smart Router para la dirección eficiente de solicitudes, Low-Latency Communication Library para la transferencia de datos optimizada y Memory Manager para el manejo económico de datos. Grandes empresas como AWS, Google Cloud, Microsoft Azure y Meta implementarán esta tecnología.

NVIDIA는 AI 공장에서 AI 추론 모델을 가속화하고 확장하기 위해 설계된 새로운 오픈 소스 추론 소프트웨어인 NVIDIA Dynamo를 발표했습니다. 이 소프트웨어는 NVIDIA Triton Inference Server를 이어받아 토큰 수익 생성을 극대화하고 비용을 절감하는 데 중점을 둡니다.

Dynamo의 주요 기능은 다음과 같습니다:

  • NVIDIA Hopper 플랫폼에서 Llama 모델의 성능과 수익을 두 배로 증가시킵니다.
  • GB200 NVL72 랙의 DeepSeek-R1 모델에 대해 GPU당 30배 이상의 토큰 생성을 증가시킵니다.
  • 동적 GPU 할당 및 관리를 가능하게 합니다.
  • 별도의 처리 단계에 대한 분산 서비스를 지원합니다.

이 플랫폼에는 네 가지 주요 혁신이 포함되어 있습니다: 동적 자원 관리를 위한 GPU Planner, 효율적인 요청 방향을 위한 Smart Router, 최적화된 데이터 전송을 위한 Low-Latency Communication Library, 비용 효율적인 데이터 처리를 위한 Memory Manager입니다. AWS, Google Cloud, Microsoft Azure 및 Meta와 같은 주요 기업들이 이 기술을 구현할 예정입니다.

NVIDIA a dévoilé NVIDIA Dynamo, un nouveau logiciel d'inférence open-source conçu pour accélérer et mettre à l'échelle les modèles de raisonnement IA dans les usines d'IA. Ce logiciel, qui succède à NVIDIA Triton Inference Server, se concentre sur la maximisation de la génération de revenus par token tout en réduisant les coûts.

Les principales caractéristiques de Dynamo incluent:

  • Double les performances et les revenus pour les modèles Llama sur la plateforme NVIDIA Hopper
  • Augmente la génération de tokens de plus de 30 fois par GPU pour le modèle DeepSeek-R1 sur les racks GB200 NVL72
  • Permet l'allocation et la gestion dynamiques des GPU
  • Supporte la fourniture désagrégée pour des phases de traitement séparées

La plateforme comprend quatre innovations principales : GPU Planner pour la gestion dynamique des ressources, Smart Router pour une direction efficace des demandes, Low-Latency Communication Library pour un transfert de données optimisé et Memory Manager pour une gestion économique des données. Des entreprises majeures comme AWS, Google Cloud, Microsoft Azure et Meta mettront en œuvre cette technologie.

NVIDIA hat NVIDIA Dynamo vorgestellt, eine neue Open-Source-Inferenzsoftware, die entwickelt wurde, um KI-Rationalisierungsmodelle in KI-Fabriken zu beschleunigen und zu skalieren. Die Software, die den NVIDIA Triton Inference Server ablöst, konzentriert sich darauf, die Einnahmen aus Token zu maximieren und gleichzeitig die Kosten zu senken.

Zu den Hauptmerkmalen von Dynamo gehören:

  • Verdoppelt die Leistung und die Einnahmen für Llama-Modelle auf der NVIDIA Hopper-Plattform
  • Steigert die Token-Generierung um über das 30-fache pro GPU für das DeepSeek-R1-Modell auf GB200 NVL72-Racks
  • Ermöglicht dynamische GPU-Zuweisung und -Verwaltung
  • Unterstützt disaggregierte Bereitstellung für separate Verarbeitungsphasen

Die Plattform umfasst vier Hauptinnovationen: GPU Planner für das dynamische Ressourcenmanagement, Smart Router für eine effiziente Anforderungslenkung, Low-Latency Communication Library für optimierte Datenübertragung und Memory Manager für eine kosteneffiziente Datenverarbeitung. Große Unternehmen wie AWS, Google Cloud, Microsoft Azure und Meta werden diese Technologie implementieren.

Positive
  • Doubles performance and revenue for Llama models on existing NVIDIA Hopper platform
  • 30x increase in token generation per GPU for DeepSeek-R1 model
  • Major tech companies already committed to implementation
  • Cost reduction through intelligent resource allocation and management
Negative
  • None.

Insights

NVIDIA's launch of Dynamo represents a significant strategic move to strengthen its AI infrastructure dominance and expand its total addressable market. This open-source inference software directly addresses a critical economic challenge for enterprises operating AI factories: maximizing return on GPU investments through increased token throughput and reduced operational costs.

The financial implications are substantial:

  • Revenue Protection: As competing AI accelerators emerge, Dynamo creates stronger software lock-in for NVIDIA hardware
  • Ecosystem Reinforcement: Integration with major cloud providers (AWS, Google Cloud, Microsoft Azure) and AI companies (Meta, Perplexity) solidifies NVIDIA's position
  • TAM Expansion: By doubling throughput on existing Hopper systems and achieving 30x gains on Blackwell, Dynamo makes AI inference economically viable for more use cases
  • Recurring Revenue Enhancement: While open-source, Dynamo will drive commercial adoption of NVIDIA NIM and AI Enterprise software

Particularly compelling is how Dynamo's disaggregated serving capability allows AI companies to optimize different computational phases independently, maximizing GPU utilization efficiency. This directly translates to improved unit economics for AI providers using NVIDIA hardware—effectively reducing the cost per token while increasing throughput.

The timing is strategic, addressing growing concerns about AI inference costs that have been constraining broader enterprise AI adoption. By making reasoning models more economical to deploy at scale, NVIDIA is positioning itself to capture value from the next wave of AI implementation focused on agent-based reasoning systems.

NVIDIA's Dynamo represents a breakthrough in AI inference architecture that solves several critical technical bottlenecks in large-scale AI deployment. The significance extends beyond performance metrics to fundamental changes in how AI workloads are orchestrated.

The technical innovations driving Dynamo's efficiency gains include:

  • Knowledge-Aware Routing: By mapping KV cache across thousands of GPUs and intelligently routing requests to GPUs with matching context, Dynamo eliminates redundant computations that waste processing cycles
  • Dynamic Resource Allocation: The GPU Planner feature introduces genuine elasticity to AI infrastructure, automatically scaling resources based on demand patterns
  • Memory Hierarchy Optimization: Offloading inference data to lower-cost storage creates a tiered memory system optimized for AI workloads
  • Pipeline Parallelism: The disaggregated serving model separates processing stages (understanding vs. generation) onto different GPUs, allowing specialized optimization for each phase

These technologies directly address the primary inference bottleneck in reasoning models—the massive computational overhead of processing thousands of "thinking" tokens for each user query. Companies like Perplexity and Cohere specifically highlight Dynamo's ability to handle complex, high-token workloads efficiently.

What's particularly significant is that Dynamo appears purpose-built for the emerging class of reasoning and agent-based AI systems, which require more extensive computation than basic generative models. By optimizing specifically for these workflows, NVIDIA is positioning its hardware as the preferred platform for the next evolution of AI applications.

NVIDIA Dynamo Increases Inference Performance While Lowering Costs for Scaling Test-Time Compute; Inference Optimizations on NVIDIA Blackwell Boosts Throughput by 30x on DeepSeek-R1

SAN JOSE, Calif., March 18, 2025 (GLOBE NEWSWIRE) -- GTC -- NVIDIA today unveiled NVIDIA Dynamo, an open-source inference software for accelerating and scaling AI reasoning models in AI factories at the lowest cost and with the highest efficiency.

Efficiently orchestrating and coordinating AI inference requests across a large fleet of GPUs is crucial to ensuring that AI factories run at the lowest possible cost to maximize token revenue generation.

As AI reasoning goes mainstream, every AI model will generate tens of thousands of tokens used to “think” with every prompt. Increasing inference performance while continually lowering the cost of inference accelerates growth and boosts revenue opportunities for service providers.

NVIDIA Dynamo, the successor to NVIDIA Triton Inference Server™, is new AI inference-serving software designed to maximize token revenue generation for AI factories deploying reasoning AI models. It orchestrates and accelerates inference communication across thousands of GPUs, and uses disaggregated serving to separate the processing and generation phases of large language models (LLMs) on different GPUs. This allows each phase to be optimized independently for its specific needs and ensures maximum GPU resource utilization.

“Industries around the world are training AI models to think and learn in different ways, making them more sophisticated over time,” said Jensen Huang, founder and CEO of NVIDIA. “To enable a future of custom reasoning AI, NVIDIA Dynamo helps serve these models at scale, driving cost savings and efficiencies across AI factories.”

Using the same number of GPUs, Dynamo doubles the performance and revenue of AI factories serving Llama models on today’s NVIDIA Hopper™ platform. When running the DeepSeek-R1 model on a large cluster of GB200 NVL72 racks, NVIDIA Dynamo’s intelligent inference optimizations also boost the number of tokens generated by over 30x per GPU.

To achieve these inference performance improvements, NVIDIA Dynamo incorporates features that enable it to increase throughput and reduce costs. It can dynamically add, remove and reallocate GPUs in response to fluctuating request volumes and types, as well as pinpoint specific GPUs in large clusters that can minimize response computations and route queries. It can also offload inference data to more affordable memory and storage devices and quickly retrieve them when needed, minimizing inference costs.

NVIDIA Dynamo is fully open source and supports PyTorch, SGLang, NVIDIA TensorRT™-LLM and vLLM to allow enterprises, startups and researchers to develop and optimize ways to serve AI models across disaggregated inference. It will enable users to accelerate the adoption of AI inference, including at AWS, Cohere, CoreWeave, Dell, Fireworks, Google Cloud, Lambda, Meta, Microsoft Azure, Nebius, NetApp, OCI, Perplexity, Together AI and VAST. 

Inference Supercharged
NVIDIA Dynamo maps the knowledge that inference systems hold in memory from serving prior requests — known as KV cache — across potentially thousands of GPUs.

It then routes new inference requests to the GPUs that have the best knowledge match, avoiding costly recomputations and freeing up GPUs to respond to new incoming requests.

“To handle hundreds of millions of requests monthly, we rely on NVIDIA GPUs and inference software to deliver the performance, reliability and scale our business and users demand,” said Denis Yarats, chief technology officer of Perplexity AI. “We look forward to leveraging Dynamo, with its enhanced distributed serving capabilities, to drive even more inference-serving efficiencies and meet the compute demands of new AI reasoning models.”

Agentic AI
AI provider Cohere is planning to power agentic AI capabilities in its Command series of models using NVIDIA Dynamo.

“Scaling advanced AI models requires sophisticated multi-GPU scheduling, seamless coordination and low-latency communication libraries that transfer reasoning contexts seamlessly across memory and storage,” said Saurabh Baji, senior vice president of engineering at Cohere. “We expect NVIDIA Dynamo will help us deliver a premier user experience to our enterprise customers.”

Disaggregated Serving
The NVIDIA Dynamo inference platform also supports disaggregated serving, which assigns the different computational phases of LLMs — including building an understanding of the user query and then generating the best response — to different GPUs. This approach is ideal for reasoning models like the new NVIDIA Llama Nemotron model family, which uses advanced inference techniques for improved contextual understanding and response generation. Disaggregated serving allows each phase to be fine-tuned and resourced independently, improving throughput and delivering faster responses to users.

Together AI, the AI Acceleration Cloud, is looking to integrate its proprietary Together Inference Engine with NVIDIA Dynamo to enable seamless scaling of inference workloads across GPU nodes. This also lets Together AI dynamically address traffic bottlenecks at various stages of the model pipeline.

“Scaling reasoning models cost effectively requires new advanced inference techniques, including disaggregated serving and context-aware routing,” said Ce Zhang, chief technology officer of Together AI. “Together AI provides industry-leading performance using our proprietary inference engine. The openness and modularity of NVIDIA Dynamo will allow us to seamlessly plug its components into our engine to serve more requests while optimizing resource utilization — maximizing our accelerated computing investment. We’re excited to leverage the platform’s breakthrough capabilities to cost-effectively bring open-source reasoning models to our users.”

NVIDIA Dynamo Unpacked
NVIDIA Dynamo includes four key innovations that reduce inference serving costs and improve user experience:

  • GPU Planner: A planning engine that dynamically adds and removes GPUs to adjust to fluctuating user demand, avoiding GPU over- or under-provisioning.
  • Smart Router: An LLM-aware router that directs requests across large GPU fleets to minimize costly GPU recomputations of repeat or overlapping requests — freeing up GPUs to respond to new incoming requests.
  • Low-Latency Communication Library: An inference-optimized library that supports state-of-the-art GPU-to-GPU communication and abstracts complexity of data exchange across heterogenous devices, accelerating data transfer.
  • Memory Manager: An engine that intelligently offloads and reloads inference data to and from lower-cost memory and storage devices without impacting user experience. 

NVIDIA Dynamo will be made available in NVIDIA NIM™ microservices and supported in a future release by the NVIDIA AI Enterprise software platform with production-grade security, support and stability.

Learn more by watching the NVIDIA GTC keynote, reading this blog on Dynamo and registering for sessions from NVIDIA and industry leaders at the show, which runs through March 21.

About NVIDIA
NVIDIA (NASDAQ: NVDA) is the world leader in accelerated computing.

For further information, contact:
Cliff Edwards
NVIDIA Corporation
+1-415-699-2755
cliffe@nvidia.com

Certain statements in this press release including, but not limited to, statements as to: the benefits, impact, availability, and performance of NVIDIA’s products, services, and technologies; third parties adopting NVIDIA’s products and technologies and the benefits and impact thereof; industries around the world training AI models to think and learn in different ways, making them more sophisticated over time; and to enable a future of custom reasoning AI, NVIDIA Dynamo helping serve these models at scale, driving cost savings and efficiencies across AI factories are forward-looking statements that are subject to risks and uncertainties that could cause results to be materially different than expectations. Important factors that could cause actual results to differ materially include: global economic conditions; our reliance on third parties to manufacture, assemble, package and test our products; the impact of technological development and competition; development of new products and technologies or enhancements to our existing product and technologies; market acceptance of our products or our partners' products; design, manufacturing or software defects; changes in consumer preferences or demands; changes in industry standards and interfaces; unexpected loss of performance of our products or technologies when integrated into systems; as well as other factors detailed from time to time in the most recent reports NVIDIA files with the Securities and Exchange Commission, or SEC, including, but not limited to, its annual report on Form 10-K and quarterly reports on Form 10-Q. Copies of reports filed with the SEC are posted on the company's website and are available from NVIDIA without charge. These forward-looking statements are not guarantees of future performance and speak only as of the date hereof, and, except as required by law, NVIDIA disclaims any obligation to update these forward-looking statements to reflect future events or circumstances.

Many of the products and features described herein remain in various stages and will be offered on a when-and-if-available basis. The statements above are not intended to be, and should not be interpreted as a commitment, promise, or legal obligation, and the development, release, and timing of any features or functionalities described for our products is subject to change and remains at the sole discretion of NVIDIA. NVIDIA will have no liability for failure to deliver or delay in the delivery of any of the products, features or functions set forth herein.

© 2025 NVIDIA Corporation. All rights reserved. NVIDIA, the NVIDIA logo, NVIDIA Hopper, NVIDIA NIM, NVIDIA Triton Inference Server and TensorRT are trademarks and/or registered trademarks of NVIDIA Corporation in the U.S. and other countries. Other company and product names may be trademarks of the respective companies with which they are associated. Features, pricing, availability and specifications are subject to change without notice.

A photo accompanying this announcement is available at https://www.globenewswire.com/NewsRoom/AttachmentNg/e82546dd-6224-4ebb-8d5a-3476d18e97d0


FAQ

What performance improvements does NVIDIA Dynamo offer for NVDA's AI inference?

Dynamo doubles performance for Llama models on Hopper platform and increases token generation by 30x per GPU for DeepSeek-R1 model on GB200 NVL72 racks.

How does NVIDIA Dynamo's disaggregated serving benefit AI processing?

It separates processing and generation phases of LLMs on different GPUs, allowing independent optimization and maximum GPU resource utilization.

Which major tech companies are adopting NVIDIA's Dynamo technology?

AWS, Cohere, Google Cloud, Meta, Microsoft Azure, Perplexity, and several other major tech companies are implementing Dynamo.

What are the four key innovations in NVIDIA Dynamo's architecture?

GPU Planner, Smart Router, Low-Latency Communication Library, and Memory Manager, focusing on cost reduction and improved user experience.
Nvidia Corporation

NASDAQ:NVDA

NVDA Rankings

NVDA Latest News

NVDA Stock Data

2.80T
23.34B
4.29%
66.17%
1%
Semiconductors
Semiconductors & Related Devices
Link
United States
SANTA CLARA