NVIDIA Ethernet Networking Accelerates World’s Largest AI Supercomputer, Built by xAI
NVIDIA announced that xAI's Colossus supercomputer, featuring 100,000 NVIDIA Hopper Tensor Core GPUs, has achieved massive scale using the NVIDIA Spectrum-X Ethernet networking platform. The system, located in Memphis, Tennessee, is being used to train xAI's Grok language models and is currently being expanded to 200,000 GPUs. The supercomputer was built in just 122 days and achieved 95% data throughput with zero application latency degradation. The system utilizes NVIDIA's Spectrum SN5600 Ethernet switch, supporting speeds up to 800Gb/s, paired with BlueField-3 SuperNICs for optimal performance.
NVIDIA ha annunciato che il supercomputer Colossus di xAI, dotato di 100.000 GPU NVIDIA Hopper Tensor Core, ha raggiunto una scala massiva utilizzando la piattaforma di rete Ethernet NVIDIA Spectrum-X. Il sistema, situato a Memphis, Tennessee, viene utilizzato per addestrare i modelli di linguaggio Grok di xAI ed è attualmente in fase di espansione a 200.000 GPU. Il supercomputer è stato costruito in sole 122 giorni e ha raggiunto una capacità di throughput dati del 95% senza alcun degrado della latenza dell'applicazione. Il sistema utilizza lo switch Ethernet NVIDIA Spectrum SN5600, che supporta velocità fino a 800Gb/s, abbinate a BlueField-3 SuperNIC per prestazioni ottimali.
NVIDIA anunció que el superordenador Colossus de xAI, que cuenta con 100,000 GPUs NVIDIA Hopper Tensor Core, ha alcanzado una gran escala utilizando la plataforma de red Ethernet NVIDIA Spectrum-X. El sistema, ubicado en Memphis, Tennessee, se utiliza para entrenar los modelos de lenguaje Grok de xAI y actualmente se está expandiendo a 200,000 GPUs. El superordenador fue construido en solo 122 días y logró un rendimiento de datos del 95% sin degradación de la latencia de la aplicación. El sistema utiliza el conmutador Ethernet NVIDIA Spectrum SN5600, que soporta velocidades de hasta 800 Gb/s, junto con BlueField-3 SuperNICs para un rendimiento óptimo.
NVIDIA는 xAI의 Colossus 슈퍼컴퓨터가 100,000개의 NVIDIA Hopper Tensor Core GPU를 탑재하여 NVIDIA Spectrum-X 이더넷 네트워킹 플랫폼을 이용해 대규모 확장을 달성했다고 발표했습니다. 이 시스템은 테네시주 멤피스에 위치하고 있으며 xAI의 Grok 언어 모델을 훈련하는 데 사용되고 있으며 현재 200,000개의 GPU로 확장 중입니다. 이 슈퍼컴퓨터는 단 122일 만에 구축되었습니다 그리고 애플리케이션 지연 시간 저하 없이 95%의 데이터 처리량을 달성했습니다. 이 시스템은 800Gb/s까지 지원하는 NVIDIA Spectrum SN5600 이더넷 스위치를 활용하고 BlueField-3 SuperNIC와 결합하여 최적의 성능을 제공합니다.
NVIDIA a annoncé que le superordinateur Colossus de xAI, doté de 100 000 GPU NVIDIA Hopper Tensor Core, a atteint une échelle massive en utilisant la plateforme de mise en réseau Ethernet NVIDIA Spectrum-X. Le système, situé à Memphis, Tennessee, est utilisé pour entraîner les modèles de langage Grok de xAI et est actuellement en cours d'expansion à 200 000 GPU. Le superordinateur a été construit en seulement 122 jours et a atteint un débit de données de 95 % sans aucune dégradation de la latence des applications. Le système utilise le commutateur Ethernet NVIDIA Spectrum SN5600, prenant en charge des vitesses allant jusqu'à 800 Gb/s, associé aux BlueField-3 SuperNIC pour des performances optimales.
NVIDIA hat angekündigt, dass der Colossus-Supercomputer von xAI mit 100.000 NVIDIA Hopper Tensor Core GPUs mithilfe der NVIDIA Spectrum-X Ethernet-Netzwerkplattform eine massive Skalierung erreicht hat. Das System, das sich in Memphis, Tennessee, befindet, wird verwendet, um die Grok-Sprachmodelle von xAI zu trainieren und wird derzeit auf 200.000 GPUs erweitert. Der Supercomputer wurde in nur 122 Tagen gebaut und erreichte einen Daten-Durchsatz von 95% ohne jegliche Verzögerung der Anwendungslatenz. Das System verwendet den NVIDIA Spectrum SN5600 Ethernet-Switch, der Geschwindigkeiten von bis zu 800 Gb/s unterstützt, zusammen mit BlueField-3 SuperNICs für optimale Leistung.
- Successful deployment of world's largest AI supercomputer with 100,000 NVIDIA GPUs
- System expansion in progress to double capacity to 200,000 GPUs
- Achieved 95% data throughput, significantly outperforming standard Ethernet's 60%
- Rapid deployment completed in 122 days versus typical timeframe of months to years
- None.
Insights
The deployment of a 100,000 NVIDIA Hopper GPU system, with plans to double to 200,000 GPUs, represents a significant technological milestone in AI infrastructure. The system's exceptional
The rapid 122-day construction timeframe and 19-day deployment to training initiation showcase unprecedented speed in supercomputer implementation. The Spectrum-X platform's 800Gb/s port speeds and advanced features like adaptive routing position NVIDIA to capture substantial market share in the growing AI infrastructure sector. This partnership with xAI validates NVIDIA's dominance in both AI hardware and networking solutions, strengthening their competitive moat in the AI ecosystem.
This development significantly strengthens NVIDIA's market position in the AI infrastructure space. By powering xAI's Colossus, the world's largest AI supercomputer, NVIDIA demonstrates its ability to deliver end-to-end solutions for large-scale AI deployments. The successful implementation could accelerate adoption of NVIDIA's Spectrum-X platform among other major AI companies and hyperscalers.
The partnership with Elon Musk's xAI adds considerable prestige and validation to NVIDIA's networking solutions, potentially driving increased demand for their integrated GPU-networking packages. This could lead to higher margins and revenue growth as companies seek to replicate xAI's success in large-scale AI deployments.
NVIDIA Spectrum-X Makes Colossal NVIDIA Hopper 100,000-GPU System Possible
SANTA CLARA, Calif., Oct. 28, 2024 (GLOBE NEWSWIRE) -- NVIDIA today announced that xAI’s Colossus supercomputer cluster comprising 100,000 NVIDIA Hopper Tensor Core GPUs in Memphis, Tennessee, achieved this massive scale by using the NVIDIA Spectrum-X™ Ethernet networking platform, which is designed to deliver superior performance to multi-tenant, hyperscale AI factories using standards-based Ethernet, for its Remote Direct Memory Access (RDMA) network.
Colossus, the world’s largest AI supercomputer, is being used to train xAI’s Grok family of large language models, with chatbots offered as a feature for X Premium subscribers. xAI is in the process of doubling the size of Colossus to a combined total of 200,000 NVIDIA Hopper GPUs.
The supporting facility and state-of-the-art supercomputer was built by xAI and NVIDIA in just 122 days, instead of the typical timeframe for systems of this size that can take many months to years. It took 19 days from the time the first rack rolled onto the floor until training began.
While training the extremely large Grok model, Colossus achieves unprecedented network performance. Across all three tiers of the network fabric, the system has experienced zero application latency degradation or packet loss due to flow collisions. It has maintained
This level of performance cannot be achieved at scale with standard Ethernet, which creates thousands of flow collisions while delivering only
“AI is becoming mission-critical and requires increased performance, security, scalability and cost-efficiency,” said Gilad Shainer, senior vice president of networking at NVIDIA. “The NVIDIA Spectrum-X Ethernet networking platform is designed to provide innovators such as xAI with faster processing, analysis and execution of AI workloads, and in turn accelerates the development, deployment and time to market of AI solutions.”
“Colossus is the most powerful training system in the world,” said Elon Musk on X. “Nice work by xAI team, NVIDIA and our many partners/suppliers.”
“xAI has built the world’s largest, most-powerful supercomputer,” said a spokesperson for xAI. “NVIDIA’s Hopper GPUs and Spectrum-X allow us to push the boundaries of training AI models at a massive-scale, creating a super-accelerated and optimized AI factory based on the Ethernet standard.”
At the heart of the Spectrum-X platform is the Spectrum SN5600 Ethernet switch, which supports port speeds of up to 800Gb/s and is based on the Spectrum-4 switch ASIC. xAI chose to pair the Spectrum-X SN5600 switch with NVIDIA BlueField-3® SuperNICs for unprecedented performance.
Spectrum-X Ethernet networking for AI brings advanced features that deliver highly effective and scalable bandwidth with low latency and short tail latency, previously exclusive to InfiniBand. These features include adaptive routing with NVIDIA Direct Data Placement technology, congestion control, as well as enhanced AI fabric visibility and performance isolation — all key requirements for multi-tenant generative AI clouds and large enterprise environments.
About NVIDIA
NVIDIA (NASDAQ: NVDA) is the world leader in accelerated computing.
For further information, contact:
Alex Shapiro
NVIDIA Corporation
+1-415-608-5044
ashapiro@nvidia.com
Certain statements in this press release including, but not limited to, statements as to: the benefits, impact, and performance of NVIDIA’s products, services, and technologies, including NVIDIA Hopper Tensor Core GPUs, NVIDIA Spectrum-X Ethernet networking platform, NVIDIA Spectrum SN5600 Ethernet switch, Spectrum-4 switch ASIC, and NVIDIA BlueField-3 SuperNICs; features of xAI’s Colossus supercomputer cluster; xAI being in the process of doubling the size of Colossus to a combined total of 200,000 NVIDIA Hopper GPUs; the NVIDIA Spectrum-X Ethernet networking platform being designed to provide innovators such as xAI with faster processing, analysis and execution of AI workloads, and in turn accelerating the development, deployment and time to market of AI solutions; NVIDIA’s Hopper GPUs and Spectrum-X allowing xAI to push the boundaries of training AI models at a massive scale, creating a super-accelerated and optimized AI factory based on the Ethernet standard are forward-looking statements that are subject to risks and uncertainties that could cause results to be materially different than expectations. Important factors that could cause actual results to differ materially include: global economic conditions; our reliance on third parties to manufacture, assemble, package and test our products; the impact of technological development and competition; development of new products and technologies or enhancements to our existing product and technologies; market acceptance of our products or our partners’ products; design, manufacturing or software defects; changes in consumer preferences or demands; changes in industry standards and interfaces; unexpected loss of performance of our products or technologies when integrated into systems; as well as other factors detailed from time to time in the most recent reports NVIDIA files with the Securities and Exchange Commission, or SEC, including, but not limited to, its annual report on Form 10-K and quarterly reports on Form 10-Q. Copies of reports filed with the SEC are posted on the company’s website and are available from NVIDIA without charge. These forward-looking statements are not guarantees of future performance and speak only as of the date hereof, and, except as required by law, NVIDIA disclaims any obligation to update these forward-looking statements to reflect future events or circumstances.
© 2024 NVIDIA Corporation. All rights reserved. NVIDIA, the NVIDIA logo, NVIDIA Spectrum-X and BlueField are trademarks and/or registered trademarks of NVIDIA Corporation in the U.S. and other countries. Other company and product names may be trademarks of the respective companies with which they are associated. Features, pricing, availability and specifications are subject to change without notice.
A photo accompanying this announcement is available at https://www.globenewswire.com/NewsRoom/AttachmentNg/32f7e01d-2845-40ac-9a09-2226d1f79ec0
FAQ
How many NVIDIA GPUs does the xAI Colossus supercomputer currently use?
What is the data throughput achieved by NVIDIA's Spectrum-X in the Colossus supercomputer?
How long did it take to build the xAI Colossus supercomputer using NVIDIA technology?