Meta sheds more light on how it’s evolving Llama 3 training – it’s relying on almost 50,000 Nvidia H100 GPUs for now, but how long before Meta switches to its own AI chip?
Meta has revealed details about its AI training infrastructure, revealing that it currently relies on nearly 50,000 Nvidia H100 GPUs to train its open source Llama 3 LLM.
The company says it will have more than 350,000 Nvidia H100 GPUs in use by the end of 2024, and the computing power will be equivalent to nearly 600,000 H100s when combined with hardware from other sources.
The figures were revealed as Meta shared details of its data center clusters with 24,576 GPUs.
The company explained: “These clusters support our current and next generation AI models, including Llama 3, the successor to Llama 2, our publicly released LLM, as well as AI research and development in GenAI and other areas.”
The clusters are built on Grand Teton (named after the National Park in Wyoming), an internally designed, open GPU hardware platform. Grand Teton integrates power, control, compute and fabric interfaces into a single chassis for better overall performance and scalability.
The clusters also have powerful network structures, allowing them to support larger and more complex models than before. Meta says one cluster uses a direct remote memory access networking solution based on the Arista 7800, while the other has an NVIDIA Quantum2 InfiniBand fabric. Both solutions connect 400 Gbps endpoints together.
“The efficiency of the high-performance networking structures within these clusters, some of the key storage decisions, combined with the 24,576 NVIDIA Tensor Core H100 GPUs in each, ensure that both cluster versions can support models larger and more complex than could be supported in the RSC and pave the way for advancements in GenAI product development and AI research,” said Meg.
Storage is another crucial aspect of AI training, and Meta has developed a Linux file system in Userspace, supported by a version of its ‘Tectonic’ distributed storage solution optimized for Flash media. This solution reportedly allows thousands of GPUs to store and load checkpoints in a synchronized manner, in addition to “providing flexible and high-throughput exabyte-scale storage necessary for data loading.”
While the company’s current AI infrastructure relies heavily on Nvidia GPUs, it’s unclear how long this will continue. As Meta continues to develop its AI capabilities, it will inevitably focus on developing and manufacturing more proprietary hardware. Meta has already announced plans to use its own AI chips, called Artemis, in servers this year, and the company previously announced that it is gearing up to produce custom RISC-V silicon.