Alibaba unveils the network and data center design it uses to train large language models

By James On Jun 27, 2024

Alibaba has unveiled its data center design for the LLM program, which apparently consists of an Ethernet-based network with each host containing eight GPUs and nine NICs, each with two 200 Gb/sec ports.

The tech giant, which also offers one of the best large language models (LLM) trained on 110 billion parameters through its Qwen model, says this design has been used in production for eight months and aims to maximize the use of a GPU PCIe capabilities increase the network’s transmit/receive capacity.

Another feature that increases speed is the use of NVlink for the intra-host network, which provides more bandwidth between hosts. Each port on the NICs connects to a different top-of-rack switch, avoiding a single point of failure, a design Alibaba calls rail-optimized.

Each pod contains 15,000 GPUs

A new type of network is needed because the traffic patterns in LLM training are different from those in general cloud computing due to low entropy and bursty traffic. There is also higher sensitivity to errors and single point failures.

“Based on the unique characteristics of LLM training, we decided to build a new network architecture specifically for LLM training. We need to achieve the following goals: scalability, high performance, and single-ToR fault tolerance,” the company said.

Another piece of infrastructure that was revealed was the cooling mechanism. Since no vendor could provide a solution to keep chips below 105C, the temperature at which switches begin to trip, Alibaba designed and created its own vapor chamber heat sink, along with using more wicked pillars in the middle of chips that dissipate heat more efficiently.

The LLM training design is encapsulated in pods containing 15,000 GPUs and each pod can be placed in a single data center. “All data center buildings in use in Alibaba Cloud have a total power limitation of 18 MW and an 18 MW building can house approximately 15,000 GPUs. When combined with HPN, each individual building perfectly houses an entire Pod, creating predominant links in the same building.” Alibaba wrote.

Alibaba also wrote that it expects model parameters to continue to increase by an order of magnitude from a trillion to 10 trillion parameters in the coming years, and that the new architecture is planned to be able to support and expand this to a scale of 100,000 GPUs . .

Through The register