The world’s most powerful supercomputer used just over 8% of the GPUs it is equipped with to train a large language model (LLM) containing one trillion parameters – comparable to OpenAI’s GPT-4.
Frontier, based at Oak Ridge National Laboratory, used 3,072 of its AMD Radeon Instinct GPUs to train an AI system on a trillion-parameter scale, and it used 1,024 of these GPUs (about 2.5%) to create a model with 175 billion parameters, essentially the same size as ChatGPT.
The researchers needed at least 14 TB of RAM to achieve these results their paper, but each MI250X GPU only had 64 GB of VRAM, meaning the researchers had to group several GPUs together. However, this introduced a new challenge in the form of parallelism, which meant that the components had to communicate much better and more effectively as the total size of the resources used to train the LLM increased.
Putting the world’s most powerful supercomputer to work
LLMs are not typically trained on supercomputers, but are trained on specialized servers and require many more GPUs. For example, ChatGPT has been trained on over 20,000 GPUs TrendForce. But the researchers wanted to show whether they could train a supercomputer much faster and more effectively by using various techniques made possible by the supercomputer architecture.
The scientists used a combination of tensor parallelism – groups of GPUs sharing parts of the same tensor – and pipeline parallelism – groups of GPUs hosting adjacent components. They also used data parallelism to consume a large number of tokens simultaneously and a larger amount of computing resources. The overall effect was that a much faster time was achieved.
For the 22 billion parameter model, they achieved peak throughput of 38.38% (73.5 TFLOPS), 36.14% (69.2 TFLOPS) for the 175 billion parameter model, and 31.96% peak throughput (61.2 TFLOPS) for the model with 1 trillion parameters. .
They also achieved 100% weak scaling efficiency%, as well as 89.93% strong scaling performance for the 175 billion model, and 87.05% strong scaling performance for the 1 trillion parameter model.
While the researchers were open about the computing resources used and the techniques involved, they neglected to mention the timescales involved in training an LLM in this way.
Ny Breaking asked the researchers for times, but at the time of writing they have not responded.