The dirty secret of high performance computing

>

In the decades since Seymour Cray developed what is widely regarded as the world’s first supercomputer, the CDC 6600 (opens in new tab), there has been an arms race in the high performance computing (HPC) community. The goal: to improve performance, in any way, at any cost.

Powered by advances in computing, storage, networking and software, the performance of leading systems has grown a trillion times since the unveiling of the CDC 6600 in 1964, from the millions of floating point operations per second (megaFLOPS) to the trillions (exaFLOPS).

The current holder of the crown, a colossal supercomputer in the US called Border, is able to achieve 1,102 exaFLOPS according to the High Performance Linpack (HPL) benchmark. But it is suspected that there are even more powerful machines out there elsewhere in operationbehind closed doors.

The advent of so-called exascale supercomputers is expected to benefit virtually all sectors — from science to cybersecurity, healthcare to finance — and pave the way for powerful new AI models that would otherwise take years to train.

The CDC 6600, widely regarded as the world’s first supercomputer. Credit: Computer History Museum

However, an increase in speeds of this magnitude comes with a price: energy consumption. At full throttle, Frontier consumes up to 40MW (opens in new tab) net worth, about the same as 40 million desktop PCs.

Supercomputing has always been about pushing the boundaries of the possible. But as the need to minimize emissions becomes more apparent and energy prices continue to rise, the HPC industry will need to re-evaluate whether its original guiding principle is still worth following.

Performance vs Efficiency

One organization leading the way in this area is the University of Cambridge, which has partnered with Dell Technologies to develop multiple supercomputers with energy efficiency at the forefront of design.

The Wilkes3 (opens in new tab)for example, is only 100th in the general performance charts (opens in new tab)but ranks third in the Green500 (opens in new tab)a ranking of HPC systems based on performance per watt of energy consumed.

In conversation with TechRadar Proexplained Dr. Paul Calleja, director of Research Computing Services at the University of Cambridge, said the institution is much more concerned with building highly productive and efficient machines than extremely powerful ones.

“We are not really interested in large systems, because they are very specific point solutions. But the technologies deployed within it are much more widely applicable and will allow systems an order of magnitude slower to operate in a much more cost- and energy-efficient manner,” said Dr. calleja.

“That way you democratize access to computers for many more people. We’re interested in using technologies designed for those big-age systems to create much more durable supercomputers for a wider audience.”

The Wilkes3 supercomputer may not be the fastest in the world, but it is one of the most energy-efficient. Credit: University of Cambridge

In the coming years, Dr. Calleja also has an increasingly strong drive for energy efficiency in the HPC sector and the wider data center community, where energy consumption accounts for more than 90% of costs, we are told.

Recent fluctuations in the price of energy due to the war in Ukraine will also have made running supercomputers dramatically more expensive, especially in the context of exascale computing, further illustrating the importance of performance per watt.

In the context of Wilkes3, the university found that there were a number of optimizations that helped improve the efficiency level. For example, by reducing the clock speed at which some components ran depending on the workload, the team was able to reduce energy consumption by 20-30%.

“Within a given architecture family, clock speed has a linear relationship with performance, but a quadratic relationship with power consumption. That’s the killer,” Dr. Calleja explained.

“Reducing the clock speed reduces power consumption much faster than performance, but also increases the time it takes to complete a task. So what we need to look at is not the power consumption during a run, but the actual power consumption per task. There is a sweet spot.”

Software is king

In addition to fine-tuning hardware configurations for specific workloads, there are also a number of optimizations that need to be made elsewhere, in the context of storage and networking, and in connected disciplines such as cooling and rack design.

However, when asked where specifically he would like to see resources to improve energy efficiency, Dr. Calleja explains that the focus should primarily be on software.

“The hardware is not the problem, it is the efficiency of the application that matters. This will be the main bottleneck going forward,” he said. “Current exascale systems are based on GPU architectures and the number of applications that can run efficiently at scale in GPU systems is small.”

“To really take advantage of today’s technology, we need to pay a lot of attention to application development. The developmental life cycle spans decades; software in use today was developed 20-30 years ago and it’s hard when you have such long-lived code that has to be redesigned.”

The problem, however, is that the HPC industry has not made a habit of thinking software first. Historically, much more attention has been paid to the hardware because, in the words of Dr. Calleja, “it’s easy; you just buy a faster chip. You don’t have to think smart.”

“Although we had Moore’s Law, with processor performance doubling every eighteen months, you didn’t have to do anything [on a software level] to increase performance. But those days are over. If we want progress now, we have to go back and redesign the software.”

With Moore’s law beginning to waver, advances in CPU architecture can no longer be relied upon as a source of performance gains. Credit: Alexander_Safonov/Shutterstock

dr. Calleja had some credit for Intel in this regard. as the server hardware space is becoming more diverse from a vendor perspective (a positive development in most respects), application compatibility may become an issue, but Intel is working on a fix.

“One differentiator I see for Intel is that it invests a lot [of both funds and time] in the an API ecosystem, for developing code portability between silicon types. It is these kinds of toolchains that we need to enable tomorrow’s applications to take advantage of emerging silicon,” he notes.

Apart from that, Dr. Calleja calls for a stronger focus on “scientific need”. Too often things go “wrong in translation”, creating a mismatch between hardware and software architectures and the actual needs of the end user.

A more energetic approach to cross-sector collaboration, he says, would create a “beneficial circle” made up of users, service providers and suppliers, which will translate into benefits of both a performance and efficiency perspective.

A Zetta Scale Future

Typically, with the fall of the symbolic milestone of exascale, attention will now turn to the next: zetta scale.

“Zettascale is just the next flag in the ground,” said Dr. Calleja, “a totem highlighting the technologies needed to reach the next milestone in computing progress, unattainable today.”

“The fastest systems in the world are extremely expensive for what you get, in terms of scientific output. But they are important because they demonstrate the art of the possible and move the industry forward.”

Pembroke College, University of Cambridge, headquarters of the Open Zettascale Lab. Credit: University of Cambridge

Whether systems capable of achieving one zettaFLOPS of performance, a thousand times more powerful than the current crop, can be developed in a way that aligns with sustainability goals depends on industry ingenuity.

There is no binary relationship between performance and energy efficiency, but a healthy dose of craft is required in any subdiscipline to deliver the necessary performance increase within an appropriate power range.

In theory, there is a golden ratio between performance and energy consumption, where it can be said that the benefits to society that HPC brings about justify the spending of carbon emissions.

The exact figure will of course remain elusive in practice, but pursuing the idea is by definition a step in the right direction in itself.

Related Post