Nvidia’s Biggest Rival Destroys Cloud Giants in AI Performance Once Again; Cerebras Inference is 75x faster than AWS, 32x faster than Google on Llama 3.1 405B

By James On Dec 2, 2024

Cerebras achieves 969 tokens/second on Llama 3.1 405B, 75x faster than AWS
Claims industry-low latency of 240ms, twice as fast as Google Vertex
Cerebras Inference runs on the CS-3 with the WSE-3 AI processor

Cerebras Systems says it has set a new benchmark in AI performance with Meta’s Llama 3.1 405B model, achieving an unprecedented generation rate of 969 tokens per second.

Third-party benchmarking company Artificial Analysis has claimed that this performance is up to 75 times faster than GPU-based offerings from major hyperscalers. It was almost six times faster than SambaNova at 164 tokens per second, more than 14 times faster than Google Vertex at 30 tokens per second, and far surpassed Azure at just 20 tokens per second and AWS at 13 tokens per second.

Furthermore, the system demonstrated the fastest time to the world’s first token, clocking in at just 240 milliseconds – almost twice as fast as Google Vertex at 430 milliseconds and well ahead of AWS at 1,770 milliseconds.

Extend his lead

“Cerebras holds the world record in Llama 3.1 8B and 70B performance, and with this announcement we extend our lead to Llama 3.1 405B – which delivers 969 tokens per second,” said Andrew Feldman, co-founder and CEO of Cerebras.

“By running the largest models at instantaneous speed, Cerebras enables real-time responses from the world’s leading open frontier model. This opens up powerful new use cases, including reasoning and multi-agent collaboration, across the AI landscape.”

The Cerebras Inference system, powered by the CS-3 supercomputer and its Wafer Scale Engine 3 (WSE-3), supports the full context length of 128K with 16-bit accuracy. Known as the “world’s fastest AI chip,” the WSE-3 features 44 GB of on-chip SRAM, four trillion transistors, and 900,000 AI-optimized cores. It delivers a maximum AI performance of 125 petaflops and has 7,000 times the memory bandwidth of the Nvidia H100.

Meta’s GenAI VP Ahmad Al-Dahle also praised Cerebras’ latest results, saying: “Scaling inference is critical to accelerating AI and open source innovation. Thanks to the incredible work of the Cerebras team, the Llama 3.1 405B is now the fastest model in the world. Through the power of Llama and our open approach, super-fast and affordable inference is now within reach for more developers than ever before.”

Customer testing for the system is underway, with general availability planned for the first quarter of 2025. Pricing starts at $6 per million input tokens and $12 per million output tokens.

(Image credit: Cerebras)

seconds until the first token received on Llama 3.1 405B

(Image credit: Cerebras)