Cerebras has unveiled its latest AI inference chip, which is being touted as a formidable competitor to Nvidia’s DGX100.
The chip features 44GB of super-fast memory, allowing it to handle AI models with billions to trillions of parameters.
For models that exceed the memory capacity of a single wafer, Cerebras can split them at layer boundaries and distribute them across multiple CS-3 systems. A single CS-3 system can hold 20 billion parameter models, while 70 billion parameter models can be managed by just four systems.
More model support coming soon
Cerebras emphasizes the use of 16-bit model weights to maintain accuracy, unlike some competitors who reduce weight precision to 8-bit, which can degrade performance. According to Cerebras, the 16-bit models perform up to 5% better in multi-turn conversations, math, and reasoning tasks compared to 8-bit models, resulting in more accurate and reliable outputs.
The Cerebras inference platform is available via chat and API access, and is designed to be easily integrated by developers familiar with OpenAI’s Chat Completions format. The platform can execute Llama3.1 70B models at 450 tokens per second, making it the only solution to achieve instantaneous speed for such large models. For developers, Cerebras is offering 1 million free tokens daily at launch, with pricing for large-scale deployments reportedly significantly lower than popular GPU clouds.
Cerebras is initially launching with Llama3.1 8B and 70B models, with plans to add support for larger models such as Llama3 405B and Mistral Large 2 in the near future. The company emphasizes that fast inference capabilities are crucial for enabling more complex AI workflows and improving real-time LLM intelligence, particularly in techniques such as scaffolding, which require significant token usage.
Patrick Kennedy of ServeTheHome saw the product in action at the recent Hot Chips 2024 symposium and noted, “I had the opportunity to sit down with Andrew Feldman (CEO of Cerebras) for the talk and he showed me the demos live. It’s ridiculously fast. The reason this is important is not just so that humans can drive interaction. Instead, in an agent world where computer AI agents are talking to multiple other computer AI agents. Imagine it takes seconds for each agent to come up with output, and there are multiple steps in that pipeline. If you think about automated AI agent pipelines, you need fast inference to reduce the time for the entire chain.”
Cerebras positions its platform as a new standard in open LLM development and deployment, with record-breaking performance, competitive pricing, and broad API access. You can try it out by visiting inference.cerebras.ai or by scanning the QR code in the slide below.