Apple embraces Nvidia GPUs to accelerate LLM inference via its open source ReDrafter technology
- ReDrafter delivers 2.7x more tokens per second compared to traditional auto-regression
- ReDrafter can reduce latency for users while using fewer GPUs
- Apple has not said when ReDrafter will be deployed on competing AI GPUs from AMD and Intel
Apple has announced a partnership with Nvidia to accelerate the inference of large language models using the open source technology Recurrent Drafter (or ReDrafter for short).
The partnership aims to address the computational challenges of automatic regressive token generation, which is critical for improving efficiency and reducing latency in real-time LLM applications.
Introduced by Apple in November 2024, ReDrafter takes a speculative decoding approach by combining a concept model of a recurrent neural network (RNN) with beam search and dynamic tree attention. Apple’s benchmarks show that this method generates 2.7x more tokens per second compared to traditional auto-regression.
Could it go beyond Nvidia?
By integrating into Nvidia’s TensorRT-LLM framework, ReDrafter extends its impact by enabling faster LLM inference on Nvidia GPUs commonly used in production environments.
To accommodate ReDrafter’s algorithms, Nvidia has introduced new operators and modified existing ones within TensorRT-LLM, making the technology available to all developers looking to optimize the performance of large-scale models.
In addition to the speed improvements, Apple says ReDrafter has the potential to reduce user latency while requiring fewer GPUs. This efficiency not only reduces computing costs, but also reduces energy consumption, a critical factor for organizations managing large-scale AI deployments.
While the focus of this partnership remains on Nvidia’s infrastructure for now, it’s possible that similar performance benefits could be extended to competing GPUs from AMD or Intel at some point in the future.
Breakthroughs like these can help improve the efficiency of machine learning. As Nvidia says: “This collaboration has made TensorRT-LLM more powerful and flexible, allowing the LLM community to innovate more advanced models and easily deploy them with TensorRT-LLM to achieve unparalleled performance on Nvidia GPUs. These new features open up exciting possibilities , and we eagerly look forward to the next generation of advanced models from the community that leverage the capabilities of TensorRT-LLM, driving further improvements in LLM workloads.”
You can read more about the collaboration with Apple on the Technical blog for Nvidia developers.