- Slim-Llama reduces power requirements using binary/ternary quantization
- Achieves a 4.59x efficiency improvement and consumes 4.69–82.07 mW at scale
- Supports 3B parameter models with 489ms latency, enabling efficiency
Traditional large language models (LLMs) often suffer from excessive power consumption due to frequent external memory accesses. However, researchers at the Korea Advanced Institute of Science and Technology (KAIST) have now developed Slim-Llama, an ASIC designed to address this problem through smart quantization and data management.
Slim-Llama uses binary/ternary quantization, which reduces the precision of model weights to just 1 or 2 bits, significantly reducing computation and memory requirements.
To further improve efficiency, it integrates a Sparsity-aware Look-up Table, improving the processing of sparse data and reducing unnecessary calculations. The design also includes an output reuse scheme and index vector reordering, which minimizes redundant operations and improves data flow efficiency.
Reduced dependence on external memory
According to the team, the technology shows a 4.59x improvement in benchmark energy efficiency compared to previous state-of-the-art solutions.
Slim-Llama achieves a system power consumption of just 4.69 mW at 25 MHz and can be scaled up to 82.07 mW at 200 MHz, maintaining impressive energy efficiency even at higher frequencies. It is capable of peak performance of up to 4.92 TOPS at 1.31 TOPS/W, which further underlines its efficiency.
The chip has a total chip area of 20.25 mm² and uses Samsung’s 28 nm CMOS technology. With 500 KB of on-chip SRAM, Slim-Llama reduces dependence on external memory, significantly reducing the energy costs associated with data movement. The system supports an external bandwidth of 1.6 GB/s at 200 MHz, which promises smooth data processing.
Slim-Llama supports models like Llama 1bit and Llama 1.5bit, with up to 3 billion parameters, and KAIST says it delivers benchmark performance that meets the demands of modern AI applications. With a latency of 489 ms for the Llama 1bit model, Slim-Llama demonstrates both efficiency and performance, making it the first ASIC to run models with billions of parameters with such low power consumption.
Although still in its early stages, this breakthrough in energy-efficient computing could potentially pave the way for more sustainable and accessible AI hardware solutions, meeting the growing demand for efficient LLM implementation. The KAIST team will reveal more about Slim-Llama on Wednesday, February 19, at the 2025 IEEE International Solid-State Circuits Conference in San Francisco.