100x less compute power with GPT-level LLM performance: How a little-known open source project could help solve the GPU power conundrum – RWKV looks promising, but challenges remain
Recurrent neural networks (RNNs) are a form of artificial intelligence primarily used in the field of deep learning. Unlike traditional neural networks, RNNs have a memory that records information about what has been computed so far. In other words, they use their knowledge from previous inputs to influence the output they will produce.
RNNs are called “recurrent” because they perform the same task for each element in a sequence, with the output depending on the previous calculations. RNNs are still used to power smart technologies such as Apple’s Siri and Google Translate.
However, with the advent of transformers like ChatGPT, the natural language processing (NLP) landscape has changed. As transformers revolutionized NLP tasks, their memory and computational complexity scaled quadratically with sequence length, requiring more resources.
Enter RWKV
Now there’s a new open source project, RWKV, offers promising solutions to the GPU power issue. The project, supported by the Linux Foundation, aims to dramatically reduce the computational requirements for GPT-level language learning models (LLMs), potentially by up to 100x.
RNNs exhibit linear scaling in memory and computation requirements, but struggle to match the performance of transformers due to their limitations in parallelization and scalability. This is where RWKV comes into play.
RWKV, or Receptance Weighted Key Value, is a new model architecture that combines the parallelizable training efficiency of transformers with the efficient inference of RNNs. The result? A model that requires significantly less resources (VRAM, CPU, GPU, etc.) for running and training, while maintaining high-quality performance. It also scales linearly to any context length and is generally better trained on languages other than English.
Despite these promising features, the RWKV model is not without challenges. It is sensitive to prompt formatting and weaker for tasks that require review. However, these issues are being addressed and the potential benefits of the model far outweigh its current limitations.
The implications of the RWKV project are profound. Instead of requiring 100 GPUs to train an LLM model, an RWKV model could produce similar results with fewer than 10 GPUs. This not only makes the technology more accessible, but also opens up opportunities for further advancement.