Tech companies are shifting their focus from building the largest language models (LLMs) to developing smaller language models (SLMs) that match or even surpass them.
Meta’s Llama 3 (400 billion parameters), OpenAI’s GPT-3.5 (175 billion parameters), and GPT-4 (estimated 1.8 trillion parameters) are notoriously larger models, while Microsoft’s Phi-3 family ranges from 3.8 billion to 14 billion parameters, and Apple Intelligence has ‘only’ around 3 billion parameters.
It may seem like a downgrade to have models with far fewer parameters, but the appeal of SLMs is understandable. They consume less power, can be run locally on devices like smartphones and laptops, and are a good choice for smaller companies and labs that can’t afford expensive hardware setups.
David vs. Goliath
If IEEE spectrum reports: “The rise of SLMs comes at a time when the performance gap between LLMs is rapidly closing and technology companies are looking to deviate from standard scalability laws and explore other avenues for performance improvements.”
In a recent round of tests Performed by Microsoft, Phi-3-mini, the tech giant’s smallest model with 3.8 billion parameters, competed in some areas with Mixtral (8x 7 billion) and GPT-3.5, despite being small enough to fit on a phone. Its success was due to the dataset used for training, which consisted of “heavily filtered publicly available web data and synthetic data.”
Although SLMs achieve similar levels of language comprehension and reasoning as much larger models, they are still limited by their size for certain tasks and cannot store too much “factual” knowledge. This is a problem that can be addressed by combining the SLM with an online search engine.
IEEE spectrumShubham Agarwal compares SLMs to the way children learn language, saying, “By the time kids turn 13, they’re exposed to about 100 million words and are better than chatbots at language, with access to just 0.01 percent of the data.” Although, as Agarwal notes, “nobody knows what makes humans so much more efficient,” Alex Warstadta computer scientist at ETH Zurich, suggests that “reverse engineering efficient human-like learning at small scales can lead to dramatic improvements when scaled to LLM scales.”