Evaluating large language models: the ‘better together’ approach

By James On Jul 3, 2024

As the GenAI era dawns, the use of large language models (LLMs) has grown exponentially. However, as with any technology in its hype cycle, GenAI practitioners run the risk of neglecting the trust and accuracy of an LLM’s outputs in favor of rapid deployment and use. Therefore, developing checks and balances for the safe and socially responsible evaluation and use of LLMs is not only a best business practice, but also crucial to fully understanding their accuracy and performance.

Regular evaluation of large language models helps developers identify their strengths and weaknesses and enables them to detect and mitigate risks, including misleading or inaccurate code they may generate. However, not all LLMs are created equal, so evaluating their output, nuances, and complexities with consistent results can be challenging. We explore some considerations to keep in mind when assessing the effectiveness and performance of large language models.

Ellen Brandenberger

Senior Director of Product Innovation, Stack Overflow.

The complexity of evaluating large language models

Fine-tuning a large language model for your use case can feel like training a talented but enigmatic new colleague. LLMs excel at generating large amounts of code quickly, but your experience with the quality of that code may vary.

Single metrics such as the accuracy of an LLM’s output provide only a partial indicator of performance and efficiency. For example, an LLM may produce technically flawless code, but its application within a legacy system may not perform as expected. Developers must evaluate the model’s grasp of the specific domain, its ability to follow instructions, and how well the LLM prevents the generation of biased or nonsense content.

Developing the right evaluation methods for your specific LLM is a complex undertaking. Standardizing tests and incorporating human-in-the-loop assessment are essential and foundational strategies. Techniques such as prompt libraries and establishing fairness benchmarks can also help developers identify the strengths and weaknesses of an LLM. By carefully selecting and designing a multi-level evaluation method, developers can unlock the true power of LLMs to build robust and reliable applications.

Can large language models check themselves?

A newer method of evaluating LLMs is to include a second LLM as a reviewer. By leveraging the advanced capabilities of external LLMs to refine another model, developers can quickly understand and critique code, observe output patterns, and compare responses.

LLMs can improve the quality of the responses of other LLMs in the evaluation process, because multiple outcomes of the same assignment can be compared and the best or most appropriate outcome can then be selected.

People in the loop

Using LLMs to evaluate other LLMs carries risks, as any model is only as good as the data it was trained on. As the saying goes, garbage in is garbage out. That’s why it’s crucial to always build a human review step into your LLM evaluation process. Human reviewers can oversee the quality and relevance of LLM-generated content for your specific use case, ensuring it meets the desired standards and is up to date. Additionally, human feedback on retrieval augmented generation (RAG) output can also help evaluate an AI’s ability to contextualize information.

However, human evaluation also has limitations. Humans bring their own biases and inconsistencies. Both human and AI points of review and feedback are ideal, because they inform how large language models can iterate and improve.

LLMs and people are better together

As LLMs become more common, developers run the risk of using them without specifying whether they are appropriate for the use case. If they are the best option, determining trade-offs between different LLMs in terms of cost, latency, and performance is essential, or even considering using a smaller, more focused large language model. High-performance, general-purpose models can quickly become expensive, so it is crucial to assess whether the benefits justify the cost.

Human evaluation and expertise are necessary to understand and monitor the output of an LLM, especially in the early stages to ensure performance matches real-world requirements. However, a future of successful and socially responsible AI requires a collaborative approach, leveraging human ingenuity alongside machine learning capabilities. Uniting the power of the developer community and its collective knowledge with the technological efficiencies of AI is key to making this ambition a reality.

We list the best programming platforms for schools.

This article was produced as part of Ny BreakingPro’s Expert Insights channel, where we showcase the best and brightest minds in the technology sector today. The views expressed here are those of the author and do not necessarily represent those of Ny BreakingPro or Future plc. If you’re interested in contributing, you can read more here: https://www.techradar.com/news/submit-your-story-to-techradar-pro