Large Language Models

Evaluating Large Language Models


Learning Objectives

  • You know of some of the ways for evaluating large language model quality.

The quality of large language models is typically measured using benchmarks and human evaluations. There are a number of popular benchmarks such as HellaSwag, Measuring Massive Multitask Language Understanding (MMLU), and TruthfulQA. HellaSwag focuses on common sense reasoning in the context of multiple choice questions, MMLU tests the models on a wide range of subjects ranging from mathematics to law and ethics, while TruthfulQA has a set of questions that are designed to elicit false beliefs or misconceptions.

A challenge with all benchmarks is that they are inherently limited in scope as they focus on specific tasks. A model that performs well in one benchmark might not perform well on another. Thus, it is meaningful to use multiple benchmarks when evaluating models. This is the reason why leaderboards often include multiple benchmarks.

Loading Exercise...

As an example of a leaderboard, see the HuggingFace Open Large Language Model Leaderboard, which includes many openly available large language models. The HuggingFace leaderboard does not include proprietary models such as some GPT-4 from OpenAI, Claude series from Anthropic, or Gemini from Google.

There is also a challenge related to benchmarks. They can be “gamed” as the benchmark data can be included in the training data of large language models. This means that some models may perform well in the benchmarks, but less well in real-world tasks.

There is a risk that benchmark data is used in model training, which can lead to models that perform well in the benchmarks but not so well in real-world tasks.

Loading Exercise...

In addition to benchmarks, the quality of large language models can be evaluated using human evaluation. This can be done by, for example, asking human evaluators to rate the quality of the generated text, comparing the generated text to human-written text, or by giving humans outputs for the same task from different large language models, and asking the humans to rate the outputs based on their preference.

The latter is done e.g. in the Chatbot Arena, which also provides a leaderboard for chatbots, available at https://chat.lmsys.org/. The page provides a possibility to do pair-wise evaluations of large language models by providing prompts and evaluating the responses. The concrete leaderboard is available by clicking the text “Leaderboard” at the top of the page.

Try visiting the Chatbot Arena site and evaluating the responses of chatbots. You can, e.g., use the prompts “What is the capital of Finland?” which should be easy, the prompt “Please translate the Finnish sentence ‘Kuusi palaa’ into English”, and the prompt “What day is it today?”.

While the models should be able to provide an answer to the question about the Finnish capital, the translation prompt is more difficult. The sentence “Kuusi palaa” has a range of completely valid English translations, including “your moon is on fire”, “spruce returns”, “the spruce is on fire”, and “six pieces”, and the correct translation would depend on the context. Similarly, as large language models have a cut-off date in their training data, they might not be able to provide the correct answer to the question about the current day, unless the models have additional mechanisms to providing the current date.