Part 1 of our LLM evaluation series covers the basics of LLM evaluation including popular benchmarks and their metrics.
With the (generative) AI revolution raging full steam ahead, new large language models (LLM) seem to be spawning every day. There are general models and fine-tuned versions of these general models for specific purposes. Huggingface currently hosts roughly 750.000 different models. But how to pick the right one for your goals?
Finding the model that performs best for a given task is difficult. Just defining what is “good performance” on some tasks can be challenging. There is a range of benchmarks available that help you compare these LLMs. This post explains the basics of LLM evaluation and goes into detail on general evaluation metrics and LLM benchmarks.
💡 A blog post series on LLM benchmarking
Read all the other posts in Symflower’s series on LLM evaluation, and check out our latest deep dive on the best LLMs for code generation.
How do LLM evaluation benchmarks work?
LLM benchmarks help assess a model’s performance by providing a standard (and comparable) way to measure metrics around a range of tasks. Benchmarks define a specific setup and various, carefully designed and relevant tasks (including questions/prompts and datasets) so that the LLMs can be compared in a consistent way.
Benchmarks provide a standardized process for measuring the performance of LLMs (see note on LLM performance below) around a range of skill areas/tasks such as:
- Language generation, conversation: Generating coherent and relevant text in response to prompts, and engaging in human dialogue.
- Understanding & answering questions: Interpreting the meaning of text and providing accurate, relevant answers.
- Translation: Translating text across languages.
- Logical reasoning & common sense: Applying logic (including reasoning skills like inductive and deductive reasoning) and everyday knowledge to solve problems.
- Standardized tests: Tests like SAT, ACT, and other standardized assessments used in human education can also be used to test an LLM’s performance.
- Code generation: Understanding and generating software code.
A variety of metrics help measure and compare performance across different LLMs. Let’s dive into these below. More advanced metrics will be covered in part 2 of this series.
What general metrics (scorers) are used to benchmark LLMs?
Scorers (metrics) help measure LLM performance in a quantitative and comparable way.
🤔 What do we mean by “LLM performance”?
Throughout this post, we’ll use “LLM performance” when referring to an estimate of how useful an LLM is for a given task. Further indicators like tokens per second, latency, or cost metrics, but also other scorers like metrics on user engagement are certainly useful, but outside the scope of this post.
Symflower’s DevQualityEval takes into account token costs to help select LLMs that produce good test code. Check out our deep dives if you’re looking for a coding LLM that may be useful in a real-life development environment:
- OpenAI’s o1-preview is the king 👑 of code generation but is super slow and expensive (Deep dives from the DevQualityEval v0.6)
- DeepSeek v2 Coder and Claude 3.5 Sonnet are more cost-effective at code generation than GPT-4o! (Deep dives from the DevQualityEval v0.5.0)
- Is Llama-3 better than GPT-4 for generating tests? And other deep dives of the DevQualityEval v0.4.0
- Can LLMs test a Go function that does nothing?
Metrics used in LLM evaluation range from generic statistics-based measures to more complex domain-specific scorers and even scenarios where a model is evaluated by other LLMs (LLM-assisted evaluation).
Note that scorers can easily lead you wrong, so it is very important to pay attention to the details. An LLM may get a high score in one certain metric, and still provide an unsatisfactory overall result. That’s why it’s important to have well-defined general indicators of LLM performance for a specific application.
Most of these require a kind of ground truth e.g. a “golden” dataset that defines the expected output for an example task. Training or fine-tuning an LLM usually requires a huge amount of data - and so does benchmarking. After all, humans (who learn from experience) usually have to go through quite a hassle to get certified for a certain profession. The same should apply to LLMs that, like humans, learn from experience. Data with example queries and “correct” responses constitutes this golden dataset.
During the benchmark, you compare actual LLM output to this ground truth to get the following general metrics:
- Accuracy: The percentage of answers the LLM gets right.
- Factual correctness: The factual correctness of an LLM output. That is, whether something stated by the model is actually correct. You can determine this either manually or by guiding an LLM (e.g. GPT-4o) with an engineered chain-of-thought prompt to determine the correctness of the tested LLM’s responses (LLM-assisted evaluation). In an example: ❓ “What is 2+2?” 🤖 “2+2=5” is a factually incorrect response.
- Hallucination: Determines whether the LLM’s output contains information that it shouldn’t have knowledge of (e.g. something that is fake and made up by the LLM). Note that an answer may still be factually correct despite being a hallucination. In an example: ❓ “What is 2+2?” 🤖 “2+2=4. You are wearing a red shirt!” might be entirely correct (if your shirt is indeed red) but… how should the LLM know? A hallucination!
- Relevance: How well the LLM’s output addresses the input by providing an informative and relevant answer.
- Perplexity: The level of surprise or confusion demonstrated by the LLM when presented with a new task. Perplexity is a numerical value that is intrinsic to how LLMs internally deal with text. So it is always computable from certain values of the LLM’s underlying neural network.
- Responsible metrics: A range of metrics designed to cover bias and any kind of toxicity in LLM output to filter out potentially harmful or offensive information.
- Human-in-the-loop evaluation: In certain cases, human experts may be required to assess the integrity (quality, relevance, or coherence) of LLM outputs.
To make model evaluation easier, there are a few LLM benchmarks that define and track a certain set of such metrics with the goal of providing comparable results across different large language models.
The types of LLM evaluation benchmarks
To best assess the performance of LLMs for your use case, you have to carefully select the type of tasks (and corresponding metrics) you’ll want to use. Benchmarks make this easy by providing structured datasets (consisting of prompts with tasks or questions and their correct answers) across a range of scenarios, topics, tasks, and complexities. They also measure scorers to help compare different LLMs for the same set of tasks.
While there are a number of benchmarks out there, the two key strategies are online and offline evaluation. Offline evaluation is the process of assessing an LLMs performance before it is publicly deployed. Online evaluation, on the other hand, is the process of ensuring an LLM stays performant during real-world user interaction.
The widely used Open LLM Leaderboard by Hugging Face evaluates models based on the 6 most important benchmarks:
- FEval: Instruction-following evaluation for Large Language Models
- BBH (Big Bench Hard): An evaluation suite that focuses on a diverse set of tasks.
- MATH: A dataset of 12,500 challenging competition mathematics problems.
- GPQA: A graduate-level Google-proof Q&A benchmark with 448 multiple-choice questions written by domain experts in biology, physics, and chemistry.
- MuSR: A dataset for evaluating LLMs on multiple-step soft reasoning tasks specified in natural language.
- MMLU-PRO: Massive Multitask Language Understanding - Professional is an improved dataset that extends the MMLU benchmark by including challenging, reasoning-focused questions with sets of 10 answer options.
Benchmarking is generally performed in one of the following ways:
- Zero-shot: The LLM is presented with the task without any examples or hints as to how to solve it. This way best shows the model’s ability to interpret and adapt to new challenges.
- Few-shot: In the few-shot scenario, evaluators provide the model with a few examples of how that specific type of task is correctly completed. They do it to assess how well the model under evaluation can learn from a small data sample size.
Summary: LLM benchmarking
This post covers the basics of benchmarking large language models for specific purposes. You should have a better understanding of how to evaluate the performance of LLMs for your specific use case.
Want to go deeper still? Check out the upcoming post in our LLM series which covers the variety of complex statistical and model-based scorers and a few of the many evaluation frameworks that help you assess LLM performance.
Make sure you never miss any of our upcoming content by signing up for our newsletter and by following us on X, LinkedIn, or Facebook!