Evaluation

Introduction

Evaluating LLM outputs is hard. Natural language is imprecise and language outputs are hard to quantify. For example, we could use embeddings to assess how “close” an actual response is to a reference response, but the embedding is also an output of a language model and the distance metric may not be appropriate for our particular use case.

However, no evaluation is not a good alternative. It is a case, I believe, of using multiple approaches combined with human review and labelling. Nothing beats looking at the outputs to get a sense of what may or may not be working, but we also need techniques that can be applied at scale.

Hard problems in LLM evaluation

Prompt sensitivity

Are we measuring something intrinsic to the model or is it an artefact of the prompt?

This is one of the reasons Prompt Store was created - to make prompts transparent, versioned, and traceable.

Contamination

Has the model learned to solve some problems or has it simply memorizing the answers? Does model performance change on benchamarks that are time sensitive before and after the training cutoff? For example, how does the model perform on a coding challenge that was published before and after the model training cutoff date?

Reproducibility

So if we set a seed on models that support it and/or reduce the temperature parameter to enable more reproducible responses, are these settings valid for the particular use case. To generate variations of marketing content for example, a low temperature setting may defeat the purpose of the model, but how can we assess the model otherwise?

Who watches the watchers

It is tempting to use models to assess other models, and to be realistic, this needs to be part of the toolkit. Traditional NLP metrics such as ROUGE (Recall-Oriented Understudy for Gisting Evaluation) and BLEU (Bilingual Evaluation Understudy) have been criticized over the years but they fell into the camp of “better than nothing”. They require labelled data, which may not exist. They may not fit the use case. ROUGE is a proxy metric for abstractive summarization. BLEU evaluates machine translation.

The Evaluation Toolkit needs several techniques and there is no “one-size fits all” metric.

Some level of human review is useful and can be used to support other techniques.

Prompt Store starts by implementing a number of LLM rubrics - essentially prompting a model to evaluating the output of another model, with or without labelled data. These evaluations can be scheduled to establish a monitoring function.

What is ROUGE?

ROUGE stands for Recall-Oriented Understudy for Gisting Evaluation. ROUGE is a proxy metric for abstractive summarization.

There are two types of text summarization that a human, and nowadays a machine, can do:

Extractive: Words and phrases are directly extracted from the text.
Abstractive: Words and phrases are generated semantically consistent, ensuring the key information of the original text is maintained.

The algorithm to compute a ROUGE score considers consecutive tokens a.k.a. n-grams. The n-grams from one text (e.g. the human-written summary) are compared to the n-grams of the other text (e.g. the machine-written summary). A large overlap of n-grams results in a high ROUGE score and a low overlap — in a low ROUGE score. There are many variations of ROUGE such as one-grams, bi-grams, longest common sub-sequence, etc.

A criticism of ROUGE is that it has lexical bias. That is, the algorithm favours syntactic matches (extractive summarization) over semantic matches (abstractive summarization). Abstractive summarization more closely resembles the way a human writes a summary.

What is BLEU?

BLEU stands for Bilingual Evaluation Understudy. BLEU also uses n-gram overlap to measure the similarity between machine-translated text and the corresponding human-translated reference text. BLEU focuses on precision: how much the words (and/or n-grams) in the candidate model outputs appear in the human reference. ROUGE focuses on recall: how much the words (and/or n-grams) in the human references appear in the candidate model outputs. The results are complementary as is the case in the precision-recall tradeoff. But the similarities in technique also means that BLEU shares similar drawbacks.

It doesn’t consider meaning
Both ROUGE and BLEU are not great at short-form text like marketing copy for display ads or email subject lines
It doesn’t handle morphologically-rich languages well
It doesn’t correlate well with expert human responses [Callison-Burch et al. (2006), Belz and Reiter (2006), Tan et al. (2015), Smith et al. (2016), Mathur et al. (2020) who received an outstanding paper award from the ACL for their work calling for “retiring BLEU as the de facto standard metric.“]

That final paper ended by saying “human evaluation must always be the gold standard, and for continuing improvement in translation, to establish significant improvements over prior work, all automatic metrics make for inadequate substitutes.”