Using OpenAI `evals` library for prompt engineering - Part 1

What is OpenAI `evals` library?

Large language models offer remarkable capabilities for generating human-like text, but their utility in production depends on the quality and consistency of their outputs. This poses a challenge due to the non-deterministic nature of LLMs and the numerous configuration options available, from prompt design to model parameters.

The OpenAI evals library is designed to meet this need. It provides a structured approach to test LLM outputs against specific criteria, enabling developers to iterate towards optimal system performance with precision. evals supports batch processing of queries and aggregates results for comprehensive analysis.

This article introduces the evals library, outlines its key components, and demonstrates its application through a simple example.

Key Components of an Evaluation

Before diving into a practical example, let’s break down the three critical elements of an evaluation process using the evals library:

Samples

These are the input data fed into the model. They are stored in a jsonl file, where each line is a distinct json object containing the test prompts. Depending on the evaluation template you choose, these objects typically require an input field (your prompt) and, in some cases, an ideal field. The latter represents the expected outcome, serving as a benchmark for the model’s response in evaluations like the match template.

Completion Function

You can think of the completion function as the model to evaluate. But more generally, it is function that processes the samples and generates the outputs for assessment. While OpenAI provides pre-built functions for its API models, the library is flexible, allowing you to craft custom completion functions to test any model of your choice.

Evaluation Logic (`Eval` Class)

The Eval class orchestrates the evaluation, executing a defined eval_sample method for each piece of input data. This method automates the evaluation flow:

generating a prompt from your sample,
applying the completion function to produce an output,
assessing the output against your criteria,
logging the results.

Recorder

The Recorder is used to customize the evaluation’s output, determining what data is logged in the final report. It can contain both aggregated metrics and data at the individual sample level.

Evaluation setup

Now that we understand the basic structure of an eval, let’s see how to execute it in practice by running one of the provided evaluation. OpenAI simplifies this process with the oaieval command-line tool, designed to run evaluations seamlessly based on preset templates.

We’ll use the test-fuzzy-match eval as an example. As its name suggest, it evaluates response quality by fuzzy matching the output against an expected answer.

Registration

The first step is to register the eval by creating YAML configuration file in evals/registry/evals/.

For test_fuzzy_match the corresponding file is evals/registry/evals/test-basic.yaml

test-fuzzy-match:
  id: test-fuzzy-match.s1.simple-v0
  description: Example eval that uses fuzzy matching to score completions.
  metrics: [f1_score]
test-fuzzy-match.s1.simple-v0:
  class: evals.elsuite.basic.fuzzy_match:FuzzyMatch
  args:
    samples_jsonl: test_fuzzy_match/samples.jsonl

Notice that the registration points to the eval class (FuzzyMatch) and to the samples. The samples are expected to be in evals/registry/data/<eval_name>/samples.jsonl. The completion function, however, is set independently when we run the eval. This allows to run the same eval against different completion functions.

This configuration also defines the evaluation’s ID, description, the metrics to calculate (in this case, f1_score). Here’s the general pattern:

<eval_name>:
  id: <eval_name>.<split>.<version>
  description: Example eval that uses fuzzy matching to score completions.
  metrics: [desired_metrics]
<eval_name>.<split>.<version>
  class: <the evaluation class to use>
  args:
    samples_jsonl: <eval_name>/samples.jsonl

The naming convention for evals is in the form <eval_name>.<split>.<version>.
<eval_name> is the eval name, used to group evals whose scores are comparable.
<split> is the data split, used to further group evals that are under the same <base_eval>. E.g., “val”, “test”, or “dev” for testing.
<version> is the version of the eval, which can be any descriptive text you’d like to use (though it’s best if it does not contain .).

Exploring the input data

You’ll find the samples.jsonl file in evals/registry/data/test-fuzzy-match/

Each line is a JSON object. Let’s take a look at the first sample (formatted for readability):

{
    "input": [
        {
            "role": "system",
            "content": "Answer the following questions as concisely as possible."
        },
        {
            "role": "system",
            "content": "What's the capital of France?",
            "name": "example_user"
        },
        {
            "role": "system",
            "content": "Paris",
            "name": "example_assistant"
        },
        {
            "role": "system",
            "content": "What's 2+2?",
            "name": "example_user"
        },
        {
            "role": "system",
            "content": "4",
            "name": "example_assistant"
        },
        {
            "role": "user",
            "content": "Who is the girl who plays eleven in stranger things?"
        }
    ],
    "ideal": [
        "Millie Bobby Brown"
    ]
}

Here, we are assessing the capacity of the model to correctly answer trivia questions after being primed with a few examples (few shots). The inputfield contains the prompt formatted as a list of messages. This is the prompt format expected by OpenAI’s chat models. The ideal field specifies the expected correct answer, used by the fuzzy-match eval to check the model’s answer.

Running the eval with `oaieval`

With the setup in place, executing an evaluation is straightforward. First, make sure that your OPENAI_API_KEY is correctly configured in your environment variables. Then, proceed to run your evaluation from the command line, specifying the completion function and the evaluation name:

oaieval <completion_function> <eval_name>

For instance, to evaluate using the gpt-3.5-turbo model on the test-fuzzy-match evaluation: oaieval gpt-3.5-turbo test-fuzzy-match

Understanding the Output

The evaluation is processed and results are stored in /tmp/evallogs/ by default. This location can be changed using the --record_path flag.

`spec`

The first line contains the specifications of the evaluation, including the eval name and completion functions. Notice the seed value that allows repeatability of results. You can set it manually with the --seed flag.

{
    "spec": {
        "completion_fns": [
            "gpt-3.5-turbo"
        ],
        "eval_name": "test-fuzzy-match.s1.simple-v0",
        "base_eval": "test-fuzzy-match",
        "split": "s1",
        "run_config": {
			...,
			...,
			"seed": 20220722
			...,
		},
		...,	
	}
}

For additional options and flags supported by oaieval, refer to its help documentation by running oaieval --help.

`final_report`

Next, the final report aggregates the metrics specified in the registration over all the samples.

{
    "final_report": {
        "accuracy": 1.0,
        "f1_score": 0.787878787878788
    },
    "run_id": "2402031402174JDHMBM7"
}

records

Lastly, detailed records of each processed sample are available, showing both the prompts used and the model’s responses. The data available in the logs depends on the way the recorder was setup. In this case, each sample is recorded three times each with distinct data.

The record of type sampling, the default, contains the prompt and the sampled output.

{
    "run_id": "2402031402174JDHMBM7",
    "event_id": 0,
    "sample_id": "test-fuzzy-match.s1.0",
    "type": "sampling",
    "data": {
        "prompt": [
            "..."
        ],
        "sampled": [
            "..."
        ]
    },
    "created_by": "",
    "created_at": "2024-02-03 14:02:18.146781+00:00"
}

The record of type match indicates whether the match was successful.

{
	"...",
	
	"type": "match",
	"data": {
		"correct": true,
		"expected": "Millie Bobby Brown",
		"picked": [
			"Millie Bobby Brown"
		]
	},
	"..."
}

The record of type metrics provides the calculated metric for each sample.

{
	"...",
    "data": {
        "accuracy": 1.0,
        "f1_score": 1.0
    },
    "..."
}

I hope you can see now why evals is a very powerful tool for prompt engineering. It can be used to evaluate a prompt with a set of model parameters on a large number of inputs. This ability to systematically iterate and achieve consistently high-quality results is pivotal in the development of robust AI applications.

In this walkthrough, we’ve explained the architecture of an eval and we’ve demonstrated running a test evaluation using one of the provided templates. The next steps involves applying evals to our own input data, customizing the eval logic to our specific use cases and writing our own completion function. We’ll then delve into model-graded evals.

What is OpenAI evals library?