· tutorial · 6 min read
Using OpenAI `evals` library for prompt engineering - Part 1
Exploring OpenAI's evals Library, a comprehensive tool for evaluating Large Language Models. This post covers the library's core components, and walks through setting up and executing an example evaluation.

What is OpenAI evals
library?
Large language models offer remarkable capabilities for generating human-like text, but their utility in production depends on the quality and consistency of their outputs. This poses a challenge due to the non-deterministic nature of LLMs and the numerous configuration options available, from prompt design to model parameters.
The OpenAI evals
library is designed to meet this need. It provides a structured approach to test LLM outputs against specific criteria, enabling developers to iterate towards optimal system performance with precision. evals
supports batch processing of queries and aggregates results for comprehensive analysis.
This article introduces the evals
library, outlines its key components, and demonstrates its application through a simple example.
Key Components of an Evaluation
Before diving into a practical example, let’s break down the three critical elements of an evaluation process using the evals
library:
Samples
These are the input data fed into the model. They are stored in a jsonl
file, where each line is a distinct json
object containing the test prompts. Depending on the evaluation template you choose, these objects typically require an input
field (your prompt) and, in some cases, an ideal
field. The latter represents the expected outcome, serving as a benchmark for the model’s response in evaluations like the match
template.
Completion Function
You can think of the completion function as the model to evaluate. But more generally, it is function that processes the samples and generates the outputs for assessment. While OpenAI provides pre-built functions for its API models, the library is flexible, allowing you to craft custom completion functions to test any model of your choice.
Evaluation Logic (Eval
Class)
The Eval
class orchestrates the evaluation, executing a defined eval_sample
method for each piece of input data. This method automates the evaluation flow:
- generating a prompt from your sample,
- applying the completion function to produce an output,
- assessing the output against your criteria,
- logging the results.
Recorder
The Recorder
is used to customize the evaluation’s output, determining what data is logged in the final report. It can contain both aggregated metrics and data at the individual sample level.
Evaluation setup
Now that we understand the basic structure of an eval, let’s see how to execute it in practice by running one of the provided evaluation. OpenAI simplifies this process with the oaieval
command-line tool, designed to run evaluations seamlessly based on preset templates.
We’ll use the test-fuzzy-match
eval as an example. As its name suggest, it evaluates response quality by fuzzy matching the output against an expected answer.
Registration
The first step is to register the eval by creating YAML configuration file in evals/registry/evals/
.
For test_fuzzy_match
the corresponding file is evals/registry/evals/test-basic.yaml
test-fuzzy-match:
id: test-fuzzy-match.s1.simple-v0
description: Example eval that uses fuzzy matching to score completions.
metrics: [f1_score]
test-fuzzy-match.s1.simple-v0:
class: evals.elsuite.basic.fuzzy_match:FuzzyMatch
args:
samples_jsonl: test_fuzzy_match/samples.jsonl
Notice that the registration points to the eval class (FuzzyMatch
) and to the samples. The samples are expected to be in evals/registry/data/<eval_name>/samples.jsonl
. The completion function, however, is set independently when we run the eval. This allows to run the same eval against different completion functions.
This configuration also defines the evaluation’s ID, description, the metrics to calculate (in this case, f1_score
). Here’s the general pattern:
<eval_name>:
id: <eval_name>.<split>.<version>
description: Example eval that uses fuzzy matching to score completions.
metrics: [desired_metrics]
<eval_name>.<split>.<version>
class: <the evaluation class to use>
args:
samples_jsonl: <eval_name>/samples.jsonl
The naming convention for evals is in the form
<eval_name>.<split>.<version>
.
<eval_name>
is the eval name, used to group evals whose scores are comparable.<split>
is the data split, used to further group evals that are under the same<base_eval>
. E.g., “val”, “test”, or “dev” for testing.<version>
is the version of the eval, which can be any descriptive text you’d like to use (though it’s best if it does not contain.
).
Exploring the input data
You’ll find the samples.jsonl
file in evals/registry/data/test-fuzzy-match/
Each line is a JSON object. Let’s take a look at the first sample (formatted for readability):
{
"input": [
{
"role": "system",
"content": "Answer the following questions as concisely as possible."
},
{
"role": "system",
"content": "What's the capital of France?",
"name": "example_user"
},
{
"role": "system",
"content": "Paris",
"name": "example_assistant"
},
{
"role": "system",
"content": "What's 2+2?",
"name": "example_user"
},
{
"role": "system",
"content": "4",
"name": "example_assistant"
},
{
"role": "user",
"content": "Who is the girl who plays eleven in stranger things?"
}
],
"ideal": [
"Millie Bobby Brown"
]
}
Here, we are assessing the capacity of the model to correctly answer trivia questions after being primed with a few examples (few shots). The input
field contains the prompt formatted as a list of messages. This is the prompt format expected by OpenAI’s chat models. The ideal
field specifies the expected correct answer, used by the fuzzy-match eval to check the model’s answer.
Running the eval with oaieval
With the setup in place, executing an evaluation is straightforward. First, make sure that your OPENAI_API_KEY
is correctly configured in your environment variables. Then, proceed to run your evaluation from the command line, specifying the completion function and the evaluation name:
oaieval <completion_function> <eval_name>
For instance, to evaluate using the gpt-3.5-turbo
model on the test-fuzzy-match
evaluation: oaieval gpt-3.5-turbo test-fuzzy-match
Understanding the Output
The evaluation is processed and results are stored in /tmp/evallogs/
by default. This location can be changed using the --record_path
flag.
spec
The first line contains the specifications of the evaluation, including the eval name and completion functions. Notice the seed
value that allows repeatability of results. You can set it manually with the --seed
flag.
{
"spec": {
"completion_fns": [
"gpt-3.5-turbo"
],
"eval_name": "test-fuzzy-match.s1.simple-v0",
"base_eval": "test-fuzzy-match",
"split": "s1",
"run_config": {
...,
...,
"seed": 20220722
...,
},
...,
}
}
For additional options and flags supported by oaieval
, refer to its help documentation by running oaieval --help
.
final_report
Next, the final report aggregates the metrics specified in the registration over all the samples.
{
"final_report": {
"accuracy": 1.0,
"f1_score": 0.787878787878788
},
"run_id": "2402031402174JDHMBM7"
}
records
Lastly, detailed records of each processed sample are available, showing both the prompts used and the model’s responses. The data available in the logs depends on the way the recorder was setup. In this case, each sample is recorded three times each with distinct data.
The record of type sampling
, the default, contains the prompt and the sampled output.
{
"run_id": "2402031402174JDHMBM7",
"event_id": 0,
"sample_id": "test-fuzzy-match.s1.0",
"type": "sampling",
"data": {
"prompt": [
"..."
],
"sampled": [
"..."
]
},
"created_by": "",
"created_at": "2024-02-03 14:02:18.146781+00:00"
}
The record of type match
indicates whether the match was successful.
{
"...",
"type": "match",
"data": {
"correct": true,
"expected": "Millie Bobby Brown",
"picked": [
"Millie Bobby Brown"
]
},
"..."
}
The record of type metrics
provides the calculated metric for each sample.
{
"...",
"data": {
"accuracy": 1.0,
"f1_score": 1.0
},
"..."
}
I hope you can see now why evals
is a very powerful tool for prompt engineering. It can be used to evaluate a prompt with a set of model parameters on a large number of inputs. This ability to systematically iterate and achieve consistently high-quality results is pivotal in the development of robust AI applications.
In this walkthrough, we’ve explained the architecture of an eval and we’ve demonstrated running a test evaluation using one of the provided templates. The next steps involves applying evals
to our own input data, customizing the eval logic to our specific use cases and writing our own completion function. We’ll then delve into model-graded evals.