Evaluations

Evaluations are only available for Cloud and Enterprise plan

Evaluations help you monitor and understand the performance of your Chatflow/Agentflow application. On the high level, an evaluation is a process that takes a set of inputs and corresponding outputs from your Chatflow/Agentflow, and generates scores. These scores can be derived by comparing outputs to reference results, such as through string matching, numeric comparison, or even leveraging an LLM as a judge. These evaluations are conducted using Datasets and Evaluators.

Datasets

Datasets are the inputs that will be used to run your Chatflow/Agentflow, along with the corresponding outputs for comparison. User can add the input and anticipated output manually, or upload a CSV file with 2 columns: Input and Output.

Input

Output

What is the capital of UK

Capital of UK is London

How many days are there in a year

There are 365 days in a year

Evaluators

Evaluators are like unit tests. During an evaluation, the inputs from Datasets are ran on the selected flows and the outputs are evaluated using selected evaluators. There are 3 types of evaluators:

Text Based: string based checking:
- Contains Any
- Contains All
- Does Not Contains Any
- Does Not Contains All
- Starts With
- Does Not Starts With

Numeric Based: numbers type checking:
- Total Tokens
- Prompt Tokens
- Completion Tokens
- API Latency
- LLM Latency
- Chatflow Latency
- Agentflow Latency (coming)
- Output Characters Length

LLM Based: using another LLM to grade the output
- Hallucination
- Correctness

Evaluations

Now that we have Datasets and Evaluators prepared, we can start running an evaluation.

1.) Select dataset and chatflow to evaluate. You can select multiple datasets and chatflows. Using the example below, every inputs from Dataset1 will be ran against 2 chatflows. Since Dataset1 has 2 inputs, a total of 4 outputs will be produced and evaluated.