How to Use Evaluation Runs
1
Select an Agent
Choose the agent you want to test.
2
Select a Dataset
Pick or create a dataset of test questions to evaluate the agent against.
3
Run the Evaluation
Define the number of repetitions and launch the evaluation run.
4
Review Results
Inspect answers, correctness, relevance, hallucination percentage, usefulness, and response time.
5
Refine and Re-Test
Adjust prompts, knowledge bases, or integrations based on findings. Run new evaluations until the agent meets your performance threshold.
You can run a maximum of 500 evaluations per run. Evaluations are the number of repetitions you select multiplied by the amount of questions per dataset. For example, if your dataset contains 100 questions you may select a repetition amount of 5 or less.
Scoring Dimensions
Groundedness
answer vs. context
Measures whether the answer is supported by the provided context, helping prevent hallucinations.
Scores:
Measures whether the answer is supported by the provided context, helping prevent hallucinations.
Scores:
- 100-76: Fully grounded; all claims are directly supported by context.
- 75-61: Mostly grounded; majority is supported, minor unsupported details.
- 60-41: Partially grounded; some statements are traceable, but mixed in with unsupported claims.
- 40-21: Weakly grounded; mostly unsupported, only small fragments connect to context.
- 20-0: Not grounded; entirely fabricated, no link to context.
Relevance
context vs. question
Evaluates whether the retrieved context is relevant to the specific question being asked.
Scores:
Evaluates whether the retrieved context is relevant to the specific question being asked.
Scores:
- 100-76: Fully relevant; context is highly pertinent and sufficient to answer the question.
- 75-61: Mostly relevant; context is useful with minor extra noise.
- 60-41: Partially relevant; some relevant fragments, but lots of extra or distracting content.
- 40-21: Low relevance; Mostly noise, little useful info.
- 20-0: Irrelevant; context unrelated.
Usefulness
answer vs. question
Measures how actionable and helpful the answer is.
Scores:
Measures how actionable and helpful the answer is.
Scores:
- 100-76: Highly useful; comprehensive, clear, and insighful.
- 75-61: Useful; solid answer, covers key points, and is easy to understand.
- 60-41: Somewhat useful; addresses the question, but lacks details, depth, or clarity.
- 40-21: Barely useful; minimal attempt, vague, or incomplete.
- 20-0: Not useful; doesn’t address the question at all.
Correctness
answer vs. reference
Compares the answer against a reference or gold-standard answer.
Scores:
Compares the answer against a reference or gold-standard answer.
Scores:
- 100-76: Fully correct; matches the reference answer (or semantically equivalent).
- 75-61: Mostly correct; small errors or omissions, but main facts are right.
- 60-41: Partially correct; mix of correct and incorrect information.
- 40-21: Mostly incorrect; major errors, minor overlap with correct answer.
- 20-0: Incorrect; completely wrong or contradicts reference.
Response Time
Tracks the duration it takes for the agent to generate an answer.
Some models may respond more slowly if they prioritize depth of reasoning or complex analysis.
Some models may respond more slowly if they prioritize depth of reasoning or complex analysis.
Best Practices
- Use run results to fine-tune prompts, knowledge bases, or integrations.
- Re-run evaluations after updates to confirm ongoing accuracy.
- Always run evaluations before moving an agent into production.

