Skip to main content
Datasets are collections of questions used to evaluate an agent’s performance. To build effective datasets, ensure a meaningful sample size that reflects real-world usage — too few questions may fail to expose important gaps. Include variety by mixing common, edge-case, and failure-case questions, and prioritize realism by writing questions as users would naturally phrase them. Aim for balanced coverage across all knowledge bases, prompt instructions, and integrations the agent depends on. Finally, maintain a reusable format by uploading datasets as CSV files, allowing for version control and repeated use in future evaluations.

Best Practices

  • Use enough questions on your dataset to represent real-world usage. A dataset that is too small will not expose gaps effectively.
  • Include a variety of questions that include common, edge-case, and failure-case questions.
  • Cover each knowledge base, prompt instruction, or integration the agent will rely on to ensure you are evenly testing your agent components.