Question 1

How do I create test cases?

Accepted Answer

Test cases are defined as input/expected output pairs in the Studio UI or via the SDK. For each test case, you provide the user message (input), the expected agent response (or key elements that should be present), and optional metadata like tags, priority, and evaluation criteria. You can import test cases from CSV, generate them from production conversation logs, or write them manually. Test suites can be organized by capability, risk level, or use case.

Question 2

What scoring methods are available?

Accepted Answer

OrchStack supports multiple scoring methods: semantic similarity (embedding-based comparison between expected and actual output), exact match (for structured outputs like JSON), keyword presence (checking for required information), custom evaluation functions (write your own scorer in TypeScript), and LLM-as-judge (using a separate LLM to evaluate quality). You can combine multiple scorers with configurable weights to create a composite quality score.

Question 3

How does regression testing work?

Accepted Answer

When you publish a new agent version, OrchStack automatically runs your entire test suite against the new version and compares results to the previous version. If any test score drops below a configurable threshold (e.g., more than 5% degradation), the publish is blocked and you are notified. You can configure this as a hard gate (block deployment) or soft gate (warn but allow). Regression results are visualized as version-over-version comparison charts.

Question 4

Can I A/B test different prompts?

Accepted Answer

Yes. Create two or more prompt variations, assign them to test groups, and run the same test suite against each. OrchStack shows side-by-side results with statistical significance testing. You can also run A/B tests in production by routing a percentage of traffic to each variation and comparing business outcomes (not just test scores). The winning variation can be promoted with one click.

Question 5

Does evaluation support multi-turn conversations?

Accepted Answer

Yes. Multi-turn test cases define a sequence of user messages and expected agent responses. The evaluation engine maintains conversation context between turns, just like a real session. This lets you test complex scenarios like booking flows, escalation paths, and multi-step troubleshooting. Each turn is scored independently, and the overall test case score is an aggregate.

Test Agents Before They Go Live

Define Tests, Ship with Confidence

Catch Regressions Before They Ship

Version Comparison

Compare Prompt Variations Side-by-Side

Variant A

Variant B

Score Every Response for Accuracy, Relevance, and Safety

Accuracy

Relevance

Safety

Complete Evaluation Toolkit

Test Suite Builder

Regression Testing

A/B Testing

Accuracy Scoring

Safety Evaluation

Performance Dashboards

Agent Evaluation FAQ

Ship Agents You Trust