Scenarios
Path: Trust Lab → Scenarios (third icon)
A scenario is a single test conversation — a sequence of messages you send to the agent, plus the criteria (evals) that define what a correct response looks like.
Scenario list
Section titled “Scenario list”| Scenario | Suite | Steps | Evals | Last Run |
|---|---|---|---|---|
| Product overview — happy path | — | 3 | 4 | 4 days ago |
| Create a project — OM vs DIY | Smoke | 5 | 7 | 4 days ago |
| Update a record | Smoke | 2 | 2 | 4 days ago |
| Onboarding walkthrough | Smoke | 1 | 2 | 4 days ago |
Tabs at the top filter by outcome: Passed, Failed, No Runs.
Each row shows:
| Column | Description |
|---|---|
| Scenario | The scenario name (e.g., “RoofSnap: Roofing Toolkit Overview”) |
| Suite | Which suite the scenario belongs to (blank if standalone) |
| Steps | Number of conversation turns |
| Evals | Number of assertions checked per run |
| Last Run | When this scenario was last executed |
Row icons on the right let you Run now, view History, or Delete the scenario.
Creating a scenario
Section titled “Creating a scenario”Click + New Scenario. A prompt asks how you’d like to start:
- Option 1 — Start from scratch: build a multi-turn conversation with evals and mock results, turn by turn. Choose this when you want to design a specific test case from the ground up.
- Option 2 — Import Conversation: import a real conversation from the Playground or from production. Evals and mock results are auto-generated from the actual interaction. This is the fastest way to turn a real user conversation into a test case.
Importing a conversation
Section titled “Importing a conversation”The import dialog lets you search across all your production conversations by conversation ID, user name, email, or message content. Use the Action, User, and Agent filters to narrow the list. Click any conversation to preview it, then click Import to convert it into a scenario with evals pre-filled.
The scenario editor
Section titled “The scenario editor”The editor has two views: Visual (default) and JSON. Additional buttons in the top-right: View Runs, Run, and Save.
Left panel — User Context
Section titled “Left panel — User Context”Before the conversation starts, you can define who the simulated user is:
| Field | Purpose |
|---|---|
| User ID | The identity passed to the agent via the embed code |
| Name | Simulated user’s display name |
| Simulated user’s email address | |
| Role | User’s role (e.g., admin, member) |
| Plan | Subscription plan |
| Locale | Language/region setting |
| Custom Fields | Any additional attributes your product passes |
Setting a realistic User Context ensures the agent is tested with real-world data — including plan gating, locale-specific responses, and role-based behaviors.
Main panel — Turns and Evals
Section titled “Main panel — Turns and Evals”Each Turn represents one round of the conversation (a user message and the agent’s response). At the top of each turn is a turn description (for example, “Checks that the agent explains RoofSnap as a measurement and estimating tool for contractors, and offers to take the user to their dashboard.”). This is auto-generated when importing, or you can write your own.
- USER SAYS: the message sent to the agent in this turn (up to 10,000 characters).
- EVALS: the checks that must pass for this turn to be considered successful. Click + Add Eval to add one or more.
Eval types
Section titled “Eval types”When you click + Add Eval, a picker appears with two top-level categories.
Response evals — does the agent’s reply say what it should?
Section titled “Response evals — does the agent’s reply say what it should?”| Eval | How it works |
|---|---|
| LLM-as-Judge | A separate LLM evaluates the agent’s response against your written success criteria. Best for nuanced, natural-language assertions (e.g., “Agent explains the product and offers to navigate the user to their dashboard”). |
| Contains | Checks that the agent’s response includes a specific word or phrase. Fast and deterministic. |
| Exact Match | Checks that the agent’s response matches the expected text exactly (case-insensitive). |
| Regex | Validates the response against a regular expression pattern. |
Tool use evals — did the agent do the right thing?
Section titled “Tool use evals — did the agent do the right thing?”| Eval | How it works |
|---|---|
| Action Called | Verifies the agent called a specific action and lets you inspect the parameters it passed. |
| Navigate | Confirms the agent navigated to the expected URL or route. |
| Knowledge Search | Checks that the agent searched knowledge and found the expected document. |
Writing good LLM-as-Judge success criteria
Section titled “Writing good LLM-as-Judge success criteria”The success criteria field describes what a correct agent response looks like. Write it in plain English, including an example response. The LLM judge compares the actual response against this description. Be specific:
- Good: “Agent explains that RoofSnap is a measuring and estimating software for roofers and contractors, highlighting features like aerial measurements, estimates, contracts, and sketch tools. Offers to navigate the user to their dashboard.”
- Too vague: “Agent gives a good answer about RoofSnap.”