Scenarios

Path: Trust Lab → Scenarios (third icon)

A scenario is a single test conversation — a sequence of messages you send to the agent, plus the criteria (evals) that define what a correct response looks like.

Scenario list

Trust Lab Your Agent

Scenarios

Each scenario is a conversation you expect the agent to handle correctly.

New Scenario

All (4)Passed (4)Failed (0)No Runs (0)

Scenario	Suite	Steps	Evals	Last Run
Product overview — happy path	—	3	4	4 days ago
Create a project — OM vs DIY	Smoke	5	7	4 days ago
Update a record	Smoke	2	2	4 days ago
Onboarding walkthrough	Smoke	1	2	4 days ago

Tabs at the top filter by outcome: Passed, Failed, No Runs.

Each row shows:

Column	Description
Scenario	The scenario name (e.g., “RoofSnap: Roofing Toolkit Overview”)
Suite	Which suite the scenario belongs to (blank if standalone)
Steps	Number of conversation turns
Evals	Number of assertions checked per run
Last Run	When this scenario was last executed

Row icons on the right let you Run now, view History, or Delete the scenario.

Creating a scenario

Click + New Scenario. A prompt asks how you’d like to start:

Option 1 — Start from scratch: build a multi-turn conversation with evals and mock results, turn by turn. Choose this when you want to design a specific test case from the ground up.
Option 2 — Import Conversation: import a real conversation from the Playground or from production. Evals and mock results are auto-generated from the actual interaction. This is the fastest way to turn a real user conversation into a test case.

Importing a conversation

The import dialog lets you search across all your production conversations by conversation ID, user name, email, or message content. Use the Action, User, and Agent filters to narrow the list. Click any conversation to preview it, then click Import to convert it into a scenario with evals pre-filled.

The scenario editor

Trust Lab Your Agent

Product overview — happy path Passed Evals 2/2 Agent: Your Agent Model: dynamic Duration: 9.4s Edit scenario View conversation

Conversation

Hi! I can help you get set up. What would you like to do first?

User

What does this product do?

It's a measuring and estimating tool that helps your team capture measurements, build estimates, and generate contracts in one place. Want me to take you to your dashboard to get started?

Tool Calls

search_knowledge 1 result

consult_navigation_agent 1 result

Evaluations

An LLM judge determined that the response is correct · LLM-as-Judge

Response evaluated against the success criteria and the rubric was satisfied.

Passed

The agent performed a knowledge search as expected · Knowledge Search

Expected knowledge document was retrieved during the turn.

Passed

The editor has two views: Visual (default) and JSON. Additional buttons in the top-right: View Runs, Run, and Save.

Left panel — User Context

Before the conversation starts, you can define who the simulated user is:

Field	Purpose
User ID	The identity passed to the agent via the embed code
Name	Simulated user’s display name
Email	Simulated user’s email address
Role	User’s role (e.g., admin, member)
Plan	Subscription plan
Locale	Language/region setting
Custom Fields	Any additional attributes your product passes

Setting a realistic User Context ensures the agent is tested with real-world data — including plan gating, locale-specific responses, and role-based behaviors.

Main panel — Turns and Evals

Each Turn represents one round of the conversation (a user message and the agent’s response). At the top of each turn is a turn description (for example, “Checks that the agent explains RoofSnap as a measurement and estimating tool for contractors, and offers to take the user to their dashboard.”). This is auto-generated when importing, or you can write your own.

USER SAYS: the message sent to the agent in this turn (up to 10,000 characters).
EVALS: the checks that must pass for this turn to be considered successful. Click + Add Eval to add one or more.

Eval types

When you click + Add Eval, a picker appears with two top-level categories.

Response evals — does the agent’s reply say what it should?

Eval	How it works
LLM-as-Judge	A separate LLM evaluates the agent’s response against your written success criteria. Best for nuanced, natural-language assertions (e.g., “Agent explains the product and offers to navigate the user to their dashboard”).
Contains	Checks that the agent’s response includes a specific word or phrase. Fast and deterministic.
Exact Match	Checks that the agent’s response matches the expected text exactly (case-insensitive).
Regex	Validates the response against a regular expression pattern.

Tool use evals — did the agent do the right thing?

Eval	How it works
Action Called	Verifies the agent called a specific action and lets you inspect the parameters it passed.
Navigate	Confirms the agent navigated to the expected URL or route.
Knowledge Search	Checks that the agent searched knowledge and found the expected document.

Writing good LLM-as-Judge success criteria

The success criteria field describes what a correct agent response looks like. Write it in plain English, including an example response. The LLM judge compares the actual response against this description. Be specific:

Good: “Agent explains that RoofSnap is a measuring and estimating software for roofers and contractors, highlighting features like aerial measurements, estimates, contracts, and sketch tools. Offers to navigate the user to their dashboard.”
Too vague: “Agent gives a good answer about RoofSnap.”