Skip to content

Scenarios

Open in ChatGPT Open in Claude

Path: Trust Lab → Scenarios (third icon)

A scenario is a single test conversation — a sequence of messages you send to the agent, plus the criteria (evals) that define what a correct response looks like.

Tabs at the top filter by outcome: Passed, Failed, No Runs.

Each row shows:

ColumnDescription
ScenarioThe scenario name (e.g., “RoofSnap: Roofing Toolkit Overview”)
SuiteWhich suite the scenario belongs to (blank if standalone)
StepsNumber of conversation turns
EvalsNumber of assertions checked per run
Last RunWhen this scenario was last executed

Row icons on the right let you Run now, view History, or Delete the scenario.

Click + New Scenario. A prompt asks how you’d like to start:

  • Option 1 — Start from scratch: build a multi-turn conversation with evals and mock results, turn by turn. Choose this when you want to design a specific test case from the ground up.
  • Option 2 — Import Conversation: import a real conversation from the Playground or from production. Evals and mock results are auto-generated from the actual interaction. This is the fastest way to turn a real user conversation into a test case.

The import dialog lets you search across all your production conversations by conversation ID, user name, email, or message content. Use the Action, User, and Agent filters to narrow the list. Click any conversation to preview it, then click Import to convert it into a scenario with evals pre-filled.

Multi-turn scenario editor with user context panel and conversation turns with evals Visual JSON View Runs Run Save User Context USER ID usr_8a3f2b NAME Jane Smith EMAIL jane@acme.com PLAN pro LOCALE en-US CUSTOM FIELDS { "company": "Acme", "seats": 12 } Turn 1 — User asks about pricing tiers USER SAYS "What pricing plans do you offer?" EVAL 1 — LLM-as-Judge "Agent explains Free, Pro ($49/mo), and Enterprise tiers" + Add Eval Turn 2 — User asks to see the pricing page USER SAYS "Can you take me to the pricing page?" EVAL 1 — Contains "pricing" (response must contain this word) EVAL 2 — Navigate /pricing + Add Eval

The editor has two views: Visual (default) and JSON. Additional buttons in the top-right: View Runs, Run, and Save.

Before the conversation starts, you can define who the simulated user is:

FieldPurpose
User IDThe identity passed to the agent via the embed code
NameSimulated user’s display name
EmailSimulated user’s email address
RoleUser’s role (e.g., admin, member)
PlanSubscription plan
LocaleLanguage/region setting
Custom FieldsAny additional attributes your product passes

Setting a realistic User Context ensures the agent is tested with real-world data — including plan gating, locale-specific responses, and role-based behaviors.

Each Turn represents one round of the conversation (a user message and the agent’s response). At the top of each turn is a turn description (for example, “Checks that the agent explains RoofSnap as a measurement and estimating tool for contractors, and offers to take the user to their dashboard.”). This is auto-generated when importing, or you can write your own.

  • USER SAYS: the message sent to the agent in this turn (up to 10,000 characters).
  • EVALS: the checks that must pass for this turn to be considered successful. Click + Add Eval to add one or more.
Eval type picker showing response evals and tool use evals RESPONSE EVALS LLM-as-Judge Natural language criteria Contains Word or phrase check Exact Match Case-insensitive text Regex Pattern matching TOOL USE EVALS Action Called Verify action + params Navigate Verify URL redirect Knowledge Search Verify doc retrieval

When you click + Add Eval, a picker appears with two top-level categories.

Response evals — does the agent’s reply say what it should?

Section titled “Response evals — does the agent’s reply say what it should?”
EvalHow it works
LLM-as-JudgeA separate LLM evaluates the agent’s response against your written success criteria. Best for nuanced, natural-language assertions (e.g., “Agent explains the product and offers to navigate the user to their dashboard”).
ContainsChecks that the agent’s response includes a specific word or phrase. Fast and deterministic.
Exact MatchChecks that the agent’s response matches the expected text exactly (case-insensitive).
RegexValidates the response against a regular expression pattern.

Tool use evals — did the agent do the right thing?

Section titled “Tool use evals — did the agent do the right thing?”
EvalHow it works
Action CalledVerifies the agent called a specific action and lets you inspect the parameters it passed.
NavigateConfirms the agent navigated to the expected URL or route.
Knowledge SearchChecks that the agent searched knowledge and found the expected document.

Writing good LLM-as-Judge success criteria

Section titled “Writing good LLM-as-Judge success criteria”

The success criteria field describes what a correct agent response looks like. Write it in plain English, including an example response. The LLM judge compares the actual response against this description. Be specific:

  • Good: “Agent explains that RoofSnap is a measuring and estimating software for roofers and contractors, highlighting features like aerial measurements, estimates, contracts, and sketch tools. Offers to navigate the user to their dashboard.”
  • Too vague: “Agent gives a good answer about RoofSnap.”